Subject: Re: [HTCondor-users] File last modification time or job last write() attribute?
From: MIRON LIVNY <miron@xxxxxxxxxxx> Date: 05/27/2016 01:50 AM
> You are right, Dimitri.
>
> The reason I used C was to make the point that the definition of "stuck"
> has an impact on the frequency of the probe. I can see cases where
the
> probe is expensive.
>
> In the case if all goes well we will probe in a very low frequency.
With this comment in mind, I tweaked my hook script
to only chirp if the value has changed. Thanks!
Perhaps this idea could be expressed more generally
as a "watchdog service" for a job.
The linked article tells a sad tale of the demise
of the Clementine mission for want of a few lines of hardware-WDT enablement
code. Although they had implemented a thruster timeout in software,
that froze too when the processor hung. Clementine's mission to near-Earth
asteroid Geographos had to be abandoned for want of the fuel
that was spewed out during the hang. http://www.ganssle.com/watchdogs.htm
As we know HTCondor's startd can provide the equivalent
of Clementine's unused hardware watchdog, outside the purview of the
job. There's already a number of job characteristics that can be
evaluated by an periodic_hold _expression_, such as BlockWrites, BlockReads,
BytesSent, RecentBlockWrites, ResidentSetSize_RAW, RemoteSysCpu
/ RemoteUserCpu, RemoteWallClockTime, and so on. And from what I gathered
at the delightful and informative HTCondor Week 2016
- http://research.cs.wisc.edu/htcondor/HTCondorWeek2016/
- there will be even more stats available on a variety of other
aspects of the job in future revisions.
I considered using RecentBlockWrites to watchdog the
job in our situation, but the trouble there lies in the fact
that other elements of the job may be writing other things unrelated
to the hung element and that activity is reflected in
RBW, and so create a "noise floor" which would require
testing to characterize in order to avoid false positives.
CPU utilizaton is another potential sensor to use
but without a "RecentRemoteUserCpu" it's tricky to make
decisions based on it. In one case we're looking for the overall utilization since job startup to fall below about 20% - a safe
noise floor for the job in question - but if it's been running
a long time there's a long tail there.
The last modification time of a given file is really
just another statistical sensor. (I'm also looking at adding a
regexp which can be looked for in the tail end of the specified file.)
So perhaps a direction which could be explored is
another "periodic" type - we have periodic_hold, periodic_remove, periodic_checkpoint, and even periodic_memory_sync... what about a "periodic_run"
or "periodic_info" directive for condor_submit?
It would be given an input-transferred executable
to be run by the startd during the standard periodic
interval, to bring "update_job_info" in from the hooks
and "+" notation to submit-native functionality. It would be given a copy
of the job classad on stdin, and deliver an update classad on
stdout. It would probably need to be handled asynchronously like the
job info hook is.
The executable would only be responsible for updating
classads based on specific details it's looking for, while
the actual watchdog trigger and action would be handled by the
other periodic_* expressions.
Another use case that comes to mind is to use "strace"
in an info script, to attach to the running process based
on the JobPid attribute, and look for patterns in its execution to detect problematic behavior such as an infinite
loop or a hung call.
This certainly gives a nice length of stout rope to
users, but when you see folks parsing the stdout of condor_q
in a watchdog script they wrote themselves, you realize that they
already
have quite a bit of gallows rope on hand to begin with.
I suppose there's nothing inherently wrong with doing
this with an update_job_info hook, aside from the constraints that
have always existed in the hook mechanisms such as the inability
(as far as I know) to mix different hooks together since it's
not possible to specify a comma-delimited list of hook keywords.