[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Understanding condor_userlog output for self-checkpointing apps



Thanks, Todd

On 7/8/22 15:34, Todd L Miller wrote:
I have an application which is configured to checkpoint and self-exit every 60 minutes.
I am confused by the output of condor_userlog (see below):
 ÂÂÂÂNothing in the rest of HTCondor is (presently) expected to know 
anything at all about what self-checkpointing jobs are doing. We'll 
probably fix that at some point. As far as I know, however, the CPU 
usage for each individual execution of the job should be correct, so a 
reading of 0 seems like a problem.
FWIW, the job I happen to be looking at is still running with:

$ condor_q 58968166 -af RemoteUserCpu -af CumulativeRemoteUserCpu

107.0 31916.0

So I guess if the usage comes from the end-of-job summary in the userlog, that's not too surprising. I'll try finding a completed one to compare with.

- The wall times all being less than 2 hours seems suspicious to me: I'm guessing > 1 hour corresponds to cases where the job resumes on the same host after a checkpoint?
 ÂÂÂÂThe job will always _try_ to resume on the same host after a 
checkpoint. It should only switch hosts if it gets preempted (which I 
think you have turned off) or evicted for running too long or whatever, 
or if the execute node breaks in some way.
Right, and now I see the 4 hour continuous run in the Host/Job section 
so that makes sense.
Before we reconfigured to a 1 hour interval, we were running with the default 8 hours and saw wall times of that same order. Are we really just getting unlucky here and getting evicted a few minutes after each resume?
 ÂÂÂÂI'd have to take a look at the actual job log; I suspect 
condor_userlog is misunderstanding something. If the only change you 
made to the application is how often it checkpoints -- and it's running 
on the same machines, etc -- you shouldn't see any changes to how long 
it runs before being evicted.
I *think* a new issue might have manifested since I saw the 8 hour runs: 
we've got a few people with those socket disconnect messages every 50 
mins to  1.5 hours.  I did track down some stuff in the shadowlog that 
looked a lot like:
https://lists.cs.wisc.edu/archive/htcondor-users/2011-June/msg00096.shtml

But that's totally unrelated to the userlog query and I'll start a separate thread/issue when I've got a bit more data.
Thanks

- ToddM
--
James Alexander Clark
LIGO Laboratory
California Institute of Technology
email:  james.clark@xxxxxxxx
Tel. (cell):  413-230-1412