Mailing List Archives
	Authenticated access
	
	
     | 
    
	 
	 
     | 
    
	
	 
     | 
  
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Understanding condor_userlog output for self-checkpointing apps
- Date: Fri, 08 Jul 2022 17:19:32 -0400
 
- From: James Alexander Clark <jaclark@xxxxxxxxxxx>
 
- Subject: Re: [HTCondor-users] Understanding condor_userlog output for self-checkpointing apps
 
Thanks, Todd
On 7/8/22 15:34, Todd L Miller wrote:
I have an application which is configured to checkpoint and self-exit 
every 60 minutes.
I am confused by the output of condor_userlog (see below):
 ÂÂÂÂNothing in the rest of HTCondor is (presently) expected to know 
anything at all about what self-checkpointing jobs are doing. We'll 
probably fix that at some point. As far as I know, however, the CPU 
usage for each individual execution of the job should be correct, so a 
reading of 0 seems like a problem.
FWIW, the job I happen to be looking at is still running with:
$ condor_q 58968166 -af RemoteUserCpu -af CumulativeRemoteUserCpu
107.0 31916.0
So I guess if the usage comes from the end-of-job summary in the 
userlog, that's not too surprising.  I'll try finding a completed one to 
compare with.
- The wall times all being less than 2 hours seems suspicious to me:  
I'm guessing > 1 hour corresponds to cases where the job resumes on 
the same host after a checkpoint?
 ÂÂÂÂThe job will always _try_ to resume on the same host after a 
checkpoint. It should only switch hosts if it gets preempted (which I 
think you have turned off) or evicted for running too long or whatever, 
or if the execute node breaks in some way.
Right, and now I see the 4 hour continuous run in the Host/Job section 
so that makes sense.
Before we reconfigured to a 1 hour interval, we were running with the 
default 8 hours and saw wall times of that same order. Are we really 
just getting unlucky here and getting evicted a few minutes after each 
resume?
 ÂÂÂÂI'd have to take a look at the actual job log; I suspect 
condor_userlog is misunderstanding something. If the only change you 
made to the application is how often it checkpoints -- and it's running 
on the same machines, etc -- you shouldn't see any changes to how long 
it runs before being evicted.
I *think* a new issue might have manifested since I saw the 8 hour runs: 
we've got a few people with those socket disconnect messages every 50 
mins to  1.5 hours.  I did track down some stuff in the shadowlog that 
looked a lot like:
https://lists.cs.wisc.edu/archive/htcondor-users/2011-June/msg00096.shtml
But that's totally unrelated to the userlog query and I'll start a 
separate thread/issue when I've got a bit more data.
Thanks
- ToddM
--
James Alexander Clark
LIGO Laboratory
California Institute of Technology
email:  james.clark@xxxxxxxx
Tel. (cell):  413-230-1412