[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Understanding condor_userlog output for self-checkpointing apps



I have an application which is configured to checkpoint and self-exit every 60 minutes.
I am confused by the output of condor_userlog (see below):
	Nothing in the rest of HTCondor is (presently) expected to know 
anything at all about what self-checkpointing jobs are doing.  We'll 
probably fix that at some point.  As far as I know, however, the CPU usage 
for each individual execution of the job should be correct, so a reading 
of 0 seems like a problem.
- The wall times all being less than 2 hours seems suspicious to me: I'm guessing > 1 hour corresponds to cases where the job resumes on the same host after a checkpoint?
	The job will always _try_ to resume on the same host after a 
checkpoint.  It should only switch hosts if it gets preempted (which I 
think you have turned off) or evicted for running too long or whatever, or 
if the execute node breaks in some way.
Before we reconfigured to a 1 hour interval, we were running with the default 8 hours and saw wall times of that same order. Are we really just getting unlucky here and getting evicted a few minutes after each resume?
	I'd have to take a look at the actual job log; I suspect 
condor_userlog is misunderstanding something.  If the only change you made 
to the application is how often it checkpoints -- and it's running on the 
same machines, etc -- you shouldn't see any changes to how long it runs 
before being evicted.
- ToddM