I have an application which is configured to checkpoint and self-exit every 60 minutes.I am confused by the output of condor_userlog (see below):
Nothing in the rest of HTCondor is (presently) expected to know anything at all about what self-checkpointing jobs are doing. We'll probably fix that at some point. As far as I know, however, the CPU usage for each individual execution of the job should be correct, so a reading of 0 seems like a problem.
- The wall times all being less than 2 hours seems suspicious to me: I'm guessing > 1 hour corresponds to cases where the job resumes on the same host after a checkpoint?
The job will always _try_ to resume on the same host after a checkpoint. It should only switch hosts if it gets preempted (which I think you have turned off) or evicted for running too long or whatever, or if the execute node breaks in some way.
Before we reconfigured to a 1 hour interval, we were running with the default 8 hours and saw wall times of that same order. Are we really just getting unlucky here and getting evicted a few minutes after each resume?
I'd have to take a look at the actual job log; I suspect condor_userlog is misunderstanding something. If the only change you made to the application is how often it checkpoints -- and it's running on the same machines, etc -- you shouldn't see any changes to how long it runs before being evicted.
- ToddM