Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Understanding condor_userlog output for self-checkpointing apps
- Date: Fri, 08 Jul 2022 17:19:32 -0400
- From: James Alexander Clark <jaclark@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Understanding condor_userlog output for self-checkpointing apps
Thanks, Todd
On 7/8/22 15:34, Todd L Miller wrote:
I have an application which is configured to checkpoint and self-exit
every 60 minutes.
I am confused by the output of condor_userlog (see below):
ÂÂÂÂNothing in the rest of HTCondor is (presently) expected to know
anything at all about what self-checkpointing jobs are doing. We'll
probably fix that at some point. As far as I know, however, the CPU
usage for each individual execution of the job should be correct, so a
reading of 0 seems like a problem.
FWIW, the job I happen to be looking at is still running with:
$ condor_q 58968166 -af RemoteUserCpu -af CumulativeRemoteUserCpu
107.0 31916.0
So I guess if the usage comes from the end-of-job summary in the
userlog, that's not too surprising. I'll try finding a completed one to
compare with.
- The wall times all being less than 2 hours seems suspicious to me:
I'm guessing > 1 hour corresponds to cases where the job resumes on
the same host after a checkpoint?
ÂÂÂÂThe job will always _try_ to resume on the same host after a
checkpoint. It should only switch hosts if it gets preempted (which I
think you have turned off) or evicted for running too long or whatever,
or if the execute node breaks in some way.
Right, and now I see the 4 hour continuous run in the Host/Job section
so that makes sense.
Before we reconfigured to a 1 hour interval, we were running with the
default 8 hours and saw wall times of that same order. Are we really
just getting unlucky here and getting evicted a few minutes after each
resume?
ÂÂÂÂI'd have to take a look at the actual job log; I suspect
condor_userlog is misunderstanding something. If the only change you
made to the application is how often it checkpoints -- and it's running
on the same machines, etc -- you shouldn't see any changes to how long
it runs before being evicted.
I *think* a new issue might have manifested since I saw the 8 hour runs:
we've got a few people with those socket disconnect messages every 50
mins to 1.5 hours. I did track down some stuff in the shadowlog that
looked a lot like:
https://lists.cs.wisc.edu/archive/htcondor-users/2011-June/msg00096.shtml
But that's totally unrelated to the userlog query and I'll start a
separate thread/issue when I've got a bit more data.
Thanks
- ToddM
--
James Alexander Clark
LIGO Laboratory
California Institute of Technology
email: james.clark@xxxxxxxx
Tel. (cell): 413-230-1412