Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Understanding condor_userlog output for self-checkpointing apps

Date: Fri, 8 Jul 2022 14:34:12 -0500 (CDT)
From: Todd L Miller <tlmiller@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Understanding condor_userlog output for self-checkpointing apps

I have an application which is configured to checkpoint and self-exit every60 minutes.
I am confused by the output of condor_userlog (see below):

Nothing in the rest of HTCondor is (presently) expected to knowanything at all about what self-checkpointing jobs are doing. We'llprobably fix that at some point. As far as I know, however, the CPU usagefor each individual execution of the job should be correct, so a readingof 0 seems like a problem.

- The wall times all being less than 2 hours seems suspicious to me: I'mguessing > 1 hour corresponds to cases where the job resumes on thesame host after a checkpoint?

The job will always _try_ to resume on the same host after acheckpoint. It should only switch hosts if it gets preempted (which Ithink you have turned off) or evicted for running too long or whatever, orif the execute node breaks in some way.

Before we reconfigured to a 1 hour interval, we were running with thedefault 8 hours and saw wall times of that same order. Are we reallyjust getting unlucky here and getting evicted a few minutes after eachresume?

I'd have to take a look at the actual job log; I suspectcondor_userlog is misunderstanding something. If the only change you madeto the application is how often it checkpoints -- and it's running on thesame machines, etc -- you shouldn't see any changes to how long it runsbefore being evicted.


- ToddM

Follow-Ups:
- Re: [HTCondor-users] Understanding condor_userlog output for self-checkpointing apps
  - From: James Alexander Clark

References:
- [HTCondor-users] Understanding condor_userlog output for self-checkpointing apps
  - From: James Alexander Clark

Prev by Date: Re: [HTCondor-users] Understanding condor_userlog output for self-checkpointing apps
Next by Date: Re: [HTCondor-users] Understanding condor_userlog output for self-checkpointing apps
Previous by thread: Re: [HTCondor-users] Understanding condor_userlog output for self-checkpointing apps
Next by thread: Re: [HTCondor-users] Understanding condor_userlog output for self-checkpointing apps
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Understanding condor_userlog output for self-checkpointing apps