[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] Checkpointing job restarts in expiring pilot
- Date: Fri, 26 Jul 2024 18:11:13 +0200
- From: Stefano Dal Pra <stefano.dalpra@xxxxxxxxxxxx>
- Subject: [HTCondor-users] Checkpointing job restarts in expiring pilot
Hi,
I'm testing Self-Checkpointing Jobs. I had it working, however i
noticed it repeatedly restarting into an expiring pilot,
i.e. when time() > GLIDEIN_ToRetire (time() here comes from
python).
I suspect there might be some time disalignement, possibly due to
Daylight Saving Time.
The START _expression_, from .machine.ad has:
START = <stuff> && ((GLIDEIN_ToRetire =?= undefined)
|| (CurrentTime < GLIDEIN_ToRetire))
I do not know the value of CurrentTime, but inspecting job output
logs, python time()Â is 3600 + GLIDEIN_ToRetire + <few random
secs>
when it restart after a previous exit for checkpoint. That makes me
think that CurrentTime might not be DST aware.
The .machine.ad reports MyCurrentTime which looks consistent with
python time.time() at first start. I assume the .update.ad has the
update value for that.
I copy here a few log lines s from a successful job which restarted
several time in the same pilot that it left before expiration.
Lines ending with "Checkpoint exit" are when the wrapper evaluates
GLIDEIN_ToDie < 400. The wrapper terminates the executable and
exit with checkpoint_exit_code = 85
Lines with "Wrapper Start" report time in GMT. The other ones in
Local Time.
[sdalpra@igwn-sn
outjobs]$ egrep -n -A3 'Wrapper Start|GLIDEIN_ToDie='
std_548.7.out
2-Wed Jul 24
19:13:45 2024: Wrapper Start
3-Wed Jul 24
21:13:45 2024: GLIDEIN_ToDie is 1721863109 = 07/24/24-23:18:29
4-Wed Jul 24
21:13:45 2024: GLIDEIN_ToRetire is 1721859509 =
07/24/24-22:18:29
--
205:Thu Jul 25
01:14:44 2024: time()=1721862884: GLIDEIN_ToDie=1721863109,
GLIDEIN_ToRetire=1721859509. Terminating run_FH_analysis.sh and
Checkpoint exit
207-Wed Jul 24
23:14:46 2024: Wrapper Start
208-Thu Jul 25
01:14:46 2024: GLIDEIN_ToDie is 1721863109 = 07/24/24-23:18:29
209-Thu Jul 25
01:14:46 2024: GLIDEIN_ToRetire is 1721859509 =
07/24/24-22:18:29
--
260:Thu Jul 25
01:18:30 2024: time()=1721863110: GLIDEIN_ToDie=1721863109,
GLIDEIN_ToRetire=1721859509. Terminating run_FH_analysis.sh and
Checkpoint exit
262-Wed Jul 24
23:18:32 2024: Wrapper Start
263-Thu Jul 25
01:18:32 2024: GLIDEIN_ToDie is 1721863109 = 07/24/24-23:18:29
264-Thu Jul 25
01:18:32 2024: GLIDEIN_ToRetire is 1721859509 =
07/24/24-22:18:29
--
315:Thu Jul 25
01:19:40 2024: time()=1721863180: GLIDEIN_ToDie=1721863109,
GLIDEIN_ToRetire=1721859509. Terminating run_FH_analysis.sh and
Checkpoint exit
317-Wed Jul 24
23:19:42 2024: Wrapper Start
318-Thu Jul 25
01:19:42 2024: GLIDEIN_ToDie is 1721863109 = 07/24/24-23:18:29
319-Thu Jul 25
01:19:42 2024: GLIDEIN_ToRetire is 1721859509 =
07/24/24-22:18:29
--
370:Thu Jul 25
01:19:49 2024: time()=1721863189: GLIDEIN_ToDie=1721863109,
GLIDEIN_ToRetire=1721859509. Terminating run_FH_analysis.sh and
Checkpoint exit
372-Wed Jul 24
23:24:13 2024: Wrapper Start
373-Wed Jul 24
16:24:13 2024: GLIDEIN_ToDie is 1721889127 = 07/25/24-06:32:07
374-Wed Jul 24
16:24:13 2024: GLIDEIN_ToRetire is 1721885527 =
07/25/24-05:32:07
One problem here is that there is a sort of "race condition"
between the wrapper exiting for checkpoint and the pilot being
terminated when time() > GLIDEIN_ToDie.
If the latter wins, the job fails and restarts from scratch.
Thanks for any hint
Stefano