[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Checkpointing job restarts in expiring pilot



Hi,

I'm testing Self-Checkpointing Jobs. I had it working, however i noticed it repeatedly restarting into an expiring pilot,
i.e. when time() > GLIDEIN_ToRetire (time() here comes from python).
I suspect there might be some time disalignement, possibly due to Daylight Saving Time.

The START _expression_, from .machine.ad has:

START = <stuff> && ((GLIDEIN_ToRetire =?= undefined) || (CurrentTime < GLIDEIN_ToRetire))

I do not know the value of CurrentTime, but inspecting job output logs, python time()Â is 3600 + GLIDEIN_ToRetire + <few random secs>
when it restart after a previous exit for checkpoint. That makes me think that CurrentTime might not be DST aware.

The .machine.ad reports MyCurrentTime which looks consistent with python time.time() at first start. I assume the .update.ad has the update value for that.

I copy here a few log lines s from a successful job which restarted several time in the same pilot that it left before expiration.
Lines ending with "Checkpoint exit" are when the wrapper evaluates GLIDEIN_ToDie < 400. The wrapper terminates the executable and exit with checkpoint_exit_code = 85
Lines with "Wrapper Start" report time in GMT. The other ones in Local Time.

[sdalpra@igwn-sn outjobs]$ egrep -n -A3 'Wrapper Start|GLIDEIN_ToDie=' std_548.7.out
2-Wed Jul 24 19:13:45 2024: Wrapper Start
3-Wed Jul 24 21:13:45 2024: GLIDEIN_ToDie is 1721863109 = 07/24/24-23:18:29
4-Wed Jul 24 21:13:45 2024: GLIDEIN_ToRetire is 1721859509 = 07/24/24-22:18:29
--
205:Thu Jul 25 01:14:44 2024: time()=1721862884: GLIDEIN_ToDie=1721863109, GLIDEIN_ToRetire=1721859509. Terminating run_FH_analysis.sh and Checkpoint exit
207-Wed Jul 24 23:14:46 2024: Wrapper Start
208-Thu Jul 25 01:14:46 2024: GLIDEIN_ToDie is 1721863109 = 07/24/24-23:18:29
209-Thu Jul 25 01:14:46 2024: GLIDEIN_ToRetire is 1721859509 = 07/24/24-22:18:29
--
260:Thu Jul 25 01:18:30 2024: time()=1721863110: GLIDEIN_ToDie=1721863109, GLIDEIN_ToRetire=1721859509. Terminating run_FH_analysis.sh and Checkpoint exit
262-Wed Jul 24 23:18:32 2024: Wrapper Start
263-Thu Jul 25 01:18:32 2024: GLIDEIN_ToDie is 1721863109 = 07/24/24-23:18:29
264-Thu Jul 25 01:18:32 2024: GLIDEIN_ToRetire is 1721859509 = 07/24/24-22:18:29
--
315:Thu Jul 25 01:19:40 2024: time()=1721863180: GLIDEIN_ToDie=1721863109, GLIDEIN_ToRetire=1721859509. Terminating run_FH_analysis.sh and Checkpoint exit
317-Wed Jul 24 23:19:42 2024: Wrapper Start
318-Thu Jul 25 01:19:42 2024: GLIDEIN_ToDie is 1721863109 = 07/24/24-23:18:29
319-Thu Jul 25 01:19:42 2024: GLIDEIN_ToRetire is 1721859509 = 07/24/24-22:18:29
--
370:Thu Jul 25 01:19:49 2024: time()=1721863189: GLIDEIN_ToDie=1721863109, GLIDEIN_ToRetire=1721859509. Terminating run_FH_analysis.sh and Checkpoint exit
372-Wed Jul 24 23:24:13 2024: Wrapper Start
373-Wed Jul 24 16:24:13 2024: GLIDEIN_ToDie is 1721889127 = 07/25/24-06:32:07
374-Wed Jul 24 16:24:13 2024: GLIDEIN_ToRetire is 1721885527 = 07/25/24-05:32:07


One problem here is that there is a sort of "race condition" between the wrapper exiting for checkpoint and the pilot being terminated when time() > GLIDEIN_ToDie.
If the latter wins, the job fails and restarts from scratch.

Thanks for any hint
Stefano