Subject: [HTCondor-users] Checkpointing job restarts in expiring pilot
Hi,
I'm testing Self-Checkpointing Jobs. I had it working, however i
noticed it repeatedly restarting into an expiring pilot,
i.e. when time() > GLIDEIN_ToRetire (time() here comes from
python).
I suspect there might be some time disalignement, possibly due to
Daylight Saving Time.
I do not know the value of CurrentTime, but inspecting job output
logs, python time()Â is 3600 + GLIDEIN_ToRetire + <few random
secs>
when it restart after a previous exit for checkpoint. That makes me
think that CurrentTime might not be DST aware.
The .machine.ad reports MyCurrentTime which looks consistent with
python time.time() at first start. I assume the .update.ad has the
update value for that.
I copy here a few log lines s from a successful job which restarted
several time in the same pilot that it left before expiration.
Lines ending with "Checkpoint exit" are when the wrapper evaluates
GLIDEIN_ToDie < 400. The wrapper terminates the executable and
exit with checkpoint_exit_code = 85
Lines with "Wrapper Start" report time in GMT. The other ones in
Local Time.
One problem here is that there is a sort of "race condition"
between the wrapper exiting for checkpoint and the pilot being
terminated when time() > GLIDEIN_ToDie.
If the latter wins, the job fails and restarts from scratch.