[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Checkpointing job restarts in expiring pilot



Thanks Todd,

On 29/07/24 22:02, Todd L Miller via HTCondor-users wrote:
ÂÂÂÂThe starter doesn't evaluate the startd's START expression when "restarting" self-checkpointing jobs.
In this case the self-checkpointing job most likely restarts in the same place it just left (at least, this is what I observe) regardless of the remaining lifetime of the glidein being too short to do anything useful. Then, it could probably be evicted before exiting for checkpointing again.

Is there any possible way to prevent this?

ÂÂÂÂI haven't been able to reproduce any problems with CurrentTime being wrong when evaluating the START expression.

ÂÂÂÂIt's also dangerous, generally, to assume that file transfer of your job can complete during retirement time (although in this case it looks like the glide-in is giving you an hour's warning?)

My understanding of the logic is that no "new" job start when time() > GLIDEIN_ToRetire, which is 1 hour less than GLIEIN_ToDie.


-- you have no a-priori idea how long youre job will sit in the AP's transfer queue, even if the actualy data transfer takes almost no time at all. This is why we have self-checkpointing jobs at all (because otherwise you would just set ON_EXIT_OR_REMOVE and the signal sent appropriately).

Stefano



-- ToddM
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/