Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Checkpointing job restarts in expiring pilot
- Date: Sat, 3 Aug 2024 15:59:52 +0200
- From: Stefano Dal Pra <stefano.dalpra@xxxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Checkpointing job restarts in expiring pilot
Thanks Todd,
On 29/07/24 22:02, Todd L Miller via HTCondor-users wrote:
ÂÂÂÂThe starter doesn't evaluate the startd's START expression when
"restarting" self-checkpointing jobs.
In this case the self-checkpointing job most likely restarts in the same
place it just left (at least, this is what I observe) regardless of the
remaining lifetime of the glidein being too short to do anything useful.
Then, it could probably be evicted before exiting for checkpointing again.
Is there any possible way to prevent this?
ÂÂÂÂI haven't been able to reproduce any problems with CurrentTime
being wrong when evaluating the START expression.
ÂÂÂÂIt's also dangerous, generally, to assume that file transfer of
your job can complete during retirement time (although in this case it
looks like the glide-in is giving you an hour's warning?)
My understanding of the logic is that no "new" job start when time() >
GLIDEIN_ToRetire, which is 1 hour less than GLIEIN_ToDie.
-- you have no a-priori idea how long youre job will sit in the AP's
transfer queue, even if the actualy data transfer takes almost no time
at all. This is why we
have self-checkpointing jobs at all (because otherwise you would just
set ON_EXIT_OR_REMOVE and the signal sent appropriately).
Stefano
-- ToddM
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/