Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Checkpointing job restarts in expiring pilot

Date: Sat, 3 Aug 2024 15:59:52 +0200
From: Stefano Dal Pra <stefano.dalpra@xxxxxxxxxxxx>
Subject: Re: [HTCondor-users] Checkpointing job restarts in expiring pilot

Thanks Todd,

On 29/07/24 22:02, Todd L Miller via HTCondor-users wrote:

ÂÂÂÂThe starter doesn't evaluate the startd's START expression when"restarting" self-checkpointing jobs.

In this case the self-checkpointing job most likely restarts in the sameplace it just left (at least, this is what I observe) regardless of theremaining lifetime of the glidein being too short to do anything useful.Then, it could probably be evicted before exiting for checkpointing again.


Is there any possible way to prevent this?

ÂÂÂÂI haven't been able to reproduce any problems with CurrentTimebeing wrong when evaluating the START expression.
ÂÂÂÂIt's also dangerous, generally, to assume that file transfer ofyour job can complete during retirement time (although in this case itlooks like the glide-in is giving you an hour's warning?)

My understanding of the logic is that no "new" job start when time() >GLIDEIN_ToRetire, which is 1 hour less than GLIEIN_ToDie.

-- you have no a-priori idea how long youre job will sit in the AP'stransfer queue, even if the actualy data transfer takes almost no timeat all.Â This is why wehave self-checkpointing jobs at all (because otherwise you would justset ON_EXIT_OR_REMOVE and the signal sent appropriately).


Stefano



-- ToddM
_______________________________________________
HTCondor-users mailing list

To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxxwith a

subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Follow-Ups:
- Re: [HTCondor-users] Checkpointing job restarts in expiring pilot
  - From: Todd L Miller

Prev by Date: Re: [HTCondor-users] Cgroups & pid namespaces
Next by Date: Re: [HTCondor-users] condor_store_cred account
Previous by thread: Re: [HTCondor-users] condor_store_cred account
Next by thread: Re: [HTCondor-users] Checkpointing job restarts in expiring pilot
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Checkpointing job restarts in expiring pilot