Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Checkpoints in HTCondor
- Date: Thu, 09 Jan 2020 10:58:44 -0600 (CST)
- From: Todd L Miller <tlmiller@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Checkpoints in HTCondor
It works as documented in the 8.9 manual under "Self-Checkpointing
Applications," except that the "CheckpointExitCode" attribute should be
"SuccessCheckpointExitCode."
This will, of course, be fixed in the next release. :)
It's hard to even tell it has happened without looking at it carefully.
The NumJobStarts attribute and other checkpoint-related attributes don't
even change - but that may just be an incomplete implementation of an
undocumented feature in my 8.8.7 installation.
I'm curious to know what people think of this decision. The
exit-and-restart system arose from the need to make sure that (a) the
application was done writing its checkpoint and (b) the application
wouldn't try to update the checkpoint until HTCondor was done transferring
it off the local machine (back to the schedd's SPOOL directory,
presently). As such, I've been regarding it as an implementation detail,
where incrementing NumJobStarts would actually be less useful and more
confusing than retaining its current value.
The other checkpoint-related attributes were inherited from the
(removed) standard universe, and we intend to remove them as we clean up
the code. Do let us know if any of those specific attributes are of
interest for self-checkpointing vanilla-universe jobs, or what other
attributes you find use cases for.
- ToddM