Subject: Re: [HTCondor-users] using idle computers in computer labs for CFD jobs
From: Peter Ellevseth <Peter.Ellevseth@xxxxxxxxxx> Date: 03/01/2016 04:49 AM
> I have not done that much research on checkpointing
yet, so forgive my
> ignorance. I just have a question on the concept of checkpointing.
Is the
> point to just give some sort of initial value to the job, or does
> checkpointing involve some sort off memory-dump from where the simulation
> can continue?
>
> For example, if running a set of CFD jobs with Ansys, it is possible
to
> tell the solver to 'start from this file'. Will that be considere
checkpointing?
> From what I read here: http://condor.eps.manchester.ac.uk/examples/user-
> level-checkpointing-an-example-in-c/ this seems to be the case.
>
> What is then the process of checkpointing? When the job gets a vacate
> signal, will it then run some checkpointing-routine? Or will it allways
> check for checkpoint information when a job starts?
Hi Peter,
Since you said "CFD" and "Ansys,"
I suppose it's safe to assume that you mean "Fluent." ;-)
I put some thought into this question just recently
as it happens (although I don't yet have any pools running Fluent), and here's
the gist of what I came up with.
The Fluent docs indicate that a checkpoint is triggered
by the presence of a flag file in /tmp - namely /tmp/check-fluent or /tmp/exit-fluent.
When Fluent checkpoints, it runs to the end of the current iteration
and then saves a "case" and a "data" file containing
its forward progress. It then either continues running or exits, depending on which flag
file it found. When it restarts, if it finds a valid case and data file,
it picks up where it left off.
Of course, the default use of /tmp presumes that there
are no other instances of Fluent running on the machine in question, which
would not necessarily be the case for an HTCondor exec node - they may not
even be your own Fluent runs.
This is where the MOUNT_UNDER_SCRATCH knob on Linux
comes into play. By specifying "/tmp" and "/var/tmp"
for this config, each job gets its own /tmp directory, by having /tmp looped back into $_CONDOR_SCRATCH_DIR/tmp.
Then when the flag file is created in /tmp/check-fluent,
it will actually be stored in the job's scratch directory and be visible only
to that job.
Now, this type of checkpoint is distinct from the
standard universe's checkpoint, as it's managed internally by the application rather
than the standard universe wrapper applied by condor_compile. For Fluent and
similar applications which can't be relinked in this way, we need to figure out
how to signal Fluent itself to checkpoint periodically.
The _HOOK_UPDATE_JOB_INFO looked to be a good way
to do this. There may be more clever ideas proffered by the talented folks
on the list, but we can both look forward to those. This hook runs eight seconds
after startup and once every five minutes after that, while the job is running
on a machine.
Our script will then be run eight seconds after startup
and every five minutes after that. Needless to say, we don't want it to trigger
a Fluent checkpoint every five minutes, so we can use the job ClassAd
provided to the script to check the JobCurrentStartDate attribute to see if
enough time has elapsed to take a first checkpoint, and/or look for an existing
checkpoint to see if it's old enough yet.
We could also have it look for a "FluentCheckpointInterval"
attribute in the job ClassAd, so we could say something like this in
the submit description:
+FluentCheckpointInterval = 45 * $(MINUTE)
... to tell the hook script that it should checkpoint
every 45 minutes. Maybe it would default to once an hour.
At the appointed time based on the start time or age
of the prior case and data files, the script would simply create the check-fluent
file in the job's loop-mounted /tmp directory and exit, thus triggering
the Fluent internal checkpoint.
To preserve the checkpoint across runs, you'd of course
need to set this parameter in your submit description:
when_to_transfer_files = ON_EXIT_OR_EVICT
This will cause HTCondor to save your scratch directory
when eviction occurs, allowing Fluent to find the previously-created case
and data files and pick up where that checkpoint left off.
This covers periodic checkpointing, but ideally we'd
also like to have on-demand checkpointing as well, so that the job could
be instructed to checkpoint during the eviction process, rather than
losing up to as much work as your checkpoint interval indicates.
At first glance you'd think this could be handled
by defining a "FLUENT_HOOK_EVICT_CLAIM" hook, which instead
of creating a "check-fluent" file, would create the "exit-fluent" file.
However unlike the job status hook, the evict claim hook runs as the ID of the condor_startd,
which would usually be the "condor" user. This means that the
hook script wouldn't have access to the job's scratch directory, and thus couldn't
create the flag file in the scratch-looped /tmp directory.
It's possible to define the Fluent configuration to
change the flag file to some other location, so that might offer some
path forward.
I think that the alternative would have to be having
a wrapper script around the Fluent executable which would be able to
recognize the eviction signals from HTCondor and create the exit-fluent flag
file when such a signal is received.
Also if I missed a newer version of the Fluent documentation
which indicates that Fluent can checkpoint in response to a signal rather
than the flag files, that could be another option.