[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] using idle computers in computer labs for CFD jobs
- Date: Tue, 01 Mar 2016 12:28:34 -0500
- From: Michael V Pelletier <Michael.V.Pelletier@xxxxxxxxxxxx>
- Subject: Re: [HTCondor-users] using idle computers in computer labs for CFD jobs
From: Peter Ellevseth <Peter.Ellevseth@xxxxxxxxxx>
Date: 03/01/2016 04:49 AM
> I have not done that much research on checkpointing
yet, so forgive my
> ignorance. I just have a question on the concept of checkpointing.
Is the
> point to just give some sort of initial value to the job, or does
> checkpointing involve some sort off memory-dump from where the simulation
> can continue?
>
> For example, if running a set of CFD jobs with Ansys, it is possible
to
> tell the solver to 'start from this file'. Will that be considere
checkpointing?
> From what I read here: http://condor.eps.manchester.ac.uk/examples/user-
> level-checkpointing-an-example-in-c/ this seems to be the case.
>
> What is then the process of checkpointing? When the job gets a vacate
> signal, will it then run some checkpointing-routine? Or will it allways
> check for checkpoint information when a job starts?
Hi Peter,
Since you said "CFD" and "Ansys,"
I suppose it's safe to assume that you
mean "Fluent." ;-)
I put some thought into this question just recently
as it happens (although
I don't yet have any pools running Fluent), and here's
the gist of what I came
up with.
The Fluent docs indicate that a checkpoint is triggered
by the presence of a
flag file in /tmp - namely /tmp/check-fluent or /tmp/exit-fluent.
When Fluent
checkpoints, it runs to the end of the current iteration
and then saves a
"case" and a "data" file containing
its forward progress. It then either
continues running or exits, depending on which flag
file it found. When it
restarts, if it finds a valid case and data file,
it picks up where it left
off.
Of course, the default use of /tmp presumes that there
are no other instances
of Fluent running on the machine in question, which
would not necessarily be
the case for an HTCondor exec node - they may not
even be your own Fluent runs.
This is where the MOUNT_UNDER_SCRATCH knob on Linux
comes into play. By
specifying "/tmp" and "/var/tmp"
for this config, each job gets its own /tmp
directory, by having /tmp looped back into $_CONDOR_SCRATCH_DIR/tmp.
Then
when the flag file is created in /tmp/check-fluent,
it will actually be stored
in the job's scratch directory and be visible only
to that job.
Now, this type of checkpoint is distinct from the
standard universe's checkpoint,
as it's managed internally by the application rather
than the standard universe
wrapper applied by condor_compile. For Fluent and
similar applications which
can't be relinked in this way, we need to figure out
how to signal Fluent itself
to checkpoint periodically.
The _HOOK_UPDATE_JOB_INFO looked to be a good way
to do this. There may be
more clever ideas proffered by the talented folks
on the list, but we can both
look forward to those. This hook runs eight seconds
after startup and once every
five minutes after that, while the job is running
on a machine.
In our submit description, we'd have:
+HookKeyword = "FLUENT"
In our pool configuration, we'd have:
FLUENT_HOOK_UPDATE_JOB_INFO = $(LIBEXEC)/fluent_periodic_checkpoint
Our script will then be run eight seconds after startup
and every five minutes
after that. Needless to say, we don't want it to trigger
a Fluent checkpoint
every five minutes, so we can use the job ClassAd
provided to the script to
check the JobCurrentStartDate attribute to see if
enough time has elapsed
to take a first checkpoint, and/or look for an existing
checkpoint to see
if it's old enough yet.
We could also have it look for a "FluentCheckpointInterval"
attribute in the
job ClassAd, so we could say something like this in
the submit description:
+FluentCheckpointInterval = 45 * $(MINUTE)
... to tell the hook script that it should checkpoint
every 45 minutes. Maybe
it would default to once an hour.
At the appointed time based on the start time or age
of the prior case and
data files, the script would simply create the check-fluent
file in the job's
loop-mounted /tmp directory and exit, thus triggering
the Fluent internal
checkpoint.
To preserve the checkpoint across runs, you'd of course
need to set this
parameter in your submit description:
when_to_transfer_files = ON_EXIT_OR_EVICT
This will cause HTCondor to save your scratch directory
when eviction occurs,
allowing Fluent to find the previously-created case
and data files and pick
up where that checkpoint left off.
This covers periodic checkpointing, but ideally we'd
also like to have
on-demand checkpointing as well, so that the job could
be instructed to
checkpoint during the eviction process, rather than
losing up to as much work
as your checkpoint interval indicates.
At first glance you'd think this could be handled
by defining a
"FLUENT_HOOK_EVICT_CLAIM" hook, which instead
of creating a "check-fluent"
file, would create the "exit-fluent" file.
However unlike the job status hook,
the evict claim hook runs as the ID of the condor_startd,
which would usually
be the "condor" user. This means that the
hook script wouldn't have access
to the job's scratch directory, and thus couldn't
create the flag file in
the scratch-looped /tmp directory.
It's possible to define the Fluent configuration to
change the flag
file to some other location, so that might offer some
path forward.
I think that the alternative would have to be having
a wrapper script
around the Fluent executable which would be able to
recognize the eviction
signals from HTCondor and create the exit-fluent flag
file when such a signal
is received.
Also if I missed a newer version of the Fluent documentation
which indicates that
Fluent can checkpoint in response to a signal rather
than the flag files, that
could be another option.
Good luck! Let us know how it works out!
-Michael V. Pelletier.