Subject: Re: [HTCondor-users] using idle computers in computer labs for CFD jobs
I've found that quite a few compute intensive
tools with long-running scenarios have self-checkpointing capabilities
built in, even if it's only to pick up where it left off in a batch of
independent runs - which is, naturally, of limited use when you split up
the batch into one-run jobs and submit it to HTCodnor to run all of them
at the same time.
I'm not sure if it's what you're using,
but here's some information on self-checkpointing for ANSYS Fluent jobs,
on page 39:
They mention native LSF and SGE integration,
but also indicate that you can checkpoint a running Fluent job by creating
a /tmp/check-fluent file. You can checkpoint and exit ("vacate"
in HTCondorese) by creating /tmp/exit-fluent.
With HTCondor on Linux and the MOUNT_UNDER_SCRATCH
option, you can bind-mount a tmp and var/tmp directory in the job's scratch
directory so that each job has its own /tmp and /var/tmp. This means that
only a single slot would be affected by creation of a /tmp/check-fluent
file in the job's context, since it would be in ${_CONDOR_SCRATCH_DIR}/tmp/check-fluent.
It would be easy enough to write a wrapper
which traps the HTCondor checkpointing or soft-kill signal and creates
the appropriate file for Fluent - SIGSTP would be tmp/exit-fluent, and
SIGUSR2 would be tmp/check-fluent (see p.475 in the 8.2.9 manual), and
the soft-kill signal defaults to SIGTERM in vanilla.
Fluent defaults to finishing the current
iteration, but can also be directed to complete all iterations in the current
time-step before checkpointing which would potentially take longer, so
you'd want to set your timeouts in HTCondor (i.e., max vacate time) to
insure it has enough time to finish a checkpoint.
Michael V. Pelletier
IT Program Execution
Principal Engineer
978.858.9681 (5-9681) NOTE NEW NUMBER
339.293.9149 cell
339.645.8614 fax
michael.v.pelletier@xxxxxxxxxxxx