Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] file transfer problems with vanilla job
- Date: Wed, 10 Nov 2004 12:56:07 -0600 (CST)
- From: De-Wei Yin <dyin@xxxxxxxxxxxx>
- Subject: [Condor-users] file transfer problems with vanilla job
I plan to run some very long simulations that can go on for months. For
performance reasons I use the Intel Fortran Compiler and Intel Math
Kernel Library, therefore the jobs must be submitted vanilla. The
executable code has its own checkpointing mechanism. I want the
checkpoint file and other output files transferred back to the submit
node whenever the job is preempted or vacated from the execute node, or
if the job is removed from the job queue. Condor also needs to be able
to send the checkpoint file and a log file back as input when the job
restarts.
My problem is that the output files are not coming back when the job is
evicted from a node (by Condor or by me using condor_vacate or
condor_hold) or when it is removed from the queue (by me using
condor_rm), and if I do eventually get them to come back, I'm not sure
how to tell Condor which ones to send back to use in restarting the job.
The submit node is in the same pool as the execute nodes (same CM, no
flocking involved), and it does not share a common FILESYSTEM_DOMAIN
with the execute nodes. I am using Condor 6.6.6.
The program that I run basically uses four types of files:
1. "init" file: Contains all the data required to start the job from
scratch. If the "ckpt" file is present, that file is read and the
job continues from the last checkpoint; if "ckpt" does not exist,
then the "init" file is read and the job starts from the beginning.
2. "ckpt" file: Contains the minimal data set needed to continue an
interrupted job, and is periodically overwritten with newer data
sets. The first thing that the program does is search for this
file (using Fortran inquire(file="ckpt",exist=ex))
3. "rlog" file: The running log of the job, contains some data and
job status information not needed to restart the job. When the job
starts from scratch using an "init" file, a new "rlog" file is
created; when the job restarts from a "ckpt" file, new records are
appended to the existing "rlog" file.
4. "data" file: Written periodically during the job run and numbered
sequentially (not appended or overwritten). Contains data to be
analyzed by other postprocessing tools.
Here is what I have in the submit file, which works find if the job can
run from start to finish without being evicted:
Universe = vanilla
Initialdir = .
Executable = ./main
Error = ./condor.err
Log = ./condor.log
should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT
transfer_input_files = init
Queue
When the jobs is evicted or removed, I expect to find the latest "ckpt"
file (if one has already been written), an "rlog" file, and any number
of "data" files. Unfortunately nothing comes back and the job always
restarts from scratch, and I cannot figure out why.
If the job is evicted, then when it restarts, it will need as input:
the "ckpt" if it has already been created, the "init" file in case there
is no "ckpt" file, and the "rlog" if it has been created so that new
records can be appended to it. How do I tell Condor that it needs to
send back these files, especially the "ckpt" and "rlog" files, which
might not yet exist if the job was interrupted early. By the way, any
numbered "data" file that does come back need not be returned to the
execute node since they are never needed as input.
I've tried adding "transfer_output_files" to the submit script, but ran
into three problems:
1. The "init" file becomes corrupted when it is sent to the execute
node (maybe a bug?).
2. Condor panics if it cannot find an output file explicitly listed in
"transfer_output_files" (e.g., when the "ckpt" file has not yet
been written).
3. An unknown number of "data" files are created, and those not
explicitly listed in "transfer_output_files" are lost.
I would really appreciate any help with this. Thanks!
Dewey
--
Mr. De-Wei Yin, MASc, PEng
Dept of Chemical & Biological Engineering tel: +1 608 262-3370
University of Wisconsin-Madison fax: +1 608 262-5434
1415 Engineering Drive dyin at cae dot wisc dot edu
Madison WI 53706-1691 USA www.engr.wisc.edu/groups/mtsm/