Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] transfer_in/output_files only if they exist
- Date: Fri, 8 Feb 2019 17:36:23 +0000
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] transfer_in/output_files only if they exist
On 2/7/2019 8:00 AM, Duncan Brown wrote:
> Hi Todd,
>
> Is there a way to tell condor that it's OK if a specific file listed in transfer_input_files does not exist (and the same question with an output file)? The use case is using condor file i/o to manage a checkpoint file.
Hi Duncan,
TJ already answered the question above, but I am not certain you need to
do the above to handle your checkpoint file use case. :)
When your submit file has
when_to_transfer_output = ON_EXIT_OR_EVICT
what happens is when your job is evicted, any output files are
transferred back to the SPOOL directory for that job on the submit
machine. When your job is rescheduled to run again, HTCondor first
sends all the specified transfer_input files to the execute node, **and
then subsequently also sends all the files stored in SPOOL**. The
point being your checkpoint file need not be listed explicitly in
transfer_input_files at all... it will get transferred on restart
assuming it was considered output from a previous run.
So imagine you have a job that has input data ('my_input_data'), output
data ('my_output_data), and it periodically writes a checkpoint file
('ckpt_file'). Your submit file could look like:
executable = foo.exe
when_to_transfer_output = ON_EXIT_OR_EVICT
transfer_input_files = my_input_data
transfer_output_files = my_output_data ckpt_file
With the above, the only issue may be your job going on hold if your job
is evicted before it ever writes out its initial ckpt_file, because it
will not exist and yet is explicitly declared in transfer_output_files.
To prevent this case, you could make a zero-length ckpt_file on
submission, and add it to transfer_input_files. This way the job will
never go on hold because all files listed in "transfer_output_files"
will always exist. Because HTCondor first sends the input files and
then sends the spool files, on restart after a ckpt HTCondor will first
send the zero-length ckpt file from transfer_intput_files, but then
immediately overwrite it when the ckpt_file contents from the SPOOL
directory (i.e. the ckpt_file contents from the last run) is sent.
Hope the above helps,
Todd
>The use case is using condor file i/o to manage a checkpoint file. The first time the job is run, the checkpoint file does not exist so the job gets stuck in hold state. I want to be able to tell condor that it's OK that this file is not there.
>
> Cheers,
> Duncan.
>
--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing Department of Computer Sciences
HTCondor Technical Lead 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685