Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] transfer_in/output_files only if they exist
- Date: Thu, 14 Feb 2019 13:55:35 +0000
- From: Duncan Brown <dabrown@xxxxxxx>
- Subject: Re: [HTCondor-users] transfer_in/output_files only if they exist
Hi Todd,
Follow-up question: is there a way to set something like
periodic_transfer_spool = 3600
so that the contents of the job's spool directory can be transferred back to the shadow's spool periodically? In combination with ON_EXIT_OR_EVICT that would give me periodic checkpointing if the job dies unexpectedly, in addition to when it is cleanly evicted.
I could fake this with some combination of periodic_hold and periodic_release, but my recollection is that hold sends a hard kill and doesn't leave time for a SIGTERM->allow time for job checkpoint and exit->SIGKILL cycle. If my job's checkpoint timer loses sync with condor's periodic_hold that would be a recipe for badput.
Cheers (from less cold but more snowy Syracuse),
Duncan.
> On Feb 11, 2019, at 3:03 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
>
> On 2/11/2019 11:50 AM, Duncan Brown wrote:
>> Hi Todd,
>>
>> Ah, very nice, that's what I need!
>>
>> Cheers,
>> Duncan.
>>
>
> Glad to help!
>
> best regards from cold and snowy Madison,
> Todd
>
>
>
>
>>> On Feb 8, 2019, at 12:36 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
>>>
>>> On 2/7/2019 8:00 AM, Duncan Brown wrote:
>>>> Hi Todd,
>>>>
>>>> Is there a way to tell condor that it's OK if a specific file listed in transfer_input_files does not exist (and the same question with an output file)? The use case is using condor file i/o to manage a checkpoint file.
>>>
>>> Hi Duncan,
>>>
>>> TJ already answered the question above, but I am not certain you need to
>>> do the above to handle your checkpoint file use case. :)
>>>
>>> When your submit file has
>>>
>>> when_to_transfer_output = ON_EXIT_OR_EVICT
>>>
>>> what happens is when your job is evicted, any output files are
>>> transferred back to the SPOOL directory for that job on the submit
>>> machine. When your job is rescheduled to run again, HTCondor first
>>> sends all the specified transfer_input files to the execute node, **and
>>> then subsequently also sends all the files stored in SPOOL**. The
>>> point being your checkpoint file need not be listed explicitly in
>>> transfer_input_files at all... it will get transferred on restart
>>> assuming it was considered output from a previous run.
>>>
>>> So imagine you have a job that has input data ('my_input_data'), output
>>> data ('my_output_data), and it periodically writes a checkpoint file
>>> ('ckpt_file'). Your submit file could look like:
>>>
>>> executable = foo.exe
>>> when_to_transfer_output = ON_EXIT_OR_EVICT
>>> transfer_input_files = my_input_data
>>> transfer_output_files = my_output_data ckpt_file
>>>
>>> With the above, the only issue may be your job going on hold if your job
>>> is evicted before it ever writes out its initial ckpt_file, because it
>>> will not exist and yet is explicitly declared in transfer_output_files.
>>> To prevent this case, you could make a zero-length ckpt_file on
>>> submission, and add it to transfer_input_files. This way the job will
>>> never go on hold because all files listed in "transfer_output_files"
>>> will always exist. Because HTCondor first sends the input files and
>>> then sends the spool files, on restart after a ckpt HTCondor will first
>>> send the zero-length ckpt file from transfer_intput_files, but then
>>> immediately overwrite it when the ckpt_file contents from the SPOOL
>>> directory (i.e. the ckpt_file contents from the last run) is sent.
>>>
>>> Hope the above helps,
>>> Todd
>>>
>>>> The use case is using condor file i/o to manage a checkpoint file. The first time the job is run, the checkpoint file does not exist so the job gets stuck in hold state. I want to be able to tell condor that it's OK that this file is not there.
>>>>
>>>> Cheers,
>>>> Duncan.
>>>>
>>>
>>>
>>>
--
Duncan Brown Room 263-1, Physics Department
Charles Brightman Professor of Physics Syracuse University, NY 13244
http://dabrown.expressions.syr.edu Phone: 315 443 5993