Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Appending file output for a vanilla job
- Date: Thu, 04 Mar 2021 14:05:13 +0000
- From: Duncan Brown <dabrown@xxxxxxx>
- Subject: Re: [HTCondor-users] Appending file output for a vanilla job
Hi Thomas,
Thanks, it looks like chirp is the solution. I can use condor_chrip -put to send the results back to the submit machine.
Cheers,
Duncan.
> On Mar 4, 2021, at 4:28 AM, Thomas Hartmann <thomas.hartmann@xxxxxxx> wrote:
>
> Hi Duncan,
>
> if the results are small enough, maybe you can use `condor_chirp` from within the job to store/update the results as class ads? [1] Alternatively, with condor_chirp the job could probably send a status/result file back or write its results into the job log (with a "grep'able" tag in the log, the results could maybe be harvested from the collected job logs)
>
> If your jobs' workflows are somewhat complex, maybe they can be realized as a DAG [2] - but that might be overkill for just a few simple jobs.
>
> Cheers,
> Thomas
>
>
> [1]
> https://htcondor.readthedocs.io/en/latest/man-pages/condor_chirp.html?highlight=condor_chirp
>
>
> [2]
> https://htcondor.readthedocs.io/en/latest/users-manual/dagman-workflows.html#capturing-the-status-of-nodes-in-a-file
>
>
> On 03/03/2021 22.50, Duncan Brown via HTCondor-users wrote:
>> Hi all,
>> I'm trying to do something that feels like it should be HTCondor 101, but I am failing to figure it out:
>> We have a python program running in the vanilla universe that generates looks like
>> while True:
>> s = random_number_from( /dev/urandom )
>> result = calculation_that_takes_about_ten_minutes( s )
>> print(result)
>> The jobs are running on our OrangeGrid which consists of transient execute machines that have an average lifetime of 4 hours. We have
>> output = result.$(cluster).$(process)
>> stream_output = true
>> We then accumulate a bunch of results by cat-ing result.$(cluster).$(process) together. This works great while the jobs are running.
>> The problem is that if a job gets evicted by the execute machine and restarted, then the stdout file gets clobbered when the job starts back up again. We would just like to accumulate results from a bunch of jobs. The result files are simple enough that if the job got evicted while it was writing an ascii line to stdout, we can filter that out.
>> I cannot figure out how to prevent condor from clobbering stdout when the job is restarted. I also can't figure out how to stream to files that are not stdout or stderr. Writing to a specific file and using append_files won't work, as the code is python and not standard universe. The only solution I can come up with is to:
>> 1. Add transfer_input_file = result.$(cluster).$(process) to my submit file,
>> 2. Submit the job into the held state to get the $(cluster) number,
>> 3. Touch a bunch of result.$(cluster).$(process) files so they exist and are zero bytes.
>> 4. Have my program cat result.$(cluster).$(process) to stdout at startup
>> 5. Write print(result) to stdout and have condor stream stdout.
>> It feels like there has to be an easier way of doing this. What's the obvious thing that I'm missing?
>> Cheers,
>> Duncan.
>
--
Duncan Brown Room 263-1, Physics Department
Charles Brightman Professor of Physics Syracuse University, NY 13244
Physics Graduate Program Director http://dabrown.expressions.syr.edu