[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Long running OSG jobs



James,

Holding a job is different from evicting a job; AFAIK everything gets
lost with a held job even when ON_EXIT_OR_EVICT is used. If you want
to test your self-checkpointing setup to make sure it's working, you
can use condor_vacate_job instead of condor_rm.

Jason Patton

On Fri, Nov 11, 2016 at 4:39 PM, James Clark
<james.clark@xxxxxxxxxxxxxxxxxx> wrote:
> Hello,
>
> I have some long-running (>24 hour) jobs I would like to deploy on the open
> science grid.
>
> The jobs self-checkpoint (by writing out simple text files to disk) every 60
> minutes and when they receive an interrupt signal.
>
> I have enabled file transfer in the submission file, and have added the
> following lines for a periodic hold/release in the hope that the jobs will:
> 1) get evicted
> 2) transfer their working directory (which contains the checkpoint files)
> back to the submission host
> 3) resume under condor and re-send that working directory to the new worker
> node
> 4) identify the presence of the checkpoint files and cleanly resume from
> where they left off
>
> However, the jobs do not appear to transfer the data back in this scenario.
> I have also tried condor_rm, which I would expect to terminate the job and
> send the non-empty working directory back.  This also fails to achieve the
> desired effect.
>
> Some pertinent details from the submission file:
>
> should_transfer_files = YES
> transfer_output_files = $(macrooutputDir)
> transfer_input_files = datafind,$(macrooutputDir)
> when_to_transfer_output = ON_EXIT_OR_EVICT
> periodic_hold = (JobStatus == 2) && (time() - EnteredCurrentStatus > 8*3600)
> periodic_hold_subcode = 12345
> periodic_release = (JobStatus == 5) && (time() - EnteredCurrentStatus >
> 5*60) && (PeriodicHoldSubCode =?= 12345)
> want_graceful_removal = true
>
> where $(macrooutputDir) is the name of each job's working directory, as
> specified in the dagman file.
>
> Any advice would be greatly appreciated,
> Many thanks,
> James
>
> --
> ===========================================
> James Clark
> Research Scientist
>
> Center for Relativistic Astrophysics
> School of Physics
> Georgia Institute of Technology
> Atlanta GA 30332
> office: Boggs 1-110
> email:  james.clark@xxxxxxxxxxxxxxxxxx
> Tel. (cell):  413-230-1412
> Skype: jamesclark_
> ===========================================
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>