HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] Issues with output files and checkpointing



Todd,

> >I'm integrating this patch into Condor for the next release of the
> >developer series. I just have to check it in, which I'll do monday.
> 
> What happens if the ckpt server is down for an hour?  Will jobs
> restart from the start?

Only if the job has accumulated less than MAX_DISCARDED_RUN_TIME
(default 3600) seconds of CPU time.

Note that this has always been the case.  The patch is just to make
the schedd remember that this job has restarted from the start (by
removing the LastCkptServer attribute) so there is no way it can ever
restart from that checkpoint again.

The problem scenario is:

Job runs and checkpoints before MAX_DISCARDED_RUN_TIME seconds of CPU.
Job restarts, but can't read the checkpoint, so restarts from scratch.
Job is evicted without checkpointing.
Job restarts and reads old checkpoint.

The last restart will be inconsistent with the file state because
output files will have been truncated by the first restart.

-- 
Daniel K. Forrest	Laboratory for Molecular and
forrest@xxxxxxxxxxxxx	Computational Genomics
(608) 262 - 9479	University of Wisconsin, Madison