[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] how to kill job when output dir removed ?




Dr Ian C. Smith wrote:
Unfortunately the DAGMan bug in 6.8 (mentioned
again here recently) is a show stopper for us.
Just to be clear, there are two manifestations of this bug.  In 6.8, it 
is causing DAGMan to go on hold when it exits abnormally.  In 6.9.1, 
this bug causes the schedd to crash :(
The problem will be fixed in 6.8.4 and 6.9.2.

You say that Condor 6.8 will continue to attempt to write to the submit host
if the filesystem is full. Does Condor actually detect this error
separately ? In other words - if the directory is missing (or unwriteable)
will it put the job on hold or just keep trying ?
In 6.8, Condor detects the error writing output, but it doesn't put the 
job on hold.  The job goes back to the idle state and will try to run 
again.  In the common case where the failure to write the output was 
because the initial working directory had been deleted, the job will go 
on hold when it tries to run again.  However, if you are writing your 
output to some other directory, and there are no problems fetching input 
files, then the job will run again, and possibly fail again when it 
tries to write output.
In 6.9.1, the job goes on hold when it fails to write output.

Just to make matters more confusing (sorry), standard, pvm, and local universes have not yet been incorporated into the new hold-on-error regime, so jobs in these universes currently exhibit the traditional behavior of going back to the idle state and trying to run again when they hit errors caused by missing input files/directories and/or output failures.
--Dan