HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] [condor-fw] [nwp@xxxxxxxxxxxxxxxxxxxxxx: errorlog]



On Wed, May 16, 2012 at 02:40:52PM -0500, Nathan Panike wrote:

> On Wed, May 16, 2012 at 12:29:05PM -0500, Todd Tannenbaum wrote:
> 
> > True, my hypothesis was formulated in an absence of information /
> > background... for all I know your job is producing 200 output files
> > and the Total Bytes Sent is reflecting files other the one empty
> > file.
> > 
> > Another thought --  iirc, your submitted your job to transfer on
> > exit or evict... is your job truly prepared to resume with
> > half-filled output files?  Maybe you want your job to only transfer
> > on exit....
> > 
> > Yet another (depressing) thought... maybe we have a serious
> > regression in the Condor file transfer code.  :(
> > 
> 
> I don't know that it is a recent regression. LMCG has had policy to
> handle this for ages.  When we run these soma jobs in lmcg, we have
> the following:
> 
> SYSTEM_PERIODIC_HOLD = JobStatus != 3 && ( ExitCode =!= UNDEFINED && ( JobUniverse == $(VANILLA) && (RegExp("\bsoma_align\.[^/]*$",Cmd) && LastBytesSent == 0) || ExitCode != 0) || ExitBySignal == True) && OnExitHold =!= True && (OnExitRemove =?= True || TerminationPending =?= True)
> 
> Dan Forrest says: The SYSTEM_PERIODIC_HOLD expression will put soma
> jobs on hold if they don't transfer any data back.  This is to work
> around a problem with vanilla jobs (usually when they run on Windows,
> but I've seen the same thing with Linux) where there will be a zero
> ExitCode, but the output file size is zero.
> 
> This code was one of the reasons I am running these in CHTC: to see
> if more recent versions of condor are exhibiting this and cognate
> problems.
> 
> Out of 11499 jobs that have completed in my run, there are 459 (4%)
> that exhibited this problem.

Looking back on a conversation I had with Greg Thain (January 2011), I
would add that this got much worse when CHTC was upgraded to 7.5.5,
but for an entirely different reason...

updateFromStarter() is sometimes called with "ExitBySignal = 0",
"ExitCode = 0", and "JobState = Exited", but BEFORE the files are
transferred.  The files are transferred immediately after this update.
This makes the above periodic hold expression useless (I realize
LastBytesSent is an extension I added, but BytesSent is also undefined
at this point).

I patched the LMCG shadow to not evaluate the periodic policies when
ExitCode is set (the exit policies are going to be evaluated very soon
anyway), but this is another problem.

I would think that if you put something like:

  PeriodicHold = (ExitCode =!= UNDEFINED) && (BytesSent == 0)

in your submit file you would see some number of jobs go on hold too
because the hold expression is undefined (but not at LMCG because of
the above mentioned patch.)

Or maybe this is something Todd told me had been fixed, but I don't
remember?  If it has, it isn't a backward compatible fix because I
still see that sequence happening.

-- 
Daniel K. Forrest		Space Science and
dan.forrest@xxxxxxxxxxxxx	Engineering Center
(608) 890 - 0558		University of Wisconsin, Madison