HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] [condor-fw] [nwp@xxxxxxxxxxxxxxxxxxxxxx: errorlog]



On Wed, May 16, 2012 at 12:29:05PM -0500, Todd Tannenbaum wrote:
> On 5/16/2012 12:09 PM, Nathan Panike wrote:
> >>Couple thoughts:
> >>
> >>1. fix your job?  what is creating the file somarun.1968334.0.xml?
> >>is there any way for your job to create this file, fail to write
> >>anything into it, and still exit with status 0 ?  me thinks there
> >>is, and that is what is happening.
> >
> >The job produces output on every run when I run it by hand. Your
> >hypothesis runs into the line above:
> >
> >	2923130  -  Total Bytes Sent By Job
> >
> >This indicates that the shadow believes output was sent back, and yet
> >the file is empty.
> >
> 
> True, my hypothesis was formulated in an absence of information /
> background... for all I know your job is producing 200 output files
> and the Total Bytes Sent is reflecting files other the one empty
> file.
> 
> Another thought --  iirc, your submitted your job to transfer on
> exit or evict... is your job truly prepared to resume with
> half-filled output files?  Maybe you want your job to only transfer
> on exit....
> 
> Yet another (depressing) thought... maybe we have a serious
> regression in the Condor file transfer code.  :(
> 

I don't know that it is a recent regression. LMCG has had policy to
handle this for ages.  When we run these soma jobs in lmcg, we have the
following:

SYSTEM_PERIODIC_HOLD = JobStatus != 3 && ( ExitCode =!= UNDEFINED && ( JobUniverse == $(VANILLA) && (RegExp("\bsoma_align\.[^/]*$",Cmd) && LastBytesSent == 0) || ExitCode != 0) || ExitBySignal == True) && OnExitHold =!= True && (OnExitRemove =?= True || TerminationPending =?= True)

Dan Forrest says: The SYSTEM_PERIODIC_HOLD expression will put soma jobs
on hold if they don't transfer any data back.  This is to work around a
problem with vanilla jobs (usually when they run on Windows, but I've
seen the same thing with Linux) where there will be a zero ExitCode, but
the output file size is zero.

This code was one of the reasons I am running these in CHTC: to see if
more recent versions of condor are exhibiting this and cognate problems.

Out of 11499 jobs that have completed in my run, there are 459 (4%) that
exhibited this problem.

> Todd
> 
> p.s. we probably should've had this discussion on condor-developer,
> nothing really UW FW team specific about it...
> 
> >>
> >>or
> >>
> >>2. If (1) is difficult, consider submiting with DAGMan and have the
> >>post script validate the output files.
> >>
> >That is probably what I will have to do.