[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] numjobstarts vs numshadowstarts



On 3/23/2015 11:59 AM, Suchandra Thapa wrote:
Are there any situations where numjobstarts will be different than
numshadowstarts?  Is this something that'll occur frequently?

Thanks,
Suchandra

Hi Suchandra,

NumShadowStarts is incremented by the schedd whenever it launches a condor_shadow (or, in the case of a local universe job, when the schedd launches a condor_starter on the submit machine).
NumJobStarts is incremented by the condor_starter or condor_gridmanager 
right before it spawns the job, but after the execute node has been 
successfully claimed and the job's input files have been transferred.
I could imagine several scenarios where they will be different. Some 
examples:
1. If the job specifies a universe that does not launch a shadow (e.g. 
grid universe, local universe), NumJobStarts would exceed NumShadowStarts.
2. If the condor_shadow is successfully started but encounters some 
error before spawning the job, such as an error transferring the input 
files or spawning the job itself (i.e. execute node is missing required 
shared libraries, executable does not exit on the execute node, etc), 
then NumShadowStarts could exceed NumJobStarts.
3. If the job is a parallel universe job, NumJobStarts is incremented 
for each node (mpi rank) that joins the computation. Thus NumJobStarts 
would likely exceed NumShadowStarts.

Hope the above helps,
Todd