[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] [PATCH] Speeding up condor_dagman submission



On Sat, 8 Aug 2015, Brian Candler wrote:

Incidentally, by grepping for sleep I just found src/condor_procapi/WISDOM which says: "See UniqueProcessId.pdf in this folder for a more indepth discussion of how the new ProcAPI ProcessId code works"
And I read that document. But I still don't understand why this PID birthday 
issue applies to condor_dagman running in the scheduler universe, but not to 
a regular job running on a worker.
Ah, the fundamental thing is this:  we want to avoid having two instances 
of DAGMan simultaneously running on the same DAG.  This will goof things 
up because the two DAGMans will be using the same log for their node jobs, 
and the events will get mixed together.
So, to avoid this, DAGMan creates a lock file at startup (which contains 
the UniquePID information).  When DAGMan starts up, it looks for the lock 
file.  If the file exists, DAGMan tries to read the UniquePID info from 
the lock file.  If it succeeds in doing that, and the corresponding 
process is still alive, DAGMan says, "Oops, there's another DAGMan already 
running on this DAG", and exits.  If DAGMan can't read the UniquePID 
info, or that process does not exist, DAGMan assumes that there was an 
earlier instance of DAGMan running on that DAG, but that instance no 
longer exists.  So the just-started DAGMan then continues in recovery 
mode.
Hopefully that all makes sense...

Kent