Hi, for the n-th time, I found one of the schedds in my Condor pool dead. (The machine is the "pool master" but otherwise all submit machines = cluster headnodes are configured the same.)
[snip]
A look at the process table shows that the corresponding condor_schedd process is not owned by condor (as on all other submit machines) but by the user who submitted a job cluster before the problem showed up.
Practically the only time the schedd switches its effective UID to that of submitting is when it is writing into the user log.
My guess: did the user in question submit job(s) with log = /some/path in the submit file, and "/some/path" is siting on an NFS server? If so, welcome to the pathetic world of nfs file locks (esp on Linux).A quick workaround would be to have the user place the log files onto a local disk volume, or to not specify a log file.
We need to improve this at some point - since most users only need locking across the processes on one machine (i.e. the schedd, shadow, gridmanager, and dagman all run on the same box), perhaps replacing the file lock with a kernel mutex. What do folks think?
hope this helps Todd