Junjun Mao wrote:
Condor has been running fine for a few months but all the jobs got killed (some restarted) suddenly yesterday. Here is the log on master node SchedLog:7/31 15:00:03 Shadow pid 10451 for job 573.2 exited with status 1007/31 15:00:03 match (<10.10.20.64:49539>#1175522523#195) out of jobs (cluster id 181); relinquishing7/31 15:00:03 Sent RELEASE_CLAIM to startd on <10.10.20.64:49539> 7/31 15:00:03 Match record (<10.10.20.64:49539>, 181, -1) deleted 7/31 15:00:04 Got VACATE_SERVICE from <10.10.20.64:34423> 7/31 15:00:04 Shadow pid 9427 for job 636.0 exited with status 1007/31 15:00:04 match (<10.10.20.76:46461>#1175522447#121) out of jobs (cluster id 636); relinquishing7/31 15:00:04 Sent RELEASE_CLAIM to startd on <10.10.20.76:46461> 7/31 15:00:04 Match record (<10.10.20.76:46461>, 636, -1) deleted 7/31 15:00:04 Got VACATE_SERVICE from <10.10.20.76:59431> 7/31 15:00:04 In DedicatedScheduler::reaper pid 22101 has status 1024 7/31 15:00:04 Shadow pid 22101 exited with status 4 7/31 15:00:04 ERROR: Shadow exited with job exception code!It seems the Shadows exited with status 100 or 4. What is status 100 and 4? Does it have anything to do with the network or file system?
Grep for "ERROR" in your shadowlog to see what the problem is (or if you specified a "Log=" line in your job submit file to get a user log, the error will appear in that file as well).
For the interested reader, here are all the condor_shadow exit codes and what they mean:
4 JOB_EXCEPTION The job exited with an exception 44 DPRINTF_ERROR There is a fatal error with dprintf() 100 JOB_EXITED The job exited (not killed) 101 JOB_CKPTED The job was checkpointed 102 JOB_KILLED The job was killed 103 JOB_COREDUMPED The job was killed and a core file produced 105 JOB_NO_MEM Not enough memory to start the shadow 106 JOB_SHADOW_USAGE incorrect arguments to condor_shadow 107 JOB_NOT_CKPTED The job was kicked off without a checkpoint107 JOB_SHOULD_REQUEUE (!) We define this to the same number, since we want the same behavior. However, "JOB_NOT_CKPTED" doesn't mean much if we're not a standard universe job. The effect of this exit code is that we want the job to be put back in the job queue and run again.
108 JOB_NOT_STARTED Can't connect to startd or request refused 109 JOB_BAD_STATUS Job status != RUNNING on startup 110 JOB_EXEC_FAILED Exec failed for some reason other than ENOMEM 111 JOB_NO_CKPT_FILE There is no checkpoint file (lost) 112 JOB_SHOULD_HOLD The job should be put on hold 113 JOB_SHOULD_REMOVE The job should be removed