Hi, I am using Condor for a personal pool, its version is:
$CondorVersion: 7.8.2 Sep 30 2012 Debian-7.8.2~dfsg.1-1+deb7u1 $
$CondorPlatform: X86_64-Ubuntu_ $
I found that to condor_suspend a dagman job can make it crashed and get into RECOVERY mode. This is the output for dagman when issue suspend command:
......
09/05/13 14:29:44 MultiLogFiles: truncating log file /home/kyle/csf/RS-9/RS-9.log
09/05/13 14:29:44 Submitting Condor Node RS-9.1 job(s)...
09/05/13 14:29:44 submitting: condor_submit -a dag_node_name' '=' 'RS-9.1 -a +DAGManJobId' '=' '327 -a DAGManJobId' '=' '327 -a submit_event_notes' '=' 'DAG' 'Node:' 'RS-9.1 -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"" -a +KeepClaimIdle' '=' '20 RS-9.1.sub
09/05/13 14:29:44 From submit: Submitting job(s).
09/05/13 14:29:44 From submit: 1 job(s) submitted to cluster 328.
09/05/13 14:29:44 assigned Condor ID (328.0.0)
09/05/13 14:29:44 Just submitted 1 job this cycle...
09/05/13 14:29:44 Currently monitoring 1 Condor log file(s)
09/05/13 14:29:44 Event: ULOG_SUBMIT for Condor Node RS-9.1 (328.0.0)
09/05/13 14:29:44 Number of idle job procs: 1
09/05/13 14:29:44 Of 2 nodes total:
09/05/13 14:29:44 Done Pre Queued Post Ready Un-Ready Failed
09/05/13 14:29:44 === === === === === === ===
09/05/13 14:29:44 0 0 1 0 0 1 0
09/05/13 14:29:44 0 job proc(s) currently held
09/05/13 14:29:54 Currently monitoring 1 Condor log file(s)
09/05/13 14:29:54 Event: ULOG_EXECUTE for Condor Node RS-9.1 (328.0.0)
09/05/13 14:29:54 Number of idle job procs: 0
09/05/13 14:30:04 Currently monitoring 1 Condor log file(s)
09/05/13 14:30:04 Event: ULOG_IMAGE_SIZE for Condor Node RS-9.1 (328.0.0)09/05/13 14:30:28 Setting maximum accepts per cycle 8.
09/05/13 14:30:28 ******************************************************
09/05/13 14:30:28 ** condor_scheduniv_exec.327.0 (CONDOR_DAGMAN) STARTING UP
09/05/13 14:30:28 ** /usr/bin/condor_dagman
09/05/13 14:30:28 ** SubsystemInfo: name=DAGMAN type=DAGMAN(10) class=DAEMON(1)
09/05/13 14:30:28 ** Configuration: subsystem:DAGMAN local:<NONE> class:DAEMON
09/05/13 14:30:28 ** $CondorVersion: 7.8.2 Sep 30 2012 Debian-7.8.2~dfsg.1-1+deb7u1 $
09/05/13 14:30:28 ** $CondorPlatform: X86_64-Ubuntu_ $
09/05/13 14:30:28 ** PID = 15394
09/05/13 14:30:28 ** Log last touched 9/5 14:30:04
09/05/13 14:30:28 ******************************************************
......
I think the line in red is the last output before dagman crashed. The terminal window is:
kyle@scorpio ~/csf/RS-7 $ condor_q
-- Submitter:
scorpio.otitan.com : <
127.0.0.1:38147> :
scorpio.otitan.com
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
325.0 kyle 9/5 14:20 0+00:00:11 R 0 0.3 condor_dagman
326.0 kyle 9/5 14:20 0+00:00:00 I 0 2.7 csfexec
2 jobs; 0 completed, 0 removed, 1 idle, 1 running, 0 held, 0 suspended
kyle@scorpio ~/csf/RS-7 $ condor_suspend 325.0
Job 325.0 suspended
kyle@scorpio ~/csf/RS-7 $ condor_q
-- Failed to fetch ads from: <127.0.0.1:38147> : scorpio.otitan.com
CEDAR:6001:Failed to connect to <127.0.0.1:38147>
kyle@scorpio ~/csf/RS-7 $ condor_q
-- Submitter:
scorpio.otitan.com : <
127.0.0.1:59970> :
scorpio.otitan.com
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
325.0 kyle 9/5 14:20 0+00:00:00 I 0 0.3 condor_dagman
326.0 kyle 9/5 14:20 0+00:00:00 I 0 2.7 csfexec
2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended
kyle@scorpio ~/csf/RS-7 $ condor_q
-- Submitter:
scorpio.otitan.com : <
127.0.0.1:59970> :
scorpio.otitan.com
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
325.0 kyle 9/5 14:20 0+00:00:12 R 0 0.3 condor_dagman
326.0 kyle 9/5 14:20 0+00:00:00 I 0 2.7 csfexec
2 jobs; 0 completed, 0 removed, 1 idle, 1 running, 0 held, 0 suspended
There was also a job disconnect event for this dag node job.