Subject: [Condor-users] dagman aborts without creating a rescue dag
Hi,
I was running a DAG on my submitting machine (Red Hat Enterprise Linux
AS release 3, condor version 6.7.8) whereas all the jobs shall be
executed on a remote machine (Fedora Core release 3 (Heidelberg), condor
version 6.7.8). Almost the full DAG completed, but then the dagman
aborts. Here are the last few lines from the dagman.out-file:
8/4 22:04:14 Job submit try 5/6 failed, will try again in >= 16 seconds.
8/4 22:04:32 Submitting Condor Job new_rc_tx_lalapps_inca_ID000261_0 ...
8/4 22:04:32 submitting: condor_submit -a 'dag_node_name =
new_rc_tx_lalapps_inca_ID000261_0' -a '+DAGManJobID = 1897' -a
'submit_event_notes = DAG Node: new_rc_tx_lalapps_inca_ID000261_0' -a
'+DAGParentNodeNames = "lalapps_inca_ID000261"'
new_rc_tx_lalapps_inca_ID000261_0.sub 2>&1
8/4 22:04:32 failed while reading from pipe.
8/4 22:04:32 Read so far: Submitting job(s)ERROR: can't determine proxy
filenamex509 user proxy is required for globus, gt2, gt3, gt4 or
nordugrid jobs
8/4 22:04:32 condor_submit try failed
8/4 22:04:32 submit command was: condor_submit -a 'dag_node_name =
new_rc_tx_lalapps_inca_ID000261_0' -a '+DAGManJobID = 1897' -a
'submit_event_notes = DAG Node: new_rc_tx_lalapps_inca_ID000261_0' -a
'+DAGParentNodeNames = "lalapps_inca_ID000261"'
new_rc_tx_lalapps_inca_ID000261_0.sub 2>&1
8/4 22:04:32 Job submit failed after 6 tries.
8/4 22:04:32 Running POST script of Job new_rc_tx_lalapps_inca_ID000261_0...
8/4 22:04:32 Of 1024 nodes total:
8/4 22:04:32 Done Pre Queued Post Ready Un-Ready Failed
8/4 22:04:32 === === === === === === ===
8/4 22:04:32 1002 0 3 1 0 18 0
8/4 22:04:37 Event: ULOG_POST_SCRIPT_TERMINATED for Condor Job
lalapps_inspiral_ID000224 (-1.-1)
8/4 22:04:37 ERROR "Assertion ERROR on (job->GetStatus() ==
Job::STATUS_POSTRUN || recovery)" at line 772 in file dag.C
The user proxies on botch machines were still valid for a long time, and
then the dagman aborts without creating a rescue dag. Is there possibly
a bug in the file dag.C or whats going on?