Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] condor-g job status unknown caused dagman to exit
- Date: Tue, 20 Apr 2010 10:24:01 -0400
- From: Peter Doherty <doherty@xxxxxxxxxxxxxxxxxxx>
- Subject: [Condor-users] condor-g job status unknown caused dagman to exit
To the condor experts,
I'm using Condor-G and this job (see logs below) caused a large dag to
exit unexpectedly this morning. I understand the concept of BAD
EVENTS ( in this case the job started executing after it was supposed
to be done)
But I want to know how to prevent this from happening.
From what I can see the job's state became unknown (What causes
this? I see it happen a lot actually)
and then Condor abandoned the job, but then the job called back home
and instead of just ignoring it, it heard the call, but then got
confused, and dagman exited.
It seems like a bug to me.
Thanks
Peter
job log file:
000 (935774.000.000) 04/19 16:55:14 Job submitted from host:
<10.0.10.39:56607>
DAG Node: 4bbn-01360
017 (935774.000.000) 04/19 18:14:17 Job submitted to Globus
RM-Contact: ff-grid2.unl.edu/jobmanager-pbs
JM-Contact: https://ff-grid2.unl.edu:38818/1135/1271715184/
Can-Restart-JM: 1
027 (935774.000.000) 04/19 18:14:17 Job submitted to grid resource
GridResource: gt2 ff-grid2.unl.edu/jobmanager-pbs
GridJobId: gt2 ff-grid2.unl.edu/jobmanager-pbs https://ff-grid2.unl.edu:38818/1135/1271715184/
001 (935774.000.000) 04/19 20:25:12 Job executing on host: gt2 ff-
grid2.unl.edu/jobmanager-pbs
029 (935774.000.000) 04/19 20:54:49 The job's remote status is unknown
020 (935774.000.000) 04/20 01:51:03 Detected Down Globus Resource
RM-Contact: ff-grid2.unl.edu/jobmanager-pbs
026 (935774.000.000) 04/20 01:51:03 Detected Down Grid Resource
GridResource: gt2 ff-grid2.unl.edu/jobmanager-pbs
019 (935774.000.000) 04/20 08:52:40 Globus Resource Back Up
RM-Contact: ff-grid2.unl.edu/jobmanager-pbs
025 (935774.000.000) 04/20 08:52:40 Grid Resource Back Up
GridResource: gt2 ff-grid2.unl.edu/jobmanager-pbs
012 (935774.000.000) 04/20 09:03:34 Job was held.
Globus error 31: the job manager failed to cancel the job as
requested
Code 2 Subcode 31
013 (935774.000.000) 04/20 09:07:50 Job was released.
The job attribute PeriodicRelease expression
'(NumGlobusSubmits <= 7)' evaluated to TRUE
009 (935774.000.000) 04/20 09:08:17 Job was aborted by the user.
The job attribute PeriodicRemove expression '(JobStatus == 2)
&& ((CurrentTime - EnteredCurrentStatus) > 12000)' evaluated to TRUE
030 (935774.000.000) 04/20 09:08:42 The job's remote status is known
again
001 (935774.000.000) 04/20 09:08:42 Job executing on host: gt2 ff-
grid2.unl.edu/jobmanager-pbs
009 (935774.000.000) 04/20 09:08:42 Job was aborted by the user.
The job attribute PeriodicRemove expression '(JobStatus == 2)
&& ((CurrentTime - EnteredCurrentStatus) > 12000)' evaluated to TRUE
dagman.out file
04/19 16:56:13 Event: ULOG_SUBMIT for Condor Node 4bbn-01360 (935774.0)
04/19 18:14:20 Event: ULOG_GLOBUS_SUBMIT for Condor Node 4bbn-01360
(935774.0)
04/19 18:14:20 Event: ULOG_GRID_SUBMIT for Condor Node 4bbn-01360
(935774.0)
04/19 20:25:14 Event: ULOG_EXECUTE for Condor Node 4bbn-01360 (935774.0)
04/19 20:54:52 Event: ULOG_JOB_STATUS_UNKNOWN for Condor Node
4bbn-01360 (935774.0)
04/20 01:51:07 Event: ULOG_GLOBUS_RESOURCE_DOWN for Condor Node
4bbn-01360 (935774.0)
04/20 01:51:07 Event: ULOG_GRID_RESOURCE_DOWN for Condor Node
4bbn-01360 (935774.0)
04/20 08:52:41 Event: ULOG_GLOBUS_RESOURCE_UP for Condor Node
4bbn-01360 (935774.0)
04/20 08:52:41 Event: ULOG_GRID_RESOURCE_UP for Condor Node 4bbn-01360
(935774.0)
04/20 09:03:35 Event: ULOG_JOB_HELD for Condor Node 4bbn-01360
(935774.0)
04/20 09:07:53 Event: ULOG_JOB_RELEASED for Condor Node 4bbn-01360
(935774.0)
04/20 09:08:23 Event: ULOG_JOB_ABORTED for Condor Node 4bbn-01360
(935774.0)
04/20 09:08:23 Node 4bbn-01360 job completed
04/20 09:08:23 Unable to get log file from submit file ../.dag/4bbn.ca
(node 4bbn-01360); using default (/opt/osg-shared/home/site/doherty/
tmp/phaser/clean/4bbn/group/nodes/../.dag/4bbn.dag.nodes.log)
04/20 09:08:23 Running POST script of Node 4bbn-01360...
04/20 09:08:43 Event: ULOG_JOB_STATUS_KNOWN for Condor Node 4bbn-01360
(935774.0)
04/20 09:08:43 Event: ULOG_EXECUTE for Condor Node 4bbn-01360 (935774.0)
04/20 09:08:43 BAD EVENT: job (935774.0.0) executing, total end count !
= 0 (1)
04/20 09:08:43 ERROR: aborting DAG because of bad event (BAD EVENT:
job (935774.0.0) executing, total end count != 0 (1))
04/20 09:08:43 Event: ULOG_JOB_ABORTED for Condor Node 4bbn-01360
(935774.0)
04/20 09:08:43 BAD EVENT: job (935774.0.0) ended, total end count != 1
(2)
04/20 09:08:43 Continuing with DAG in spite of bad event (BAD EVENT:
job (935774.0.0) ended, total end count != 1 (2)) because of
allow_events setting
04/20 09:08:43 Aborting DAG...