| Mailing List ArchivesAuthenticated access |  | ![[Computer Systems Lab]](http://www.cs.wisc.edu/pics/csl_logo.gif)  | 
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] condor-g job status unknown caused dagman to exit
- Date: Tue, 20 Apr 2010 10:24:01 -0400
- From: Peter Doherty <doherty@xxxxxxxxxxxxxxxxxxx>
- Subject: [Condor-users] condor-g job status unknown caused dagman to exit
To the condor experts,
I'm using Condor-G and this job (see logs below) caused a large dag to  
exit unexpectedly this morning.  I understand the concept of BAD  
EVENTS ( in this case the job started executing after it was supposed  
to be done)
But I want to know how to prevent this from happening.
From what I can see the job's state became unknown (What causes  
this?  I see it happen a lot actually)
and then Condor abandoned the job, but then the job called back home  
and instead of just ignoring it, it heard the call, but then got  
confused, and dagman exited.
It seems like a bug to me.
Thanks
Peter
job log file:
000 (935774.000.000) 04/19 16:55:14 Job submitted from host:  
<10.0.10.39:56607>
    DAG Node: 4bbn-01360
017 (935774.000.000) 04/19 18:14:17 Job submitted to Globus
    RM-Contact: ff-grid2.unl.edu/jobmanager-pbs
    JM-Contact: https://ff-grid2.unl.edu:38818/1135/1271715184/
    Can-Restart-JM: 1
027 (935774.000.000) 04/19 18:14:17 Job submitted to grid resource
    GridResource: gt2 ff-grid2.unl.edu/jobmanager-pbs
    GridJobId: gt2 ff-grid2.unl.edu/jobmanager-pbs https://ff-grid2.unl.edu:38818/1135/1271715184/
001 (935774.000.000) 04/19 20:25:12 Job executing on host: gt2 ff- 
grid2.unl.edu/jobmanager-pbs
029 (935774.000.000) 04/19 20:54:49 The job's remote status is unknown
020 (935774.000.000) 04/20 01:51:03 Detected Down Globus Resource
    RM-Contact: ff-grid2.unl.edu/jobmanager-pbs
026 (935774.000.000) 04/20 01:51:03 Detected Down Grid Resource
    GridResource: gt2 ff-grid2.unl.edu/jobmanager-pbs
019 (935774.000.000) 04/20 08:52:40 Globus Resource Back Up
    RM-Contact: ff-grid2.unl.edu/jobmanager-pbs
025 (935774.000.000) 04/20 08:52:40 Grid Resource Back Up
    GridResource: gt2 ff-grid2.unl.edu/jobmanager-pbs
012 (935774.000.000) 04/20 09:03:34 Job was held.
        Globus error 31: the job manager failed to cancel the job as  
requested
        Code 2 Subcode 31
013 (935774.000.000) 04/20 09:07:50 Job was released.
        The job attribute PeriodicRelease expression  
'(NumGlobusSubmits <= 7)' evaluated to TRUE
009 (935774.000.000) 04/20 09:08:17 Job was aborted by the user.
        The job attribute PeriodicRemove expression '(JobStatus == 2)  
&& ((CurrentTime - EnteredCurrentStatus) > 12000)' evaluated to TRUE
030 (935774.000.000) 04/20 09:08:42 The job's remote status is known  
again
001 (935774.000.000) 04/20 09:08:42 Job executing on host: gt2 ff- 
grid2.unl.edu/jobmanager-pbs
009 (935774.000.000) 04/20 09:08:42 Job was aborted by the user.
        The job attribute PeriodicRemove expression '(JobStatus == 2)  
&& ((CurrentTime - EnteredCurrentStatus) > 12000)' evaluated to TRUE
dagman.out file
04/19 16:56:13 Event: ULOG_SUBMIT for Condor Node 4bbn-01360 (935774.0)
04/19 18:14:20 Event: ULOG_GLOBUS_SUBMIT for Condor Node 4bbn-01360  
(935774.0)
04/19 18:14:20 Event: ULOG_GRID_SUBMIT for Condor Node 4bbn-01360  
(935774.0)
04/19 20:25:14 Event: ULOG_EXECUTE for Condor Node 4bbn-01360 (935774.0)
04/19 20:54:52 Event: ULOG_JOB_STATUS_UNKNOWN for Condor Node  
4bbn-01360 (935774.0)
04/20 01:51:07 Event: ULOG_GLOBUS_RESOURCE_DOWN for Condor Node  
4bbn-01360 (935774.0)
04/20 01:51:07 Event: ULOG_GRID_RESOURCE_DOWN for Condor Node  
4bbn-01360 (935774.0)
04/20 08:52:41 Event: ULOG_GLOBUS_RESOURCE_UP for Condor Node  
4bbn-01360 (935774.0)
04/20 08:52:41 Event: ULOG_GRID_RESOURCE_UP for Condor Node 4bbn-01360  
(935774.0)
04/20 09:03:35 Event: ULOG_JOB_HELD for Condor Node 4bbn-01360  
(935774.0)
04/20 09:07:53 Event: ULOG_JOB_RELEASED for Condor Node 4bbn-01360  
(935774.0)
04/20 09:08:23 Event: ULOG_JOB_ABORTED for Condor Node 4bbn-01360  
(935774.0)
04/20 09:08:23 Node 4bbn-01360 job completed
04/20 09:08:23 Unable to get log file from submit file ../.dag/4bbn.ca  
(node 4bbn-01360); using default (/opt/osg-shared/home/site/doherty/ 
tmp/phaser/clean/4bbn/group/nodes/../.dag/4bbn.dag.nodes.log)
04/20 09:08:23 Running POST script of Node 4bbn-01360...
04/20 09:08:43 Event: ULOG_JOB_STATUS_KNOWN for Condor Node 4bbn-01360  
(935774.0)
04/20 09:08:43 Event: ULOG_EXECUTE for Condor Node 4bbn-01360 (935774.0)
04/20 09:08:43 BAD EVENT: job (935774.0.0) executing, total end count ! 
= 0 (1)
04/20 09:08:43 ERROR: aborting DAG because of bad event (BAD EVENT:  
job (935774.0.0) executing, total end count != 0 (1))
04/20 09:08:43 Event: ULOG_JOB_ABORTED for Condor Node 4bbn-01360  
(935774.0)
04/20 09:08:43 BAD EVENT: job (935774.0.0) ended, total end count != 1  
(2)
04/20 09:08:43 Continuing with DAG in spite of bad event (BAD EVENT:  
job (935774.0.0) ended, total end count != 1 (2)) because of  
allow_events setting
04/20 09:08:43 Aborting DAG...