I'm running Condor Stable on Windows. A couple times I've seen my big DAGs die with incomprehensible "BAD EVENT" stuff. The dagman.out log below seems to indicate 5886 exits successfully, but then an unexpected ULOG_EXECUTING event happens
for no clear reason? There are a bunch of these "bad event" messages scattered throughout the log alongside "Continuing with DAG in spite of bad event". But then suddenly "Aborting DAG" happens and everything gets condor_rm'ed. I can't figure out what the proximate
issue to the "Aborting DAG" message is. 01/14/12 21:14:25 Event: ULOG_EXECUTE for Condor Node moeadd_scen0.sim-850087584 (5886.0.0) 01/14/12 21:14:25 Number of idle job procs: 1428 01/14/12 21:14:25 Event: ULOG_EXECUTE for Condor Node moeadd_scen0.sim-632752657 (5889.0.0) 01/14/12 21:14:25 Number of idle job procs: 1427 01/14/12 21:14:25 Event: ULOG_ATTRIBUTE_UPDATE for Condor Node reprun_scen0.sim-23890347 (4004.0.0) 01/14/12 21:14:25 Event: ULOG_IMAGE_SIZE for Condor Node moeadd_scen0.sim-850087584 (5886.0.0) 01/14/12 21:14:25 Event: ULOG_SUBMIT for Condor Node alertadd_scen0.sim-118624484 (0.2147483647.1033) 01/14/12 21:14:25 Number of idle job procs: 1428 01/14/12 21:14:25 Event: ULOG_JOB_TERMINATED for Condor Node alertadd_scen0.sim-118624484 (0.2147483647.1033) 01/14/12 21:14:25 Node alertadd_scen0.sim-118624484 job proc (0.2147483647.1033) completed successfully. 01/14/12 21:14:25 Node alertadd_scen0.sim-118624484 job completed 01/14/12 21:14:25 Number of idle job procs: 1427 01/14/12 21:14:25 Event: ULOG_JOB_TERMINATED for Condor Node moeadd_scen0.sim-850087584 (5886.0.0) 01/14/12 21:14:25 Node moeadd_scen0.sim-850087584 job proc (5886.0.0) completed successfully. 01/14/12 21:14:25 Node moeadd_scen0.sim-850087584 job completed 01/14/12 21:14:25 Number of idle job procs: 1427 01/14/12 21:14:25 Event: ULOG_SUBMIT for Condor Node alertadd_scen0.sim-22384084 (0.2147483647.1034) 01/14/12 21:14:25 Number of idle job procs: 1428 01/14/12 21:14:25 Event: ULOG_EXECUTE for Condor Node moeadd_scen0.sim-850087584 (5886.0.0) 01/14/12 21:14:25 BAD EVENT: job (5886.0.0) executing, total end count != 0 (1) 01/14/12 21:14:25 ERROR: aborting DAG because of bad event (BAD EVENT: job (5886.0.0) executing, total end count != 0 (1)) 01/14/12 21:14:25 Event: ULOG_EXECUTE for Condor Node moeadd_scen0.sim-632752657 (5889.0.0) 01/14/12 21:14:25 Number of idle job procs: 1428 01/14/12 21:14:25 Event: ULOG_ATTRIBUTE_UPDATE for Condor Node reprun_scen0.sim-23890347 (4004.0.0) 01/14/12 21:14:25 Event: ULOG_IMAGE_SIZE for Condor Node moeadd_scen0.sim-850087584 (5886.0.0) 01/14/12 21:14:25 Event: ULOG_SUBMIT for Condor Node alertadd_scen0.sim-118624484 (0.2147483647.1033) 01/14/12 21:14:25 BAD EVENT: job (0.2147483647.1033) submitted, total end count != 0 (1) 01/14/12 21:14:25 Continuing with DAG in spite of bad event (BAD EVENT: job (0.2147483647.1033) submitted, total end count != 0 (1)) because of allow_events setting 01/14/12 21:14:26 Event: ULOG_JOB_TERMINATED for Condor Node alertadd_scen0.sim-118624484 (0.2147483647.1033) 01/14/12 21:14:26 BAD EVENT: job (0.2147483647.1033) ended, total end count != 1 (2) 01/14/12 21:14:26 Continuing with DAG in spite of bad event (BAD EVENT: job (0.2147483647.1033) ended, total end count != 1 (2)) because of allow_events setting 01/14/12 21:14:26 Event: ULOG_JOB_TERMINATED for Condor Node moeadd_scen0.sim-850087584 (5886.0.0) 01/14/12 21:14:26 BAD EVENT: job (5886.0.0) ended, total end count != 1 (2) 01/14/12 21:14:26 Continuing with DAG in spite of bad event (BAD EVENT: job (5886.0.0) ended, total end count != 1 (2)) because of allow_events setting 01/14/12 21:14:26 Event: ULOG_SUBMIT for Condor Node alertadd_scen0.sim-22384084 (0.2147483647.1034) 01/14/12 21:14:26 BAD EVENT: job (0.2147483647.1034) submitted, submit count != 1 (2) 01/14/12 21:14:26 Continuing with DAG in spite of bad event (BAD EVENT: job (0.2147483647.1034) submitted, submit count != 1 (2)) because of allow_events setting 01/14/12 21:14:26 Event: ULOG_JOB_TERMINATED for Condor Node alertadd_scen0.sim-22384084 (0.2147483647.1034) 01/14/12 21:14:26 Node alertadd_scen0.sim-22384084 job proc (0.2147483647.1034) completed successfully. 01/14/12 21:14:26 Node alertadd_scen0.sim-22384084 job completed 01/14/12 21:14:26 Number of idle job procs: 1427 01/14/12 21:14:26 Event: ULOG_SUBMIT for Condor Node alertadd_scen0.sim-763834499 (0.2147483647.1035) 01/14/12 21:14:26 Number of idle job procs: 1428 01/14/12 21:14:26 Event: ULOG_JOB_TERMINATED for Condor Node alertadd_scen0.sim-763834499 (0.2147483647.1035) 01/14/12 21:14:26 Node alertadd_scen0.sim-763834499 job proc (0.2147483647.1035) completed successfully. 01/14/12 21:14:26 Node alertadd_scen0.sim-763834499 job completed 01/14/12 21:14:26 Number of idle job procs: 1427 01/14/12 21:14:26 Event: ULOG_SUBMIT for Condor Node alertadd_scen0.sim-140979186 (0.2147483647.1036) 01/14/12 21:14:26 Number of idle job procs: 1428 01/14/12 21:14:26 Event: ULOG_JOB_TERMINATED for Condor Node alertadd_scen0.sim-140979186 (0.2147483647.1036) 01/14/12 21:14:26 Node alertadd_scen0.sim-140979186 job proc (0.2147483647.1036) completed successfully. 01/14/12 21:14:26 Node alertadd_scen0.sim-140979186 job completed 01/14/12 21:14:26 Number of idle job procs: 1427 01/14/12 21:14:26 Event: ULOG_ATTRIBUTE_UPDATE for Condor Node reprun_scen0.sim-801761742 (4027.0.0) 01/14/12 21:14:26 Aborting DAG... |