Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] condor_dagman.exe in idle after submit jobs completed
- Date: Tue, 14 Jun 2011 08:09:45 -0600
- From: "Michael O'Donnell" <odonnellm@xxxxxxxx>
- Subject: [Condor-users] condor_dagman.exe in idle after submit jobs completed
I am running a DAG that has taken approximately 6 days. All the submit
jobs completed last night, but the condor_dagman.exe is not exiting and it
is in idle. I noticed over the 6 days that the condor_dagman.exe would
transition in and out of idle (jobs were always running however). Has
anyone else had similar problems? Our pool consists of windows platforms
only.
Is there a way to get the dag to complete and does anyone have any ideas
to what might be causing this?
thanks,
mike
Here is an excerpt from the dagman.out file, but I do not see any
problems.
06/14/11 07:55:20 ******************************************************
06/14/11 07:55:20 ** condor_scheduniv_exec.55529.0 (CONDOR_DAGMAN)
STARTING UP
06/14/11 07:55:20 ** C:\Condor\bin\condor_dagman.exe
06/14/11 07:55:20 ** SubsystemInfo: name=DAGMAN type=DAGMAN(10)
class=DAEMON(1)
06/14/11 07:55:20 ** Configuration: subsystem:DAGMAN local:<NONE>
class:DAEMON
06/14/11 07:55:20 ** $CondorVersion: 7.6.0 Apr 16 2011 BuildID: 327460 $
06/14/11 07:55:20 ** $CondorPlatform: x86_winnt_5.1 $
06/14/11 07:55:20 ** PID = 2140
06/14/11 07:55:20 ** Log last touched 6/14 06:50:31
06/14/11 07:55:20 ******************************************************
06/14/11 07:55:20 Using config source:
\\igskbacbfssim\condor$\Secured_Config\Condor_Config\Global\FORTcondor_config
06/14/11 07:55:20 Using local config sources:
06/14/11 07:55:20
\\igskbacbfssim\condor$\Secured_Config\Condor_Config\Local\condor_config_IGSKBACBWS407.local
06/14/11 07:55:20 LISTEN <IP> fd=612
06/14/11 07:55:20 CONNECT bound to <IP> fd=608 peer=<IP>
06/14/11 07:55:20 ACCEPT bound to <IP> fd=32 peer=<IP>
06/14/11 07:55:20 CLOSE <IP> fd=612
06/14/11 07:55:20 LISTEN <IP> fd=612
06/14/11 07:55:20 DaemonCore: private command socket at <IP>
06/14/11 07:55:20 Setting maximum accepts per cycle 4.
06/14/11 07:55:20 DAGMAN_VERBOSITY setting: 3
06/14/11 07:55:20 DAGMAN_DEBUG_CACHE_SIZE setting: 5242880
06/14/11 07:55:20 DAGMAN_DEBUG_CACHE_ENABLE setting: False
06/14/11 07:55:20 DAGMAN_SUBMIT_DELAY setting: 0
06/14/11 07:55:20 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6
06/14/11 07:55:20 DAGMAN_STARTUP_CYCLE_DETECT setting: False
06/14/11 07:55:20 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 5
06/14/11 07:55:20 DAGMAN_USER_LOG_SCAN_INTERVAL setting: 5
06/14/11 07:55:20 allow_events (DAGMAN_IGNORE_DUPLICATE_JOB_EXECUTION,
DAGMAN_ALLOW_EVENTS) setting: 114
06/14/11 07:55:20 DAGMAN_RETRY_SUBMIT_FIRST setting: True
06/14/11 07:55:20 DAGMAN_RETRY_NODE_FIRST setting: False
06/14/11 07:55:20 DAGMAN_MAX_JOBS_IDLE setting: 0
06/14/11 07:55:20 DAGMAN_MAX_JOBS_SUBMITTED setting: 0
06/14/11 07:55:20 DAGMAN_MAX_PRE_SCRIPTS setting: 0
06/14/11 07:55:20 DAGMAN_MAX_POST_SCRIPTS setting: 0
06/14/11 07:55:20 DAGMAN_ALLOW_LOG_ERROR setting: False
06/14/11 07:55:20 DAGMAN_MUNGE_NODE_NAMES setting: True
06/14/11 07:55:20 DAGMAN_PROHIBIT_MULTI_JOBS setting: False
06/14/11 07:55:20 DAGMAN_SUBMIT_DEPTH_FIRST setting: False
06/14/11 07:55:20 DAGMAN_ABORT_DUPLICATES setting: True
06/14/11 07:55:20 DAGMAN_ABORT_ON_SCARY_SUBMIT setting: True
06/14/11 07:55:20 DAGMAN_PENDING_REPORT_INTERVAL setting: 600
06/14/11 07:55:20 DAGMAN_AUTO_RESCUE setting: True
06/14/11 07:55:20 DAGMAN_MAX_RESCUE_NUM setting: 100
06/14/11 07:55:20 DAGMAN_DEFAULT_NODE_LOG setting: null
06/14/11 07:55:20 DAGMAN_GENERATE_SUBDAG_SUBMITS setting: True
06/14/11 07:55:20 ALL_DEBUG setting: D_COMMAND D_NETWORK
06/14/11 07:55:20 DAGMAN_DEBUG setting:
06/14/11 07:55:20 argv[0] == "condor_scheduniv_exec.55529.0"
06/14/11 07:55:20 argv[1] == "-Lockfile"
06/14/11 07:55:20 argv[2] == "GSLIB_DAG.dag.lock"
06/14/11 07:55:20 argv[3] == "-AutoRescue"
06/14/11 07:55:20 argv[4] == "1"
06/14/11 07:55:20 argv[5] == "-DoRescueFrom"
06/14/11 07:55:20 argv[6] == "0"
06/14/11 07:55:20 argv[7] == "-Dag"
06/14/11 07:55:20 argv[8] == "GSLIB_DAG.dag"
06/14/11 07:55:20 argv[9] == "-CsdVersion"
06/14/11 07:55:20 argv[10] == "$CondorVersion: 7.6.0 Apr 16 2011 BuildID:
327460 $"
06/14/11 07:55:20 argv[11] == "-Dagman"
06/14/11 07:55:20 argv[12] == "C:\Condor\bin\condor_dagman.exe"
06/14/11 07:55:20 Default node log file is:
<\\igskbacbfssim\gissim$\PrjRas\CondorFiles\Submits\Simulations_Step4\GSLIB_DAG.dag.nodes.log>
06/14/11 07:55:20 DAG Lockfile will be written to GSLIB_DAG.dag.lock
06/14/11 07:55:20 DAG Input file is GSLIB_DAG.dag
06/14/11 07:55:20 Parsing 1 dagfiles
06/14/11 07:55:20 Parsing GSLIB_DAG.dag ...
06/14/11 07:55:20 Dag contains 1080 total jobs
06/14/11 07:55:20 Lock file GSLIB_DAG.dag.lock detected,
06/14/11 07:55:20 Duplicate DAGMan PID 796 is no longer alive; this DAGMan
should continue.
06/14/11 07:55:20 Sleeping for 12 seconds to ensure ProcessId uniqueness
06/14/11 07:55:32 WARNING: ProcessId not confirmed unique
06/14/11 07:55:32 Bootstrapping...
06/14/11 07:55:32 Number of pre-completed nodes: 0
06/14/11 07:55:32 Running in RECOVERY mode...
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>