Hi all, I've got an issue where, with a sufficient number of jobs in a dag, the DAGMan continues to crash and stay running. There's 1900 jobs in the dag and about 500 complete successfully. In the end, the only thing I have on my queue is the dag itself. 10/27 10:17:20 Parsing C:\temp\condor\condor_68353.dag ... 10/27 10:17:21 Dag contains 1903 total jobs 10/27 10:17:21 Lock file C:\temp\condor\condor_68353.dag.lock detected, 10/27 10:17:21 Duplicate DAGMan PID 5708 is no longer alive; this DAGMan should continue. 10/27 10:17:21 Sleeping for 12 seconds to ensure ProcessId uniqueness 10/27 10:17:33 WARNING: ProcessId not confirmed unique 10/27 10:17:33 Bootstrapping... 10/27 10:17:33 Number of pre-completed nodes: 0 10/27 10:17:33 Running in RECOVERY mode... 10/27 10:17:36 ****************************************************** 10/27 10:17:36 ** condor_scheduniv_exec.4250.0 (CONDOR_DAGMAN) STARTING UP 10/27 10:17:36 ** C:\condor\bin\condor_dagman.exe 10/27 10:17:36 ** $CondorVersion: 7.0.4 Jul 16 2008 BuildID: 95033 $ 10/27 10:17:36 ** $CondorPlatform: INTEL-WINNT50 $ 10/27 10:17:36 ** PID = 1948 10/27 10:17:37 ** Log last touched 10/27 09:17:34 10/27 10:17:37 ****************************************************** 10/27 10:17:37 Using config source: C:\condor\condor_config 10/27 10:17:37 Using local config sources: 10/27 10:17:37 C:\condor\condor_config.local 10/27 10:17:37 DaemonCore: Command Socket at <10.10.242.54:1795> 10/27 10:17:37 DAGMAN_SUBMIT_DELAY setting: 0 10/27 10:17:37 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6 10/27 10:17:37 DAGMAN_STARTUP_CYCLE_DETECT setting: 0 10/27 10:17:37 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 5 10/27 10:17:37 allow_events (DAGMAN_IGNORE_DUPLICATE_JOB_EXECUTION, DAGMAN_ALLOW_EVENTS) setting: 114 10/27 10:17:37 DAGMAN_RETRY_SUBMIT_FIRST setting: 1 10/27 10:17:37 DAGMAN_RETRY_NODE_FIRST setting: 0 10/27 10:17:37 DAGMAN_MAX_JOBS_IDLE setting: 0 10/27 10:17:37 DAGMAN_MAX_JOBS_SUBMITTED setting: 0 10/27 10:17:37 DAGMAN_MUNGE_NODE_NAMES setting: 1 10/27 10:17:37 DAGMAN_DELETE_OLD_LOGS setting: 1 10/27 10:17:37 DAGMAN_PROHIBIT_MULTI_JOBS setting: 0 10/27 10:17:37 DAGMAN_SUBMIT_DEPTH_FIRST setting: 0 10/27 10:17:37 DAGMAN_ABORT_DUPLICATES setting: 1 10/27 10:17:37 DAGMAN_ABORT_ON_SCARY_SUBMIT setting: 1 10/27 10:17:37 DAGMAN_PENDING_REPORT_INTERVAL setting: 600 10/27 10:17:37 argv[0] == "condor_scheduniv_exec.4250.0" 10/27 10:17:37 argv[1] == "-Debug" 10/27 10:17:37 argv[2] == "3" 10/27 10:17:37 argv[3] == "-Lockfile" 10/27 10:17:37 argv[4] == "C:\temp\condor\condor_68353.dag.lock" 10/27 10:17:37 argv[5] == "-Condorlog" 10/27 10:17:37 argv[6] == "C:\temp\condor\condor_49152.log" 10/27 10:17:37 argv[7] == "-Dag" 10/27 10:17:37 argv[8] == "C:\temp\condor\condor_68353.dag" 10/27 10:17:37 argv[9] == "-Rescue" 10/27 10:17:37 argv[10] == "C:\temp\condor\condor_68353.dag.rescue" 10/27 10:17:37 DAG Lockfile will be written to C:\temp\condor\condor_68353.dag.lock 10/27 10:17:37 DAG Input file is C:\temp\condor\condor_68353.dag 10/27 10:17:37 Rescue DAG will be written to C:\temp\condor\condor_68353.dag.rescue ... then it lists all of the log files: 10/27 10:17:38 C:\temp\condor\condor_49152.log (Condor) 10/27 10:17:38 C:\temp\condor\condor_81924.log (Condor) ... Then repeat all this seconds later ... this log grew huge ! :) Should I increase the maxjobs in the condor dag submission to get this rolling? Sorry to ask such a general question, but I'm hoping somebody can explain to me what's going on in this case or cases like this? (This is with condor 7.0.4, so I'm back a few minor releases -- maybe its time to upgrade). Appreciate the help as always :). Steve Ready for a deal-of-a-lifetime? Find fantastic offers on Windows 7, in one convenient place. |