Thanks for the quick response Kent, I tried the 7.2.4 release and sent 1000 python jobs in a single no-dependency dag. Each job just created a file and exited. After doing a condor_submit_dag on my created dag, I got 509 files back and then it looks like my dag job got stuck and started idling (with the 7.0.4 build, I could swear that it remained 'running' but still had the same behavior). Looking at the lib.err file for the dag, it had the error: dprintf() had a fatal error in pid 8620 Can't open "bigjob.dag.dagman.out" errno: 24 (Too many open files) and the mentioned bigjob.dag.dagman.out file has constantly growing output similar to that entered below: 10/27 21:13:44 Parsing 1 dagfiles 10/27 21:13:44 Parsing bigjob.dag ... 10/27 21:13:44 Dag contains 1000 total jobs 10/27 21:13:44 Lock file bigjob.dag.lock detected, 10/27 21:13:44 Duplicate DAGMan PID 10964 is no longer alive; this DAGMan should continue. 10/27 21:13:44 Sleeping for 12 seconds to ensure ProcessId uniqueness 10/27 21:13:56 WARNING: ProcessId not confirmed unique 10/27 21:13:56 Bootstrapping... 10/27 21:13:56 Number of pre-completed nodes: 0 10/27 21:13:56 Running in RECOVERY mode... 10/27 21:18:33 ****************************************************** 10/27 21:18:33 ** condor_scheduniv_exec.21.0 (CONDOR_DAGMAN) STARTING UP 10/27 21:18:33 ** C:\condor\bin\condor_dagman.exe 10/27 21:18:33 ** SubsystemInfo: name=DAGMAN type=DAEMON(10) class=DAEMON(1) 10/27 21:18:33 ** Configuration: subsystem:DAGMAN local:<NONE> class:DAEMON 10/27 21:18:33 ** $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $ 10/27 21:18:33 ** $CondorPlatform: INTEL-WINNT50 $ 10/27 21:18:33 ** PID = 10284 10/27 21:18:33 ** Log last touched 10/27 20:13:56 10/27 21:18:33 ****************************************************** 10/27 21:18:33 Using config source: C:\condor\condor_config 10/27 21:18:33 Using local config sources: 10/27 21:18:33 C:\condor\condor_config.local 10/27 21:18:33 DaemonCore: Command Socket at <10.10.242.111:4214> 10/27 21:18:33 DAGMAN_DEBUG_CACHE_SIZE setting: 5242880 10/27 21:18:33 DAGMAN_DEBUG_CACHE_ENABLE setting: False 10/27 21:18:33 DAGMAN_SUBMIT_DELAY setting: 0 10/27 21:18:33 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6 10/27 21:18:34 DAGMAN_STARTUP_CYCLE_DETECT setting: 0 10/27 21:18:34 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 5 10/27 21:18:34 allow_events (DAGMAN_IGNORE_DUPLICATE_JOB_EXECUTION, DAGMAN_ALLOW_EVENTS) setting: 114 10/27 21:18:34 DAGMAN_RETRY_SUBMIT_FIRST setting: 1 10/27 21:18:34 DAGMAN_RETRY_NODE_FIRST setting: 0 10/27 21:18:34 DAGMAN_MAX_JOBS_IDLE setting: 0 10/27 21:18:34 DAGMAN_MAX_JOBS_SUBMITTED setting: 0 10/27 21:18:34 DAGMAN_MUNGE_NODE_NAMES setting: 1 10/27 21:18:34 DAGMAN_DELETE_OLD_LOGS setting: 1 10/27 21:18:34 DAGMAN_PROHIBIT_MULTI_JOBS setting: 0 10/27 21:18:34 DAGMAN_SUBMIT_DEPTH_FIRST setting: 0 10/27 21:18:34 DAGMAN_ABORT_DUPLICATES setting: 1 10/27 21:18:34 DAGMAN_ABORT_ON_SCARY_SUBMIT setting: 1 10/27 21:18:34 DAGMAN_PENDING_REPORT_INTERVAL setting: 600 10/27 21:18:34 DAGMAN_AUTO_RESCUE setting: 1 10/27 21:18:34 DAGMAN_MAX_RESCUE_NUM setting: 100 10/27 21:18:34 argv[0] == "condor_scheduniv_exec.21.0" 10/27 21:18:34 argv[1] == "-Debug" 10/27 21:18:34 argv[2] == "3" 10/27 21:18:34 argv[3] == "-Lockfile" 10/27 21:18:34 argv[4] == "bigjob.dag.lock" 10/27 21:18:34 argv[5] == "-AutoRescue" 10/27 21:18:34 argv[6] == "1" 10/27 21:18:34 argv[7] == "-DoRescueFrom" 10/27 21:18:34 argv[8] == "0" 10/27 21:18:34 argv[9] == "-Dag" 10/27 21:18:34 argv[10] == "bigjob.dag" 10/27 21:18:34 argv[11] == "-CsdVersion" 10/27 21:18:34 argv[12] == "$CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $" 10/27 21:18:34 DAG Lockfile will be written to bigjob.dag.lock 10/27 21:18:34 DAG Input file is bigjob.dag 10/27 21:18:34 All DAG node user log files: 10/27 21:18:34 C:\condor\jobs\bigjob\bigjob1.log (Condor) 10/27 21:18:34 C:\condor\jobs\bigjob\bigjob2.log (Condor) etc... I figure that I must be doing something wrong with some configuration setting on my dag submission. Or that there is some limitation on how big a DAG job can or should be. Should I just split up the dag into smaller groups of jobs in the future? Appreciate any suggestions, again, as always :), Steve > Date: Tue, 27 Oct 2009 13:46:09 -0500 > From: wenger@xxxxxxxxxxx > To: condor-users@xxxxxxxxxxx > Subject: Re: [Condor-users] Condor DAG spinning > > On Tue, 27 Oct 2009, Steve Shaw wrote: > > > I've got an issue where, with a sufficient number of jobs in a dag, the > > DAGMan continues to crash and stay running. There's 1900 jobs in the > > dag and about 500 complete successfully. In the end, the only thing I > > have on my queue is the dag itself. > > > > 10/27 10:17:20 Parsing C:\temp\condor\condor_68353.dag ... > > 10/27 10:17:21 Dag contains 1903 total jobs > > 10/27 10:17:21 Lock file C:\temp\condor\condor_68353.dag.lock detected, > > 10/27 10:17:21 Duplicate DAGMan PID 5708 is no longer alive; this DAGMan should continue. > > 10/27 10:17:21 Sleeping for 12 seconds to ensure ProcessId uniqueness > > 10/27 10:17:33 WARNING: ProcessId not confirmed unique > > 10/27 10:17:33 Bootstrapping... > > 10/27 10:17:33 Number of pre-completed nodes: 0 > > 10/27 10:17:33 Running in RECOVERY mode... > > 10/27 10:17:36 ****************************************************** > > 10/27 10:17:36 ** condor_scheduniv_exec.4250.0 (CONDOR_DAGMAN) STARTING UP > > 10/27 10:17:36 ** C:\condor\bin\condor_dagman.exe > > 10/27 10:17:36 ** $CondorVersion: 7.0.4 Jul 16 2008 BuildID: 95033 $ > > 10/27 10:17:36 ** $CondorPlatform: INTEL-WINNT50 $ > > 10/27 10:17:36 ** PID = 1948 > > 10/27 10:17:37 ** Log last touched 10/27 09:17:34 > > 10/27 10:17:37 ****************************************************** > > 10/27 10:17:37 Using config source: C:\condor\condor_config > > 10/27 10:17:37 Using local config sources: > > 10/27 10:17:37 C:\condor\condor_config.local > > 10/27 10:17:37 DaemonCore: Command Socket at <10.10.242.54:1795> > > 10/27 10:17:37 DAGMAN_SUBMIT_DELAY setting: 0 > > 10/27 10:17:37 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6 > > 10/27 10:17:37 DAGMAN_STARTUP_CYCLE_DETECT setting: 0 > > 10/27 10:17:37 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 5 > > 10/27 10:17:37 allow_events (DAGMAN_IGNORE_DUPLICATE_JOB_EXECUTION, DAGMAN_ALLOW_EVENTS) setting: 114 > > 10/27 10:17:37 DAGMAN_RETRY_SUBMIT_FIRST setting: 1 > > 10/27 10:17:37 DAGMAN_RETRY_NODE_FIRST setting: 0 > > 10/27 10:17:37 DAGMAN_MAX_JOBS_IDLE setting: 0 > > 10/27 10:17:37 DAGMAN_MAX_JOBS_SUBMITTED setting: 0 > > 10/27 10:17:37 DAGMAN_MUNGE_NODE_NAMES setting: 1 > > 10/27 10:17:37 DAGMAN_DELETE_OLD_LOGS setting: 1 > > 10/27 10:17:37 DAGMAN_PROHIBIT_MULTI_JOBS setting: 0 > > 10/27 10:17:37 DAGMAN_SUBMIT_DEPTH_FIRST setting: 0 > > 10/27 10:17:37 DAGMAN_ABORT_DUPLICATES setting: 1 > > 10/27 10:17:37 DAGMAN_ABORT_ON_SCARY_SUBMIT setting: 1 > > 10/27 10:17:37 DAGMAN_PENDING_REPORT_INTERVAL setting: 600 > > 10/27 10:17:37 argv[0] == "condor_scheduniv_exec.4250.0" > > 10/27 10:17:37 argv[1] == "-Debug" > > 10/27 10:17:37 argv[2] == "3" > > 10/27 10:17:37 argv[3] == "-Lockfile" > > 10/27 10:17:37 argv[4] == "C:\temp\condor\condor_68353.dag.lock" > > 10/27 10:17:37 argv[5] == "-Condorlog" > > 10/27 10:17:37 argv[6] == "C:\temp\condor\condor_49152.log" > > 10/27 10:17:37 argv[7] == "-Dag" > > 10/27 10:17:37 argv[8] == "C:\temp\condor\condor_68353.dag" > > 10/27 10:17:37 argv[9] == "-Rescue" > > 10/27 10:17:37 argv[10] == "C:\temp\condor\condor_68353.dag.rescue" > > 10/27 10:17:37 DAG Lockfile will be written to C:\temp\condor\condor_68353.dag.lock > > 10/27 10:17:37 DAG Input file is C:\temp\condor\condor_68353.dag > > 10/27 10:17:37 Rescue DAG will be written to C:\temp\condor\condor_68353.dag.rescue > > > > ... then it lists all of the log files: > > 10/27 10:17:38 C:\temp\condor\condor_49152.log (Condor) > > 10/27 10:17:38 C:\temp\condor\condor_81924.log (Condor) > > ... > > > > Then repeat all this seconds later ... this log grew huge ! :) > > > > Should I increase the maxjobs in the condor dag submission to get this > > rolling? Sorry to ask such a general question, but I'm hoping somebody > > can explain to me what's going on in this case or cases like this? > > > > (This is with condor 7.0.4, so I'm back a few minor releases -- maybe > > its time to upgrade). > > Hmm -- 7.0.4 *is* pretty old. I'd say the first thing to try is > installing newer condor_dagman and condor_submit_dag binaries. You can > just upgrade those two binaries without upgrading the rest of your Condor > installation if you want to. > > I'd recommend going to either 7.2.4 (if you want to stay with a stable > release) or 7.3.1. (7.3.2 has problem with rescue DAGs, which has been > fixed for the upcoming 7.4.0.) > > If you still get the problem with a newer DAGMan version, please let us > know and we'll look inth things further. > > Kent Wenger > Condor Team > _______________________________________________ > Condor-users mailing list > To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a > subject: Unsubscribe > You can also unsubscribe by visiting > https://lists.cs.wisc.edu/mailman/listinfo/condor-users > > The archives can be found at: > https://lists.cs.wisc.edu/archive/condor-users/ CDN College or University student? Get Windows 7 for only $39.99 before Jan 3! Buy it now! |