Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] dagman does not submit ready jobs, how to debug?
- Date: Tue, 17 Mar 2009 22:56:06 +0100
- From: Carsten Aulbert <carsten.aulbert@xxxxxxxxxx>
- Subject: [Condor-users] dagman does not submit ready jobs, how to debug?
Hi all,
I thought by now I understood some part of Condor, but it nevertheless
manages to surprise me every now and then ;)
Our machines are using the following DAGMAN settings:
$ condor_config_val -dump |grep DAGMAN
DAGMAN_ABORT_DUPLICATES = TRUE
DAGMAN_COPY_TO_SPOOL = TRUE
DAGMAN_MAX_JOBS_IDLE = 500
DAGMAN_MAX_JOBS_SUBMITTED = 2000
DAGMAN_MAX_SUBMITS_PER_INTERVAL = 200
DAGMAN_PROHIBIT_MULTI_JOBS = TRUE
DAGMAN_SUBMIT_DELAY = 0
DAGMAN_SUBMIT_DEPTH_FIRST = TRUE
Right now, I've about 860 test jobs running and currently the collection
of all dags (including the überdag) looks that quite a number of jobs
are ready for submission but are not:
XXXXXXXXXXXXX Done Pre Queued Post Ready Un-Ready Failed
3/16 23:11:36 142 . . . . . .
3/17 22:41:46 2479 . 31 . 284 1854 .
3/17 22:41:47 2442 . 226 . 485 9110 .
3/17 22:42:23 3487 . 196 . 495 5670 .
3/17 22:42:23 3545 . 135 . 555 5613 .
3/17 22:42:23 3522 . 135 . 592 5596 3
3/17 22:41:46 3620 . 98 . 453 5677 .
(. means 0)
Since no jobs are idle, I've not yet reached 2000 jobs and whatever
daemons' cycle is referenced by DAGMAN_MAX_SUBMITS_PER_INTERVAL is
probably over by far (I've been watching this for more than an hour
now), I'm running out of ideas, why only very few jobs are submitted.
The cluster itself has more than enough slots open/in backfill.
Any idea where I should continue digging?
Cheers
Carsten