Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] schedd getting more than max_jobs_running running
- Date: Wed, 25 Mar 2015 12:42:34 -0500
- From: Joe Boyd <boyd@xxxxxxxx>
- Subject: [HTCondor-users] schedd getting more than max_jobs_running running
Hi condor-users,
We have MAX_JOBS_RUNNING set to:
[root@fifebatch1 condor]# condor_config_val -v MAX_JOBS_RUNNING
MAX_JOBS_RUNNING = 10000
# at: /etc/condor/config.d/02_gwms_schedds.config, line 9
# raw: MAX_JOBS_RUNNING = 10000
[root@fifebatch1 condor]# ls -al
/etc/condor/config.d/02_gwms_schedds.config
-rw-r--r-- 1 root root 3528 Feb 26 14:43
/etc/condor/config.d/02_gwms_schedds.config
[root@fifebatch1 condor]# grep MAX_JOBS_RUNNING
/etc/condor/config.d/02_gwms_schedds.config
MAX_JOBS_RUNNING = 10000
Grepping in our schedd log there are lines like:
SchedLog.20150324T161949:03/24/15 08:17:11 (pid:2002) Preempting 66 jobs
due to MAX_JOBS_RUNNING change
SchedLog.20150324T161949:03/24/15 08:27:13 (pid:2002) Preempting 88 jobs
due to MAX_JOBS_RUNNING change
SchedLog.20150324T161949:03/24/15 08:32:10 (pid:2002) Preempting 10 jobs
due to MAX_JOBS_RUNNING change
SchedLog.20150324T161949:03/24/15 10:17:10 (pid:2002) Preempting 38 jobs
due to MAX_JOBS_RUNNING change
SchedLog.20150324T161949:03/24/15 10:17:36 (pid:2002) Preempting 13 jobs
due to MAX_JOBS_RUNNING change
The manual at:
http://research.cs.wisc.edu/htcondor/manual/v8.3/3_3Configuration.html#21897
says:
Changing this setting to be below the current number of jobs that are
running will cause running jobs to be aborted until the number running
is within the limit.
My problem is that we are NOT changing the value (see config file
timestamp above). We're managing with puppet but certainly not running
puppet every 25 seconds as the last two log lines above show so it can't
even be some craziness there.
I thought I remember reading somewhere that the schedd may in fact get
more than MAX_JOBS_RUNNING jobs started because of the way it works
which is fine with me but I thought then it just didn't run any more
until the number got below. It seems to be running more than 10k and
then proceeding to kill them.
Am I wrong?
joe