Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] Trying to overcommit CPUs
- Date: Wed, 04 Dec 2013 21:39:57 +0000
- From: Brian Candler <b.candler@xxxxxxxxx>
- Subject: [HTCondor-users] Trying to overcommit CPUs
[Using htcondor 8.0.4 from Debian Wheezy package, running under ubuntu
12.04]
The types of jobs I run tend to wait on I/O quite a lot, and therefore
the CPUs are idle part of the time. So I'd like to allow more concurrent
jobs than there are CPUs (or threads), in proportion to the CPUs available.
I am using dynamic slots. If I set "cpus=150%", like this:
COUNT_HYPERTHREAD_CPUS = True
SLOT_TYPE_1 = cpus=150%, ram=90%, swap=100%, disk=100%
SLOT_TYPE_1_PARTITIONABLE = True
NUM_SLOTS_TYPE_1 = 1
then startd repeatedly crashes. MasterLog shows:
12/04/13 21:10:32 Master restart (GRACEFUL) is watching
/usr/sbin/condor_master (mtime:1382218101)
12/04/13 21:10:33 Started DaemonCore process "/usr/sbin/condor_startd",
pid and pgroup = 2924
12/04/13 21:10:33 DefaultReaper unexpectedly called on pid 2924, status
1024.
12/04/13 21:10:33 The STARTD (pid 2924) exited with status 4
12/04/13 21:10:33 Sending obituary for "/usr/sbin/condor_startd"
12/04/13 21:10:33 restarting /usr/sbin/condor_startd in 10 seconds
12/04/13 21:10:43 Started DaemonCore process "/usr/sbin/condor_startd",
pid and pgroup = 2951
12/04/13 21:10:43 DefaultReaper unexpectedly called on pid 2951, status
1024.
12/04/13 21:10:43 The STARTD (pid 2951) exited with status 4
12/04/13 21:10:43 Sending obituary for "/usr/sbin/condor_startd"
12/04/13 21:10:43 restarting /usr/sbin/condor_startd in 11 seconds
^C
And StartLog shows:
12/04/13 21:10:33 ERROR: Can't allocate 1st slot of type 1
Requesting: slot type 1: Cpus: 48, Memory: 58083, Swap:
100.00%, Disk: 100.00%
Available: Slot #1: Cpus: 32, Memory: 64537, Swap: 100.00%,
Disk: 100.00%
12/04/13 21:10:33 ERROR "Ran out of system resources" at line 122 in
file /slots/04/dir_478/userdir/src/condor_startd.V6/slot_builder.cpp
So then I tried setting
request_cpus = 0.66
in the job submit file. But this doesn't work; each slot uses 1 CPU.
Indeed, even "request_cpus = 0" gives this.
After some more digging, setting
NUM_CPUS = $(DETECTED_CPUS)*1.5
does seem to work.
Is there a better way, or something I've overlooked?
Thanks,
Brian.