Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Trying to overcommit CPUs

Date: Wed, 04 Dec 2013 21:39:57 +0000
From: Brian Candler <b.candler@xxxxxxxxx>
Subject: [HTCondor-users] Trying to overcommit CPUs

[Using htcondor 8.0.4 from Debian Wheezy package, running under ubuntu12.04]

The types of jobs I run tend to wait on I/O quite a lot, and thereforethe CPUs are idle part of the time. So I'd like to allow more concurrentjobs than there are CPUs (or threads), in proportion to the CPUs available.


I am using dynamic slots. If I set "cpus=150%", like this:

COUNT_HYPERTHREAD_CPUS = True
SLOT_TYPE_1 = cpus=150%, ram=90%, swap=100%, disk=100%
SLOT_TYPE_1_PARTITIONABLE = True
NUM_SLOTS_TYPE_1 = 1

then startd repeatedly crashes. MasterLog shows:

12/04/13 21:10:32 Master restart (GRACEFUL) is watching/usr/sbin/condor_master (mtime:1382218101)12/04/13 21:10:33 Started DaemonCore process "/usr/sbin/condor_startd",pid and pgroup = 292412/04/13 21:10:33 DefaultReaper unexpectedly called on pid 2924, status1024.

12/04/13 21:10:33 The STARTD (pid 2924) exited with status 4
12/04/13 21:10:33 Sending obituary for "/usr/sbin/condor_startd"
12/04/13 21:10:33 restarting /usr/sbin/condor_startd in 10 seconds

12/04/13 21:10:43 Started DaemonCore process "/usr/sbin/condor_startd",pid and pgroup = 295112/04/13 21:10:43 DefaultReaper unexpectedly called on pid 2951, status1024.

12/04/13 21:10:43 The STARTD (pid 2951) exited with status 4
12/04/13 21:10:43 Sending obituary for "/usr/sbin/condor_startd"
12/04/13 21:10:43 restarting /usr/sbin/condor_startd in 11 seconds
^C

And StartLog shows:

12/04/13 21:10:33 ERROR: Can't allocate 1st slot of type 1

Requesting: slot type 1: Cpus: 48, Memory: 58083, Swap:100.00%, Disk: 100.00%Available: Slot #1: Cpus: 32, Memory: 64537, Swap: 100.00%,Disk: 100.00%12/04/13 21:10:33 ERROR "Ran out of system resources" at line 122 infile /slots/04/dir_478/userdir/src/condor_startd.V6/slot_builder.cpp


So then I tried setting
request_cpus = 0.66

in the job submit file. But this doesn't work; each slot uses 1 CPU.Indeed, even "request_cpus = 0" gives this.


After some more digging, setting
NUM_CPUS = $(DETECTED_CPUS)*1.5
does seem to work.

Is there a better way, or something I've overlooked?

Thanks,

Brian.

Follow-Ups:
- Re: [HTCondor-users] Trying to overcommit CPUs
  - From: Matthew Farrellee

Prev by Date: Re: [HTCondor-users] Problems sending multiple files via SOAP
Next by Date: Re: [HTCondor-users] Trying to overcommit CPUs
Previous by thread: Re: [HTCondor-users] Problems sending multiple files via SOAP
Next by thread: Re: [HTCondor-users] Trying to overcommit CPUs
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[HTCondor-users] Trying to overcommit CPUs