On Sun, Jan 16, 2011 at 8:40 AM, Matthew Farrellee
<matt@xxxxxxxxxx> wrote:
On 01/15/2011 08:52 PM, Erik Aronesty wrote:
I've confirmed on many tests that jobs with RequestCpus > 1 don't seem
to be compatible with dynamic slots.
Is this a condor version issue? I'm running 7.4.4 on x86_64
Our system has many, many jobs that consume between 1-8 cpus and many
SMP machines with 4 and 32 cores.
(I can use condor_qedit and get a job to run on a dynamic slot just by
switching its Cpus to 1. It will not run otherwise ... even if Start=TRUE)
The message from analyse is "2 reject your job because of their own
requirements" ... (or however many slots are partitionable).
It would be nice to be able to take a job id and a node, and then ask
for an explanation of why it's not running on that node.
If I run a bunch of jobs with 1 cpu... the dynamic slot works as
advertised... forking off new slots and reclaiming them later... quite
nicely. I even like to leave some lots this way - since they are so
much better about resource utilization... in every other respect.
I've noticed one other thread posting about this, but have never seen a
final solution.
https://www-auth.cs.wisc.edu/lists/condor-users/2009-June/msg00065.shtml
Has anyone gotten dynamic slots to work with RequestCpus > 1... where it
actually decrements the number of cpus from those remaining?
> condor -version
$CondorVersion: 7.2.4 Apr 11 2010 $
$CondorPlatform: X86_64-LINUX_DEBIAN_UNKNOWN $
>condor_status ea-morpheus -l | grep Cpu
CpuIsBusy = false
Cpus = 1
CpuBusyTime = 0
CpuBusy = ( ( LoadAvg - CondorLoadAvg ) >= 0.500000 )
TotalCpus = 4
Machine doing nothing:
>cat /srv/condor/ea-morpheus
DAEMON_LIST = MASTER, STARTD
NUM_SLOTS=1
SLOT_TYPE_1=Cpu=4,auto
SLOT_TYPE_1_PARTITIONABLE=TRUE
NUM_SLOTS_TYPE_1=1
START=TRUE
JOB not running:
> condor_q 6490.0 -l | grep Req
AutoClusterAttrs =
"JobUniverse,LastCheckpointPlatform,NumCkpts,RequestCpus,RequestDisk,RequestMemory,FileSystemDomain,DiskUsage,ImageSize,Requirements,NiceUser,ConcurrencyLimits"
RequestDisk = DiskUsage
RequestMemory = 500
RequestCpus = 2
Requirements = ( Memory >= 500 ) && ( TARGET.Arch == "X86_64" ) && (
TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= DiskUsage ) && ( (
RequestMemory * 1024 ) >= ImageSize ) && ( TARGET.FileSystemDomain ==
MY.FileSystemDomain )
- Erik
Seems to work fine here...
$ rpm -q condor
condor-7.4.2-1.fc13.x86_64
$ condor_version
$CondorVersion: 7.4.2 May 20 2010 BuildID: Fedora-7.4.2-1.fc13 $
$CondorPlatform: X86_64-LINUX_F13 $
$ tail -n5 ~condor/condor_config.local
NUM_SLOTS=1
SLOT_TYPE_1=Cpu=4,auto
SLOT_TYPE_1_PARTITIONABLE=TRUE
NUM_SLOTS_TYPE_1=1
START=TRUE
08:29:45am$ condor_status
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
slot1@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.320 3760 0+00:00:04
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 1 0 0 1 0 0 0
Total 1 0 0 1 0 0 0
08:29:47am$ echo 'cmd=/bin/sleep\nargs=1d\nrequestcpus=2\nqueue 2' | condor_submit
Submitting job(s)..
2 job(s) submitted to cluster 10377.
08:30:27am$ condor_q
-- Submitter: eeyore.local : <192.168.1.100:37322> : eeyore.local
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
10377.0 matt 1/16 08:30 0+00:00:02 R 0 0.0 sleep 1d
10377.1 matt 1/16 08:30 0+00:00:00 I 0 0.0 sleep 1d
2 jobs; 1 idle, 1 running, 0 held
08:30:30am$ condor_status
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
slot1@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.140 3759 0+00:00:56
slot1_1@xxxxxxxxxx LINUX X86_64 Claimed Busy 0.000 1 0+00:00:04
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 2 0 1 1 0 0 0
Total 2 0 1 1 0 0 0
08:30:34am$ condor_status -format "%s\t" Name -format "%d\n" Cpus
slot1@xxxxxxxxxxxx 0
slot1_1@xxxxxxxxxxxx 2
slot1_2@xxxxxxxxxxxx 2
08:31:00am$ recho You read to the end
dne eht ot daer uoY