Re: [HTCondor-users] Slots are in use even when no job is running

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Hello Condor Experts,

We are facing a strange issue with MPI jobs.

Condor scheduler "condor_q -all" is not showing any job in queue and we are using only single sched node to submit MPI jobs but still if we submit new job it remains in idle state, running better-analyze against that job showing below relevant output. 172 is expected as we have 172 vanila jobs running with another user but since no job is running with this user hence count 10 is totally unexpected.

1580.000: Run analysis summary ignoring user priority. Of 194 machines,
172 are rejected by your job's requirements
0 reject your job because of their own requirements
2 are exhausted partitionable slots
10 match and are already running your jobs
0 match but are serving other users
0 are available to run your job

thought of checking the used slots on all nodes present in cluster I found that it's showing chunk of 10 slots in used status which I believe is corresponding to earlier attempts of MPI job ran. We are requesting 10 cores and 5 nodes while running MPI job. when I login into any of the node showing 10 cpus in used status and do condor_who or pstree -p condor or htop or top it doesn't show any user process running on that node. Again total count of 1slot 172 is expected as another user is running the jobs.

# condor_status -compact -af:h machine cpus totalcpus childcpus 'int(totalcpus-cpus)'
machine cpus totalcpus childcpus int(totalcpus-cpus)
testnode0001.test.com 10 36.0 { 1,1,1,1,1,1,10,10 } 26
testnode0002.test.com 11 36.0 { 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,10 } 25
testnode0003.test.com 21 36.0 { 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 } 15
testnode0004.test.com 15 36.0 { 1,1,1,1,1,10,1,1,1,1,1,1 } 21
testnode0005.test.com 13 36.0 { 10,1,1,1,1,1,1,1,1,1,1,1,1,1 } 23
testnode0006.test.com 24 36.0 { 1,1,1,1,1,1,1,1,1,1,1,1 } 12
testnode0007.test.com 8 36.0 { 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,10,1 } 28
testnode0008.test.com 12 36.0 { 1,1,1,1,1,1,1,1,1,1,1,1,1,1,10 } 24
testnode0009.test.com 12 36.0 { 1,1,1,1,1,1,1,1,10,1,1,1,1,1,1 } 24
testnode0010.test.com 1 36.0 { 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,10,10 } 35
testnode0011.test.com 15 36.0 { 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 } 21
testnode0012.test.com 17 36.0 { 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 } 19

condor_userprio output doesn't agree with above output:

# condor_userprio -all
Last Priority Update: 11/19 03:25
Effective Real Priority Res Total Usage Usage Last Time Since
User Name Priority Priority Factor In Use (wghted-hrs) Start Time Usage Time Last Usage
------------------------------- ------------ -------- --------- ------ ------------ ---------------- ---------------- ----------
testuser1@xxxxxxxx 3166.63 31.67 100.00 172 11421.95 11/01/2019 00:05 11/19/2019 03:25 <now>
------------------------------- ------------ -------- --------- ------ ------------ ---------------- ---------------- ----------
Number of users: 1 172 11421.95 11/18/2019 03:25 0+23:59

Adding more confusion I found that some of old cgroup directories are not removed below output is from testnode0001 on which actually 6 vanila jobs are running each slot 1 but in cgroup I can see 14 directories are present.

# ls -ld /cgroup/cpu/htcondor/condor_spare_condor_slot1_*@testnode0001.test.com/cpu.shares
-rw-r--r-- 1 root root 0 Nov 19 01:32 /cgroup/cpu/htcondor/condor_spare_condor_slot1_11@xxxxxxxxxxxxxxxxxxxxx/cpu.shares
-rw-r--r-- 1 root root 0 Nov 18 22:06 /cgroup/cpu/htcondor/condor_spare_condor_slot1_1@xxxxxxxxxxxxxxxxxxxxx/cpu.shares
-rw-r--r-- 1 root root 0 Nov 19 01:32 /cgroup/cpu/htcondor/condor_spare_condor_slot1_22@xxxxxxxxxxxxxxxxxxxxx/cpu.shares
-rw-r--r-- 1 root root 0 Nov 19 01:32 /cgroup/cpu/htcondor/condor_spare_condor_slot1_23@xxxxxxxxxxxxxxxxxxxxx/cpu.shares
-rw-r--r-- 1 root root 0 Nov 18 22:06 /cgroup/cpu/htcondor/condor_spare_condor_slot1_26@xxxxxxxxxxxxxxxxxxxxx/cpu.shares
-rw-r--r-- 1 root root 0 Nov 18 22:07 /cgroup/cpu/htcondor/condor_spare_condor_slot1_28@xxxxxxxxxxxxxxxxxxxxx/cpu.shares
-rw-r--r-- 1 root root 0 Nov 18 22:06 /cgroup/cpu/htcondor/condor_spare_condor_slot1_29@xxxxxxxxxxxxxxxxxxxxx/cpu.shares
-rw-r--r-- 1 root root 0 Jul 9 13:08 /cgroup/cpu/htcondor/condor_spare_condor_slot1_33@xxxxxxxxxxxxxxxxxxxxx/cpu.shares
-rw-r--r-- 1 root root 0 Nov 18 22:07 /cgroup/cpu/htcondor/condor_spare_condor_slot1_36@xxxxxxxxxxxxxxxxxxxxx/cpu.shares
-rw-r--r-- 1 root root 0 Nov 19 01:30 /cgroup/cpu/htcondor/condor_spare_condor_slot1_4@xxxxxxxxxxxxxxxxxxxxx/cpu.shares
-rw-r--r-- 1 root root 0 Nov 19 01:32 /cgroup/cpu/htcondor/condor_spare_condor_slot1_5@xxxxxxxxxxxxxxxxxxxxx/cpu.shares
-rw-r--r-- 1 root root 0 Nov 19 01:32 /cgroup/cpu/htcondor/condor_spare_condor_slot1_6@xxxxxxxxxxxxxxxxxxxxx/cpu.shares
-rw-r--r-- 1 root root 0 Nov 19 01:32 /cgroup/cpu/htcondor/condor_spare_condor_slot1_7@xxxxxxxxxxxxxxxxxxxxx/cpu.shares
-rw-r--r-- 1 root root 0 Nov 19 01:32 /cgroup/cpu/htcondor/condor_spare_condor_slot1_9@xxxxxxxxxxxxxxxxxxxxx/cpu.shares

Setup Information:

$CondorVersion: 8.5.8 Dec 13 2016 BuildID: 390781 $
$CondorPlatform: x86_64_RedHat6 $

Thanks & Regards,

Vikrant Aggarwal

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Slots are in use even when no job is running