Hi all,
Iâm having an issue with HTCondor while using the parallel universe and dynamic slots and I'm hoping someone here might be able to point me in the right direction. On a small two-node cluster (each identical node has 24 virtual processor cores), we are trying to run an MPI program, but this error occurs even with programs that do not use MPI or any other library/protocol to communicate.
When attempting to run a parallel job, it seems that in general the more processors I tell the job to use (greaterÂmachine_count), the less often the job actually runs (machine_countÂis always less than the total number of processors). WhenÂmachine_countÂis 2, the job always runs. If itâs 4, it usually runs. If itâs 10, it sometimes runs. If itâs 35, it rarely runs. If itâs 40 it never runs. If it doesnât run, the job just sits as idle and never does anything, even after a day. Also note that no one else is using the machines.
The weird thing is thatÂcondor_qÂsays that slots were matched, but also says that there are no matches. It seems like condor is not always partitioning the dynamic slots like itâs supposed to, but sometimes it does work perfectly. There are no errors, the job just stays idle.
Iâm using the linuxÂsleepÂcommand in the following example to simplify it as much as possible. The problem occurs regardless of the program. If I submit the job to the vanilla universe with âqueue 40â, everything runs fine (but not what I want since I need the parallel universe).
For example, if Iâm trying to run the following job (just an example):
universe = parallel
executable = /bin/sleep
arguments = 20
machine_count = 30
request_memory = 500
request_disk = 500
log = output/test.log
queue
This is whatÂcondor_statusÂreports before the job is submitted:
Name          OpSys  ÂArch   ÂState    ÂActivity  ÂLoadAv  Mem   ÂActvtyTime
slot1@xxxxxxxxxxxxx ÂLINUX  ÂX86_64  ÂUnclaimed  ÂIdle    Â0.000  Â80533  Â0+00:13:00
slot1@xxxxxxxxxxxxx ÂLINUX  ÂX86_64  ÂUnclaimed  ÂIdle    Â0.000  Â80533  Â0+00:12:53
        Total  ÂOwner  Claimed  ÂUnclaimed  ÂMatched  ÂPreempting  ÂBackfill
X86_64/LINUX Â Â2 Â Â Â Â0 Â Â Â Â0 Â Â Â Â Â2 Â Â Â Â Â Â0 Â Â Â Â Â0 Â Â Â Â Â Â 0
Total      2    Â0    Â0     Â2      Â0     Â0       0
This is whatÂcondor_statusÂreports after the job is submitted:
Name          OpSys  ÂArch   ÂState    ÂActivity  ÂLoadAv  Mem   ÂActvtyTime
slot1@xxxxxxxxxxxxx ÂLINUX  ÂX86_64  ÂUnclaimed  ÂIdle    Â0.000  Â80021  Â0+00:13:00
slot1_1@xxxxxxxxxxxxxÂÂLINUX  ÂX86_64  ÂClaimed   ÂIdle    Â0.000  Â512   Â0+00:13:00
slot1@xxxxxxxxxxxxx ÂLINUX  ÂX86_64  ÂUnclaimed  ÂIdle    Â0.000  Â80021  Â0+00:12:53
slot1_1@xxxxxxxxxxxxxÂÂLINUX  ÂX86_64  ÂClaimed   ÂIdle    Â0.000  Â512   Â0+00:12:53
        Total  ÂOwner  ÂClaimed  ÂUnclaimed  ÂMatched  ÂPreempting  ÂBackfill
X86_64/LINUX Â Â4 Â Â Â Â0 Â Â Â Â2 Â Â Â Â Â2 Â Â Â Â Â Â0 Â Â Â Â Â0 Â Â Â Â Â Â 0
Total      4    Â0    Â2     Â2      Â0     Â0       0
This is whatÂcondor_q -better-analyzeÂreports:
-- Submitter:Âcomp1.site.caÂ: <ip:port> :Âcomp1.site.ca
User priority forÂenglers@xxxxxxxÂis not available, attempting to analyze without it.
---
107.000: Run analysis summary. Of 4 machines,
ÂÂÂÂÂ 0 are rejected by your job's requirements
ÂÂÂÂÂ 0 reject your job because of their own requirements
ÂÂÂÂÂ 2 match and are already running your jobs
ÂÂÂÂÂ 0 match but are serving other users
ÂÂÂÂÂ 0 are available to run your job
ÂÂÂÂÂÂÂ No successful match recorded.
ÂÂÂÂÂÂÂ Last failed match: Tue Aug 5 13:09:59 2014
ÂÂÂÂÂÂÂ Reason for last match failure: no match found
The Requirements _expression_ for your job is:
ÂÂÂ ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) &&
ÂÂÂ ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) &&
ÂÂÂ ( ( TARGET.HasFileTransfer ) ||
ÂÂÂ ( TARGET.FileSystemDomain == MY.FileSystemDomain ) )
Your job defines the following attributes:
ÂÂÂ FileSystemDomain = "comp1.site.ca"
ÂÂÂ RequestDisk = 500
ÂÂÂ RequestMemory = 500
The Requirements _expression_ for your job reduces to these conditions:
Slots
StepÂÂÂÂÂÂ MatchedÂÂÂÂÂÂÂÂÂÂÂÂÂ Condition
-----ÂÂÂÂÂÂÂ --------ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ---------
[0]ÂÂÂÂÂÂÂÂÂ 4ÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ TARGET.Arch == "X86_64"
[1]ÂÂÂÂÂÂÂÂÂ 4ÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ TARGET.OpSys == "LINUX"
[3]ÂÂÂÂÂÂÂÂÂ 4ÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ TARGET.Disk >= RequestDisk
[5]ÂÂÂÂÂÂÂÂÂ 4ÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ TARGET.Memory >= RequestMemory
[7]ÂÂÂÂÂÂÂÂÂ 4ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ TARGET.HasFileTransfer
Suggestions:
ÂÂÂ ConditionÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ Machines MatchedÂÂÂÂÂÂÂÂ Suggestion
ÂÂÂ ---------ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ----------------ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ----------
1ÂÂ ( TARGET.Arch == "X86_64" )ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 4
2ÂÂ ( TARGET.OpSys == "LINUX" )ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 4
3ÂÂ ( TARGET.Disk >= 500 )ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 4
4ÂÂ ( TARGET.Memory >= 500 )ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 4
5ÂÂ ( ( TARGET.HasFileTransfer ) || ( TARGET.FileSystemDomain == "comp1.site.ca" ) )
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 4
The condor configuration contains for each node:
Â
START = TRUE
SUSPEND = FALSE
PREEMPT = FALSE
KILL = FALSE
NUM_SLOTS=1
NUM_SLOTS_TYPE_1=1
SLOT_TYPE_1=100%
SLOT_TYPE_1_PARTITIONABLE=true
I have also attached a file with the relevant sections of the log files that were updated. Some of the lines look like something went wrong, but I donât understand what they mean.
I have also tried changing user priorities (the only two users are englers and DedicatedScheduler) with no change.
Note: all instances of the hostname and ip addresses / ports were replaced with âcomp1.site.ca/comp2.site.caâ and âip:portâ.
It would be a great help if someone could provide some insight into what is going on. Iâm not really sure where to start.
Thanks for your time!
Steve