Hi all,
I’m having an issue with HTCondor while using the parallel universe and dynamic slots and I'm hoping someone here might be able to point me in the right direction. On a small two-node cluster (each identical node has 24 virtual processor cores), we are trying
to run an MPI program, but this error occurs even with programs that do not use MPI or any other library/protocol to communicate.
When attempting to run a parallel job, it seems that in general the more processors I tell the job to use (greater machine_count), the less often the job actually runs (machine_count is always less than the total number of processors). When machine_count is
2, the job always runs. If it’s 4, it usually runs. If it’s 10, it sometimes runs. If it’s 35, it rarely runs. If it’s 40 it never runs. If it doesn’t run, the job just sits as idle and never does anything, even after a day. Also note that no one else is using
the machines.
The weird thing is that condor_q says that slots were matched, but also says that there are no matches. It seems like condor is not always partitioning the dynamic slots like it’s supposed to, but sometimes it does work perfectly. There are no errors,
the job just stays idle.
I’m using the linux sleep command in the following example to simplify it as much as possible. The problem occurs regardless of the program. If I submit the job to the vanilla universe with “queue 40”, everything runs fine (but not what I want since
I need the parallel universe).
For example, if I’m trying to run the following job (just an example):
universe = parallel
executable = /bin/sleep
arguments = 20
machine_count = 30
request_memory = 500
request_disk = 500
log = output/test.log
queue
This is what condor_status reports before the job is submitted:
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 2 0 0 2 0 0 0
Total 2 0 0 2 0 0 0
This is what condor_status reports after the job is submitted:
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 4 0 2 2 0 0 0
Total 4 0 2 2 0 0 0
This is what condor_q -better-analyze reports:
User priority for
englers@xxxxxxx is not available, attempting to analyze without it.
---
107.000: Run analysis summary. Of 4 machines,
0 are rejected by your job's requirements
0 reject your job because of their own requirements
2 match and are already running your jobs
0 match but are serving other users
0 are available to run your job
No successful match recorded.
Last failed match: Tue Aug 5 13:09:59 2014
Reason for last match failure: no match found
The Requirements _expression_ for your job is:
( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) &&
( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) &&
( ( TARGET.HasFileTransfer ) ||
( TARGET.FileSystemDomain == MY.FileSystemDomain ) )
Your job defines the following attributes:
RequestDisk = 500
RequestMemory = 500
The Requirements _expression_ for your job reduces to these conditions:
Slots
Step Matched Condition
----- -------- ---------
[0] 4 TARGET.Arch == "X86_64"
[1] 4 TARGET.OpSys == "LINUX"
[3] 4 TARGET.Disk >= RequestDisk
[5] 4 TARGET.Memory >= RequestMemory
[7] 4 TARGET.HasFileTransfer
Suggestions:
Condition Machines Matched Suggestion
--------- ---------------- ----------
1 ( TARGET.Arch == "X86_64" ) 4
2 ( TARGET.OpSys == "LINUX" ) 4
3 ( TARGET.Disk >= 500 ) 4
4 ( TARGET.Memory >= 500 ) 4
5 ( ( TARGET.HasFileTransfer ) || ( TARGET.FileSystemDomain == "
comp1.site.ca" ) )
4
The condor configuration contains for each node:
START = TRUE
SUSPEND = FALSE
PREEMPT = FALSE
KILL = FALSE
NUM_SLOTS=1
NUM_SLOTS_TYPE_1=1
SLOT_TYPE_1=100%
SLOT_TYPE_1_PARTITIONABLE=true
I have also attached a file with the relevant sections of the log files that were updated. Some of the lines look like something went wrong, but I don’t understand what they mean.
I have also tried changing user priorities (the only two users are englers and DedicatedScheduler) with no change.
It would be a great help if someone could provide some insight into what is going on. I’m not really sure where to start.
Thanks for your time!
Steve
_______________________________________________