Are the jobs parallel universe jobs?ÂÂ The purpose of ParallelSchedulingGroup is to insure that all of the nodes of a parallel universe job in the same âscheduling groupâ (usually use to indicate that the machines have fast network access to each other).
Â
I think you just want to add Opsys==âWINDOWSâ to your jobâs requirements _expression_.Â
Â
As for your question about -better-analyze. It is not saying that all 4 machines match.Â
This line
[0]ÂÂÂÂÂÂÂÂÂÂ 2Â ParallelSchedulingGroup is my.Matched_PSG
Indicates that only two machines match that clause. Âwhereas these lines
Â
1ÂÂ ( ParallelSchedulingGroup is "windows-cluster" )
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ MODIFY TO "windows-cluster" 2ÂÂ ( ( ( Opsys == "Linux" ) || ( Opsys == "Windows" ) ) && ( Arch == "X86_64" ) && ( stringListMember("2017",
TARGET.CST
_INSTALLED_VERSIONS,",") ) && ( CST_CLUSTER_HAS_DC is true ) )
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ REMOVE Â
(incorrectly) indicates that 0 machines match. There is a known problem with the âSuggestions:â clause of -better-analyze. It does not correctly analyze complex sub-clauses, and almost never makes useful suggestions â the suggestions clause has been removed from HTCondor 8.6 and later for that reason.
Â
-tj
Â
Â
From: HTCondor-users [mailto:htcondor-users-
bounces@xxxxxxxxxxx ] On Behalf Of Felix Wolfheimer
Sent: Monday, July 24, 2017 3:02 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Parallel scheduling group problemÂ
We've a mixed Windows/Linux setup managed by HTCondor.ÂI configured parallel scheduling groups for all systems. In a test setup where I can reproduce the issues, which I experience in the production pool, I have four execution hosts (2xWindows, 2xLinux). The execution hosts have parallel scheduling groups as follows:
Â
# on both Linux machines
ParallelSchedulingGroup = "linux-cluster"
Â
# on the Windows machines
ParallelSchedulingGroup = "windows-cluster"Â
After a while, jobs submitted to the parallel universe won't be started anymore and condor_q -better-analyze for such a job gives the following somehow inconsistent information:
------------------------------------------------------------ ------------------------------ --------------- ÂThe Requirements _expression_ for your job is:
ÂÂÂ ( ParallelSchedulingGroup is my.Matched_PSG ) &&
ÂÂÂ ( ( ( Opsys == "Linux" ) || ( Opsys == "Windows" ) ) &&
ÂÂÂÂÂ ( Arch == "X86_64" ) &&
ÂÂÂÂÂ ( stringListMember("2017",TARGET.CST_INSTALLED_VERSIONS, ",") ) &&
ÂÂÂÂÂ ( CST_CLUSTER_HAS_DC is true ) ) && ( TARGET.Disk >= RequestDisk ) &&
ÂÂÂ ( TARGET.HasFileTransfer )Your job defines the following attributes:
ÂÂÂ DiskUsage = 75
ÂÂÂ Matched_PSG = "windows-cluster"
ÂÂÂ RequestDisk = 75The Requirements _expression_ for your job reduces to these conditions:
ÂÂÂÂÂÂÂÂ Slots
Step Matched Condition
-----Â --------Â ---------
[0]ÂÂÂÂÂÂÂÂÂÂ 2Â ParallelSchedulingGroup is my.Matched_PSG
[1]ÂÂÂÂÂÂÂÂÂÂ 2Â Opsys == "Linux"
[2]ÂÂÂÂÂÂÂÂÂÂ 2Â Opsys == "Windows"
[3]ÂÂÂÂÂÂÂÂÂÂ 4Â [1] || [2]
[4]ÂÂÂÂÂÂÂÂÂÂ 4Â Arch == "X86_64"
[6]ÂÂÂÂÂÂÂÂÂÂ 4Â stringListMember("2017",TARGET.CST_INSTALLED_VERSIONS, ",")
[8]ÂÂÂÂÂÂÂÂÂÂ 4Â CST_CLUSTER_HAS_DC is trueSuggestions:
ÂÂÂ ConditionÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ Machines MatchedÂÂÂ Suggestion
ÂÂÂ ---------ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ----------------ÂÂÂ ----------
1ÂÂ ( ParallelSchedulingGroup is "windows-cluster" )
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ MODIFY TO "windows-cluster"
2ÂÂ ( ( ( Opsys == "Linux" ) || ( Opsys == "Windows" ) ) && ( Arch == "X86_64" ) && ( stringListMember("2017",TARGET.CST
_INSTALLED_VERSIONS,",") ) && ( CST_CLUSTER_HAS_DC is true ) )
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ REMOVE
3ÂÂ ( TARGET.Disk >= 75 )ÂÂÂÂÂÂÂÂÂÂÂÂ 4
4ÂÂ ( TARGET.HasFileTransfer )ÂÂÂÂÂÂÂ 4Â
------------------------------
------------------------------ ------------------------------ --------------- It's strange that on one hand condor_q tells me that basically all four machines match my requirements _expression_, but on the other hand tells me that no machine matches the condition
ParallelSchedulingGroup is "windows-cluster"ÂÂ
which is for sure not true as I have also checked with condor_status:
condor_status -pool centos7-master.cst.de -af Machine ParallelSchedulingGroup
centos7-node01.cst.de linux-cluster
centos7-node02.cst.de linux-cluster
win2012-master.cst.de windows-cluster
win2012-node01.cst.de windows-clusterHas anyone an idea what may cause this strange behavior?
Don't know whether this is relevant but I've set NUM_CPUS=1 for all machines as a job is supposed to have exclusive access to all resources on a compute node.
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor- users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/