Re: [HTCondor-users] Parallel scheduling group problem

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

2017-07-25 20:01 GMT+02:00 John M Knoeller <johnkn@xxxxxxxxxxx>:

Are the jobs parallel universe jobs?ÂÂ The purpose of ParallelSchedulingGroup is to insure that all of the nodes of a parallel universe job in the same âscheduling groupâ (usually use to indicate that the machines have fast network access to each other).

Â

I think you just want to add Opsys==âWINDOWSâ to your jobâs requirements _expression_.Â

Â

As for your question about -better-analyze.Â It is not saying that all 4 machines match.Â

This line

[0]ÂÂÂÂÂÂÂÂÂÂ 2Â ParallelSchedulingGroup is my.Matched_PSG

Indicates that only two machines match that clause. Âwhereas these lines

Â

1ÂÂ ( ParallelSchedulingGroup is "windows-cluster" )
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ MODIFY TO "windows-cluster"

2ÂÂ ( ( ( Opsys == "Linux" ) || ( Opsys == "Windows" ) ) && ( Arch == "X86_64" ) && ( stringListMember("2017",TARGET.CST
_INSTALLED_VERSIONS,",") ) && ( CST_CLUSTER_HAS_DC is true ) )
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ REMOVE

Â

(incorrectly) indicates that 0 machines match.Â There is a known problem with the âSuggestions:â clause of -better-analyze. It does not correctly analyze complex sub-clauses, and almost never makes useful suggestions â the suggestions clause has been removed from HTCondor 8.6 and later for that reason.

Â

-tj

Â

Â

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Felix Wolfheimer
Sent: Monday, July 24, 2017 3:02 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Parallel scheduling group problem

Â

We've a mixed Windows/Linux setup managed by HTCondor.ÂI configured parallel scheduling groups for all systems. In a test setup where I can reproduce the issues, which I experience in the production pool, I have four execution hosts (2xWindows, 2xLinux). The execution hosts have parallel scheduling groups as follows:

Â

# on both Linux machines

ParallelSchedulingGroup = "linux-cluster"

Â

# on the Windows machines
ParallelSchedulingGroup = "windows-cluster"

Â

After a while, jobs submitted to the parallel universe won't be started anymore and condor_q -better-analyze for such a job gives the following somehow inconsistent information:

---------------------------------------------------------------------------------------------------------

ÂThe Requirements _expression_ for your job is:

ÂÂÂ ( ParallelSchedulingGroup is my.Matched_PSG ) &&
ÂÂÂ ( ( ( Opsys == "Linux" ) || ( Opsys == "Windows" ) ) &&
ÂÂÂÂÂ ( Arch == "X86_64" ) &&
ÂÂÂÂÂ ( stringListMember("2017",TARGET.CST_INSTALLED_VERSIONS,",") ) &&
ÂÂÂÂÂ ( CST_CLUSTER_HAS_DC is true ) ) && ( TARGET.Disk >= RequestDisk ) &&
ÂÂÂ ( TARGET.HasFileTransfer )

Your job defines the following attributes:

ÂÂÂ DiskUsage = 75
ÂÂÂ Matched_PSG = "windows-cluster"
ÂÂÂ RequestDisk = 75

The Requirements _expression_ for your job reduces to these conditions:

ÂÂÂÂÂÂÂÂ Slots
StepÂÂÂ MatchedÂ Condition
-----Â --------Â ---------
[0]ÂÂÂÂÂÂÂÂÂÂ 2Â ParallelSchedulingGroup is my.Matched_PSG
[1]ÂÂÂÂÂÂÂÂÂÂ 2Â Opsys == "Linux"
[2]ÂÂÂÂÂÂÂÂÂÂ 2Â Opsys == "Windows"
[3]ÂÂÂÂÂÂÂÂÂÂ 4Â [1] || [2]
[4]ÂÂÂÂÂÂÂÂÂÂ 4Â Arch == "X86_64"
[6]ÂÂÂÂÂÂÂÂÂÂ 4Â stringListMember("2017",TARGET.CST_INSTALLED_VERSIONS,",")
[8]ÂÂÂÂÂÂÂÂÂÂ 4Â CST_CLUSTER_HAS_DC is true

Suggestions:

ÂÂÂ ConditionÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ Machines MatchedÂÂÂ Suggestion
ÂÂÂ ---------ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ----------------ÂÂÂ ----------
1ÂÂ ( ParallelSchedulingGroup is "windows-cluster" )
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ MODIFY TO "windows-cluster"
2ÂÂ ( ( ( Opsys == "Linux" ) || ( Opsys == "Windows" ) ) && ( Arch == "X86_64" ) && ( stringListMember("2017",TARGET.CST
_INSTALLED_VERSIONS,",") ) && ( CST_CLUSTER_HAS_DC is true ) )
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ REMOVE
3ÂÂ ( TARGET.Disk >= 75 )ÂÂÂÂÂÂÂÂÂÂÂÂ 4
4ÂÂ ( TARGET.HasFileTransfer )ÂÂÂÂÂÂÂ 4

Â

---------------------------------------------------------------------------------------------------------

It's strange that on one hand condor_q tells me that basically all four machines match my requirements _expression_, but on the other hand tells me that no machine matches the condition

ParallelSchedulingGroup is "windows-cluster"Â

Â

which is for sure not true as I have also checked with condor_status:

condor_status -pool centos7-master.cst.de -af Machine ParallelSchedulingGroup
centos7-node01.cst.de linux-cluster
centos7-node02.cst.de linux-cluster
win2012-master.cst.de windows-cluster
win2012-node01.cst.de windows-cluster

Has anyone an idea what may cause this strange behavior?

Don't know whether this is relevant but I've set NUM_CPUS=1 for all machines as a job is supposed to have exclusive access to all resources on a compute node.

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Parallel scheduling group problem