Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] All jobs stay idle in HTCondor-CE because of GPU requirement - but host of jobs have no GPUs
- Date: Wed, 7 Aug 2024 23:05:47 +0000
- From: Marco Mambelli <marcom@xxxxxxxx>
- Subject: [HTCondor-users] All jobs stay idle in HTCondor-CE because of GPU requirement - but host of jobs have no GPUs
Greetings,
HTCondor-CE 23.0.8 (htcondor 23.8.1) seems to add a GPU requirement that causes all jobs to stay idle.
There is no GPU on the host and no GPU is mentioned in the job classed on the submit host.
On the HTCondor-CE there seems to be a GPU requirement that is not matched and causes the jobs to stay idle.
Follow below the outputs of:
- condor_q -better -reverse
- condor_q -better
- condor_q -l | grep -i gpu
All jobs submitted stay idle
Any suggestion on how to troubleshoot this?
Any recent change involving GPU requirements?
Thank you,
Marco
Command outputs:
[root@ce-workspace /]# condor_q -all -better -reverse -machine slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx 4
-- Schedd: ce-workspace.glideinwms.org : <10.89.0.35:46367?...
-- Slot: slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx : Analyzing matches for 1 Jobs in 1 autoclusters
The Requirements expression for this slot is
START &&
(WithinResourceLimits)
START is
true
WithinResourceLimits is
(MY.Cpus > 0 &&
TARGET.RequestCpus <= MY.Cpus && MY.Memory > 0 &&
TARGET.RequestMemory <= MY.Memory && MY.Disk > 0 &&
TARGET.RequestDisk <= MY.Disk && (TARGET.RequestGPUs is undefined ||
MY.GPUs >= TARGET.RequestGPUs))
This slot defines the following attributes:
Cpus = 1
Disk = 17978392
GPUs = 0
Memory = 1763
Job 4.0 has the following attributes:
TARGET.RequestCpus = 1
TARGET.RequestDisk = 100
TARGET.RequestGPUs = undefined
TARGET.RequestMemory = 2000
The Requirements expression for this slot reduces to these conditions:
Clusters
Step Matched Condition
----- -------- ---------
[6] 0 TARGET.RequestMemory <= MY.Memory
[12] 1 TARGET.RequestGPUs is undefined
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Run analysis summary of 1 jobs.
0 (0.00 %) match both slot and job requirements.
0 match the requirements of this slot.
1 have job requirements that match this slot.
[root@ce-workspace /]# condor_q -all -better 4
-- Schedd: ce-workspace.glideinwms.org : <10.89.0.35:46367?...
The Requirements expression for job 4.000 is
(RequestGpus ?: 0) >= (TARGET.Gpus ?: 0)
Job 4.000 defines the following attributes:
GlideinCpusIsGood = !isUndefined(MATCH_EXP_JOB_GLIDEIN_Cpus) && (int(MATCH_EXP_JOB_GLIDEIN_Cpus) =!= error)
JobGPUs = JobIsRunning ? int(MATCH_EXP_JOB_GLIDEIN_GPUs) : OriginalGPUs
JobIsRunning = (JobStatus =!= 1) && (JobStatus =!= 5) && GlideinCpusIsGood
JobStatus = 1
OriginalGPUs = undefined
RequestGpus = ifThenElse((WantWholeNode =?= true && OriginalGPUs =!= undefined),( !isUndefined(TotalGPUs) && TotalGPUs > 0) ? TotalGPUs : JobGPUs,OriginalGPUs)
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx has the following attributes:
TARGET.Gpus = 0
TARGET.TotalGPUs = 0
The Requirements expression for job 4.000 reduces to these conditions:
Slots
Step Matched Condition
----- -------- ---------
[0] 0 TARGET.Gpus ?: 0
[1] 1 (RequestGpus ?: 0) >= (TARGET.Gpus ?: 0)
No successful match recorded.
Last failed match: Wed Aug 7 22:45:27 2024
Reason for last match failure: no match found
004.000: Run analysis summary ignoring user priority. Of 1 machines,
0 are rejected by your job's requirements
1 reject your job because of their own requirements
0 match and are already running your jobs
0 match but are serving other users
0 are able to run your job
WARNING: Be advised:
Job did not match any machines's constraints
To see why, pick a machine that you think should match and add
-reverse -machine <name>
to your query.
[root@ce-workspace /]# condor_q -all -l 4 | grep -i gpu
AutoClusterAttrs = "MachineLastMatchTime,Offline,RemoteOwner,RequestCpus,RequestDisk,RequestGPUs,RequestMemory,TotalJobRuntime,ConcurrencyLimits,FlockTo,Rank,Requirements,DiskUsage,GlideinCpusIsGood,JobCpus,JobGPUs,JobIsRunning,JobMemory,JobStatus,MATCH_EXP_JOB_GLIDEIN_Cpus,MATCH_EXP_JOB_GLIDEIN_GPUs,MATCH_EXP_JOB_GLIDEIN_Memory,OriginalCpus,OriginalGPUs,OriginalMemory,TotalCpus,TotalGPUs,TotalMemory,WantWholeNode"
GlideinGPUsIsGood = !isUndefined(MATCH_EXP_JOB_GLIDEIN_GPUs) && (int(MATCH_EXP_JOB_GLIDEIN_GPUs) =!= error)
JOB_GLIDEIN_GPUs = "$$(ifThenElse(WantWholeNode is true, !isUndefined(TotalGPUs) ? TotalGPUs : JobGPUs, OriginalGPUs))"
JobGPUs = JobIsRunning ? int(MATCH_EXP_JOB_GLIDEIN_GPUs) : OriginalGPUs
OriginalGPUs = undefined
RequestGPUs = ifThenElse((WantWholeNode =?= true && OriginalGPUs =!= undefined),( !isUndefined(TotalGPUs) && TotalGPUs > 0) ? TotalGPUs : JobGPUs,OriginalGPUs)
Requirements = (RequestGpus ?: 0) >= (TARGET.Gpus ?: 0)
[root@ce-workspace /]# condor_ce_version
$HTCondorCEVersion: 23.0.8 $
$CondorVersion: 23.8.1 2024-06-27 BuildID: 742100 PackageID: 23.8.1-1 GitSHA: 8cf018d1 $
$CondorPlatform: x86_64_AlmaLinux9 $
[root@ce-workspace /]#