Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] All jobs stay idle in HTCondor-CE because of GPU requirement - but host of jobs have no GPUs
- Date: Thu, 8 Aug 2024 00:14:28 +0000
- From: Marco Mambelli <marcom@xxxxxxxx>
- Subject: Re: [HTCondor-users] All jobs stay idle in HTCondor-CE because of GPU requirement - but host of jobs have no GPUs
I think the GPU is a red herring,
The problem is the HTCondor-CE bug/feature where also a sleep job asking basically no memory is bumped by the CE to request 2 GB of memory.
I was setting maxMemory in the glidein ( <submit_attr name="maxMemory" value="RequestMemoryâ/>) but something is going wrong
> On Aug 7, 2024, at 6:05âPM, Marco Mambelli via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
>
> [EXTERNAL] â This message is from an external sender
>
> Greetings,
> HTCondor-CE 23.0.8 (htcondor 23.8.1) seems to add a GPU requirement that causes all jobs to stay idle.
> There is no GPU on the host and no GPU is mentioned in the job classed on the submit host.
> On the HTCondor-CE there seems to be a GPU requirement that is not matched and causes the jobs to stay idle.
>
> Follow below the outputs of:
> - condor_q -better -reverse
> - condor_q -better
> - condor_q -l | grep -i gpu
>
> All jobs submitted stay idle
> Any suggestion on how to troubleshoot this?
> Any recent change involving GPU requirements?
>
> Thank you,
> Marco
>
>
> Command outputs:
>
>
> [root@ce-workspace /]# condor_q -all -better -reverse -machine slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx 4
>
>
> -- Schedd: ce-workspace.glideinwms.org : <10.89.0.35:46367?...
>
> -- Slot: slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx : Analyzing matches for 1 Jobs in 1 autoclusters
>
> The Requirements expression for this slot is
>
> START &&
> (WithinResourceLimits)
>
> START is
> true
>
> WithinResourceLimits is
> (MY.Cpus > 0 &&
> TARGET.RequestCpus <= MY.Cpus && MY.Memory > 0 &&
> TARGET.RequestMemory <= MY.Memory && MY.Disk > 0 &&
> TARGET.RequestDisk <= MY.Disk && (TARGET.RequestGPUs is undefined ||
> MY.GPUs >= TARGET.RequestGPUs))
>
> This slot defines the following attributes:
>
> Cpus = 1
> Disk = 17978392
> GPUs = 0
> Memory = 1763
>
> Job 4.0 has the following attributes:
>
> TARGET.RequestCpus = 1
> TARGET.RequestDisk = 100
> TARGET.RequestGPUs = undefined
> TARGET.RequestMemory = 2000
>
> The Requirements expression for this slot reduces to these conditions:
>
> Clusters
> Step Matched Condition
> ----- -------- ---------
> [6] 0 TARGET.RequestMemory <= MY.Memory
> [12] 1 TARGET.RequestGPUs is undefined
>
> slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Run analysis summary of 1 jobs.
> 0 (0.00 %) match both slot and job requirements.
> 0 match the requirements of this slot.
> 1 have job requirements that match this slot.
> [root@ce-workspace /]# condor_q -all -better 4
>
>
> -- Schedd: ce-workspace.glideinwms.org : <10.89.0.35:46367?...
> The Requirements expression for job 4.000 is
>
> (RequestGpus ?: 0) >= (TARGET.Gpus ?: 0)
>
> Job 4.000 defines the following attributes:
>
> GlideinCpusIsGood = !isUndefined(MATCH_EXP_JOB_GLIDEIN_Cpus) && (int(MATCH_EXP_JOB_GLIDEIN_Cpus) =!= error)
> JobGPUs = JobIsRunning ? int(MATCH_EXP_JOB_GLIDEIN_GPUs) : OriginalGPUs
> JobIsRunning = (JobStatus =!= 1) && (JobStatus =!= 5) && GlideinCpusIsGood
> JobStatus = 1
> OriginalGPUs = undefined
> RequestGpus = ifThenElse((WantWholeNode =?= true && OriginalGPUs =!= undefined),( !isUndefined(TotalGPUs) && TotalGPUs > 0) ? TotalGPUs : JobGPUs,OriginalGPUs)
>
> slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx has the following attributes:
>
> TARGET.Gpus = 0
> TARGET.TotalGPUs = 0
>
> The Requirements expression for job 4.000 reduces to these conditions:
>
> Slots
> Step Matched Condition
> ----- -------- ---------
> [0] 0 TARGET.Gpus ?: 0
> [1] 1 (RequestGpus ?: 0) >= (TARGET.Gpus ?: 0)
>
> No successful match recorded.
> Last failed match: Wed Aug 7 22:45:27 2024
>
> Reason for last match failure: no match found
>
> 004.000: Run analysis summary ignoring user priority. Of 1 machines,
> 0 are rejected by your job's requirements
> 1 reject your job because of their own requirements
> 0 match and are already running your jobs
> 0 match but are serving other users
> 0 are able to run your job
>
> WARNING: Be advised:
> Job did not match any machines's constraints
> To see why, pick a machine that you think should match and add
> -reverse -machine <name>
> to your query.
>
> [root@ce-workspace /]# condor_q -all -l 4 | grep -i gpu
> AutoClusterAttrs = "MachineLastMatchTime,Offline,RemoteOwner,RequestCpus,RequestDisk,RequestGPUs,RequestMemory,TotalJobRuntime,ConcurrencyLimits,FlockTo,Rank,Requirements,DiskUsage,GlideinCpusIsGood,JobCpus,JobGPUs,JobIsRunning,JobMemory,JobStatus,MATCH_EXP_JOB_GLIDEIN_Cpus,MATCH_EXP_JOB_GLIDEIN_GPUs,MATCH_EXP_JOB_GLIDEIN_Memory,OriginalCpus,OriginalGPUs,OriginalMemory,TotalCpus,TotalGPUs,TotalMemory,WantWholeNode"
> GlideinGPUsIsGood = !isUndefined(MATCH_EXP_JOB_GLIDEIN_GPUs) && (int(MATCH_EXP_JOB_GLIDEIN_GPUs) =!= error)
> JOB_GLIDEIN_GPUs = "$$(ifThenElse(WantWholeNode is true, !isUndefined(TotalGPUs) ? TotalGPUs : JobGPUs, OriginalGPUs))"
> JobGPUs = JobIsRunning ? int(MATCH_EXP_JOB_GLIDEIN_GPUs) : OriginalGPUs
> OriginalGPUs = undefined
> RequestGPUs = ifThenElse((WantWholeNode =?= true && OriginalGPUs =!= undefined),( !isUndefined(TotalGPUs) && TotalGPUs > 0) ? TotalGPUs : JobGPUs,OriginalGPUs)
> Requirements = (RequestGpus ?: 0) >= (TARGET.Gpus ?: 0)
> [root@ce-workspace /]# condor_ce_version
> $HTCondorCEVersion: 23.0.8 $
> $CondorVersion: 23.8.1 2024-06-27 BuildID: 742100 PackageID: 23.8.1-1 GitSHA: 8cf018d1 $
> $CondorPlatform: x86_64_AlmaLinux9 $
> [root@ce-workspace /]#
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.cs.wisc.edu_mailman_listinfo_htcondor-2Dusers&d=DwIFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EF06-Wh4L9CNLgD8bnIjNQ&m=p0P_yFwa6Gf99JwL4Sgp1E33007ZsGmps9icaOG7j1wLA1LIjyqSILgof0vfB1hE&s=4z99ZPvqqUGaL9Mu02EwmlMB1GXnpnc90qd7dQOsyug&e=
>
> The archives can be found at:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.cs.wisc.edu_archive_htcondor-2Dusers_&d=DwIFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EF06-Wh4L9CNLgD8bnIjNQ&m=p0P_yFwa6Gf99JwL4Sgp1E33007ZsGmps9icaOG7j1wLA1LIjyqSILgof0vfB1hE&s=qRvMMDCB5KF-T04yiJiOjBnBTKVfMk1ZVckSSzjZDKY&e=