[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] All jobs stay idle in HTCondor-CE because of GPU requirement - but host of jobs have no GPUs



I think the GPU is a red herring,
The problem is the HTCondor-CE bug/feature where also a sleep job asking basically no memory is bumped by the CE to request 2 GB of memory.
I was setting maxMemory in the glidein ( <submit_attr name="maxMemory" value="RequestMemoryâ/>) but something is going wrong

> On Aug 7, 2024, at 6:05âPM, Marco Mambelli via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
> 
> [EXTERNAL] â This message is from an external sender
> 
> Greetings,
> HTCondor-CE  23.0.8 (htcondor 23.8.1) seems to add a GPU requirement that causes all jobs to stay idle.
> There is no GPU on the host and no GPU is mentioned in the job classed on the submit host.
> On the HTCondor-CE there seems to be a GPU requirement that is not matched and causes the jobs to stay idle.
> 
> Follow below the outputs of:
> - condor_q  -better -reverse
> - condor_q  -better
> - condor_q -l | grep -i gpu
> 
> All jobs submitted stay idle
> Any suggestion on how to troubleshoot this?
> Any recent change involving GPU requirements?
> 
> Thank you,
> Marco
> 
> 
> Command outputs:
> 
> 
> [root@ce-workspace /]# condor_q -all -better -reverse -machine slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx 4
> 
> 
> -- Schedd: ce-workspace.glideinwms.org : <10.89.0.35:46367?...
> 
> -- Slot: slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx : Analyzing matches for 1 Jobs in 1 autoclusters
> 
> The Requirements expression for this slot is
> 
>    START &&
>    (WithinResourceLimits)
> 
>  START is
>    true
> 
>  WithinResourceLimits is
>    (MY.Cpus > 0 &&
>      TARGET.RequestCpus <= MY.Cpus && MY.Memory > 0 &&
>      TARGET.RequestMemory <= MY.Memory && MY.Disk > 0 &&
>      TARGET.RequestDisk <= MY.Disk && (TARGET.RequestGPUs is undefined ||
>        MY.GPUs >= TARGET.RequestGPUs))
> 
> This slot defines the following attributes:
> 
>    Cpus = 1
>    Disk = 17978392
>    GPUs = 0
>    Memory = 1763
> 
> Job 4.0 has the following attributes:
> 
>    TARGET.RequestCpus = 1
>    TARGET.RequestDisk = 100
>    TARGET.RequestGPUs = undefined
>    TARGET.RequestMemory = 2000
> 
> The Requirements expression for this slot reduces to these conditions:
> 
>       Clusters
> Step    Matched  Condition
> -----  --------  ---------
> [6]           0  TARGET.RequestMemory <= MY.Memory
> [12]          1  TARGET.RequestGPUs is undefined
> 
> slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Run analysis summary of 1 jobs.
>    0 (0.00 %) match both slot and job requirements.
>    0 match the requirements of this slot.
>    1 have job requirements that match this slot.
> [root@ce-workspace /]# condor_q -all -better 4
> 
> 
> -- Schedd: ce-workspace.glideinwms.org : <10.89.0.35:46367?...
> The Requirements expression for job 4.000 is
> 
>    (RequestGpus ?: 0) >= (TARGET.Gpus ?: 0)
> 
> Job 4.000 defines the following attributes:
> 
>    GlideinCpusIsGood =  !isUndefined(MATCH_EXP_JOB_GLIDEIN_Cpus) && (int(MATCH_EXP_JOB_GLIDEIN_Cpus) =!= error)
>    JobGPUs = JobIsRunning ? int(MATCH_EXP_JOB_GLIDEIN_GPUs) : OriginalGPUs
>    JobIsRunning = (JobStatus =!= 1) && (JobStatus =!= 5) && GlideinCpusIsGood
>    JobStatus = 1
>    OriginalGPUs = undefined
>    RequestGpus = ifThenElse((WantWholeNode =?= true && OriginalGPUs =!= undefined),( !isUndefined(TotalGPUs) && TotalGPUs > 0) ? TotalGPUs : JobGPUs,OriginalGPUs)
> 
> slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx has the following attributes:
> 
>    TARGET.Gpus = 0
>    TARGET.TotalGPUs = 0
> 
> The Requirements expression for job 4.000 reduces to these conditions:
> 
>         Slots
> Step    Matched  Condition
> -----  --------  ---------
> [0]           0  TARGET.Gpus ?: 0
> [1]           1  (RequestGpus ?: 0) >= (TARGET.Gpus ?: 0)
> 
> No successful match recorded.
> Last failed match: Wed Aug  7 22:45:27 2024
> 
> Reason for last match failure: no match found
> 
> 004.000:  Run analysis summary ignoring user priority.  Of 1 machines,
>      0 are rejected by your job's requirements
>      1 reject your job because of their own requirements
>      0 match and are already running your jobs
>      0 match but are serving other users
>      0 are able to run your job
> 
> WARNING:  Be advised:
>   Job did not match any machines's constraints
>   To see why, pick a machine that you think should match and add
>     -reverse -machine <name>
>   to your query.
> 
> [root@ce-workspace /]# condor_q -all -l 4 | grep -i gpu
> AutoClusterAttrs = "MachineLastMatchTime,Offline,RemoteOwner,RequestCpus,RequestDisk,RequestGPUs,RequestMemory,TotalJobRuntime,ConcurrencyLimits,FlockTo,Rank,Requirements,DiskUsage,GlideinCpusIsGood,JobCpus,JobGPUs,JobIsRunning,JobMemory,JobStatus,MATCH_EXP_JOB_GLIDEIN_Cpus,MATCH_EXP_JOB_GLIDEIN_GPUs,MATCH_EXP_JOB_GLIDEIN_Memory,OriginalCpus,OriginalGPUs,OriginalMemory,TotalCpus,TotalGPUs,TotalMemory,WantWholeNode"
> GlideinGPUsIsGood =  !isUndefined(MATCH_EXP_JOB_GLIDEIN_GPUs) && (int(MATCH_EXP_JOB_GLIDEIN_GPUs) =!= error)
> JOB_GLIDEIN_GPUs = "$$(ifThenElse(WantWholeNode is true, !isUndefined(TotalGPUs) ? TotalGPUs : JobGPUs, OriginalGPUs))"
> JobGPUs = JobIsRunning ? int(MATCH_EXP_JOB_GLIDEIN_GPUs) : OriginalGPUs
> OriginalGPUs = undefined
> RequestGPUs = ifThenElse((WantWholeNode =?= true && OriginalGPUs =!= undefined),( !isUndefined(TotalGPUs) && TotalGPUs > 0) ? TotalGPUs : JobGPUs,OriginalGPUs)
> Requirements = (RequestGpus ?: 0) >= (TARGET.Gpus ?: 0)
> [root@ce-workspace /]# condor_ce_version
> $HTCondorCEVersion: 23.0.8 $
> $CondorVersion: 23.8.1 2024-06-27 BuildID: 742100 PackageID: 23.8.1-1 GitSHA: 8cf018d1 $
> $CondorPlatform: x86_64_AlmaLinux9 $
> [root@ce-workspace /]#
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.cs.wisc.edu_mailman_listinfo_htcondor-2Dusers&d=DwIFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EF06-Wh4L9CNLgD8bnIjNQ&m=p0P_yFwa6Gf99JwL4Sgp1E33007ZsGmps9icaOG7j1wLA1LIjyqSILgof0vfB1hE&s=4z99ZPvqqUGaL9Mu02EwmlMB1GXnpnc90qd7dQOsyug&e= 
> 
> The archives can be found at:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.cs.wisc.edu_archive_htcondor-2Dusers_&d=DwIFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EF06-Wh4L9CNLgD8bnIjNQ&m=p0P_yFwa6Gf99JwL4Sgp1E33007ZsGmps9icaOG7j1wLA1LIjyqSILgof0vfB1hE&s=qRvMMDCB5KF-T04yiJiOjBnBTKVfMk1ZVckSSzjZDKY&e=