[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] All jobs stay idle in HTCondor-CE because of GPU requirement - but host of jobs have no GPUs



The CE will look at maxMemory in the incoming job ad and default_maxMemory of the route in order to set RequestMemory of the routed job. If neither is set, then the value 2000 is used. Does that match your expectation? Note that the RequestMemory attribute of the incoming job is ignored (though if maxMemory references it, that is respected).

If you send me the incoming and routed job ads and the route configuration, I can take a look to see whatâs happening.

 - Jaime

On Aug 7, 2024, at 7:14âPM, Marco Mambelli via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

I think the GPU is a red herring,
The problem is the HTCondor-CE bug/feature where also a sleep job asking basically no memory is bumped by the CE to request 2 GB of memory.
I was setting maxMemory in the glidein ( <submit_attr name="maxMemory" value="RequestMemoryâ/>) but something is going wrong

On Aug 7, 2024, at 6:05âPM, Marco Mambelli via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

[EXTERNAL] â This message is from an external sender

Greetings,
HTCondor-CE  23.0.8 (htcondor 23.8.1) seems to add a GPU requirement that causes all jobs to stay idle.
There is no GPU on the host and no GPU is mentioned in the job classed on the submit host.
On the HTCondor-CE there seems to be a GPU requirement that is not matched and causes the jobs to stay idle.

Follow below the outputs of:
- condor_q  -better -reverse
- condor_q  -better
- condor_q -l | grep -i gpu

All jobs submitted stay idle
Any suggestion on how to troubleshoot this?
Any recent change involving GPU requirements?

Thank you,
Marco


Command outputs:


[root@ce-workspace /]# condor_q -all -better -reverse -machine slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx 4


-- Schedd: ce-workspace.glideinwms.org : <10.89.0.35:46367?...

-- Slot: slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx : Analyzing matches for 1 Jobs in 1 autoclusters

The Requirements _expression_ for this slot is

  START &&
  (WithinResourceLimits)

START is
  true

WithinResourceLimits is
  (MY.Cpus > 0 &&
    TARGET.RequestCpus <= MY.Cpus && MY.Memory > 0 &&
    TARGET.RequestMemory <= MY.Memory && MY.Disk > 0 &&
    TARGET.RequestDisk <= MY.Disk && (TARGET.RequestGPUs is undefined ||
      MY.GPUs >= TARGET.RequestGPUs))

This slot defines the following attributes:

  Cpus = 1
  Disk = 17978392
  GPUs = 0
  Memory = 1763

Job 4.0 has the following attributes:

  TARGET.RequestCpus = 1
  TARGET.RequestDisk = 100
  TARGET.RequestGPUs = undefined
  TARGET.RequestMemory = 2000

The Requirements _expression_ for this slot reduces to these conditions:

     Clusters
Step    Matched  Condition
-----  --------  ---------
[6]           0  TARGET.RequestMemory <= MY.Memory
[12]          1  TARGET.RequestGPUs is undefined

slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Run analysis summary of 1 jobs.
  0 (0.00 %) match both slot and job requirements.
  0 match the requirements of this slot.
  1 have job requirements that match this slot.
[root@ce-workspace /]# condor_q -all -better 4


-- Schedd: ce-workspace.glideinwms.org : <10.89.0.35:46367?...
The Requirements _expression_ for job 4.000 is

  (RequestGpus ?: 0) >= (TARGET.Gpus ?: 0)

Job 4.000 defines the following attributes:

  GlideinCpusIsGood =  !isUndefined(MATCH_EXP_JOB_GLIDEIN_Cpus) && (int(MATCH_EXP_JOB_GLIDEIN_Cpus) =!= error)
  JobGPUs = JobIsRunning ? int(MATCH_EXP_JOB_GLIDEIN_GPUs) : OriginalGPUs
  JobIsRunning = (JobStatus =!= 1) && (JobStatus =!= 5) && GlideinCpusIsGood
  JobStatus = 1
  OriginalGPUs = undefined
  RequestGpus = ifThenElse((WantWholeNode =?= true && OriginalGPUs =!= undefined),( !isUndefined(TotalGPUs) && TotalGPUs > 0) ? TotalGPUs : JobGPUs,OriginalGPUs)

slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx has the following attributes:

  TARGET.Gpus = 0
  TARGET.TotalGPUs = 0

The Requirements _expression_ for job 4.000 reduces to these conditions:

       Slots
Step    Matched  Condition
-----  --------  ---------
[0]           0  TARGET.Gpus ?: 0
[1]           1  (RequestGpus ?: 0) >= (TARGET.Gpus ?: 0)

No successful match recorded.
Last failed match: Wed Aug  7 22:45:27 2024

Reason for last match failure: no match found

004.000:  Run analysis summary ignoring user priority.  Of 1 machines,
    0 are rejected by your job's requirements
    1 reject your job because of their own requirements
    0 match and are already running your jobs
    0 match but are serving other users
    0 are able to run your job

WARNING:  Be advised:
 Job did not match any machines's constraints
 To see why, pick a machine that you think should match and add
   -reverse -machine <name>
 to your query.

[root@ce-workspace /]# condor_q -all -l 4 | grep -i gpu
AutoClusterAttrs = "MachineLastMatchTime,Offline,RemoteOwner,RequestCpus,RequestDisk,RequestGPUs,RequestMemory,TotalJobRuntime,ConcurrencyLimits,FlockTo,Rank,Requirements,DiskUsage,GlideinCpusIsGood,JobCpus,JobGPUs,JobIsRunning,JobMemory,JobStatus,MATCH_EXP_JOB_GLIDEIN_Cpus,MATCH_EXP_JOB_GLIDEIN_GPUs,MATCH_EXP_JOB_GLIDEIN_Memory,OriginalCpus,OriginalGPUs,OriginalMemory,TotalCpus,TotalGPUs,TotalMemory,WantWholeNode"
GlideinGPUsIsGood =  !isUndefined(MATCH_EXP_JOB_GLIDEIN_GPUs) && (int(MATCH_EXP_JOB_GLIDEIN_GPUs) =!= error)
JOB_GLIDEIN_GPUs = "$$(ifThenElse(WantWholeNode is true, !isUndefined(TotalGPUs) ? TotalGPUs : JobGPUs, OriginalGPUs))"
JobGPUs = JobIsRunning ? int(MATCH_EXP_JOB_GLIDEIN_GPUs) : OriginalGPUs
OriginalGPUs = undefined
RequestGPUs = ifThenElse((WantWholeNode =?= true && OriginalGPUs =!= undefined),( !isUndefined(TotalGPUs) && TotalGPUs > 0) ? TotalGPUs : JobGPUs,OriginalGPUs)
Requirements = (RequestGpus ?: 0) >= (TARGET.Gpus ?: 0)
[root@ce-workspace /]# condor_ce_version
$HTCondorCEVersion: 23.0.8 $
$CondorVersion: 23.8.1 2024-06-27 BuildID: 742100 PackageID: 23.8.1-1 GitSHA: 8cf018d1 $
$CondorPlatform: x86_64_AlmaLinux9 $
[root@ce-workspace /]#
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.cs.wisc.edu_mailman_listinfo_htcondor-2Dusers&d=DwIFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EF06-Wh4L9CNLgD8bnIjNQ&m=p0P_yFwa6Gf99JwL4Sgp1E33007ZsGmps9icaOG7j1wLA1LIjyqSILgof0vfB1hE&s=4z99ZPvqqUGaL9Mu02EwmlMB1GXnpnc90qd7dQOsyug&e=

The archives can be found at:
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.cs.wisc.edu_archive_htcondor-2Dusers_&d=DwIFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EF06-Wh4L9CNLgD8bnIjNQ&m=p0P_yFwa6Gf99JwL4Sgp1E33007ZsGmps9icaOG7j1wLA1LIjyqSILgof0vfB1hE&s=qRvMMDCB5KF-T04yiJiOjBnBTKVfMk1ZVckSSzjZDKY&e=


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/