Normally, thereâs no value for RequestMemory thatâs too small. As the _expression_ shows, the default for RequestMemory is the size of the executable, until the job runs for the first time, at which point it becomes the max amount of memory used on previous
executions. Partitionable EPs will round up the memory allocated to the slot to some reasonable minimum amount.
You have something that works, so I donât think we need to dig further.
- Jaime
On Aug 9, 2024, at 10:27âAM, Marco Mambelli <marcom@xxxxxxxx> wrote:
I
think I found the problem
maxMemory
is 1 and somehow it is too small for the CE to be considered:
maxMemory
= ((RequestMemory ?: 1) > 100 ? RequestMemory : 100)
MemoryProvisioned
= 128
RequestMemory
= ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024)
Requirements
= (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory)
[root@ce-workspace
/]# condor_ce_q -af maxMemory RequestMemory ImageSize MemoryUsage
1
1 100 undefined
1
1 100 undefined
1
1 100 undefined
Replacing
maxMemory with:
maxMemory
= ((RequestMemory ?: 1) > 100 ? RequestMemory : 100)
Gets
maxMemory evaluated to 100 and allows jobs to run
Marco
On Aug 9, 2024, at 10:11âAM, Jaime Frey <jfrey@xxxxxxxxxxx> wrote:
[EXTERNAL] â This message is from an external sender
You are looking at the unrouted jobs in the CE queue. You need to look at the routed jobs in the site queue (i.e. run condor_q).
- Jaime
On Aug 8, 2024, at 11:49âPM, Marco Mambelli <marcom@xxxxxxxx> wrote:
Initially I has a problem because I was omitting the plus (maxMemory instead of +maxMemory)
But also after setting +maxMemory=RequestMemory jobs are still not matching
[root@ce-workspace /]# condor_ce_q
-- Schedd: ce-workspace.glideinwms.org : <10.89.0.2:41921?... @ 08/09/24 04:22:29
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
fermilab ID: 1 8/9 03:55 _ _ 1 1 1.0
fermilab ID: 2 8/9 03:57 _ _ 1 1 2.0
fermilab ID: 3 8/9 04:05 _ _ 1 1 3.0
Total for query: 3 jobs; 0 completed, 0 removed, 3 idle, 0 running, 0 held, 0 suspended
Total for all users: 3 jobs; 0 completed, 0 removed, 3 idle, 0 running, 0 held, 0 suspended
[root@ce-workspace /]# condor_ce_q -l | grep -i mem
maxMemory = RequestMemory
RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024)
Requirements = (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory)
maxMemory = RequestMemory
RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024)
Requirements = (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory)
maxMemory = RequestMemory
RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024)
Requirements = (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory)
[root@ce-workspace /]# condor_ce_q -l | grep -i size
ExecutableSize = 100
ExecutableSize_RAW = 87
ImageSize = 100
ImageSize_RAW = 87
RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024)
TransferInputSizeMB = 0
ExecutableSize = 100
ExecutableSize_RAW = 87
ImageSize = 100
ImageSize_RAW = 87
RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024)
TransferInputSizeMB = 0
ExecutableSize = 100
ExecutableSize_RAW = 87
ImageSize = 100
ImageSize_RAW = 87
RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024)
TransferInputSizeMB = 0
[root@ce-workspace /]# condor_ce_q
-- Schedd: ce-workspace.glideinwms.org : <10.89.0.2:41921?... @ 08/09/24 04:23:53
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
fermilab ID: 1 8/9 03:55 _ _ 1 1 1.0
fermilab ID: 2 8/9 03:57 _ _ 1 1 2.0
fermilab ID: 3 8/9 04:05 _ _ 1 1 3.0
Total for query: 3 jobs; 0 completed, 0 removed, 3 idle, 0 running, 0 held, 0 suspended
Total for all users: 3 jobs; 0 completed, 0 removed, 3 idle, 0 running, 0 held, 0 suspended
[root@ce-workspace /]# condor_ce_q -better 1.0
-- Schedd: ce-workspace.glideinwms.org : <10.89.0.2:41921?...
The Requirements _expression_ for job 1.000 is
(TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory)
Job 1.000 defines the following attributes:
DiskUsage = 100
ImageSize = 100
RequestDisk = DiskUsage (kb)
RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024) (mb)
The Requirements _expression_ for job 1.000 reduces to these conditions:
Slots
Step Matched Condition
----- -------- ---------
[0] 0 TARGET.Arch == "X86_64"
[1] 0 TARGET.OpSys == "LINUX"
[3] 0 TARGET.Disk >= RequestDisk
[5] 0 TARGET.Memory >= RequestMemory
[root@ce-workspace /]# condor_status
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 1763 0+00:29:22
Total Owner Claimed Unclaimed Matched Preempting Drain Backfill BkIdle
X86_64/LINUX 1 0 0 1 0 0 0 0 0
Total 1 0 0 1 0 0 0 0 0
[root@ce-workspace /]# condor_status -l grep Memory
condor_status: unknown host grep
[root@ce-workspace /]# condor_status -l | grep Memory
ChildMemory = { }
DetectedMemory = 1763
MachineResources = "Cpus Memory Disk Swap GPUs"
Memory = 1763
TotalMemory = 1763
TotalSlotMemory = 1763
TotalVirtualMemory = 6690348
VirtualMemory = 0
WithinResourceLimits = (MY.Cpus > 0 && TARGET.RequestCpus <= MY.Cpus && MY.Memory > 0 && TARGET.RequestMemory <= MY.Memory && MY.Disk > 0 && TARGET.RequestDisk <= MY.Disk && (TARGET.RequestGPUs =?= undefined || MY.GPUs >= TARGET.RequestGPUs))
[root@ce-workspace /]# condor_status -l | grep Arch
Arch = "X86_64"
[root@ce-workspace /]# condor_status -l | grep OpSys
OpSys = "LINUX"
OpSysAndVer = "AlmaLinux9"
OpSysLegacy = "LINUX"
OpSysLongName = "AlmaLinux release 9.4 (Seafoam Ocelot)"
OpSysMajorVer = 9
OpSysName = "AlmaLinux"
OpSysShortName = "AlmaLinux"
OpSysVer = 904
[root@ce-workspace /]# condor_status -l | grep Disk
ChildDisk = { }
Disk = 19881824
MachineResources = "Cpus Memory Disk Swap GPUs"
TotalDisk = 19881824
TotalSlotDisk = 19881824.0
WithinResourceLimits = (MY.Cpus > 0 && TARGET.RequestCpus <= MY.Cpus && MY.Memory > 0 && TARGET.RequestMemory <= MY.Memory && MY.Disk > 0 && TARGET.RequestDisk <= MY.Disk && (TARGET.RequestGPUs =?= undefined || MY.GPUs >= TARGET.RequestGPUs))
[root@ce-workspace /]# condor_ce_q -l | grep -i disk
DiskUsage = 100
DiskUsage_RAW = 89
RequestDisk = DiskUsage
Requirements = (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory)
DiskUsage = 100
DiskUsage_RAW = 89
RequestDisk = DiskUsage
Requirements = (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory)
DiskUsage = 100
DiskUsage_RAW = 89
RequestDisk = DiskUsage
Requirements = (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory)
On Aug 8, 2024, at 1:37âPM, Jaime Frey <jfrey@xxxxxxxxxxx> wrote:
[EXTERNAL] â This message is from an external sender
The CE will look at maxMemory in the incoming job ad and default_maxMemory of the route in order to set RequestMemory of the routed job. If neither is set, then the value 2000 is used. Does that match your expectation? Note that the RequestMemory attribute
of the incoming job is ignored (though if maxMemory references it, that is respected).
If you send me the incoming and routed job ads and the route configuration, I can take a look to see whatâs happening.
- Jaime
On Aug 7, 2024, at 7:14âPM, Marco Mambelli via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
I think the GPU is a red herring,
The problem is the HTCondor-CE bug/feature where also a sleep job asking basically no memory is bumped by the CE to request 2 GB of memory.
I was setting maxMemory in the glidein ( <submit_attr name="maxMemory" value="RequestMemoryâ/>) but something is going wrong
On Aug 7, 2024, at 6:05âPM, Marco Mambelli via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
[EXTERNAL] â This message is from an external sender
Greetings,
HTCondor-CE 23.0.8 (htcondor 23.8.1) seems to add a GPU requirement that causes all jobs to stay idle.
There is no GPU on the host and no GPU is mentioned in the job classed on the submit host.
On the HTCondor-CE there seems to be a GPU requirement that is not matched and causes the jobs to stay idle.
Follow below the outputs of:
- condor_q -better -reverse
- condor_q -better
- condor_q -l | grep -i gpu
All jobs submitted stay idle
Any suggestion on how to troubleshoot this?
Any recent change involving GPU requirements?
Thank you,
Marco
Command outputs:
[root@ce-workspace /]# condor_q -all -better -reverse -machine slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx 4
-- Schedd: ce-workspace.glideinwms.org : <10.89.0.35:46367?...
-- Slot: slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx : Analyzing matches for 1 Jobs in 1 autoclusters
The Requirements _expression_ for this slot is
START &&
(WithinResourceLimits)
START is
true
WithinResourceLimits is
(MY.Cpus > 0 &&
TARGET.RequestCpus <= MY.Cpus && MY.Memory > 0 &&
TARGET.RequestMemory <= MY.Memory && MY.Disk > 0 &&
TARGET.RequestDisk <= MY.Disk && (TARGET.RequestGPUs is undefined ||
MY.GPUs >= TARGET.RequestGPUs))
This slot defines the following attributes:
Cpus = 1
Disk = 17978392
GPUs = 0
Memory = 1763
Job 4.0 has the following attributes:
TARGET.RequestCpus = 1
TARGET.RequestDisk = 100
TARGET.RequestGPUs = undefined
TARGET.RequestMemory = 2000
The Requirements _expression_ for this slot reduces to these conditions:
Clusters
Step Matched Condition
----- -------- ---------
[6] 0 TARGET.RequestMemory <= MY.Memory
[12] 1 TARGET.RequestGPUs is undefined
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Run analysis summary of 1 jobs.
0 (0.00 %) match both slot and job requirements.
0 match the requirements of this slot.
1 have job requirements that match this slot.
[root@ce-workspace /]# condor_q -all -better 4
-- Schedd: ce-workspace.glideinwms.org : <10.89.0.35:46367?...
The Requirements _expression_ for job 4.000 is
(RequestGpus ?: 0) >= (TARGET.Gpus ?: 0)
Job 4.000 defines the following attributes:
GlideinCpusIsGood = !isUndefined(MATCH_EXP_JOB_GLIDEIN_Cpus) && (int(MATCH_EXP_JOB_GLIDEIN_Cpus) =!= error)
JobGPUs = JobIsRunning ? int(MATCH_EXP_JOB_GLIDEIN_GPUs) : OriginalGPUs
JobIsRunning = (JobStatus =!= 1) && (JobStatus =!= 5) && GlideinCpusIsGood
JobStatus = 1
OriginalGPUs = undefined
RequestGpus = ifThenElse((WantWholeNode =?= true && OriginalGPUs =!= undefined),( !isUndefined(TotalGPUs) && TotalGPUs > 0) ? TotalGPUs : JobGPUs,OriginalGPUs)
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx has the following attributes:
TARGET.Gpus = 0
TARGET.TotalGPUs = 0
The Requirements _expression_ for job 4.000 reduces to these conditions:
Slots
Step Matched Condition
----- -------- ---------
[0] 0 TARGET.Gpus ?: 0
[1] 1 (RequestGpus ?: 0) >= (TARGET.Gpus ?: 0)
No successful match recorded.
Last failed match: Wed Aug 7 22:45:27 2024
Reason for last match failure: no match found
004.000: Run analysis summary ignoring user priority. Of 1 machines,
0 are rejected by your job's requirements
1 reject your job because of their own requirements
0 match and are already running your jobs
0 match but are serving other users
0 are able to run your job
WARNING: Be advised:
Job did not match any machines's constraints
To see why, pick a machine that you think should match and add
-reverse -machine <name>
to your query.
[root@ce-workspace /]# condor_q -all -l 4 | grep -i gpu
AutoClusterAttrs = "MachineLastMatchTime,Offline,RemoteOwner,RequestCpus,RequestDisk,RequestGPUs,RequestMemory,TotalJobRuntime,ConcurrencyLimits,FlockTo,Rank,Requirements,DiskUsage,GlideinCpusIsGood,JobCpus,JobGPUs,JobIsRunning,JobMemory,JobStatus,MATCH_EXP_JOB_GLIDEIN_Cpus,MATCH_EXP_JOB_GLIDEIN_GPUs,MATCH_EXP_JOB_GLIDEIN_Memory,OriginalCpus,OriginalGPUs,OriginalMemory,TotalCpus,TotalGPUs,TotalMemory,WantWholeNode"
GlideinGPUsIsGood = !isUndefined(MATCH_EXP_JOB_GLIDEIN_GPUs) && (int(MATCH_EXP_JOB_GLIDEIN_GPUs) =!= error)
JOB_GLIDEIN_GPUs = "$$(ifThenElse(WantWholeNode is true, !isUndefined(TotalGPUs) ? TotalGPUs : JobGPUs, OriginalGPUs))"
JobGPUs = JobIsRunning ? int(MATCH_EXP_JOB_GLIDEIN_GPUs) : OriginalGPUs
OriginalGPUs = undefined
RequestGPUs = ifThenElse((WantWholeNode =?= true && OriginalGPUs =!= undefined),( !isUndefined(TotalGPUs) && TotalGPUs > 0) ? TotalGPUs : JobGPUs,OriginalGPUs)
Requirements = (RequestGpus ?: 0) >= (TARGET.Gpus ?: 0)
[root@ce-workspace /]# condor_ce_version
$HTCondorCEVersion: 23.0.8 $
$CondorVersion: 23.8.1 2024-06-27 BuildID: 742100 PackageID: 23.8.1-1 GitSHA: 8cf018d1 $
$CondorPlatform: x86_64_AlmaLinux9 $
[root@ce-workspace /]#
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.cs.wisc.edu_mailman_listinfo_htcondor-2Dusers&d=DwIFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EF06-Wh4L9CNLgD8bnIjNQ&m=p0P_yFwa6Gf99JwL4Sgp1E33007ZsGmps9icaOG7j1wLA1LIjyqSILgof0vfB1hE&s=4z99ZPvqqUGaL9Mu02EwmlMB1GXnpnc90qd7dQOsyug&e=
The archives can be found at:
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.cs.wisc.edu_archive_htcondor-2Dusers_&d=DwIFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EF06-Wh4L9CNLgD8bnIjNQ&m=p0P_yFwa6Gf99JwL4Sgp1E33007ZsGmps9icaOG7j1wLA1LIjyqSILgof0vfB1hE&s=qRvMMDCB5KF-T04yiJiOjBnBTKVfMk1ZVckSSzjZDKY&e=
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.cs.wisc.edu_mailman_listinfo_htcondor-2Dusers&d=DwIGaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EF06-Wh4L9CNLgD8bnIjNQ&m=6NA-IR9Hv5jfUZzSVhCLgx-GxtnxmsUc9iGcRoqS3wNZJsHWuRYDSNhV98-d1EGV&s=8v2IJLNY0SK3I7ZLF8DI6kpZRL3S8cntIxpb9jPaOtY&e=
The archives can be found at:
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.cs.wisc.edu_archive_htcondor-2Dusers_&d=DwIGaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EF06-Wh4L9CNLgD8bnIjNQ&m=6NA-IR9Hv5jfUZzSVhCLgx-GxtnxmsUc9iGcRoqS3wNZJsHWuRYDSNhV98-d1EGV&s=x6m8L0s4H3rcMOMpJSHOjbdA6Wme429KkMemXAluHNg&e=
|