There are a lot of different approaches depending on your overall pool setup.
If your pool is never really full you could teach the negotiator to completely fill up a node before using a 'new' one.
If your workload is very predictable you could provide some static slots for the multicore usage or tag some workernodes to only run multicore jobs.
The defrag daemon can be used to drain a configurable number of slots down to a 'whole-machine' definition which would be '32 cores == whole-machine' in your case. Then multicore jobs would jump on these slots.
There are more subtile approaches too, you can fine tune the ranking and put multicore jobs on top of the list but that only works if there are suitable sized slots of course.
The startd-policy section in the docs is a good read and also the defrag daemon part is useful !
Hi,
So we have frequent requests for multicore or even whole-node jobs. Hereâs an example:
The Requirements _expression_ for job 27504.000 is
(TARGET.Machine == "wn-sate-078.nikhef.nl") && (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) &&
(TARGET.Cpus >= RequestCpus) && ((TARGET.FileSystemDomain == MY.FileSystemDomain) || (TARGET.HasFileTransfer))
Job 27504.000 defines the following attributes:
DiskUsage = 1
FileSystemDomain = "stoomboot.nikhef.nl"
RequestCpus = 32
RequestDisk = DiskUsage (kb)
RequestMemory = 4000 (mb)
The Requirements _expression_ for job 27504.000 reduces to these conditions:
Slots
Step Matched Condition
----- -------- ---------
[0] 33 TARGET.Machine == "wn-sate-078.nikhef.nl"
[7] 35 TARGET.Memory >= RequestMemory
[8] 1 [0] && [7]
[9] 17 TARGET.Cpus >= RequestCpus
[10] 0 [8] && [9]
[11] 1130 TARGET.FileSystemDomain == MY.FileSystemDomain
No successful match recorded.
Last failed match: Tue Jun 18 12:09:24 2024
Reason for last match failure: no match found
27504.000: Run analysis summary ignoring user priority. Of 30 machines,
29 are rejected by your job's requirements
0 reject your job because of their own requirements
0 match and are already running your jobs
0 match but are serving other users
1 are able to run your job
Background : this node has a special capability alongside also being a normal pool node. The bottom line says â1 are able to run your jobâ, but thatâs not true, as HTCondor keeps scheduling single-core jobs onto that machine, so a 32-core slot can never be collected. Where do I look for documentation on how to do this with HTCondor?
Thanks a lot,
JT
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/