[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] How is multicore supposed to work in HTCondor? How to get started



Hi Jeff,

that problem as you might know is as old as batch scheduling ;)

There are a lot of different approaches depending on your overall pool setup.

If your pool is never really full you could teach the  negotiator to completely fill up a node before using  a 'new' one.

If your workload is very predictable you could provide some static slots for the  multicore usage or tag some workernodes to only run multicore jobs. 

The defrag daemon can be used to drain a configurable number of slots down to a 'whole-machine' definition which would be '32 cores == whole-machine' in your case. Then multicore jobs would jump on these slots. 

There are more subtile approaches too, you can fine tune the ranking and put multicore jobs on top of the list but that only works if there are suitable sized slots of course. 

The startd-policy section in the docs is a good read and also the defrag daemon part is useful !

Best
christoph

--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "Jeff Templon" <templon@xxxxxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Gesendet: Dienstag, 18. Juni 2024 12:13:59
Betreff: [HTCondor-users] How is multicore supposed to work in HTCondor? How        to get started

Hi,
So we have frequent requests for multicore or even whole-node jobs.  Hereâs an example:

The Requirements _expression_ for job 27504.000 is

    (TARGET.Machine == "wn-sate-078.nikhef.nl") && (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) &&
    (TARGET.Cpus >= RequestCpus) && ((TARGET.FileSystemDomain == MY.FileSystemDomain) || (TARGET.HasFileTransfer))

Job 27504.000 defines the following attributes:

    DiskUsage = 1
    FileSystemDomain = "stoomboot.nikhef.nl"
    RequestCpus = 32
    RequestDisk = DiskUsage (kb)
    RequestMemory = 4000 (mb)

The Requirements _expression_ for job 27504.000 reduces to these conditions:

         Slots
Step    Matched  Condition
-----  --------  ---------
[0]          33  TARGET.Machine == "wn-sate-078.nikhef.nl"
[7]          35  TARGET.Memory >= RequestMemory
[8]           1  [0] && [7]
[9]          17  TARGET.Cpus >= RequestCpus
[10]          0  [8] && [9]
[11]       1130  TARGET.FileSystemDomain == MY.FileSystemDomain

No successful match recorded.
Last failed match: Tue Jun 18 12:09:24 2024

Reason for last match failure: no match found

27504.000:  Run analysis summary ignoring user priority.  Of 30 machines,
     29 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match and are already running your jobs
      0 match but are serving other users
      1 are able to run your job

Background : this node has a special capability alongside also being a normal pool node. The bottom line says â1 are able to run your jobâ, but thatâs not true, as HTCondor keeps scheduling single-core jobs onto that machine, so a 32-core slot can never be collected.  Where do I look for documentation on how to do this with HTCondor?

Thanks a lot,

JT


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/