Re: [HTCondor-users] Running non-gpu job on gpu machine referring Whats_New_condor_week

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

It was difficult for me to do the testing on a GPU machine for a long time. I created a customer resource named GOLD.

# Define a new custom resource type (e.g., GPUs)
MACHINE_RESOURCE_NAMES = GOLD
# Specify the number of GPUs available on this machine
MACHINE_RESOURCE_GOLD = 4
BackfillSlot = true
ResourceConflict = "GOLD"
use FEATURE : PartitionableSlot(1, 100%)
SLOT_TYPE_1_START = TARGET.RequestGOLD > 0
SLOT_TYPE_2_BACKFILL = true
use FEATURE : PartitionableSlot(2, 90%, GOLD=0)
SLOT_TYPE_2_PREEMPT = size(ResourceConflict?:"") > 0
SLOT_TYPE_2_START = TARGET.BackfillJob
# When it's time to go, it's time to go.
MAXJOBRETIREMENTTIME = 0

I am able to run the job requesting GOLD resources. But when I tried to run the job without GOLD resource with the following submit file it stays in idle status.

executable = sleep.sh
transfer_executable = false
arguments = 600
getenv = True
requirements = machine == "testnode.example.com"
should_transfer_files = NO
+BackfillJob = True
queue 1

Following is the output from slot2 of the destination machine with "condor_q --better-analyze <jobid> -reverse -machine <machinename>"

The Requirements _expression_ for this slot is

START &&
(WithinResourceLimits)

START is
TARGET.BackfillJob

WithinResourceLimits is
(MY.Cpus > 0 &&
TARGET.RequestCpus <= MY.Cpus && MY.Memory > 0 &&
TARGET.RequestMemory <= MY.Memory && MY.Disk > 0 &&
TARGET.RequestDisk <= MY.Disk && (TARGET.RequestGOLD is undefined ||
MY.GOLD >= TARGET.RequestGOLD))

This slot defines the following attributes:

Cpus = 149
Disk = 2945875554
GOLD = 0
Memory = 1382398

Job 1173.0 has the following attributes:

TARGET.BackfillJob = true
TARGET.RequestCpus = 1
TARGET.RequestDisk = 3
TARGET.RequestMemory = 2000

The Requirements _expression_ for this slot reduces to these conditions:

Clusters
Step Matched Condition
----- -------- ---------
[0] 1 START
[1] 1 WithinResourceLimits

Thanks & Regards,

Vikrant Aggarwal

On Fri, Feb 7, 2025 at 8:52âPM Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:

Hello,

Yes, I do have that in submit file of non-gpu job.

On Fri, Feb 7, 2025, 7:33âPM John M Knoeller via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

SLOT_TYPE_2_START = TARGET.BackfillJob

says that in order to match with the backfill slot, a job must have

BackfillJob=true.

In the job classad. Do your non-gpu jobs have that?

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Vikrant Aggarwal <ervikrant06@xxxxxxxxx>
Sent: Friday, February 7, 2025 3:01 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Running non-gpu job on gpu machine referring Whats_New_condor_week_2023

Hello Experts,

I was reading the presentation Whats_New_condor_week_2023 and came across an interesting feature of backfill which I wanted to use on a gpu machine.

From the presentation, I made this configuration, my GPU job runs on the machine without any trouble.

START = $(START)
use feature : GPUs
GPU_DISCOVERY_EXTRA = -extra

PreemptMaxRuntime = 4 * 24 * 60
ExemptMaxRuntime = 4 * 24 * 60

BackfillSlot = true
ResourceConflict = "GPUs"
use FEATURE : PartitionableSlot(1, 100%)
SLOT_TYPE_1_START = TARGET.RequestGpus > 0
SLOT_TYPE_2_BACKFILL = true
use FEATURE : PartitionableSlot(2, 90%, GPUs=0)
SLOT_TYPE_2_PREEMPT = size(ResourceConflict?:"") > 0
SLOT_TYPE_2_START = TARGET.BackfillJob

However a non-gpu machine stays in idle status. --better-analyze doesn't reveal why it's in idle status.

executable = sleep.sh
transfer_executable = false
arguments = 600
should_transfer_files = NO
+BackfillJob = True
queue 1

following I see in better-analyze for second slot.

The Requirements _expression_ for this slot reduces to these conditions:

Clusters
Step Matched Condition
----- -------- ---------
[0] 1 START
[1] 1 WithinResourceLimits

Am I missing anything in the configuration to make non-gpu jobs run on a gpu machine?

For clarity: at the time of testing no GPU job was running on that machine, it was a completely idle machine.

Also, is the feature PreferGPUJobs mentioned in ppt introduced yet or not, couldn't find anything in release notes about it.

Thanks & Regards,

Vikrant Aggarwal

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Running non-gpu job on gpu machine referring Whats_New_condor_week_2023