Does the job actually sit idle, or is it bouncing quickly between idle and running?Â
condor_q <jobid> -af NumShadowStarts
will show 0 if the job has nover been matched, and > 0 if the job is trying to start and failing for some reason.
If the job is never getting matched, we need to look on the CentralManager, the NegotiatorLog or MatchLog may showwhy the job is not being matched.Â
If the job is bouncing between idle and running, then you should look in the StartLog on the execute node with the GOLD resource to see why the job is not able to start successfully.
-tj
From: Vikrant Aggarwal <ervikrant06@xxxxxxxxx>
Sent: Friday, February 21, 2025 4:14 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>; John M Knoeller <johnkn@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Running non-gpu job on gpu machine referring Whats_New_condor_week_2023ÂIt was difficult for me to do the testing on a GPU machine for a long time. I created a customer resource named GOLD.Â
# Define a new custom resource type (e.g., GPUs)
MACHINE_RESOURCE_NAMES = GOLD
# Specify the number of GPUs available on this machine
MACHINE_RESOURCE_GOLD = 4
BackfillSlot = true
ResourceConflict = "GOLD"
use FEATURE : PartitionableSlot(1, 100%)
SLOT_TYPE_1_START = TARGET.RequestGOLD > 0
SLOT_TYPE_2_BACKFILL = true
use FEATURE : PartitionableSlot(2, 90%, GOLD=0)
SLOT_TYPE_2_PREEMPT = size(ResourceConflict?:"") > 0
SLOT_TYPE_2_START = TARGET.BackfillJob
# When it's time to go, it's time to go.
MAXJOBRETIREMENTTIME = 0
I am able to run the job requesting GOLD resources. But when I tried to run the job without GOLD resource with the following submit file it stays in idle status.
executable = sleep.sh
transfer_executable = false
arguments = 600
getenv = True
requirements = machine == "testnode.example.com"
should_transfer_files = NO
+BackfillJob = True
queue 1
Following is the output from slot2 of the destination machine with "condor_q --better-analyze <jobid> -reverse -machine <machinename>"
The Requirements _expression_ for this slot is
  START &&
  (WithinResourceLimits)
 START is
  TARGET.BackfillJob
 WithinResourceLimits is
  (MY.Cpus > 0 &&
   TARGET.RequestCpus <= MY.Cpus && MY.Memory > 0 &&
   TARGET.RequestMemory <= MY.Memory && MY.Disk > 0 &&
   TARGET.RequestDisk <= MY.Disk && (TARGET.RequestGOLD is undefined ||
    MY.GOLD >= TARGET.RequestGOLD))
This slot defines the following attributes:
  Cpus = 149
  Disk = 2945875554
  GOLD = 0
  Memory = 1382398
Job 1173.0 has the following attributes:
  TARGET.BackfillJob = true
  TARGET.RequestCpus = 1
  TARGET.RequestDisk = 3
  TARGET.RequestMemory = 2000
The Requirements _expression_ for this slot reduces to these conditions:
   ÂClusters
Step  ÂMatched ÂCondition
----- Â-------- Â---------
[0] Â Â Â Â Â 1 ÂSTART
[1] Â Â Â Â Â 1 ÂWithinResourceLimits
Thanks & Regards,Vikrant Aggarwal
On Fri, Feb 7, 2025 at 8:52âPM Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:
Hello,Â
Yes, I do have that in submit file of non-gpu job.
On Fri, Feb 7, 2025, 7:33âPM John M Knoeller via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
_______________________________________________SLOT_TYPE_2_START = TARGET.BackfillJob
says that in order to match with the backfill slot, a job must haveÂ
 ÂBackfillJob=true
.Â
In the job classad. Do your non-gpu jobs have that?
From:ÂHTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Vikrant Aggarwal <ervikrant06@xxxxxxxxx>
Sent:ÂFriday, February 7, 2025 3:01 PM
To:ÂHTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject:Â[HTCondor-users] Running non-gpu job on gpu machine referring Whats_New_condor_week_2023ÂHello Experts,
I was reading the presentation Whats_New_condor_week_2023Âand came across an interesting feature of backfill which I wanted to use on a gpu machine.Â
From the presentation, I made this configuration, my GPU job runs on the machine without any trouble.Â
START = $(START)
use feature : GPUs
GPU_DISCOVERY_EXTRA = -extraPreemptMaxRuntime = 4 * 24 * 60
ExemptMaxRuntime = 4 * 24 * 60
BackfillSlot = true
ResourceConflict = "GPUs"
use FEATURE : PartitionableSlot(1, 100%)
SLOT_TYPE_1_START = TARGET.RequestGpus > 0
SLOT_TYPE_2_BACKFILL = true
use FEATURE : PartitionableSlot(2, 90%, GPUs=0)
SLOT_TYPE_2_PREEMPT = size(ResourceConflict?:"") > 0
SLOT_TYPE_2_START = TARGET.BackfillJob
However a non-gpu machine stays in idle status. --better-analyze doesn't reveal why it's in idle status.Â
executable = sleep.sh
transfer_executable = false
arguments = 600
should_transfer_files = NO
+BackfillJob = True
queue 1
following I see in better-analyze for second slot.Â
The Requirements _expression_ for this slot reduces to these conditions:
   ÂClusters
Step  ÂMatched ÂCondition
----- Â-------- Â---------
[0] Â Â Â Â Â 1 ÂSTART
[1] Â Â Â Â Â 1 ÂWithinResourceLimits
Am I missing anything in the configuration to make non-gpu jobs run on a gpu machine?Â
For clarity: at the time of testing no GPU job was running on that machine, it was a completely idle machine.Â
Also, is the featureÂPreferGPUJobs mentioned in ppt introduced yet or not, couldn't find anything in release notes about it.Â
Thanks & Regards,Vikrant Aggarwal
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/