[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Running non-gpu job on gpu machine referring Whats_New_condor_week_2023



Does the job actually sit idle, or is it bouncing quickly between idle and running? 

condor_q <jobid> -af NumShadowStarts

will show 0 if the job has nover been matched, and > 0 if the job is trying to start and failing for some reason.

If the job is never getting matched, we need to look on the CentralManager,  the NegotiatorLog or MatchLog may show
why the job is not being matched. 

If the job is bouncing between idle and running, then you should look in the StartLog on the execute node with the GOLD resource to see why the job is not able to start successfully.

-tj


From: Vikrant Aggarwal <ervikrant06@xxxxxxxxx>
Sent: Friday, February 21, 2025 4:14 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>; John M Knoeller <johnkn@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Running non-gpu job on gpu machine referring Whats_New_condor_week_2023
 
It was difficult for me to do the testing on a GPU machine for a long time. I created a customer resource named GOLD. 

# Define a new custom resource type (e.g., GPUs)
MACHINE_RESOURCE_NAMES = GOLD
# Specify the number of GPUs available on this machine
MACHINE_RESOURCE_GOLD = 4
BackfillSlot = true
ResourceConflict = "GOLD"
use FEATURE : PartitionableSlot(1, 100%)
SLOT_TYPE_1_START = TARGET.RequestGOLD > 0
SLOT_TYPE_2_BACKFILL = true
use FEATURE : PartitionableSlot(2, 90%, GOLD=0)
SLOT_TYPE_2_PREEMPT = size(ResourceConflict?:"") > 0
SLOT_TYPE_2_START = TARGET.BackfillJob
# When it's time to go, it's time to go.
MAXJOBRETIREMENTTIME = 0


I am able to run the job requesting GOLD resources. But when I tried to run the job without GOLD resource with the following submit file it stays in idle status.

executable = sleep.sh
transfer_executable = false
arguments = 600
getenv = True
requirements = machine == "testnode.example.com"
should_transfer_files = NO
+BackfillJob = True
queue 1


Following is the output from slot2 of the destination machine with "condor_q --better-analyze <jobid> -reverse -machine <machinename>"

The Requirements _expression_ for this slot is

    START &&
    (WithinResourceLimits)

  START is
    TARGET.BackfillJob

  WithinResourceLimits is
    (MY.Cpus > 0 &&
      TARGET.RequestCpus <= MY.Cpus && MY.Memory > 0 &&
      TARGET.RequestMemory <= MY.Memory && MY.Disk > 0 &&
      TARGET.RequestDisk <= MY.Disk && (TARGET.RequestGOLD is undefined ||
        MY.GOLD >= TARGET.RequestGOLD))

This slot defines the following attributes:

    Cpus = 149
    Disk = 2945875554
    GOLD = 0
    Memory = 1382398

Job 1173.0 has the following attributes:

    TARGET.BackfillJob = true
    TARGET.RequestCpus = 1
    TARGET.RequestDisk = 3
    TARGET.RequestMemory = 2000

The Requirements _expression_ for this slot reduces to these conditions:

       Clusters
Step    Matched  Condition
-----  --------  ---------
[0]           1  START
[1]           1  WithinResourceLimits





Thanks & Regards,
Vikrant Aggarwal


On Fri, Feb 7, 2025 at 8:52âPM Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:
Hello, 

Yes, I do have that in submit file of non-gpu job.

On Fri, Feb 7, 2025, 7:33âPM John M Knoeller via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
SLOT_TYPE_2_START = TARGET.BackfillJob

says that in order to match with the backfill slot, a job must have 

   BackfillJob=true

In the job classad.  Do your non-gpu jobs have that?



From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Vikrant Aggarwal <ervikrant06@xxxxxxxxx>
Sent: Friday, February 7, 2025 3:01 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Running non-gpu job on gpu machine referring Whats_New_condor_week_2023
 
Hello Experts,

I was reading the presentation Whats_New_condor_week_2023 and came across an interesting feature of backfill which I wanted to use on a gpu machine. 

From the presentation, I made this configuration, my GPU job runs on the machine without any trouble. 

START = $(START)
use feature : GPUs
GPU_DISCOVERY_EXTRA = -extra
PreemptMaxRuntime = 4 * 24 * 60
ExemptMaxRuntime = 4 * 24 * 60

BackfillSlot = true
ResourceConflict = "GPUs"
use FEATURE : PartitionableSlot(1, 100%)
SLOT_TYPE_1_START = TARGET.RequestGpus > 0
SLOT_TYPE_2_BACKFILL = true
use FEATURE : PartitionableSlot(2, 90%, GPUs=0)
SLOT_TYPE_2_PREEMPT = size(ResourceConflict?:"") > 0
SLOT_TYPE_2_START = TARGET.BackfillJob

However a non-gpu machine stays in idle status. --better-analyze doesn't reveal why it's in idle status. 

executable = sleep.sh
transfer_executable = false
arguments = 600
should_transfer_files = NO
+BackfillJob = True
queue 1

following I see in better-analyze for second slot. 

The Requirements _expression_ for this slot reduces to these conditions:

       Clusters
Step    Matched  Condition
-----  --------  ---------
[0]           1  START
[1]           1  WithinResourceLimits


Am I missing anything in the configuration to make non-gpu jobs run on a gpu machine? 

For clarity: at the time of testing no GPU job was running on that machine, it was a completely idle machine. 

Also, is the feature PreferGPUJobs mentioned in ppt introduced yet or not, couldn't find anything in release notes about it. 




Thanks & Regards,
Vikrant Aggarwal
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/