Does the job actually sit idle, or is it bouncing quickly between idle and running?
condor_q <jobid> -af NumShadowStarts
will show 0 if the job has nover been matched, and > 0 if the job is trying to start and failing for some reason.
If the job is never getting matched, we need to look on the CentralManager, the NegotiatorLog or MatchLog may show
why the job is not being matched.
If the job is bouncing between idle and running, then you should look in the StartLog on the execute node with the GOLD resource to see why the job is not able to start successfully.
-tj
From: Vikrant Aggarwal <ervikrant06@xxxxxxxxx>
Sent: Friday, February 21, 2025 4:14 PM To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>; John M Knoeller <johnkn@xxxxxxxxxxx> Subject: Re: [HTCondor-users] Running non-gpu job on gpu machine referring Whats_New_condor_week_2023 It was difficult for me to do the testing on a GPU machine for a long time. I created a customer resource named GOLD.
# Define a new custom resource type (e.g., GPUs)
MACHINE_RESOURCE_NAMES = GOLD # Specify the number of GPUs available on this machine MACHINE_RESOURCE_GOLD = 4 BackfillSlot = true ResourceConflict = "GOLD" use FEATURE : PartitionableSlot(1, 100%) SLOT_TYPE_1_START = TARGET.RequestGOLD > 0 SLOT_TYPE_2_BACKFILL = true use FEATURE : PartitionableSlot(2, 90%, GOLD=0) SLOT_TYPE_2_PREEMPT = size(ResourceConflict?:"") > 0 SLOT_TYPE_2_START = TARGET.BackfillJob # When it's time to go, it's time to go. MAXJOBRETIREMENTTIME = 0 I am able to run the job requesting GOLD resources. But when I tried to run the job without GOLD resource with the following submit file it stays in idle status.
executable = sleep.sh
transfer_executable = false arguments = 600 getenv = True requirements = machine == "testnode.example.com" should_transfer_files = NO +BackfillJob = True queue 1 Following is the output from slot2 of the destination machine with "condor_q --better-analyze <jobid> -reverse -machine <machinename>"
The Requirements _expression_ for this slot is
START && (WithinResourceLimits) START is TARGET.BackfillJob WithinResourceLimits is (MY.Cpus > 0 && TARGET.RequestCpus <= MY.Cpus && MY.Memory > 0 && TARGET.RequestMemory <= MY.Memory && MY.Disk > 0 && TARGET.RequestDisk <= MY.Disk && (TARGET.RequestGOLD is undefined || MY.GOLD >= TARGET.RequestGOLD)) This slot defines the following attributes: Cpus = 149 Disk = 2945875554 GOLD = 0 Memory = 1382398 Job 1173.0 has the following attributes: TARGET.BackfillJob = true TARGET.RequestCpus = 1 TARGET.RequestDisk = 3 TARGET.RequestMemory = 2000 The Requirements _expression_ for this slot reduces to these conditions: Clusters Step Matched Condition ----- -------- --------- [0] 1 START [1] 1 WithinResourceLimits Thanks & Regards,
Vikrant Aggarwal
On Fri, Feb 7, 2025 at 8:52âPM Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:
|