[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Troubleshooting slots configuration



The StartLog will probably tell you what it is unhappy about.  I would guess that it is failing to start up because it cannot provision the slot configuration.

 

These lines

 

   use feature : GPUs

   GPU_DISCOVERY_EXTRA = -extra

 

conflict with these lines

 

   MACHINE_RESOURCE_GPUs = GPU_0, GPU_1, GPU_2, GPU_3

  ENVIRONMENT_FOR_AssignedGPUs = GPU_NAME GPU_ID=/CUDA//

 

I would recommend getting rid of the second set of configuration lines, but you should get rid of one of those sets for sure.

 

-tj

 

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Andrea Borsic
Sent: Friday, November 11, 2022 7:59 AM
To: htcondor-users@xxxxxxxxxxx
Subject: [HTCondor-users] Troubleshooting slots configuration

 

Hi All,

 

I have installed condor 9.12 on a Ubuntu 20.04 server using the rpm packages.

 

The directory /etc/condor/config.d contains:

 

/etc/condor/config.d/etc/condor/config.d/10-nes-cm-submit-execute-node.config (default file)

use security:recommended_v9_0

 

/etc/condor/config.d/10-nes-cm-submit-execute-node.config (created by me)

use ROLE : centralmanager

use ROLE : submit

use ROLE : execute

CONDOR_HOST = 192.168.10.160

CONDOR_COLLECTOR = $(CONDOR_HOST)

 

/etc/condor/config.d/20-local-hardware.config (created by me)

use feature : GPUs

GPU_DISCOVERY_EXTRA = -extra

NUM_CPUS = 20

MACHINE_RESOURCE_GPUs = GPU_0, GPU_1, GPU_2, GPU_3

ENVIRONMENT_FOR_AssignedGPUs = GPU_NAME GPU_ID=/CUDA//

NUM_SLOTS = 1

NUM_SLOTS_TYPE_1 = 1

SLOT_TYPE_1 = cpus=100%

SLOT_TYPE_1_PARTITIONABLE = true

 

/var/log/condor/MasterLog indicates that the three files above are considered to determine the overall configuration.

 

The file 20-local-hardware.config was used on a previous condor 8.8 configuration.

 

At this time, if I type “condor_config” I get no output on screen. All the expected processes are running.

 

Does anyone have any tip regarding the why no slot / node information is appearing with condor_status? Is there any particular log file that might indicate problems with the slot and GPU resources definitions? I have looked at the files under /var/log/condor but I wasn’t able to find any clue regarding why the system seems not configured properly.

 

Thanks for any advice,

 

Best Regards,

 

Andrea