[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Getting GPU utilization in condor machine slots



Yes,  the script hangs on purpose when it detects that there are no GPUs that it can monitor. 
If it did not hang, then the Startd would just constantly restart it.  

For these outputs. 

# /usr/libexec/condor/condor_gpu_utilization
Hanging to prevent process churn.

# /usr/libexec/condor/condor_gpu_utilization
cuInit(0) failed, aborting.
Hanging to prevent process churn.

The first seems to show that the machine has no GPUs.  I would expect that if you run condor_gpu_discovery on that machine it will show GPUs=0. 

The second indicates that something is broken with the Nvidia libraries, and the nvcuda cannot be initialized.   This is often a sign of a mismatch between the Nvidia driver library and the cuda runtime library. 

As for how often it updates.  If the machine has working GPUs, condor_gpu_utilization  will send an update to the STARTD every 10 seconds. 

-tj



From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Vikrant Aggarwal <ervikrant06@xxxxxxxxx>
Sent: Monday, July 22, 2024 5:00 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Getting GPU utilization in condor machine slots
 
Hello Experts,

Using condor version 9.0.17

As per the link [1] this script is supposed to advertise gpu utilization information in slots but this script is hanging. 

# ldd /usr/libexec/condor/condor_gpu_utilization
        linux-vdso.so.1 =>  (0x00007ffd9a33d000)
        libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007fc30868e000)
        libresolv.so.2 => /usr/lib64/libresolv.so.2 (0x00007fc308474000)
        libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00007fc30816c000)
        libm.so.6 => /usr/lib64/libm.so.6 (0x00007fc307e6a000)
        libgomp.so.1 => /usr/lib64/libgomp.so.1 (0x00007fc307c44000)
        libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x00007fc307a2e000)
        libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007fc307812000)
        libc.so.6 => /usr/lib64/libc.so.6 (0x00007fc307444000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fc308892000)

On some GPU machines it works (No pattern in type of GPUs) 

# /usr/libexec/condor/condor_gpu_utilization
Hanging to prevent process churn.

# /usr/libexec/condor/condor_gpu_utilization
cuInit(0) failed, aborting.
Hanging to prevent process churn.

Running strace shows nothing after nanosleep. 

--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=1658970, si_uid=0, si_status=1, si_utime=0, si_stime=0} ---
openat(AT_FDCWD, "/proc/modules", O_RDONLY) = 8
fstat(8, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
read(8, "overlay 139264 0 - Live 0xffffff"..., 1024) = 1024
read(8, "1 ast, Live 0xffffffffc16ce000\ne"..., 1024) = 1024
read(8, "dc0000\ni2c_piix4 24576 0 - Live "..., 1024) = 1024
read(8, "ffffffffc084f000\nib_ipoib 147456"..., 1024) = 1024
read(8, "0xffffffffc0735000\nasync_tx 1638"..., 1024) = 1024
read(8, " 0 - Live 0xffffffffc0772000\npsa"..., 1024) = 1024
read(8, " - Live 0xffffffffc05d0000\ncnic "..., 1024) = 901
read(8, "", 1024)                       = 0
close(8)                                = 0
stat("/usr/bin/nvidia-modprobe", {st_mode=S_IFREG|S_ISUID|0755, st_size=43392, ...}) = 0
geteuid()                               = 0
ioctl(-1, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffe55e09710) = -1 EBADF (Bad file descriptor)
write(2, "cuInit(0) failed, aborting.\n", 28cuInit(0) failed, aborting.
) = 28
write(2, "Hanging to prevent process churn"..., 34Hanging to prevent process churn.
) = 34
nanosleep({tv_sec=1024, tv_nsec=0},



- How often does it update the slot classad?
- If we want to advertise this info in job classad like CPUsUsage, is it possible?
 

[1] https://htcondor.readthedocs.io/en/lts/admin-manual/monitoring.html#gpus


Thanks & Regards,
Vikrant Aggarwal