[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Getting GPU utilization in condor machine slots



Hello Experts,

Using condor version 9.0.17

As per the link [1] this script is supposed to advertise gpu utilization information in slots but this script is hanging.Â

# ldd /usr/libexec/condor/condor_gpu_utilization
    linux-vdso.so.1 => Â(0x00007ffd9a33d000)
    libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007fc30868e000)
    libresolv.so.2 => /usr/lib64/libresolv.so.2 (0x00007fc308474000)
    libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00007fc30816c000)
    libm.so.6 => /usr/lib64/libm.so.6 (0x00007fc307e6a000)
    libgomp.so.1 => /usr/lib64/libgomp.so.1 (0x00007fc307c44000)
    libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x00007fc307a2e000)
    libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007fc307812000)
    libc.so.6 => /usr/lib64/libc.so.6 (0x00007fc307444000)
    /lib64/ld-linux-x86-64.so.2 (0x00007fc308892000)

On some GPU machines it works (No pattern in type of GPUs)Â

# /usr/libexec/condor/condor_gpu_utilization
Hanging to prevent process churn.

# /usr/libexec/condor/condor_gpu_utilization
cuInit(0) failed, aborting.
Hanging to prevent process churn.

Running strace shows nothing after nanosleep.Â

--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=1658970, si_uid=0, si_status=1, si_utime=0, si_stime=0} ---
openat(AT_FDCWD, "/proc/modules", O_RDONLY) = 8
fstat(8, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
read(8, "overlay 139264 0 - Live 0xffffff"..., 1024) = 1024
read(8, "1 ast, Live 0xffffffffc16ce000\ne"..., 1024) = 1024
read(8, "dc0000\ni2c_piix4 24576 0 - Live "..., 1024) = 1024
read(8, "ffffffffc084f000\nib_ipoib 147456"..., 1024) = 1024
read(8, "0xffffffffc0735000\nasync_tx 1638"..., 1024) = 1024
read(8, " 0 - Live 0xffffffffc0772000\npsa"..., 1024) = 1024
read(8, " - Live 0xffffffffc05d0000\ncnic "..., 1024) = 901
read(8, "", 1024) Â Â Â Â Â Â Â Â Â Â Â = 0
close(8) Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â= 0
stat("/usr/bin/nvidia-modprobe", {st_mode=S_IFREG|S_ISUID|0755, st_size=43392, ...}) = 0
geteuid() Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â = 0
ioctl(-1, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffe55e09710) = -1 EBADF (Bad file descriptor)
write(2, "cuInit(0) failed, aborting.\n", 28cuInit(0) failed, aborting.
) = 28
write(2, "Hanging to prevent process churn"..., 34Hanging to prevent process churn.
) = 34
nanosleep({tv_sec=1024, tv_nsec=0},



- How often does it update the slot classad?
- If we want to advertise this info in job classad like CPUsUsage, is it possible?
Â

[1]Âhttps://htcondor.readthedocs.io/en/lts/admin-manual/monitoring.html#gpus


Thanks & Regards,
Vikrant Aggarwal