Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] GPU monitoring vanished in my pool :(
- Date: Tue, 26 May 2020 14:31:25 +0200 (CEST)
- From: "Beyer, Christoph" <christoph.beyer@xxxxxxx>
- Subject: Re: [HTCondor-users] GPU monitoring vanished in my pool :(
Hi,
I did some more digging into this but did not find any hints on what is actually going wrong, I suspect the gpu_monitoring process should write out a bit more information ?
Is the version mismatch between workernode and sched critical maybe ?
I kind of need the GPU usage and memory usage information for statistics etc. :(
best
christoph
--
Christoph Beyer
DESY Hamburg
IT-Department
Notkestr. 85
Building 02b, Room 009
22607 Hamburg
phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx
----- UrsprÃngliche Mail -----
Von: "Christoph Beyer" <christoph.beyer@xxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Dienstag, 5. Mai 2020 12:32:18
Betreff: Re: [HTCondor-users] GPU monitoring vanished in my pool :(
just a follow-up, condor_gpu_utilization seems to work fine:
[root@batchg003 ~]# /usr/libexec/condor/condor_gpu_utilization
SlotMergeConstraint = StringListMember( "CUDA0", AssignedGPUs )
UptimeGPUsSeconds = 0.000000
UptimeGPUsMemoryPeakUsage = 11
- GPUsSlot0
SlotMergeConstraint = StringListMember( "CUDA0", AssignedGPUs )
UptimeGPUsSeconds = 0.000000
UptimeGPUsMemoryPeakUsage = 11
- GPUsSlot0
--
Christoph Beyer
DESY Hamburg
IT-Department
Notkestr. 85
Building 02b, Room 009
22607 Hamburg
phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx
----- UrsprÃngliche Mail -----
Von: "Christoph Beyer" <christoph.beyer@xxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Dienstag, 5. Mai 2020 11:37:57
Betreff: [HTCondor-users] GPU monitoring vanished in my pool :(
Hi,
I do use the to GPU features on my GPU nodes:
[root@batchg003 ~]# condor_config_val use feature:GPUs
use FEATURE:GPUs is
MACHINE_RESOURCE_INVENTORY_GPUs=$(LIBEXEC)/condor_gpu_discovery -properties $(GPU_DISCOVERY_EXTRA)
ENVIRONMENT_FOR_AssignedGPUs=GPU_DEVICE_ORDINAL=/(CUDA|OCL)// CUDA_VISIBLE_DEVICES
ENVIRONMENT_VALUE_FOR_UnAssignedGPUs=10000
use feature : GPUsMonitor
[root@batchg003 ~]# condor_config_val use feature:GPUsMonitor
use FEATURE:GPUsMonitor is
use feature : Monitor( GPUs, WaitForExit, 1, $(LIBEXEC)/condor_gpu_utilization, SUM:GPUs, PEAK:GPUsMemory )
And in the past I was able for a while to check the results in the memory of a job, like this:
condor_history 11262904 -af:l GPUsMemoryUsage GPUsProvisioned GPUsUsage
> GPUsMemoryUsage = 29261.0 GPUsProvisioned = 4 GPUsUsage = 3.688929331491713
(given the job used any GPUs of course) This has vanished from my history unfortunately without any changes been made (at least no changes by intention I might want to say).
I use 8.9.3 on the gpu nodes and 8.9.1 on the sched but that should not explain it - right ?
Best
christoph
--
Christoph Beyer
DESY Hamburg
IT-Department
Notkestr. 85
Building 02b, Room 009
22607 Hamburg
phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/