Hello,
Going back to this topic, as Cristoph commented the GpusAverageUsage value was undefined for many GPU jobs for unknown reasons and it was not just for short jobs. I do not know if my next explanation has any sense but this is what we found...
We started to suspectÂthatÂthe 8 GPUs of this machine were not always responding when the condor_gpu_utilization script continually asks the usage in aÂWaitForExit mode. We moved the condor_gpu_utilization script to run with a timeout of 20 seconds and in periodic mode and it works better for a while. Although some jobs stillÂhad undefined GpusAverageUsage, the major part of them reported correctly theÂusage. After several days, we moved back to check WaitForExit mode again and we observed that the GpusAverageUsage was correct again and not undefined...
What it seems is that after a "systemctl reload condor" in the WN, the GpusAverageUsage value changes to undefined for the major part of the jobs, and is needed a restart that creates again the CronJob to obtain again GpusAverageUsage values. This is also achieved by changing the monitor mode from WaitForExit to Periodic or whatever and using reload, the old CronJob GPUs_MONITOR is removed, a new one is created and, then, the GpusAverageUsage is not undefined. Does this have any sense?Â
Right now, the machine is reporting the GpusAverageUsageÂfor the major part of the jobs, and our _expression_ that put on hold the jobs that are not using the GPU for the last 4 hours is working fine.
Thank you very much and excuse me if I have not explained myself well.
Cheers,
Carles