Hi Steffen, HTCondor sets the _CONDOR_JOB_AD and _CONDOR_MACHINE_AD environment variables that point to files containing the respective class Ad dumps. Regarding the GPU related question: The HTCondor jobs get environment variables set "CUDA_VISIBLE_DEVICES"&"GPU_DEVICE_ORDINAL"&"_CONDOR_AssignedGPUs" that contain a comma separated list of device identifiers. These are the GPUs that were assigned to the job. At least with Nvidia GPUs, these are strings, not actual ordinals: e.g. "GPU-24a2cfec". For some multi-GPU jobs and some frameworks that use CUDA under the hood, we observed that they weren't happy with the CUDA_VISIBLE_DEVICES being set to the string ids.. If we are reported that, we provide the uses with a script to translate the string env to an integral one. The script is attached. In case there are issues, we recommend running this script at the start of the job.. See the usage: $ echo ${CUDA_VISIBLE_DEVICES} GPU-9515f130,GPU-fc201e55,GPU-515d339b,GPU-77ac93e6 $ source translate_gpu_ids.sh now: CUDA_VISIBLE_DEVICES=0,1,2,3 $ echo ${CUDA_VISIBLE_DEVICES} 0,1,2,3 Yes, you can alter these environment variables in the job.. that can be abused.. but I'd really expect all users to not change that variable, since, when everybody would do that, nobody would be able to do anything useful... Hope this helped! - Joachim Am Montag, 15. Mai 2023, 12:24:07 CEST schrieb Steffen Grunewald: > Good morning/afternoon/..., > > we're facing a problem with GPU-bound jobs, and while investigating the best > approach to use a multi-GPU machine (I couldn't find an equivalent to CPU > sharing - as that is done by the kernel), I was wondering > > - Does a job running in its slot have a means to read its own Job ClassAd, > and the Machine ClassAd of the slot it's running in? > - If the answer is yes, how to do it without Python bindings? > > (The background is: If the OSG gets access to some of our GPUs, how do we > and how do the users make sure there are no collisions? If there's already > a canonical way to assign and use GPUs known to, and used by, everyone - > I'd like to join in... If there isn't, how to set up a standard?) > > Thanks, > Steffen
Attachment:
translate_gpu_ids.sh
Description: application/shellscript