I understand that cuInit is not called by Condor. What I was trying to say is that I do not see the error if the same jobs are not run using Condor. If I change my job to a
script that prints out the CUDA_VISIBLE_DEVICES environment variable and then sleeps for over a minute, then all the jobs print â0â but still only one job runs at a time. As to comment by Michael Pelletier, we are also able to keep our P100 ânice and toastyâ (loaded to 100%) by training about 15 machine learning jobs simultaneously on it, but
it requires starting them manually, which is of course suboptimal. However, I have just ran first tests of configuring Condor as described here: https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToManageGpusInSeriesSeven and it seems to be working: I am able to run multiple jobs using the same GPU. From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx]
On Behalf Of John M Knoeller condor never calls cuInit, so this message canât be coming from HTCondor. âfailed call to cuInit: CUDA_ERROR_NO_DEVICEâ If you change your job to a script that prints out the CUDA_VISIBLE_DEVICES environment variable and then sleeps for a while, do multiple jobs start? do they all print â0â
? -tj From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx]
On Behalf Of Vaurynovich, Siarhei John, Thank you for your reply! The solution does not seem to work, unfortunately. This is what I did:
use feature : GPUs GPU_DISCOVERY_EXTRA = -extra MACHINE_RESOURCE_GPUS = CUDA0, CUDA0, CUDA0, CUDA0, CUDA0 CUDA_VISIBLE_DEVICES = 0
In the submit file I tried to set request_GPUs = 0.199 # only one GPU process starts request_GPUs = 1 # only one GPU process starts SlotID>=0 && SlotID<6 # all processes start, but only one gets GPU and the rest âfailed call to cuInit: CUDA_ERROR_NO_DEVICEâ I am guessing that only one slot gets assigned a GPU since if I set a range of SlotIDs, which does not contain 0, then all jobs âfail to call cuInitâ. If I run several of my jobs interactively, they all are able to use the GPU simultaneously, so it is an HTCondor issue. I am using Condor 8.6.9-1 If you have any other ideas I could try or I did something wrong, please, do let me know. Thank you, Siarhei. From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx]
On Behalf Of John M Knoeller I think if you double advertised cuda devices, you *would* get multiple jobs running on the same gpu. If
MACHINE_RESOURCE_GPUS = CUDA0, CUDA0, CUDA1, CUDA1 Then the Startd could hand out resource CUDA0 twice. and would set CUDA_VISIBLE_DEVICES = 0 both times because it sets that just by stripping off âCUDAâ and keeping the number. If that is not working, then itâs a bug and we should fix it. -tj From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx]
On Behalf Of Michael Pelletier From what I can tell, this isnât possible in a straightforward way. With CPU cores, theyâre fungible, so if you want to assign half a core to a job you can just set the machineâs total CPU count to 2x what it actually is, and
then have a job request one CPU, which means it will get half of one. However, due to $CUDA_VISIBLE_DEVICES which is used to inform the job which GPU to use, the GPUs are not fungible, so if you double-advertised the GPUs you wouldnât
get CUDA0, CUDA0, CUDA1, CUDA1, but 0,1,2,3 instead. Perhaps you could do something with a user job wrapper script to remap the visible devices on machines with double-advertised GPUs? Transform CUDA1 to CUDA0,
and CUDA0,CUDA1 to CUDA0, etc? NVIDIAâs CUDA 9.1 package introduces a new service that partitions GPUs in the driver, so I think weâre starting to get to the point where weâll need to see GPUs
as partitionable resources. Iâve been meaning to experiment with that feature to see how one would go about advertising it to the collector. -Michael Pelletier. From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx]
On Behalf Of Vaurynovich, Siarhei Hello, Could you please help me to figure out how to configure HTCondor to run multiple processes using the same GPU? Is it possible at all? Each process is rather light using <=20%
of the GPU but there are many of them. I can certainly run more than one of them in parallel. I restricted my processes to use only 1/3 of the GPU memory and provided in my submit file: request_GPUs = 0.333 But HTCondor still only runs one GPU using process at the same time. Of course, I could restrict the slot numbers and not tell HTCondor that I will be using GPU, but I was
wondering if there is a better solution. Thank you for your help, Siarhei. ............................................................................ |