Tim Blattner and I worked on this for discovering and advertising GPUs using Hawkeye. For Linux, we have an automated script to detect NVIDIA GPUs and the availability of CUDA with its version. That is what is available on the Sourceforge site. Feel free to use and improve it. (Tim has since graduated and moved on to a PhD program, so development is sitting idle right now.)
Not much of our work has gone into policy decisions or in identifying when jobs have control of a GPU (or GPUs).
A general solution will be tricky, given the current variety of libraries (e.g. CUDA, OpenCL, or OpenGL for older GPGPU codes) and different kinds of GPU processors (or in general other co-processors) available.
A few other comments. I had Tim investigate what happens to memory on the card when jobs complete or are preempted. It appears that GPU memory is not cleared, and since it's not controlled by the operating system the same as host memory, it's possible to read left over data on the card. That bothered us.
Also, it was possible to write a GPU kernel with a simple infinite loop that would prevent Condor from preempting the job, so we weren't entirely convinced GPUs were robust enough to handle an environment with ill-behaved jobs. Has anyone else run into this problem? (Current GPUs may be better than the older ones we used for testing.)
Craig
On Jan 7, 2010, at 8:59 AM, Michael O'Donnell wrote: Our group has been considering this technology as well. I suggest taking a look at these URLS (if you have not seen them already): http://sourceforge.net/projects/condorgpu/ www.cs.wisc.edu/condor/CondorWeek2009/condor.../blattner_integrating_gpus.ppt The work that has been done has been with Linux OS. Our group has only just began using Condor and we have not spent any significant time looking into this technology, but I think there could be great potential. We are using Condor to support a wide array of research that utilizes statistical packages, Geographic Information Science (GIS) applications, Java and others and therefore, we have not determined how restricted we would be using this technology. A lot of our most demanding work is in GIS, and a majority of our applications work with proprietary software so some of our initial concerns require investigating what type of applications and what type of programming languages will work in this environment as well as what is involved with getting Condor and GPUs to work on Windows environment. If you do make any progress, it would be great to hear back from you. Mike
Dear Condor Folks, is there someone in Condor user's community who has build GPU cluster based on condor? I mean someone, who has worker nodes hw with GPU graphical cards and job management is done by condor on the top. We are very interested in this topic and would like to build such a infrastructure (condor + gpu worker nodes) for research people in our organization. In first epoch of this project we'd like to develop standalone cluster: - master condor head node - 5 gpu worker nodes (each worker node 2x nVIDIA GTX295) - storage element for data I know, there is a lot to see on google about such a experiments, but I wanted to ask directly from condor users about their opinions/suggestions/recommendations since we are serious about to build condor gpu cluster and use it in production for our research activities. If there is someone who has done similar setup and is willing share the knowledge, I appreciate talk about it! Any url hints are welcome too... Thanks and regards, Marian _______________________________________________ Condor-users mailing list To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/condor-users The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/ <ATT00001.txt>
-- Craig A. Struble, Ph.D. | 369 Cudahy Hall | Marquette University Associate Professor of Computer Science | (414)288-3783 Director, Master of Bioinformatics Program | (414)288-5472 (fax)
|