[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Job scheduling



Hi all,

I run a 20 node cluster (160 CPU, 2GB RAM each cpu) and am having an issue with the way condor distributes jobs across the cluster.
A user is launching simulations that grow to over 6GB in size  
(Memory), and condor reports it as 15GB (I assume this is Mem+Swap),  
and if 3 jobs are run on one node, at a certain point in time the node  
will become completely unresponsive. Ganglia shows it as down and ssh  
hangs, but a couple of hours later the condor_startd will crash and  
restart and the node becomes responsive again. I assume this is due to  
the memory being saturated.
While the job is being run outside operating parameters (6GB >> 2GB),  
the jobs still have to be run, and they run fine if there is only one  
being run per node. The problem is, all of the jobs are being flocked  
together to one node (compute-1-0 or compute-2-0), is this an intended  
function of condor, or is there a way I can configure condor to  
scatter the jobs across the cluster whenever possible?
-Patrick