[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Job scheduling
- Date: Wed, 6 Aug 2008 14:43:53 -0400
- From: Patrick Ford <patrick.ford88@xxxxxxxxx>
- Subject: [Condor-users] Job scheduling
Hi all,
I run a 20 node cluster (160 CPU, 2GB RAM each cpu) and am having an
issue with the way condor distributes jobs across the cluster.
A user is launching simulations that grow to over 6GB in size
(Memory), and condor reports it as 15GB (I assume this is Mem+Swap),
and if 3 jobs are run on one node, at a certain point in time the node
will become completely unresponsive. Ganglia shows it as down and ssh
hangs, but a couple of hours later the condor_startd will crash and
restart and the node becomes responsive again. I assume this is due to
the memory being saturated.
While the job is being run outside operating parameters (6GB >> 2GB),
the jobs still have to be run, and they run fine if there is only one
being run per node. The problem is, all of the jobs are being flocked
together to one node (compute-1-0 or compute-2-0), is this an intended
function of condor, or is there a way I can configure condor to
scatter the jobs across the cluster whenever possible?
-Patrick