Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Strange scheduling behavior in 6.8.0
- Date: Wed, 16 Aug 2006 10:40:45 -0700
- From: "Michael S. Root" <mike@xxxxxxxxxxxxxx>
- Subject: [Condor-users] Strange scheduling behavior in 6.8.0
Hi all. I'm having an intermittent problem since upgrading to 6.8.0
from 6.6.10 a few weeks ago. Here's the scenario:
We have a pool of about 40 machines running Linux (some FC4, some are
still RH8), all running 6.8.0.
I submit a DAG with about 25 jobs. There are no inter-job dependencies,
all the machines match the job criteria, and there are no other jobs
running in the pool.
Most of the time, all the jobs will be appropriately scheduled and run
simultaneously. However, sometimes, only about 10 of the jobs will get
started (the exact number varies). DAGman has submitted them all into
the queue, but they aren't matched for some reason. As the first batch
of jobs finish, more are submitted, but never more than the initial
count run at once.
When this behavior is occurring, if I run "condor_status", it properly
lists all the machines in our pool, including the idle ones that should
have been matched to jobs. If I run "condor reschedule -all", it will
send the "Reschedule" command to only those 10 or so machines that are
actually running jobs. If I run "condor restart -all", it will send the
"Restart" command to all machines in the pool, at which point everything
will return to normal--all the 'stuck' jobs get properly matched to
machines.
Anyone else see something like this?
-Mike