Mailing List Archives
	Authenticated access
	
	
     | 
    
	 
	 
     | 
    
	
	 
     | 
  
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Strange scheduling behavior in 6.8.0
- Date: Wed, 16 Aug 2006 10:40:45 -0700
 
- From: "Michael S. Root" <mike@xxxxxxxxxxxxxx>
 
- Subject: [Condor-users] Strange scheduling behavior in 6.8.0
 
Hi all.  I'm having an intermittent problem since upgrading to 6.8.0 
from 6.6.10 a few weeks ago.  Here's the scenario:
We have a pool of about 40 machines running Linux (some FC4, some are 
still RH8), all running 6.8.0.
I submit a DAG with about 25 jobs.  There are no inter-job dependencies, 
all the machines match the job criteria, and there are no other jobs 
running in the pool.
Most of the time, all the jobs will be appropriately scheduled and run 
simultaneously.  However, sometimes, only about 10 of the jobs will get 
started (the exact number varies).  DAGman has submitted them all into 
the queue, but they aren't matched for some reason.  As the first batch 
of jobs finish, more are submitted, but never more than the initial 
count run at once.
When this behavior is occurring, if I run "condor_status", it properly 
lists all the machines in our pool, including the idle ones that should 
have been matched to jobs.  If I run "condor reschedule -all", it will 
send the "Reschedule" command to only those 10 or so machines that are 
actually running jobs.  If I run "condor restart -all", it will send the 
"Restart" command to all machines in the pool, at which point everything 
will return to normal--all the 'stuck' jobs get properly matched to 
machines.
Anyone else see something like this?
-Mike