Joel Hernandez wrote:
We have two clusters, louie and duey. Users submit their jobs on the
louie cluster. When all the nodes on louie are busy, the jobs flock
to the duey cluster. This works fine for three or four hours and
then stops all together for several hours even though many runnable
jobs are still in the queue.
The jobs start flocking again after several hours or immediately
after a condor_restart is performed on louie. However, after several
hours all the jobs stop migrating again. Has anyone had this problem?
Very odd. When you say that you do a condor_restart on louie, what
daemons are running on the machine in question? Are you restarting
the schedd, or is it just the collector and negotiator?
In the schedd logs, you should see statements about the "flock
level". Can you please check what this is doing during the time when
flocking is not working?
Dan Bradley
University of Wisconsin, Condor Project
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>