Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Flocking 'twixt Condor pools
- Date: Fri, 30 Mar 2007 16:05:25 +0100
- From: Mark Calleja <M.Calleja@xxxxxxxxxxxxxxx>
- Subject: Re: [Condor-users] Flocking 'twixt Condor pools
Hi Ian,
Ian Cottam wrote:
Can anyone help with debugging why flocking 'twixt two Condor pools
isn't working please. (Condor 6.6.11 on all machines.)
We have a successful pool - mibpool1 - and we want to create similar on
student clusters around the University. I have started with a new test
pool of a couple of PCs in another building; all is well with it as an
independent pool. FLOCK_TO and FLOCK_FROM variables are set correctly on
both pool masters.
FLOCK_FROM is a property of a central manager (or "pool master", as you
call it). However, FLOCK_TO is a property of a schedd, i.e. a submit
machine. Hence, different submit nodes within the same pool can be
configured to flock to different external pools, or the same ones in
different order (flocking is attempted in the order listed in the
FLOCK_TO field). Have your submit hosts have this set correctly?
On my main pool we always have a 100 to 200 jobs (mainly Java) nearly
always queued up ready to run (Idle status in their queues); they never
flock over. I can do condor_status -pool <the other pool master> -java
and it says they are free and unclaimed.
I've checked with our network experts and there is no firewall or router
settings causing problems.
I have taken one of our PCs out of the main pool and put it in its own -
mibpooltest - to see if I can flock to that, so far no luck.
What do you see in the SchedLog of the submit host? After the job fails
to be serviced by the local pool you should see something like:
<date> <time> (pid:<number>) Increasing flock level for <user>@<submit
host> to 1.
Do you have anything like it? If not what does the following return when
run on the submit host:
condor_config_val FLOCK_TO
Cheers,
Mark