condor-users-bounces@xxxxxxxxxxx wrote on 06/16/2005
08:32:47 PM:
> On Thu June 16 2005 5:13 am, Alexandre Badez wrote:
> > Good Morning !
> Hello,
>
> > I'm running a little test cluster of 6 machines, with redhat
3. They are
> > named node1 to node6 (ip @ 10.2.4.11 to 10.2.4.16), and my domain
name is
> > *.mop.ibm.com
> > I've setup the 6 machines with the rpm avaiable on the download
pages
> > (Condor 6.6.9).
> > My central manager is node1, all others are execution hosts.
> >
> > My problem, seems to be my node1 where there is no negociator:
> >
> > [root@node1 root]# condor_master
> > [root@node1 root]# ps ax | grep condor
> > 5137 ? S 0:00
condor_master
> > 5138 ? S 0:00
condor_collector -f
> > 5139 ? R 0:03
condor_startd -f
> > 5142 ? S 0:00
condor_schedd -f
> > 5149 pts/0 S 0:00 grep
condor
> > [root@node1 root]#
>
> I don't know much about how our RPMs configure Condor, but I can see
that
> something is wrong here... Your central manager (node1) should
be running
> both the collector and the negotiator. Look at the DAEMON_LIST
setting in
> the condor_config (or condor_config.local), and make sure that both
COLLECTOR
> NEGOTIATOR is in the list.
The COLLECTOR and NEGOTIATOR were un the list.
>
> Also, if you don't want to be running jobs on this machine, remove
> STARTD from
> the list. Similarly, if you aren't going to be submitting jobs
from this
> host, remove SCHEDD from the list.
Thanks for the this information, but actuallys it's
just for running some test, not for a real use.
>
> > Moreover there is a negociator on each execution node:
> >
> > [root@node2 root]# condor_master
> > [root@node2 root]# ps ax | grep condor
> > 29704 ? S 0:00
condor_master
> > 29705 ? S 0:00
condor_collector -f
> > 29706 ? S 0:00
condor_negotiator -f
> > 29707 ? S 0:06
condor_startd -f
> > 29708 ? S 0:00
condor_schedd -f
> > 29717 pts/0 R 0:00 grep condor
> > [root@node2 root]#
>
> Again, edit your condor_config on the execution node(s), and remove
COLLECTOR
> and NEGOTIATOR from the DAEMON_LIST.
My mistake...
>
> As above, I'll note that you're running the schedd here, which allows
you to
> submit jobs from this host. If this is not what you intended,
then remove
> SCHEDD from the list.
>
> You'll need to restart Condor on the affected nodes for these changes
to take
> effect. "condor_restart -master node1", or "/etc/init.d/condor
restart" (or
> similar).
>
> > Is it normal? After re-reading the installation manual, it don't
seems
> > so...
>
> Nope. See above. I don't know _why_ they're set as they
are, but it's
> obviously wrong.
>
> > I can also send the config and config local files if you need
them.
>
> Try the above first -- it'll probably solve the problems that you'reseeing.
> If not, we can pursue it further.
>
> > Thanks for your help.
>
> Glad to help!
>
> -Nick
Thanks Nick, but actually, my node1 do not want to
execute the negociator (don't know why) in the Master's log file, it's
only written that the negociator failed to execute and will retry later...
On the contrary, there is no problems on my others
node. So I use my node2 as central manager, and it seems to work great
now. But I wonder why I can't execute the negociator on my node1. Indeed,
my nodes are quiet exactly the same (same hardware, same OS, same configuration),
the only difference is that on my node1 I share a folder by NFS with oter
node. Maybe a bug ?