Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] jobs are only running at condor_master machine
- Date: Mon, 29 Aug 2005 13:47:37 -0500
- From: Nick LeRoy <nleroy@xxxxxxxxxxx>
- Subject: Re: [Condor-users] jobs are only running at condor_master machine
On Monday 29 August 2005 1:27 pm, Narunjan Kumar wrote:
> Hello
Hello,
> i have setup a condor pool of two machines.
> 1st is condor master
> 2nd is slave node.
FYI, in Condor land we don't use the term "master" to refer to a machine; it
causes confusion with the condor_master. I believe that the terms you are
looking for are "central manager" (which you do use below), "execute machine"
and "submit machine".
> when i submit the jobs through condor master it runs but at at
> condor master machine.
> jobs donot go any other machine even the machine are idle.
In other words, jobs submitted from the Central Manager (which is apparently
also functioning as a submit host, and, possibly, execute host) do run. Is
that correct?
> when i submit the jobs with the 2nd machine they remains idle in the
> Que and never runs even on the same machine .
Jobs submitted from the other submit hosts do not run. is that correct?
> in either case i have found same error message in
> ---------- Started Negotiation Cycle ----------
> 8/29 20:16:45 Phase 1: Obtaining ads from collector ...
> 8/29 20:16:45 Getting all public ads ...
> 8/29 20:16:45 Sorting 7 ads ...
> 8/29 20:16:45 Getting startd private ads ...
> 8/29 20:16:45 Got ads: 7 public and 2 private
> 8/29 20:16:45 Public ads include 1 submitter, 2 startd
> 8/29 20:16:45 Phase 2: Performing accounting ...
> 8/29 20:16:45 Phase 3: Sorting submitter ads by priority ...
> 8/29 20:16:45 Phase 4.1: Negotiating with schedds ...
> 8/29 20:16:45 Negotiating with condor@xxxxxxxxxxxxxxxxxxxxxxx at
> <**.26.146.226:1173>
> 8/29 20:17:15 select returns 0, connect failed
> 8/29 20:17:15 Will keep trying for 30 seconds...
> 8/29 20:17:16 Connect failed for 30 seconds; returning FALSE
> 8/29 20:17:16 Failed to connect to <**.26.146.226:1173>
> 8/29 20:17:16 Error: Ignoring schedd for this cycle
> 8/29 20:17:16 ---------- Finished Negotiation Cycle ----------
It would be very useful to see what's in the SchedLog on **.26.146.226 (which
I assume to be the second host).
> what is the problem here
> why the central manger is unable to connect with other machine nodes
> in the pool.
> if I see the condor_status then it shows both computer in the list
Do jobs *run* on the second host? When you sumbit 2 jobs from the CM and run
'condor_status' do they both switch to "claimed/busy"? Is there anything in
any of the logs on the second host about "permission denied" or similar? If
so, you should review "3.7 Security In Condor" of the Condor manual.
I think that we'll need answers to some of these questions before we can
proceed much further...
Hope this helps,
-Nick
--
<<< Why, oh, why, didn't I take the blue pill? >>>
/`-_ Nicholas R. LeRoy The Condor Project
{ }/ http://www.cs.wisc.edu/~nleroy http://www.cs.wisc.edu/condor
\ / nleroy@xxxxxxxxxxx The University of Wisconsin
|_*_| 608-265-5761 Department of Computer Sciences