Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Bit of a problem with HAD
- Date: Tue, 10 Jan 2006 15:05:21 -0800
- From: "Finch, Ralph" <rfinch@xxxxxxxxxxxx>
- Subject: Re: [Condor-users] Bit of a problem with HAD
> From: condor-users-bounces@xxxxxxxxxxx
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Nick LeRoy
> Sent: Tuesday, January 10, 2006 1:06 PM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] Bit of a problem with HAD
>
> On Tue January 10 2006 2:37 pm, Finch, Ralph wrote:
> > condor -version
> > $CondorVersion: 6.7.13 Nov 7 2005 $
> > $CondorPlatform: INTEL-WINNT50 $
> >
> > My desktop machine and another machine are the HAD
> machines, and also
> > serve as condor executors.
>
> By "are the HAD machines", I assume that you mean "are the
> two machines that
> the negotiator can run on" (and, thus, are setup with
> condor_had). Is that
> correct?
Yes. Sorry for my poor terminology.
> > When I installed this a few weeks ago things were working
> OK, though I
> > don't think I tested dagman then. Now I have these
> symptoms: when I
> > submit a dagman job, the jobs wait in the queue several
> minutes. Then
> > on my machine (MERRIT) a condor_exec.exe starts and runs
> full CPU speed,
> > but no other jobs start to run.
BTW, I'm fairly sure that condor_exec.exe is the desired job,
but I don't recall seeing that .exe name before; I was expecting
hydro.exe.
It's also puzzling why it would run on my machine, 'cause
mine is one of the slowest of the pool.
> Is condor_had running on both machines?
Yes, I just checked to be sure.
Is condor_negotiator
> running on
> (exactly) one of the machines? Which one? Is one of the
> machines setup as
> the primary (HAD_USE_PRIMARY)? Which one?
Yes; delta-mod. Yes; delta-mod. I checked these again to be sure.
> On your own, you can look in the HadLogs to see which machine
> thinks it's the
> leader, then look in the MasterLog to verify that it tried to
> start the
> Negotiator properly, and the NegotiatorLog to verify that it
> actually started
> properly.
There's something odd in the negotiator log...
delta-mod HADLog:
1/10 14:21:07 ******************************************************
1/10 14:21:07 Using config file: Z:\Condor\condor_config
1/10 14:21:07 Using local config files: Z:/Condor/condor_config.local
1/10 14:21:07 DaemonCore: Command Socket at <136.200.32.102:9450>
1/10 14:21:07 Starting HAD ....
1/10 14:21:07 ** Register on stateMachineTimerID , interval = 21
1/10 14:21:07 ** HAD_ID 1
1/10 14:21:07 ** HAD_CYCLE_INTERVAL 42
1/10 14:21:07 ** HAD_CONNECTION_TIMEOUT 5
1/10 14:21:07 ** HAD_USE_PRIMARY(true/false) 1
1/10 14:21:07 ** AM I PRIMARY ?(true/false) 1
1/10 14:21:07 ** HAD_LIST(others only)
1/10 14:21:07 ** <136.200.32.182:9450>
1/10 14:21:07 ** HAD_STAND_ALONE_DEBUG(true/false) 0
1/10 14:21:53 DaemonCore: Command received via TCP from host
<136.200.32.182:4623>
1/10 14:21:53 DaemonCore: received command 701 (SEND ID command),
calling handler (commandHandler)
1/10 14:22:14 DaemonCore: Command received via TCP from host
<136.200.32.182:4636>
1/10 14:22:14 DaemonCore: received command 701 (SEND ID command),
calling handler (commandHandler)
merrit HADLog:
1/10 14:44:43 DaemonCore: Command received via TCP from host
<136.200.32.102:4029>
1/10 14:44:43 DaemonCore: received command 700 (ALIVE command), calling
handler (commandHandler)
1/10 14:45:04 DaemonCore: Command received via TCP from host
<136.200.32.102:4044>
1/10 14:45:04 DaemonCore: received command 700 (ALIVE command), calling
handler (commandHandler)
1/10 14:45:25 DaemonCore: Command received via TCP from host
<136.200.32.102:4059>
1/10 14:45:25 DaemonCore: received command 700 (ALIVE command), calling
handler (commandHandler)
1/10 14:45:46 DaemonCore: Command received via TCP from host
<136.200.32.102:4076>
1/10 14:45:46 DaemonCore: received command 700 (ALIVE command), calling
handler (commandHandler)
1/10 14:46:07 DaemonCore: Command received via TCP from host
<136.200.32.102:4091>
1/10 14:46:07 DaemonCore: received command 700 (ALIVE command), calling
handler (commandHandler)
1/10 14:46:28 DaemonCore: Command received via TCP from host
<136.200.32.102:4106>
1/10 14:46:28 DaemonCore: received command 700 (ALIVE command), calling
handler (commandHandler)
1/10 14:46:49 DaemonCore: Command received via TCP from host
<136.200.32.102:4123>
1/10 14:46:49 DaemonCore: received command 700 (ALIVE command), calling
handler (commandHandler)
delta-mod MasterLog:
1/10 14:21:07 WinFirewall: get_CurrentProfile failed: 0x800706d9
1/10 14:21:07 Started DaemonCore process
"Z:/Condor/bin/condor_collector.exe", pid and pgroup = 2788
1/10 14:21:07 Started DaemonCore process
"Z:/Condor/bin/condor_startd.exe", pid and pgroup = 4072
1/10 14:21:07 Started DaemonCore process
"Z:/Condor/bin/condor_schedd.exe", pid and pgroup = 2916
1/10 14:21:07 Started DaemonCore process
"Z:/Condor/bin/condor_negotiator.exe", pid and pgroup = 3628
1/10 14:21:07 Started DaemonCore process "Z:/Condor/bin/condor_had.exe",
pid and pgroup = 2288
1/10 14:21:07 DaemonCore: Command received via TCP from host
<136.200.32.102:2907>
1/10 14:21:07 DaemonCore: received command 468 (DAEMON_OFF_FAST),
calling handler (admin_command_handler)
1/10 14:21:07 Handling daemon-specific command for "negotiator"
1/10 14:21:08 Sent signal 3 to NEGOTIATOR (pid 3628)
1/10 14:21:11 DaemonCore: Command received via UDP from host
<136.200.32.102:2936>
1/10 14:21:11 DaemonCore: received command 60011 (DC_NOP), calling
handler (handle_nop())
1/10 14:21:11 The NEGOTIATOR (pid 3628) exited with status 0
1/10 14:22:31 DaemonCore: Command received via TCP from host
<136.200.32.102:3009>
1/10 14:22:31 DaemonCore: received command 469 (DAEMON_ON), calling
handler (admin_command_handler)
1/10 14:22:31 Handling daemon-specific command for "negotiator"
1/10 14:22:31 Started DaemonCore process
"Z:/Condor/bin/condor_negotiator.exe", pid and pgroup = 3560
merrit MasterLog:
1/10 14:21:40 WinFirewall: get_CurrentProfile failed: 0x800706d9
1/10 14:21:40 Started DaemonCore process
"Z:/Condor/bin/condor_collector.exe", pid and pgroup = 524
1/10 14:21:40 Started DaemonCore process
"Z:/Condor/bin/condor_startd.exe", pid and pgroup = 3352
1/10 14:21:40 Started DaemonCore process
"Z:/Condor/bin/condor_schedd.exe", pid and pgroup = 3876
1/10 14:21:40 Started DaemonCore process
"Z:/Condor/bin/condor_negotiator.exe", pid and pgroup = 2360
1/10 14:21:40 Started DaemonCore process "Z:/Condor/bin/condor_had.exe",
pid and pgroup = 2456
1/10 14:21:40 DaemonCore: Command received via TCP from host
<136.200.32.182:4559>
1/10 14:21:40 DaemonCore: received command 468 (DAEMON_OFF_FAST),
calling handler (admin_command_handler)
1/10 14:21:40 Handling daemon-specific command for "negotiator"
1/10 14:21:40 Sent signal 3 to NEGOTIATOR (pid 2360)
1/10 14:21:40 DaemonCore: Command received via UDP from host
<136.200.32.182:4569>
1/10 14:21:40 DaemonCore: received command 60011 (DC_NOP), calling
handler (handle_nop())
1/10 14:21:40 The NEGOTIATOR (pid 2360) exited with status 0
delta-mod NegotiatorLog (hmmmm, something awry):
1/10 14:32:32 Phase 1: Obtaining ads from collector ...
1/10 14:32:32 Getting all public ads ...
1/10 14:32:32 Sorting 58 ads ...
1/10 14:32:32 Getting startd private ads ...
1/10 14:32:32 Got ads: 58 public and 28 private
1/10 14:32:32 Public ads include 1 submitter, 28 startd
1/10 14:32:32 Phase 2: Performing accounting ...
1/10 14:32:32 Phase 3: Sorting submitter ads by priority ...
1/10 14:32:32 Phase 4.1: Negotiating with schedds ...
1/10 14:32:32 Negotiating with rfinch@xxxxxxxxxxxx at
<136.200.32.182:4553>
1/10 14:32:32 0 seconds so far
1/10 14:32:32 condor_read(): recv() returned -1, errno = 10054, assuming
failure.
1/10 14:32:32 IO: Failed to read packet header
1/10 14:32:32 Failed to get reply from schedd
1/10 14:32:32 Error: Ignoring schedd for this cycle
1/10 14:32:32 ---------- Finished Negotiation Cycle ----------
merrit NegotiatorLog:
1/10 14:21:40 ******************************************************
1/10 14:21:40 Using config file: z:\Condor\condor_config
1/10 14:21:40 Using local config files: Z:/Condor/condor_config.local
1/10 14:21:40 DaemonCore: Command Socket at <136.200.32.182:4554>
1/10 14:21:40 ACCOUNTANT_HOST = None (local)
1/10 14:21:40 NEGOTIATOR_INTERVAL = 300 sec
1/10 14:21:40 NEGOTIATOR_TIMEOUT = 30 sec
1/10 14:21:40 MAX_TIME_PER_SUBMITTER = 31536000 sec
1/10 14:21:40 MAX_TIME_PER_PIESPIN = 31536000 sec
1/10 14:21:40 PREEMPTION_REQUIREMENTS = FALSE
1/10 14:21:40 PREEMPTION_RANK = None
1/10 14:21:40 NEGOTIATOR_PRE_JOB_RANK = None
1/10 14:21:40 NEGOTIATOR_POST_JOB_RANK = None
1/10 14:21:40 ---------- Started Negotiation Cycle ----------
1/10 14:21:40 Phase 1: Obtaining ads from collector ...
1/10 14:21:40 Getting all public ads ...
1/10 14:21:40 Sorting 0 ads ...
1/10 14:21:40 Getting startd private ads ...
1/10 14:21:40 Got ads: 0 public and 0 private
1/10 14:21:40 Public ads include 0 submitter, 0 startd
1/10 14:21:40 Phase 2: Performing accounting ...
1/10 14:21:40 Phase 3: Sorting submitter ads by priority ...
1/10 14:21:40 Phase 4.1: Negotiating with schedds ...
1/10 14:21:40 ---------- Finished Negotiation Cycle ----------
1/10 14:21:40 Got SIGQUIT. Performing fast shutdown.
1/10 14:21:40 **** condor_negotiator.exe (condor_NEGOTIATOR) EXITING
WITH STATUS 0
> Finally, I'd like to note that the 6.7.14 master and HAD can
> better handle
> cases in which the HAD tells the master "start the
> negotiator", but the
> master is unable to do so for whatever reason. If you are
> upgrading to
> 6.7.14, however, make sure that you upgrade both the master
> and the HAD
> together; *bad* things will happen if you don't...
OK; I can do the upgrade if you think it a good idea. Thanks much.
Ralph Finch, P.E.
Dept. of Water Resources
Bay-Delta Office, Room 215-13
Sacramento, CA 95814
916-653-7552
rfinch@xxxxxxxxxxxx