Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Did something change in the 7.2.2 Windows release?- I can't see startds any more
- Date: Thu, 16 Apr 2009 11:22:21 -0400
- From: Ian Chesal <ICHESAL@xxxxxxxxxx>
- Subject: [Condor-users] Did something change in the 7.2.2 Windows release?- I can't see startds any more
I just upgraded my binaries from the 7.2.2 pre-release binaries to the
7.2.2 officially released binaries and my two Windows machines in my
test pool are no longer working correctly. They're failing to send the
startd ads to my collector. The daemons start up find and I see the
correct processes:
D:\arc\condor\log>ps -ef |grep condor_
SYSTEM 852 696 0 07:59:47 con 0:00
d:\arc\condor\bin\condor_master.exe
SYSTEM 1256 852 0 07:59:48 con 0:00 condor_procd.exe -A
//./pipe/procd_pipe -L d:/arc/condor/log/ProcdLog -K
d:/arc/condor/bin\condor_softkill.exe
SYSTEM 204 852 0 07:59:48 con 0:08 condor_startd.exe -f
And the StartLog is sane:
4/16 07:59:48 ******************************************************
4/16 07:59:48 ** condor_startd.exe (CONDOR_STARTD) STARTING UP
4/16 07:59:48 ** d:\arc\condor\bin\condor_startd.exe
4/16 07:59:48 ** SubsystemInfo: name=STARTD type=STARTD(7)
class=DAEMON(1)
4/16 07:59:48 ** Configuration: subsystem:STARTD local:<NONE>
class:DAEMON
4/16 07:59:48 ** $CondorVersion: 7.2.2 Apr 9 2009 BuildID: 145189 $
4/16 07:59:48 ** $CondorPlatform: INTEL-WINNT50 $
4/16 07:59:48 ** PID = 204
4/16 07:59:48 ** Log last touched time unavailable (No such file or
directory)
4/16 07:59:48 ******************************************************
4/16 07:59:48 Using config source: \\sv129\arc\condor\condor_config
4/16 07:59:48 Using local config sources:
4/16 07:59:48 \\sv129\arc\condor/condor_config.basic
4/16 07:59:48 \\sv129\arc\condor/os/condor_config.WINNT51
4/16 07:59:48 \\sv129\arc\condor/site/condor_config.SJDEV
4/16 07:59:48 \\sv129\arc\condor/machine/condor_config.sj-bs3400-272
4/16 07:59:48
\\sv129\arc\condor/machine/condor_config.sj-bs3400-272.WINNT51
4/16 07:59:48 \\sv129\arc\condor/patch/condor_config.sj-bs3400-272
4/16 07:59:48
\\sv129\arc\condor/patch/condor_config.sj-bs3400-272.WINNT51
4/16 07:59:48 \\sv129\arc\condor/cycleserver/sj-bs3400-272.config
4/16 07:59:48 DaemonCore: Command Socket at <137.57.203.81:4807>
4/16 07:59:48 slot1: New machine resource of type 2 allocated
4/16 07:59:48 slot2: New machine resource of type 3 allocated
4/16 07:59:53 About to run initial benchmarks.
4/16 08:00:01 Completed initial benchmarks.
4/16 08:00:01 Cron: Initializing job 'update'
(d:/arc/scripts/hooks/update_hooks_and_modules.bat)
4/16 08:00:01 Executable is a batch script, so executing
C:\WINDOWS\system32\cmd.exe /Q /C
"d:/arc/scripts/hooks/update_hooks_and_modules.bat" update
4/16 08:00:01 slot2: State change: IS_OWNER is false
4/16 08:00:01 slot2: Changing state: Owner -> Unclaimed
4/16 08:00:01 Executable is a batch script, so executing
C:\WINDOWS\system32\cmd.exe /Q /C
"d:/arc/scripts/hooks/arc_job_fetch.bat"
4/16 08:00:01 slot1: State change: IS_OWNER is false
4/16 08:00:01 slot1: Changing state: Owner -> Unclaimed
4/16 08:00:01 Executable is a batch script, so executing
C:\WINDOWS\system32\cmd.exe /Q /C
"d:/arc/scripts/hooks/arc_job_fetch.bat"
4/16 08:00:02 Calling pipe Handler <Guarantee all data written to pipe>
for Pipe end=65539 <DC stdin pipe>
4/16 08:00:02 Return from pipe Handler
4/16 08:00:02 Calling pipe Handler <Guarantee all data written to pipe>
for Pipe end=65541 <DC stdin pipe>
4/16 08:00:02 Return from pipe Handler
4/16 08:00:02 Received UDP command 60011 (DC_NOP) from
<137.57.203.81:4810>, access level READ
4/16 08:00:02 Calling HandleReq <handle_nop()> (0)
4/16 08:00:02 Return from HandleReq <handle_nop()> (handler: 0.000s,
sec: 0.031s)
I can talk to the collector *from* the machine:
D:\arc\condor\log>condor_status
Name OpSys Arch State Activity LoadAv Mem
ActvtyTime
sqal08.altera.com LINUX INTEL Unclaimed Idle 0.020 2026
0+03:42:04
sv129.altera.com LINUX INTEL Owner Idle 0.290 2024
1+17:34:16
slot1@sj-bs3400-31 LINUX X86_64 Unclaimed Idle 0.230 1224
0+03:11:05
slot2@sj-bs3400-31 LINUX X86_64 Unclaimed Idle 0.000 1224
0+19:11:52
slot3@sj-bs3400-31 LINUX X86_64 Unclaimed Idle 0.000 750
0+19:11:53
slot4@sj-bs3400-31 LINUX X86_64 Unclaimed Idle 0.000 750
0+19:11:54
slot1@sj-bs3400-31 LINUX X86_64 Unclaimed Idle 0.370 1224
0+03:34:07
slot2@sj-bs3400-31 LINUX X86_64 Unclaimed Idle 0.000 1224
0+19:34:54
slot3@sj-bs3400-31 LINUX X86_64 Unclaimed Idle 0.000 750
0+19:34:55
slot4@sj-bs3400-31 LINUX X86_64 Unclaimed Idle 0.000 750
0+19:34:56
slot1@sqal64-36-te LINUX X86_64 Unclaimed Idle 0.160 1264
0+03:32:08
slot2@sqal64-36-te LINUX X86_64 Unclaimed Idle 0.000 742
0+19:32:36
slot1@sqal64-37-te LINUX X86_64 Unclaimed Idle 0.520 1224
0+03:25:10
slot2@sqal64-37-te LINUX X86_64 Unclaimed Idle 0.000 1224
0+19:25:52
slot3@sqal64-37-te LINUX X86_64 Unclaimed Idle 0.000 750
0+19:25:52
slot4@sqal64-37-te LINUX X86_64 Unclaimed Idle 0.000 750
0+19:25:53
Total Owner Claimed Unclaimed Matched Preempting
Backfill
INTEL/LINUX 2 1 0 1 0 0
0
X86_64/LINUX 14 0 0 14 0 0
0
Total 16 1 0 15 0 0
0
But you see no startd entries from my Windows machines in the
collector's view of the world:
D:\arc\condor\log>condor_status -any
MyType TargetType Name
DaemonMaster None sj-bs3400-272.altera.com
DaemonMaster None sj-bs3400-279.altera.com
DaemonMaster None sj-bs3400-311.altera.com
Machine Job sqal08.altera.com
Machine Job sv129.altera.com
Machine Job slot1@xxxxxxxxxxxxxxxxxxxxxxxx
Machine Job slot2@xxxxxxxxxxxxxxxxxxxxxxxx
Machine Job slot3@xxxxxxxxxxxxxxxxxxxxxxxx
Machine Job slot4@xxxxxxxxxxxxxxxxxxxxxxxx
DaemonMaster None sj-bs3400-312.altera.com
Machine Job slot1@xxxxxxxxxxxxxxxxxxxxxxxx
Machine Job slot2@xxxxxxxxxxxxxxxxxxxxxxxx
Machine Job slot3@xxxxxxxxxxxxxxxxxxxxxxxx
Machine Job slot4@xxxxxxxxxxxxxxxxxxxxxxxx
DaemonMaster None sqal08.altera.com
Machine Job slot1@xxxxxxxxxxxxxxxxxxxxxxxx
Machine Job slot2@xxxxxxxxxxxxxxxxxxxxxxxx
DaemonMaster None sqal64-36-test.altera.com
Machine Job slot1@xxxxxxxxxxxxxxxxxxxxxxxx
Machine Job slot2@xxxxxxxxxxxxxxxxxxxxxxxx
Machine Job slot3@xxxxxxxxxxxxxxxxxxxxxxxx
Machine Job slot4@xxxxxxxxxxxxxxxxxxxxxxxx
DaemonMaster None sqal64-37-test.altera.com
Negotiator None sv129.altera.com
DaemonMaster None sv129.altera.com
And -direct returns nothing, the command times out:
D:\arc\condor\log>condor_status -direct localhost -debug
4/16 08:11:31 condor_read(): timeout reading 5 bytes from
<137.57.203.81:4807>.
4/16 08:11:31 IO: Failed to read packet header
If I swap out the 7.2.2 pre-release binaries I had for the official
release binaries I just downloaded (the .zip bundle BTW) everything
functions perfectly:
D:\tmp>condor_status -direct localhost
Name OpSys Arch State Activity LoadAv Mem
ActvtyTime
slot1@sj-bs3400-27 WINNT51 INTEL Unclaimed Idle 0.470 2257
0+00:01:54
slot2@sj-bs3400-27 WINNT51 INTEL Unclaimed Idle 0.000 1325
0+00:01:54
Total Owner Claimed Unclaimed Matched Preempting
Backfill
INTEL/WINNT51 2 0 0 2 0 0
0
Total 2 0 0 2 0 0
0
The other odd thing I noticed is running 'net stop condor' fails to kill
Condor off on the machine. I have to kill the condor_* processes
manually.
The pre-release binaries I was testing were:
D:\tmp>condor_version
$CondorVersion: 7.2.2 Mar 20 2009 BuildID: none PRE-RELEASE-UWCS $
$CondorPlatform: INTEL-WINNT50 $
So something after March 20th? I've reverted to the pre-release binaries
on my Windows machines for now.
- Ian
Confidentiality Notice.
This message may contain information that is confidential or otherwise protected from disclosure. If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution, or copying of this message, or any attachments, is strictly prohibited. If you have received this message in error, please advise the sender by reply e-mail, and delete the message and any attachments. Thank you.