On 8/25/2011 3:53 PM, Seth Bardash wrote:
First, thanks for spend the time reading this.
I upgraded to Condor 7.6.2 from
condor-7.6.2-x86_rhap_5-unstripped.tar.gz on a Dual Xeon 32 bit
machine running the i686 version of Centos 5.6 , stock kernel,
with 12 GB DRAM.
Originally it was 7.5.3, 32 bit and only controlling 2 Windows XP
Pro 32 bit machines - this worked.
I upgraded to 7.6.2 so that we could add Windows 7 64 bit slaves
and Windows 2008 Server 64 bit slaves
I uninstalled all the old code on the linux master and the windows
slaves.
Our test code runs fine, standalone, on all the machines (both 32
and 64 bit)
Un-tar-gz'd the condor-7.6.2-x86_rhap_5-unstripped.tar.gz on the
linux machine and downloaded the msi files and the redistributable
2008 and 2011 C executables for the windows machines.
I then installed (./condor_install --type=manager,submit
--central-manager=sched1.am1.mnet --verbose) on the linux machine
named sched1.am1.mnet.
I then installed the msi files on the old XP Pro 32 bit machines,
the Windows 7 64 Bit machine and Windows 2008 Server machine.
Question 1: How do I get the Windows 2008 Server machine to start
the condor service as a local service. If I give it a user and
login password it starts. If I choose a local service it won't
start. This machine has a full install of W2K8 STD SVR 64 bit but
is used for nothing else. condor starts correctly on the XP and W7
machines
Questions 2: Once the W2K8 STD SVR machine is started via a user
and password in the services screen I see:
[root@sched1 condor]# condor_status
Name OpSys Arch State Activity LoadAv
Mem ActvtyTime
slot1@condor-xp1 WINNT51 INTEL Unclaimed Idle 0.000
1019 0+01:07:25
slot2@condor-xp1 WINNT51 INTEL Unclaimed Idle 0.000
1019 0+01:06:26
slot1@HP2 WINNT61 X86_64 Unclaimed Idle 0.000
2047 0+03:55:04
slot2@HP2 WINNT61 X86_64 Unclaimed Idle 0.000
2047 0+03:54:45
slot3@HP2 WINNT61 X86_64 Unclaimed Idle 0.000
2047 0+03:55:06
slot4@HP2 WINNT61 X86_64 Unclaimed Idle 0.000
2047 0+03:55:07
slot1@HP3 WINNT61 X86_64 Unclaimed Idle 0.010
1023 0+02:11:56
slot2@HP3 WINNT61 X86_64 Unclaimed Idle 0.000
1023 0+02:13:57
slot3@HP3 WINNT61 X86_64 Unclaimed Idle 0.000
1023 0+02:12:58
slot4@HP3 WINNT61 X86_64 Unclaimed Idle 0.000
1023 0+03:15:07
Total Owner Claimed Unclaimed Matched
Preempting Backfill
INTEL/WINNT51 2 0 0 2
0 0 0
X86_64/WINNT61 8 0 0 8
0 0 0
Total 10 0 0 10
0 0 0
condor-xp1 - Windows XP Pro 32 Bit
HP2 Windows 7 64 Bit
HP3 Windows 2008 STD Server 64 Bit
but when I try to submit a job using condor_submit file_name I get
these errors:
08/25/11 13:18:34 (pid:3451) Activity on stashed negotiator
socket: <172.28.96.118:43953>
08/25/11 13:18:34 (pid:3451) Negotiating for owner:
rita@localdomain localhost
08/25/11 13:18:34 (pid:3451) Finished negotiating for rita in
local pool: 1 matched, 0 rejected
08/25/11 13:18:34 (pid:3451) TransferQueueManager stats: active
up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
08/25/11 13:18:34 (pid:3451) Sent ad to central manager for
rita@localdomain localhost
08/25/11 13:18:34 (pid:3451) Sent ad to 1 collectors for
rita@localdomain localhost
08/25/11 13:18:34 (pid:3451) condor_read() failed: recv() returned
-1, errno = 104 Connection reset by peer, reading 5 bytes from
startd slot2@HP3 <172.28.96.120:49215> for
rita.
08/25/11 13:18:34 (pid:3451) IO: Failed to read packet header
08/25/11 13:18:34 (pid:3451) Response problem from startd when
requesting claim slot2@HP3 <172.28.96.120:49215> for rita
7.0.
08/25/11 13:18:34 (pid:3451) Failed to send REQUEST_CLAIM to
startd slot2@HP3 <172.28.96.120:49215> for rita:
CEDAR:6004:failed reading from socket
08/25/11 13:18:34 (pid:3451) Match record (slot2@HP3
<172.28.96.120:49215> for rita, 7.0) deleted
08/25/11 13:19:34 (pid:3451) Activity on stashed negotiator
socket: <172.28.96.118:43953>
08/25/11 13:19:34 (pid:3451) Negotiating for owner:
rita@localdomain localhost
08/25/11 13:19:34 (pid:3451) Finished negotiating for rita in
local pool: 1 matched, 0 rejected
08/25/11 13:19:34 (pid:3451) TransferQueueManager stats: active
up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
08/25/11 13:19:34 (pid:3451) Sent ad to central manager for
rita@localdomain localhost
08/25/11 13:19:34 (pid:3451) Sent ad to 1 collectors for
rita@localdomain localhost
08/25/11 13:19:34 (pid:3451) condor_read() failed: recv() returned
-1, errno = 104 Connection reset by peer, reading 5 bytes from
startd slot3@HP3 <172.28.96.120:49215> for
rita.
08/25/11 13:19:34 (pid:3451) IO: Failed to read packet header
08/25/11 13:19:34 (pid:3451) Response problem from startd when
requesting claim slot3@HP3 <172.28.96.120:49215> for rita
7.0.
08/25/11 13:19:34 (pid:3451) Failed to send REQUEST_CLAIM to
startd slot3@HP3 <172.28.96.120:49215> for rita:
CEDAR:6004:failed reading from socket
08/25/11 13:19:34 (pid:3451) Match record (slot3@HP3
<172.28.96.120:49215> for rita, 7.0) deleted
The job gets matched to a free or a set of free cores on a slave
node machine but I get a "condor_read() failed: recv() returned
-1, errno = 104" error after the match.
Output from the master for processes running is:
[root@sched1 condor]# ps -eflc | grep condor_
5 S condor 3448 1 TS 21 - 2094 - 11:26 ?
00:00:09 condor_master
4 S condor 3449 3448 TS 21 - 2290 - 11:26 ?
00:00:01 condor_collector -f
4 S condor 3450 3448 TS 20 - 2182 - 11:26 ?
00:00:04 condor_negotiator -f
4 S condor 3451 3448 TS 21 - 2585 - 11:26 ?
00:00:00 condor_schedd -f
4 S root 3452 3451 TS 21 - 978 - 11:26 ?
00:00:03 condor_procd -A
/tmp/condor-lock.sched10.974037967463122/procd_pipe.SCHEDD -R
10000000 -S 60 -C 1016
0 S root 4786 3866 TS 21 - 1005 pipe_w 15:44 pts/2
00:00:00 grep condor_
the scheduler (sched1) is 172.28.96.118, the slave nodes are .119,
.120 for the 64 bit machines and .79 for the WinXP Pro 32 bit
machine
users on all machines are 2 people (rita and seth), a "condor"
user and root (on linux)
Any help in looking for where to troubleshoot this would be
greatly appreciated.
After reading many forum entries that had only marginal
applicability......
I added explicitly the name of the master (sched1) on all the
windows machines' config files to both the lines listed below:
## Negotiator access. Machines listed here are trusted central (-
ADDED "sched1")
## managers. You should normally not have to change this.
ALLOW_NEGOTIATOR = $(CONDOR_HOST), sched1
## Now, with flocking we need to let the SCHEDD trust the other
## negotiators we are flocking with as well. You should normally
## not have to change this either.
ALLOW_NEGOTIATOR_SCHEDD = $(CONDOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS),
sched1
AND
installed java on all the machines.
Rebooted.
but the Windows 2008 Server machine would not come up automatically.
I made it a manual start service, waited till the network had
started then started it manually from the
services screen and now it runs correctly.
If someone has an idea of how to get this to wait for the network to
come up and how to make it a local service
I would be appreciative.
--
Seth Bardash
Integrated Solutions and Systems LLC
1510 Old North Gate Road
Colorado Springs, CO 80921
719-495-5866 Shop Phone
719-337-4779 Cell
719-386-0218 Metso Phone
seth@xxxxxxxxxxxxxxxxxxxxxxx
Failure cannot survive knowledge and perseverance!
|