Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Condor 7.6.2 error sending a program to slave nodes form a master
- Date: Thu, 25 Aug 2011 15:53:32 -0600
- From: Seth Bardash <seth@xxxxxxxxxxxxxxxxxxxxxxx>
- Subject: [Condor-users] Condor 7.6.2 error sending a program to slave nodes form a master
First, thanks for spend the time reading this.
I upgraded to Condor 7.6.2 from
condor-7.6.2-x86_rhap_5-unstripped.tar.gz on a Dual Xeon 32 bit machine
running the i686 version of Centos 5.6 , stock kernel, with 12 GB DRAM.
Originally it was 7.5.3, 32 bit and only controlling 2 Windows XP Pro 32
bit machines - this worked.
I upgraded to 7.6.2 so that we could add Windows 7 64 bit slaves and
Windows 2008 Server 64 bit slaves
I uninstalled all the old code on the linux master and the windows slaves.
Our test code runs fine, standalone, on all the machines (both 32 and 64
bit)
Un-tar-gz'd the condor-7.6.2-x86_rhap_5-unstripped.tar.gz on the linux
machine and downloaded the msi files and the redistributable 2008 and
2011 C executables for the windows machines.
I then installed (./condor_install --type=manager,submit
--central-manager=sched1.am1.mnet --verbose) on the linux machine named
sched1.am1.mnet.
I then installed the msi files on the old XP Pro 32 bit machines, the
Windows 7 64 Bit machine and Windows 2008 Server machine.
Question 1: How do I get the Windows 2008 Server machine to start the
condor service as a local service. If I give it a user and login
password it starts. If I choose a local service it won't start. This
machine has a full install of W2K8 STD SVR 64 bit but is used for
nothing else. condor starts correctly on the XP and W7 machines
Questions 2: Once the W2K8 STD SVR machine is started via a user and
password in the services screen I see:
[root@sched1 condor]# condor_status
Name OpSys Arch State Activity LoadAv Mem
ActvtyTime
slot1@condor-xp1 WINNT51 INTEL Unclaimed Idle 0.000 1019
0+01:07:25
slot2@condor-xp1 WINNT51 INTEL Unclaimed Idle 0.000 1019
0+01:06:26
slot1@HP2 WINNT61 X86_64 Unclaimed Idle 0.000 2047
0+03:55:04
slot2@HP2 WINNT61 X86_64 Unclaimed Idle 0.000 2047
0+03:54:45
slot3@HP2 WINNT61 X86_64 Unclaimed Idle 0.000 2047
0+03:55:06
slot4@HP2 WINNT61 X86_64 Unclaimed Idle 0.000 2047
0+03:55:07
slot1@HP3 WINNT61 X86_64 Unclaimed Idle 0.010 1023
0+02:11:56
slot2@HP3 WINNT61 X86_64 Unclaimed Idle 0.000 1023
0+02:13:57
slot3@HP3 WINNT61 X86_64 Unclaimed Idle 0.000 1023
0+02:12:58
slot4@HP3 WINNT61 X86_64 Unclaimed Idle 0.000 1023
0+03:15:07
Total Owner Claimed Unclaimed Matched Preempting
Backfill
INTEL/WINNT51 2 0 0 2 0
0 0
X86_64/WINNT61 8 0 0 8 0
0 0
Total 10 0 0 10 0
0 0
condor-xp1 - Windows XP Pro 32 Bit
HP2 Windows 7 64 Bit
HP3 Windows 2008 STD Server 64 Bit
but when I try to submit a job using condor_submit file_name I get these
errors:
08/25/11 13:18:34 (pid:3451) Activity on stashed negotiator socket:
<172.28.96.118:43953>
08/25/11 13:18:34 (pid:3451) Negotiating for owner: rita@localdomain
localhost
08/25/11 13:18:34 (pid:3451) Finished negotiating for rita in local
pool: 1 matched, 0 rejected
08/25/11 13:18:34 (pid:3451) TransferQueueManager stats: active up=0/10
down=0/10; waiting up=0 down=0; wait time up=0s down=0s
08/25/11 13:18:34 (pid:3451) Sent ad to central manager for
rita@localdomain localhost
08/25/11 13:18:34 (pid:3451) Sent ad to 1 collectors for
rita@localdomain localhost
08/25/11 13:18:34 (pid:3451) condor_read() failed: recv() returned -1,
errno = 104 Connection reset by peer, reading 5 bytes from startd
slot2@HP3 <172.28.96.120:49215> for
rita.
08/25/11 13:18:34 (pid:3451) IO: Failed to read packet header
08/25/11 13:18:34 (pid:3451) Response problem from startd when
requesting claim slot2@HP3 <172.28.96.120:49215> for rita 7.0.
08/25/11 13:18:34 (pid:3451) Failed to send REQUEST_CLAIM to startd
slot2@HP3 <172.28.96.120:49215> for rita: CEDAR:6004:failed reading from
socket
08/25/11 13:18:34 (pid:3451) Match record (slot2@HP3
<172.28.96.120:49215> for rita, 7.0) deleted
08/25/11 13:19:34 (pid:3451) Activity on stashed negotiator socket:
<172.28.96.118:43953>
08/25/11 13:19:34 (pid:3451) Negotiating for owner: rita@localdomain
localhost
08/25/11 13:19:34 (pid:3451) Finished negotiating for rita in local
pool: 1 matched, 0 rejected
08/25/11 13:19:34 (pid:3451) TransferQueueManager stats: active up=0/10
down=0/10; waiting up=0 down=0; wait time up=0s down=0s
08/25/11 13:19:34 (pid:3451) Sent ad to central manager for
rita@localdomain localhost
08/25/11 13:19:34 (pid:3451) Sent ad to 1 collectors for
rita@localdomain localhost
08/25/11 13:19:34 (pid:3451) condor_read() failed: recv() returned -1,
errno = 104 Connection reset by peer, reading 5 bytes from startd
slot3@HP3 <172.28.96.120:49215> for
rita.
08/25/11 13:19:34 (pid:3451) IO: Failed to read packet header
08/25/11 13:19:34 (pid:3451) Response problem from startd when
requesting claim slot3@HP3 <172.28.96.120:49215> for rita 7.0.
08/25/11 13:19:34 (pid:3451) Failed to send REQUEST_CLAIM to startd
slot3@HP3 <172.28.96.120:49215> for rita: CEDAR:6004:failed reading from
socket
08/25/11 13:19:34 (pid:3451) Match record (slot3@HP3
<172.28.96.120:49215> for rita, 7.0) deleted
The job gets matched to a free or a set of free cores on a slave node
machine but I get a "condor_read() failed: recv() returned -1, errno =
104" error after the match.
Output from the master for processes running is:
[root@sched1 condor]# ps -eflc | grep condor_
5 S condor 3448 1 TS 21 - 2094 - 11:26 ? 00:00:09
condor_master
4 S condor 3449 3448 TS 21 - 2290 - 11:26 ? 00:00:01
condor_collector -f
4 S condor 3450 3448 TS 20 - 2182 - 11:26 ? 00:00:04
condor_negotiator -f
4 S condor 3451 3448 TS 21 - 2585 - 11:26 ? 00:00:00
condor_schedd -f
4 S root 3452 3451 TS 21 - 978 - 11:26 ? 00:00:03
condor_procd -A
/tmp/condor-lock.sched10.974037967463122/procd_pipe.SCHEDD -R 10000000
-S 60 -C 1016
0 S root 4786 3866 TS 21 - 1005 pipe_w 15:44 pts/2 00:00:00
grep condor_
the scheduler (sched1) is 172.28.96.118, the slave nodes are .119, .120
for the 64 bit machines and .79 for the WinXP Pro 32 bit machine
users on all machines are 2 people (rita and seth), a "condor" user and
root (on linux)
Any help in looking for where to troubleshoot this would be greatly
appreciated.
--
Seth Bardash
Integrated Solutions and Systems LLC
1510 Old North Gate Road
Colorado Springs, CO 80921
719-495-5866 Shop Phone
719-337-4779 Cell
seth@xxxxxxxxxxxxxxxxxxxxxxx
Failure cannot survive knowledge and perseverance!