Mailing List Archives
	Authenticated access
	
	
     | 
    
	 
	 
     | 
    
	
	 
     | 
  
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Condor 7.6.2 error sending a program to slave nodes	form a master
- Date: Thu, 25 Aug 2011 15:53:32 -0600
 
- From: Seth Bardash <seth@xxxxxxxxxxxxxxxxxxxxxxx>
 
- Subject: [Condor-users] Condor 7.6.2 error sending a program to slave nodes	form a master
 
First, thanks for spend the time reading this.
I upgraded to Condor 7.6.2 from 
condor-7.6.2-x86_rhap_5-unstripped.tar.gz on  a Dual Xeon 32 bit machine 
running the i686 version of Centos 5.6 , stock kernel, with 12 GB DRAM.
Originally it was 7.5.3, 32 bit and only controlling 2 Windows XP Pro 32 
bit machines - this worked.
I upgraded to 7.6.2 so that we could add Windows 7 64 bit slaves and 
Windows 2008 Server 64 bit slaves
I uninstalled all the old code on the linux master and the windows slaves.
Our test code runs fine, standalone, on all the machines (both 32 and 64 
bit)
Un-tar-gz'd the condor-7.6.2-x86_rhap_5-unstripped.tar.gz on the linux 
machine and downloaded the msi files and the redistributable 2008 and 
2011 C executables for the windows machines.
I then installed (./condor_install --type=manager,submit 
--central-manager=sched1.am1.mnet --verbose) on the linux machine named 
sched1.am1.mnet.
I then installed the msi files on the old XP Pro 32 bit machines, the 
Windows 7 64 Bit machine and Windows 2008 Server machine.
Question 1: How do I get the Windows 2008 Server machine to start the 
condor service as a local service. If I give it a user and login 
password it starts. If I choose a local service it won't start. This 
machine has a full install of W2K8 STD SVR 64 bit but is used for 
nothing else. condor starts correctly on the XP and W7 machines
Questions 2: Once the W2K8 STD SVR machine is started via a user and 
password in the services screen I see:
[root@sched1 condor]# condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   
ActvtyTime
slot1@condor-xp1   WINNT51    INTEL  Unclaimed Idle     0.000  1019  
0+01:07:25
slot2@condor-xp1   WINNT51    INTEL  Unclaimed Idle     0.000  1019  
0+01:06:26
slot1@HP2          WINNT61    X86_64 Unclaimed Idle     0.000  2047  
0+03:55:04
slot2@HP2          WINNT61    X86_64 Unclaimed Idle     0.000  2047  
0+03:54:45
slot3@HP2          WINNT61    X86_64 Unclaimed Idle     0.000  2047  
0+03:55:06
slot4@HP2          WINNT61    X86_64 Unclaimed Idle     0.000  2047  
0+03:55:07
slot1@HP3          WINNT61    X86_64 Unclaimed Idle     0.010  1023  
0+02:11:56
slot2@HP3          WINNT61    X86_64 Unclaimed Idle     0.000  1023  
0+02:13:57
slot3@HP3          WINNT61    X86_64 Unclaimed Idle     0.000  1023  
0+02:12:58
slot4@HP3          WINNT61    X86_64 Unclaimed Idle     0.000  1023  
0+03:15:07
                     Total Owner Claimed Unclaimed Matched Preempting 
Backfill
       INTEL/WINNT51     2     0       0         2       0          
0        0
      X86_64/WINNT61     8     0       0         8       0          
0        0
               Total    10     0       0        10       0          
0        0
condor-xp1  -  Windows XP Pro 32 Bit
HP2                 Windows 7 64 Bit
HP3                Windows 2008 STD Server 64 Bit
but when I try to submit a job using condor_submit file_name I get these 
errors:
08/25/11 13:18:34 (pid:3451) Activity on stashed negotiator socket: 
<172.28.96.118:43953>
08/25/11 13:18:34 (pid:3451) Negotiating for owner: rita@localdomain 
localhost
08/25/11 13:18:34 (pid:3451) Finished negotiating for rita in local 
pool: 1 matched, 0 rejected
08/25/11 13:18:34 (pid:3451) TransferQueueManager stats: active up=0/10 
down=0/10; waiting up=0 down=0; wait time up=0s down=0s
08/25/11 13:18:34 (pid:3451) Sent ad to central manager for 
rita@localdomain localhost
08/25/11 13:18:34 (pid:3451) Sent ad to 1 collectors for 
rita@localdomain localhost
08/25/11 13:18:34 (pid:3451) condor_read() failed: recv() returned -1, 
errno = 104 Connection reset by peer, reading 5 bytes from startd 
slot2@HP3 <172.28.96.120:49215> for
rita.
08/25/11 13:18:34 (pid:3451) IO: Failed to read packet header
08/25/11 13:18:34 (pid:3451) Response problem from startd when 
requesting claim slot2@HP3 <172.28.96.120:49215> for rita 7.0.
08/25/11 13:18:34 (pid:3451) Failed to send REQUEST_CLAIM to startd 
slot2@HP3 <172.28.96.120:49215> for rita: CEDAR:6004:failed reading from 
socket
08/25/11 13:18:34 (pid:3451) Match record (slot2@HP3 
<172.28.96.120:49215> for rita, 7.0) deleted
08/25/11 13:19:34 (pid:3451) Activity on stashed negotiator socket: 
<172.28.96.118:43953>
08/25/11 13:19:34 (pid:3451) Negotiating for owner: rita@localdomain 
localhost
08/25/11 13:19:34 (pid:3451) Finished negotiating for rita in local 
pool: 1 matched, 0 rejected
08/25/11 13:19:34 (pid:3451) TransferQueueManager stats: active up=0/10 
down=0/10; waiting up=0 down=0; wait time up=0s down=0s
08/25/11 13:19:34 (pid:3451) Sent ad to central manager for 
rita@localdomain localhost
08/25/11 13:19:34 (pid:3451) Sent ad to 1 collectors for 
rita@localdomain localhost
08/25/11 13:19:34 (pid:3451) condor_read() failed: recv() returned -1, 
errno = 104 Connection reset by peer, reading 5 bytes from startd 
slot3@HP3 <172.28.96.120:49215> for
rita.
08/25/11 13:19:34 (pid:3451) IO: Failed to read packet header
08/25/11 13:19:34 (pid:3451) Response problem from startd when 
requesting claim slot3@HP3 <172.28.96.120:49215> for rita 7.0.
08/25/11 13:19:34 (pid:3451) Failed to send REQUEST_CLAIM to startd 
slot3@HP3 <172.28.96.120:49215> for rita: CEDAR:6004:failed reading from 
socket
08/25/11 13:19:34 (pid:3451) Match record (slot3@HP3 
<172.28.96.120:49215> for rita, 7.0) deleted
The job gets matched to a free or a set of free cores on a slave node 
machine but I get a "condor_read() failed: recv() returned -1, errno = 
104" error after the match.
Output from the master for processes running is:
[root@sched1 condor]# ps -eflc | grep condor_
5 S condor    3448     1 TS   21 -  2094 -      11:26 ?        00:00:09 
condor_master
4 S condor    3449  3448 TS   21 -  2290 -      11:26 ?        00:00:01 
condor_collector -f
4 S condor    3450  3448 TS   20 -  2182 -      11:26 ?        00:00:04 
condor_negotiator -f
4 S condor    3451  3448 TS   21 -  2585 -      11:26 ?        00:00:00 
condor_schedd -f
4 S root      3452  3451 TS   21 -   978 -      11:26 ?        00:00:03 
condor_procd -A 
/tmp/condor-lock.sched10.974037967463122/procd_pipe.SCHEDD -R 10000000 
-S 60 -C 1016
0 S root      4786  3866 TS   21 -  1005 pipe_w 15:44 pts/2    00:00:00 
grep condor_
the scheduler (sched1) is 172.28.96.118, the slave nodes are .119, .120 
for the 64 bit machines and .79 for the WinXP Pro 32 bit machine
users on all machines are 2 people (rita and seth), a "condor" user and 
root (on linux)
Any help in looking for where to troubleshoot this would be greatly 
appreciated.
--
Seth Bardash
Integrated Solutions and Systems LLC
1510 Old North Gate Road
Colorado Springs, CO  80921
719-495-5866   Shop Phone
719-337-4779   Cell
seth@xxxxxxxxxxxxxxxxxxxxxxx
Failure cannot survive knowledge and perseverance!