Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Windows Xp Service Pack 2
- Date: Wed, 25 May 2005 17:04:20 -0400
- From: Joshua Juen <jj9867@xxxxxxxxx>
- Subject: [Condor-users] Windows Xp Service Pack 2
This question was asked back in November.
I am just now attempting to set up a pool of windows xp service pack 2
machines and am having the same problems listed here with version
6.6.9
I was wondering if anyone has made any headway towards solving these problems?
The system worked flawlessly until the machines were rebooted and now
nothing seems to work.
Thanks
JJ
Old Post:
-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of
Geraint.Lloyd@xxxxxxxxxxxx
Sent: 02 November 2004 15:32
To: condor-users@xxxxxxxxxxx
Subject: [Condor-users] Windows XP firewall problems with 6.6.7
We have an all Windows (mixture of Win2K and XP) Condor pool with most
of
the nodes acting as execute only machines, along with one central pool
manager / submitter. We have recently updated to use Condor 6.6.7
following installation of Windows XP SP2 on some of the execute nodes.
We
are now having problems getting jobs to run on all nodes. I have traced
this to a combination of 2 problems:
1) On some of the machines with XP SP2 installed, the firewall is still
blocking some connections. This happens when the machine is initially
booted and Condor starts automatically. The Condor master log on these
nodes display lines similar to the following:
1/2 14:19:42 ******************************************************
11/2 14:19:42 ** Condor (CONDOR_MASTER) STARTING UP
11/2 14:19:42 ** C:\Condor\bin\condor_master.exe
11/2 14:19:42 ** $CondorVersion: 6.6.7 Oct 14 2004 $
11/2 14:19:42 ** $CondorPlatform: INTEL-WINNT40 $
11/2 14:19:42 ** PID = 432
11/2 14:19:42 ******************************************************
11/2 14:19:42 Using config file: C:\Condor\condor_config
11/2 14:19:42 Using local config files: C:\Condor/condor_config.local
11/2 14:19:42 DaemonCore: Command Socket at <10.1.16.136:1043> 11/2
14:19:42 WinFirewall: get_CurrentProfile failed: 0x800706d9 11/2
14:19:42 Started DaemonCore process
"C:\Condor/bin/condor_startd.exe", pid and pgroup = 496
The node still appears in the pool but won't run any jobs and the
negotiator log on the central pool manager displays errors connecting to
this machine whenever jobs are submitted.
If I stop and restart the Condor service manually at a later stage all
works fine - the master log on the node now displays
11/2 14:21:17 Authorized application C:\Condor/bin/condor_startd.exe is
now enabled in the firewall.
- and does not give the WinFirewall error. Jobs now run on the node
without problems - no firewall blocking.
All the firewall settings are correct - exceptions allowed etc. I've
tried various changes, including making the Condor service dependent on
the firewall service to ensure that it starts after this, but it hasn't
fixed the problem. Any ideas ?
2) Running jobs on all the nodes is made far worse by a second problem.
If
the negotiator fails to talk correctly to one of the nodes (i.e. because
of the firewall problem) then it gives up on that negotiator cycle. The
negotiator log displays lines such as :
11/2 14:12:12 Request 00347.00008:
11/2 14:12:12 Matched 347.8 persephone@xxxxxxxxxxxxxx
<10.1.16.132:4990> preempting none <10.1.16.77:1039>
11/2 14:12:12 Successfully matched with pergola.tessella.co.uk
11/2 14:12:12 Request 00347.00009:
11/2 14:12:33 Can't connect to <10.1.16.136:1044>:0, errno = 10060 11/2
14:12:33 Will keep trying for 10 seconds... 11/2 14:12:34 Connect failed
for 10 seconds; returning FALSE 11/2 14:12:34 ERROR: SECMAN:2003:TCP
connection to <10.1.16.136:1044> failed
11/2 14:12:34 condor_write(): Socket closed when trying to write buffer
11/2 14:12:34 Buf::write(): condor_write() failed
11/2 14:12:34 Could not send PERMISSION
11/2 14:12:34 Error: Ignoring schedd for this cycle
11/2 14:12:34 ---------- Finished Negotiation Cycle ----------
and the scheduler something like
11/2 14:12:11 Negotiating for owner: persephone@xxxxxxxxxxxxxx 11/2
14:12:11 Checking consistency running and runnable jobs 11/2 14:12:11
Tables are consistent 11/2 14:12:32 condor_read(): timeout reading
buffer. 11/2 14:12:32 Can't receive request from manager 11/2 14:12:32
DaemonCore: Command received via UDP from host
<10.1.16.102:1655>
11/2 14:12:32 DaemonCore: received command 60014 (DC_INVALIDATE_KEY),
calling handler (handle_invalidate_key())
11/2 14:12:32 condor_read(): recv() returned -1, errno = 10054, assuming
failure.
11/2 14:12:32 Response problem from startd.
11/2 14:12:32 Sent RELEASE_CLAIM to startd on <10.1.16.102:1040> 11/2
14:12:32 Match record (<10.1.16.102:1040>, 347, 3) deleted
This means that all the other nodes in the pool (mostly without the
Windows firewall) that come after this error in the negotiation ycle are
ignored and don't run any jobs.
Is there any way of getting the scheduler / negotiator to ignore a
machine
which it can't connect to and carry on assigning jobs to the rest of the
pool. I've tried setting NEGOTIATE_ALL_JOBS_IN_CLUSTER to True but this
doesn't help. I noticed another posting to the users list mentioning
this
problem but there were no responses. It was also using a Windows central
manager so has anyone seen this outside of Windows ?
Any suggestions would be appreciated,
Thanks
Geraint Lloyd