[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] SchedLog: job submission timed out....port problem?



Hi,

I'm baffled!

A job is not running for days, although the negotiator matches
the job to the specified machine (the machine is in Unclaimed
state); apparent reason: a broken communication
(and I suspected a firewall problem (see my earlier msg below)).

Then suddenly days later the job does start running. SchedLog:

09/10 10:31:15 (pid:2109) attempt to connect to <115.145.228.20:1048> failed: 
Connection timed out (connect errno = 110).
09/10 10:31:15 (pid:2109) Failed to send REQUEST_CLAIM to startd slot1@2-4-1 
<115.145.228.20:1048> for user@xxxxxxxxxxxxxx: SECMAN:2003:TCP connection to 
startd slot1@2-4-1 <115.145.228.20:1048> for user@xxxxxxxxxxxxxx failed.
09/10 10:31:15 (pid:2109) Match record (slot1@2-4-1 <115.145.228.20:1048> for 
user@xxxxxxxxxxxxxx, 250.0) deleted
09/10 10:31:40 (pid:2109) Completed REQUEST_CLAIM to startd slot1@2-4-1 
<115.145.228.20:4961> for user@xxxxxxxxxxxxxx
09/10 10:31:40 (pid:2109) Started shadow for job 250.0 on slot1@2-4-1 
<115.145.228.20:4961> for user@xxxxxxxxxxxxxx, (shadow pid = 3739)
09/10 15:19:01 (pid:2109) match (slot1@2-4-1 <115.145.228.20:4961> for 
user@xxxxxxxxxxxxxx) out of jobs; relinquishing
09/10 15:19:01 (pid:2109) Completed RELEASE_CLAIM to startd at 
<115.145.228.20:4961>
09/10 15:19:01 (pid:2109) Match record (slot1@2-4-1 <115.145.228.20:4961> for 
user@xxxxxxxxxxxxxx, 250.-1) deleted


Why is the communication to this Unclaimed machine blocked for days and
then suddenly the job submission works.....???

The "Failed to send REQUEST_CLAIM" happened with ports 1053 and 1048:

Failed to send REQUEST_CLAIM to startd slot1@2-4-1 <115.145.228.20:1053> for
   user@xxxxxxxxxxxxxx: SECMAN:2003:TCP connection to startd slot1@2-4-1
   <115.145.228.20:1053> for user@xxxxxxxxxxxxxx failed.
Failed to send REQUEST_CLAIM to startd slot1@2-4-1 <115.145.228.20:1048> for
   user@xxxxxxxxxxxxxx: SECMAN:2003:TCP connection to startd slot1@2-4-1
   <115.145.228.20:1048> for user@xxxxxxxxxxxxxx failed.


The "Completed REQUEST_CLAIM" happened with port 4961:

Completed REQUEST_CLAIM to startd slot1@2-4-1 <115.145.228.20:4961> for 
user@xxxxxxxxxxxxxx


What conclusion should a draw from this?
Any suggestions?


Thanks,
Rob.


----------------------------------------------------------------
On Wed, 8 Sep 2010 Rob wrote:

Hi,

I use a Linux master PC.
I have a Windows pool PC (ip = 115.145.228.26 or name = "3-4")
which is in the Unclaimed state.
All are running Condor 7.4.3.

When I submit a Vanilla job, then NegotiatorLog tells me that the match is OK.

The SchedLog has then the following entries:

09/09 12:54:25 (pid:2109) attempt to connect to <115.145.228.26:1042> failed: 
Connection timed out (connect errno = 110).  Will keep trying for 45 total 
seconds (24 to go).
09/09 12:54:50 (pid:2109) attempt to connect to <115.145.228.26:1042> failed: 
Connection timed out (connect errno = 110).
09/09 12:54:50 (pid:2109) Failed to send REQUEST_CLAIM to startd slot1@3-4 
<115.145.228.26:1042> for user@xxxxxxxxxxxxxxx: SECMAN:2003:TCP connection to 
startd slot1@3-4 <115.145.228.26:1042> for user@xxxxxxxxxxxxxxx failed.
09/09 12:54:50 (pid:2109) Match record (slot1@3-4 <115.145.228.26:1042> for 
user@xxxxxxxxxxxxxxx, 247.0) deleted

Apparently the network communication is not working.
Can somebody tell me what communication or firewall rule
is actually missing from these messages in SchedLog?


The (linux) master does get the status info and it can
get the Windows log files with condor_fetchlog.

The firewall on the Windows PC is a commercial Korean product
(V3 from Ahnlab). I have allowed as firewall exceptions:
  condor_dagman.exe
  condor_kbdd.exe
  condor_master.exe
  condor_startd.exe
  condor_starter.exe
  condor_vm-gahp.exe
  condor_preen.exe

It seems that this is not enough to allow full condor communication.....

Thanks.
Rob.