I had
set up a condor pool of ~140 slots, consisting of  my
desktop (call it pchead) - 4 slots, 1 listening to keyboard
activity) as a Master, Scheduler, Submission AND execution
node, and a set of other desktops and VMs as execution
nodes. It had been working excellently till this morning - I
had been however away for a few days, so I can't exactly
pinpoint when the problem appeared.Â
There
was a network glitch this morning and I restarted the master
node and since then, when I submit my jobs, only the jobs
matched to the 3 slots on pclcd21 run (the 4th being
reserved for the owner). All other slots refuse to start
jobs and stick to Idle. The job resource requirements are
not likely to be the problem since I am running on the same
machines and the exact same jobs. Furthermore, when I do
condor_q -better-analyze I get:
212899.000:  Run analysis summary. Of 138
machines,
   0 are rejected by your job's requirementsÂ
   1 reject your job because of their own
requirementsÂ
   3 match and are already running your jobsÂ
   0 match but are serving other usersÂ
  134 are available to run your job
The Requirements _expression_ for your job is:
  ( TARGET.Arch == "X86_64" ) && (
TARGET.OpSys == "LINUX" ) &&
  ( TARGET.Disk >= RequestDisk ) && (
TARGET.Memory >= RequestMemory ) &&
  ( TARGET.HasFileTransfer )
Your job defines the following attributes:
  DiskUsage = 75000
  ImageSize = 35
  RequestDisk = 75000
  RequestMemory = 1
The Requirements _expression_ for your job reduces to
these conditions:
     Slots
Step   Matched  Condition
----- Â -------- Â ---------
[0] Â Â Â Â 138 Â TARGET.Arch == "X86_64"
[1] Â Â Â Â 138 Â TARGET.OpSys == "LINUX"
[3] Â Â Â Â 138 Â TARGET.Disk >= RequestDisk
[5] Â Â Â Â 138 Â TARGET.Memory >=
RequestMemory
[7] Â Â Â Â 138 Â TARGET.HasFileTransfer
Therefore I would conclude that the job resource
requirements is not the issue.
From my investigations, everything points to a
network/misconfiguration issue, however, I was unable to
pinpoint where the problem is. For completeness, I should
mention that all machines are within a private network,
inaccessible from the outside world. Therefore, it's safe
enough to disable the firewall to allow free communication
between the nodes. Indeed, I have done just that but it
does not seem to fix the problem, nodes with the firewall
up and nodes with the firewall down exhibit the same
behavior, even with the head node's (pchead) firewall
down. It could be that the admins have modified the
network somehow to restrict traffic, but I have not seen
any relevant announcement. In addition, some quick scans I
performed show that some ports including 9618 are
accessible on the head node.Â
For simplicity, I removed all but one execution node
(call it "pcnodevm00" with two slots) by stopping all
daemons on all other machines, and removing the hostnames
from ALLOW_WRITE on the head node configuration For the
head node, I also set the START flag to FALSE to avoid
starting any jobs there.Â
Looking at the log files, I see on pchead's
CollectorLog messages like
10/15/14 16:39:58 StartdPvtAd  : Inserting ** "<
slot1@pcnodevm00 , XXX.XXX.XXX.XXX >"
10/15/14 16:39:58 stats: Inserting new hashent for
'StartdPvt':'slot1@pcnodevm00':'XXX.XXX.XXX.XXX'
which look fine. The daemons at pcnodevm00 look running
(from pstree) and the "StartLog" on the node looks fine,
ending with:
10/15/14 16:40:18 slot1: Changing activity:
Benchmarking -> Idle
I then proceed with submitting 6 jobs. I now see all
the jobs stuck in Idle state as expected. Looking at
StartLog on pcnodevm00 again I see messages like (where
XXX.XXX.XXX.XXX is the pcnodevm00 machine ip address):
0/15/14 16:52:03 slot1: Can't receive eom from schedd
10/15/14 16:52:03 slot1: State change: claiming
protocol failed
10/15/14 16:52:03 slot1: Changing state: Unclaimed
-> Owner
10/15/14 16:52:03 slot1: State change: IS_OWNER is
false
10/15/14 16:52:03 slot1: Changing state: Owner ->
Unclaimed
10/15/14 16:52:04 slot2: Can't receive eom from
schedd
10/15/14 16:52:04 slot2: State change: claiming
protocol failed
10/15/14 16:52:04 slot2: Changing state: Unclaimed
-> Owner
10/15/14 16:52:04 slot2: State change: IS_OWNER is
false
10/15/14 16:52:04 slot2: Changing state: Owner ->
Unclaimed
10/15/14 16:52:04 Error: can't find resource with
ClaimId (<XXX.XXX.XXX.XXX:43045>#1413383989#2#...)
10/15/14 16:52:04 Error: can't find resource with
ClaimId (<XXX.XXX.XXX.XXX:43045>#1413383989#1#...)
10/15/14 16:53:03 slot1: match_info called
10/15/14 16:53:03 slot1: Received match
<XXX.XXX.XXX.XXX:43045>#1413383989#3#...
10/15/14 16:53:03 slot1: State change: match
notification protocol successful
10/15/14 16:53:03 slot1: Changing state: Unclaimed
-> Matched
10/15/14 16:53:03 slot2: match_info called
10/15/14 16:53:03 slot2: Received match
<XXX.XXX.XXX.XXX:43045>#1413383989#4#...
10/15/14 16:53:03 slot2: State change: match
notification protocol successful
10/15/14 16:53:03 slot2: Changing state: Unclaimed
-> Matched
10/15/14 16:53:03 slot1: Can't receive eom from
schedd
10/15/14 16:53:03 slot1: State change: claiming
protocol failed
10/15/14 16:53:03 slot1: Changing state: Matched
-> Owner
10/15/14 16:53:03 slot1: State change: IS_OWNER is
false
10/15/14 16:53:03 slot1: Changing state: Owner ->
Unclaimed
10/15/14 16:53:03 slot2: Can't receive eom from
schedd
10/15/14 16:53:03 slot2: State change: claiming
protocol failed
10/15/14 16:53:03 slot2: Changing state: Matched
-> Owner
Can anybody deduce from the "Can't receive eom from
schedd" Â error, where exactly I am messing the
configuration up?
Regards,
Nikiforos