[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Match timed out - jobs don't run?



On Wed, Apr 16, 2014 at 05:21:20PM +0200, Steffen Grunewald wrote:
> I'm running into problems with a freshly setup pool - as I rewrote
> all the config, there might be an ugly typo hiding somewhere.

With the pool shrunk to a single node, after releasing a single job from
the held jobcluster, scheduler/master node:

14/04/17 11:06:28(pid:28779) Calling Handler <DCMessenger::receiveMsgCallback REQUEST_CLAIM> (5)
14/04/17 11:06:28(pid:28779) Request was NOT accepted for claim slot1@xxxxxxxxxxxxxxxxxx <10.150.96.2:43820> for stas 2.0
14/04/17 11:06:28(pid:28779) Match record (slot1@xxxxxxxxxxxxxxxxxx <10.150.96.2:43820> for stas, 2.0) deleted
14/04/17 11:06:28(pid:28779) Return from Handler <DCMessenger::receiveMsgCallback REQUEST_CLAIM> 0.0001s

==> /var/log/condor/CollectorLog <==
14/04/17 11:07:07Accumulating data: Time=1397725627

==> /var/log/condor/NegotiatorLog <==
14/04/17 11:07:08---------- Started Negotiation Cycle ----------
14/04/17 11:07:08Phase 1:  Obtaining ads from collector ...
14/04/17 11:07:08Not considering preemption, therefore constraining idle machines with ifThenElse(State == "Claimed","Name State Activity StartdIpAddr AccountingGroup Owner RemoteUser Requirements
SlotWeight ConcurrencyLimits","") 
14/04/17 11:07:08  Getting startd private ads ...

==> /var/log/condor/CollectorLog <==
14/04/17 11:07:08Got QUERY_STARTD_PVT_ADS
14/04/17 11:07:08Number of Active Workers 1
14/04/17 11:07:08Number of Active Workers 0
14/04/17 11:07:08(Sending 1 ads in response to query)
14/04/17 11:07:08Query info: matched=1; skipped=0; query_time=0.000374; send_time=0.000172; type=MachinePrivate; requirements={true}; peer=<10.150.94.40:47751>; projection={}

==> /var/log/condor/NegotiatorLog <==
14/04/17 11:07:08  Getting Scheduler, Submitter and Machine ads ...

==> /var/log/condor/CollectorLog <==
14/04/17 11:07:08Number of Active Workers 2
14/04/17 11:07:08Number of Active Workers 1
14/04/17 11:07:08(Sending 3 ads in response to query)
14/04/17 11:07:08Query info: matched=3; skipped=4; query_time=0.000365; send_time=0.000355; type=Any; requirements={( ( ( MyType == "Scheduler" ) || ( MyType == "Submitter" ) ) || ( ( MyType ==
"Machine" ) ) )}; peer=<10.150.94.40:43621>; projection={}

==> /var/log/condor/NegotiatorLog <==
14/04/17 11:07:08  Sorting 3 ads ...
14/04/17 11:07:08Got ads: 3 public and 1 private
14/04/17 11:07:08Public ads include 1 submitter, 1 startd
14/04/17 11:07:08Phase 2:  Performing accounting ...
14/04/17 11:07:08 negotiateWithGroup resources used scheddAds length 1 
14/04/17 11:07:08---------- Finished Negotiation Cycle ----------

worker node:

==> /var/log/condor/StartdLog <==
14/04/17 11:06:08Calling Handler <DaemonCommandProtocol::WaitForSocketData> (2)
14/04/17 11:06:08Return from Handler <DaemonCommandProtocol::WaitForSocketData> 0.0011s
14/04/17 11:06:08Received TCP command 442 (REQUEST_CLAIM) from unauthenticated@unmapped <10.150.94.40:43575>, access level DAEMON
14/04/17 11:06:08Calling HandleReq <command_request_claim> (0) for command 442 (REQUEST_CLAIM) from unauthenticated@unmapped <10.150.94.40:43575>
14/04/17 11:06:28condor_read(): timeout reading 5 bytes from <10.150.94.40:43575>.
14/04/17 11:06:28IO: Failed to read packet header
14/04/17 11:06:28Can't read ClaimId
14/04/17 11:06:28Return from HandleReq <command_request_claim> (handler: 20.020s, sec: 0.001s, payload: 0.000s)
14/04/17 11:06:28Received UDP command 440 (MATCH_INFO) from unauthenticated@unmapped <10.150.94.40:41744>, access level NEGOTIATOR
14/04/17 11:06:28Calling HandleReq <command_match_info> (0) for command 440 (MATCH_INFO) from unauthenticated@unmapped <10.150.94.40:41744>
14/04/17 11:06:28slot1: match_info called
14/04/17 11:06:28slot1: Received match <10.150.96.2:43820>#1395390192#3763#...
14/04/17 11:06:28slot1: State change: match notification protocol successful
14/04/17 11:06:28slot1: Changing state: Unclaimed -> Matched
14/04/17 11:06:28Return from HandleReq <command_match_info> (handler: 0.000s, sec: 0.000s, payload: 0.000s)


Um, the culprit seems to be the "Request was NOT accepted for claim..." message.
Asking the Big Search Engine for help, I got pointed to ALLOW_* settings, so I 
checked these:

root@v9602:~# condor_config_val -dump | grep NETW
NETWORK_INTERFACE = 10.150.96.2
root@v9602:~# condor_config_val -dump | grep ALLOW
ALLOW_ADMINISTRATOR = $(CONDOR_HOST)
ALLOW_CLIENT = *
ALLOW_NEGOTIATOR = $(COLLECTOR_HOST)
ALLOW_NEGOTIATOR_SCHEDD = $(COLLECTOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS)
ALLOW_OWNER = $(FULL_HOSTNAME), $(ALLOW_ADMINISTRATOR)
ALLOW_READ = 10.150.0.0/16 #172.16.0.0/12
ALLOW_READ_COLLECTOR = $(ALLOW_READ), $(FLOCK_FROM)
ALLOW_READ_STARTD = $(ALLOW_READ), $(FLOCK_FROM)
ALLOW_VM_CRUFT = False
ALLOW_WRITE = 10.150.0.0/16 #172.16.0.0/12
ALLOW_WRITE_COLLECTOR = $(ALLOW_WRITE), $(FLOCK_FROM)
ALLOW_WRITE_STARTD = $(ALLOW_WRITE), $(FLOCK_FROM)
STARTER_ALLOW_RUNAS_OWNER = True

(similar/same on master node)

All *_DOMAIN settings are identical as well.

Running out of ideas right now...

> Any suggestions where to look next?

- S