[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Match timed out - jobs don't run?



I'm running into problems with a freshly setup pool - as I rewrote
all the config, there might be an ugly typo hiding somewhere.

A user submitted a job cluster, but it won't run. On the central
manager machine, I can see the jobs being matched (actually, 
condor_status -t shows more machines in "Matched" state than there
are jobs in the queue).
On a randomly selected worker node, the StartdLog consists of
lines like following:

...
14/04/16 16:47:17 Can't read ClaimId
14/04/16 16:47:17 Return from HandleReq <command_request_claim> (handler: 20.020s, sec: 0.000s, payload: 0.000s)
14/04/16 16:56:57 slot1: State change: match timed out
14/04/16 16:56:57 slot1: Changing state: Matched -> Owner
14/04/16 16:56:57 slot1: State change: IS_OWNER is false
14/04/16 16:56:57 slot1: Changing state: Owner -> Unclaimed
14/04/16 16:57:57 Received UDP command 440 (MATCH_INFO) from unauthenticated@unmapped <10.150.94.40:51160>, access level NEGOTIATOR
14/04/16 16:57:57 Calling HandleReq <command_match_info> (0) for command 440 (MATCH_INFO) from unauthenticated@unmapped <10.150.94.40:51160>
14/04/16 16:57:57 slot1: match_info called
14/04/16 16:57:57 slot1: Received match <10.150.97.34:51117>#1395391320#3565#...
14/04/16 16:57:57 slot1: State change: match notification protocol successful
14/04/16 16:57:57 slot1: Changing state: Unclaimed -> Matched
14/04/16 16:57:57 Return from HandleReq <command_match_info> (handler: 0.000s, sec: 0.000s, payload: 0.000s)
14/04/16 16:57:57 Received TCP command 442 (REQUEST_CLAIM) from unauthenticated@unmapped <10.150.94.40:40959>, access level DAEMON
14/04/16 16:57:57 Calling HandleReq <command_request_claim> (0) for command 442 (REQUEST_CLAIM) from unauthenticated@unmapped <10.150.94.40:40959>
14/04/16 16:58:17 condor_read(): timeout reading 5 bytes from <10.150.94.40:40959>.
14/04/16 16:58:17 IO: Failed to read packet header
14/04/16 16:58:17 Can't read ClaimId
14/04/16 16:58:17 Return from HandleReq <command_request_claim> (handler: 20.020s, sec: 0.000s, payload: 0.000s)
14/04/16 17:07:57 slot1: State change: match timed out
14/04/16 17:07:57 slot1: Changing state: Matched -> Owner
14/04/16 17:07:57 slot1: State change: IS_OWNER is false
14/04/16 17:07:57 slot1: Changing state: Owner -> Unclaimed
14/04/16 17:08:59 Received UDP command 440 (MATCH_INFO) from unauthenticated@unmapped <10.150.94.40:53640>, access level NEGOTIATOR
14/04/16 17:08:59 Calling HandleReq <command_match_info> (0) for command 440 (MATCH_INFO) from unauthenticated@unmapped <10.150.94.40:53640>
14/04/16 17:08:59 slot1: match_info called
14/04/16 17:08:59 slot1: Received match <10.150.97.34:51117>#1395391320#3567#...
14/04/16 17:08:59 slot1: State change: match notification protocol successful
14/04/16 17:08:59 slot1: Changing state: Unclaimed -> Matched
14/04/16 17:08:59 Return from HandleReq <command_match_info> (handler: 0.000s, sec: 0.000s, payload: 0.000s)
14/04/16 17:08:59 Received TCP command 442 (REQUEST_CLAIM) from unauthenticated@unmapped <10.150.94.40:44271>, access level DAEMON
14/04/16 17:08:59 Calling HandleReq <command_request_claim> (0) for command 442 (REQUEST_CLAIM) from unauthenticated@unmapped <10.150.94.40:44271>
14/04/16 17:09:19 condor_read(): timeout reading 5 bytes from <10.150.94.40:44271>.
14/04/16 17:09:19 IO: Failed to read packet header
14/04/16 17:09:19 Can't read ClaimId
14/04/16 17:09:19 Return from HandleReq <command_request_claim> (handler: 20.020s, sec: 0.000s, payload: 0.000s)
...

Any suggestions where to look next?

- S