Hi HTCondor community I’m running into a problem where a bunch of Windows 7 execute nodes are not accepting jobs. I have a pool of about 50 machines (with 4 or 8 nodes each) and about half of them merrily accept jobs submitted from a Linux submit node. The other half do not. Condor version is CondorVersion: 8.2.1 Jun 27 2014 BuildID: 256063 These machines were all built from the same image and in all cases are running on their D: drives. Any advice would be welcome! I tried restarting the service, restarting the machines, and restarting HTCondor on the central manager — all to no avail. Thanks! Mike Fienen USGS Wisconsin Water Science Center Middleton, WI USA The cluster log on the submit machine is full of messages like this: 024 (1823.046.000) 10/09 08:07:50 Job reconnection failed Job not found at execution machine Can not reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxx, rescheduling job Then going to BLHBLAH34.gs.doi.net, I find this in the StartLog: 10/09/14 08:09:07 slot1: match_info called 10/09/14 08:09:07 slot1: Received match <xxx.xxx.xxx.134:64106>#1412826697#975#... 10/09/14 08:09:07 slot1: State change: match notification protocol successful 10/09/14 08:09:07 slot1: Changing state: Unclaimed -> Matched 10/09/14 08:09:07 slot1_1: New machine resource of type -1 allocated 10/09/14 08:09:07 slot1: Changing state: Matched -> Unclaimed 10/09/14 08:09:07 Setting up slot pairings 10/09/14 08:09:07 slot1_1: Request accepted. 10/09/14 08:09:07 slot1_1: Remote owner is mnfienen@xxxxxxxxxxxxxxxxxxxxxxxx 10/09/14 08:09:07 slot1_1: State change: claiming protocol successful 10/09/14 08:09:07 slot1_1: Changing state: Owner -> Claimed 10/09/14 08:09:07 slot1_1: Got activate_claim request from shadow (xxx.xxx.xxx.72) 10/09/14 08:09:07 slot1_1: Remote job ID is 1823.40 10/09/14 08:09:07 slot1_1: Got universe "VANILLA" (5) from request classad 10/09/14 08:09:07 slot1_1: State change: claim-activation protocol successful 10/09/14 08:09:07 slot1_1: Changing activity: Idle -> Busy 10/09/14 08:09:08 condor_read() failed: recv(fd=712) returned -1, errno = 10054 , reading 5 bytes from <127.0.0.1:49722>. 10/09/14 08:09:08 IO: Failed to read packet header 10/09/14 08:09:08 Starter pid 3732 exited with status -1073740940 10/09/14 08:09:08 slot1_1: State change: starter exited 10/09/14 08:09:08 slot1_1: Changing activity: Busy -> Idle 10/09/14 08:09:08 Aborting CA_LOCATE_STARTER 10/09/14 08:09:08 ClaimId (<xxx.xxx.xxx.134:64106>#1412826697#975#40425bd1402e06a6391f4fdec6b771e1e7daa7b2) and GlobalJobId (BLAHBLAHM000.er.usgs.gov#1823.40#1412854596 ) not found 10/09/14 08:09:08 slot1_1: State change: received RELEASE_CLAIM command 10/09/14 08:09:08 slot1_1: Changing state and activity: Claimed/Idle -> Preempting/Vacating 10/09/14 08:09:08 slot1_1: State change: No preempting claim, returning to owner 10/09/14 08:09:08 slot1_1: Changing state and activity: Preempting/Vacating -> Owner/Idle 10/09/14 08:09:08 slot1_1: State change: IS_OWNER is false 10/09/14 08:09:08 slot1_1: Changing state: Owner -> Unclaimed 10/09/14 08:09:08 slot1_1: Changing state: Unclaimed -> Delete 10/09/14 08:09:08 slot1_1: Resource no longer needed, deleting |