Hi everyone, We are currently running condor 8.4.9. on a Windows server with Windows 7 clients. It occasionally happens that some jobs may not be correctly started due to timeout problems. We really don’t understand why and when this happens and we hope that
you could help us finding the problem. It seems to happen less often when setting STARTD_DEBUG to FULLDEBUG level on the client. Does this hint to some race condition? So what happens if it doesn’t work is the following: -
The job is correctly matched, the client is set to claimed. The job is listed as running. -
After 20 seconds a timeout happens to a port, which I did not find in any of the server log files. -
Job is preempted/vacated
LogFile of Client StartLog: 03/09/17 11:23:11 slot1: Schedd addr = <10.10.11.116:49578?addrs=10.10.11.116-49578> 03/09/17 11:23:11 slot1: Alive interval = 300 03/09/17 11:23:11 slot1: Received ClaimId from schedd (<10.10.40.23:49295>#1489054871#1#...) 03/09/17 11:23:11 slot1: Rank of this claim is: 0.000000 03/09/17 11:23:11 slot1: Request accepted. 03/09/17 11:23:11 IPVERIFY: checking server.mydomain.net against 10.10.11.116 03/09/17 11:23:11 IPVERIFY: matched 10.10.11.116 to 10.10.11.116 03/09/17 11:23:11 IPVERIFY: ip found is 1 03/09/17 11:23:11 slot1: Remote owner is michael.name@xxxxxxxxxxxx 03/09/17 11:23:11 slot1: State change: claiming protocol successful 03/09/17 11:23:11 slot1: Changing state: Owner -> Claimed 03/09/17 11:23:11 slot1: Started ClaimLease timer (12) w/ 1800 second lease duration 03/09/17 11:23:31 condor_read(): timeout reading 5 bytes from <10.10.11.116:49768>. 03/09/17 11:23:31 IO: Failed to read packet header 03/09/17 11:23:31 Can't read ClaimId 03/09/17 11:23:31 Trying to update collector <10.10.11.116:9618> 03/09/17 11:23:31 Attempting to send update via TCP to collector <10.10.11.116:9618> 03/09/17 11:23:31 slot1: Sent update to 1 collector(s) 03/09/17 11:23:53 slot1: Called deactivate_claim() 03/09/17 11:23:53 slot1: State change: received RELEASE_CLAIM command 03/09/17 11:23:53 slot1: Changing state and activity: Claimed/Idle -> Preempting/Vacating 03/09/17 11:23:53 slot1: Canceled ClaimLease timer (12) 03/09/17 11:23:53 slot1: State change: No preempting claim, returning to owner 03/09/17 11:23:53 slot1: Changing state and activity: Preempting/Vacating -> Owner/Idle 03/09/17 11:23:57 Trying to update collector <10.10.11.116:9618> 03/09/17 11:23:57 Attempting to send update via TCP to collector <10.10.11.116:9618> 03/09/17 11:23:57 slot1: Sent update to 1 collector(s) ShadowLog of server: 03/09/17 11:23:12 ****************************************************** 03/09/17 11:23:12 ** condor_shadow (CONDOR_SHADOW) STARTING UP 03/09/17 11:23:12 ** C:\condor\bin\condor_shadow.exe 03/09/17 11:23:12 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1) 03/09/17 11:23:12 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON 03/09/17 11:23:12 ** $CondorVersion: 8.4.9 Sep 29 2016 BuildID: 382747 $ 03/09/17 11:23:12 ** $CondorPlatform: x86_64_Windows8 $ 03/09/17 11:23:12 ** PID = 3612 03/09/17 11:23:12 ** Log last touched 3/9 10:36:14 03/09/17 11:23:12 ****************************************************** 03/09/17 11:23:12 Using config source: \\server\Condor\config\condor_config 03/09/17 11:23:12 Using local config sources:
03/09/17 11:23:12 C:\QueueConfig\local\condor_config.local 03/09/17 11:23:12 config Macros = 109, Sorted = 109, StringBytes = 10628, TablesBytes = 896 03/09/17 11:23:12 CLASSAD_CACHING is OFF 03/09/17 11:23:12 Daemon Log is logging: D_ALWAYS D_ERROR 03/09/17 11:23:12 Daemoncore: Listening at <0.0.0.0:49761> on TCP (ReliSock) and UDP (SafeSock). 03/09/17 11:23:12 DaemonCore: command socket at <10.10.11.116:49761?addrs=10.10.11.116-49761> 03/09/17 11:23:12 DaemonCore: private command socket at <10.10.11.116:49761?addrs=10.10.11.116-49761> 03/09/17 11:23:12 Initializing a VANILLA shadow for job 900.0 03/09/17 11:23:53 (900.0) (3612): condor_write(): Socket closed when trying to write 4096 bytes to startd client01.mydomain.net, fd is 24 03/09/17 11:23:53 (900.0) (3612): Buf::write(): condor_write() failed 03/09/17 11:23:53 (900.0) (3612): client01.mydomain.net: DCStartd::activateClaim: Failed to send job ClassAd to the startd 03/09/17 11:23:53 (900.0) (3612): Job 900.0 is being evicted from client01.mydomain.net 03/09/17 11:23:53 (900.0) (3612): logEvictEvent with unknown reason (108), not logging. 03/09/17 11:23:53 (900.0) (3612): **** condor_shadow (condor_SHADOW) pid 3612 EXITING WITH STATUS 108 I am not sure where this port 49768 is heading to? I could not find any log on the server listing this port number. I am not even sure if I should search in ShadowLog
or somewhere else? Should I increase log level of any of the server daemons? In the old mails of your list I found two similar problems all hinting to some network problems. However, the fact that sometimes everything just works smoothly makes
me wonder if this could really be a problem of our network? Furthermore, I am not sure how to address this? This problem happens on several clients similarly.
We are thankful for any comments or ideas! Thanks,
Best, Michael |