Yes, this is a sort of race, but one that is relatively harmless – it slows things down a bit but doesn’t result in any long term harm.
What’s happening is that the SCHEDD gets a match and contacts the STARTD to claim the slot, then it goes back to doing other processing. (perhaps handling a condor_q request, or doing some garbage collection in the job queue) If it doesn’t’ get back around to continuing the conversation with the STARTD to complete the claim within 20 seconds, then that claim
Is lost and the slot goes back into to idle state and to be matched again the next negotiation cycle.
Adding D_FULLDEBUG to the STARTD’s log file will slow the STARTD down somewhat, giving the SCHEDD a little more time to make the claim, But the real problem here is that the SCHEDD is heavily loaded and not able to process all of the matches that it gets the first time it gets them. You could treat this as just part of doing business, or you can look into what is slowing down the SCHEDD and try to address that. Try running condor_status -schedd -direct <schedd-name> -long -statistics ALL:2 > schedd.ad Have a look at the RecentDeamonCoreDutyCycle attribute in schedd.ad. If it is more than about 0.95 then the SCHEDD is overloaded. Also make sure that the attributes that start with SelfMonitor are relatively sane. In particular, is the SCHEDD using more memory than the machine has? You can also look at the statistics in the schedd.ad for operations that are taking more than a few seconds. Look for attributes that match *RuntimeMax That are > 5. These would be operations that sometimes take more than 5 seconds,
and during that time the SCHEDD will not be doing anything else (like finishing a claim). -tj From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx]
On Behalf Of Michael Schwarzfischer Hi everyone, We are currently running condor 8.4.9. on a Windows server with Windows 7 clients. It occasionally happens that some jobs may not be correctly started due to timeout problems. We really don’t understand why and when this happens and we hope that
you could help us finding the problem. It seems to happen less often when setting STARTD_DEBUG to FULLDEBUG level on the client. Does this hint to some race condition? So what happens if it doesn’t work is the following: -
The job is correctly matched, the client is set to claimed. The job is listed as running. -
After 20 seconds a timeout happens to a port, which I did not find in any of the server log files. -
Job is preempted/vacated
LogFile of Client StartLog: 03/09/17 11:23:11 slot1: Schedd addr = <10.10.11.116:49578?addrs=10.10.11.116-49578> 03/09/17 11:23:11 slot1: Alive interval = 300 03/09/17 11:23:11 slot1: Received ClaimId from schedd (<10.10.40.23:49295>#1489054871#1#...) 03/09/17 11:23:11 slot1: Rank of this claim is: 0.000000 03/09/17 11:23:11 slot1: Request accepted. 03/09/17 11:23:11 IPVERIFY: checking server.mydomain.net against 10.10.11.116 03/09/17 11:23:11 IPVERIFY: matched 10.10.11.116 to 10.10.11.116 03/09/17 11:23:11 IPVERIFY: ip found is 1 03/09/17 11:23:11 slot1: Remote owner is
michael.name@xxxxxxxxxxxx 03/09/17 11:23:11 slot1: State change: claiming protocol successful 03/09/17 11:23:11 slot1: Changing state: Owner -> Claimed 03/09/17 11:23:11 slot1: Started ClaimLease timer (12) w/ 1800 second lease duration 03/09/17 11:23:31 condor_read(): timeout reading 5 bytes from <10.10.11.116:49768>. 03/09/17 11:23:31 IO: Failed to read packet header 03/09/17 11:23:31 Can't read ClaimId 03/09/17 11:23:31 Trying to update collector <10.10.11.116:9618> 03/09/17 11:23:31 Attempting to send update via TCP to collector <10.10.11.116:9618> 03/09/17 11:23:31 slot1: Sent update to 1 collector(s) 03/09/17 11:23:53 slot1: Called deactivate_claim() 03/09/17 11:23:53 slot1: State change: received RELEASE_CLAIM command 03/09/17 11:23:53 slot1: Changing state and activity: Claimed/Idle -> Preempting/Vacating 03/09/17 11:23:53 slot1: Canceled ClaimLease timer (12) 03/09/17 11:23:53 slot1: State change: No preempting claim, returning to owner 03/09/17 11:23:53 slot1: Changing state and activity: Preempting/Vacating -> Owner/Idle 03/09/17 11:23:57 Trying to update collector <10.10.11.116:9618> 03/09/17 11:23:57 Attempting to send update via TCP to collector <10.10.11.116:9618> 03/09/17 11:23:57 slot1: Sent update to 1 collector(s) ShadowLog of server: 03/09/17 11:23:12 ****************************************************** 03/09/17 11:23:12 ** condor_shadow (CONDOR_SHADOW) STARTING UP 03/09/17 11:23:12 ** C:\condor\bin\condor_shadow.exe 03/09/17 11:23:12 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1) 03/09/17 11:23:12 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON 03/09/17 11:23:12 ** $CondorVersion: 8.4.9 Sep 29 2016 BuildID: 382747 $ 03/09/17 11:23:12 ** $CondorPlatform: x86_64_Windows8 $ 03/09/17 11:23:12 ** PID = 3612 03/09/17 11:23:12 ** Log last touched 3/9 10:36:14 03/09/17 11:23:12 ****************************************************** 03/09/17 11:23:12 Using config source:
\\server\Condor\config\condor_config 03/09/17 11:23:12 Using local config sources:
03/09/17 11:23:12 C:\QueueConfig\local\condor_config.local 03/09/17 11:23:12 config Macros = 109, Sorted = 109, StringBytes = 10628, TablesBytes = 896 03/09/17 11:23:12 CLASSAD_CACHING is OFF 03/09/17 11:23:12 Daemon Log is logging: D_ALWAYS D_ERROR 03/09/17 11:23:12 Daemoncore: Listening at <0.0.0.0:49761> on TCP (ReliSock) and UDP (SafeSock). 03/09/17 11:23:12 DaemonCore: command socket at <10.10.11.116:49761?addrs=10.10.11.116-49761> 03/09/17 11:23:12 DaemonCore: private command socket at <10.10.11.116:49761?addrs=10.10.11.116-49761> 03/09/17 11:23:12 Initializing a VANILLA shadow for job 900.0 03/09/17 11:23:53 (900.0) (3612): condor_write(): Socket closed when trying to write 4096 bytes to startd client01.mydomain.net, fd is 24 03/09/17 11:23:53 (900.0) (3612): Buf::write(): condor_write() failed 03/09/17 11:23:53 (900.0) (3612): client01.mydomain.net: DCStartd::activateClaim: Failed to send job ClassAd to the startd 03/09/17 11:23:53 (900.0) (3612): Job 900.0 is being evicted from client01.mydomain.net 03/09/17 11:23:53 (900.0) (3612): logEvictEvent with unknown reason (108), not logging. 03/09/17 11:23:53 (900.0) (3612): **** condor_shadow (condor_SHADOW) pid 3612 EXITING WITH STATUS 108 I am not sure where this port 49768 is heading to? I could not find any log on the server listing this port number. I am not even sure if I should search in ShadowLog
or somewhere else? Should I increase log level of any of the server daemons? In the old mails of your list I found two similar problems all hinting to some network problems. However, the fact that sometimes everything just works smoothly makes
me wonder if this could really be a problem of our network? Furthermore, I am not sure how to address this? This problem happens on several clients similarly.
We are thankful for any comments or ideas! Thanks,
Best, Michael |