I just figured out there is something bad happens on the Windows node. The file StarterLog show the following, this text is added to the log it every single minute: 03/06/19 17:57:02 (pid:5900) ****************************************************** 03/06/19 17:57:02 (pid:5900) ** condor_starter (CONDOR_STARTER) STARTING UP 03/06/19 17:57:02 (pid:5900) ** C:\condor\bin\condor_starter.exe 03/06/19 17:57:02 (pid:5900) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1) 03/06/19 17:57:02 (pid:5900) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON 03/06/19 17:57:02 (pid:5900) ** $CondorVersion: 8.8.1 Feb 18 2019 BuildID: 461773 $ 03/06/19 17:57:02 (pid:5900) ** $CondorPlatform: x86_64_Windows10 $ 03/06/19 17:57:02 (pid:5900) ** PID = 5900 03/06/19 17:57:02 (pid:5900) ** Log last touched 3/6 17:56:02 03/06/19 17:57:02 (pid:5900) ****************************************************** 03/06/19 17:57:02 (pid:5900) Using config source: C:\condor\condor_config 03/06/19 17:57:02 (pid:5900) Using local config sources: 03/06/19 17:57:02 (pid:5900) C:\condor\condor_config.local 03/06/19 17:57:02 (pid:5900) config Macros = 48, Sorted = 48, StringBytes = 1055, TablesBytes = 1776 03/06/19 17:57:02 (pid:5900) CLASSAD_CACHING is OFF 03/06/19 17:57:02 (pid:5900) Daemon Log is logging: D_ALWAYS D_ERROR 03/06/19 17:57:02 (pid:5900) SharedPortEndpoint: listener already created. 03/06/19 17:57:02 (pid:5900) DaemonCore: command socket at <10.211.55.11:9618?addrs=10.211.55.11-9618&noUDP&sock=5316_87ec_125> 03/06/19 17:57:02 (pid:5900) DaemonCore: private command socket at <10.211.55.11:9618?addrs=10.211.55.11-9618&noUDP&sock=5316_87ec_125> 03/06/19 17:57:02 (pid:5900) GLEXEC_JOB not supported on this platform; ignoring 03/06/19 17:57:02 (pid:5900) Communicating with shadow <127.0.0.1:9618?addrs=127.0.0.1-9618+[fdb2-2c26-f4e4-0-21c-42ff-fe97-e2c1]-9618&noUDP&sock=1327422_a214_245> 03/06/19 17:57:02 (pid:5900) Submitting machine is "htcondor.shared" 03/06/19 17:57:02 (pid:5900) setting the orig job name in starter 03/06/19 17:57:02 (pid:5900) setting the orig job iwd in starter 03/06/19 17:57:02 (pid:5900) Chirp config summary: IO false, Updates false, Delayed updates true. 03/06/19 17:57:02 (pid:5900) Initialized IO Proxy. 03/06/19 17:57:02 (pid:5900) Setting resource limits not implemented! 03/06/19 17:57:02 (pid:5900) condor_write(): Socket closed when trying to write 39 bytes to daemon at <127.0.0.1:9618>, fd is 596, errno=10054 03/06/19 17:57:02 (pid:5900) Buf::write(): condor_write() failed 03/06/19 17:57:02 (pid:5900) ERROR "Could not initiate file transfer" at line 2412 in file C:\condor\execute\dir_9076\sources\src\condor_starter.V6.1\jic_shadow.cpp 03/06/19 17:57:02 (pid:5900) ShutdownFast all jobs. 03/06/19 17:57:02 (pid:5900) Failed to open '.update.ad' to read update ad: No such file or directory (2). 03/06/19 17:57:02 (pid:5900) condor_read() failed: recv(fd=620) returned -1, errno = 10054 , reading 5 bytes from <10.211.55.10:26601>. 03/06/19 17:57:02 (pid:5900) IO: Failed to read packet header 03/06/19 17:57:02 (pid:5900) Lost connection to shadow, waiting 2400 secs for reconnect 03/06/19 17:57:02 (pid:5900) All jobs have exited... starter exiting 03/06/19 17:57:02 (pid:5900) SharedPortEndpoint: Destructor: Problem in thread shutdown notification: 0 03/06/19 17:57:02 (pid:5900) **** condor_starter (condor_STARTER) pid 5900 EXITING WITH STATUS 0 Quick googling got me to this bug report which is closed in 2016 as WONTFIX. I am not sure if this is somehow related to the malfunction I observe but log looks similar. Any ideas?
|