Hi all, I just stumbled over an issue with a fresh spawned test cluster, where the collector died regularly with a status 4. As it seems the issue was a '/var/ViewHist' directory, which was missing. After creating it (and also owning to the condor user...), the collector seems to be stable. I guess this directory is the pool history dictionary (viewhist files look like they keep some kind of pool statistics) While KEEP_POOL_HISTORY is enabled, POOL_HISTORY_DIR is not set- from the documentation I would take that it should go to /var/spool/ by default, or? Long story short: is it maybe a bug, that the pool history default starts at /var but not /var/spool or have I screwed up my config? ;) Cheers, Thomas [MasterLog] 12/12/18 04:58:07 attempt to connect to <131.169.168.39:9618> failed: Connection refused (connect errno = 111). 12/12/18 04:58:07 ERROR: SECMAN:2003:TCP connection to collector dcache-dot1.desy.de:9618 failed. 12/12/18 04:58:07 Failed to start non-blocking update to <131.169.168.39:9618>. 12/12/18 04:58:18 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 114569 12/12/18 05:06:18 DefaultReaper unexpectedly called on pid 114569, status 1024. 12/12/18 05:06:18 The COLLECTOR (pid 114569) exited with status 4 12/12/18 05:06:18 Sending obituary for "/usr/sbin/condor_collector" 12/12/18 05:06:18 restarting /usr/sbin/condor_collector in 10 seconds 12/12/18 05:06:18 condor_write(): Socket closed when trying to write 1904 bytes to collector dcache-dot1.desy.de:9618, fd is 12 12/12/18 05:06:18 Buf::write(): condor_write() failed 12/12/18 05:06:18 attempt to connect to <131.169.168.39:9618> failed: Connection refused (connect errno = 111). [CollectorLog] 12/12/18 04:57:51 StartdAd : Inserting ** "< slot1@xxxxxxxxxxxxxxxxxxxx , 131.169.98.92 >" 12/12/18 04:57:51 StartdPvtAd : Inserting ** "< slot1@xxxxxxxxxxxxxxxxxxxx , 131.169.98.92 >" 12/12/18 04:58:07 Accumulating data: Time=1544587087 12/12/18 04:58:07 Could not open data file /var/ViewHist/viewhist0.0.new for appending!!! errno=13 12/12/18 04:58:07 ERROR "Could not open data file appending!!!" at line 739 in file /slots/11/dir_3021763/userdir/.tmpyoMELi/BUILD/condor-8.7.10/src/condor_collector.V6/view_server.cpp 12/12/18 04:58:18 Setting maximum file descriptors to 10240. 12/12/18 04:58:18 ****************************************************** 12/12/18 04:58:18 ** condor_collector (CONDOR_COLLECTOR) STARTING UP 12/12/18 04:58:18 ** /usr/sbin/condor_collector [ProcLog] 12/12/18 04:58:07 : PROC_FAMILY_KILL_FAMILY 12/12/18 04:58:07 : taking a snapshot... 12/12/18 04:58:07 : process 113437 (of family 113437) has exited 12/12/18 04:58:07 : ...snapshot complete 12/12/18 04:58:07 : sending signal 9 to family with root 113437 12/12/18 04:58:07 : PROC_FAMILY_UNREGISTER_FAMILY 12/12/18 04:58:07 : unregistering family with root pid 113437
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature