Hello, I am in the process of trying to update an Ubuntu 20 8.8.x flock to 23.0. I am not sure if this is new or well-known issue but there appears to be a problem with HTCondor install for 23.0, possibly 10.x and 10.0 too.
I haven’t tried 9.0, 9.x or 23.x. My 8.8.x are old and but I do not remember experiencing this issue when doing fresh 8.8.x installs. As part of the testing, I tried directly upgrading from 8.8.x, but I also tried fresh installs using both the get_condor script and a manual install. In both the later cases the previous install was stripped using the command line from the get_condor script: apt-get -y remove --purge htcondor && apt-get -y autoremove --purge && rm -fr /etc/condor (basically what get_condor suggests to
do to remove older installs). On Ubuntu 20, this issue seems to appear all the time when installing using get_condor (minicondor or other roles) or manually with apt-get install htcondor or apt-get install minicondor When installing on a condor-free system, while /etc/condor is recreated, it seems that the root of the execute directory /var/lib/condor is not recreated by the installation process.
This issue does not occur when updating from 8.8.x of course as the /var/lib/condor directory already exists. This path is defined in the default /etc/condor/condor_config that is deployed by the installation: LOCAL_DIR = /var […] EXECUTE = $(LOCAL_DIR)/lib/condor/execute This issue makes stard fail in loop on execute machines (see logs below) when starting condor, probably due to the fact that the condor user cannot create a directory in /var/lib. The fix for this is for root to create the missing directory /var/lib/condor, startd will recreate the execute sub-directory belonging to the condor user next time condor is restarted. $ cd /var/lib $ chmod 755 condor In /var/log/condor/MasterLog: 10/05/23 10:09:11 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 69386 10/05/23 10:09:11 Daemons::StartAllDaemons all daemons were started 10/05/23 10:09:14 The STARTD (pid 69386) exited with status 4 10/05/23 10:09:14 Sending obituary for "/usr/sbin/condor_startd" 10/05/23 10:09:14 restarting /usr/sbin/condor_startd in 10 seconds […] loops from here In /var/log/condor/StartLog: 10/05/23 10:10:42 ****************************************************** 10/05/23 10:10:42 ** condor_startd (CONDOR_STARTD) STARTING UP 10/05/23 10:10:42 ** /usr/sbin/condor_startd 10/05/23 10:10:42 ** SubsystemInfo: name=STARTD type=STARTD(6) class=DAEMON(1) 10/05/23 10:10:42 ** Configuration: subsystem:STARTD local:<NONE> class:DAEMON 10/05/23 10:10:42 ** $CondorVersion: 23.0.0 2023-09-29 BuildID: 678686 PackageID: 23.0.0-1 $ 10/05/23 10:10:42 ** $CondorPlatform: X86_64-Ubuntu_20.04 $ 10/05/23 10:10:42 ** PID = 70350 10/05/23 10:10:42 ** Log last touched 10/5 10:10:17 10/05/23 10:10:42 ****************************************************** 10/05/23 10:10:42 Using config source: /etc/condor/condor_config 10/05/23 10:10:42 Using local config sources: 10/05/23 10:10:42 /etc/condor/config.d/00-htcondor-9.0.config 10/05/23 10:10:42 /etc/condor/condor_config.local 10/05/23 10:10:42 config Macros = 93, Sorted = 93, StringBytes = 2628, TablesBytes = 3404 10/05/23 10:10:42 CLASSAD_CACHING is ENABLED 10/05/23 10:10:42 Daemon Log is logging: D_ALWAYS D_ERROR D_STATUS 10/05/23 10:10:42 SharedPortEndpoint: waiting for connections to named socket startd_69340_0d0f 10/05/23 10:10:42 DaemonCore: command socket at <192.168.8.246:9618?addrs=192.168.8.246-9618&alias=suvofpcand20.corp.spc.int&noUDP&sock=startd_69340_0d0f> 10/05/23 10:10:42 DaemonCore: private command socket at <192.168.8.246:9618?addrs=192.168.8.246-9618&alias=suvofpcand20.corp.spc.int&noUDP&sock=startd_69340_0d0f> 10/05/23 10:10:45 VM universe will be tested to check if it is available 10/05/23 10:10:45 History file rotation is enabled. 10/05/23 10:10:45 Maximum history file size is: 20971520 bytes 10/05/23 10:10:45 Number of rotated history files is: 2 10/05/23 10:10:45 Startd will not enforce disk limits via logical volume management. 10/05/23 10:10:45 Failed to stat /var/lib/condor/execute: (errno 2) No such file or directory 10/05/23 10:10:45 ERROR "Error accessing execute directory /var/lib/condor/execute specified in the configuration setting SLOT1_EXECUTE: (errno=2) No such file or directory" at line 78 in file /var/lib/condor/execute/slot1/dir_3182108/userdir/build-I2xw6a/condor-23.0.0/src/condor_startd.V6/slot_builder.cpp |