Hi, I hope everyone has fully recovered from the HTCondor week :) Unfortunately, this problem only came up after yesterday's Q&A session finished, thus asking it in the usual space. In our new configuration (config summary at the end) we try to enforce preemption based on priority (and maybe other factors in the future), but right now, jobs are killed earlier than what MaxJobRetirementTime is set to and we currently have no idea what is causing this? The lines copied below seem to indicate networking problems, but I do not see any hints in the switches or on the nodes - especially the submit machine is quite busy but it should not overwhelm it (system is mostly idle but with about 5% wait, 10% system and mostly pegasus monitor processes in D state). The service should have enough file handles: condor.service - Condor Distributed High-Throughput-Computing Loaded: loaded (/lib/systemd/system/condor.service; enabled; vendor preset: enabled) Drop-In: /etc/systemd/system.control/condor.service.d ââ50-TasksMax.conf /etc/systemd/system/condor.service.d ââensure-run-dir.conf, limits.conf Active: active (running) since Thu 2020-05-21 08:30:44 UTC; 8h ago Process: 2440231 ExecStartPre=/bin/mkdir --parent --mode=0755 /run/condor (code=exited, status=0/SUCCESS) Process: 2440232 ExecStartPre=/bin/chown condor:condor /run/condor (code=exited, status=0/SUCCESS) Main PID: 2440233 (condor_master) Status: "All daemons are responding" Tasks: 34255 (limit: 131071) Memory: 76.7G CGroup: /system.slice/condor.service Thus, I would be grateful for any hint where to look. Cheers Carsten In the SchedLog we see these worrisome lines: 05/21/20 16:43:26 (2061018.0) (1646287): **** condor_shadow (condor_SHADOW) pid 1646287 EXITING WITH STATUS 108 05/21/20 16:43:26 (2061021.0) (1646309): **** condor_shadow (condor_SHADOW) pid 1646309 EXITING WITH STATUS 108 05/21/20 16:43:26 (2006521.0) (1624977): File transfer completed successfully. 05/21/20 16:43:26 (2001479.0) (1610449): File transfer completed successfully. 05/21/20 16:43:26 (2013966.0) (1637250): **** condor_shadow (condor_SHADOW) pid 1637250 EXITING WITH STATUS 108 05/21/20 16:43:26 (2054171.0) (1339388): condor_read(): Socket closed abnormally when trying to read 21 bytes from schedd at <10.20.30.17:14867>, errno=104 Connection reset by peer 05/21/20 16:43:26 (2054171.0) (1339388): SetEffectiveOwner(ahnitz) failed with errno=110: Connection timed out. 05/21/20 16:43:26 (2052187.0) (1313342): condor_read(): Socket closed abnormally when trying to read 21 bytes from schedd at <10.20.30.17:14867>, errno=104 Connection reset by peer 05/21/20 16:43:26 (2052187.0) (1313342): SetEffectiveOwner(ahnitz) failed with errno=110: Connection timed out. 05/21/20 16:43:26 (2031965.0) (1283878): condor_read(): Socket closed abnormally when trying to read 21 bytes from schedd at <10.20.30.17:14867>, errno=104 Connection reset by peer 05/21/20 16:43:26 (2031965.0) (1283878): SetEffectiveOwner(ahnitz) failed with errno=110: Connection timed out. 05/21/20 16:43:26 (2051766.0) (1308655): condor_read(): Socket closed abnormally when trying to read 21 bytes from schedd at <10.20.30.17:14867>, errno=104 Connection reset by peer 05/21/20 16:43:26 (2039452.0) (1293958): condor_read(): Socket closed abnormally when trying to read 21 bytes from schedd at <10.20.30.17:14867>, errno=104 Connection reset by peer In StarterLogs these lines 05/21/20 15:24:32 (pid:48660) Create_Process succeeded, pid=48814 05/21/20 15:24:40 (pid:48660) Failed to open '.update.ad' to read update ad: No such file or directory (2). 05/21/20 15:24:40 (pid:48660) Failed to open '.update.ad' to read update ad: No such file or directory (2). 05/21/20 16:45:03 (pid:48660) Connection to shadow may be lost, will test by sending whoami request. 05/21/20 16:45:03 (pid:48660) condor_write(): Socket closed when trying to write 37 bytes to <10.20.30.17:12065>, fd is 9 05/21/20 16:45:03 (pid:48660) Buf::write(): condor_write() failed 05/21/20 16:45:03 (pid:48660) i/o error result is 0, errno is 0 05/21/20 16:45:03 (pid:48660) Lost connection to shadow, waiting 2400 secs for reconnect 05/21/20 16:45:27 (pid:48660) Got SIGTERM. Performing graceful shutdown. 05/21/20 16:45:27 (pid:48660) ShutdownGraceful all jobs. 05/21/20 16:45:28 (pid:48660) Process exited, pid=48814, status=143 05/21/20 16:45:28 (pid:48660) Returning from CStarter::JobReaper() matching lines from StartLog: 05/21/20 15:11:00 slot1_64: New machine resource of type -1 allocated 05/21/20 15:11:01 slot1_64: Request accepted. 05/21/20 15:11:01 slot1_64: Remote owner is ahnitz@xxxxxxxxxxx 05/21/20 15:11:01 slot1_64: State change: claiming protocol successful 05/21/20 15:11:01 slot1_64: Changing state: Owner -> Claimed 05/21/20 15:21:01 slot1_64: Response problem from schedd <10.20.30.17:9618?addrs=10.20.30.17-9618&noUDP&sock=2440233_7254_3> on ALIVE job -1.-1. 05/21/20 15:21:06 slot1_64: Response problem from schedd <10.20.30.17:9618?addrs=10.20.30.17-9618&noUDP&sock=2440233_7254_3> on ALIVE job -1.-1. 05/21/20 15:21:11 slot1_64: Couldn't send ALIVE to schedd at <10.20.30.17:9618?addrs=10.20.30.17-9618&noUDP&sock=2440233_7254_3> 05/21/20 15:23:16 slot1_64: Got activate_claim request from shadow (10.20.30.17) 05/21/20 15:23:16 slot1_64: Remote job ID is 2049848.0 05/21/20 15:23:16 slot1_64: Got universe "VANILLA" (5) from request classad 05/21/20 15:23:16 slot1_64: State change: claim-activation protocol successful 05/21/20 15:23:16 slot1_64: Changing activity: Idle -> Busy 05/21/20 15:23:19 slot1_64: Failed to open '/local/condor/execute/dir_48660/.update.ad.tmp' for writing update ad: No such file or directory (2). 05/21/20 15:24:01 slot1_64: Failed to open '/local/condor/execute/dir_48660/.update.ad.tmp' for writing update ad: No such file or directory (2). 05/21/20 15:24:04 slot1_64: Failed to open '/local/condor/execute/dir_48660/.update.ad.tmp' for writing update ad: No such file or directory (2). 05/21/20 15:24:04 slot1_64: Failed to open '/local/condor/execute/dir_48660/.update.ad.tmp' for writing update ad: No such file or directory (2). 05/21/20 16:45:27 slot1_64: Got ACTIVATE_CLAIM while in Claimed/Busy state, ignoring. 05/21/20 16:45:27 slot1_64: Called deactivate_claim() 05/21/20 16:45:27 slot1_64: Changing state and activity: Claimed/Busy -> Preempting/Vacating 05/21/20 16:45:38 slot1_64: Got RELEASE_CLAIM while in Preempting state, ignoring. -- Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics, CallinstraÃe 38, 30167 Hannover, Germany Phone: +49 511 762 17185
# condor_config_val $CondorVersion: 8.8.9 May 06 2020 BuildID: Debian-8.8.9-1 PackageID: 8.8.9-1 Debian-8.8.9-1 $ # # from /etc/condor/condor_config # RELEASE_DIR = /usr LOCAL_DIR = /local/condor LOCAL_CONFIG_FILE = /etc/condor/condor_config.local REQUIRE_LOCAL_CONFIG_FILE = false LOCAL_CONFIG_DIR = /etc/condor/config.d LOCAL_CONFIG_DIR_EXCLUDE_REGEXP = ^((\..*)|(.*~)|(#.*)|(.*\.rpmsave)|(.*\.rpmnew))$ ALLOW_NEGOTIATOR_SCHEDD = $(ALLOW_NEGOTIATOR) $(FLOCK_NEGOTIATOR_HOSTS) ALLOW_READ_COLLECTOR = $(ALLOW_READ) $(FLOCK_FROM) ALLOW_READ_STARTD = $(ALLOW_READ) $(FLOCK_FROM) ALLOW_WRITE_COLLECTOR = $(ALLOW_WRITE) $(FLOCK_FROM) ALLOW_WRITE_STARTD = $(ALLOW_WRITE) $(FLOCK_FROM) RUN = /run/condor LIB = $(RELEASE_DIR)/lib/condor INCLUDE = $(RELEASE_DIR)/include/condor LIBEXEC = $(RELEASE_DIR)/lib/condor/libexec SHARE = $(RELEASE_DIR)/share/condor PROCD_ADDRESS = $(RUN)/procd_pipe # # from /etc/condor/config.d/01_generic # LOG = /var/log/condor MAX_MASTER_LOG = 100000000 MAX_CKPT_SERVER_LOG = 100000000 MAX_STARTD_LOG = 500000000 MAX_COLLECTOR_LOG = 500000000 MAX_NEGOTIATOR_LOG = 500000000 MAX_HAD_LOG = 100000000 MAX_SCHEDD_LOG = 500000000 MAX_SHADOW_LOG = 500000000 MAX_MATCH_LOG = 500000000 MAX_HISTORY_LOG = 1000000000 MAX_HISTORY_ROTATIONS = 7 ROTATE_HISTORY_DAILY = True HISTORY_HELPER_MAX_HISTORY = 999999999 SEC_PASSWORD_FILE = /var/lib/condor/passwd SEC_DAEMON_AUTHENTICATION = REQUIRED SEC_DAEMON_INTEGRITY = REQUIRED SEC_DAEMON_AUTHENTICATION_METHODS = PASSWORD SEC_NEGOTIATOR_AUTHENTICATION = REQUIRED SEC_NEGOTIATOR_INTEGRITY = REQUIRED SEC_NEGOTIATOR_AUTHENTICATION_METHODS = FS, PASSWORD SEC_CLIENT_AUTHENTICATION_METHODS = FS, PASSWORD SEC_ADMINISTRATOR_AUTHENTICATION = REQUIRED SEC_ADMINISTRATOR_AUTHENTICATION_METHODS = PASSWORD SEC_DEFAULT_AUTHENTICATION_METHODS = PASSWORD, FS ADMINISTRATORS = condor@xxxxxxxxxxx/*.atlas.local CENTRAL_MANAGER = condorhub.atlas.local ALL_SCHEDD = condor*.atlas.local ALL_STARTD = a*.atlas.local ALL_NODES = $(CENTRAL_MANAGER), $(ALL_SCHEDD), $(ALL_STARTD) COLLECTOR_NAME = MPI-GRAPHY-AEI-Hannover CONDOR_IDS = 666.666 COLLECTOR_HOST = $(CENTRAL_MANAGER) NEGOTIATOR_HOST = $(CENTRAL_MANAGER) DEFAULT_DOMAIN_NAME = atlas.local UID_DOMAIN = atlas.local FILESYSTEM_DOMAIN = atlas.local LOCK = /var/lock/ CONDOR_ADMIN = atlas_admin@xxxxxxxxxx MAIL = /usr/bin/mail LOCAL_CONDOR_SCRIPTS = /usr/share/condor # # from /etc/condor/config.d/10_SCHEDD # NETWORK_INTERFACE = 10.20.30.17 DAEMON_LIST = MASTER SCHEDD ALLOW_ADMINISTRATOR = $(ADMINISTRATORS) ALLOW_ADVERTISE_MASTER = $(ADMINISTRATORS) ALLOW_ADVERTISE_SCHEDD = $(ADMINISTRATORS) ALLOW_CLIENT = $(CENTRAL_MANAGER), $(ALL_SCHEDD), $(ALL_STARTD) ALLOW_CONFIG = $(ADMINISTRATORS) ALLOW_DAEMON = $(ADMINISTRATORS) ALLOW_NEGOTIATOR = $(ADMINISTRATORS), $(CENTRAL_MANAGER) ALLOW_OWNER = $(ADMINISTRATORS) ALLOW_READ = $(ADMINISTRATORS), $(FULL_HOSTNAME) ALLOW_WRITE = * MAX_JOBS_RUNNING = 100000 # # from /etc/condor/config.d/20_FLOCKING # FLOCK_TO = condorhubinteractive.atlas.local # # from /etc/condor/config.d/99_TRANSFORM # JOB_TRANSFORM_NAMES = TagJob JOB_TRANSFORM_TagJob = [ Eval_set_AccountingGroup = join(".", split(toLower(AcctGroup), ".")[1], AcctGroupUser); Eval_set_AcctGroup = toLower(AcctGroup); ] SCHEDD_CLASSAD_USER_MAP_NAMES = ValidSearchTags ValidSearchUsers CLASSAD_USER_MAPFILE_ValidSearchTags = /etc/condor/accounting/valid_tags CLASSAD_USER_MAPFILE_ValidSearchUsers = /etc/condor/accounting/valid_users SUBMIT_REQUIREMENT_NAMES = ValidateSearchTag ValidateSearchUser SUBMIT_REQUIREMENT_ValidateSearchTag = JobUniverse == 7 || userMap("ValidSearchTags", AcctGroup) isnt undefined SUBMIT_REQUIREMENT_ValidateSearchTag_REASON = strcat("Invalid value for search tag: ", AcctGroup ?: "<undefined>") SUBMIT_REQUIREMENT_ValidateSearchUser = Debug(JobUniverse == 7 || userMap("ValidSearchUsers", Owner, AcctGroupUser) is AcctGroupUser || userMap("ValidSearchUsers", Owner) is undefined && Owner =?= AcctGroupUser) SUBMIT_REQUIREMENT_ValidateSearchUser_REASON = strcat("Invalid value for search user: ", AcctGroupUser ?: "<undefined>", "\n", " Valid values are: ",userMap("ValidSearchUsers", Owner))
# condor_config_val $CondorVersion: 8.8.7 Dec 26 2019 BuildID: Debian-8.8.7-1 PackageID: 8.8.7-1 Debian-8.8.7-1 $ # # from /etc/condor/condor_config # RELEASE_DIR = /usr LOCAL_DIR = /local/condor LOCAL_CONFIG_FILE = /etc/condor/condor_config.local REQUIRE_LOCAL_CONFIG_FILE = false LOCAL_CONFIG_DIR = /etc/condor/config.d LOCAL_CONFIG_DIR_EXCLUDE_REGEXP = ^((\..*)|(.*~)|(#.*)|(.*\.rpmsave)|(.*\.rpmnew))$ ALLOW_NEGOTIATOR_SCHEDD = $(ALLOW_NEGOTIATOR) $(FLOCK_NEGOTIATOR_HOSTS) ALLOW_READ_COLLECTOR = $(ALLOW_READ) $(FLOCK_FROM) ALLOW_READ_STARTD = $(ALLOW_READ) $(FLOCK_FROM) ALLOW_WRITE_COLLECTOR = $(ALLOW_WRITE) $(FLOCK_FROM) ALLOW_WRITE_STARTD = $(ALLOW_WRITE) $(FLOCK_FROM) RUN = /run/condor LIB = $(RELEASE_DIR)/lib/condor INCLUDE = $(RELEASE_DIR)/include/condor LIBEXEC = $(RELEASE_DIR)/lib/condor/libexec SHARE = $(RELEASE_DIR)/share/condor PROCD_ADDRESS = $(RUN)/procd_pipe # # from /etc/condor/config.d/01_generic # LOG = /var/log/condor MAX_MASTER_LOG = 100000000 MAX_CKPT_SERVER_LOG = 100000000 MAX_STARTD_LOG = 500000000 MAX_COLLECTOR_LOG = 500000000 MAX_NEGOTIATOR_LOG = 500000000 MAX_HAD_LOG = 100000000 MAX_SCHEDD_LOG = 500000000 MAX_SHADOW_LOG = 500000000 MAX_MATCH_LOG = 500000000 MAX_HISTORY_LOG = 1000000000 MAX_HISTORY_ROTATIONS = 7 ROTATE_HISTORY_DAILY = True HISTORY_HELPER_MAX_HISTORY = 999999999 SEC_PASSWORD_FILE = /var/lib/condor/passwd SEC_DAEMON_AUTHENTICATION = REQUIRED SEC_DAEMON_INTEGRITY = REQUIRED SEC_DAEMON_AUTHENTICATION_METHODS = PASSWORD SEC_NEGOTIATOR_AUTHENTICATION = REQUIRED SEC_NEGOTIATOR_INTEGRITY = REQUIRED SEC_NEGOTIATOR_AUTHENTICATION_METHODS = FS, PASSWORD SEC_CLIENT_AUTHENTICATION_METHODS = FS, PASSWORD SEC_ADMINISTRATOR_AUTHENTICATION = REQUIRED SEC_ADMINISTRATOR_AUTHENTICATION_METHODS = PASSWORD SEC_DEFAULT_AUTHENTICATION_METHODS = PASSWORD, FS ADMINISTRATORS = condor@xxxxxxxxxxx/*.atlas.local CENTRAL_MANAGER = condorhub.atlas.local ALL_SCHEDD = condor*.atlas.local ALL_STARTD = a*.atlas.local ALL_NODES = $(CENTRAL_MANAGER), $(ALL_SCHEDD), $(ALL_STARTD) COLLECTOR_NAME = MPI-GRAPHY-AEI-Hannover CONDOR_IDS = 666.666 COLLECTOR_HOST = $(CENTRAL_MANAGER) NEGOTIATOR_HOST = $(CENTRAL_MANAGER) DEFAULT_DOMAIN_NAME = atlas.local UID_DOMAIN = atlas.local FILESYSTEM_DOMAIN = atlas.local LOCK = /var/lock/ CONDOR_ADMIN = atlas_admin@xxxxxxxxxx MAIL = /usr/bin/mail LOCAL_CONDOR_SCRIPTS = /usr/share/condor # # from /etc/condor/config.d/10_CENTRALMANAGER # NETWORK_INTERFACE = 10.20.40.190 DAEMON_LIST = MASTER COLLECTOR NEGOTIATOR ALLOW_ADMINISTRATOR = $(ADMINISTRATORS), $(FULL_HOSTNAME) ALLOW_ADVERTISE_MASTER = $(ADMINISTRATORS), $(ALL_NODES) ALLOW_CLIENT = * ALLOW_CONFIG = $(ADMINISTRATORS) ALLOW_DAEMON = $(ADMINISTRATORS), $(ALL_NODES) ALLOW_NEGOTIATOR = $(ADMINISTRATORS), $(CENTRAL_MANAGER) ALLOW_OWNER = $(ADMINISTRATORS), $(FULL_HOSTNAME) ALLOW_READ = * ALLOW_WRITE = * NewUserBetterPrio = ( RemoteUserPrio > SubmitterUserPrio * 1.2 ) IsGPUJob = ( Target.RequestGPUs > 0 && My.RequestGPUs =?= 0 ) PREEMPTION_REQUIREMENTS = $(NewUserBetterPrio) ALLOW_PSLOT_PREEMPTION = True
# condor_config_val $CondorVersion: 8.8.9 May 06 2020 BuildID: Debian-8.8.9-1 PackageID: 8.8.9-1 Debian-8.8.9-1 $ # # from /etc/condor/condor_config # RELEASE_DIR = /usr LOCAL_DIR = /local/condor LOCAL_CONFIG_FILE = /etc/condor/condor_config.local REQUIRE_LOCAL_CONFIG_FILE = false LOCAL_CONFIG_DIR = /etc/condor/config.d LOCAL_CONFIG_DIR_EXCLUDE_REGEXP = ^((\..*)|(.*~)|(#.*)|(.*\.rpmsave)|(.*\.rpmnew))$ ALLOW_NEGOTIATOR = $(CONDOR_HOST) ALLOW_NEGOTIATOR_SCHEDD = $(ALLOW_NEGOTIATOR) $(FLOCK_NEGOTIATOR_HOSTS) ALLOW_OWNER = $(FULL_HOSTNAME) $(IPV4_ADDRESS) $(IPV6_ADDRESS) ALLOW_READ = * ALLOW_READ_COLLECTOR = $(ALLOW_READ) $(FLOCK_FROM) ALLOW_READ_STARTD = $(ALLOW_READ) $(FLOCK_FROM) ALLOW_WRITE_COLLECTOR = $(ALLOW_WRITE) $(FLOCK_FROM) ALLOW_WRITE_STARTD = $(ALLOW_WRITE) $(FLOCK_FROM) ALLOW_WRITE = 10.0.0.0/9 RUN = /run/condor LIB = $(RELEASE_DIR)/lib/condor INCLUDE = $(RELEASE_DIR)/include/condor LIBEXEC = $(RELEASE_DIR)/lib/condor/libexec SHARE = $(RELEASE_DIR)/share/condor PROCD_ADDRESS = $(RUN)/procd_pipe # # from /etc/condor/config.d/01_generic # LOG = /var/log/condor MAX_MASTER_LOG = 100000000 MAX_CKPT_SERVER_LOG = 100000000 MAX_STARTD_LOG = 500000000 MAX_COLLECTOR_LOG = 500000000 MAX_NEGOTIATOR_LOG = 500000000 MAX_HAD_LOG = 100000000 MAX_SCHEDD_LOG = 500000000 MAX_SHADOW_LOG = 500000000 MAX_MATCH_LOG = 500000000 MAX_HISTORY_LOG = 1000000000 MAX_HISTORY_ROTATIONS = 7 ROTATE_HISTORY_DAILY = True HISTORY_HELPER_MAX_HISTORY = 999999999 SEC_PASSWORD_FILE = /var/lib/condor/passwd SEC_DAEMON_AUTHENTICATION = REQUIRED SEC_DAEMON_INTEGRITY = REQUIRED SEC_DAEMON_AUTHENTICATION_METHODS = PASSWORD SEC_NEGOTIATOR_AUTHENTICATION = REQUIRED SEC_NEGOTIATOR_INTEGRITY = REQUIRED SEC_NEGOTIATOR_AUTHENTICATION_METHODS = FS, PASSWORD SEC_CLIENT_AUTHENTICATION_METHODS = FS, PASSWORD SEC_ADMINISTRATOR_AUTHENTICATION = REQUIRED SEC_ADMINISTRATOR_AUTHENTICATION_METHODS = PASSWORD SEC_DEFAULT_AUTHENTICATION_METHODS = PASSWORD, FS ADMINISTRATORS = condor@xxxxxxxxxxx/*.atlas.local CENTRAL_MANAGER = condorhub.atlas.local ALL_SCHEDD = condor*.atlas.local ALL_STARTD = a*.atlas.local ALL_NODES = $(CENTRAL_MANAGER), $(ALL_SCHEDD), $(ALL_STARTD) COLLECTOR_NAME = MPI-GRAPHY-AEI-Hannover CONDOR_IDS = 666.666 COLLECTOR_HOST = $(CENTRAL_MANAGER) NEGOTIATOR_HOST = $(CENTRAL_MANAGER) DEFAULT_DOMAIN_NAME = atlas.local UID_DOMAIN = atlas.local FILESYSTEM_DOMAIN = atlas.local LOCK = /var/lock/ CONDOR_ADMIN = atlas_admin@xxxxxxxxxx MAIL = /usr/bin/mail LOCAL_CONDOR_SCRIPTS = /usr/share/condor # # from /etc/condor/config.d/10_EXECUTE # NETWORK_INTERFACE = 10.10.71.1 DAEMON_LIST = MASTER STARTD MAXJOBRETIREMENTTIME = (10 * $(HOUR)) START = True && (MaxRunTimeHours * $(HOUR) <= MAXJOBRETIREMENTTIME) NUM_SLOTS_TYPE_1 = 1 SLOT_TYPE_1 = cpus=100%, ram=501913, swap=0% SLOT_TYPE_1_PARTITIONABLE = True LOCALINFO = $(LOCAL_CONDOR_SCRIPTS)/lscpu_localuser_info STARTD_CRON_JOBLIST = LOCALINFO STARTD_CRON_LOCALINFO_EXECUTABLE = $(LOCALINFO) STARTD_CRON_LOCALINFO_PERIOD = 300 MASTER_NEW_BINARY_RESTART = peaceful
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature