Hi All.
Thanks for the feedback. I removed condor from all 3 machines and installed version 9.4 from the script. Installing onto Oracle 8 was bit fiddly, but I got it to work eventually.
I still have nothing coming up when I do condor_status.
I tried using Google for an example of a multiple machine condor installation on Linux for V9, but nothing much turns up.
Thank you for your time, and once again, any hints are gratefully accepted.
--
kind regards,
Justin Fisher.
My worker config now looks like this on both machines:
----------------------------------------------------------------------------------------------------
CAL_CONFIG_DIR = /etc/condor/config.d
##DAEMON_LIST = MASTER, COLLECTOR, STARTD, SCHEDD, SHARED_PORT
DAEMON_LIST = MASTER,STARTD, SCHEDD, SHARED_PORT
DEFAULT_DOMAIN_NAME =
ingenazure.comCONDOR_HOST =
or8.ingenazure.comUID_DOMAIN =
ingenazure.comFILESYSTEM_DOMAIN = $(UID_DOMAIN)
## ALLOW_WRITE = *.$(UID_DOMAIN)
## ALLOW_WRITE = $(ALLOW_WRITE), $(CONDOR_HOST), 192.168.178.*
## ALLOW_WRITE = 192.168.178.*
ALLOW_READ = *.$(UID_DOMAIN), Â192.168.178.*
CONDOR_ADMIN =
jfisher@xxxxxxxxxxxxxxUSE_NFS = FALSE
StartJobs = true
STARTD_ATTRS = StartJobs, $(STARTD_ATTRS)
# When is this node willing to run jobs?
## START = (NODE_IS_HEALTHY =?= True) && (StartJobs =?= True)
START = true
# Permanent way of stopping jobs from starting
HOSTALLOW_CONFIG = $(CONDOR_HOST)
ALLOW_CONFIG = $(CONDOR_HOST)
ENABLE_RUNTIME_CONFIG = True
RUNTIME_CONFIG_ADMIN = $(CONDOR_HOST)
STARTD.SETTABLE_ATTRS_ADMINISTRATOR = StartJobs
ENABLE_PERSISTENT_CONFIG = True
PERSISTENT_CONFIG_DIR = /etc/condor/persistent
# use one shared port
USE_SHARED_PORT = TRUE
SHARED_PORT_ARGS = -p 9618
COLLECTOR_USES_SHARED_PORT=TRUE
COLLECTOR_HOST = $(CONDOR_HOST):9618
# Enable CGROUP control
BASE_CGROUP = htcondor
# hard: job can't access more physical memory than allocated
# soft: job can access more physical memory than allocated when there are free memory
CGROUP_MEMORY_LIMIT_POLICY = soft
# slots
NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 24
SLOT_TYPE_1 = cpus=1, ram=4%, swap=4%, disk=4%
SLOT_TYPE_1_PARTITIONABLE = true
COUNT_HYPERTHREAD_CPUS = true
----------------------------------------------------------------------------------------------------
I'm not really sure what I should be doing with ALLOW_WRITE on the worker.
My master config looks like this:
----------------------------------------------------------------------------------------------------
CONDOR_HOST =
or8.ingenazure.com# For details, run condor_config_val use role:get_htcondor_central_manager
use role:get_htcondor_central_manager
use security:recommended_v9_0
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, SHARED_PORT
START = true
ALLOW_ADMINISTRATOR =
jfisher@xxxxxxxxxxxxxxDEFAULT_DOMAIN_NAME =
ingenazure.comUID_DOMAIN =
ingenazure.comFILESYSTEM_DOMAIN = $(UID_DOMAIN)
## ALLOW_DAEMON = $(ALLOW_WRITE)
ALLOW_ADVERTISE_MASTER = $(ALLOW_WRITE)
ALLOW_ADVERTISE_STARTD = $(ALLOW_WRITE)
ALLOW_ADVERTISE_SCHEDD = $(ALLOW_WRITE)
ALLOW_READ Â= */*.
ingenazure.com,
or8.ingenazure.comALLOW_NEGOTIATOR =
or8.ingenazure.comCONDOR_ADMIN =
jfisher@xxxxxxxxxxxxxxUSE_NFS = FALSE
HOSTNAME = or8
## use shared ports
USE_SHARED_PORT=TRUE
SHARED_PORT_ARGS = -p 9618
COLLECTOR_USES_SHARED_PORT=TRUE
COLLECTOR_HOST = $(CONDOR_HOST):9618
StartJobs = TRUE
### Logging & debugging
MASTER_INSTANCE_LOCK = /var/lock/condor/InstanceLock
MAX_DEFAULT_LOG = 1000000
EVENT_LOG = $(LOG)/EventLog
EVENT_LOG_JOB_AD_INFORMATION_ATTRS=Owner,CurrentHosts,x509userproxysubject,x509UserProxyVOName,AccountingGroup,GlobalJobId,QDate,JobStartDate,JobCurrentStartDate,JobFinishedHookDone
EVENT_LOG_MAX_SIZE = 10000000
EVENT_LOG_MAX_ROTATIONS = 5
POOL_HISTORY_DIR = /var/log/condor
KEEP_POOL_HISTORY = True
###
# QUOTAS
###
GROUP_NAMES = group_ANALOG, group_DIGITAL, group_OTHER, #set the shares for your users
GROUP_QUOTA_DYNAMIC_group_ANALOG = 1
GROUP_QUOTA_DYNAMIC_group_DIGITAL = 1
GROUP_QUOTA_DYNAMIC_group_OTHER = 0.5
GROUP_ACCEPT_SURPLUS = TRUE
----------------------------------------------------------------------------------------------------
tail -n15 CollectorLog
12/29/21 17:23:46 Got QUERY_STARTD_PVT_ADS
12/29/21 17:23:46 QueryWorker: forked new high priority worker with id 12249 ( max 4 active 1 pending 0 )
12/29/21 17:23:46 (Sending 0 ads in response to query)
12/29/21 17:23:46 Query info: matched=0; skipped=0; query_time=0.000179; send_time=0.000125; type=MachinePrivate; requirements={true}; locate=0; limit=0; from=COLLECTOR; peer=<
192.168.178.63:17021>; projection={}; filter_private_ads=0
12/29/21 17:23:46 QueryWorker: forked new high priority worker with id 12250 ( max 4 active 2 pending 0 )
12/29/21 17:23:46 (Sending 0 ads in response to query)
12/29/21 17:23:46 Query info: matched=0; skipped=14; query_time=0.000197; send_time=0.000089; type=Any; requirements={(((MyType == "Submitter")) || ((MyType == "Machine")))}; locate=0; limit=0; from=COLLECTOR; peer=<
192.168.178.63:16169>; projection={}; filter_private_ads=0
12/29/21 17:24:45 Accumulating data: Time=1640795085
12/29/21 17:24:46 Got QUERY_STARTD_PVT_ADS
12/29/21 17:24:46 QueryWorker: forked new high priority worker with id 12268 ( max 4 active 1 pending 0 )
12/29/21 17:24:46 (Sending 0 ads in response to query)
12/29/21 17:24:46 Query info: matched=0; skipped=0; query_time=0.000172; send_time=0.000088; type=MachinePrivate; requirements={true}; locate=0; limit=0; from=COLLECTOR; peer=<
192.168.178.63:6541>; projection={}; filter_private_ads=0
12/29/21 17:24:46 QueryWorker: forked new high priority worker with id 12269 ( max 4 active 2 pending 0 )
12/29/21 17:24:46 (Sending 0 ads in response to query)
12/29/21 17:24:46 Query info: matched=0; skipped=14; query_time=0.000173; send_time=0.000085; type=Any; requirements={(((MyType == "Submitter")) || ((MyType == "Machine")))}; locate=0; limit=0; from=COLLECTOR; peer=<
192.168.178.63:2175>; projection={}; filter_private_ads=0
----------------------------------------------------------------------------------------------------
Master schedd log
tail -n15 SchedLog
12/29/21 17:10:45 (pid:9696) Â Â/etc/condor/condor_config.local
12/29/21 17:10:45 (pid:9696) config Macros = 94, Sorted = 94, StringBytes = 2705, TablesBytes = 3432
12/29/21 17:10:45 (pid:9696) CLASSAD_CACHING is ENABLED
12/29/21 17:10:45 (pid:9696) Daemon Log is logging: D_ALWAYS D_ERROR
12/29/21 17:10:45 (pid:9696) SharedPortEndpoint: waiting for connections to named socket schedd_9642_3040
12/29/21 17:10:45 (pid:9696) DaemonCore: command socket at <
192.168.178.63:9618?addrs=192.168.178.63-9618+[2001-871-262-b1ea-20c-29ff-feff-a619]-9618&alias=or8.ingenazure.com&noUDP&sock=schedd_9642_3040>
12/29/21 17:10:45 (pid:9696) DaemonCore: private command socket at <
192.168.178.63:9618?addrs=192.168.178.63-9618+[2001-871-262-b1ea-20c-29ff-feff-a619]-9618&alias=or8.ingenazure.com&noUDP&sock=schedd_9642_3040>
12/29/21 17:10:45 (pid:9696) History file rotation is enabled.
12/29/21 17:10:45 (pid:9696) Â Maximum history file size is: 20971520 bytes
12/29/21 17:10:45 (pid:9696) Â Number of rotated history files is: 2
12/29/21 17:10:46 (pid:9696) Reloading job factories
12/29/21 17:10:46 (pid:9696) Loaded 0 job factories, 0 were paused, 0 failed to load
12/29/21 17:10:46 (pid:9696) TransferQueueManager stats: active up=0/100 down=0/100; waiting up=0 down=0; wait time up=0s down=0s
12/29/21 17:10:46 (pid:9696) TransferQueueManager upload 1m I/O load: 0 bytes/s Â0.000 disk load Â0.000 net load
12/29/21 17:10:46 (pid:9696) TransferQueueManager download 1m I/O load: 0 bytes/s Â0.000 disk load Â0.000 net load
----------------------------------------------------------------------------------------------------
Worker schedd log
tail -n15 /var/log/condor/SchedLog
12/29/21 17:17:20 (pid:26109) Failed to start non-blocking update to <
192.168.178.63:9618>.
12/29/21 17:17:20 (pid:26109) SECMAN: required authentication with collector
or8.ingenazure.com:9618 failed, so aborting command DC_START_TOKEN_REQUEST.
12/29/21 17:17:20 (pid:26109) Failed to request a new token: DAEMON:1:failed to start command for token request with remote daemon at '<
192.168.178.63:9618?alias=or8.ingenazure.com>'.|AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using FS
12/29/21 17:22:20 (pid:26109) SECMAN: required authentication with collector
or8.ingenazure.com:9618 failed, so aborting command UPDATE_SCHEDD_AD.
12/29/21 17:22:20 (pid:26109) ERROR: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using FS
12/29/21 17:22:20 (pid:26109) Collector update failed; will try to get a token request for trust domain
or8.ingenazure.com, identity (default).
12/29/21 17:22:20 (pid:26109) Failed to start non-blocking update to <
192.168.178.63:9618>.
12/29/21 17:22:20 (pid:26109) SECMAN: required authentication with collector
or8.ingenazure.com:9618 failed, so aborting command DC_START_TOKEN_REQUEST.
12/29/21 17:22:20 (pid:26109) Failed to request a new token: DAEMON:1:failed to start command for token request with remote daemon at '<
192.168.178.63:9618?alias=or8.ingenazure.com>'.|AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using FS
12/29/21 17:27:21 (pid:26109) SECMAN: required authentication with collector
or8.ingenazure.com:9618 failed, so aborting command UPDATE_SCHEDD_AD.
12/29/21 17:27:21 (pid:26109) ERROR: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using FS
12/29/21 17:27:21 (pid:26109) Collector update failed; will try to get a token request for trust domain
or8.ingenazure.com, identity (default).
12/29/21 17:27:21 (pid:26109) Failed to start non-blocking update to <
192.168.178.63:9618>.
12/29/21 17:27:21 (pid:26109) SECMAN: required authentication with collector
or8.ingenazure.com:9618 failed, so aborting command DC_START_TOKEN_REQUEST.
12/29/21 17:27:21 (pid:26109) Failed to request a new token: DAEMON:1:failed to start command for token request with remote daemon at '<
192.168.178.63:9618?alias=or8.ingenazure.com>'.|AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using FS
----------------------------------------------------------------------------------------------------
Worker Master Log
tail -n15 /var/log/condor/MasterLog
12/29/21 17:17:25 Failed to start non-blocking update to <
192.168.178.63:9618>.
12/29/21 17:17:25 SECMAN: required authentication with collector
or8.ingenazure.com:9618 failed, so aborting command DC_START_TOKEN_REQUEST.
12/29/21 17:17:25 Failed to request a new token: DAEMON:1:failed to start command for token request with remote daemon at '<
192.168.178.63:9618?alias=or8.ingenazure.com>'.|AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using FS
12/29/21 17:22:25 SECMAN: required authentication with collector
or8.ingenazure.com:9618 failed, so aborting command UPDATE_MASTER_AD.
12/29/21 17:22:25 ERROR: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using FS
12/29/21 17:22:25 Collector update failed; will try to get a token request for trust domain
or8.ingenazure.com, identity (default).
12/29/21 17:22:25 Failed to start non-blocking update to <
192.168.178.63:9618>.
12/29/21 17:22:25 SECMAN: required authentication with collector
or8.ingenazure.com:9618 failed, so aborting command DC_START_TOKEN_REQUEST.
12/29/21 17:22:25 Failed to request a new token: DAEMON:1:failed to start command for token request with remote daemon at '<
192.168.178.63:9618?alias=or8.ingenazure.com>'.|AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using FS
12/29/21 17:27:25 SECMAN: required authentication with collector
or8.ingenazure.com:9618 failed, so aborting command UPDATE_MASTER_AD.
12/29/21 17:27:25 ERROR: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using FS
12/29/21 17:27:25 Collector update failed; will try to get a token request for trust domain
or8.ingenazure.com, identity (default).
12/29/21 17:27:25 Failed to start non-blocking update to <
192.168.178.63:9618>.
12/29/21 17:27:25 SECMAN: required authentication with collector
or8.ingenazure.com:9618 failed, so aborting command DC_START_TOKEN_REQUEST.
12/29/21 17:27:25 Failed to request a new token: DAEMON:1:failed to start command for token request with remote daemon at '<
192.168.178.63:9618?alias=or8.ingenazure.com>'.|AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using FS
----------------------------------------------------------------------------------------------------
Master master log
tail -n15 MasterLog
12/29/21 17:10:45 SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s.
12/29/21 17:10:45 DaemonCore: private command socket at <
192.168.178.63:0?alias=or8.ingenazure.com&sock=master_9642_3040>
12/29/21 17:10:45 SHARED_PORT is in front of a COLLECTOR, so it will use the configured collector port
12/29/21 17:10:45 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1638477680)
12/29/21 17:10:45 Cannot remove wait-for-startup file /var/lock/condor/shared_port_ad
12/29/21 17:10:45 Started DaemonCore process "/usr/libexec/condor/condor_shared_port", pid and pgroup = 9693
12/29/21 17:10:45 Waiting for /var/lock/condor/shared_port_ad to appear.
12/29/21 17:10:45 Found /var/lock/condor/shared_port_ad.
12/29/21 17:10:45 Cannot remove wait-for-startup file /var/log/condor/.collector_address
12/29/21 17:10:45 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 9694
12/29/21 17:10:45 Waiting for /var/log/condor/.collector_address to appear.
12/29/21 17:10:45 Found /var/log/condor/.collector_address.
12/29/21 17:10:45 Started DaemonCore process "/usr/sbin/condor_negotiator", pid and pgroup = 9695
12/29/21 17:10:45 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 9696
12/29/21 17:10:45 Daemons::StartAllDaemons all daemons were started
----------------------------------------------------------------------------------------------------