Fellow condor users,
I have an all-windows condor pool consisting of 1 central manager, 2-3 schedulers, a dedicated credd server and about 30 execute machines. All machines, except for the credd server, have startd running on them and can therefore accept jobs in some capacity. I have a mix of versions from 7.6.8 up to 8.0.3. This configuration has worked seamlessly for about 3 years now until last week when my credd server died, and I had to migrate to a new machine. To do so I copied the config.local of the old credd server (fortunately I had a backup) to a nearly identical machine (same OS, hardware, etc.), and have been unable to bring the pool up since. I have pasted what I think are the relevant configuration settings as well as telling log messages below. In a nutshell, jobs are starting then crashing because they cannot find a password for my account. However, when I run
condor_store_cred add
It completes successfully, and indeed
condor_store_cred query
reports that credentials have been stored and are valid. I cannot find the disconnect between the credd server/scheduler and starters. I have tried changing credd servers (the configuration below actually has the CM, 10.1.216.182, as the credd server), different users, different schedulers, and always end up with the same result. Furthermore, the log messages are not leading me to an answer as they had in the past. Has anyone managed to work through this issue? If so, I would greatly appreciate some guidance.
Eric
Condor_config.local (CM): CREDD_HOST = $(FULL_HOSTNAME) STARTER_ALLOW_RUNAS_OWNER = True CREDD_CACHE_LOCALLY = True SEC_CLIENT_AUTHENTICATION_METHODS = NTSSPI, PASSWORD ALLOW_CONFIG = Administrator@*,$(CONDOR_HOST) SEC_CONFIG_NEGOTIATION = REQUIRED SEC_CONFIG_AUTHENTICATION = REQUIRED SEC_CONFIG_ENCRYPTION = REQUIRED SEC_CONFIG_INTEGRITY = REQUIRED
CREDD_LOG = $(LOG)/CreddLog CREDD_DEBUG = D_COMMAND MAX_CREDD_LOG = 50000000
ALLOW_CONFIG = $(IP_ADDRESS),$(CONDOR_HOST),Administrator@* ALLOW_WRITE = 10.*,*.$(UID_DoMAIN) ALLOW_READ = * --------------------------------------------------------------------------------------------- Condor_config.local (schedd) CREDD_HOST = $(CONDOR_HOST).$(UID_DOMAIN) STARTER_ALLOW_RUNAS_OWNER = True CREDD_CACHE_LOCALLY = True SEC_CLIENT_AUTHENTICATION_METHODS = NTSSPI, PASSWORD SEC_CONFIG_NEGOTIATION = REQUIRED SEC_CONFIG_AUTHENTICATION = REQUIRED SEC_CONFIG_ENCRYPTION = REQUIRED SEC_CONFIG_INTEGRITY = REQUIRED
ALLOW_CONFIG = $(IP_ADDRESS),$(CONDOR_HOST),Administrator@* ALLOW_WRITE = $(FULL_HOSTNAME),$(IP_ADDRESS),*.vms.ad.varian.com,10.1.*
WorkHours = ( (ClockMin >= 450 && ClockMin < 1080) && \ (ClockDay > 0 && ClockDay < 6) ) AfterHours = ( (ClockMin < 450 || ClockMin >= 1080) || \ (ClockDay == 0 || ClockDay == 6) )
#START = $(AfterHours) && $(UWCS_START)
#SUSPEND = $(WorkHours) || $(UWCS_SUSPEND)
#PREEMPT = $(WorkHours)
#START = TRUE #SUSPEND = FALSE #KILL = FALSE #PREEMPT = FALSE #STARTD_DEBUG=D_ALL #MAX_NUM_CPUS = 3 DAEMON_LIST = MASTER, KBDD, SCHEDD
------------------------------------------------------------------------------------------------- Log excerpt from Starter.slot1
08/05/14 07:11:35 Using config source: C:\condor\condor_config 08/05/14 07:11:35 Using local config sources: 08/05/14 07:11:35 C:\condor/condor_config.local 08/05/14 07:11:35 DaemonCore: command socket at <10.1.216.198:58156> 08/05/14 07:11:35 DaemonCore: private command socket at <10.1.216.198:58156> 08/05/14 07:11:35 GLEXEC_JOB not supported on this platform; ignoring 08/05/14 07:11:35 Communicating with shadow <10.1.216.182:3690> 08/05/14 07:11:35 Submitting machine is "mv6d8xfmnb1.vms.ad.varian.com" 08/05/14 07:11:35 setting the orig job name in starter 08/05/14 07:11:35 setting the orig job iwd in starter 08/05/14 07:11:39 ERROR: Could not locate valid credential for user 'eabel@VMS' 08/05/14 07:11:39 Could not initialize user_priv as "VMS\eabel". Make sure this account's password is securely stored with condor_store_cred. 08/05/14 07:11:39 ERROR: Failed to determine what user to run this job as, aborting 08/05/14 07:11:39 Failed to initialize JobInfoCommunicator, aborting 08/05/14 07:11:39 Unable to start job.
Log excerpt from Credd.log
NewSession = "YES" ParentUniqueID = "MV6D8XFMNB1:6996:1407247268" AuthMethods = "NTSSPI, PASSWORD" Enact = "NO" CryptoMethods = "3DES,BLOWFISH" OutgoingNegotiation = "PREFERRED" CurrentTime = time() RemoteVersion = "$CondorVersion: 7.8.7 Dec 12 2012 BuildID: 86173 $" ServerCommandSock = "<10.1.216.182:1450>" Integrity = "OPTIONAL" ServerPid = 7984 Encryption = "OPTIONAL" Authentication = "OPTIONAL" SessionLease = 3600 SessionDuration = "86400" Subsystem = "SHADOW" Command = 81099 08/05/14 10:02:37 condor_write(): Socket closed when trying to write 291 bytes to <10.1.216.182:1750>, fd is 464 08/05/14 10:02:37 Buf::write(): condor_write() failed 08/05/14 10:02:37 SECMAN: Error sending response classad to <10.1.216.182:1750>!
|