I’m struggling with HTCondor-C. This was originally working on our system but during the 2 years I have been away something failed and users reverted to using it as a single pool. Still running 7.8.8 across a number of dedicated linux processors with Windows user submit machines. I don’t want to upgrade until I find the answer to this issue. If a grid job is submitted it sits locally Idle with: Request has not been considered by the Matchmaker Gridmanager process keeps starting up, repeatedly failing to set permissions for something? And then exiting. SchedLog shows something similar. I have Googled my heart out to no avail. Have re-installed at Windows submit machine. What is it about the uid’s/permissions? As I said, jobs submitted as Vanilla rather than Grid to the same remote central manager run as per normal. The Gahp_worker never fires, so I think it is a problem locally. Can anyone please be of assistance. Troy GridmanagerLog: ... 02/11/14 14:23:40 [7608] TokenCache contents: troy@domain 02/11/14 14:23:40 [7608] DaemonCore: in SendAliveToParent() 02/11/14 14:23:40 [7608] DaemonCore::IsPidAlive(): OpenProcess failed 02/11/14 14:23:40 [7608] DaemonCore: in SendAliveToParent() - ppid 4740l disappeared! 02/11/14 14:23:40 [7608] Checking proxies 02/11/14 14:23:43 [7608] Initialized the following authorization table: 02/11/14 14:23:43 [7608] Authorizations yet to be resolved: 02/11/14 14:23:43 [7608] allow READ: */* */* 02/11/14 14:23:43 [7608] allow WRITE: */* */local@xxxxx */147.66.85.62 */147.66.85.62 02/11/14 14:23:43 [7608] allow NEGOTIATOR: */ local@xxxxx */147.66.85.62 */147.66.85.62 02/11/14 14:23:43 [7608] allow ADMINISTRATOR: */ local@xxxxx */147.66.85.62 */147.66.85.62 02/11/14 14:23:43 [7608] allow OWNER: */ local@xxxxx */NEW-50985.aad.gov.au */147.66.85.62 */147.66.85.62 */147.66.85.62 02/11/14 14:23:43 [7608] allow DAEMON: */* */ local@xxxxx */147.66.85.62 */147.66.85.62 02/11/14 14:23:43 [7608] allow ADVERTISE_STARTD: */* */ local@xxxxx */147.66.85.62 */147.66.85.62 02/11/14 14:23:43 [7608] allow ADVERTISE_SCHEDD: */* */ local@xxxxx */147.66.85.62 */147.66.85.62 02/11/14 14:23:43 [7608] allow ADVERTISE_MASTER: */* */ local@xxxxx */147.66.85.62 */147.66.85.62 02/11/14 14:23:43 [7608] Received ADD_JOBS signal 02/11/14 14:23:43 [7608] in doContactSchedd() 02/11/14 14:23:43 [7608] TokenCache contents: troy@domain 02/11/14 14:23:43 [7608] SetEffectiveOwner(troy@domain) failed with errno=13: Permission denied. 02/11/14 14:23:43 [7608] Failed to connect to schedd! Will retry 02/11/14 14:23:45 [7608] Evaluating staleness of remote job statuses. 02/11/14 14:23:48 [7608] in doContactSchedd() 02/11/14 14:23:48 [7608] TokenCache contents: troy@domain ...[SNIP]... 02/11/14 14:24:23 [7608] SetEffectiveOwner(troy@domain) failed with errno=13: Permission denied. 02/11/14 14:24:23 [7608] Failed to connect to schedd! Will retry 02/11/14 14:24:28 [7608] in doContactSchedd() 02/11/14 14:24:28 [7608] TokenCache contents: troy@domain 02/11/14 14:24:28 [7608] SetEffectiveOwner(troy@domain) failed with errno=13: Permission denied. 02/11/14 14:24:28 [7608] Failed to connect to schedd! 02/11/14 14:24:28 [7608] ERROR "Too many failures connecting to schedd!" at line 1246 in file c:\condor\execute\dir_11160\userdir\src\condor_gridmanager\gridmanager.cpp 02/11/14 14:28:40 init_user_ids: want user ‘troy@domain’, current is '(null)@(null)' 02/11/14 14:28:40 Found credential for user troy@domain’ 02/11/14 14:28:40 LogonUser completed. SchedLog: ... 02/11/14 13:36:11 (pid:4740) SetEffectiveOwner security violation: setting owner to troy@domain when active owner is "SYSTEM" 02/11/14 13:36:12 (pid:4740) Number of Active Workers 0 02/11/14 13:36:14 (pid:4740) Number of Active Workers 0 02/11/14 13:36:15 (pid:4740) Number of Active Workers 0 02/11/14 13:36:16 (pid:4740) SetEffectiveOwner security violation: setting owner to troy@domain when active owner is "SYSTEM" 02/11/14 13:36:17 (pid:4740) Number of Active Workers 0 02/11/14 13:36:18 (pid:4740) Number of Active Workers 0 02/11/14 13:36:20 (pid:4740) Number of Active Workers 0 02/11/14 13:36:21 (pid:4740) SetEffectiveOwner security violation: setting owner to troy@domain when active owner is "SYSTEM" 02/11/14 13:36:21 (pid:4740) condor_gridmanager (PID 7652, owner troy) exited with return code 4. 02/11/14 13:36:21 (pid:4740) Number of Active Workers 0 Condor_config.local: ... UID_DOMAIN = $(FULL_HOSTNAME) #TRUST_UID_DOMAIN = TRUE HOSTALLOW_READ = * HOSTALLOW_WRITE = * ## Daemons DAEMON_LIST=MASTER SCHEDD COLLECTOR NEGOTIATOR ## GRID PARAMS CONDOR_GAHP = $(SBIN)/condor_c-gahp GRIDMANAGER_LOG = $(LOG)/GridLogs/GridmanagerLog.$(USERNAME) C_GAHP_LOG = $(LOG)/GridLogs/CGAHPLog.$(USERNAME) C_GAHP_WORKER_THREAD_LOG = $(LOG)/GridLogs/CGAHPWorkerLog.$(USERNAME) ## DEBUGGING GRIDMANAGER_DEBUG = D_FULLDEBUG C_GAHP_DEBUG = D_FULLDEBUG C_GAHP_WORKER_THREAD_DEBUG = D_FULLDEBUG ## Security SEC_DEFAULT_NEGOTIATION = OPTIONAL SEC_DEFAULT_AUTHENTICATION_METHODS = CLAIMTOBE Submit file: Universe = grid Executable = R transfer_executable = False Arguments = --version Error = Error_$(Cluster).$(Process).txt Output = Output_$(Cluster).$(Process).txt Log = Condor_log.txt should_transfer_files = True when_to_transfer_output = ON_EXIT grid_resource = condor server1.a.b.c server1.a.b.c +remote_requirements = Arch == "X86_64" && OpSys == "LINUX" +remote_universe = vanilla +remote_shouldtransferfiles = "YES" +remote_whentotransferoutput = "ON_EXIT" Queue ___________________________________________________________________________ Australian Antarctic Division - Commonwealth of
Australia |