[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] problems with htcondor-ce 3.2.1-1 + condor 8.8.1



Stefano,

On the CE host, is the local condor running and configured as a submit 
host? This error in the JobRouterLog leads me to believe that there's a 
communication error between the CE job router and the local HTCondor's 
schedd:

03/08/19 18:17:32 (D_ALWAYS) ERROR (pool htc-2.cr.cnaf.infn.it:9618) 
Can't find address of schedd

You may find relevant errors in /var/log/condor/SchedLog if there are 
incompatibilities in the SEC_ configuration between HTCondor-CE and the 
local HTCondor.

We don't set COLLECTOR_PORT explicitly but instead set COLLECTOR_HOST 
(https://github.com/opensciencegrid/htcondor-ce/blob/master/config/condor_config#L13-L15) 
so I believe that's fine.

If you're getting HTCondor from the CHTC repositories, the blahp is 
built-in. It's curious that you have the blahp RPM on your "old CE" but 
you shouldn't need it.

- Brian

On 3/8/19 2:43 PM, Stefano Dal Pra wrote:
> Hello,
>
> I would need some help to get working a new HTCondor-CE instance.
> So far i have a working test cluster with HTCondor-CE 3.1.0-1.el7 / 
> condor 8.6.13.
>
> I'm working to setup a second instance with latest stable releases:
>
> - ce02-htc.cr.cnaf.infn.it:9619, HTCondor-CE 3.2.1-1.el7 / condor 8.6.13
> - htc-2.cr.cnaf.infn.it,ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ Central Manager / Collector 8.6.13
>
> However submitted jobs (condor-ce-trace from a user interface) are 
> going held:
>
> [root@ce02-htc condor]# condor_ce_q
>
> -- Schedd: ce02-htc.cr.cnaf.infn.it : <131.154.192.41:19416> @ 
> 03/08/19 17:58:51
> OWNERÂÂÂ BATCH_NAMEÂÂÂ SUBMITTEDÂÂ DONEÂÂ RUNÂÂÂ IDLEÂÂ HOLD TOTAL 
> JOB_IDS
> dteam039 ID: 26ÂÂÂÂÂÂ 3/7Â 17:35ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ 1 1 26.0
> dteam039 ID: 27ÂÂÂÂÂÂ 3/7Â 17:42ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ 1 1 27.0
> dteam039 ID: 28ÂÂÂÂÂÂ 3/8Â 13:15ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ 1 1 28.0
> dteam039 ID: 31ÂÂÂÂÂÂ 3/8Â 17:21ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ 1 1 31.0
>
>
> Apparently the match with job router should be ok:
>
> [root@ce02-htc condor]# condor_ce_config_val -dump JOB_ROUTER_ENTRIES
> # Configuration from machine: ce02-htc.cr.cnaf.infn.it
>
> # Parameters with names that match JOB_ROUTER_ENTRIES:
> JOB_ROUTER_ENTRIES = [
> name = "condor_pool_dteam";
> TargetUniverse = 5;
> Requirements = (regexp("dteam", TARGET.x509UserProxyVoName));
> set_requirements = (TARGET.Arch == "X86_64") && (TARGET.OpSys == 
> "LINUX");
> MaxJobs = 100;
> MaxIdleJobs = 100;
> ]
> [SNIP]
>
> However:
> [root@ce02-htc condor]# condor_ce_q -analyze 31
>
> -- Schedd: ce02-htc.cr.cnaf.infn.it : <131.154.192.41:19416>
>
> 031.000:Â Job is held.
>
> Hold reason: HTCondor-CE held job due to no matching routes, route job 
> limit, or route failure threshold; see 'HTCondor-CE Troubleshooting 
> Guide'
>
> Looking into the condor-ce logs i see these errors:
>
> JobRouterLog:
>
> 03/08/19 18:17:32 (D_ALWAYS) SECMAN: required authentication with 
> collector at <131.154.195.32:9618> failed, so aborting command 
> QUERY_SCHEDD_ADS.
> 03/08/19 18:17:32 (D_ALWAYS) ERROR: AUTHENTICATE:1003:Failed to 
> authenticate with any method|AUTHENTICATE:1004:Failed to authenticate 
> using FS
> 03/08/19 18:17:32 (D_ALWAYS) ERROR (pool htc-2.cr.cnaf.infn.it:9618) 
> Can't find address of schedd
>
>
> CollectorLog:
>
> 03/08/19 18:11:34 (D_ALWAYS:2) Trying to update collector 
> <131.154.192.41:9619>
> 03/08/19 18:11:34 (D_ALWAYS:2) Attempting to send update via TCP to 
> collector ce02-htc.cr.cnaf.infn.it <131.154.192.41:9619>
> 03/08/19 18:11:34 (D_ALWAYS:2) Sent ad to 1 collectors for 
> dteam039@htc_tier1 Hit=4 Tot=4 Idle=0 Run=0
> 03/08/19 18:11:34 (D_ALWAYS:2) ============ Begin clean_shadow_recs 
> =============
> 03/08/19 18:11:34 (D_ALWAYS:2) ============ End clean_shadow_recs 
> =============
> 03/08/19 18:11:34 (D_ALWAYS:2) Job 32.0 held for spooling. Checking 
> how long...
> 03/08/19 18:11:34 (D_ALWAYS:2) Attribute StageInStart not set in 32.0. 
> Set it.
> 03/08/19 18:11:34 (D_ALWAYS:2) Sending RESCHEDULE command to 
> negotiator(s)
> 03/08/19 18:11:34 (D_ALWAYS:2) Will use TCP to update collector 
> ce02-htc.cr.cnaf.infn.it <131.154.192.41:9619>
> 03/08/19 18:11:34 (D_ALWAYS:2) Trying to query collector 
> <131.154.192.41:9619>
> 03/08/19 18:11:35 (D_ALWAYS) Can't find address for negotiator
> 03/08/19 18:11:35 (D_ALWAYS|D_FAILURE) Failed to send RESCHEDULE to 
> unknown daemon:
> 03/08/19 18:11:35 (cid:19) (D_AUDIT) 
> Command=SPOOL_JOB_FILES_WITH_PERMS, peer=<131.154.192.239:24028>
> 03/08/19 18:11:35 (cid:19) (D_AUDIT) AuthMethod=GSI, 
> AuthId=/C=IT/O=INFN/OU=Personal Certificate/L=CNAF/CN=Stefano Dal 
> Pra,/dteam/Role=NULL/Capability=NULL, CondorId=dteam039@htc_tier1
> 03/08/19 18:11:35 (D_ALWAYS:2) spoolJobFiles(): read JobAdsArrayLen - 1
> [...]
> 03/08/19 18:11:40 (D_ALWAYS:2) Sent ad to 1 collectors for 
> dteam039@htc_tier1 Hit=4 Tot=4 Idle=1 Run=0
> 03/08/19 18:11:40 (D_ALWAYS:2) ============ Begin clean_shadow_recs 
> =============
> 03/08/19 18:11:40 (D_ALWAYS:2) ============ End clean_shadow_recs 
> =============
> 03/08/19 18:11:40 (D_ALWAYS:2) Sending RESCHEDULE command to 
> negotiator(s)
> 03/08/19 18:11:40 (D_ALWAYS:2) Will use TCP to update collector 
> ce02-htc.cr.cnaf.infn.it <131.154.192.41:9619>
> 03/08/19 18:11:40 (D_ALWAYS:2) Trying to query collector 
> <131.154.192.41:9619>
> 03/08/19 18:11:40 (D_ALWAYS) Can't find address for negotiator
> 03/08/19 18:11:40 (D_ALWAYS|D_FAILURE) Failed to send RESCHEDULE to 
> unknown daemon:
> 03/08/19 18:11:40 (D_ALWAYS:2) ForkWorker::Fork: New child of 1132279 
> = 1132521
>
> ###########
>
> CollectorLog (in the Central Manager/Collector, ce02-htc)
>
> 03/08/19 21:34:50 DC_AUTHENTICATE: required authentication of 
> 131.154.192.41 failed: AUTHENTICATE:1003:Failed to aut
> henticate with any method|AUTHENTICATE:1004:Failed to authenticate 
> using PASSWORD|AUTHENTICATE:1004:Failed to authen
> ticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXX49WRX5)
>
> ##############
> .
> I've been comparing configurations with that of the working 
> htcondor-ce (ce01-htc.cr.cnaf.infn.it), but i haven't found a solution.
>
> These are the SEC_* settings
> [root@ce02-htc ~]# condor_ce_config_val -dump SEC_ | egrep -v '^#'
>
> CEVIEW.SEC_CLIENT_AUTHENTICATION_METHODS = FS
> CEVIEW.SEC_CLIENT_NEGOTIATION = PREFERRED
> MASTER.SEC_DEFAULT_AUTHENTICATION_METHODS = FS, GSI
> SCHEDD.SEC_DAEMON_AUTHENTICATION_METHODS = FS,GSI
> SCHEDD.SEC_WRITE_AUTHENTICATION_METHODS = FS,GSI
> SEC_CLAIMTOBE_INCLUDE_DOMAIN = false
> SEC_CLAIMTOBE_USER =
> SEC_CLIENT_AUTHENTICATION = OPTIONAL
> SEC_CLIENT_AUTHENTICATION_METHODS = GSI,FS
> SEC_CLIENT_ENCRYPTION = OPTIONAL
> SEC_CLIENT_INTEGRITY = OPTIONAL
> SEC_CREDENTIAL_REFRESH_INTERVAL = -1
> SEC_DEBUG_PRINT_KEYS = false
> SEC_DEFAULT_AUTHENTICATION = REQUIRED
> SEC_DEFAULT_AUTHENTICATION_METHODS = GSI
> SEC_DEFAULT_AUTHENTICATION_TIMEOUT = 20
> SEC_DEFAULT_ENCRYPTION = OPTIONAL
> SEC_DEFAULT_INTEGRITY = REQUIRED
> SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = true
> SEC_INVALIDATE_SESSIONS_VIA_TCP = true
> SEC_PASSWORD_DOMAIN =
> SEC_PASSWORD_FILE =
> SEC_READ_AUTHENTICATION = OPTIONAL
> SEC_READ_ENCRYPTION = OPTIONAL
> SEC_READ_INTEGRITY = OPTIONAL
> SEC_SESSION_DURATION_SLOP = 20
> SEC_TCP_SESSION_TIMEOUT = 20
>
>
> I tried adding password authentication:
> [root@ce02-htc ~]# grep SEC_CLIENT_AUTHENTICATION_METHODS 
> /etc/condor-ce/config.d/01-common-auth.conf
> SEC_CLIENT_AUTHENTICATION_METHODS=FS,GSI,PASSWORD
>
> but then it seems to be overriden:
> [root@ce02-htc ~]# condor_ce_config_val -v 
> SEC_CLIENT_AUTHENTICATION_METHODS
> SEC_CLIENT_AUTHENTICATION_METHODS = GSI,FS
> Â# at: <Environment>
> Â# raw: SEC_CLIENT_AUTHENTICATION_METHODS = GSI,FS
>
> by the condor_ce_* wrapper commands.
>
>
> A few things that puzzles me:
>
> [root@ce02-htc ~]# condor_ce_config_val -v COLLECTOR_PORT
> COLLECTOR_PORT = 9618
> Â# at: <Default>
> Â# raw: COLLECTOR_PORT = 9618
>
> [root@ce02-htc ~]# condor_config_val -v COLLECTOR_PORT
> COLLECTOR_PORT = 9618
> Â# at: <Default>
> Â# raw: COLLECTOR_PORT = 9618
>
>
> But the other machine has:
>
> [root@ce01-htc ~]# condor_ce_config_val -v COLLECTOR_PORT
> COLLECTOR_PORT = 9619
> Â# at: 
> /usr/share/condor-ce/config.d/01-common-collector-defaults.conf, line 11
> Â# raw: COLLECTOR_PORT = 9619
>
> [root@ce01-htc ~]# condor_config_val -v COLLECTOR_PORT
> COLLECTOR_PORT = 9618
> Â# at: <Default>
> Â# raw: COLLECTOR_PORT = 9618
>
>
> The "older" CE has a blah rpm:
> [root@ce01-htc ~]# rpm -qa | grep blah
> condor-classads-blah-patch-0.0.1-1.el7.centos.x86_64
> blahp-1.18.35.bosco-1.osg34.el7.x86_64
>
> But i have not found a blahp rpm in the repo for 8.8.1.
> How does work a 3.2.1 HTCondor-CE on top of a non condor batch system?
>
>
> Thank You for any help,
> Stefano
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx 
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/