Hello,
I would need some help to get working a new HTCondor-CE instance.
So far i have a working test cluster with HTCondor-CE 3.1.0-1.el7 /
condor 8.6.13.
I'm working to setup a second instance with latest stable releases:
- ce02-htc.cr.cnaf.infn.it:9619, HTCondor-CE 3.2.1-1.el7 / condor 8.6.13
- htc-2.cr.cnaf.infn.it,ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ Central Manager / Collector 8.6.13
However submitted jobs (condor-ce-trace from a user interface) are
going held:
[root@ce02-htc condor]# condor_ce_q
-- Schedd: ce02-htc.cr.cnaf.infn.it : <131.154.192.41:19416> @
03/08/19 17:58:51
OWNERÂÂÂ BATCH_NAMEÂÂÂ SUBMITTEDÂÂ DONEÂÂ RUNÂÂÂ IDLEÂÂ HOLD TOTAL
JOB_IDS
dteam039 ID: 26ÂÂÂÂÂÂ 3/7Â 17:35ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ 1 1 26.0
dteam039 ID: 27ÂÂÂÂÂÂ 3/7Â 17:42ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ 1 1 27.0
dteam039 ID: 28ÂÂÂÂÂÂ 3/8Â 13:15ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ 1 1 28.0
dteam039 ID: 31ÂÂÂÂÂÂ 3/8Â 17:21ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ 1 1 31.0
Apparently the match with job router should be ok:
[root@ce02-htc condor]# condor_ce_config_val -dump JOB_ROUTER_ENTRIES
# Configuration from machine: ce02-htc.cr.cnaf.infn.it
# Parameters with names that match JOB_ROUTER_ENTRIES:
JOB_ROUTER_ENTRIES = [
name = "condor_pool_dteam";
TargetUniverse = 5;
Requirements = (regexp("dteam", TARGET.x509UserProxyVoName));
set_requirements = (TARGET.Arch == "X86_64") && (TARGET.OpSys ==
"LINUX");
MaxJobs = 100;
MaxIdleJobs = 100;
]
[SNIP]
However:
[root@ce02-htc condor]# condor_ce_q -analyze 31
-- Schedd: ce02-htc.cr.cnaf.infn.it : <131.154.192.41:19416>
031.000:Â Job is held.
Hold reason: HTCondor-CE held job due to no matching routes, route job
limit, or route failure threshold; see 'HTCondor-CE Troubleshooting
Guide'
Looking into the condor-ce logs i see these errors:
JobRouterLog:
03/08/19 18:17:32 (D_ALWAYS) SECMAN: required authentication with
collector at <131.154.195.32:9618> failed, so aborting command
QUERY_SCHEDD_ADS.
03/08/19 18:17:32 (D_ALWAYS) ERROR: AUTHENTICATE:1003:Failed to
authenticate with any method|AUTHENTICATE:1004:Failed to authenticate
using FS
03/08/19 18:17:32 (D_ALWAYS) ERROR (pool htc-2.cr.cnaf.infn.it:9618)
Can't find address of schedd
CollectorLog:
03/08/19 18:11:34 (D_ALWAYS:2) Trying to update collector
<131.154.192.41:9619>
03/08/19 18:11:34 (D_ALWAYS:2) Attempting to send update via TCP to
collector ce02-htc.cr.cnaf.infn.it <131.154.192.41:9619>
03/08/19 18:11:34 (D_ALWAYS:2) Sent ad to 1 collectors for
dteam039@htc_tier1 Hit=4 Tot=4 Idle=0 Run=0
03/08/19 18:11:34 (D_ALWAYS:2) ============ Begin clean_shadow_recs
=============
03/08/19 18:11:34 (D_ALWAYS:2) ============ End clean_shadow_recs
=============
03/08/19 18:11:34 (D_ALWAYS:2) Job 32.0 held for spooling. Checking
how long...
03/08/19 18:11:34 (D_ALWAYS:2) Attribute StageInStart not set in 32.0.
Set it.
03/08/19 18:11:34 (D_ALWAYS:2) Sending RESCHEDULE command to
negotiator(s)
03/08/19 18:11:34 (D_ALWAYS:2) Will use TCP to update collector
ce02-htc.cr.cnaf.infn.it <131.154.192.41:9619>
03/08/19 18:11:34 (D_ALWAYS:2) Trying to query collector
<131.154.192.41:9619>
03/08/19 18:11:35 (D_ALWAYS) Can't find address for negotiator
03/08/19 18:11:35 (D_ALWAYS|D_FAILURE) Failed to send RESCHEDULE to
unknown daemon:
03/08/19 18:11:35 (cid:19) (D_AUDIT)
Command=SPOOL_JOB_FILES_WITH_PERMS, peer=<131.154.192.239:24028>
03/08/19 18:11:35 (cid:19) (D_AUDIT) AuthMethod=GSI,
AuthId=/C=IT/O=INFN/OU=Personal Certificate/L=CNAF/CN=Stefano Dal
Pra,/dteam/Role=NULL/Capability=NULL, CondorId=dteam039@htc_tier1
03/08/19 18:11:35 (D_ALWAYS:2) spoolJobFiles(): read JobAdsArrayLen - 1
[...]
03/08/19 18:11:40 (D_ALWAYS:2) Sent ad to 1 collectors for
dteam039@htc_tier1 Hit=4 Tot=4 Idle=1 Run=0
03/08/19 18:11:40 (D_ALWAYS:2) ============ Begin clean_shadow_recs
=============
03/08/19 18:11:40 (D_ALWAYS:2) ============ End clean_shadow_recs
=============
03/08/19 18:11:40 (D_ALWAYS:2) Sending RESCHEDULE command to
negotiator(s)
03/08/19 18:11:40 (D_ALWAYS:2) Will use TCP to update collector
ce02-htc.cr.cnaf.infn.it <131.154.192.41:9619>
03/08/19 18:11:40 (D_ALWAYS:2) Trying to query collector
<131.154.192.41:9619>
03/08/19 18:11:40 (D_ALWAYS) Can't find address for negotiator
03/08/19 18:11:40 (D_ALWAYS|D_FAILURE) Failed to send RESCHEDULE to
unknown daemon:
03/08/19 18:11:40 (D_ALWAYS:2) ForkWorker::Fork: New child of 1132279
= 1132521
###########
CollectorLog (in the Central Manager/Collector, ce02-htc)
03/08/19 21:34:50 DC_AUTHENTICATE: required authentication of
131.154.192.41 failed: AUTHENTICATE:1003:Failed to aut
henticate with any method|AUTHENTICATE:1004:Failed to authenticate
using PASSWORD|AUTHENTICATE:1004:Failed to authen
ticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXX49WRX5)
##############
.
I've been comparing configurations with that of the working
htcondor-ce (ce01-htc.cr.cnaf.infn.it), but i haven't found a solution.
These are the SEC_* settings
[root@ce02-htc ~]# condor_ce_config_val -dump SEC_ | egrep -v '^#'
CEVIEW.SEC_CLIENT_AUTHENTICATION_METHODS = FS
CEVIEW.SEC_CLIENT_NEGOTIATION = PREFERRED
MASTER.SEC_DEFAULT_AUTHENTICATION_METHODS = FS, GSI
SCHEDD.SEC_DAEMON_AUTHENTICATION_METHODS = FS,GSI
SCHEDD.SEC_WRITE_AUTHENTICATION_METHODS = FS,GSI
SEC_CLAIMTOBE_INCLUDE_DOMAIN = false
SEC_CLAIMTOBE_USER =
SEC_CLIENT_AUTHENTICATION = OPTIONAL
SEC_CLIENT_AUTHENTICATION_METHODS = GSI,FS
SEC_CLIENT_ENCRYPTION = OPTIONAL
SEC_CLIENT_INTEGRITY = OPTIONAL
SEC_CREDENTIAL_REFRESH_INTERVAL = -1
SEC_DEBUG_PRINT_KEYS = false
SEC_DEFAULT_AUTHENTICATION = REQUIRED
SEC_DEFAULT_AUTHENTICATION_METHODS = GSI
SEC_DEFAULT_AUTHENTICATION_TIMEOUT = 20
SEC_DEFAULT_ENCRYPTION = OPTIONAL
SEC_DEFAULT_INTEGRITY = REQUIRED
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = true
SEC_INVALIDATE_SESSIONS_VIA_TCP = true
SEC_PASSWORD_DOMAIN =
SEC_PASSWORD_FILE =
SEC_READ_AUTHENTICATION = OPTIONAL
SEC_READ_ENCRYPTION = OPTIONAL
SEC_READ_INTEGRITY = OPTIONAL
SEC_SESSION_DURATION_SLOP = 20
SEC_TCP_SESSION_TIMEOUT = 20
I tried adding password authentication:
[root@ce02-htc ~]# grep SEC_CLIENT_AUTHENTICATION_METHODS
/etc/condor-ce/config.d/01-common-auth.conf
SEC_CLIENT_AUTHENTICATION_METHODS=FS,GSI,PASSWORD
but then it seems to be overriden:
[root@ce02-htc ~]# condor_ce_config_val -v
SEC_CLIENT_AUTHENTICATION_METHODS
SEC_CLIENT_AUTHENTICATION_METHODS = GSI,FS
Â# at: <Environment>
Â# raw: SEC_CLIENT_AUTHENTICATION_METHODS = GSI,FS
by the condor_ce_* wrapper commands.
A few things that puzzles me:
[root@ce02-htc ~]# condor_ce_config_val -v COLLECTOR_PORT
COLLECTOR_PORT = 9618
Â# at: <Default>
Â# raw: COLLECTOR_PORT = 9618
[root@ce02-htc ~]# condor_config_val -v COLLECTOR_PORT
COLLECTOR_PORT = 9618
Â# at: <Default>
Â# raw: COLLECTOR_PORT = 9618
But the other machine has:
[root@ce01-htc ~]# condor_ce_config_val -v COLLECTOR_PORT
COLLECTOR_PORT = 9619
Â# at:
/usr/share/condor-ce/config.d/01-common-collector-defaults.conf, line 11
Â# raw: COLLECTOR_PORT = 9619
[root@ce01-htc ~]# condor_config_val -v COLLECTOR_PORT
COLLECTOR_PORT = 9618
Â# at: <Default>
Â# raw: COLLECTOR_PORT = 9618
The "older" CE has a blah rpm:
[root@ce01-htc ~]# rpm -qa | grep blah
condor-classads-blah-patch-0.0.1-1.el7.centos.x86_64
blahp-1.18.35.bosco-1.osg34.el7.x86_64
But i have not found a blahp rpm in the repo for 8.8.1.
How does work a 3.2.1 HTCondor-CE on top of a non condor batch system?
Thank You for any help,
Stefano
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/