[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] token jobs not being routed by HTCondor-CE



Hello all, an update on this:

I replicated the nonworking rules to a condor-ce with little load (it serves only one VO) and these work as expected.
This ensures that the rule syntax is correct.

Then i noticed that in the other CEs there were several nonrouted jobs from a VO who recently started using token credentials, and
whose jobrouter rule was not yet token aware. After fixing that rule, the pending jobs were routed and my rule also started working.
For a while, only. This morning i found several nonroutedjobs (Qdate --> midnight, routing rule correct, i.e. those jobs SHOULD have been routed).
I manually removed those stuck jobs and next fresh ones were being routed flawlessly. The route i'm adding, however, still does not work.

Questions:
- is there a maximum lenght for the active routes listed in JOB_ROUTER_ROUTE_NAMES ?
- is there a "cache effect" so that fixing an error in a JOB_ROUTER_ROUTE_<name> entry does not take effect until <some_cache> expiration?
- is there a (short) timeout for a scitoken job to be routed, after that no more chances exist of being routed?
- if I rename an existing route does that help with "caching" problems? (spoiler: no, i just verified that).

Stefano





On 28/03/23 23:55, Stefano Dal Pra wrote:
Hi Todd, thanks for the advices;
yes, I issued condor_ce_reconfig. The suggested command says

[root@ce07-htc ~]# condor_ce_history -l 3250138.0 |Â condor_ce_job_router_info -match-jobs -ignore-prior-routing -jobads -

Matching jobs against routes to find candidate jobs.

And the same for other test jobs in the queue:

[root@ce03-htc ~]# condor_ce_q 6384655.0 -l | condor_ce_job_router_info -match-jobs -ignore-prior-routing -jobads -

Matching jobs against routes to find candidate jobs.

Since the REQUIREMENTS _expression_ evaluates to True, my guess is that a routing is attempted but fails, possibly because
of some residual problem with that specific token issuer. In fact, there are token only jobs flowing regularly; for example these ones from atlas:

[root@ce06-htc ~]# cccv JOB_ROUTER_ROUTE_atlas_sam
REQUIREMENTS (x509UserProxyVoName =?= "atlas" && x509UserProxyFirstFQAN =?= "/atlas/Role=lcgadmin/Capability=NULL") || (AuthTokenIssuer =?= "https://atlas-auth.web.cern.ch/" && AuthTokenSubject =?= "5c5d2a4d-9177-3efa-912f-1b4e5c9fb660")
UNIVERSE VANILLA
SET Requirements (TARGET.t1_allow_sam =?= true) && (!StringListMember("gpfs_atlas",t1_GPFS_CHECK ?: "",":"))

[root@ce06-htc ~]# condor_ce_q -cons 'x509userproxyvoname =?= undefined && AuthTokenSubject == "5c5d2a4d-9177-3efa-912f-1b4e5c9fb660"' -af:j owner jobstatus routedtojobid qdate 'formattime(qdate)'
5394875.0 atlassgm006 1 8837395.0 1680039294 Tue Mar 28 23:34:54 2023

The above is a "token only job". However this other one remains idle:

[sdalpra@ui-htc CE5]$ export _condor_SEC_CLIENT_AUTHENTICATION_METHODS=SCITOKENS ; condor_submit -pool ce06-htc.cr.cnaf.infn.it:9619 -remote ce06-htc.cr.cnaf.infn.it -append '+WantRoute = "herd_cloud"' ce_scitok308.sub
Submitting job(s).
1 job(s) submitted to cluster 5394871.

[root@ce06-htc ~]# cccv JOB_ROUTER_ROUTE_herdcloud
REQUIREMENTS (AuthTokenIssuer =?= "https://iam-herd.cloud.cnaf.infn.it/" && AuthTokenSubject =?= "6f925657-f9aa-4cb6-b264-a3b1ee78df57")
UNIVERSE VANILLA
SET Requirements (TARGET.t1_group =?= "herd_cloud")
SET RequestMemory 400
SET MaxJobs 35
SET MaxIdleJobs 12

[root@ce06-htc ~]# condor_ce_q 5394871.0 -af:j owner routedtojobid '(AuthTokenIssuer =?= "https://iam-herd.cloud.cnaf.infn.it/" && AuthTokenSubject =?= "6f925657-f9aa-4cb6-b264-a3b1ee78df57")'
5394871.0 herd006 undefined true

[root@ce06-htc ~]# condor_ce_q 5394871.0 -l | condor_ce_job_router_info -match-jobs -ignore-prior-routing -jobads -

Matching jobs against routes to find candidate jobs.


Stefano

Â



On 28/03/23 21:36, Todd Tannenbaum wrote:
On 3/28/2023 5:42 AM, Stefano Dal Pra wrote:

When using (only) x509 and no token, the job is mapped (by argus) to dteam026.
StringListMember should work the same with dteam007 or dteam026
however it only matches with dteam026 (i.e. GSI). and not with dteam007.
I normally check for issuer and subject in the jobrouter; i tried with StringListMember to
restrict the check to Owner only.


Hi Stefano -

After changing the route to try StringListMember, did you remember to issue a "condor_ce_reconfig" command?Â

For job 3250138.0 below, it sure looks like the owner mapping from the token worked fine... perhaps this command will give a clue:
root@host # condor_ce_history -l 3250138.0 | condor_ce_job_router_info -match-jobs -ignore-prior-routing -jobads -
Also see the CE Manual for troubleshooting tips when a job does not route at URL:
 https://htcondor.com/htcondor-ce/v4/troubleshooting/troubleshooting/#jobs-stay-idle-on-the-ce

Hope the above helps, let us know how it goes, feel free to ask for more help if you continue to be stuck.

regards,
Todd




Adding a detail on the submit file used for GSI and SCITOKENS
#submit file for GSI
[sdalpra@ui-htc CE5]$ cat ce_gsi308.sub
universe = vanilla
use_x509userproxy = true
+Owner = undefined
[...]

[sdalpra@ui-htc CE5]$ cat ce_scitok308.sub Â
universe = vanilla
use_scitokens = true
+Owner = undefined


Stefano



On 28/03/23 11:56, Thomas Hartmann wrote:
Hi Stefano,

how does your token mapping look like? ð

Just a suspicion, but maybe the token subject is mapped to another user than the X509 mapped user and the requirement
 REQUIREMENTS StringListMember(Owner, "dteam007|dteam026|cmssgm017","|")
does not get triggered?

Cheers,
 Thomas

On 27/03/2023 22.50, Stefano Dal Pra wrote:
Hello to all,

htcondor-ce-5.1.6 + condor-9.0.17 Here.

I'm having problems with HTCondor-CE not routing jobs submitted with iam token [1]. The same routing rule [2] or [3] working with GSI does not work with tokens.
More notes in [4].

USING GSI
#This works
[sdalpra@ui-htc CE5]$ export _condor_SEC_CLIENT_AUTHENTICATION_METHODS=GSI ; condor_submit -pool ce07-htc.cr.cnaf.infn.it:9619 -remote ce07-htc.cr.cnaf.infn.it ce_gsi308.sub
Submitting job(s).
1 job(s) submitted to cluster 3250129.

#the job is routed and submitted to condor; note the local user (dteam026), that is mapped by argus
[root@ce07-htc ~]# condor_ce_q 3250129. -af:j owner routedtojobid
3250129.0 dteam026 4991835.0

USING SCITOKENS
#This does not work
[sdalpra@ui-htc CE5]$ export _condor_SEC_CLIENT_AUTHENTICATION_METHODS=SCITOKENS ; condor_submit -pool ce07-htc.cr.cnaf.infn.it:9619 -remote ce07-htc.cr.cnaf.infn.it ce_scitok308.sub
Submitting job(s).
1 job(s) submitted to cluster 3250138.

#the job is never routed. Note that the REQUIREMENTS _expression_ evaluates to true.
[root@ce07-htc ~]# condor_ce_q 3250138. -af:j owner routedtojobid 'StringListMember(Owner, "dteam007|dteam026|cmssgm017","|")'
3250138.0 dteam007 undefined true


[1] The token being used
[sdalpra@ui-htc CE5]$ cat Â$BEARER_TOKEN_FILE|jwt.py -v
{
ÂÂ"alg": "RS256",
ÂÂ"kid": "rsa1"
}
{
ÂÂ"sub": "9662c0b5-31a1-4478-963e-bdf3783232ed",
ÂÂ"iss": "https://wlcg.cloud.cnaf.infn.it/",
ÂÂ"wlcg.groups": [
ÂÂÂÂ"/wlcg",
ÂÂÂÂ"/wlcg/pilots",
ÂÂÂÂ"/wlcg/xfers"
ÂÂ],
ÂÂ"wlcg.ver": "1.0",
ÂÂ"jti": "4270f069-81d9-48fb-88ef-817a83b98c6a",
ÂÂ"exp": 1679943559,
ÂÂ"iat": 1679939959,
ÂÂ"client_id": "ad852b22-e517-44a4-99e8-7c0660f878a1",
ÂÂ"scope": "openid compute.create profile compute.read storage.read:/ compute.modify eduperson_entitlement wlcg storage.create:/ offline_access compute.cancel eduperson
_scoped_affiliation storage.modify:/ email wlcg.groups",
ÂÂ"nbf": 1679939959,
ÂÂ"aud": "https://wlcg.cern.ch/jwt/v1/any"
}
exp: Mon Mar 27 20:59:19 2023

[2],[3] Jobrouter rules

JOB_ROUTER_ROUTE_routestsci @=jrt
REQUIREMENTS StringListMember(Owner, "dteam007|dteam026|cmssgm017","|")
ÂÂ UNIVERSE VANILLA
SET Requirements (TARGET.t1_group=?= "myfancygroup")
ÂÂÂSET RequestMemory 400
ÂÂÂSET MaxJobs 5
ÂÂÂSET MaxIdleJobs 10
@jrt

JOB_ROUTER_ROUTE_routestgsi @=jrt
REQUIREMENTS (x509UserProxyVOName== "dteam") || (AuthTokenIssuer =?= "https://wlcg.cloud.cnaf.infn.it/"&& AuthTokenSubject =?= "9662c0b5-31a1-4478-963e-bdf3783232ed")
ÂÂUNIVERSE VANILLA
SET Requirements (TARGET.t1_group=?= "testgroup")
@jrt

JOB_ROUTER_ROUTE_NAMES= routestsci routestgsi $(JOB_ROUTER_ROUTE_NAMES)

[4] Notes

- scitoken is "partially" valid as the mapping to the local user succeeds.
- the REQUIREMENTS _expression_ matches with the condor-ce job, i.e.
ÂÂÂÂ condor_ce_q <jobid> -af StringListMember(Owner, "dteam007|dteam026|cmssgm017","|")
ÂÂ returns True.
- These rules used to work as far as i know. More complex REQUIREMENTS expressions where successfully used with tokens.
- I checked rule [2] against a condor-ce at another site where a colleague accepted to test it; the result is the same: using GSI the job is routed, using SCITOKENS it is not.
- I find nothing useful in the condor-ce logs:

[root@ce07-htc ~]# grep 3250492. /var/log/condor-ce/*Log
/var/log/condor-ce/AuditLog:03/27/23 21:54:54 (cid:18395186) (D_AUDIT) Submitting new job 3250492.0
/var/log/condor-ce/AuditLog:03/27/23 21:54:54 (cid:18395188) (D_AUDIT) Transferring files for jobs 3250492.0
/var/log/condor-ce/SchedLog:03/27/23 21:54:55 (D_ALWAYS) Job 3250492.0 released from hold: Data files spooled

Also at maximum verbosity nothing is found in the JobRouterLog.
I'm out of ideas now. Any hint to find out what's wrong?
Thanks
Stefano



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


-- 
Todd Tannenbaum <tannenba@xxxxxxxxxxx>  University of Wisconsin-Madison
Center for High Throughput Computing    Department of Computer Sciences
Calendar: https://tinyurl.com/yd55mtgd  1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                   Madison, WI 53706-1685 


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/