[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] We can not make JobRouter work



Dear experts,

we (CMS-CRAB) have used JobRouter since years to edit queued jobs in CMS global pool to redirect then from busy sites to sites with possibly available slots, something that we call "overflow". Details of why and what happens to those jobs when they run do not matter, simply to introduce the name.

Alas we have been using the deprecated JOB_ROUTER_ENTRIES_CMD macro where custom script of ours was called to create routing on the fly So last month we rewrote our stuff using static routing table in the configuration file reproducing the same desiderata. But once put in production things started to go nuts and we had to disable.

Unfortunately we have no experience with the new configuration, and nobody else in CMS or glideinWms is using it.

We are running $CondorVersion: 23.9.6 2024-08-08 BuildID: 748275 PackageID: 23.9.6-1 GitSHA: dfdd9eaa $ and followed the example in https://htcondor.readthedocs.io/en/latest/grid-computing/job-router.html defining a list of mutually exclusive routes as in [1]

But once the daemon starts the first route is matched to all possible jobs, even if the Requirements = (DESIRED_SITES=="sitename") is not satisfied.

Some things work as expected i.e. the edit in place, attribute setting, and the route listed first in JOB_ROUTER_ROUTE_NAMES is clearly the one used first. But that first rout is also applied to jobs where DESIRED_SITES is a different string.

Basically all my idle jobs get the same value of NEW_SITES and I can change that by changing the order or routes in JOB_ROUTER_ROUTE_NAMES [2].

Can you spot something which we do wrong here ?

I also noted that while jobs are routed, the RoutedBy attributes is not set (ref https://htcondor.readthedocs.io/en/latest/admin-manual/configuration-macros.html#JOB_ROUTER_SOURCE_JOB_CONSTRAINT and https://htcondor.readthedocs.io/en/latest/admin-manual/configuration-macros.html#JOB_ROUTER_NAME )

Let us know if there's any more information which I can send.

Thanks!!!

Stefano


[1]

[root@vocms059 config.d]# pwd
/etc/condor/config.d
[root@vocms059 config.d]# cat 90_jobrouter.config
# Configuration file for the JobRouter
#
JOB_ROUTER_NAME = OverflowRouter

JOB_ROUTER_SOURCE_JOB_CONSTRAINT = ((JobUniverse==5) && (jobstatus==1))

# Static route names for each T1
JOB_ROUTER_ROUTE_NAMES =Â T1_DE_KIT T1_IT_CNAF T1_UK_RAL T1_ES_PIC T1_FR_CCIN2P3

JOB_ROUTER_ROUTE_T1_DE_KIT @=rtkit
 Name = "Overflow:T1_DE_KIT"
 EditJobInPlace = True
 Requirements = (DESIRED_SITES=="T1_DE_KIT")
 SET NEW_SITES "T2_DE_DESY"
 SET HasBeenOverflowRouted True
@rtkit

JOB_ROUTER_ROUTE_T1_ES_PIC @=rtpic
 Name = "Overflow:T1_ES_PIC"
 EditJobInPlace = True
 Requirements = (DESIRED_SITES=="T1_ES_PIC")
 SET NEW_SITES "T2_ES_CIEMAT"
 SET HasBeenOverflowRouted True
@rtpic

JOB_ROUTER_ROUTE_T1_FR_CCIN2P3 @=rtin2p3
 Name = "Overflow:T1_FR_CCIN2P3"
 EditJobInPlace = True
 Requirements = (DESIRED_SITES=="T1_FR_CCIN2P3")
 SET NEW_SITES "T2_FR_GRIF,T2_FR_IPHC"
 SET HasBeenOverflowRouted True
@rtin2p3

JOB_ROUTER_ROUTE_T1_IT_CNAF @=rtcnaf
 Name = "Overflow:T1_IT_CNAF"
 EditJobInPlace = True
 Requirements = (DESIRED_SITES=="T1_IT_CNAF")
 SET NEW_SITES "T2_IT_Pisa,T2_IT_Rome"
 SET HasBeenOverflowRouted True
@rtcnaf

JOB_ROUTER_ROUTE_T1_UK_RAL @=rtral
 Name = "Overflow:T1_UK_RAL"
 EditJobInPlace = True
 Requirements = (DESIRED_SITES=="T1_UK_RAL")
 SET NEW_SITES "T2_UK_London_IC,T2_UK_SGrid_RALPP"
 SET HasBeenOverflowRouted True
@rtral

# How often to poll the job queue to route jobs
JOB_ROUTER_POLLING_PERIOD = 5*60

# Start the Job Router
DAEMON_LIST = $(DAEMON_LIST) JOB_ROUTER
[root@vocms059 config.d]#

[2]

belforte@vocms059/HTCondor> condor_q -con HasBeenOverflowRouted -af:h jobstatus desired_sites new_sites
jobstatus desired_sitesÂÂÂÂÂÂÂÂÂÂÂÂ new_sites
1ÂÂÂÂÂÂÂÂ T1_DE_KITÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ T2_DE_DESY
1ÂÂÂÂÂÂÂÂ T1_ES_PICÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ T2_DE_DESY
1ÂÂÂÂÂÂÂÂ T1_FR_CCIN2P3ÂÂÂÂÂÂÂÂÂÂÂÂ T2_DE_DESY
1ÂÂÂÂÂÂÂÂ T1_IT_CNAFÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ T2_DE_DESY
1ÂÂÂÂÂÂÂÂ T1_UK_RALÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ T2_DE_DESY
1ÂÂÂÂÂÂÂÂ T1_US_FNALÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ T2_DE_DESY
1ÂÂÂÂÂÂÂÂ T2_CH_CERN_HLTÂÂÂÂÂÂÂÂÂÂÂ T2_DE_DESY
1ÂÂÂÂÂÂÂÂ T2_CH_CERN_P5ÂÂÂÂÂÂÂÂÂÂÂÂ T2_DE_DESY
1ÂÂÂÂÂÂÂÂ T2_IN_TIFRÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ T2_DE_DESY
1ÂÂÂÂÂÂÂÂ T2_IT_RomeÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ T2_DE_DESY
1ÂÂÂÂÂÂÂÂ T2_LB_HPC4LÂÂÂÂÂÂÂÂÂÂÂÂÂÂ T2_DE_DESY
1ÂÂÂÂÂÂÂÂ T2_PK_NCPÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ T2_DE_DESY
1ÂÂÂÂÂÂÂÂ T2_TR_METUÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ T2_DE_DESY
1ÂÂÂÂÂÂÂÂ T2_UK_SGrid_BristolÂÂÂÂÂÂ T2_DE_DESY
1ÂÂÂÂÂÂÂÂ T3_BG_UNI_SOFIAÂÂÂÂÂÂÂÂÂÂ T2_DE_DESY
1ÂÂÂÂÂÂÂÂ T3_IN_TIFRCloudÂÂÂÂÂÂÂÂÂÂ T2_DE_DESY
1ÂÂÂÂÂÂÂÂ T3_IT_Opportunistic_dodas T2_DE_DESY
1ÂÂÂÂÂÂÂÂ T3_MX_CinvestavÂÂÂÂÂÂÂÂÂÂ T2_DE_DESY
1ÂÂÂÂÂÂÂÂ T3_TW_TIDCÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ T2_DE_DESY
1ÂÂÂÂÂÂÂÂ T3_US_FNALLPCÂÂÂÂÂÂÂÂÂÂÂÂ T2_DE_DESY
1ÂÂÂÂÂÂÂÂ T3_US_OokamiÂÂÂÂÂÂÂÂÂÂÂÂÂ T2_DE_DESY
1ÂÂÂÂÂÂÂÂ T3_US_TestÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ T2_DE_DESY
1ÂÂÂÂÂÂÂÂ T3_US_UMDÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ T2_DE_DESY
1ÂÂÂÂÂÂÂÂ T1_IT_CNAFÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ T2_DE_DESY
1ÂÂÂÂÂÂÂÂ T1_IT_CNAFÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ T2_DE_DESY
1ÂÂÂÂÂÂÂÂ T1_ES_PICÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ T2_DE_DESY
1ÂÂÂÂÂÂÂÂ T1_FR_CCIN2P3ÂÂÂÂÂÂÂÂÂÂÂÂ T2_DE_DESY
1ÂÂÂÂÂÂÂÂ T1_UK_RALÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ T2_DE_DESY
1ÂÂÂÂÂÂÂÂ T1_UK_RALÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ T2_DE_DESY
belforte@vocms059/HTCondor>