[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] CondorCE: jobs taking different routes albeit the same route condition are fulfilled



Hi Tj,

many thanks for the pointer!

Round robin is false so that it should be just the configured route as hard paths.

The router logs are a bit odd - AFAIS the routes are near their their limits. There were cycles on the CE where all jobs got send over the default "Condor_Pool" route [1] - but then with the next cycle the jobs then go via the "DESYNAF" route again [2]. I.e., the two cycles [1] and [2] 10secs apart.

For the moment, I have dropped the `Condor_Pool` default route to force everything on the `DESYNAF` route, but tbh I have no idea, why route decisions like [1] happened occasionally ð

Cheers,
  Thomas

[1]
12/03/25 16:15:45 === Current Probing Information ===
12/03/25 16:15:45 fsize: 45718426               mtime: 1764774943
12/03/25 16:15:45 first log entry: 102 CreationTimestamp 1756886501
12/03/25 16:15:45 JobRouter: Checking for candidate jobs. routing table is:
Route Name Source Submitted/Max Idle/Max Throttle Recent: Started Succeeded Failed DESYNAF DESYNAF 94/ 10000 35/ 2000 10.2976 jobs/sec 4 2 0 Condor_Pool Condor_Pool 26/ 10000 9/ 2000 0.312857 jobs/sec 0 0 4 12/03/25 16:15:45 JobRouter: Umbrella constraint: (target.JobUniverse =?= 5 || target.JobUniverse =?= 1) && ( (True) || (Owner == "atlasnf1") ) && (target.ProcId >= 0 && target.JobStatus == 1 && (target.StageInStart is undefined || target.StageInFinish isnt undefined) && target.Managed isnt "ScheddDone" && target.Managed isnt "External" && target.Owner isnt Undefined && target.RoutedBy isnt "htcondor-ce") 12/03/25 16:15:45 JobRouter (src=114271.0,dest=164743.0): job mirror synchronized; removing job from internal 'retirement' status 12/03/25 16:15:45 JobRouter (src=114226.0,dest=164755.0): job mirror synchronized; removing job from internal 'retirement' status 12/03/25 16:15:45 JobRouter (src=114313.0,dest=164725.0): job mirror synchronized; removing job from internal 'retirement' status 12/03/25 16:15:45 JobRouter (src=114272.0,dest=164756.0): job mirror synchronized; removing job from internal 'retirement' status 12/03/25 16:15:45 SharedPortClient: sent connection request to local schedd for shared port id schedd_993_b1a2
12/03/25 16:15:45 Setting CpusUsage = 0.0
12/03/25 16:15:45 Setting DiskUsage = 2750
12/03/25 16:15:45 Setting ImageSize = 3500
12/03/25 16:15:45 Setting MemoryUsage = ((ResidentSetSize + 1023) / 1024)
12/03/25 16:15:45 Setting NumJobStarts = 1
12/03/25 16:15:45 Setting RemoteSysCpu = 6.0
12/03/25 16:15:45 Setting RemoteUserCpu = 2.0
12/03/25 16:15:45 Setting ResidentSetSize = 3500
12/03/25 16:15:45 Setting ScratchDirFileCount = 335
12/03/25 16:15:45 JobRouter (src=114306.0,dest=164726.0,route=Condor_Pool): updated job status 12/03/25 16:15:45 SharedPortClient: sent connection request to local schedd for shared port id schedd_993_b1a2
12/03/25 16:15:45 Setting CpusUsage = 0.0
...
### default Condor_Pool route taken from here until next cycle

[2]
12/03/25 16:15:55 === Current Probing Information ===
12/03/25 16:15:55 fsize: 45727554               mtime: 1764774955
12/03/25 16:15:55 first log entry: 102 CreationTimestamp 1756886501
12/03/25 16:15:55 JobRouter: Checking for candidate jobs. routing table is:
Route Name Source Submitted/Max Idle/Max Throttle Recent: Started Succeeded Failed DESYNAF DESYNAF 94/ 10000 35/ 2000 10.2976 jobs/sec 4 2 0 Condor_Pool Condor_Pool 26/ 10000 9/ 2000 0.312857 jobs/sec 0 0 4 12/03/25 16:15:55 JobRouter: Umbrella constraint: (target.JobUniverse =?= 5 || target.JobUniverse =?= 1) && ( (True) || (Owner == "atlasnf1") ) && (target.ProcId >= 0 && t arget.JobStatus == 1 && (target.StageInStart is undefined || target.StageInFinish isnt undefined) && target.Managed isnt "ScheddDone" && target.Managed isnt "External" &&
target.Owner isnt Undefined && target.RoutedBy isnt "htcondor-ce")
12/03/25 16:15:55 JobRouter: Found candidate job src=114226.0,route=DESYNAF
12/03/25 16:15:55 JobRouter: Found candidate job src=114271.0,route=DESYNAF
12/03/25 16:15:55 JobRouter: Found candidate job src=114272.0,route=DESYNAF
12/03/25 16:15:55 JobRouter: Found candidate job src=114313.0,route=DESYNAF
12/03/25 16:15:55 SharedPortClient: sent connection request to local schedd for shared port id schedd_993_b1a2
12/03/25 16:15:55 JobRouter (src=114271.0,route=DESYNAF): claimed job
12/03/25 16:15:55 JobRouter post-route transform OnExitHold: does not match job. skippping it. 12/03/25 16:15:55 Will use TCP to update collector bird-htc-master21.desy.de <131.169.56.132:9618?alias=bird-htc-master21.desy.de> 12/03/25 16:15:55 Trying to query collector <131.169.56.132:9618?alias=bird-htc-master21.desy.de> 12/03/25 16:15:55 SharedPortClient: sent connection request to schedd naf-htc-preprod-ce01.desy.de for shared port id schedd_1391_b053 12/03/25 16:15:55 SharedPortClient: sent connection request to schedd naf-htc-preprod-ce01.desy.de for shared port id schedd_1391_b053 12/03/25 16:15:55 JobRouter (src=114271.0,dest=164836.0,route=DESYNAF): submitted job 12/03/25 16:15:55 JobRouter (src=114271.0,dest=164836.0,route=DESYNAF): submitted job has not yet appeared in job queue mirror or was removed (submitted 0 seconds ago) 12/03/25 16:15:55 SharedPortClient: sent connection request to local schedd for shared port id schedd_993_b1a2
12/03/25 16:15:55 JobRouter (src=114226.0,route=DESYNAF): claimed job
12/03/25 16:15:55 JobRouter post-route transform OnExitHold: does not match job. skippping it. 12/03/25 16:15:55 Will use TCP to update collector bird-htc-master21.desy.de <131.169.56.132:9618?alias=bird-htc-master21.desy.de> 12/03/25 16:15:55 Trying to query collector <131.169.56.132:9618?alias=bird-htc-master21.desy.de> 12/03/25 16:15:55 SharedPortClient: sent connection request to schedd naf-htc-preprod-ce01.desy.de for shared port id schedd_1391_b053 12/03/25 16:15:55 SharedPortClient: sent connection request to schedd naf-htc-preprod-ce01.desy.de for shared port id schedd_1391_b053
...
### DESYNAF taken from here ###


On  2025-12-04 00:09, John M Knoeller via HTCondor-users wrote:
This sounds like

 Â Â JOB_ROUTER_ROUND_ROBIN_SELECTIONÂ = true

is configured in the CE. ÂThe default is false.

 Â Â condor_ce_config_val -v JOB_ROUTER_ROUND_ROBIN_SELECTION

Will show if that is the case.

Alternatively, there is a limit on the total number of jobs that can be routed to DESYNAF and that limit
is being hit.

 Â Â condor_ce_job_router_info -config

should show any limits on the routes.

-tj

------------------------------------------------------------------------
*From:*ÂHTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Thomas Hartmann <thomas.hartmann@xxxxxxx>
*Sent:*ÂWednesday, December 3, 2025 9:26 AM
*To:*ÂHTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
*Subject:*Â[HTCondor-users] CondorCE: jobs taking different routes albeit the same route condition are fulfilled

Hi all,

I am sometimes observing, that jobs on a test CE are taking different
routes although the conditions should be the same for taking a specific
route. I.e., some jobs go via the `DESYNAF` as intended but some go via
the default `Condor_Pool` route [1]
The two example CE jobs in [1] have both the same `atlasnf1` owner [2]
mapped - which is actually the sole requirement for the `DESYNAF` route
[3], so that I would expect both jobs to match to this route, or?

Cheers,
 ÂÂ Thomas

[1]
12/03/25 16:05:53 JobRouter: Found candidate job src=114376.0,route=DESYNAF
vs
12/03/25 15:38:32 JobRouter: Found candidate job
src=114316.0,route=Condor_Pool

[2]
[root@naf-htc-preprod-ce01 config.d]# sdiff /tmp/114376.DESYNAF.ads
/tmp/114316.CondorPool.ads | grep -i owner
Owner = "atlasnf1"ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ Owner =
"atlasnf1"

[3]
[root@naf-htc-preprod-ce01 config.d]# cat 99_nafroute.conf
JOB_ROUTER_ROUTE_DESYNAF @=end
 ÂÂ UNIVERSE VANILLA
 ÂÂ REQUIREMENTS Owner == "atlasnf1"
 ÂÂ ...
@end

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature