Hi Jaime, I digged a bit into the issue and the message about a failed daemon RESCHEDULE command really seems to be a red herring - at least it appears in fulldebug in all cases. --- The issue is somewhat difficult to reproduce as trace jobs are apparently not affected at all - and debugging on the production machines is somewhat awkward. On one of our production CEs, I have now added comma-separated both LRMS Condor heads to the CE's JOB_ROUTER_SCHEDD2_POOL ad, e.g., JOB_ROUTER_SCHEDD2_POOL=mainhead.fqdn.fo:9618,fallbackhead.fqdn.fo:9618 So far it seems to work and production job are propagating to the LRMS Condor. And the CE schedd sends updates to updates both manager nodes [2] But I am not sure, if this is the proper way to attach the CE to the pool's manager(s)? E.g., I see massages about failing job removals(?) in the router log [3] - which seems not to be healthy, or? (maybe the manager list is parsed taken as a single name here?) Cheers, Thomas (the active negotiator is running on our condor01 node and collectors are running on both, condor01 & grid-htc-master02) [1] 12/15/21 14:58:08 Sending RESCHEDULE command to negotiator(s) 12/15/21 14:58:08 Will use TCP to update collector grid-htcondorce-dev.desy.de <131.169.223.131:9619?alias=grid-htcondorce-dev.desy.de> 12/15/21 14:58:08 Trying to query collector <131.169.223.131:9619?alias=grid-htcondorce-dev.desy.de> 12/15/21 14:58:08 Can't find address for negotiator 12/15/21 14:58:08 Failed to send RESCHEDULE to unknown daemon: 12/15/21 14:58:08 ForkWorker::Fork: New child of 3252024 = 3252206 12/15/21 14:58:08 Number of Active Workers 0 [2] JobRouterLog 12/15/21 15:11:23 HOOK_JOB_FINALIZE not configured. 12/15/21 15:11:23 Will use TCP to update collector condor01.desy.de <131.169.56.33:9618?alias=condor01.desy.de> 12/15/21 15:11:23 Will use TCP to update collector grid-htc-master02.desy.de <131.169.223.100:9618?alias=grid-htc-master02.desy.de> 12/15/21 15:11:23 Trying to query collector <131.169.223.100:9618?alias=grid-htc-master02.desy.de> 12/15/21 15:11:23 SharedPortClient: sent connection request to schedd at <131.169.223.131:9620> for shared port id schedd_3253687_3afb 12/15/21 15:11:23 (6.0) Writing terminate record to user logfile ... [3] JobRouterLog 12/15/21 15:42:26 Unable to find address of grid-htcondorce1.desy.de at condor01.desy.de:9618,grid-htc-master02.desy.de:9618 12/15/21 15:42:26 JobRouter (src=996335.5,dest=1739316.0,route=Local_Condor): failed to remove dest job: Unable to find address of grid-htcondorce1.desy.de at condor01.desy.de:9618,grid-htc-master02.desy.de:9618 12/15/21 15:42:26 JobRouter failure (src=993489.0,dest=1737377.0,route=DESYGRID): giving up, because submitted job is still not in job queue mirror (submitted 614 seconds ago). Perhaps it has been removed? 12/15/21 15:42:26 Can't find address for schedd grid-htcondorce1.desy.de 12/15/21 15:42:26 Unable to find address of grid-htcondorce1.desy.de at condor01.desy.de:9618,grid-htc-master02.desy.de:9618 12/15/21 15:42:26 JobRouter (src=993489.0,dest=1737377.0,route=DESYGRID): failed to remove dest job: Unable to find address of grid-htcondorce1.desy.de at condor01.desy.de:9618,grid-htc-master02.desy.de:9618 12/15/21 15:42:26 JobRouter (src=992300.0,dest=1739178.0,route=Local_Condor): dest job was removed! 12/15/21 15:42:26 Can't find address for schedd grid-htcondorce1.desy.de 12/15/21 15:42:26 Unable to find address of grid-htcondorce1.desy.de at condor01.desy.de:9618,grid-htc-master02.desy.de:9618 12/15/21 15:42:26 JobRouter (src=992300.0,dest=1739178.0,route=Local_Condor): failed to remove dest job: Unable to find address of grid-htcondorce1.desy.de at condor01.desy.de:9618,grid-htc-master02.desy.de:9618 12/15/21 15:42:26 JobRouter (src=991185.0,dest=1739240.0,route=Local_Condor): dest job was removed! 12/15/21 15:42:26 Can't find address for schedd grid-htcondorce1.desy.de 12/15/21 15:42:26 Unable to find address of grid-htcondorce1.desy.de at condor01.desy.de:9618,grid-htc-master02.desy.de:9618 12/15/21 15:42:26 JobRouter (src=991185.0,dest=1739240.0,route=Local_Condor): failed to remove dest job: Unable to find address of grid-htcondorce1.desy.de at condor01.desy.de:9618,grid-htc-master02.desy.de:9618 12/15/21 15:42:26 JobRouter (src=997105.0,dest=1739180.0,route=Local_Condor): dest job was removed! 12/15/21 15:42:26 Can't find address for schedd grid-htcondorce1.desy.de 12/15/21 15:42:26 Unable to find address of grid-htcondorce1.desy.de at condor01.desy.de:9618,grid-htc-master02.desy.de:9618 12/15/21 15:42:26 JobRouter (src=997105.0,dest=1739180.0,route=Local_Condor): failed to remove dest job: Unable to find address of grid-htcondorce1.desy.de at condor01.desy.de:9618,grid-htc-master02.desy.de:9618 12/15/21 15:42:26 JobRouter (src=993563.0,dest=1739181.0,route=Local_Condor): dest job was removed! 12/15/21 15:42:26 DCSchedd:actOnJobs: Action failed ... 12/15/21 15:42:41 Routing jobs to schedd grid-htcondorce1.desy.de in pool condor01.desy.de:9618,grid-htc-master02.desy.de:9618 ... 12/15/21 15:53:35 JobRouter (src=995193.1,dest=1739584.0,route=DESYGRID): failed to remove dest job: Job 1739584.0 not found 12/15/21 15:53:35 JobRouter failure (src=995193.4,dest=1739585.0,route=DESYGRID): giving up, because submitted job is still not in job queue mirror (submitted 606 seconds ago). Perhaps it has been removed? 12/15/21 15:53:35 DCSchedd:actOnJobs: Action failed 12/15/21 15:53:35 JobRouter (src=995193.4,dest=1739585.0,route=DESYGRID): failed to remove dest job: Job 1739585.0 not found 12/15/21 15:53:35 JobRouter (src=993667.0,dest=1741438.0,route=Local_Condor): dest job was removed! 12/15/21 15:53:35 DCSchedd:actOnJobs: Action failed On 14/12/2021 04.12, Jaime Frey wrote: > An HA configuration for the Condor LRMS should not be an issue for the CE. I also wouldnât expect real jobs to fail when trace jobs succeeded. I assume excerpt [1] is from the CE SchedLog and [3] from the LRMS Condor configuration? > > The messages in [1] donât look like a problem. I expect to see them, since the CE doesnât have a startd or negotiator. Do you see anything else in the logs thatâs indicative of a problem? Is the Job Router failing to contact the LRMS schedd? > > - Jaime
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature