[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Consistency problems between schedd(s) view and CM?



See?  I told you that these states are fruitful for finding issues...

In the snippet below, I think it's the Startd that may be shutting things down; you may find luck looking in that log.

That said, I feel like I've seen GitHub traffic in the last few months looking at a bug in this neighborhood.  I'll defer to Todd or Jaime to see if they recognize this particular set of logging lines.

Brian

On Dec 3, 2025, at 9:34âAM, Jeff Templon <templon@xxxxxxxxx> wrote:

And the corresponding lines from the ShadowLog:

ShadowLog:12/03/25 13:47:02 Initializing a VANILLA shadow for job 715003.0
ShadowLog:12/03/25 13:47:02 (715003.0) (3737347): LIMIT_DIRECTORY_ACCESS = <unset>
ShadowLog:12/03/25 13:47:02 (715003.0) (3737347): Request to run on slot1_19@xxxxxxxxxxxxxxxxxxxxxxxxx <194.171.101.44:9618?addrs=194.171.101.44-9618+[2a07-8500-120-e080-f0ef-0-65-2c]-9618&alias=wn-lot-032.farm.nikhef.nl&noUDP&sock=startd_3110_c7a0> was REFUSED
ShadowLog:12/03/25 13:47:02 (715003.0) (3737347): Job 715003.0 is being evicted from slot1_19@xxxxxxxxxxxxxxxxxxxxxxxxx
ShadowLog:12/03/25 13:47:02 (715003.0) (3737347): RemoteResource::killStarter(): Could not send command to startd, will retry in 5 seconds
ShadowLog:12/03/25 13:47:07 (715003.0) (3737347): RemoteResource::killStarter(): Could not send command to startd, will retry in 5 seconds
ShadowLog:12/03/25 13:47:12 (715003.0) (3737347): RemoteResource::killStarter(): Could not send command to startd
ShadowLog:12/03/25 13:47:12 (715003.0) (3737347): logEvictEvent with unknown reason (108), not logging.
ShadowLog:12/03/25 13:47:12 (715003.0) (3737347): **** condor_shadow (condor_SHADOW) pid 3737347 EXITING WITH STATUS 108

And on the node, the starter log for that slot:

12/03/25 13:47:01 (pid:3988239) Got vacate code=1009 subcode=0 reason=Claim deactivated
12/03/25 13:47:01 (pid:3988239) ShutdownGraceful all jobs.
12/03/25 13:47:01 (pid:3988239) Process exited, pid=3988456, status=143
12/03/25 13:47:01 (pid:3988239) All jobs have exited... starter exiting
12/03/25 13:47:02 (pid:3988239) **** condor_starter (condor_STARTER) pid 3988239 EXITING WITH STATUS 0
12/03/25 13:47:21 (pid:345302) ******************************************************




On 3 Dec 2025, at 14:44, Jeff Templon <templon@xxxxxxxxx> wrote:

Hi Brian,

Could be - at the times where discrepancy appears, I see jobs with RunCount of 4 or more, and this kind of stuff in the SchedLog:

SchedLog:12/03/25 13:47:02 (pid:1291) match (slot1@xxxxxxxxxxxxxxxxxxxxxxxxx <194.171.101.44:9618?addrs=194.171.101.44-9618+[2a07-8500-120-e080-f0ef-0-65-2c]-9618&alias=wn-lot-032.farm.nikhef.nl&noUDP&sock=startd_3110_c7a0> for group_atlas.atlb009) switching to job 715003.0
SchedLog:12/03/25 13:47:02 (pid:1291) Starting add_shadow_birthdate(715003.0)
SchedLog:12/03/25 13:47:02 (pid:1291) Started shadow for job 715003.0 on slot1@xxxxxxxxxxxxxxxxxxxxxxxxx <194.171.101.44:9618?addrs=194.171.101.44-9618+[2a07-8500-120-e080-f0ef-0-65-2c]-9618&alias=wn-lot-032.farm.nikhef.nl&noUDP&sock=startd_3110_c7a0> for group_atlas.atlb009, (shadow pid = 3737347)
SchedLog:12/03/25 13:47:12 (pid:1291) Shadow pid 3737347 for job 715003.0 exited with status 108
SchedLog:12/03/25 13:47:12 (pid:1291) Match record (slot1@xxxxxxxxxxxxxxxxxxxxxxxxx <194.171.101.44:9618?addrs=194.171.101.44-9618+[2a07-8500-120-e080-f0ef-0-65-2c]-9618&alias=wn-lot-032.farm.nikhef.nl&noUDP&sock=startd_3110_c7a0> for group_atlas.atlb009, 715003.0) deleted

I also wonder about a race condition that was mentioned, something about giving a slot back vs keeping the slot open and putting another job into it.

JT


On 3 Dec 2025, at 14:32, Bockelman, Brian <BBockelman@xxxxxxxxxxxxx> wrote:

Hi Jeff,

You might be hitting issues with âclaimed / idleâ as a missing state in your sum.

After the negotiator hands a match to a schedd, the schedd will âclaimâ the slot.  It now belongs to the schedd.

Next, the schedd selects a job to run and âactivatesâ the slot (putting the job into run state).

Under typical conditions, the time between claimed and activated is less than a second - can be hard to catch in a busy pool.

Under busy conditions - or if there are persistent failures in activation, the numbers diverge.  That causes a slot to be claimed - but no jobs running.  In the past, Iâve found a large discrepancy a fruitful place to dig for bugs or misconfiguration.

Is this a possible explanation for what youâre seeing?

Brian

Sent from my iPhone

On Dec 3, 2025, at 6:05âAM, Jeff Templon <templon@xxxxxxxxx> wrote:

ï Hi

Thanks for your answer, Christoph.  I can follow your words and agree with them generally, except that âa little bitâ is not the right description - see the attached plot.  There are moments where the discrepancy (amount by which the sum rule is violated) approaches 15%.

Maybe others will chime in, Iâm inclined to view the CM as holy, albeit it possibly not entirely up-to-date, as it seems to always be self consistent.


<ndpf-running-day-top-large.png>



In the plot, the black line should be flat at the actual number of cores, and the colors should sum up to the black line (so no white gaps).

JT


> On 3 Dec 2025, at 10:59, Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:
>
> Hi Jeff,
>
> I did not check that in our pool deliberately but I recognized that especially in busy situations HTC acts like most distributed systems and there is a little bit of diff in the state-db's depending on which entity you query.
>
> From my experience it equals out pretty soon and the most relevant information is depending on what you are aiming for either the AP or the CM.
>
> You can probably tune a little bit the collector and negotiator to very frequent and short cycles and minimize the claim lifetime etc. did some of that to speed up the interactive part of the NAF but for classic batch I would rather focus on the reliable scale out and live with some moments of relative uncertainty about a very small fraction of the pools resources.
>
> Best
> christoph
>
> --
> Christoph Beyer
> DESY Hamburg
> IT-Department
>
> Notkestr. 85
> Building 02b, Room 009
> 22607 Hamburg
>
> phone:+49-(0)40-8998-2317
> mail: christoph.beyer@xxxxxxx
>
> ----- UrsprÃngliche Mail -----
> Von: "Jeff Templon" <templon@xxxxxxxxx>
> An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
> Gesendet: Mittwoch, 3. Dezember 2025 10:28:14
> Betreff: Re: [HTCondor-users] Consistency problems between schedd(s) view and CM?
>
> Hi,
>
> I see I left some material hanging after an edit and apparently cut out some other info without putting it back in:
>
> The âsum ruleâ is : collect information from the schedds about running jobs and add up all the cores from those
> Collect information from the CM about free slots, offline slots, and slots blocked by the defrag daemon
> Add all the above: should equal the total capacity
>
> The âalternate viewâ is to collect the running job information from the CM (dynamic slots) instead of the schedds.
>
> Maybe that was clear, but I wasnât happy with how it was described when I re-read it.
>
> JT
>
>
>> On 3 Dec 2025, at 10:16, Jeff Templon <templon@xxxxxxxxx> wrote:
>>
>> Hi,
>>
>> We have ported our âVOViewsâ package from Torque to HTCondor.  Iâve noticed three times now, that there are consistency issues.  Basically it goes like this: if we look at what the CM says about how many dynamic slots there are with how many CPUs, this should agree with what a survey of the schedds says about the number of running jobs and cores occupied by them.  If we then add in the number of offline slots, free slots (hanging around in the Partitionable slot waiting to be occupied) and slots held from running by the defrag daemon, this should equal the total capacity.
>>
>> This âsum ruleâ is what has been violated a few times.  All three times have been moments where there is a lot of activity on the cluster; for example the most recent one, the cluster was opened back up after a downtime, so thousands of queued jobs suddenly found thousands of empty cores to try and occupy.
>>
>> I realized I could get a different ârunning jobâ view by collecting information on all the dynamic slots via condor_status.  I did so, and compared that view to a simultaneous collection from the schedds - what I saw was that in the best case, they agreed; otherwise, it has always (always means the ten times I checked) been the case that the schedd has jobs that the CM does not, but never the other way around.
>>
>> Hypothesis : this must mean that some slots are âdoubly occupiedâ, which I checked, and indeed, in a few cases you could see that one of our CEs (Access Points) had a job from ATLAS in a particular dynamic slot, while a different access point claimed to have a job from LHCb in that exact same slot.
>>
>> Question: how fresh is the information supposed to be / guaranteed to be in the schedds and in the CM?  I tried to find that information, as well as a description of whether one of those views was âholierâ than the other, I came up empty handed - anybody here know the answers?
>>
>> We have four access points btw.
>>
>> JT
>>
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>>
>> The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
>
> The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
>
> The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
<ndpf-running-day-top-large.png>_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/