Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Repeated condor_startd Crashes on HTCondor 24.0.6 (condor_startd Segfault on EPs (Signal 11))

Date: Fri, 11 Apr 2025 15:02:02 +0000
From: Arshad Ahmad <aahmad@xxxxxxxx>
Subject: [HTCondor-users] Repeated condor_startd Crashes on HTCondor 24.0.6 (condor_startd Segfault on EPs (Signal 11))

Hi HTCondor Team,

We recently upgraded to HTCondor 24.0.6, and since then we’ve observed repeated crashes of the condor_startd daemon across multiple worker nodes in our cluster.

On several worker nodes, condor_startd has crashed with core dumps. Based on initial inspection, the stack traces appear substantially similar, suggesting a common root cause.

Error Message:

/usr/sbin/condor_startd on workernode died due to signal 11 (Segmentation fault).

Condor will automatically restart this process in 10 seconds.

These failures occur randomly across workers, not tied to any specific job or hardware, and were not observed prior to the upgrade. The following log message does appear, but we do not believe it is the trigger, as it also appeared in earlier, stable versions:

ERROR “Failed to write ToE tag to .job.ad file”

Excerpt from /var/log/condor/StartLog just before the crash:

04/10/25 09:54:39 Failed to write ToE tag to .job.ad file (13): Permission denied

04/10/25 09:54:41 slot1_21: Got activate claim while starter is still alive.

…

04/10/25 09:54:42 slot1_27: Changing state: Unclaimed -> Delete

Other context:

• Numerous “Error: can’t find resource with ClaimId (…)” messages before the crash

• Multiple slots rapidly transitioning: Preempting/Vacating -> Owner -> Unclaimed

• Crashes seem to occur during periods of high job churn

• So far, the issue has only been observed on nodes running kernel: 5.14.0-503.31.1.el9_5.x86_64

We would appreciate your guidance on:

• Whether this is a known issue in 24.0.6

• What further debugging steps we can take

Thank you

Best regards,

Arshad

Follow-Ups:
- Re: [HTCondor-users] Repeated condor_startd Crashes on HTCondor 24.0.6 (condor_startd Segfault on EPs (Signal 11))
  - From: Greg Thain

Prev by Date: Re: [HTCondor-users] Using htcondor.EventLog to to study Condor File Transfer
Next by Date: Re: [HTCondor-users] Using htcondor.EventLog to to study Condor File Transfer
Previous by thread: Re: [HTCondor-users] Fixed, don't know why - Re: Ubuntu 24.04, HTCondor 24.6.1: Dozens of condor_procd's
Next by thread: Re: [HTCondor-users] Repeated condor_startd Crashes on HTCondor 24.0.6 (condor_startd Segfault on EPs (Signal 11))
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[HTCondor-users] Repeated condor_startd Crashes on HTCondor 24.0.6 (condor_startd Segfault on EPs (Signal 11))