[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Repeated condor_startd Crashes on HTCondor 24.0.6 (condor_startd Segfault on EPs (Signal 11))



Hi HTCondor Team,
We recently upgraded to HTCondor 24.0.6, and since then we’ve observed repeated crashes of the condor_startd daemon across multiple worker nodes in our cluster.
On several worker nodes, condor_startd has crashed with core dumps. Based on initial inspection, the stack traces appear substantially similar, suggesting a common root cause.

Error Message:

/usr/sbin/condor_startd on workernode died due to signal 11 (Segmentation fault).

Condor will automatically restart this process in 10 seconds.

These failures occur randomly across workers, not tied to any specific job or hardware, and were not observed prior to the upgrade. The following log message does appear, but we do not believe it is the trigger, as it also appeared in earlier, stable versions:
ERROR “Failed to write ToE tag to .job.ad file”

Excerpt from /var/log/condor/StartLog just before the crash:

04/10/25 09:54:39 Failed to write ToE tag to .job.ad file (13): Permission denied

04/10/25 09:54:41 slot1_21: Got activate claim while starter is still alive.

04/10/25 09:54:42 slot1_27: Changing state: Unclaimed -> Delete


Other context:

• Numerous “Error: can’t find resource with ClaimId (…)” messages before the crash

• Multiple slots rapidly transitioning: Preempting/Vacating -> Owner -> Unclaimed

• Crashes seem to occur during periods of high job churn

• So far, the issue has only been observed on nodes running kernel: 5.14.0-503.31.1.el9_5.x86_64

We would appreciate your guidance on:

• Whether this is a known issue in 24.0.6

• What further debugging steps we can take

Thank you

Best regards,

Arshad