|
Hi Condor Team,
We are testing HTCondor 25 on a subset of our cluster for production evaluation and have noticed random startd crashes on the upgraded nodes. The upgrade is from 24.0.14 to 25.0.7.
Nodes running HTCondor 25 advertise DockerCachedImageSizeMb in the machine ClassAd, and the crashes appear to be related to this new attribute introduced in Condor 25. Relevant snippet from the StartLog:
Caught signal 6: si_code=4294967290, si_pid=2079, si_uid=0, si_addr=0x81F Stack trace highlights: DockerAPI::imageCacheUsed() condor_startd(MachAttributes::compute_for_update)
StartLog is attached for reference.
We are running Docker 28.5.1, which is consistent across the cluster. Jobs are launched using our custom condor-docker.py wrapper.
Please let us know if any additional information or logs would help diagnose and fix this issue.
Thank you, Arshad |
"/usr/sbin/condor_startd" on "xxx" died due to signal 6 (Aborted).
Condor will automatically restart this process in 10 seconds.
*** /var/log/condor/StartLog:
03/12/26 16:32:50 slot1_17: Got universe "VANILLA" (5) from request classad
03/12/26 16:32:50 slot1_17: State change: claim-activation protocol successful, Starter 1877181
03/12/26 16:32:50 slot1_17: Changing activity: Idle -> Busy
03/12/26 16:37:33 slot1_14[2206144.70]: Received final job ClassAd update from starter. 0 unread bytes
03/12/26 16:37:33 slot1_14: Called deactivate_claim_forcibly()
03/12/26 16:37:33 slot1_14[2206144.70]: vacateJob() ignored because starter is already doing final cleanup (starter pid 993466).
03/12/26 16:37:33 slot1_14: Changing state and activity: Claimed/Busy -> Preempting/Vacating
03/12/26 16:37:33 Starter pid 993466 exited with status 0
03/12/26 16:37:33 slot1_14: State change: starter exited : /var/lib/condor/execute/slot1/dir_48562/userdir/build-HhqYor/BUILD/condor-25.0.7/src/condor_startd.V6/Resource.cpp(1013) Preempting/Vacating
03/12/26 16:37:33 slot1_14: State change: No preempting claim, returning to owner
03/12/26 16:37:33 slot1_14: Changing state and activity: Preempting/Vacating -> Owner/Idle
03/12/26 16:37:33 slot1_14: State change: IS_OWNER is false
03/12/26 16:37:33 slot1_14: Changing state: Owner -> Unclaimed
03/12/26 16:37:33 slot1_14: Slot slot1_14 no longer needed, deleting
03/12/26 16:37:34 STARTD_CRON job healthcheck started. pid=1885680
03/12/26 16:37:52 bind slot DevIds tag=GPUs contraint=
03/12/26 16:37:52 slot1_14: New dSlot of type 1 allocated
03/12/26 16:37:52 slot1_14: Cpus: 1.00, Memory: 2048, Swap: 0.00%, Disk: 0.02%, GPUs: 0
03/12/26 16:37:52 slot1: Request accepted.
03/12/26 16:37:52 slot1_14: Remote owner is â?¦.
03/12/26 16:37:52 slot1_14: Changing state: Owner -> Claimed
03/12/26 16:37:52 slot1_14: State change: claiming protocol successful
03/12/26 16:37:54 slot1_14: Got activate_claim request from shadow (â?¦.)
03/12/26 16:54:01 slot1_7: Remote job ID is 58443112.8
03/12/26 16:54:01 slot1_7[58443112.8]: Setting affinity env to 28
03/12/26 16:54:01 slot1_7: Got universe "VANILLA" (5) from request classad
03/12/26 16:54:01 slot1_7: State change: claim-activation protocol successful, Starter 1918308
03/12/26 16:54:01 slot1_7: Changing activity: Idle -> Busy
03/12/26 16:54:03 OfflineUniverses = {}
03/12/26 16:54:17 OfflineUniverses = {}
03/12/26 16:55:27 slot1_17[2206421.34]: Received final job ClassAd update from starter. 0 unread bytes
03/12/26 16:55:27 slot1_17: Called deactivate_claim_forcibly()
03/12/26 16:55:27 slot1_17[2206421.34]: vacateJob() ignored because starter is already doing final cleanup (starter pid 1877181).
03/12/26 16:55:27 slot1_17: Changing state and activity: Claimed/Busy -> Preempting/Vacating
03/12/26 16:55:27 Starter pid 1877181 exited with status 0
03/12/26 16:55:27 slot1_17: State change: starter exited : /var/lib/condor/execute/slot1/dir_48562/userdir/build-HhqYor/BUILD/condor-25.0.7/src/condor_startd.V6/Resource.cpp(1013) Preempting/Vacating
03/12/26 16:55:27 slot1_17: State change: No preempting claim, returning to owner
03/12/26 16:55:27 slot1_17: Changing state and activity: Preempting/Vacating -> Owner/Idle
03/12/26 16:55:27 slot1_17: State change: IS_OWNER is false
03/12/26 16:55:27 slot1_17: Changing state: Owner -> Unclaimed
03/12/26 16:55:27 slot1_17: Slot slot1_17 no longer needed, deleting
Caught signal 6: si_code=4294967290, si_pid=2079, si_uid=0, si_addr=0x81F
Stack dump for process 2079 at timestamp 1773352530 (20 frames)
/lib64/libcondor_utils_25_0_7.so(_Z18dprintf_dump_stackv+0x28)[0x7f661c45b598]
/lib64/libcondor_utils_25_0_7.so(_Z17unix_sig_coredumpiP9siginfo_tPv+0x6f)[0x7f661c60339f]
/lib64/libc.so.6(+0x3fc30)[0x7f661ba3fc30]
/lib64/libc.so.6(+0x8d02c)[0x7f661ba8d02c]
/lib64/libc.so.6(raise+0x16)[0x7f661ba3fb86]
/lib64/libc.so.6(abort+0xd3)[0x7f661ba29873]
/lib64/libc.so.6(+0x2a1b2)[0x7f661ba2a1b2]
/lib64/libc.so.6(+0x970d7)[0x7f661ba970d7]
/lib64/libc.so.6(+0x98faf)[0x7f661ba98faf]
/lib64/libc.so.6(free+0x55)[0x7f661ba9b405]
/lib64/libcondor_utils_25_0_7.so(_ZN9DockerAPI14imageCacheUsedEv+0x97a)[0x7f661c4562aa]
condor_startd(_ZN14MachAttributes18compute_for_updateEv+0x61)[0x555dae4e9fb1]
condor_startd(+0x75691)[0x555dae514691]
condor_startd(_ZN6ResMgr19eval_and_update_allEi+0x43)[0x555dae5062c3]
/lib64/libcondor_utils_25_0_7.so(_ZN12TimerManager7TimeoutEPiPd+0xef)[0x7f661c61c9ef]
/lib64/libcondor_utils_25_0_7.so(_ZN10DaemonCore6DriverEv+0x193)[0x7f661c5eda33]
/lib64/libcondor_utils_25_0_7.so(_Z7dc_mainiPPc+0x140e)[0x7f661c612ebe]
/lib64/libc.so.6(+0x2a610)[0x7f661ba2a610]
/lib64/libc.so.6(__libc_start_main+0x80)[0x7f661ba2a6c0]
condor_startd(_start+0x25)[0x555dae4d09b5]
*** End of file StartLog