[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] [Bug Report] Docker Universe Jobs Crash Startd on Condor 24 / Ubuntu 24



Hello,
When upgrading our execute nodes from Ubuntu22.04 to Ubuntu24.04, we're seeing the condor startd process crashing when it tries to run a Docker Universe Job. Specifically it was a docker universe job that used host networking and the chirp server (thinking of things that may not be common). I've pasted the logs below, but I'm not sure how helpful they'll be.

If you need more information, please let me know.

Thanks,
Jordan

$CondorVersion: 24.0.3 2025-01-03 BuildID: 777902 PackageID: 24.0.3-1+ubu24 GitSHA: ef02b46e $
$CondorPlatform: X86_64-Ubuntu_24.04 $

02/03/25 22:23:43 (pid:1423137) CLASSAD_CACHING is OFF
02/03/25 22:23:43 (pid:1423137) Daemon Log is logging: D_ALWAYS D_ERROR D_STATUS
02/03/25 22:23:43 (pid:1423137) SharedPortEndpoint: waiting for connections to named socket slot1_1416438_7538_36
02/03/25 22:23:43 (pid:1423137) DaemonCore: command socket at <10.112.38.113:9618?addrs=10.112.38.113-9618&alias=elucid-dev-dev-1-proc-render-i-030e870216f5f7e3e&noUDP&sock=slot1_1416438_7538_36>
02/03/25 22:23:43 (pid:1423137) DaemonCore: private command socket at <10.112.38.113:9618?addrs=10.112.38.113-9618&alias=elucid-dev-dev-1-proc-render-i-030e870216f5f7e3e&noUDP&sock=slot1_1416438_7538_36>
02/03/25 22:23:43 (pid:1423137) Communicating with shadow <10.112.35.14:9618?addrs=10.112.35.14-9618&alias=elucid-dev-dev-1-manager.dev.cloud.elucid.biz&noUDP&sock=shadow_1256883_86c0_109>
02/03/25 22:23:43 (pid:1423137) Submitting machine is "ip-10-112-35-14.ec2.internal"
02/03/25 22:23:43 (pid:1423137) setting the orig job name in starter
02/03/25 22:23:43 (pid:1423137) setting the orig job iwd in starter
02/03/25 22:23:43 (pid:1423137) Job has WantIOProxy=true
02/03/25 22:23:43 (pid:1423137) Chirp config summary: IO true, Updates true, Delayed updates true.
02/03/25 22:23:43 (pid:1423137) Initialized IO Proxy.
02/03/25 22:23:43 (pid:1423137) Done setting resource limits
02/03/25 22:23:43 (pid:1423137) Job 5.0 set to execute immediately
02/03/25 22:23:43 (pid:1423137) Starting a VANILLA universe job with ID: 5.0
02/03/25 22:23:43 (pid:1423137) Adding /var/lib/condor/execute/dir_1423137/tmp/:/tmp as a docker volume to mount under scratch
02/03/25 22:23:43 (pid:1423137) Adding /var/lib/condor/execute/dir_1423137/var/tmp/:/var/tmp as a docker volume to mount under scratch
Caught signal 6: si_code=4294967290, si_pid=1423137, si_uid=0, si_addr=0x15B721
Stack dump for process 1423137 at timestamp 1738621423 (19 frames)
/lib/libcondor_utils_24_0_3.so(_Z18dprintf_dump_stackv+0x2f)[0x72b7ff3f1acf]
/lib/libcondor_utils_24_0_3.so(_Z17unix_sig_coredumpiP9siginfo_tPv+0x80)[0x72b7ff58b570]
/lib/x86_64-linux-gnu/libc.so.6(+0x45320)[0x72b7fe445320]
/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x11c)[0x72b7fe49eb1c]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x1e)[0x72b7fe44526e]
/lib/x86_64-linux-gnu/libc.so.6(abort+0xdf)[0x72b7fe4288ff]
/lib/x86_64-linux-gnu/libc.so.6(+0x297b6)[0x72b7fe4297b6]
/lib/x86_64-linux-gnu/libc.so.6(+0x136c19)[0x72b7fe536c19]
/lib/x86_64-linux-gnu/libc.so.6(+0x1365d4)[0x72b7fe5365d4]
/lib/x86_64-linux-gnu/libc.so.6(+0x137fd5)[0x72b7fe537fd5]
condor_starter(_ZN10DockerProc8StartJobEv+0x20c5)[0x63b47e0d1895]
condor_starter(_ZN7Starter8SpawnJobEv+0x407)[0x63b47e0c65f7]
condor_starter(_ZN7Starter14SpawnPreScriptEi+0x306)[0x63b47e0c6126]
/lib/libcondor_utils_24_0_3.so(_ZN12TimerManager7TimeoutEPiPd+0x153)[0x72b7ff5a0ed3]
/lib/libcondor_utils_24_0_3.so(_ZN10DaemonCore6DriverEv+0x2ed)[0x72b7ff5758ed]
/lib/libcondor_utils_24_0_3.so(_Z7dc_mainiPPc+0x1655)[0x72b7ff598205]
/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x72b7fe42a1ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x72b7fe42a28b]
condor_starter(_start+0x25)[0x63b47e0bae85]