Hello,
When upgrading our execute nodes from Ubuntu22.04 to Ubuntu24.04, we're seeing the condor startd process crashing when it tries to run a Docker Universe Job. Specifically it was a docker universe job that used host networking and the chirp server (thinking
of things that may not be common). I've pasted the logs below, but I'm not sure how helpful they'll be.
If you need more information, please let me know.
Thanks,
Jordan
$CondorVersion: 24.0.3 2025-01-03 BuildID: 777902 PackageID: 24.0.3-1+ubu24 GitSHA: ef02b46e $ $CondorPlatform: X86_64-Ubuntu_24.04 $
02/03/25 22:23:43 (pid:1423137) CLASSAD_CACHING is OFF
02/03/25 22:23:43 (pid:1423137) Daemon Log is logging: D_ALWAYS D_ERROR D_STATUS 02/03/25 22:23:43 (pid:1423137) SharedPortEndpoint: waiting for connections to named socket slot1_1416438_7538_36 02/03/25 22:23:43 (pid:1423137) DaemonCore: command socket at <10.112.38.113:9618?addrs=10.112.38.113-9618&alias=elucid-dev-dev-1-proc-render-i-030e870216f5f7e3e&noUDP&sock=slot1_1416438_7538_36> 02/03/25 22:23:43 (pid:1423137) DaemonCore: private command socket at <10.112.38.113:9618?addrs=10.112.38.113-9618&alias=elucid-dev-dev-1-proc-render-i-030e870216f5f7e3e&noUDP&sock=slot1_1416438_7538_36> 02/03/25 22:23:43 (pid:1423137) Communicating with shadow <10.112.35.14:9618?addrs=10.112.35.14-9618&alias=elucid-dev-dev-1-manager.dev.cloud.elucid.biz&noUDP&sock=shadow_1256883_86c0_109> 02/03/25 22:23:43 (pid:1423137) Submitting machine is "ip-10-112-35-14.ec2.internal" 02/03/25 22:23:43 (pid:1423137) setting the orig job name in starter 02/03/25 22:23:43 (pid:1423137) setting the orig job iwd in starter 02/03/25 22:23:43 (pid:1423137) Job has WantIOProxy=true 02/03/25 22:23:43 (pid:1423137) Chirp config summary: IO true, Updates true, Delayed updates true. 02/03/25 22:23:43 (pid:1423137) Initialized IO Proxy. 02/03/25 22:23:43 (pid:1423137) Done setting resource limits 02/03/25 22:23:43 (pid:1423137) Job 5.0 set to execute immediately 02/03/25 22:23:43 (pid:1423137) Starting a VANILLA universe job with ID: 5.0 02/03/25 22:23:43 (pid:1423137) Adding /var/lib/condor/execute/dir_1423137/tmp/:/tmp as a docker volume to mount under scratch 02/03/25 22:23:43 (pid:1423137) Adding /var/lib/condor/execute/dir_1423137/var/tmp/:/var/tmp as a docker volume to mount under scratch Caught signal 6: si_code=4294967290, si_pid=1423137, si_uid=0, si_addr=0x15B721 Stack dump for process 1423137 at timestamp 1738621423 (19 frames) /lib/libcondor_utils_24_0_3.so(_Z18dprintf_dump_stackv+0x2f)[0x72b7ff3f1acf] /lib/libcondor_utils_24_0_3.so(_Z17unix_sig_coredumpiP9siginfo_tPv+0x80)[0x72b7ff58b570] /lib/x86_64-linux-gnu/libc.so.6(+0x45320)[0x72b7fe445320] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x11c)[0x72b7fe49eb1c] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x1e)[0x72b7fe44526e] /lib/x86_64-linux-gnu/libc.so.6(abort+0xdf)[0x72b7fe4288ff] /lib/x86_64-linux-gnu/libc.so.6(+0x297b6)[0x72b7fe4297b6] /lib/x86_64-linux-gnu/libc.so.6(+0x136c19)[0x72b7fe536c19] /lib/x86_64-linux-gnu/libc.so.6(+0x1365d4)[0x72b7fe5365d4] /lib/x86_64-linux-gnu/libc.so.6(+0x137fd5)[0x72b7fe537fd5] condor_starter(_ZN10DockerProc8StartJobEv+0x20c5)[0x63b47e0d1895] condor_starter(_ZN7Starter8SpawnJobEv+0x407)[0x63b47e0c65f7] condor_starter(_ZN7Starter14SpawnPreScriptEi+0x306)[0x63b47e0c6126] /lib/libcondor_utils_24_0_3.so(_ZN12TimerManager7TimeoutEPiPd+0x153)[0x72b7ff5a0ed3] /lib/libcondor_utils_24_0_3.so(_ZN10DaemonCore6DriverEv+0x2ed)[0x72b7ff5758ed] /lib/libcondor_utils_24_0_3.so(_Z7dc_mainiPPc+0x1655)[0x72b7ff598205] /lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x72b7fe42a1ca] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x72b7fe42a28b] condor_starter(_start+0x25)[0x63b47e0bae85] |