Hello, everyone.
My name is Geonmo Ryu, and I am an admon of KISTI ALICE Tier-1 center.
For almost a year now, we've been experiencing a phenomenon where roughly 10% of the jobs performed on our site are suddenly terminated and labeled as expired.
We've been debugging this phenomenon diligently with the ALICE Grid Middleware team, and we're currently experiencing the following odd behavior, which I wanted to pass on to the HTCondor team.
Here is a rough process tree of the ALICE jobs currently running on our site.
condor_starter -> agent.startup.N -> JobRunner -> apptainer(PID isolation)-> JobWrapper -> o2-sim -> o2-sim-jobs
When I collected signal information using bcc tools' kill_snoop on WN, I found the following phenomenon.
```
~~~
05:06:04 3897305 o2-sim 9 -28814098 -3
05:06:04 2216143 JobAgent_1145 9 292479 0
05:06:04 3897305 o2-sim 9 0 0
05:06:04 2216125 condor_starter 9 277679 0
05:06:04 2216125 condor_starter 9 224932 0
~~~
```
We found that condor_starter kills all of its children when o2-sim sent a sigkill to PID 0.
I've seen this at least 14 times, so I don't think it's a coincidence.
Also, the condor_starter logs indicate that "agent.startup.N" died because it received signal 9.
-----
05/23/25 18:16:58 (pid:2216125) About to exec /var/lib/condor/execute/dir_2216125/agent.startup.247827
05/23/25 18:16:58 (pid:2216125) Create_Process succeeded, pid=2216127
05/23/25 18:17:06 (pid:2216125) Failed to open '.update.ad' to read update ad: No such file or directory (2).
05/23/25 18:17:06 (pid:2216125) Failed to open '.update.ad' to read update ad: No such file or directory (2).
05/24/25 05:06:04 (pid:2216125) Process exited, pid=2216127, signal=9
05/24/25 05:06:04 (pid:2216125) Failed to write ToE tag to .job.ad file (13): Permission denied
-----
Of course, I think to send a sigkill to PID 0 is a bug, but I'm contacting you because I don't understand how that signal is being sent to "agent.startup" up the pstree, which is separated by the PID namespace.
The ALICE middleware team has tested what happens when JobRunner continues to run without HTCondor to process jobs, and found that it doesn't cause any issues.
So, our guess is that if we can solve the problem of how HTCondor allows to receive sigkills from subprocesses that are separated by PID namespaces and clean up the jobs, we can minimize the damage so that when kill -9 0 occurs.
How did the signal from inside the apptainer container, which isolated the PID namespace inside the slot, sent to the "executable"?
If you know anything about the above issue, please share us.
Regards,
-- Geonmo
For FYI,
``` condor_starter log ```
05/23/25 18:16:58 (pid:2216125) ******************************************************
05/23/25 18:16:58 (pid:2216125) ** condor_starter (CONDOR_STARTER) STARTING UP
05/23/25 18:16:58 (pid:2216125) ** /usr/sbin/condor_starter
05/23/25 18:16:58 (pid:2216125) ** SubsystemInfo: name=STARTER type=STARTER(7) class=DAEMON(1)
05/23/25 18:16:58 (pid:2216125) ** Configuration: subsystem:STARTER local:slot_type_1 class:DAEMON
05/23/25 18:16:58 (pid:2216125) ** $CondorVersion: 23.0.16 2024-10-09 BuildID: 760880 PackageID: 23.0.16-1 $
05/23/25 18:16:58 (pid:2216125) ** $CondorPlatform: x86_64_AlmaLinux9 $
05/23/25 18:16:58 (pid:2216125) ** PID = 2216125
05/23/25 18:16:58 (pid:2216125) ** Log last touched 5/23 18:16:12
05/23/25 18:16:58 (pid:2216125) ******************************************************
05/23/25 18:16:58 (pid:2216125) Using config source: /etc/condor/condor_config
05/23/25 18:16:58 (pid:2216125) Using local config sources:
05/23/25 18:16:58 (pid:2216125) /etc/condor/config.d/00-htcondor-9.0.config
05/23/25 18:16:58 (pid:2216125) /etc/condor/config.d/01-cluster.conf
05/23/25 18:16:58 (pid:2216125) /etc/condor/config.d/02-local.conf
05/23/25 18:16:58 (pid:2216125) /etc/condor/config.d/03-cgroup.conf
05/23/25 18:16:58 (pid:2216125) /etc/condor/config.d/10-stash-plugin.conf
05/23/25 18:16:58 (pid:2216125) /etc/condor/condor_config.local
05/23/25 18:16:58 (pid:2216125) config Macros = 98, Sorted = 96, StringBytes = 3348, TablesBytes = 3608
05/23/25 18:16:58 (pid:2216125) CLASSAD_CACHING is OFF
05/23/25 18:16:58 (pid:2216125) Daemon Log is logging: D_ALWAYS D_ERROR D_STATUS
05/23/25 18:16:58 (pid:2216125) SharedPortEndpoint: waiting for connections to named socket slot1_3341883_a4f2_50
05/23/25 18:16:58 (pid:2216125) DaemonCore: command socket at <134.75.125.213:9618?addrs=134.75.125.213-9618+[2001-320-15-125-a1-be-ce-5206]-9618&alias=alice-t1-b03-wn06.sdfarm.kr&noUDP&sock=slot1_3341883_a4f2_50>
05/23/25 18:16:58 (pid:2216125) DaemonCore: private command socket at <134.75.125.213:9618?addrs=134.75.125.213-9618+[2001-320-15-125-a1-be-ce-5206]-9618&alias=alice-t1-b03-wn06.sdfarm.kr&noUDP&sock=slot1_3341883_a4f2_50>
05/23/25 18:16:58 (pid:2216125) Communicating with shadow <134.75.125.147:9618?addrs=134.75.125.147-9618+[2001-320-15-125-a1-be-ce-4003]-9618&alias=alice-t1-ce06.sdfarm.kr&noUDP&sock=shadow_1309471_879a_4907>
05/23/25 18:16:58 (pid:2216125) Submitting machine is "alice-t1-ce06.sdfarm.kr"
05/23/25 18:16:58 (pid:2216125) setting the orig job name in starter
05/23/25 18:16:58 (pid:2216125) setting the orig job iwd in starter
05/23/25 18:16:58 (pid:2216125) Chirp config summary: IO false, Updates false, Delayed updates true.
05/23/25 18:16:58 (pid:2216125) Initialized IO Proxy.
05/23/25 18:16:58 (pid:2216125) Done setting resource limits
05/23/25 18:16:58 (pid:2216125) Set filetransfer runtime ads to /var/lib/condor/execute/dir_2216125/.job.ad and /var/lib/condor/execute/dir_2216125/.machine.ad.
05/23/25 18:16:58 (pid:2216125) File transfer completed successfully.
05/23/25 18:16:58 (pid:2216125) Job 69037.0 set to execute immediately
05/23/25 18:16:58 (pid:2216125) Starting a VANILLA universe job with ID: 69037.0
05/23/25 18:16:58 (pid:2216125) Checking to see if htcondor is a writeable cgroup
05/23/25 18:16:58 (pid:2216125) Cgroup /htcondor is useable
05/23/25 18:16:58 (pid:2216125) Current mount, /tmp, is shared.
05/23/25 18:16:58 (pid:2216125) Current mount, /var/tmp, is shared.
05/23/25 18:16:58 (pid:2216125) IWD: /var/lib/condor/execute/dir_2216125
05/23/25 18:16:58 (pid:2216125) Output file: /var/lib/condor/execute/dir_2216125/_condor_stdout
05/23/25 18:16:58 (pid:2216125) Error file: /var/lib/condor/execute/dir_2216125/_condor_stderr
05/23/25 18:16:58 (pid:2216125) Renice expr "0" evaluated to 0
05/23/25 18:16:58 (pid:2216125) Running job as user sgmalit1_01
05/23/25 18:16:58 (pid:2216125) About to exec /var/lib/condor/execute/dir_2216125/agent.startup.247827
05/23/25 18:16:58 (pid:2216125) Create_Process succeeded, pid=2216127
05/23/25 18:17:06 (pid:2216125) Failed to open '.update.ad' to read update ad: No such file or directory (2).
05/23/25 18:17:06 (pid:2216125) Failed to open '.update.ad' to read update ad: No such file or directory (2).
05/24/25 05:06:04 (pid:2216125) Process exited, pid=2216127, signal=9
05/24/25 05:06:04 (pid:2216125) Failed to write ToE tag to .job.ad file (13): Permission denied
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
05/24/25 05:06:08 (pid:2216125) All jobs have exited... starter exiting
05/24/25 05:06:29 (pid:2216125) **** condor_starter (condor_STARTER) pid 2216125 EXITING WITH STATUS 0
```
-- Geonmo
Attachment:
expired_jobs.png
Description: PNG image