[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Possibility of sigkill signals being passed to parent process from child processes separated by PID namespace



Hello, everyone.


My name is Geonmo Ryu, and I am an admon of KISTI ALICE Tier-1 center. 


For almost a year now, we've been experiencing a phenomenon where roughly 10% of the jobs performed on our site are suddenly terminated and labeled as expired.


We've been debugging this phenomenon diligently with the ALICE Grid Middleware team, and we're currently experiencing the following odd behavior, which I wanted to pass on to the HTCondor team.


Here is a rough process tree of the ALICE jobs currently running on our site.


condor_starter -> agent.startup.N -> JobRunner -> apptainer(PID isolation)-> JobWrapper -> o2-sim -> o2-sim-jobs


When I collected signal information using bcc tools' kill_snoop on WN, I found the following phenomenon.


```

~~~

05:06:04 3897305 o2-sim 9 -28814098 -3

05:06:04 2216143 JobAgent_1145 9 292479 0

05:06:04 3897305 o2-sim 9 0 0

05:06:04 2216125 condor_starter 9 277679 0

05:06:04 2216125 condor_starter 9 224932 0

~~~

```


We found that condor_starter kills all of its children when o2-sim sent a sigkill to PID 0.


I've seen this at least 14 times, so I don't think it's a coincidence.


Also, the condor_starter logs indicate that "agent.startup.N" died because it received signal 9.


-----

05/23/25 18:16:58 (pid:2216125) About to exec /var/lib/condor/execute/dir_2216125/agent.startup.247827 

05/23/25 18:16:58 (pid:2216125) Create_Process succeeded, pid=2216127

05/23/25 18:17:06 (pid:2216125) Failed to open '.update.ad' to read update ad: No such file or directory (2).

05/23/25 18:17:06 (pid:2216125) Failed to open '.update.ad' to read update ad: No such file or directory (2).

05/24/25 05:06:04 (pid:2216125) Process exited, pid=2216127, signal=9

05/24/25 05:06:04 (pid:2216125) Failed to write ToE tag to .job.ad file (13): Permission denied

-----


Of course, I think to send a sigkill to PID 0 is a bug, but I'm contacting you because I don't understand how that signal is being sent to "agent.startup" up the pstree, which is separated by the PID namespace.



The ALICE middleware team has tested what happens when JobRunner continues to run without HTCondor to process jobs, and found that it doesn't cause any issues. 


So, our guess is that if we can solve the problem of how HTCondor allows to receive sigkills from subprocesses that are separated by PID namespaces and clean up the jobs, we can minimize the damage so that when kill -9 0 occurs.



How did the signal from inside the apptainer container, which isolated the PID namespace inside the slot, sent to the "executable"?



If you know anything about the above issue, please share us. 


Regards,


-- Geonmo


For FYI,


``` condor_starter log ```

05/23/25 18:16:58 (pid:2216125) ******************************************************

05/23/25 18:16:58 (pid:2216125) ** condor_starter (CONDOR_STARTER) STARTING UP

05/23/25 18:16:58 (pid:2216125) ** /usr/sbin/condor_starter

05/23/25 18:16:58 (pid:2216125) ** SubsystemInfo: name=STARTER type=STARTER(7) class=DAEMON(1)

05/23/25 18:16:58 (pid:2216125) ** Configuration: subsystem:STARTER local:slot_type_1 class:DAEMON

05/23/25 18:16:58 (pid:2216125) ** $CondorVersion: 23.0.16 2024-10-09 BuildID: 760880 PackageID: 23.0.16-1 $

05/23/25 18:16:58 (pid:2216125) ** $CondorPlatform: x86_64_AlmaLinux9 $

05/23/25 18:16:58 (pid:2216125) ** PID = 2216125

05/23/25 18:16:58 (pid:2216125) ** Log last touched 5/23 18:16:12

05/23/25 18:16:58 (pid:2216125) ******************************************************

05/23/25 18:16:58 (pid:2216125) Using config source: /etc/condor/condor_config

05/23/25 18:16:58 (pid:2216125) Using local config sources: 

05/23/25 18:16:58 (pid:2216125)    /etc/condor/config.d/00-htcondor-9.0.config

05/23/25 18:16:58 (pid:2216125)    /etc/condor/config.d/01-cluster.conf

05/23/25 18:16:58 (pid:2216125)    /etc/condor/config.d/02-local.conf

05/23/25 18:16:58 (pid:2216125)    /etc/condor/config.d/03-cgroup.conf

05/23/25 18:16:58 (pid:2216125)    /etc/condor/config.d/10-stash-plugin.conf

05/23/25 18:16:58 (pid:2216125)    /etc/condor/condor_config.local

05/23/25 18:16:58 (pid:2216125) config Macros = 98, Sorted = 96, StringBytes = 3348, TablesBytes = 3608

05/23/25 18:16:58 (pid:2216125) CLASSAD_CACHING is OFF

05/23/25 18:16:58 (pid:2216125) Daemon Log is logging: D_ALWAYS D_ERROR D_STATUS

05/23/25 18:16:58 (pid:2216125) SharedPortEndpoint: waiting for connections to named socket slot1_3341883_a4f2_50

05/23/25 18:16:58 (pid:2216125) DaemonCore: command socket at <134.75.125.213:9618?addrs=134.75.125.213-9618+[2001-320-15-125-a1-be-ce-5206]-9618&alias=alice-t1-b03-wn06.sdfarm.kr&noUDP&sock=slot1_3341883_a4f2_50>

05/23/25 18:16:58 (pid:2216125) DaemonCore: private command socket at <134.75.125.213:9618?addrs=134.75.125.213-9618+[2001-320-15-125-a1-be-ce-5206]-9618&alias=alice-t1-b03-wn06.sdfarm.kr&noUDP&sock=slot1_3341883_a4f2_50>

05/23/25 18:16:58 (pid:2216125) Communicating with shadow <134.75.125.147:9618?addrs=134.75.125.147-9618+[2001-320-15-125-a1-be-ce-4003]-9618&alias=alice-t1-ce06.sdfarm.kr&noUDP&sock=shadow_1309471_879a_4907>

05/23/25 18:16:58 (pid:2216125) Submitting machine is "alice-t1-ce06.sdfarm.kr"

05/23/25 18:16:58 (pid:2216125) setting the orig job name in starter

05/23/25 18:16:58 (pid:2216125) setting the orig job iwd in starter

05/23/25 18:16:58 (pid:2216125) Chirp config summary: IO false, Updates false, Delayed updates true.

05/23/25 18:16:58 (pid:2216125) Initialized IO Proxy.

05/23/25 18:16:58 (pid:2216125) Done setting resource limits

05/23/25 18:16:58 (pid:2216125) Set filetransfer runtime ads to /var/lib/condor/execute/dir_2216125/.job.ad and /var/lib/condor/execute/dir_2216125/.machine.ad.

05/23/25 18:16:58 (pid:2216125) File transfer completed successfully.

05/23/25 18:16:58 (pid:2216125) Job 69037.0 set to execute immediately

05/23/25 18:16:58 (pid:2216125) Starting a VANILLA universe job with ID: 69037.0

05/23/25 18:16:58 (pid:2216125) Checking to see if htcondor is a writeable cgroup

05/23/25 18:16:58 (pid:2216125)     Cgroup /htcondor is useable

05/23/25 18:16:58 (pid:2216125) Current mount, /tmp, is shared.

05/23/25 18:16:58 (pid:2216125) Current mount, /var/tmp, is shared.

05/23/25 18:16:58 (pid:2216125) IWD: /var/lib/condor/execute/dir_2216125

05/23/25 18:16:58 (pid:2216125) Output file: /var/lib/condor/execute/dir_2216125/_condor_stdout

05/23/25 18:16:58 (pid:2216125) Error file: /var/lib/condor/execute/dir_2216125/_condor_stderr

05/23/25 18:16:58 (pid:2216125) Renice expr "0" evaluated to 0

05/23/25 18:16:58 (pid:2216125) Running job as user sgmalit1_01

05/23/25 18:16:58 (pid:2216125) About to exec /var/lib/condor/execute/dir_2216125/agent.startup.247827 

05/23/25 18:16:58 (pid:2216125) Create_Process succeeded, pid=2216127

05/23/25 18:17:06 (pid:2216125) Failed to open '.update.ad' to read update ad: No such file or directory (2).

05/23/25 18:17:06 (pid:2216125) Failed to open '.update.ad' to read update ad: No such file or directory (2).

05/24/25 05:06:04 (pid:2216125) Process exited, pid=2216127, signal=9

05/24/25 05:06:04 (pid:2216125) Failed to write ToE tag to .job.ad file (13): Permission denied

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:04 (pid:2216125) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_var_lib_condor_execute_slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy

05/24/25 05:06:08 (pid:2216125) All jobs have exited... starter exiting

05/24/25 05:06:29 (pid:2216125) **** condor_starter (condor_STARTER) pid 2216125 EXITING WITH STATUS 0

```








-- Geonmo




Attachment: expired_jobs.png
Description: PNG image