Hello, Greg.
I checked with the command below on the server that a job was ended with "signal 9" 2 hours ago, and there were no kernel messages left.
-------
[root@alice-t1-b03-wn06 condor]# cat /var/log/messages* | egrep "kill|Kill"
2025-05-09T17:19:42.714208+09:00 alice-t1-b03-wn06.sdfarm.kr kernel: audit: type=1300 audit(1746778782.693:484644038): arch=c000003e syscall=3 success=yes exit=0 a0=6 a1=0 a2=0 a3=7f2e8ade41d0 items=0 ppid=1007979 pid=1010579 auid=556951380 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=1295 comm="killsnoop" exe="/usr/bin/python3.9" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=(null)
2025-05-09T17:39:47.561601+09:00 alice-t1-b03-wn06.sdfarm.kr kernel: audit: type=1300 audit(1746779987.529:484645785): arch=c000003e syscall=321 success=yes exit=3 a0=5 a1=7ffc4daffdf0 a2=74 a3=7ffc4daffdf0 items=0 ppid=1007979 pid=3907191 auid=556951380 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=1295 comm="killsnoop" exe="/usr/bin/python3.9" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=(null)
2025-05-09T17:47:59.521770+09:00 alice-t1-b03-wn06.sdfarm.kr kernel: audit: type=1300 audit(1746780479.519:484647423): arch=c000003e syscall=3 success=yes exit=0 a0=6 a1=0 a2=0 a3=7ffafc8671d0 items=0 ppid=1007979 pid=3907191 auid=556951380 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=1295 comm="killsnoop" exe="/usr/bin/python3.9" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=(null)
[root@alice-t1-b03-wn06 condor]# dmesg | egrep "kill|Kill"
[root@alice-t1-b03-wn06 condor]#
Also, looking at the code for o2-sim, I noticed that it defaults the PID to 0 when it initializes (1), and it doesn't check if it's 0 when it sends a signal. (2)
So, I'm guessing that for some reason, it doesn't get the proper value from the process when it initializes, so it saves with the PID initialized to 0, and when it tries to kill the process, it sends a SIGKILL with PID 0.
It checks to see if the active variable in DeviceInfo is true, but since the initial value of the structure is true(1) instead of false, it probably passes.
Reviewing the above, I think it's fair to say that the o2-sim process sends SIGKILL with PID 0.
Of course, I know that in theory, the SIGKILL from o2-sim cannot be passed to an executable program bypassing PID isolation.
However, I am deeply concerned that certain parts of HTCondor have put in place a shortcut to make this possible.
Regards,
-- Geonmo
HI Geonmo:
This is, indeed surprising. Do you know if o2-sim is explicitly sending SIGKILL to pid 0, or perhaps is this the OOM killer signal coming from the kernel (the dmesg command can show this).
-greg