Hi,
One more piece of information : on a different node that had the same problem later on
Jul 8 01:46:57 wn-lot-047 systemd: condor.service watchdog timeout (limit 20min)! Jul 8 01:46:57 wn-lot-047 systemd: condor.service: main process exited, code=dumped, status=6/ABRT
JT
On 7 Jul 2023, at 21:00, Jeff Templon <templon@xxxxxxxxx> wrote:
Hi Folks,
Thanks for the suggestions. We tried the RESERVED_MEMORY (8192) and the DISABLE_SWAP_FOR_JOB - to no avail, the problem persists. The condor_master on the execute node is choking:
07/07/23 20:27:00 The STARTD (pid 63951) was killed because it was no longer responding Caught signal 6: si_code=0, si_pid=1, si_uid=0, si_addr=0x1 Stack dump for process 63917 at timestamp 1688755615 (18 frames) /lib64/libcondor_utils_9_8_0.so(_Z18dprintf_dump_stackv+0x57)[0x7f1e6440b0e7] /lib64/libcondor_utils_9_8_0.so(_Z17unix_sig_coredumpiP9siginfo_tPv+0x68)[0x7f1e64653058] /lib64/libpthread.so.0(+0xf630)[0x7f1e62963630] /lib64/libc.so.6(__select+0x13)[0x7f1e6267bb23] /lib64/libcondor_utils_9_8_0.so(_ZN8Selector7executeEv+0x119)[0x7f1e64527379] /lib64/libcondor_utils_9_8_0.so(_ZN15NamedPipeReader9read_dataEPvi+0x57)[0x7f1e64676b17] /lib64/libcondor_utils_9_8_0.so(_ZN16ProcFamilyClient13signal_familyEi21proc_family_command_tRb+0x60)[0x7f1e64675d70] /lib64/libcondor_utils_9_8_0.so(_ZN15ProcFamilyProxy11kill_familyEi+0x40)[0x7f1e6451fee0] /lib64/libcondor_utils_9_8_0.so(_ZN10DaemonCore11Kill_FamilyEi+0x16)[0x7f1e6462fd06] /usr/sbin/condor_master(_ZN6daemon6ExitedEi+0x15d)[0x55978b3920bd] /usr/sbin/condor_master(_ZN7Daemons13DefaultReaperEii+0x122)[0x55978b395fd2] /lib64/libcondor_utils_9_8_0.so(_ZN10DaemonCore10CallReaperEiPKcii+0x1e9)[0x7f1e6463e319] /lib64/libcondor_utils_9_8_0.so(_ZN10DaemonCore17HandleProcessExitEii+0x339)[0x7f1e646425e9] /lib64/libcondor_utils_9_8_0.so(_ZN10DaemonCore24HandleDC_SERVICEWAITPIDSEi+0x36)[0x7f1e64642746] /lib64/libcondor_utils_9_8_0.so(_ZN10DaemonCore6DriverEv+0x551)[0x7f1e64643101] /lib64/libcondor_utils_9_8_0.so(_Z7dc_mainiPPc+0x1838)[0x7f1e64658718] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f1e625a8555] /usr/sbin/condor_master(+0xb759)[0x55978b38b759] 07/07/23 20:47:56 ****************************************************** 07/07/23 20:47:56 ** condor_master (CONDOR_MASTER) STARTING UP 07/07/23 20:47:56 ** /usr/sbin/condor_master
There is a core file left behind,
file core.63917 core.63917: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/condor_master -f', real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execfn: '/usr/sbin/condor_master', platform: âx86_64'
Maybe this info helps you help us. Have a good weekend,
JT
On 6 Jul 2023, at 19:21, Todd Tannenbaum via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
Hi Mary,
A couple comments in addition to the wisdom from Greg below:
1. How HTCondor on the Execution Point (EP) reacts to the OOM
killer was changed/improved starting with HTCondor ver 10.3.0 to
deal with issues like yours below - from the version history in
the Manual:
When HTCondor is configured to use cgroups, if the system
as a whole is out of memory, and the kernel kills a job with the
out
of memory killer, HTCondor now checks to see if the job is below
the provisioned memory. If so, HTCondor now evicts the job, and
marks it as idle, not held, so that it might start again on a
machine with sufficient resources. Previous, HTCondor would let
this job attempt to run, hoping the next time the OOM killer
fired
it would pick a different process.
(HTCONDOR-1512)
2. Perhaps you want to set a value for RESERVED_MEMORY in your
HTCondor config? From the Manual:
RESERVED_MEMORY
How much memory would you like reserved from HTCondor? By
default,
HTCondor considers all the physical memory of your machine
as
available to be used by HTCondor jobs. If RESERVED_MEMORY
is
defined, HTCondor subtracts it from the amount of memory it
advertises as available.
Hope the above plus Greg's ideas below helps,
Todd
On 7/6/2023 10:42 AM, Greg Thain via HTCondor-users wrote:
On
7/6/23 10:01, Mary Hester wrote:
Hello HTCondor experts,
We're seeing some interesting behaviour with user jobs on our
local HTCondor cluster, running version 9.8.
Basically, if a job in the cgroup manages to go sufficiently
over memory so that the container cannot allocate accountable
memory that is needed for basic functioning of the system as a
whole (e.g. to hold its cmdline), then the container has impact
on the whole system and will bring it down. This is a worse
condition than condor not being able to fully get the
status/failure reason for any single specific container. And
since oom_kill_disable is set to 1, the kernel will now not
intervene and hence the entire system grinds to a halt. It is
preferable to loose state for a single job, have the kernel do
its thing, and have the system survive. Now, the only workaround
is to run for i in
/sys/fs/cgroup/memory/htcondor/condor*/memory.oom_control ; do
echo 0 > $i ; done in a loop to ensure the sysadmin-intended
settings are applied to the condor-managed cgroups.
Hi Mary:
I'm sorry your system is having problems. Perhaps what is
happening is that there is swap enabled on the system, and cgroups
are limiting the amount of physical memory used by the job, and
the system is paging itself to death before the starter can read
the OOM message. Can you try setting
DISABLE_SWAP_FOR_JOB = true
and see if the problem persists?
The reason condor sets oom_kill_disable to true is that the
starter registers to get notified of the OOM event, so that it can
know that the reason the job exitted was due to OOM kill. It
sounds like perhaps the system is so overloaded that this event
isn't getting delivered or processed.
Newer versions of HTCondor, and those with cgroup v2 don't set
oom_kill_disable, they wait for the cgroup to die, and there is
first class support in the cgroup for querying whether the OOM
killer fired. We hope this will be a more reliable method in the
future.
Let us know how this goes,
-greg
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
-- Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison Center for High Throughput Computing Department of Computer Sciences Calendar: https://tinyurl.com/yd55mtgd 1210 W. Dayton St. Rm #4257 Phone: (608) 263-7132 Madison, WI 53706-1685
_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/ |