Hi all
I am struggling to understand how the cgroup mechanism affects my jobs. I have a added a new fresh node to our cluster. I have starting a lot of jobs on it, but all of sudden it starts killing my jobs. I have traced it back to the OOM killer. However, the execute
machine has 250GB of memory and my jobs are not using close to that.
I wanted to try to tune the oom-killer, but I can't seem to find the relevant services (systemd-oomd, OS is ubuntu 22.04). Also haven't found out how to disable it.
Right now I am able to run about 40 (out of 48 cores) jobs. Each use about 0.5% of total memory. When I submit more jobs, the oom-killer steps in and kills them.
I am noticing that the OS seems to be using a lot of swap even when there is a lot physical memory available.
Are there any knobs in condor I can tune to aid with this?
P
|