Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] cgroups question/problem
- Date: Tue, 15 Mar 2016 09:37:01 -0400
- From: Bob Ball <ball@xxxxxxxxx>
- Subject: [HTCondor-users] cgroups question/problem
We have recently upgraded to HTCondor 8.4.4 on our worker nodes, and are
encountering problems with cgroups. The issue is this. From the end of
StarterLog.slot10 on this particular WN, I find
03/14/16 22:07:10 (pid:1722453) Running job as user usatlas1
03/14/16 22:07:10 (pid:1722453) Create_Process succeeded, pid=1722478
03/14/16 22:07:10 (pid:1722453) Limiting (soft) memory usage to
4294967296 bytes
03/14/16 22:07:10 (pid:1722453) Limiting (hard) memory usage to
4174556160 bytes
03/14/16 22:07:10 (pid:1722453) Limiting memsw usage to 4174560256 bytes
03/15/16 02:18:26 (pid:1722453) Hold all jobs
03/15/16 02:18:26 (pid:1722453) Job was held due to OOM event: Job has
gone over memory limit of 4096 megabytes.
03/15/16 02:18:26 (pid:1722453) ShutdownFast all jobs.
The thing is, the top of that process tree is still around, along with
most (if not all) of the subprocesses.
condor 1722453 1 0 Mar14 ? 00:00:00 condor_starter -f -a
slot10 gate04.local
[root@c-117-13 ~]# pstree -h -p -l 1722453
condor_starter(1722453)---bash(1722478)-+-bash(2453448)
`-python(1722744)-+-python(1724660)-+-sh(1725396)---python(1725397)---sh(1738769)---python(1748382)---top-xaod(1748422)
| `-sh(1725882)---MemoryMonitor(1764264)---sh(2453440)---sh(2453441)
`-{python}(1724539)
We have set the following:
BASE_CGROUP = htcondor
CGROUP_MEMORY_LIMIT_POLICY = soft
So, why are jobs like this still hanging around? Literally, there is no
progress on this or similar jobs, and they end up wedging the WN.
Condor_status no longer shows this WN.
The condor_startd is no longer running, with nothing in the StartLog to
indicate why. The MasterLog contains this
03/15/16 03:05:01 ERROR: Child pid 178946 appears hung! Killing it hard.
03/15/16 03:05:01 DefaultReaper unexpectedly called on pid 178946, status 9.
03/15/16 03:05:01 The STARTD (pid 178946) was killed because it was no
longer responding
The last entry in the ProcLog is simultaneous with the hold on the
slot10 job above, ie,
03/15/16 02:18:19 : PROC_FAMILY_GET_USAGE
03/15/16 02:18:19 : gathering usage data for family with root pid 1651340
03/15/16 02:18:26 : PROC_FAMILY_GET_USAGE
03/15/16 02:18:26 : gathering usage data for family with root pid 1722478
03/15/16 02:18:26 : taking a snapshot...
03/15/16 02:18:26 : ProcAPI: new boottime = 1456240746; old_boottime =
1456240745; /proc/stat boottime = 1456240746; /proc/uptime boottime =
1456240746
Is this a bug in 8.4.4? We had upgraded from 8.2.10 and never saw
anything like this.
Thanks,
bob