Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] cgroups question/problem

Date: Tue, 15 Mar 2016 09:37:01 -0400
From: Bob Ball <ball@xxxxxxxxx>
Subject: [HTCondor-users] cgroups question/problem

We have recently upgraded to HTCondor 8.4.4 on our worker nodes, and areencountering problems with cgroups. The issue is this. From the end ofStarterLog.slot10 on this particular WN, I find


03/14/16 22:07:10 (pid:1722453) Running job as user usatlas1
03/14/16 22:07:10 (pid:1722453) Create_Process succeeded, pid=1722478

03/14/16 22:07:10 (pid:1722453) Limiting (soft) memory usage to4294967296 bytes03/14/16 22:07:10 (pid:1722453) Limiting (hard) memory usage to4174556160 bytes

03/14/16 22:07:10 (pid:1722453) Limiting memsw usage to 4174560256 bytes
03/15/16 02:18:26 (pid:1722453) Hold all jobs

03/15/16 02:18:26 (pid:1722453) Job was held due to OOM event: Job hasgone over memory limit of 4096 megabytes.

03/15/16 02:18:26 (pid:1722453) ShutdownFast all jobs.

The thing is, the top of that process tree is still around, along withmost (if not all) of the subprocesses.

condor 1722453 1 0 Mar14 ? 00:00:00 condor_starter -f -aslot10 gate04.local


[root@c-117-13 ~]# pstree -h -p -l 1722453
condor_starter(1722453)---bash(1722478)-+-bash(2453448)
`-python(1722744)-+-python(1724660)-+-sh(1725396)---python(1725397)---sh(1738769)---python(1748382)---top-xaod(1748422)
| `-sh(1725882)---MemoryMonitor(1764264)---sh(2453440)---sh(2453441)
`-{python}(1724539)

We have set the following:
BASE_CGROUP = htcondor
CGROUP_MEMORY_LIMIT_POLICY = soft

So, why are jobs like this still hanging around? Literally, there is noprogress on this or similar jobs, and they end up wedging the WN.Condor_status no longer shows this WN.

The condor_startd is no longer running, with nothing in the StartLog toindicate why. The MasterLog contains this

03/15/16 03:05:01 ERROR: Child pid 178946 appears hung! Killing it hard.
03/15/16 03:05:01 DefaultReaper unexpectedly called on pid 178946, status 9.

03/15/16 03:05:01 The STARTD (pid 178946) was killed because it was nolonger responding

The last entry in the ProcLog is simultaneous with the hold on theslot10 job above, ie,

03/15/16 02:18:19 : PROC_FAMILY_GET_USAGE
03/15/16 02:18:19 : gathering usage data for family with root pid 1651340
03/15/16 02:18:26 : PROC_FAMILY_GET_USAGE
03/15/16 02:18:26 : gathering usage data for family with root pid 1722478
03/15/16 02:18:26 : taking a snapshot...

03/15/16 02:18:26 : ProcAPI: new boottime = 1456240746; old_boottime =1456240745; /proc/stat boottime = 1456240746; /proc/uptime boottime =1456240746

Is this a bug in 8.4.4? We had upgraded from 8.2.10 and never sawanything like this.


Thanks,
bob

Follow-Ups:
- Re: [HTCondor-users] cgroups question/problem
  - From: Brian Bockelman
- Re: [HTCondor-users] cgroups question/problem
  - From: Greg Thain

Prev by Date: [HTCondor-users] dynamic allocation of RAM
Next by Date: [HTCondor-users] condor's ClusterID
Previous by thread: Re: [HTCondor-users] dynamic allocation of RAM
Next by thread: Re: [HTCondor-users] cgroups question/problem
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[HTCondor-users] cgroups question/problem