Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] cgroups question/problem
- Date: Fri, 18 Mar 2016 09:18:00 -0400
- From: Brian Bockelman <bbockelm@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] cgroups question/problem
Hi Bob,
Was the job accessing CVMFS at the same time? Weâve seen the kernel deadlock (in non-HTCondor contexts) when mixing FUSE filesystems and cgroup-based memory limits.
What happens when SIGKILL these by hand?
Brian
> On Mar 15, 2016, at 9:37 AM, Bob Ball <ball@xxxxxxxxx> wrote:
>
> We have recently upgraded to HTCondor 8.4.4 on our worker nodes, and are encountering problems with cgroups. The issue is this. From the end of StarterLog.slot10 on this particular WN, I find
>
> 03/14/16 22:07:10 (pid:1722453) Running job as user usatlas1
> 03/14/16 22:07:10 (pid:1722453) Create_Process succeeded, pid=1722478
> 03/14/16 22:07:10 (pid:1722453) Limiting (soft) memory usage to 4294967296 bytes
> 03/14/16 22:07:10 (pid:1722453) Limiting (hard) memory usage to 4174556160 bytes
> 03/14/16 22:07:10 (pid:1722453) Limiting memsw usage to 4174560256 bytes
> 03/15/16 02:18:26 (pid:1722453) Hold all jobs
> 03/15/16 02:18:26 (pid:1722453) Job was held due to OOM event: Job has gone over memory limit of 4096 megabytes.
> 03/15/16 02:18:26 (pid:1722453) ShutdownFast all jobs.
>
> The thing is, the top of that process tree is still around, along with most (if not all) of the subprocesses.
>
> condor 1722453 1 0 Mar14 ? 00:00:00 condor_starter -f -a slot10 gate04.local
>
> [root@c-117-13 ~]# pstree -h -p -l 1722453
> condor_starter(1722453)---bash(1722478)-+-bash(2453448)
> `-python(1722744)-+-python(1724660)-+-sh(1725396)---python(1725397)---sh(1738769)---python(1748382)---top-xaod(1748422)
> | `-sh(1725882)---MemoryMonitor(1764264)---sh(2453440)---sh(2453441)
> `-{python}(1724539)
>
> We have set the following:
> BASE_CGROUP = htcondor
> CGROUP_MEMORY_LIMIT_POLICY = soft
>
> So, why are jobs like this still hanging around? Literally, there is no progress on this or similar jobs, and they end up wedging the WN. Condor_status no longer shows this WN.
>
> The condor_startd is no longer running, with nothing in the StartLog to indicate why. The MasterLog contains this
> 03/15/16 03:05:01 ERROR: Child pid 178946 appears hung! Killing it hard.
> 03/15/16 03:05:01 DefaultReaper unexpectedly called on pid 178946, status 9.
> 03/15/16 03:05:01 The STARTD (pid 178946) was killed because it was no longer responding
>
> The last entry in the ProcLog is simultaneous with the hold on the slot10 job above, ie,
> 03/15/16 02:18:19 : PROC_FAMILY_GET_USAGE
> 03/15/16 02:18:19 : gathering usage data for family with root pid 1651340
> 03/15/16 02:18:26 : PROC_FAMILY_GET_USAGE
> 03/15/16 02:18:26 : gathering usage data for family with root pid 1722478
> 03/15/16 02:18:26 : taking a snapshot...
> 03/15/16 02:18:26 : ProcAPI: new boottime = 1456240746; old_boottime = 1456240745; /proc/stat boottime = 1456240746; /proc/uptime boottime = 1456240746
>
> Is this a bug in 8.4.4? We had upgraded from 8.2.10 and never saw anything like this.
>
> Thanks,
> bob
>