HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] Per-job PID namespaces



On Mar 15, 2011, at 7:32 AM, Matthew Farrellee wrote:

> On 03/10/2011 07:17 PM, Brian Bockelman wrote:
>> 
>> On Mar 10, 2011, at 6:06 PM, Matthew Farrellee wrote:
>> 
>>> On 03/10/2011 06:10 PM, Brian Bockelman wrote:
>>>> Hi all,
>>>> 
>>>> Last night, I took another detour into "cool things in modern Linux kernels" and came up with per-job PID namespaces for Condor:
>>>> 
>>>> https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1959
>>>> 
>>>> Basically, when a job runs, the starter requests the a new PID namespace from the kernel.  The clone'd process believes it is PID 1, with all process in the job hanging off that.  It looks something like this:
>>>> 
>>>> [bbockelm@rcf-bockelman condor]$ condor_run ps faux
>>>> USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
>>>> bbockelm     1  0.0  0.0 106200  1132 ?        SN   14:15   0:00 /bin/bash /home/bbockelm/projects/condor/.condor_run.17238
>>>> bbockelm     2  0.0  0.0 108052  1000 ?        RN   14:15   0:00 ps faux
>>>> 
>>>> However, to the "outside world", these appear as normal processes.  The processes inside the job can't view or contact external processes - two jobs running within the same Unix account can't discover or send signals to each other.  Additionally, when "PID 1" dies, the kernel wipes out the remaining processes started by the job.  It's a fairly neat trick.  This all requires kernel 2.6.24 or later.
>>>> 
>>>> Enjoy!
>>>> 
>>>> Brian
>>> 
>>> Nice. Does the accounting for memory, cpu, io all get rolled up into one process for Condor to monitor too? Bye bye proc family tree?<- expensive to maintain
>>> 
>> 
>> Actually, it would be possible to make the proc family tree less
>> expensive by putting the condor_master into its own PID namespace -
>> greatly limiting the number of processes to look at.
> 
> That's an interesting thought. Do you have a trick for doing this within an init script? I would only expect a modest improvement though. I tend to see few processes beyond condor_master children on most execute nodes, and the primary cause of process explosion on submit nodes is condor_master children.
> 

No, but I have patches for condor_master:

https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1965

Add the "-n" flag at startup.  DaemonCore apparently uses a different code path to fork/exec; this patch should be usable independent of everything else.

> Actually makes me wonder how configurable the proc polling interval is for the master.
> 

Well, with cgroups, you basically never need to poll.  However, polls are still triggered by certain operations (such as killing a family tree).

The procd does have multiple methods for determining parentage.  After cgroups goes in, I thought we might make the used methods configurable - once you turn off the ones which do heavy parsing in /proc, you might greatly decrease the impact.

> 
>> However, it's a long road to getting rid of the procd.  Right now,
>> this is all another tool in the procd's arsenal.  I'd want to see
>> these techniques "work well" in the wild, before we think about
>> disabling parts of the procd, then slowly disabling functionality if
>> the cgroups take care of it.  For example, it shouldn't be necessary
>> to scan all the PIDs anymore as the kernel keeps the hierarchies.
>> There's so much procd code that I don't see a clear way to do a "big
>> bang" replacement (not to mention that "legacy" linuxes, such as
>> basically all the deployed Condor installs, will take a long time to
>> phase out).
>> 
>> If we head down this direction successfully, maybe we can get rid of
>> it in 5 years?
> 
> I'm going to take that to mean that resources aren't aggregated for the namespace. 8o)
> 
> Good thing they are for the cgroup ammo.
> 

Correct - nothing special in the resource accounting is mixed into the namespace work.  I don't see a particular draw in improving the account for namespaces (Greg said it should improve getrusage) when any group with namespaces ought to have cgroups - and cgroups have more functionality.  For example, once the primary patch lands, I think someone could do a follow-up patch using the cgroups I/O accounting.

Brian

Attachment: smime.p7s
Description: S/MIME cryptographic signature