Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Limit Memory
- Date: Mon, 2 Sep 2013 15:03:47 +0000 (UTC)
- From: Romain <nuelromain@xxxxxxxxx>
- Subject: Re: [HTCondor-users] Limit Memory
Brian Bockelman <bbockelm@...> writes:
>
>
> On Aug 28, 2013, at 7:18 AM, Romain <nuelromain@...> wrote:
>
> > Romain <nuelromain <at> ...> writes:
> >
> >>
> >> Brian Bockelman <bbockelm <at> ...> writes:
> >>
> >>>
> >>>
> >>> On Aug 27, 2013, at 9:22 AM, Romain <nuelromain <at> ...> wrote:
> >>>
> >>>> Hi everybody,
> >>>>
> >>>> I've some problems with the limits of memory usage on my pool.
> >>>>
> >>>> So I've install cgroup and configure like that:
> >>>> BASE_CGROUP = htcondor
> >>>> CGROUP_MEMORY_LIMIT_POLICY = hard
> >>>> On my configuration files (condor_config)
> >>>>
> >>>> I want to suspend the jobs if it stay at the limit for a time (1 min
> > for
> >>>> example) and go back to the queue if it stay another time more (5 min
> >> for
> >>>> example)
> >>>>
> >>>
> >>> I don't understand the question. The memory limits are per-job. If
you
> >> suspend the job, how is it going to
> >>> decrease its memory usage?
> >>>
> >>> Brian
> >>>
> >>
> >> I want to suspend the job for a time and if it can't restart I want to
> > stop
> >> it and let go back to the queue
> >>
> >> If isn't possible I want to let go back to the queue directly
> >>
> >> I attribute 2 CPU and 1 Go RAM for each user machine, job don't have to
> > take
> >> more than 1Go because it can be a problem for user.
> >>
> >> Sorry for my bad English :s
> >>
> >> Thank you and have a nice day
> >>
> >> --
> >> Romain
> >>
> >>
> >
> > To more explain my problem:
> > With htop I see that the cgroup limit is respect (for example a job can
use
> > 500MB max).
> > The "RES" column show the limit respect, but the virtual memory grow up
and
> > the "progress bar" (which show all memory use on the machine) grow up
too
> > so my limit is at 500MB but the job use more than 1.3GB with no problem
so
> > that can crash the machine
> >
>
> Hi Romain,
>
> I think I understand now. Is it possible that the jobs are going into
swap?
>
> Options are:
> 1) Remove swap, or use the swappiness file in the /condor cgroup to remove
condor's ability to use swap.
> 2) Set the max swap / memory usage for all of condor in the cgroup
configuration.
>
> Brian
>
> > I just want to put back to the queue jobs which reach the limit.
> >
> > What I need is to find the parameter and the arguments to put on to
> > configure condor to do this
> >
> > The priority is to save the user even if the job restart from the
beginning
> >
> >
> > Thank you
> >
> > --
> > Romain
> >
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to
> htcondor-users-request@... with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/
>
>
Hi,
Thanks for your response
I try the second solution you propose but it doesn't work, so I update
HTCondor and I retry "older" solution for limit memory usage that I've try
before and it works!
So I use this :
((ImageSize > Memory*1024) && (ResidentSetSize >= 800*1024))
(for suspend with CPUBusy)
It works fine but now I've an other problem
I think there is a problem when job are running on a desktop machine (so
with limit memory configuration etc...) and the user don't use it
The problem is that after a moment when the user come back the machine is
crash :s
Have you any idea of what it may come by chance?
I try to locate the problem but it's really difficult
Job are suspended when they approach the limit and I configure the
want_vacate and preempt with UWCS policy to do the job go back to queue
after few minutes
Want_Suspend is at "true" is that it can cause some troubles?
For info :
Start use CPUIdle (only, I'll try to add something)
Continue is the opposite of Suspend (|| => && etc..)
The problem appear when I run a lot of job on the pool and all machine
aren't touch by this
I just keep searching.
Thank you
Bye