Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Limit Memory

Date: Mon, 2 Sep 2013 15:03:47 +0000 (UTC)
From: Romain <nuelromain@xxxxxxxxx>
Subject: Re: [HTCondor-users] Limit Memory

Brian Bockelman <bbockelm@...> writes:

> 
> 
> On Aug 28, 2013, at 7:18 AM, Romain <nuelromain@...> wrote:
> 
> > Romain <nuelromain <at> ...> writes:
> > 
> >> 
> >> Brian Bockelman <bbockelm <at> ...> writes:
> >> 
> >>> 
> >>> 
> >>> On Aug 27, 2013, at 9:22 AM, Romain <nuelromain <at> ...> wrote:
> >>> 
> >>>> Hi everybody,
> >>>> 
> >>>> I've some problems with the limits of memory usage on my pool.
> >>>> 
> >>>> So I've install cgroup and configure like that:
> >>>> BASE_CGROUP = htcondor
> >>>> CGROUP_MEMORY_LIMIT_POLICY = hard
> >>>> On my configuration files (condor_config)
> >>>> 
> >>>> I want to suspend the jobs if it stay at the limit for a time (1 min 
> > for 
> >>>> example) and go back to the queue if it stay another time more (5 min 
> >> for 
> >>>> example)
> >>>> 
> >>> 
> >>> I don't understand the question.  The memory limits are per-job.  If 
you 
> >> suspend the job, how is it going to
> >>> decrease its memory usage?
> >>> 
> >>> Brian
> >>> 
> >> 
> >> I want to suspend the job for a time and if it can't restart I want to 
> > stop 
> >> it and let go back to the queue
> >> 
> >> If isn't possible I want to let go back to the queue directly
> >> 
> >> I attribute 2 CPU and 1 Go RAM for each user machine, job don't have to 
> > take 
> >> more than 1Go because it can be a problem for user.
> >> 
> >> Sorry for my bad English :s
> >> 
> >> Thank you and have a nice day
> >> 
> >> --
> >> Romain
> >> 
> >> 
> > 
> > To more explain my problem:
> > With htop I see that the cgroup limit is respect (for example a job can 
use 
> > 500MB max).
> > The "RES" column show the limit respect, but the virtual memory grow up 
and 
> > the "progress bar" (which show all memory use on the machine) grow up 
too
> > so my limit is at 500MB but the job use more than 1.3GB with no problem 
so 
> > that can crash the machine
> > 
> 
> Hi Romain,
> 
> I think I understand now.  Is it possible that the jobs are going into 
swap?
> 
> Options are:
> 1) Remove swap, or use the swappiness file in the /condor cgroup to remove 
condor's ability to use swap.
> 2) Set the max swap / memory usage for all of condor in the cgroup 
configuration.
> 
> Brian
> 
> > I just want to put back to the queue jobs which reach the limit.
> > 
> > What I need is to find the parameter and the arguments to put on to 
> > configure condor to do this
> > 
> > The priority is to save the user even if the job restart from the 
beginning 
> > 
> > 
> > Thank you
> > 
> > --
> > Romain
> > 
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to
> htcondor-users-request@... with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> > 
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> 

Hi,

Thanks for your response

I try the second solution you propose but it doesn't work, so I update 
HTCondor and I retry "older" solution for limit memory usage that I've try 
before and it works!

So I use this :
((ImageSize > Memory*1024) && (ResidentSetSize >= 800*1024))
(for suspend with CPUBusy)

It works fine but now I've an other problem

I think there is a problem when job are running on a desktop machine (so 
with limit memory configuration etc...) and the user don't use it
The problem is that after a moment when the user come back the machine is 
crash :s 
Have you any idea of what it may come by chance?

I try to locate the problem but it's really difficult
Job are suspended when they approach the limit and I configure the 
want_vacate and preempt with UWCS policy to do the job go back to queue 
after few minutes
Want_Suspend is at "true" is that it can cause some troubles?
For info : 
Start use CPUIdle (only, I'll try to add something)
Continue is the opposite of Suspend (|| => && etc..)

The problem appear when I run a lot of job on the pool and all machine 
aren't touch by this

I just keep searching.

Thank you 

Bye

Follow-Ups:
- Re: [HTCondor-users] Limit Memory
  - From: Romain

Prev by Date: Re: [HTCondor-users] Paralell Jobs
Next by Date: [HTCondor-users] schedd crash due to failed file transfers
Previous by thread: Re: [HTCondor-users] Paralell Jobs
Next by thread: Re: [HTCondor-users] Limit Memory
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Limit Memory