Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] DAG File descriptor panic when quota is exceeded

Date: Tue, 22 Dec 2009 12:52:41 -0500
From: Ian Stokes-Rees <ijstokes@xxxxxxxxxxxxxxxxxxx>
Subject: Re: [Condor-users] DAG File descriptor panic when quota is exceeded

Kent,

R. Kent Wenger wrote:

12/22 02:27:12 Currently monitoring 1145 Condor log file(s)
12/22 02:27:12 Node 2vv
...
If you are able to increase the file descriptor limit, things shouldwork.

Things were working fine until I ran out of quota. For example, at4:30pm yesterday I hit the high water mark for monitoring log files forthis DAG:


12/21 16:29:38 Currently monitoring 1769 Condor log file(s)

The first DAG restart and PANIC happened at 2:29am and 2:35am on 12/22respectively (or thereabouts). I have the file and proc limit setpretty high now.


/etc/security/limits.conf:
*               hard     nofile           40000
*               soft     nofile           40000
*               hard     nproc            20000
*               soft     nproc            20000

$ ulimit -H -a:
open files                      (-n) 40000
max user processes              (-u) 20000

The machine has been restarted since these changes were made, to be sureall daemon processes inherited the setting.

The best thing to do is condor_hold the DAGMan job, increase the filedescriptor limit, and then condor_release the DAGMan job. This willput DAGMan into recovery mode, which will automatically read the logsto figure out what jobs had already been run, so you don't have to tryto re-create a subset of your original DAG.

I'll try that next time. I've already condor_rm'ed the DAG since it wasjust looping on crash and restart without actually submitting any jobs.So condor_hold will create the rescue DAG? What happens to runningjobs? Are they suspended/aborted? This is all in a Condor-G context.

I think you may also have to re-start the condor_schedd after changingthe file descriptor limit (before condor_releasing the DAGMan job), sothat the schedd gets the new limit, and therfore the new DAGManprocess is forked with the new limit.


Machine was rebooted.

--
Ian Stokes-Rees                            W: http://sbgrid.org
ijstokes@xxxxxxxxxxxxxxxxxxx               T: +1 617 432-5608 x75
SBGrid, Harvard Medical School             F: +1 617 432-5600

Follow-Ups:
- Re: [Condor-users] DAG File descriptor panic when quota is exceeded
  - From: R. Kent Wenger

References:
- [Condor-users] DAG File descriptor panic when quota is exceeded
  - From: Ian Stokes-Rees
- Re: [Condor-users] DAG File descriptor panic when quota is exceeded
  - From: R. Kent Wenger

Prev by Date: Re: [Condor-users] DAG File descriptor panic when quota is exceeded
Next by Date: Re: [Condor-users] DAG File descriptor panic when quota is exceeded
Previous by thread: Re: [Condor-users] DAG File descriptor panic when quota is exceeded
Next by thread: Re: [Condor-users] DAG File descriptor panic when quota is exceeded
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [Condor-users] DAG File descriptor panic when quota is exceeded