Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] DAG File descriptor panic when quota is exceeded
- Date: Tue, 22 Dec 2009 12:52:41 -0500
- From: Ian Stokes-Rees <ijstokes@xxxxxxxxxxxxxxxxxxx>
- Subject: Re: [Condor-users] DAG File descriptor panic when quota is exceeded
Kent,
R. Kent Wenger wrote:
12/22 02:27:12 Currently monitoring 1145 Condor log file(s)
12/22 02:27:12 Node 2vv
...
If you are able to increase the file descriptor limit, things should
work.
Things were working fine until I ran out of quota. For example, at
4:30pm yesterday I hit the high water mark for monitoring log files for
this DAG:
12/21 16:29:38 Currently monitoring 1769 Condor log file(s)
The first DAG restart and PANIC happened at 2:29am and 2:35am on 12/22
respectively (or thereabouts). I have the file and proc limit set
pretty high now.
/etc/security/limits.conf:
* hard nofile 40000
* soft nofile 40000
* hard nproc 20000
* soft nproc 20000
$ ulimit -H -a:
open files (-n) 40000
max user processes (-u) 20000
The machine has been restarted since these changes were made, to be sure
all daemon processes inherited the setting.
The best thing to do is condor_hold the DAGMan job, increase the file
descriptor limit, and then condor_release the DAGMan job. This will
put DAGMan into recovery mode, which will automatically read the logs
to figure out what jobs had already been run, so you don't have to try
to re-create a subset of your original DAG.
I'll try that next time. I've already condor_rm'ed the DAG since it was
just looping on crash and restart without actually submitting any jobs.
So condor_hold will create the rescue DAG? What happens to running
jobs? Are they suspended/aborted? This is all in a Condor-G context.
I think you may also have to re-start the condor_schedd after changing
the file descriptor limit (before condor_releasing the DAGMan job), so
that the schedd gets the new limit, and therfore the new DAGMan
process is forked with the new limit.
Machine was rebooted.
--
Ian Stokes-Rees W: http://sbgrid.org
ijstokes@xxxxxxxxxxxxxxxxxxx T: +1 617 432-5608 x75
SBGrid, Harvard Medical School F: +1 617 432-5600