Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] DAG File descriptor panic when quota is exceeded
- Date: Tue, 22 Dec 2009 12:08:57 -0600 (CST)
- From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
- Subject: Re: [Condor-users] DAG File descriptor panic when quota is exceeded
On Tue, 22 Dec 2009, Ian Stokes-Rees wrote:
Things were working fine until I ran out of quota. For example, at 4:30pm
yesterday I hit the high water mark for monitoring log files for this DAG:
12/21 16:29:38 Currently monitoring 1769 Condor log file(s)
The first DAG restart and PANIC happened at 2:29am and 2:35am on 12/22
respectively (or thereabouts). I have the file and proc limit set pretty
high now.
/etc/security/limits.conf:
* hard nofile 40000
* soft nofile 40000
* hard nproc 20000
* soft nproc 20000
$ ulimit -H -a:
open files (-n) 40000
max user processes (-u) 20000
The machine has been restarted since these changes were made, to be sure all
daemon processes inherited the setting.
The best thing to do is condor_hold the DAGMan job, increase the file
descriptor limit, and then condor_release the DAGMan job. This will put
DAGMan into recovery mode, which will automatically read the logs to figure
out what jobs had already been run, so you don't have to try to re-create a
subset of your original DAG.
I'll try that next time. I've already condor_rm'ed the DAG since it was just
looping on crash and restart without actually submitting any jobs. So
condor_hold will create the rescue DAG? What happens to running jobs? Are
they suspended/aborted? This is all in a Condor-G context.
There are two separate situations: recovery mode and rescue DAG (this
always gets complicated to explain). When you condor_rm a DAGMan job, it
tries to condor_rm all of the node jobs, and creates a rescue DAG. The
rescue DAG has nodes marked DONE to record the progress of the DAG. When
you re-run the DAG, it automatically runs the rescue DAG (for fairly
recent versions of DAGMan -- for older versions you have to specify the
rescue DAG file on the condor_submit_dag command line). When the rescue
DAG is run, nodes are marked DONE as the DAG file is parsed, and then
execution continues.
In recovery mode, there is no rescue DAG -- DAGMan re-reads the individual
node job log files to "catch up" to the state of the jobs. (DAGMan goes
into recovery mode after having condor_hold/condor_release done to it.
When you do a condor_hold it suspends the DAGMan process itself, but it
doesn't stop the currently-running node jobs).
Okay, one thing to do is run condor_check_userlogs on the log files of the
node jobs. That should tell you if the log files themselves are
corrupted. (Depending on how many log files you have, you may not be able
to run condor_check_userlogs on all of them at once; but it's fine to run
condor_check_userlogs a number of times on different sets of log files.)
I'm kind of surprised you're getting the 'out of file descriptors' problem
after changing the limits. It wouldn't surprise me that much that you got
errors reading events, but you shouldn't get run out of file descriptors.
There are at least two things to check:
1) The results of running condor_check_userlogs on the log files.
2) How many log files DAGMan says it's monitoring before it runs out of
fds.
You probably should also set DAGMAN_DEBUG to D_FDS -- that will allow you
to see what fd DAGMan is up to if it runs out again. You can do this by
setting DAGMAN_DEBUG in your config file, or setting _CONDOR_DAGMAN_DEBUG
in your environment before you run condor_submit_dag.
Kent Wenger
Condor Team