Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] DAG File descriptor panic when quota is exceeded

Date: Tue, 22 Dec 2009 12:08:57 -0600 (CST)
From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
Subject: Re: [Condor-users] DAG File descriptor panic when quota is exceeded

On Tue, 22 Dec 2009, Ian Stokes-Rees wrote:

Things were working fine until I ran out of quota. For example, at 4:30pmyesterday I hit the high water mark for monitoring log files for this DAG:
12/21 16:29:38 Currently monitoring 1769 Condor log file(s)
The first DAG restart and PANIC happened at 2:29am and 2:35am on 12/22respectively (or thereabouts). I have the file and proc limit set prettyhigh now.
/etc/security/limits.conf:
*               hard     nofile           40000
*               soft     nofile           40000
*               hard     nproc            20000
*               soft     nproc            20000

$ ulimit -H -a:
open files                      (-n) 40000
max user processes              (-u) 20000
The machine has been restarted since these changes were made, to be sure alldaemon processes inherited the setting.
The best thing to do is condor_hold the DAGMan job, increase the filedescriptor limit, and then condor_release the DAGMan job. This will putDAGMan into recovery mode, which will automatically read the logs to figureout what jobs had already been run, so you don't have to try to re-create asubset of your original DAG.
I'll try that next time. I've already condor_rm'ed the DAG since it was justlooping on crash and restart without actually submitting any jobs. Socondor_hold will create the rescue DAG? What happens to running jobs? Arethey suspended/aborted? This is all in a Condor-G context.

There are two separate situations: recovery mode and rescue DAG (thisalways gets complicated to explain). When you condor_rm a DAGMan job, ittries to condor_rm all of the node jobs, and creates a rescue DAG. Therescue DAG has nodes marked DONE to record the progress of the DAG. Whenyou re-run the DAG, it automatically runs the rescue DAG (for fairlyrecent versions of DAGMan -- for older versions you have to specify therescue DAG file on the condor_submit_dag command line). When the rescueDAG is run, nodes are marked DONE as the DAG file is parsed, and thenexecution continues.

In recovery mode, there is no rescue DAG -- DAGMan re-reads the individualnode job log files to "catch up" to the state of the jobs. (DAGMan goesinto recovery mode after having condor_hold/condor_release done to it.When you do a condor_hold it suspends the DAGMan process itself, but itdoesn't stop the currently-running node jobs).


Okay, one thing to do is run condor_check_userlogs on the log files of the

node jobs. That should tell you if the log files themselves arecorrupted. (Depending on how many log files you have, you may not be ableto run condor_check_userlogs on all of them at once; but it's fine to runcondor_check_userlogs a number of times on different sets of log files.)

I'm kind of surprised you're getting the 'out of file descriptors' problemafter changing the limits. It wouldn't surprise me that much that you goterrors reading events, but you shouldn't get run out of file descriptors.There are at least two things to check:

1) The results of running condor_check_userlogs on the log files.

2) How many log files DAGMan says it's monitoring before it runs out offds.

You probably should also set DAGMAN_DEBUG to D_FDS -- that will allow youto see what fd DAGMan is up to if it runs out again. You can do this bysetting DAGMAN_DEBUG in your config file, or setting _CONDOR_DAGMAN_DEBUGin your environment before you run condor_submit_dag.


Kent Wenger
Condor Team

References:
- [Condor-users] DAG File descriptor panic when quota is exceeded
  - From: Ian Stokes-Rees
- Re: [Condor-users] DAG File descriptor panic when quota is exceeded
  - From: R. Kent Wenger
- Re: [Condor-users] DAG File descriptor panic when quota is exceeded
  - From: Ian Stokes-Rees

Prev by Date: Re: [Condor-users] DAG File descriptor panic when quota is exceeded
Next by Date: Re: [Condor-users] condor_q, QUILL, strange RUN_TIME result
Previous by thread: Re: [Condor-users] DAG File descriptor panic when quota is exceeded
Next by thread: [Condor-users] DAG File descriptor panic when quota is exceeded
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [Condor-users] DAG File descriptor panic when quota is exceeded