Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] DAG File descriptor panic when quota is exceeded

Date: Tue, 22 Dec 2009 11:30:44 -0600 (CST)
From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
Subject: Re: [Condor-users] DAG File descriptor panic when quota is exceeded

On Tue, 22 Dec 2009, Ian Stokes-Rees wrote:

I ran out of quota on a disk running a large DAG last night (around 2am). Ibumped up the quota at 9am (10GB to 30GB), but the running DAG is stillreporting file descriptor panics and DAGMan keeps crashing. Is thisexpected? I suppose some of the temporary/recovery files it is trying to usefor the restart may be corrupted. Is there any way to test this? We hadfinished 30k of 100k nodes in the DAG. It would be nice not to have torestart the entire DAG (although I could write a script to re-generate theDAG with only the nodes that did not complete).
Suggestions on the best way to recover this situation would be greatlyappreciated.
TIA.

Ian

Summary of dagman.out log file follows.

At 2:30am it looks like DAGMan fell over.  The last entry to this point is:

12/22 02:27:12 Currently monitoring 1145 Condor log file(s)
12/22 02:27:12 Node 2vv
...

The snippet above shows that DAGMan is currently trying to monitor 1145log files, which means that it needs to have at least 1145 filedescriptors. Do you know what the limit of file descriptors for a processis on your machine? I think 1024 is a common default on Linux, at least,which will obviously cause problems for monitoring 1145 log files.


If you are able to increase the file descriptor limit, things should work.

The best thing to do is condor_hold the DAGMan job, increase the filedescriptor limit, and then condor_release the DAGMan job. This will putDAGMan into recovery mode, which will automatically read the logs tofigure out what jobs had already been run, so you don't have to try tore-create a subset of your original DAG.

I think you may also have to re-start the condor_schedd after changing thefile descriptor limit (before condor_releasing the DAGMan job), so thatthe schedd gets the new limit, and therfore the new DAGMan process isforked with the new limit.


Kent Wenger
Condor Team

Follow-Ups:
- Re: [Condor-users] DAG File descriptor panic when quota is exceeded
  - From: Ian Stokes-Rees

References:
- [Condor-users] DAG File descriptor panic when quota is exceeded
  - From: Ian Stokes-Rees

Prev by Date: [Condor-users] DAG File descriptor panic when quota is exceeded
Next by Date: Re: [Condor-users] DAG File descriptor panic when quota is exceeded
Previous by thread: [Condor-users] DAG File descriptor panic when quota is exceeded
Next by thread: Re: [Condor-users] DAG File descriptor panic when quota is exceeded
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [Condor-users] DAG File descriptor panic when quota is exceeded