Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] DAG File descriptor panic when quota is exceeded
- Date: Tue, 22 Dec 2009 11:30:44 -0600 (CST)
- From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
- Subject: Re: [Condor-users] DAG File descriptor panic when quota is exceeded
On Tue, 22 Dec 2009, Ian Stokes-Rees wrote:
I ran out of quota on a disk running a large DAG last night (around 2am). I
bumped up the quota at 9am (10GB to 30GB), but the running DAG is still
reporting file descriptor panics and DAGMan keeps crashing. Is this
expected? I suppose some of the temporary/recovery files it is trying to use
for the restart may be corrupted. Is there any way to test this? We had
finished 30k of 100k nodes in the DAG. It would be nice not to have to
restart the entire DAG (although I could write a script to re-generate the
DAG with only the nodes that did not complete).
Suggestions on the best way to recover this situation would be greatly
appreciated.
TIA.
Ian
Summary of dagman.out log file follows.
At 2:30am it looks like DAGMan fell over. The last entry to this point is:
12/22 02:27:12 Currently monitoring 1145 Condor log file(s)
12/22 02:27:12 Node 2vv
...
The snippet above shows that DAGMan is currently trying to monitor 1145
log files, which means that it needs to have at least 1145 file
descriptors. Do you know what the limit of file descriptors for a process
is on your machine? I think 1024 is a common default on Linux, at least,
which will obviously cause problems for monitoring 1145 log files.
If you are able to increase the file descriptor limit, things should work.
The best thing to do is condor_hold the DAGMan job, increase the file
descriptor limit, and then condor_release the DAGMan job. This will put
DAGMan into recovery mode, which will automatically read the logs to
figure out what jobs had already been run, so you don't have to try to
re-create a subset of your original DAG.
I think you may also have to re-start the condor_schedd after changing the
file descriptor limit (before condor_releasing the DAGMan job), so that
the schedd gets the new limit, and therfore the new DAGMan process is
forked with the new limit.
Kent Wenger
Condor Team