Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] DAG questions
I had a few DAG-specific questions to follow up on. I have increased my
file handle and process limits to 40k and 20k respectively.
Ian Stokes-Rees wrote:
In particular, I'm trying to create a 100k node DAG (flat, no
dependencies), with MAXJOBS 6000 and I'm getting the error:
...
These are in 100k separate classads in 100k directories (in a 2-tier
hierarchy groupX/nodeY, so as to avoid overloading a single
directory), with 100k log files in each of the node directories.
It takes about 1 hour for the DAG to be submitted. I've bumped up
ulmits to a level which should get rid of the problem, but it isn't
clear if I need to re-submit the DAG, restart Condor, logout/login, or
even reboot the machine to have these changes come into effect. Any
advice kindly appreciated.
I've read and re-read some of the DAGMan documentation. I've now set:
DAGMAN_MAX_SUBMITS_PER_INTERVAL=250
DAGMAN_LOG_ON_NFS_IS_ERROR=False
The latter is surprising since I understand the default is "True", but
my jobs were submitted OK (docs for 7.0 say this should cause DAG
failure). All my job files are on NFS. I don't have space on local
disk for the 20+ GB this DAG will produce on each iteration. I'm using
Condor 7.2. I should also mention that I have DOT generation turned on
and set to UPDATE. This may not be a good idea. In the short term I
can move job submission to a local disk for testing, and turn off DOT
generation.
My dagman.out file is huge: 200 MB. Is there some way to reduce the
logging level? I couldn't see any option to do this. I seem to get one
line per DAG node every time DAGMan re-evaluates the DAG. 100k lines
every few minutes is too much. My ideal scenario:
1. Specify the location of the DAG log, out, and err files explicitly
(rather than have them end up in the directory where condor_submit_dag
is executed).
2. Limit logging to remove per-DAG-node lines.
3. Log rotate files that could grow big
Finally condor_submit_dag seems to be silent while it processes the
DAG. I don't want a flood of output, but it would be nice to know
*something* is going on. Instead it outputs nothing for 60 minutes,
then dumps the status of the DAG submission.
Thanks for advice on how to improve our use of DAGMan.
Ian
--
Ian Stokes-Rees, Research Associate
SBGrid, Harvard Medical School
http://sbgrid.org