Mailing List Archives
	Authenticated access
	
	
     | 
    
	 
	 
     | 
    
	
	 
     | 
  
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] DAG File descriptor panic when quota is exceeded
- Date: Tue, 22 Dec 2009 12:52:41 -0500
 
- From: Ian Stokes-Rees <ijstokes@xxxxxxxxxxxxxxxxxxx>
 
- Subject: Re: [Condor-users] DAG File descriptor panic when quota is exceeded
 
Kent,
R. Kent Wenger wrote:
12/22 02:27:12 Currently monitoring 1145 Condor log file(s)
12/22 02:27:12 Node 2vv
...
If you are able to increase the file descriptor limit, things should 
work.
Things were working fine until I ran out of quota.   For example, at 
4:30pm yesterday I hit the high water mark for monitoring log files for 
this DAG:
12/21 16:29:38 Currently monitoring 1769 Condor log file(s)
The first DAG restart and PANIC happened at 2:29am and 2:35am on 12/22 
respectively (or thereabouts).  I have the file and proc limit set 
pretty high now.
/etc/security/limits.conf:
*               hard     nofile           40000
*               soft     nofile           40000
*               hard     nproc            20000
*               soft     nproc            20000
$ ulimit -H -a:
open files                      (-n) 40000
max user processes              (-u) 20000
The machine has been restarted since these changes were made, to be sure 
all daemon processes inherited the setting.
The best thing to do is condor_hold the DAGMan job, increase the file 
descriptor limit, and then condor_release the DAGMan job.  This will 
put DAGMan into recovery mode, which will automatically read the logs 
to figure out what jobs had already been run, so you don't have to try 
to re-create a subset of your original DAG.
I'll try that next time.  I've already condor_rm'ed the DAG since it was 
just looping on crash and restart without actually submitting any jobs.  
So condor_hold will create the rescue DAG?  What happens to running 
jobs?  Are they suspended/aborted?  This is all in a Condor-G context.
I think you may also have to re-start the condor_schedd after changing 
the file descriptor limit (before condor_releasing the DAGMan job), so 
that the schedd gets the new limit, and therfore the new DAGMan 
process is forked with the new limit.
Machine was rebooted.
--
Ian Stokes-Rees                            W: http://sbgrid.org
ijstokes@xxxxxxxxxxxxxxxxxxx               T: +1 617 432-5608 x75
SBGrid, Harvard Medical School             F: +1 617 432-5600