[Condor-users] Large Number of small jobs


Date: Fri, 11 Feb 2005 14:41:18 -0800
From: Daniel Durand <Daniel.Durand@xxxxxx>
Subject: [Condor-users] Large Number of small jobs
Hi

I am rather new to condor although I did pass through a fair amount of help/web page before having to post to
the list to gather some precious help.


Here is the situation.

I have to run a fair amount of DAGs, about 100,000 which are all quite simple.

I used to submit every DAGs independently for small job number (<300) but with a large number of
jobs I ran quickly out of file descriptors.


I try a solution which is putting all the independent DAGs in on master dag like:
Job solaris_1 job1.opus
Job linux_1 job1.linux
Script POST linux_1 remove_tar.pl job1.tar
Parent solaris_1 Child linux1
Job solaris_2 job2.opus
Job linux_2 job2.linux
Script POST linux_2 remove_tar.pl job2.tar
Parent solaris_2 Child linux_2
.
.
.


This was repeated many times and submitted via condor_submit_dag -maxjobs 40 file.dag

This ran much better but still ran out of file descriptors at some point. The reason is that all
the parent tasks got executed first and I end up with tons of tar files (passing data fine between
parents and child) in the submission directory filling up precious disk space. Looks like
all the parents are executed first, condor not finishing a given sub-dag before starting a new one.


Is there a better way to do this?

My system manager tried to change the number of file descriptors available for my account but
any changes to the default 1024 would render my account not usable, any shell would give up
immediately after login in. We tried to change /etc/security/limits.conf
without any success


Here is my setup:
host 31% cat /proc/sys/fs/file-max
209664

host 34% limit
cputime         unlimited
filesize        unlimited
datasize        unlimited
stacksize       unlimited
coredumpsize    1 kbytes
memoryuse       unlimited
vmemoryuse      unlimited
descriptors     1024
memorylocked    unlimited
maxproc         7168

host 37% condor_version
$CondorVersion: 6.6.6 Jul 26 2004 $
$CondorPlatform: I386-LINUX_RH9 $

Linux host 2.4.22-1.2188.nptlsmp #1 SMP Wed Apr 21 20:12:56 EDT 2004 i686 athlon i386 GNU/Linux

Many thanks

Daniel


Daniel Durand | Tel/Tél: +1 250 363 0052 | FAX: +1 250 363 0045
HST archives scientist | Responsable Archive HST
Herzbergh Institute of Astrophysics | Institut Herzberg Astrophysique
National Research Council Canada | Conseil National de Recherches du Canada
5071 W. Saanich Road | 5071 W. Saanich Road
Victoria, B.C. | Victoria, C.B.
Canada V9E 2E7



[← Prev in Thread] Current Thread [Next in Thread→]