| Mailing List ArchivesAuthenticated access |  | ![[Computer Systems Lab]](http://www.cs.wisc.edu/pics/csl_logo.gif)  | 
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Condor and Big Files
On 9/7/10 1:20 PM, Edier Alberto Zapata Hernández wrote:
> Good morning, I was asked about the way to run some Bioinformatic's
> tools like: Blast, Exonerate and Mira in a Condor's cluster, but the
> input files have more than 400MB of data, and it is supposed to run
> more than 1 query at the time, with Condor's file transfer mechanism
> or NFS it will produce a lot of network traffic. Is there any way to
> access or transfer big files between Condor's nodes?
>
First, you shouldn't stage this data with each job. 
Data-staging/placement should be done in advance.  You have a few options:
NFS - simple, but your NFS server will probably sink if you cluster has
>20 job slots all trying to fetch data from /nfs/blast (or wherever). 
Your jobs would just use absolute paths to reference data which is
available on all compute nodes.
Replicas - replicate the data to every machine in the cluster.  If you
don't have admin-rights, you can probably get away with putting it in
/tmp/username/blast (or somewhere under /tmp) for 1-7 days before
tmpwatch garbage collects it.  Otherwise put it somewhere permanent:
/local/data/blast.  Make sure the path is the same on every machine. 
Place data before you run jobs.
Cluster file system: HadoopFS, Glustre, etc.  These will automatically
replicate and spread your data around the cluster depending on where
there is demand for it.  You then refer to the data via a fixed path
/clusterfs/blast, but the clusterFS then looks after replicating the
data for you.  Performance improvements on things like a BLAST search
will only come after doing lots of analysis, as the cluster FS needs
experience and time to know what data to cache/cluster where.  Also has
overhead of setting up cluster FS.
Map-reduce your analysis: spread the data across the cluster, and each
node then holds a slice.  The computations on that node will always
process the same slice of the fixed data set.  You could use something
like Hadoop to help you set this up, or one of the many map-reduce
frameworks you can find out there (for very basic, look at bash-reduce).
HTH
Ian