Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] HTCondor file-transfer vs networked storage
- Date: Mon, 22 Aug 2022 16:19:39 -0500
- From: Greg Thain <gthain@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] HTCondor file-transfer vs networked storage
On 8/22/22 15:35, Matthew T West via HTCondor-users wrote:
When working on a single homogeneous compute cluster, are there any
advantages to using HTCondor's file-transfer rather than working off
shared network storage?
Hi Matthew:
There are several advantages to using explicit file transfer. Perhaps
the biggest advantage is error handling. If the file cannot be
transferred, or there is a typo, or disk error, HTCondor will notice and
your job won't start. Should such an error happen with a shared
filesystem, it probably won't happen until after your job starts, and it
becomes the job's responsibility not just to detect the error, but to
properly propagate the error up and out to HTCondor, so it can re-run
the job. This is often hard to do, especially if you use 3rd party
software. Usually what ends up happening is that the error is not
correctly propagated out, and the job leaves the queue without correct
or complete output, leading to very hard to debug problems. (Or worse,
quietly missing data)
If you are using the native file transfer mechanism (i.e. not an URL),
then file transfer is throttled by the access point. If using shared
network filesystems, it is often possible for a lot of concurrent access
to crash the file server or otherwise cause several performance problems.
HTCondor records in the job ad the number of input and output bytes
transferred, which can be useful in determining how to size and
provision network and disk size and bandwidth. This is harder to
measure if using a shared file system.
Now, there's no such thing a free lunch. It is often difficult to know
a-priori what the input file set it, in which case a shared filesystem
might make more sense. Also, in the case where a job just needs a very
small subset of a very large file, there may be performance benefits to
reading that small chunk from a share filesystem instead of asking
HTCondor to copy the whole file over, just to access a small part of it.
-greg