Re: [HTCondor-devel] Remote IO in vanilla universe


Date: Mon, 28 Jan 2013 10:21:14 -0500
From: Matthew Farrellee <matt@xxxxxxxxxx>
Subject: Re: [HTCondor-devel] Remote IO in vanilla universe
I doubt anyone will object to doing that on-list.

Best,


matt

On 01/28/2013 10:01 AM, Douglas Thain wrote:
Brian -

The idea all along has been to have a common protocol definition, so
that the various implementations would interoperate:
http://research.cs.wisc.edu/htcondor/chirp

We last looked at this about 1.5 years ago -- and it worked -- but I
don't believe there is any regular testing of the interaction between
the cctools chirp and the condor chirp.  Without that, things may
drift apart over time.

I will be happy to put forth some effort from my group to make this
interaction work better.  How about we start by identifying the known
problems and any desired features?  (Off list, probably.)

Best -
Doug


On Mon, Jan 28, 2013 at 8:49 AM, Brian Bockelman <bbockelm@xxxxxxxxxxx> wrote:
Do you know the history behind the split implementation of the chirp client?  Why can't there just be a common library or codebase for the client?  I know I've seen jobs bedeviled by the "no timeouts" problem when using the CLI shipped (not to mention the issues of thread safety!).

The work described below is really just gluing together the two interfaces.  Most function implementations look like this:

static int chirp_read(const char * path, char * buffer, size_t size, off_t offset, struct fuse_file_info * fi) {
         GET_CLIENT(client);
         assert(path);
         return chirp_client_pread(client, fi->fh, buffer, size, offset);
}

(GET_CLIENT is a macro to pull the client handle from the FUSE context and lock a mutex).  Hence, things are mostly at the mercy of the underlying client.

Brian

PS - I see that chirp_fuse doesn't use the standard option parsing for fuse, meaning it can't be made to be compatible with /etc/fstab.  :(  However, that shouldn't be a roadblock to using it in this case.

On Jan 28, 2013, at 7:29 AM, Douglas Thain <dthain@xxxxxx> wrote:

Brian -

You might check out the existing chirp_fuse module, which should
interoperate with both the Condor Chirp I/O proxy as well as the
standalone Chirp server. When used with the latter, you also get
proper errnos, timeouts, and transparent failure recovery.

http://www.cse.nd.edu/~ccl/software/manuals/man/chirp_fuse.html

Cheers,
Doug


On Sun, Jan 27, 2013 at 9:46 PM, Brian Bockelman <bbockelm@xxxxxxxxxxx> wrote:
Hi all,

Figured out a relatively simple way of providing remote IO in the vanilla
universe and am looking for someone willing to give it a spin.  It's a
surprisingly small amount of code - the heavy lifting is done by chirp.
Mostly, the new code is just gluing pre-existing components.

See the design document:

https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3465

In short:  I created a FUSE filesystem that translates filesystem calls to
chirp IO (which does a remote IO with the submit host).  I use the
filesystem namespaces feature to make this filesystem only appear to the job
(and be automatically unmounted at the job's end).  This way, the job sees
the filesystem of the submit host (either as / or as /condor/submitter,
depending on the job's requested options).  The technique appears to work
well, but I haven't tried pushing it too hard.

I'm not quite sure where Chirp breaks, but I did notice that it has no error
codes implemented (either returns 0 or -1, no errno).  Hence, any IO error
is converted to EIO.  That will likely be problematic for some applications.
Chirp also has no timeouts or error recovery; the filesystem will likely die
if the shadow restarts.

Enjoy!

Brian

_______________________________________________
HTCondor-devel mailing list
HTCondor-devel@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-devel

_______________________________________________
HTCondor-devel mailing list
HTCondor-devel@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-devel


[← Prev in Thread] Current Thread [Next in Thread→]