Re: [HTCondor-devel] Remote IO in vanilla universe


Date: Mon, 28 Jan 2013 07:49:40 -0600
From: Brian Bockelman <bbockelm@xxxxxxxxxxx>
Subject: Re: [HTCondor-devel] Remote IO in vanilla universe
Do you know the history behind the split implementation of the chirp client?  Why can't there just be a common library or codebase for the client?  I know I've seen jobs bedeviled by the "no timeouts" problem when using the CLI shipped (not to mention the issues of thread safety!).

The work described below is really just gluing together the two interfaces.  Most function implementations look like this:

static int chirp_read(const char * path, char * buffer, size_t size, off_t offset, struct fuse_file_info * fi) {
	GET_CLIENT(client);
	assert(path);
	return chirp_client_pread(client, fi->fh, buffer, size, offset);
}

(GET_CLIENT is a macro to pull the client handle from the FUSE context and lock a mutex).  Hence, things are mostly at the mercy of the underlying client.

Brian

PS - I see that chirp_fuse doesn't use the standard option parsing for fuse, meaning it can't be made to be compatible with /etc/fstab.  :(  However, that shouldn't be a roadblock to using it in this case.

On Jan 28, 2013, at 7:29 AM, Douglas Thain <dthain@xxxxxx> wrote:

> Brian -
> 
> You might check out the existing chirp_fuse module, which should
> interoperate with both the Condor Chirp I/O proxy as well as the
> standalone Chirp server. When used with the latter, you also get
> proper errnos, timeouts, and transparent failure recovery.
> 
> http://www.cse.nd.edu/~ccl/software/manuals/man/chirp_fuse.html
> 
> Cheers,
> Doug
> 
> 
> On Sun, Jan 27, 2013 at 9:46 PM, Brian Bockelman <bbockelm@xxxxxxxxxxx> wrote:
>> Hi all,
>> 
>> Figured out a relatively simple way of providing remote IO in the vanilla
>> universe and am looking for someone willing to give it a spin.  It's a
>> surprisingly small amount of code - the heavy lifting is done by chirp.
>> Mostly, the new code is just gluing pre-existing components.
>> 
>> See the design document:
>> 
>> https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3465
>> 
>> In short:  I created a FUSE filesystem that translates filesystem calls to
>> chirp IO (which does a remote IO with the submit host).  I use the
>> filesystem namespaces feature to make this filesystem only appear to the job
>> (and be automatically unmounted at the job's end).  This way, the job sees
>> the filesystem of the submit host (either as / or as /condor/submitter,
>> depending on the job's requested options).  The technique appears to work
>> well, but I haven't tried pushing it too hard.
>> 
>> I'm not quite sure where Chirp breaks, but I did notice that it has no error
>> codes implemented (either returns 0 or -1, no errno).  Hence, any IO error
>> is converted to EIO.  That will likely be problematic for some applications.
>> Chirp also has no timeouts or error recovery; the filesystem will likely die
>> if the shadow restarts.
>> 
>> Enjoy!
>> 
>> Brian
>> 
>> _______________________________________________
>> HTCondor-devel mailing list
>> HTCondor-devel@xxxxxxxxxxx
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-devel

Attachment: smime.p7s
Description: S/MIME cryptographic signature

[← Prev in Thread] Current Thread [Next in Thread→]