Re: [HTCondor-devel] Remote IO in vanilla universe


Date: Mon, 28 Jan 2013 10:01:43 -0500
From: Douglas Thain <dthain@xxxxxx>
Subject: Re: [HTCondor-devel] Remote IO in vanilla universe
Brian -

The idea all along has been to have a common protocol definition, so
that the various implementations would interoperate:
http://research.cs.wisc.edu/htcondor/chirp

We last looked at this about 1.5 years ago -- and it worked -- but I
don't believe there is any regular testing of the interaction between
the cctools chirp and the condor chirp.  Without that, things may
drift apart over time.

I will be happy to put forth some effort from my group to make this
interaction work better.  How about we start by identifying the known
problems and any desired features?  (Off list, probably.)

Best -
Doug


On Mon, Jan 28, 2013 at 8:49 AM, Brian Bockelman <bbockelm@xxxxxxxxxxx> wrote:
> Do you know the history behind the split implementation of the chirp client?  Why can't there just be a common library or codebase for the client?  I know I've seen jobs bedeviled by the "no timeouts" problem when using the CLI shipped (not to mention the issues of thread safety!).
>
> The work described below is really just gluing together the two interfaces.  Most function implementations look like this:
>
> static int chirp_read(const char * path, char * buffer, size_t size, off_t offset, struct fuse_file_info * fi) {
>         GET_CLIENT(client);
>         assert(path);
>         return chirp_client_pread(client, fi->fh, buffer, size, offset);
> }
>
> (GET_CLIENT is a macro to pull the client handle from the FUSE context and lock a mutex).  Hence, things are mostly at the mercy of the underlying client.
>
> Brian
>
> PS - I see that chirp_fuse doesn't use the standard option parsing for fuse, meaning it can't be made to be compatible with /etc/fstab.  :(  However, that shouldn't be a roadblock to using it in this case.
>
> On Jan 28, 2013, at 7:29 AM, Douglas Thain <dthain@xxxxxx> wrote:
>
>> Brian -
>>
>> You might check out the existing chirp_fuse module, which should
>> interoperate with both the Condor Chirp I/O proxy as well as the
>> standalone Chirp server. When used with the latter, you also get
>> proper errnos, timeouts, and transparent failure recovery.
>>
>> http://www.cse.nd.edu/~ccl/software/manuals/man/chirp_fuse.html
>>
>> Cheers,
>> Doug
>>
>>
>> On Sun, Jan 27, 2013 at 9:46 PM, Brian Bockelman <bbockelm@xxxxxxxxxxx> wrote:
>>> Hi all,
>>>
>>> Figured out a relatively simple way of providing remote IO in the vanilla
>>> universe and am looking for someone willing to give it a spin.  It's a
>>> surprisingly small amount of code - the heavy lifting is done by chirp.
>>> Mostly, the new code is just gluing pre-existing components.
>>>
>>> See the design document:
>>>
>>> https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3465
>>>
>>> In short:  I created a FUSE filesystem that translates filesystem calls to
>>> chirp IO (which does a remote IO with the submit host).  I use the
>>> filesystem namespaces feature to make this filesystem only appear to the job
>>> (and be automatically unmounted at the job's end).  This way, the job sees
>>> the filesystem of the submit host (either as / or as /condor/submitter,
>>> depending on the job's requested options).  The technique appears to work
>>> well, but I haven't tried pushing it too hard.
>>>
>>> I'm not quite sure where Chirp breaks, but I did notice that it has no error
>>> codes implemented (either returns 0 or -1, no errno).  Hence, any IO error
>>> is converted to EIO.  That will likely be problematic for some applications.
>>> Chirp also has no timeouts or error recovery; the filesystem will likely die
>>> if the shadow restarts.
>>>
>>> Enjoy!
>>>
>>> Brian
>>>
>>> _______________________________________________
>>> HTCondor-devel mailing list
>>> HTCondor-devel@xxxxxxxxxxx
>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-devel
>
[← Prev in Thread] Current Thread [Next in Thread→]