Re: [HTCondor-devel] Remote IO in vanilla universe


Date: Mon, 28 Jan 2013 11:59:36 -0600
From: Erik Paulson <epaulson@xxxxxxxxxxx>
Subject: Re: [HTCondor-devel] Remote IO in vanilla universe
I would welcome it on-list, in fact. 

Not that this should be a deal-breaker, but is FUSE enabled by default 
in most modern Linux distributions? 

I've always wanted to turn the std universe libraries into a straight C-based
implementation that turned remote I/O into a Chirp call over a pipe to a 
process on the local machine, which would be responsible for actually 
carrying out the I/O. That way, all of the Condor I/O code wouldn't have to
be loaded into the address space of the user job, nor would there be any
nastiness with C++ exceptions and runtime libraries, which would make the 
build way simpler too. 

The checkpointing code could be done in C as well, and dumped over the pipe
to that same process and managed externally as well. 

-Erik

On Mon, Jan 28, 2013 at 10:21:14AM -0500, Matthew Farrellee wrote:
> I doubt anyone will object to doing that on-list.
> 
> Best,
> 
> 
> matt
> 
> On 01/28/2013 10:01 AM, Douglas Thain wrote:
> >Brian -
> >
> >The idea all along has been to have a common protocol definition, so
> >that the various implementations would interoperate:
> >http://research.cs.wisc.edu/htcondor/chirp
> >
> >We last looked at this about 1.5 years ago -- and it worked -- but I
> >don't believe there is any regular testing of the interaction between
> >the cctools chirp and the condor chirp.  Without that, things may
> >drift apart over time.
> >
> >I will be happy to put forth some effort from my group to make this
> >interaction work better.  How about we start by identifying the known
> >problems and any desired features?  (Off list, probably.)
> >
> >Best -
> >Doug
> >
> >
> >On Mon, Jan 28, 2013 at 8:49 AM, Brian Bockelman <bbockelm@xxxxxxxxxxx> 
> >wrote:
> >>Do you know the history behind the split implementation of the chirp 
> >>client?  Why can't there just be a common library or codebase for the 
> >>client?  I know I've seen jobs bedeviled by the "no timeouts" problem 
> >>when using the CLI shipped (not to mention the issues of thread safety!).
> >>
> >>The work described below is really just gluing together the two 
> >>interfaces.  Most function implementations look like this:
> >>
> >>static int chirp_read(const char * path, char * buffer, size_t size, 
> >>off_t offset, struct fuse_file_info * fi) {
> >>         GET_CLIENT(client);
> >>         assert(path);
> >>         return chirp_client_pread(client, fi->fh, buffer, size, offset);
> >>}
> >>
> >>(GET_CLIENT is a macro to pull the client handle from the FUSE context 
> >>and lock a mutex).  Hence, things are mostly at the mercy of the 
> >>underlying client.
> >>
> >>Brian
> >>
> >>PS - I see that chirp_fuse doesn't use the standard option parsing for 
> >>fuse, meaning it can't be made to be compatible with /etc/fstab.  :(  
> >>However, that shouldn't be a roadblock to using it in this case.
> >>
> >>On Jan 28, 2013, at 7:29 AM, Douglas Thain <dthain@xxxxxx> wrote:
> >>
> >>>Brian -
> >>>
> >>>You might check out the existing chirp_fuse module, which should
> >>>interoperate with both the Condor Chirp I/O proxy as well as the
> >>>standalone Chirp server. When used with the latter, you also get
> >>>proper errnos, timeouts, and transparent failure recovery.
> >>>
> >>>http://www.cse.nd.edu/~ccl/software/manuals/man/chirp_fuse.html
> >>>
> >>>Cheers,
> >>>Doug
> >>>
> >>>
> >>>On Sun, Jan 27, 2013 at 9:46 PM, Brian Bockelman <bbockelm@xxxxxxxxxxx> 
> >>>wrote:
> >>>>Hi all,
> >>>>
> >>>>Figured out a relatively simple way of providing remote IO in the 
> >>>>vanilla
> >>>>universe and am looking for someone willing to give it a spin.  It's a
> >>>>surprisingly small amount of code - the heavy lifting is done by chirp.
> >>>>Mostly, the new code is just gluing pre-existing components.
> >>>>
> >>>>See the design document:
> >>>>
> >>>>https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3465
> >>>>
> >>>>In short:  I created a FUSE filesystem that translates filesystem calls 
> >>>>to
> >>>>chirp IO (which does a remote IO with the submit host).  I use the
> >>>>filesystem namespaces feature to make this filesystem only appear to 
> >>>>the job
> >>>>(and be automatically unmounted at the job's end).  This way, the job 
> >>>>sees
> >>>>the filesystem of the submit host (either as / or as /condor/submitter,
> >>>>depending on the job's requested options).  The technique appears to 
> >>>>work
> >>>>well, but I haven't tried pushing it too hard.
> >>>>
> >>>>I'm not quite sure where Chirp breaks, but I did notice that it has no 
> >>>>error
> >>>>codes implemented (either returns 0 or -1, no errno).  Hence, any IO 
> >>>>error
> >>>>is converted to EIO.  That will likely be problematic for some 
> >>>>applications.
> >>>>Chirp also has no timeouts or error recovery; the filesystem will 
> >>>>likely die
> >>>>if the shadow restarts.
> >>>>
> >>>>Enjoy!
> >>>>
> >>>>Brian
> >>>>
> >>>>_______________________________________________
> >>>>HTCondor-devel mailing list
> >>>>HTCondor-devel@xxxxxxxxxxx
> >>>>https://lists.cs.wisc.edu/mailman/listinfo/htcondor-devel
> >>
> >_______________________________________________
> >HTCondor-devel mailing list
> >HTCondor-devel@xxxxxxxxxxx
> >https://lists.cs.wisc.edu/mailman/listinfo/htcondor-devel
> >
> 
> _______________________________________________
> HTCondor-devel mailing list
> HTCondor-devel@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-devel
[← Prev in Thread] Current Thread [Next in Thread→]