I would welcome it on-list, in fact.
Not that this should be a deal-breaker, but is FUSE enabled by default
in most modern Linux distributions?
I've always wanted to turn the std universe libraries into a straight C-based
implementation that turned remote I/O into a Chirp call over a pipe to a
process on the local machine, which would be responsible for actually
carrying out the I/O. That way, all of the Condor I/O code wouldn't have to
be loaded into the address space of the user job, nor would there be any
nastiness with C++ exceptions and runtime libraries, which would make the
build way simpler too.
The checkpointing code could be done in C as well, and dumped over the pipe
to that same process and managed externally as well.
-Erik
On Mon, Jan 28, 2013 at 10:21:14AM -0500, Matthew Farrellee wrote:
> I doubt anyone will object to doing that on-list.
>
> Best,
>
>
> matt
>
> On 01/28/2013 10:01 AM, Douglas Thain wrote:
> >Brian -
> >
> >The idea all along has been to have a common protocol definition, so
> >that the various implementations would interoperate:
> >http://research.cs.wisc.edu/htcondor/chirp
> >
> >We last looked at this about 1.5 years ago -- and it worked -- but I
> >don't believe there is any regular testing of the interaction between
> >the cctools chirp and the condor chirp. Without that, things may
> >drift apart over time.
> >
> >I will be happy to put forth some effort from my group to make this
> >interaction work better. How about we start by identifying the known
> >problems and any desired features? (Off list, probably.)
> >
> >Best -
> >Doug
> >
> >
> >On Mon, Jan 28, 2013 at 8:49 AM, Brian Bockelman <bbockelm@xxxxxxxxxxx>
> >wrote:
> >>Do you know the history behind the split implementation of the chirp
> >>client? Why can't there just be a common library or codebase for the
> >>client? I know I've seen jobs bedeviled by the "no timeouts" problem
> >>when using the CLI shipped (not to mention the issues of thread safety!).
> >>
> >>The work described below is really just gluing together the two
> >>interfaces. Most function implementations look like this:
> >>
> >>static int chirp_read(const char * path, char * buffer, size_t size,
> >>off_t offset, struct fuse_file_info * fi) {
> >> GET_CLIENT(client);
> >> assert(path);
> >> return chirp_client_pread(client, fi->fh, buffer, size, offset);
> >>}
> >>
> >>(GET_CLIENT is a macro to pull the client handle from the FUSE context
> >>and lock a mutex). Hence, things are mostly at the mercy of the
> >>underlying client.
> >>
> >>Brian
> >>
> >>PS - I see that chirp_fuse doesn't use the standard option parsing for
> >>fuse, meaning it can't be made to be compatible with /etc/fstab. :(
> >>However, that shouldn't be a roadblock to using it in this case.
> >>
> >>On Jan 28, 2013, at 7:29 AM, Douglas Thain <dthain@xxxxxx> wrote:
> >>
> >>>Brian -
> >>>
> >>>You might check out the existing chirp_fuse module, which should
> >>>interoperate with both the Condor Chirp I/O proxy as well as the
> >>>standalone Chirp server. When used with the latter, you also get
> >>>proper errnos, timeouts, and transparent failure recovery.
> >>>
> >>>http://www.cse.nd.edu/~ccl/software/manuals/man/chirp_fuse.html
> >>>
> >>>Cheers,
> >>>Doug
> >>>
> >>>
> >>>On Sun, Jan 27, 2013 at 9:46 PM, Brian Bockelman <bbockelm@xxxxxxxxxxx>
> >>>wrote:
> >>>>Hi all,
> >>>>
> >>>>Figured out a relatively simple way of providing remote IO in the
> >>>>vanilla
> >>>>universe and am looking for someone willing to give it a spin. It's a
> >>>>surprisingly small amount of code - the heavy lifting is done by chirp.
> >>>>Mostly, the new code is just gluing pre-existing components.
> >>>>
> >>>>See the design document:
> >>>>
> >>>>https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3465
> >>>>
> >>>>In short: I created a FUSE filesystem that translates filesystem calls
> >>>>to
> >>>>chirp IO (which does a remote IO with the submit host). I use the
> >>>>filesystem namespaces feature to make this filesystem only appear to
> >>>>the job
> >>>>(and be automatically unmounted at the job's end). This way, the job
> >>>>sees
> >>>>the filesystem of the submit host (either as / or as /condor/submitter,
> >>>>depending on the job's requested options). The technique appears to
> >>>>work
> >>>>well, but I haven't tried pushing it too hard.
> >>>>
> >>>>I'm not quite sure where Chirp breaks, but I did notice that it has no
> >>>>error
> >>>>codes implemented (either returns 0 or -1, no errno). Hence, any IO
> >>>>error
> >>>>is converted to EIO. That will likely be problematic for some
> >>>>applications.
> >>>>Chirp also has no timeouts or error recovery; the filesystem will
> >>>>likely die
> >>>>if the shadow restarts.
> >>>>
> >>>>Enjoy!
> >>>>
> >>>>Brian
> >>>>
> >>>>_______________________________________________
> >>>>HTCondor-devel mailing list
> >>>>HTCondor-devel@xxxxxxxxxxx
> >>>>https://lists.cs.wisc.edu/mailman/listinfo/htcondor-devel
> >>
> >_______________________________________________
> >HTCondor-devel mailing list
> >HTCondor-devel@xxxxxxxxxxx
> >https://lists.cs.wisc.edu/mailman/listinfo/htcondor-devel
> >
>
> _______________________________________________
> HTCondor-devel mailing list
> HTCondor-devel@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-devel
|