Hi all,
Figured out a relatively simple way of providing remote IO in the vanilla universe and am looking for someone willing to give it a spin. It's a surprisingly small amount of code - the heavy lifting is done by chirp. Mostly, the new code is just gluing pre-existing components.
See the design document:
In short: I created a FUSE filesystem that translates filesystem calls to chirp IO (which does a remote IO with the submit host). I use the filesystem namespaces feature to make this filesystem only appear to the job (and be automatically unmounted at the job's end). This way, the job sees the filesystem of the submit host (either as / or as /condor/submitter, depending on the job's requested options). The technique appears to work well, but I haven't tried pushing it too hard.
I'm not quite sure where Chirp breaks, but I did notice that it has no error codes implemented (either returns 0 or -1, no errno). Hence, any IO error is converted to EIO. That will likely be problematic for some applications. Chirp also has no timeouts or error recovery; the filesystem will likely die if the shadow restarts.
Enjoy!
Brian |