HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] Avoiding redundant executables in the SPOOL



On Wed, 30 Apr 2008, Alan De Smet wrote:

Question: what happens with when_to_transfer_output=ON_EXIT_OR_EVICT?  I
          don't know enough about what Condor does here to know whether
          this would be a problem under this scheme.  If the current
          behaviour on eviction is to transfer the executable (along with
          everything else) back to the SPOOL directory, and then send the
          executable (copied from the execute host that evicted it) out
          again to the new execute host, then you might be in trouble.

          Reason: if that's the case, then jobs that modify their
          executable will currently work fine with
          when_to_transfer_output=ON_EXIT_OR_EVICT even if evicted
          at some point before completion.  Once they've modified their
          executable, though, its hash value changes - under the new
          scheme, is this going to be a problem for the schedd (or any of
          the other daemons, e.g. the startd or starter)?

Other comments inline below - hope they are useful.

	-- Bruce

<snip>
1. condor_submit: Calculate a hash of the executable.

	Discussion:
	- We calculate on the submit side because:
		- It allows the schedd to potentially never look at the
		  file, saving network and schedd processor time for
		  remote submits.
		- submit can be aware that it's submitting the same
		  binary 50 times in different clusters, the schedd
		  can't.  This means we can do the hash once on the
		  client instead of 50 times on the server.
	- Arguably there is a risk of calculating it on the client
	  side: the client can lie.  Later steps will reveal that we
	  don't link between jobs of different Owners, so if client
	  lies about the hash, it only hurts themselves.

Maybe you should think a little more about this? Consider the following situations:

- I've installed a front-end that allows my users to submit jobs to my
  compute resources.  At the back end it uses Condor.  Since my users
  never had direct access to Condor I submitted all jobs as the same user
  (just ensuring that the input and output for each job were in separate
  subdirectories), because that way the Condor submit hosts don't need to
  know about all my different users.

  Now I MUST make copy_to_spool=false, or hash collisions mean that
  the wrong executable may be run (either maliciously or by accident).
  You better make sure I know this in BIG, BRIGHT, GLOWING letters before
  I upgrade my Condor installation to a version of Condor with this
  feature.


- I use "pool accounts" for my users, i.e. there is a collection of
  accounts that get shared out between users depending on who is using my
  resources at the time.  User B submits an executable with the same hash
  as user A, but which is _not_ the same executable as that submitted by
  user A (hash collision).  User B is/was using pool account ACC01.  User
  B leaves (temporarily or permanently), but doesn't bother to make sure
  they've removed all their existing jobs, and user A now gets assigned
  pool account ACC01.


- Multiple submitters to the same schedd (e.g. via Condor-C, etc).  If a
  bad person can masquerade as me on ANY of those submitters they can
  submit a bad executable that will be run instead of the one I actually
  submitted.

  You might say: "but in that case you've got a problem anyway".  Yes, but
  previously the bad person could submit jobs and possibly (depending on
  the set-up) affect my jobs currently in the queue.  Now, they can
  DEFINITELY affect my jobs by causing them to run a bad executable.
  That bad executable might steal the input data of my real jobs (which
  might be valuable) and send it to the bad person.


<snip>
6. condor_schedd: Does the hash path find an existing file?

	6a: No: (The path doesn't exist, or it's invalid) Tell
	condor_submit: "Send it along."  Write the incoming file to
	the ickpt.  Is hash path valid, just not currently existing?

		6aa: No, the hash path is the special value "invalid":
		Skip to step 7.

		6ab: Yes: Hard link the ickpt to the hash path.

	6a: Yes: Great, we'll reuse that.  Tell condor_submit: "No
	thanks."  Hard link from hash path to the ickpt
	("$SPOOL/cluster7524.ickpt.subproc0").

Obviously you know that using hard links makes this filesystem specific. In particular, where you get filesystems masquerading as "normal" UNIX-type filesystems, you may have the situation that the filesystem's "support" for hard links is actually by making duplicate copies of the file's _data_. In such situations you are no better off than you were before. In fact, you are slightly worse off: you've done lots of work to get here, you have a potential security hole, AND you still have multiple copies of the executable. Since the filesystem "lies" that it does support hard links, the schedd presumably doesn't know that this is the situation.

Also, there are situations where not all users are allowed to make hard links (presumably the schedd is running as root (if it can) when it does all this?). I guess, from your previous e-mail, that here you fall back to the old non-hashed behaviour?

One more thing: in a previous e-mail, you said that "this won't work in Windows". Why not? NTFS has supported hard links since at least NTFS 3.0 (Windows 2000).


8. condor_schedd: Time to remove the job from the queue.  Check
the hard link count for the ickpt file.  Is it 2?  We're the last
user, unlink the hash path.

	- If it's 3 or above, there are other users of this file.
	  Leave it be.
	- If it's 1, this is a job that didn't have a hash.

9: condor_schedd: unlink the ickpt file, same as before.

...so if a user can create a hard link to the hash path then the ickpt file will never be unlinked. What are the circumstances in which a user can do this? In particular, I guess, if the only circumstances in which a user can do this are where they also have the ability to do other equally bad things, or to do even worse things, then you don't really need to worry about it. :)


<snip>
Escaping algorithm:

The goals are:
- We need a character to separate fields in the CmdHash.  This
 character must be escaped.
- All resulting characters, including the field character, must
 be valid as a file name on all operating systems Condor
 supports.

Field seperator: - (, might be better, but I don't know if
Windows and MacOSX like it)

Windows (at least from Windows 2000 onward) is perfectly happy with a comma (,) in the filename. (Of course, as with spaces in filenames, users need to surround the filename in double quotes.)

My MacOS 10.4.10 system was also happy with a comma in the filename.


<snip>
General notes:

This is insecure in the face of CLAIMTOBE, ANONYMOUS, or any
other situation in which multiple users map to the same Owner.
Don't do that.  Even without this behavior, that's very insecure
and users can mess with each other using condor_qedit, condor_rm
and more.

...except that, as I mention above, you can have situations where multiple users map to the same Owner, but users can't use condor_qedit, etc. (e.g. if I use some sort of portal or front-end for job submission, monitoring, etc.). However, in such situations they are now vulnerable to hash collision attacks on executables whilst previously they were fine.

--
Bruce Beckles,
e-Science Specialist,
University of Cambridge Computing Service.