Re: [HTCondor-devel] more on IP addresses


Date: Fri, 22 Mar 2013 12:52:44 -0500
From: Erik Paulson <epaulson@xxxxxxxxxxxx>
Subject: Re: [HTCondor-devel] more on IP addresses
Igor - 

Migrating jobs between nodes is one way to deal with upgrades. There are cases that it doesn't help - but then it also helps other upgrade problems that are outside of the scope of HTCondor, and wouldn't be helped with the ability to reattach the startd to an existing process tree/sandbox (rebooting to patch a kernel, for example)

But to get back to the original point, and to make it a bit more concrete - the CHTC pool has 4000-ish cores. We don't have 4000 extra publicly-addressable IPv4 addresses at the UW to give each core an IP address - and most jobs running on those cores don't need one anyway. However, they could probably scrape together say 500 extra IPs that we could use to give to jobs as they run on the pool.

It'd be nice to be able to give jobs that wanted an IP address an IP address - either something in the 10.x reserved/private IPs or in one of the 500 extra routed/public IPs that the UW controls. HTCondor should think about how to manage those IP addresses. There are at least three things to think about:
- How do you manage the IPs and actually put them in job classads
- How do you actually set up the routing infrastructure and host configuration so if you give a job an IP address and it starts running on a core, how do packets actually get there?
- If HTCondor wants to migrate a job and network connection, how can it do it without delegating the whole process to some VMWare/Microsoft/Citirix all-singing/all-dancing solution? (Or do you just say screw it and run the whole thing in vSphere or whatever else is out there). I don't have any idea how much help stock VM Monitors help you with live migrations, and how much you can roll yourself versus how much you can only get as commercial add-ons - and how programmable/extensible those add-ons are.

-Erik  
 
On Thu, Mar 21, 2013 at 11:55 PM, Igor Sfiligoi <sfiligoi@xxxxxxxx> wrote:
Hi Erik.

I think we are mixing two things here;
migrating jobs between nodes and killing jobs due to upgrades.

Personally, I would really like the second solved first!
As you said "It's nuts that you have to restart so many jobs when you upgrade HTCondor"!
HTCondor startd should be able to restart (and possibly change version in between) without killing any jobs!!!

Not rocket science, it just does not fit the current architecture.

My 2c,
  Igor


On 03/21/2013 07:03 PM, Erik Paulson wrote:
HTCondor really should be doing something to manage IP addresses and treat them as a resource. Not every job will need a public-facing IP address, but some will, and HTCondor should think harder about them and possibly require that dependency be declared ahead of time.

Where you really should be is in a place where you can migrate any job, including network connections, to other hosts. Just here at the UW the rigmarole with scheduling and upgrades would be dramatically simpler if you could shuffle things around and you had a better sense of what the job was doing with its network. (It's nuts that you have to restart so many jobs when you upgrade HTCondor)

-Erik


_______________________________________________
HTCondor-devel mailing list
HTCondor-devel@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-devel



[← Prev in Thread] Current Thread [Next in Thread→]