Subject: [HTCondor-users] Looking for a good HPC vs. HTC soundbite
Hi folks,
I'm on my way to a company-internal
symposium in Tucson where I'll be giving a presentation about some interesting
recent work with HTCondor, in the High *Performance* Computing track. I'll
be talking to an audience which, by and large, is unfamiliar with HTCondor
capabilities, and if they've seen it before, it might be v6.x or earlier.
This, of course, got me to thinking about something that I recall Prof.
Livny talking about at HTCondor Week last year.
The emphasis in "High Performance"
computing has been to orchestrate as many cooperating CPU cores as possible,
running as fast as possible - you see things like the SGI UV 3000 series
where you have up to 256 sockets (not cores, *sockets*... zomg...)
on a cache-coherent memory image, the proliferation of Infiniband fabrics,
10Gb, 40Gb, and 100Gb Ethernets, rDMA and ROCE, etc. etc., all working
to build out the most fierce Lamborghini of a computing system the world
has ever seen.
But Prof. Livny's observation was that
the paradigm of large-scale computing is shifting around us, and it will
have the same kind of revolutionary impact on computing as the introduction
of the PC. We are entering a world where for an absurdly modest price,
you can harness the power of tens or hundreds of thousands of CPU cores
for only as long as you need it. Even with the most dense Xeon chips in
the biggest UV 3000 available, to the tune of millions upon millions of
dollars, you can't even remotely come close to the power and scale that's
available in Amazon EC2, Azure, and the rest for however many pennies per
hour you want to spend.
Simple math dictates that a hundred
machines which take ten minutes to run a given task will complete more
of those tasks in a given time than a tricked-out muscle-machine which
can complete the same task in ten seconds, and that's what "High Throughput"
is all about, and that was what I saw as the crux of Prof. Livny's observation:
the most important work in large-scale computing in the coming years is
going to be figuring out how to adapt the design of algorithms to this
new reality - figuring out how to run your four-week 20-core MPI job in
a few hours on 20,000 intermittently-available EC2 spot instances instead.
Since I only have about twenty minutes
in my time slot, I'd be delighted if someone who has thought through this
issue could offer a pithy, memorable, and succinct way to express this
idea to a potentially skeptical audience. Or a link to one.
Thanks for any suggestions!
Hope to see some of you at HTCondor
Week two weeks hence!
Michael V. Pelletier
IT Program Execution
Principal Engineer
978.858.9681 (5-9681)
339.293.9149 cell
michael.v.pelletier@xxxxxxxxxxxx