Friday, December 4, 2009

Intel’s Single-Chip Clus… (sorry) Cloud

Intel's recent announcement of a 48-core "single-chip cloud" (SCC) is now rattling around several news sources, with varying degrees of boneheaded-ness and/or willful suspension of disbelief in the hype. Gotta set this record straight, and also raise a few questions I didn't find answered in the sources now available (the presentation, the software paper, the developers' video).

Let me emphasize from the start that I do not think this is a bad idea. In fact, it's an idea good enough that I've lead or been associated with rather similar architectures twice, although not on a single chip (RP3, POWER 4 (no, not the POWER4 chip; this one was four original POWER processors in one box, apparently lost on the internet) (but I've still got a button…)). Neither was, in hindsight, a good idea at their times.

So, some facts:

SCC is not a product. It's an experimental implementation of which about a hundred will be made, given to various labs for software research. It is not like Larrabee, which will be shipped in full-bore product-scale quantities Some Day Real Soon Now. Think concept car. That software research will surely be necessary, since:

SCC is neither a multiprocessor nor a multicore system in the usual sense. They call it a "single chip cloud" because the term "cluster" is déclassé. Those 48 cores have caches (two levels), but cache coherence is not implemented in hardware. So, it's best thought of as a 48-node cluster of uniprocessors on a single chip. Except that those 48 cluster nodes all access the same memory. And if one processor changes what's in memory, the others… don't find out. Until the cache at random happens to kick the changed line out. Sometime. Who knows when. (Unless something else is done; but what? See below.)

Doing this certainly does save hardware complexity, but one might note that quite adequately scalable cache coherence does exist. It's sold today by Rackable (the part that was SGI); Fujitsu made big cache-coherent multi-x86 systems for quite a while; and there are ex-Sequent folks out there who remember it well. There's even an IEEE standard for it (SCI, Scalable Coherent Interface). So let's not pretend the idea is impossible and assume your audience will be ignorant. Mostly they will be, but that just makes misleading them more reprehensible.

To leave cache coherence out of an experimental chip like this is quite reasonable; I've no objection there. I do object to things like the presentation's calling this "New Data-Sharing Options." That's some serious lipstick being applied.

It also leads to several questions that are so far unanswered:

How do you keep the processors out of each others' pants? Ungodly uncontrolled race-like uglies must happen unless… what? Someone says "software," but what hardware does that software exercise? Do they, perhaps, keep the 48 separate from one another by virtualization techniques? (Among other things, virtualization hardware has to keep virtual machine A out of virtual machine B's memory.) That would actually be kind of cool, in my opinion, but I doubt it; hypervisor implementations require cache coherence, among other issues. Do you just rely on instructions that push individual cache lines out to memory? Ugh. Is there a way to decree whole swaths of memory to be non-cacheable? Sounds kind of inefficient, but whatever. There must be something, since they have demoed some real applications and so this problem must have been solved somehow. How?

What's going on with the operating system? Is there a separate kernel for each core? My guess: Yes. That's part of being a clus… sorry, a cloud. One news article said it ran "Rock Creek Linux." Never heard of it? Hint: The chip was called Rock Creek prior to PR.

One iteration of non-coherent hardware I dealt with used cluster single system image to make it look like one machine for management and some other purposes. I'll bet that becomes one of the software experiments. (If you don't know what SSI is, I've got five posts for you to read, starting here.)

Appropriately, there's mention of message-passing as the means of communicating among the cores. That's potentially fast message passing, since you're using memory-to-memory transfers in the same machine. (Until you saturate the memory interface – only four ports shared by all 48.) (Or until you start counting usual software layers. Not everybody loves MPI.) Is there any hardware included to support that, like DMA engines? Or protocol offload engines?

Finally, why does every Intel announcement of gee-whiz hardware always imply it will solve the same set of problems? I'm really tired of those flying cars. No, I don't expect to ever see an answer to that one.

I'll end by mentioning something in Intel's SCC (née Rock Creek) that I think is really good and useful: multiple separate power regions. Voltage and frequency can be varied separately in different areas of the chip, so if you aren't using a bunch of cores, they can go slower and/or draw less power. That's something that will be "jacks or better" in future multicore designs, and spending the effort to figure out how to build and use it is very worthwhile.

Heck, the whole thing is worthwhile, as an experiment. On its own. Without inflated hype about solving all the world's problems.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - -

(This has been an unusually topical post, brought to you courtesy of the author's head-banging-wall level of annoyance at boneheaded news stories. Soon we will resume our more usual programming. Not instantly. First I have to close on and move into a house. In Holiday and snow season.)

Thursday, November 26, 2009

Oh, for the Good Old Days to Come

I recently had a glorious flashback to 2004.

Remember how, back then, when you got a new computer you would be just slightly grinning for a few weeks because all your programs were suddenly so crisp and responsive? You hadn't realized your old machine had a rubbery-feeling delay responding to your clicks and key-presses until zip! You booted the new machine for the first time, and wow. It just felt good.

I hadn't realized how much I'd missed that. My last couple of upgrades have been OK. I've gotten a brighter screen, better graphics, lighter weight, and so on. They were worth it, intellectually at least. But the new system zip, the new system crispness of response – it just wasn't there.

I have to say I hadn't consciously noticed that lack because, basically, I mostly didn't need it. How much faster do you want a word processor to be, anyway? So I muddled along like everyone else, all our lives just a tad more drab than they used to be.

Of course, the culprit denying us this small pleasure has been the flattening of single-thread performance wrought by the half-death of Moore's Law. Used to be, after a couple of years delay you would naturally get a system that ran 150% or 200% faster, so everything just went faster. All your programs were rejuvenated, and you noticed, instantly. A few weeks or so later you were of course used to it. But for a while, life was just a little bit better.

That hasn't happened for nigh unto five years now. Sure, we have more cores. I personally didn't get much use out of them. All my regular programs don't perk up. But as I said, I really didn't notice, consciously.

So what happened to make me realize how deprived I – and everybody else – has been? The Second Life client.

I'd always been less than totally satisfied with how well SL ran on my system. It was usable. But it was rubbery. Click to walk or turn and it took just a little … time before responding. It wasn't enough to make things truly unpleasant (except when lots of folks were together, but that's another issue). But it was enough to be noticeably less than great. I just told myself, what the heck, it's not Quake but who cares, that's not what SL is about.

Then for reasons I'll explain in another post, I was motivated to reanimate my SL avatar. It hadn't seen any use for at least six months, so I was not at all surprised to find a new SL client required when I connected. I downloaded, installed, and cranked it up.

Ho. Ly. Crap.

The rubber was gone.

There were immediate, direct responses to everything I told it to do. I proceeded to spend much more time in SL than I originally intended, wandering around and visiting old haunts just because it was so pleasant. It was a major difference, on the order of the difference I used to encounter when using a brand-new system. It was like those good old days of CPU clock-cranking madness. The grin was back.

So was this "just" a new, better, software release? Well, of course it was that. But I wouldn't have bothered writing this post if I hadn't noticed two other things:

First, my CPU utilization meter was often pegged. Pegged, as in 100% utilization, where flooring only one of my two CPUs only reads 50%. When I looked a little deeper, I saw the one, single SL process was regularly over 50%. I've not looked at any of the SL documentation on this, but from that data I can pretty confidently say that this release of the SL client can make effective use of both cores simultaneously. It's the only program I've got with that property.

Second, my thighs started burning. Not literally. But that heat tells me when my discrete GPU gets cranking. So, this client was also exercising the GPU, to good effect.

Apparently, this SL client actually does exploit the theoretical performance improvements from graphics units and multiple cores that had been laying around unused in my system. I was, in effect, pole-vaulted about two system generations down the road – that's how long it's been since there was a discernible difference. The SL client is my first post-Moore client program.

All of this resonates for me with the recent SC09 (Supercomputing Conference 2009) keynote of Intel's Justin Rattner. Unfortunately it wasn't recorded by conference rules (boo!), but reports are that he told the crowd they were in a stagnant, let us not say decaying, business unless they got their butts behind pushing the 3D web. (UPDATE: Intel has posted video of Rattner's talk.)

Say What? No. For me, particularly following the SL experience above, this is not a "Say What?" moment. It makes perfect sense. Without a killer application, the chip volumes won't be there to keep down the costs of the higher-end chips used in non-boutique supercomputers. Asking that audience for a killer app, though, is like asking an industrial assembly-line designer for next year's toy fashion trends. Killer apps have to be client-side and used by the masses, or the volumes aren't there.

Hence, the 3D Web. This would take the kind of processing in the SL client, which can take advantage of multicore and great graphics processing, and put it in something that everybody uses every day: the browser. Get a new system, crank up the browser, and bang! you feel the difference immediately.

Only problem: Why does anybody need the web to be 3D? This is the same basic problem with virtual worlds: OK, here's a virtual world. You can run around and bump into people. What, exactly, do you do there? Chat? Bogus. That's more easily done, with more easily achieved breadth of interaction, on regular (2D) social networking sites. (Hence Google's virtual world failure.)

There are things that virtual worlds and a "3D web" can, potentially, excel at; but that's a topic for a later post.

In the meantime, I'll note that in a great crawl-first development, there are real plans to use graphics accelerators to speed up the regular old 2D web, by speeding up page rendering. Both Microsoft and Mozilla (IE & Firefox) are saying they'll bring accelerator-based speedups to browsers (see CNET and Bas Schouten's Mozilla blog) using Direct2D and DirectWrite to exploit specialized graphics hardware.

One could ask what good it is to render a Twitter page twice as fast. (That really was one of the quoted results.) What's the point? Asking that, however, would only prove that One doesn't Get It. You boot your new system, crank up the browser and bam! Everything you do there, and you do more and more there, has more zip. The web itself – the plain, old 2D web – feels more directly connected to your inputs, to your wishes; it feels more alive. Result?

The grin will be back. That's the point.

Sunday, November 8, 2009

Multicore vs. Cloud Computing

Multicore is the wave of the future. Cloud Computing is the wave of the future. Do they get along? My take: Eh. Sorta. There are the usual problems with parallel programming support, despite hubbub about parallel languages and runtimes and ever bigger multicores.

Multicore announcements in particular have been rampant recently. Not just the usual drumbeat from Intel and AMD; that continues, out to 8-way, 12-way, and onward to the future. Now more extreme systems are showing their heads, such as ScaleMP announcing vSMP for the Cloud, (and also for SMB), a way of gluing together X86 multicore systems into even larger shared-memory (NUMA) systems. 3Leaf is doing also doing essentially the same thing. Tilera just announced a 100-core chip product, beating Intel and others to the punch. Windows 7 has replaced the locking in XP, allowing it to "scale to 256 processors" – a statement that tells me (a) they probably did fix a bunch of stuff; and (b) they reserved one whole byte for the processor/thread ID. (Hope that's not cast in concrete for the future, or you'll have problems with your friendly neighborhood ScaleMP'd Tilera.)

So there's a collection of folks outside the Big Two processor vendors who see a whole lot of cores as good. Non-"commodity" systems – by the likes of IBM, Sun, and Fujitsu – have of course been seeing that for a long time, but the low-priced spread is arguably as the most important in the market, and certainly is the only hardware basis I've seen for clouds.

What's in the clouds for multicore?

Amazon's instance types and pricing do take muticore into account: at the low end small Linux is $0.085/hour for 1 core, nominally 1 GHz; and at the high end, still Linux, you can get "Extra-large High CPU" Linux is $0.68/hour for 8 cores, 2.5GHz each. So, assuming perfect parallel scaling, that's about 20X performance for 8X the price, a good deal. (I simplified 1 Amazon compute unit to 1 GHz. Amazon says it's 1.0-1.2 GHz 2007 Opteron or Xeon.)

Google App Engine (GAE) just charges per equivalent 1.2 GHz single CPU. What happens when you create a new thread in your Java code (now supported) is… well, you can't. Starting a new thread isn't supported. So GAE basically takes the same approach as Microsoft Azure, which treats multicore as an opportunity to exercise virtualization (not necessarily hardware-level virtualization, by my read), dividing multicores down into single core systems.

The difference between AWS, on the one hand, and GAE or Azure on the other, of course makes quite a bit of sense.

GAE and Azure are both PaaS (Platform as a Service) systems, providing an entire application platform, and the application is web serving, and the dominant computing models for web-serving are all based on throughput, not multicore turnaround.

AWS, in contrast, is IaaS (Infrastructure as a Service): You got the code? It's got the hardware. Just run it. That's any code you can fit into a virtual machine, including shared-memory parallel code, all the way up to big hulking database systems.

Do all God's chillum have to be writing their web code in Erlang or Haskell or Clojure (which turns into Java at runtime) or Ct or whatever before PaaS clouds start supporting shared-memory parallelism? But if PaaS is significant, doesn't it have to support E/H/C/Ct/etc. before those chillums will use them? Do we have a chicken-and-egg problem here? I think this just adds to the already long list of good reasons why parallel languages haven't taken off: PaaS clouds won't support them.

And in the meantime, there are of course a formidable barrier to others hosting their own specialty code on PaaS, like databases and at least some HPC codes. Hence the huge number of Amazon Solution Providers, including large-way SMP users such as of Oracle, IBM DB2, and Pervasive, while Google has just a few third-party Python libraries so far.

PaaS is a good thing, but I think that sooner or later it will be forced, like everyone else, to stop ignoring the other wave of the future.


Postscript / Addendum / Meta-note: My apologies for the lack up blog updates recently; I've been busy on things that make money or get me housed. I've something like six posts stacked up to be written, so you can expect more soon.

Wednesday, September 23, 2009

HPC – The Next Twenty Years

The Coalition for Academic Scientific Computation had its 20-year anniversary celebration symposium recently, and I was invited to participate on a panel with the topic HPC – The Next 20 Years. I thought it would be interesting to write down here what I said in my short position presentation. Eventually all the slides of the talks will be available; I’ll update this post when I know where.

First, my part.

Thank you for inviting me here today. I accepted with misgivings, since futurists give me hives. So I stand here now with a kind of self-induced autoimmune disorder.

I have no clue about what high-performance computing will look like 20 years from now.

(Later note: I was rather surprised that the other panelists did not say that; they all did agree.)

So, I asked a few of my colleagues. The answers can be summarized simply, since there were only three, really:

A blank stare. This was the most common reaction. Like “Look, I have a deadline tomorrow.”

Laughter. I understand that response completely.

And, finally, someone said: What an incredible opportunity! You get to make totally outrageous statements that you’ll never be held accountable for! How about offshore data centers, powered by wave motion, continuously serviced by autonomous robots with salamander-level consciousness, spidering around replacing chicklet-sized compute units, all made by the world’s largest computer vendor – Haier! [They make refrigerators.] And lots of graphs, all going up to the right!

There’s a man after my own heart. I clearly owe him a beer.

And he’s got a lot more imagination than I have. I went out and boringly looked for some data.

What I found was the chart below, from the ITRS, the International Technology Roadmap for Semiconductors, a consortium sponsored by other semiconductor consortia for the purpose of creating and publishing roadmaps. It’s their 2008 update, the latest published, since they meet in December to do the deed. Here it is:

Oooooo. Complicated. Lots of details! Even the details have details, and lesser details upon ‘em. Anything with that much detail obviously must be correct, right?

My immediate reaction to this chart, having created thousands of technical presentations in my non-retired life, is that this is actually a transparent application of technical presentation rule #34: Overwhelm the audience with detail.

The implied message this creates is: This stuff is very, very complicated. I understand it. You do not. Therefore, obviously, I am smarter than you. So what I say must be correct, even if your feeble brain cannot understand why.

It doesn’t go out a full 20 years, but does go to 2020, and it says that by then we’ll be at 10 nm. feature sizes, roughly a quarter of what’s shipping today. Elsewhere it elaborates on how this will mean many hundreds of processors per chip, multi-terabit flash chips, and other wonders.

But you don’t have to understand all of that detail. You want to know what that chart really means? I’ll tell you. It means this, in a bright, happy, green:

Everything’s Fine! .

We’ll just keep rolling down the road, with progress all the way. No worries, mate!

Why does it mean that? Because it’s a future roadmap. Any company publishing a roadmap that does not say “Everything’s Fine!” is clearly inviting everybody to short their stock. The enormous compromise that must be performed within a consortium of consortia clearly must say that, or agreement could not conceivably be reached.

That said, I note two things on this graph:

First, the historical points on the left really don’t say to me that the linear extrapolation will hold at that slope. They look like they’re flattening out. Another year or so of data would make that more clear one way or another, but for now, it doesn’t look too supportive of the extrapolated future predictions.

Second, a significant update for 2008 is noted as changing the slope from a 2.5-year cycle to a 3-year cycle of making improvements. In conjunction with the first observation, I’d expect future updates to increase the cycle length even more, gradually flattening out the slope, extending the period over which the improvements will be made.

The implication: Moore’s Law won’t end with a bang; it will end with a whimper. It will gradually fade out in a period stretching over at least two decades.

I lack the imagination to say what happens when things really flatten out; that will depend on a lot of things other than hardware or software technology. But in the period leading up to this, there are some things I think will happen.

First, computing will in general become cheaper – but not necessarily that much faster. Certainly it won’t be much faster per processor. Whether it will be faster taking parallelism into account we’ll talk about in a bit.

Second, there will be a democratization of at least some HPC: Everybody will be able to do it. Well before 20 years are out, high-end graphics engines will be integrated into traditional high-end personal PC CPUs (see my post A Larrabee in Every PC and Mac). That means there will be TeraFLOPS on everybody’s lap, at least for some values of “lap”; lap may really be pocket or purse.

Third, computing will be done either on one’s laptop / cellphone / whatever; or out in a bloody huge mist/fog/cloud -like thing somewhere. There may be a hierarchy of such cloud resources, but I don’t think anybody will get charged up about what level they happen to be using at the moment.

Those resources will not be the high-quality compute cycles most of the people in the room – huge HPC center managers and users – are usually concerned with. They’ll be garbage computing; the leftovers when Amazon or Google or Microsoft or IBM are finished doing what they want to do.

Now, there’s nothing wrong with dumpster-diving for computing. That, after all, is what many of the original clusters were all about. In fact, the first email I got after publishing the second edition of my book said, roughly, “Hey love the book, but you forgot my favorite cluster – Beowulf.” True enough. Tom Sterling’s first press release on Beowulf came out two weeks after my camera-ready copy was shipped. “I use that,” he continued. “I rescued five PCs from the trash, hooked the up with cheap Ethernet, put Linux on them, and am doing [some complicated technical computing thing or other, I forget] on them. My boss was so impressed he gave me a budget of $600 to expand it!”

So, garbage cycles. But really cheap. In lots of cases, they’ll get the job done.

Fourth, as we get further out, you won’t get billed by how many processors or memory or racks you use – but by how much power your computation takes. And possibly by how much bandwidth it consumes.

Then there’s parallelism.

I’m personally convinced that there will be no savior architecture or savior language that makes parallel processing simple or easy. I’ve lived through a good four decades of trying to find such a thing, with significant funding available, and nothing’s emerged. For languages in particular, take a look at my much earlier post series about there being 101 Parallel Languages, with none of them are in real use. We’ve got MPI – a package for doing message-passing – and that’s about it. Sometimes OpenMP (for shared memory) gets used, but it’s a distant second.

That’s the bad news. The good news is that it doesn’t matter in many cases, because the data sets involved will be absolutely humongous. Genomes, sensor networks, multimedia streams, the entire corpus of human literature will all be out there. This will offer enormous amounts of the kinds of parallelism traditionally derided as “embarrassingly parallel” because it didn’t pose any kind of computer science challenge: There was nothing interesting to say about it because it was too easy to exploit. So despite the lack of saviors in architecture and languages, there will be a lot of parallel computing. There are people now trying to call this kind of computation “pleasantly parallel.”

Probably the biggest challenges will arise in getting access to the highest-quality, most extensive exemplars of such huge data sets.

The traditional kind of “hard” computer-science-y parallel problems may well still be an area of interest, however, because of a curious physical fact: The amount of power consumed by a collection of processing elements goes up linearly with the number of processors; but it also goes up as the square of the clock frequency. So if you can do the same computation, in the same time, with more processors that run more slowly, you use less power. This is much less macho than traditional massive parallelism. “I get twice as much battery life as you” just doesn’t compete with “I have the biggest badass computer on the planet!” But it will be a significant economic issue. From this perspective, parallel office applications – such as parallel browsers, and even the oft-derided Parallel PowerPoint – actually make sense, as a way to extend the life of your cell phone battery charge.

Finally, I’d like to remind everybody of something that I think was quite well expressed by Tim Bray, when he tweeted this:

Here it is, 2009, and I'm typing SQL statements into my telephone. This is not quite the future I'd imagined.

The future will be different – strangely different – from anything we now imagine.

* * * *

That was my presentation. I’ll put some notes about some interesting things I heard from others in a separate post.

Friday, September 11, 2009

Of Muffins and Megahertz

Some readers have indicated, offline, that they liked the car salesman dialog about multicore systems that appeared in What Multicore Really Means. So I thought it might be interesting to relate the actual incident that prompted the miles- to muffins-per-hour performance switch I used.

If you haven't read that post, be warned that what follows will be more meaningful if you do.

It was inspired by a presentation to upper-level management of a 6-month-plus study of what was going on in the silicon concerning clock rate, and what if anything could be done about it. This occurred several years ago. I was involved, but not charged with providing any of the key slides. Well, OK, not one of my slides ended up being used.

It of course began with the usual "here's the team, here's how hard we worked" introduction.

First content chart: a cloud of data points collected from all over the industry that showed performance – specifically SPECINT, an integer benchmark – keeling over. It showed a big, obvious switch from the usual rise in performance, with a rough curve fit, breaking to a much lower predicted performance increase from now on. Pretty impressive. The obvious conclusion: Something major has happened. Things are different. There is big trouble.

Now, there's a rule for executive presentations: Never show a problem without proposing a solution. (Kind of like never letting a crisis go to waste.) So,

Second chart: a very similar-looking cloud of data points, sailing on at the usual growth rate for many years to come – labeled as multiprocessor (MP) results, what the industry would do in response. Yay, no problem! It's all right! We just keep on going! MP is the future! Lots of the rest of the pitch was about various forms of MP, from normal to bizarre.

Small print on second chart: It graphed SPECRATE. Not SPECINT.

SPECINT is a single-processor measure of performance. SPECRATE is, basically, how many completely separate SPECINTs you can do at once. Like, say, instead of the response time of PowerPoint, you get the incredibly useful measure of how many different PowerPoint slides you can modify at the same time. Or you change from miles per hour to muffins per hour.

Nothing on any slide or in any verbal statements referred to the difference. The chart makers - mostly high-level silicon technology experts - knew the difference, at least in theory. At least some of them did. I know others definitely did not in any meaningful sense.

At any event, throughout the entire rest of the presentation they displayed no inclination to inform anybody what it really meant. They didn't even distinguish the good result: typical server tasks can in general make really good use of parallelism. (See IT Departments Should NOT Fear Multicore.)

I was aghast. I couldn't believe that would be presented, like that, no matter what political positioning was going on. But I "knew better" than to say anything. Those charts were a result not just of data mining the industry for performance data but of data mining the company politically to get something that would reflect best on everybody involved. Speak up, and you get told that you don't know the big picture.

My opinion about feeding nonsense to anybody should be obvious from this blog. I don't think I'm totally blind in the political spectrum, but hey, guys, come on. That's blatant.

One hopes that the people who were the target knew enough to know the difference. I suspect that the whole point of the exercise, from their point of view, was just to really, firmly, nail down the point that the first chart – SPECINT keeling over – was a physical fact, and not just one of the regularly-scheduled pitches from the silicon folks for more development funds because they were in trouble. The target audience probably stopped paying attention after that first slide.

I don't mean to imply above that the gents who are responsible for the physical silicon don't regularly didn't have real problems; they do. But this situation was a problem of a whole different dimension.

It still is.

Tuesday, September 8, 2009

Japan, Inc., vs. Intel – Not Really

Fujitsu, Toshiba, Panasonic, Renesas Technology, NEC, Hitachi and Canon, with Japan's Ministry of Economy, Trade and Industry supplying 3-4 billion yen, are pooling resources to build a new "super CPU" for consumer electronics by the end of 2012, according to an article in Forbes. It's being publicized as a "taking on Intel."

The design is based on the work of Hironori Kasahara, professor of computer science at Waseda University, and is allegedly extremely power-efficient. It even "runs on solar cells that will use less than 70% of the power consumed by normal ones." Man, I hate silly marketing talk, especially when subject to translation.

El Reg also picked up on this development.

Why a new CPU design? To jump to the conclusion: I don't know. I don't see it. Not clear what's really going on here.

Digging around for info runs into an almost impenetrable wall of academic publisher copyrights. I did find a paper downloadable from back in 2006, and what looks like a conference poster session exhibit, and a friend got me a copy of a more recent paper that gave a few more clues.

The main advances here appear to be in Kasahara's OSCAR compiler, which produces a hierarchical coarse-grain task graph that is statically scheduled on multiprocessors by the compiler itself. The lowest levels appear to target all the way down to an accelerator. I'm not enough of a compiler expert to judge this, but fine, I'll agree it works. A compiler doesn't require a new CPU design.

The multicore system targeted – and of course there's no guarantee this is what the funded project will ultimately produce – seems to be a conventional cache-coherent MP integrated with a Rapport Kilocore-style reconfigurable 2D array of few-bit (don't know how many, likely 2 or 4) ALUs and multipliers. Some inter-processor block data transfer and straightforward synchronization registers are there, too. Use of the accelerator can produce the usual kinds of accelerator speedups, like 24X over one core for MP3 encoding.

Except for their specific accelerator, this is fairly standard stuff for the embedded market. So far, I don't see anything justifying the huge cost of a developing a new architecture, and, more importantly, producing the never-ending stream of software support it requires: compilers, SDKs, development systems, simulators, etc.

One feature that does not appear standard is the power control. Apparently individual cores can have their frequency and voltage changed independently. For example, one core can run full tilt while another is at half-speed and a third is quarter-speed. Embedded systems today, like IBM/Frescale PowerPC and ARM licensees, typically just provide on and off, with several variants of "off" using less power the longer it takes to turn on again.

All the scheduling, synchronization, and power control is under the control of the compiler. This is particularly useful when subtask times are knowable and you have a deadline that's less than flat-out performance. In those circumstances, the compiler can arrange execution to optimize power. For example, 60% less energy is needed to run a computational fluid dynamics benchmark (Applu) and 87% less for mpeg2encode. As a purely automatic result, this is pretty good. It didn't, in this case, use the accelerator.

Enough for a new architecture? I wouldn't think so. I don't see why they wouldn't, for example, license ARM or PowerPC and thereby get a huge leg up on the support software. Something else is driving this, and I'm not sure what. The Intel reference is, of course, just silly; it is instead competing with the very wide variety of embedded system chips instead. Of course, those have volumes 1000s of times larger than desktops and servers, so any perceived advantage has a huge multiplier.

Oh, and there’s no way this can be the basis of a new general-purpose desktop or server system. All the synch and power control under compiler control, which is key to the OSCAR compiler operation, has to be directly accessible in user mode for the compiler to play with. This is standard in embedded systems that run only one application, forever (like your car transmission), but necessarily anathema in a “general-purpose” system.

Sunday, August 30, 2009

A Larrabee in Every PC and Mac

There's a rumor that Intel is planning to integrate Larrabee, its forthcoming high-end graphics / HPC accelerator, into its processors in 2012. A number of things about this make a great deal of sense at a detailed level, but there's another level at which you have to ask "What will this mean?"

Some quick background, first: Larrabee is a well-publicized Intel product-of-the-future, where this particular future is late this year ('09) or early next year ('10). It's Intel's first foray into the realm of high-end computer graphics engines. Nvidia and ATI (part of AMD now) are now the big leagues of that market, with CUDA and Stream products respectively. While Larrabee, CUDA and Stream differ greatly, all three use parallel processing to get massive performance. Larrabee may be in the 1,000 GFLOPS range, while today Nvidia is 518 GFLOPS and ATI is 2400 GFLOPS. In comparison, Intel's latest core i7 processor reaches about 50 GFLOPS.

Integrating Larrabee into the processor (or at least its package) fits well with what's known of Intel's coarse roadmap, illustrated below (from, self-proclaimed "un scandale" below):

"Ticks" are new lithography, meaning smaller chips; Intel "just" shrinks the prior design. "Tocks" keep the same lithography, but add new architecture or features. So integrating Larabee on the "Haswell" tock makes sense as the earliest point at which it could be done.

The march of the remainder of Moore's Law – more transistors, same clock rate – makes such integration possible, and, for cost purposes, inevitable.

The upcoming "Westmere" parts start this process, integrating Intel's traditional "integrated graphics" onto the same package with the processor; "integrated" here means low-end graphics integrated into the processor-supporting chipset that does IO and other functions. AMD will do the same. According to Jon Peddie Research, this will destroy the integrated graphics market. No surprise there: same function, one less chip to package on a motherboard, probably lower power, and… free. Sufficiently, anyway. Like Internet Explorer built into Windows for "free" (subsidized) destroying Netscape, this will just come with the processor at no extra charge.

We will ultimately see the same thing for high-end graphics. 2012 for Larrabee integration just puts a date on it. AMD will have to follow suit with ATI-related hardware. And now you know why Nvidia has been starting its own X86 design, a pursuit that otherwise would make little sense.

Obviously, this will destroy the add-in high-end graphics market. There might be some residual super-high-end graphics left for the super-ultimate gamer, folks who buy or build systems with multiple high-end cards now, but whether there will be enough volume for that to survive at all is arguable.

Note that once integration is a fact, "X inside" will have graphics architecture implications it never had before. You pick Intel, you get Larraabee; AMD, ATI. Will Apple go with Intel/Larrabee, AMD/ATI, or whatever Nvidia cooks up? They began OpenCL, to abstract the hardware, but as an interface it is rather low-level and reflective of Nvidia's memory hierarchy. Apple will have to make a choice that PC users will individually, but for their entire user base.

That's how, and "why" in a low-level technical hardware sense. It is perfectly logical that, come 2012, every new PC and Mac has what by then will probably have around 2,000 GFLOPS. This is serious computing power. On your lap.

What the heck are most customers going to do with this? Will there be a Windows GlassWax and Mac OS XII Yeti where the user interface is a full 3D virtual world, and instead of navigating a directory tree to find things, you do dungeon crawls? Unlikely, but I think more likely than verbal input, even really well done, since talking aloud isn't viable in too many situations. Video editing, yes. Image search, yes too, but that's already here for some, and there are only so many times I want to find all the photos of Aunt Bessie. 3D FaceSpace? Maybe, but if it were a big win, I think it would already exist in 2.5D. Same for simple translations of the web pages into 3D. Games? Sure, but that's targeting a comparatively narrow user base, with increasingly less relevance to gameplay. And it's a user base that may shrink substantially due to cloud gaming (see my post Twilight of the GPU?).

It strikes me that this following of one's nose on hardware technology is a prime example of what Robert Capps brought up in a recent Wired article (The Good Enough Revolution: When Cheap and Simple Is Just Fine) quoting Clay Shirky, an NYU new media studies professor, who was commenting on CDs and lossless compression compared to MP3:

"There comes a point at which improving upon the thing that was important in the past is a bad move," Shirky said in a recent interview. "It's actually feeding competitive advantage to outsiders by not recognizing the value of other qualities." In other words, companies that focus on traditional measures of quality—fidelity, resolution, features—can become myopic and fail to address other, now essential attributes like convenience and shareability. And that means someone else can come along and drink their milk shake.

It may be that Intel is making a bet that the superior programmability of Larrabee compared with strongly graphics-oriented architectures like CUDA and Stream will give it a tremendous market advantage once integration sets in: Get "Intel Inside" and you get all these wonderful applications that AMD (Nvidia?) doesn't have. That, however, presumes that there are such applications. As soon as I hear of one, I'll be the first to say they're right. In the meantime, see my admittedly sarcastic post just before this one.

My solution? I don't know of one yet. I just look at integrated Larrabee and immediately think peacock, or Irish Elk – 88 lbs. of antlers, 12 feet tip-to-tip.

Megaloceros Giganteus, the Irish Elk, as integrated Larrabee.
Based on an image that is Copyright Pavel Riha, used with implicit permission
(Wikipedia Commons, GNU Free Documentation License)

They're extinct. Will the traditional high-performance personal computer also go extinct, leaving us with a volume market occupied only by the successors of netbooks and smart phones?


The effect discussed by Shirky makes predicting the future based on current trends inherently likely to fail. That happens to apply to me at the moment. I have, with some misgivings, accepted an invitation to be on a panel at the Anniversary Celebration of the Coalition for Academic Scientific Computation.

The misgivings come from the panel topic: HPC - the next 20 years. I'm not a futurist. In fact, futurists usually give me hives. I'm collecting my ideas on this; right now I'm thinking of democratization (2TF on everybody's lap), really big data, everything bigger in the cloud, parallelism still untamed but widely used due to really big data. I'm not too happy with those, since they're mostly linear extrapolations of where we are now, and ultimately likely to be as silly as the flying car extrapolations of the 1950s. Any suggestions will be welcome, particularly suggestions that point away from linear extrapolations. They'll of course be attributed if used. I do intend to use a Tweet from Tim Bray (timbray) to illustrate the futility of linear extrapolation: “Here it is, 2009, and I'm typing SQL statements into my telephone. This is not quite the future I'd imagined.”

Parallelism Needs a Killer Application

Gee, do you think?

That's the startling conclusion of a panel in the latest Hot Chips Conference, according to Computerworld.

(Actually, it doesn't need a killer app – for servers. But servers don't have sufficient volume.)

The article also says Dave Patterson, UC Berkeley, was heard to say "There's no La-Z-Boy approach to programming." This from someone I regard as a super hardware dude. If he's finally got the software message, maybe hope is dawning.

The article has this quote from a panelist: "Threads have to synchronize correctly." Wow, do my car tires have to be inflated, too? This has to have been taken out of a more meaningful context.

Here are some links to posts in this blog that have been banging on this drum for nearly two years: here, here, here, here, here, here. Actually, it's pretty much the whole dang blog; see the history/archives.

Maybe there will be more meaningful coverage of that panel elsewhere.

Ha, my shortest post yet. It's a quickie done just before I finished another one. I just couldn't believe that article. I guess this is another "my head exploded" post.

Saturday, August 15, 2009

Today’s Graphics Hardware is Too Hard

Tim Sweeny recently gave a keynote at High Performance Graphics 2009 titled "The End of the GPU Roadmap" (slides). Tim is CEO and founder of Epic Games, producers of over 30 games including Gears of War, as well as the Unreal game engine used in 100s of games. There are lots of really interesting points in that 74-slide presentation, but my biggest keeper is slide 71:

[begin quote]

Lessons learned: Today's hardware is too hard!

  • If it costs X (time, money, pain) to develop an efficient single-threaded algorithm, then…
    • Multithreaded version costs 2X
    • PlayStation 3 Cell version costs 5X
    • Current "GPGPU" version costs: 10X or more
  • Over 2X is uneconomical for most software companies!
  • This is an argument against:
    • Hardware that requires difficult programming techniques
    • Non-unified memory architectures
    • Limited "GPGPU" programming models

[end quote]

Judging from the prior slides, by '"GPGPU"' Tim apparently means the DirectX 10 pipeline with programmable shaders.

I'm not sure what else to make of this beyond rehashing Tim's words, and I'd rather point you to his slides than start doing that. The overall tenor somewhat echoes comments I made in one of my first posts; it continues to be the most hit-on page of this blog, so I must have said something useful there.

I will note, though, that Tim's estimates of effort are based on very extensive experience – with game programming. For low-ish levels of parallelism, like 4 or 8, multithreading adds zero cost to typical commercial applications already running under a competent transaction monitor. It just works, since they're already at that level of software multithreading for other reasons (like achieving overlap with IO waits). Of course, that's not at all universally true for commercial applications, particularly for high levels of parallelism, no matter how much cloud evangelists talk about elasticity.

Once again, thanks to my friend who is expert at finding things like this slide set (it's not on the conference web site) and doesn't want his name mentioned.

Short post this time.

Monday, July 20, 2009

Why Accelerators Now?

Accelerators have always been the professional wrestlers of computing. They're ripped, trash-talking superheroes, whose special signature moves and bodybuilder physiques promise to reduce diamond-hard computing problems to soft blobs quivering in abject surrender. Wham! Nvidia "The Green Giant" CUDA body-slams a Black-Scholes equation financial model! Shreik! Intel "bong-da-Dum-da-Dum" Larrabee cobra clutches a fast fourier transform to agonizing surrender!

And they're "green"! Many more FLOPS/OPS/whatever per watt and per square inch of floor space than your standard server! And I'm using way too many exclamation points!!

Logical Sidebar: What is an accelerator, anyway? My definition: An accelerator is a device optimized to enhance the performance or function of a computing system. An accelerator does not function on its own; it requires invocation from host programs. This is by intention and design optimization, not physics, since an accelerator may contain general purpose system parts (like a standard processor), be substantially software or firmware, and (recursively) contain other accelerators. The strategy is specialization; there is no such thing as a "general-purpose" accelerator. Claims to the contrary usually assume just one application area, usually HPC, but there are many kinds of accelerators – see the table appearing later. The big four "general purpose" GPUs – IBM Cell, Intel Larrabee, Nvidia CUDA, ATI/AMD Stream – are just the tip of the iceberg. The architecture of accelerators is a glorious zoo that is home to the most bizarre organizations imaginable, a veritable Cambrian explosion of computing evolution.

So, if they're so wonderful, why haven't accelerators already taken over the world?

Let me count the ways:

Nonstandard software that never quite works with your OS release; disappointing results when you find out you're going 200 times faster – on 5% of the whole problem; lethargic data transfer whose overhead squanders the performance; narrow applicability that might exactly hit your specific problem, or might not when you hit the details; difficult integration into system management and software development processes; and a continual need for painful upgrades to the next, greatest version with its different nonstandard software and new hardware features; etc.

When everything lines up just right, the results can be fantastic; check any accelerator company's web page for numerous examples. But getting there can be a mess. Anyone who was a gamer in the bad old days before Microsoft DirectX is personally familiar with this; every new game was a challenge to get working on your gear. Those perennial problems are also the reason for a split reaction in the finance industry to computational accelerators. The quants want them; if they can make them work (and they're always optimists), a few milliseconds advantage over a competitor can yield millions of dollars per day. But their CIOs' reaction is usually unprintable.

I think there are indicators that this worm may well be turning, though, allowing many more types of accelerators to become far more mainstream. Which implies another question: Why is this happening now?


First of all, vendors seem to be embracing actual industry software standards for programming accelerators. I'm referring here to the Khronos Group's OpenCL, which Nvidia, AMD/ATI, Intel, and IBM, among others, are supporting. This may well replace proprietary interfaces like Nvidia's CUDA API and AMD/ATI's CTM, and in doing so have an effect as good and simplifying as Microsoft's DirectX API series, which eliminated a plethora of problems for graphics accelerators (GPUs).

Another indicator is that connecting general systems to accelerators is becoming easier and faster, reducing both the CPU overhead and latency involved in transferring data and kicking off accelerator operations. This is happening on two fronts: IO, and system bus.

On the IO side, there's AMD developing and intermittently showcasing its Torrenza high-speed connection. In addition, the PCI-SIG will, presumably real soon now, publish PCIe version 3.0, which contains architectural features designed for lower-overhead accelerator attachment, like atomic operations and caching of IO data.

On the system bus side, both Intel and AMD have been licensing their inner inter-processor system busses as attachment points to selected companies. This is the lowest-overhead, fastest way to communicate that exists in any system; the latencies are in sub-nanoseconds and the data rates in gigabytes/second. This indicates a real commitment to accelerators, because foreign attachment directly to one's system bus was heretofore unheard-of, for very good reason. The protocols used on system busses, particularly the aspects controlling cache coherence, are mind-numbingly complex. They're the kinds of things best developed and used by a team whose cubes/offices are within whispering range of each other. When they don't work, you get nasty intermittent errors that can corrupt data and crash the system. Letting a foreign company onto your system bus is like agreeing to the most intimate unprotected sex imaginable. Or doing a person-to-person mutual blood transfusion. Or swapping DNA. If the other guy is messed up, you are toast. Yet, it's happening. My mind is boggled.

Another indicator is the width of the market. The vast majority of the accelerator press has focused on GPGPUs, but there are actually a huge number of accelerators out there, spanning an oceanic range of application areas. Cryptography? Got it. Java execution? Yep. XML processing? – not just parsing, but schema validation, XSLT transformations, XPaths, etc. – Oh, yes, that too. Here's a table of some of the companies involved in some of the areas. It is nowhere near comprehensive, but it will give you a flavor (click on it to enlarge) (I hope):

Beyond companies making accelerators, there are a collection of companies who are accelerator arms dealers – they live by making technology that's particularly good for creating accelerators, like semi-custom single-chip systems with your own specified processing blocks and/or instructions. Some names: Cavium, Freescale Semiconductor, Infineon, LSI Logic, Raza Microelectronics, STMicroeletronics, Teja, Tensilica, Britestream. That's not to leave out FPGA vendors who make custom hardware simple by providing chips that are seas of gates and functions you can electrically wire up as you like.

Why Now?

That's all fine and good, but. Tech centers around the world are littered with the debris of failed accelerator companies. There have always been accelerator companies in a variety of areas, particularly floating point computing and the offloading of communications protocols (chiefly TCP/IP); efforts date back to the early 1970s. Is there some fundamental reason why the present surge won't crash and burn like it always has?

A list can certainly be made of how circumstances have changed for accelerator development. There didn't used to be silicon foundries, for example. Or Linux. Or increasingly capable building blocks like FPGAs. I think there's a more fundamental reason.

Until recently, everybody has had to run a Red Queen's race with general purpose hardware. There's no point in obtaining an accelerator if by the time you convince your IT organization to allow it, order it, receive it, get it installed, and modify your software to use it, you could have gone faster by just sitting there on your butt, doing nothing, and getting a later generation general-purpose system. When the general-purpose system has gotten twice as fast, for example, the effective value of your accelerator has halved.

How bad a problem is this? Here's a simple graph that illustrates it:

What the graph shows is this: Suppose you buy an accelerator that does something 10 times faster than the fastest general-purpose "commodity" system does, today. Now, assume GP systems increase in speed as they have over the last couple of decades, a 45% CAGR. After only two years, you're only 5 times faster. The value of your investment in that accelerator has been halved. After four years, it's nearly divided by 5. After five years, it's worthless; it's actually slower than a general purpose system.

This is devastating economics for any company trying to make a living by selling accelerators. It means they have to turn over designs continually to keep their advantage, and furthermore, they have to time their development very carefully – a schedule slip means they have effectively lost performance. They're in a race with the likes of Intel, AMD, IBM, and whoever else is out there making systems out of their own technology, and they have nowhere near the resources being applied to general purpose systems (even if they are part of Intel, AMD, and IBM).

Now look at what happens when the rate of increase slows down:

Look at that graph, keeping in mind that the best guess for single-thread performance increases over time is now in the range of 10%-15% CAGR at best. Now your hardware design can actually provide value for five years. You have some slack in your development schedule.

It means that the single-thread performance reduction of Moore's Law makes accelerators economically viable to a degree they never have been before.

Cambrian explosion? I think it's going to be a Cambrian Fourth-of-July, except that the traditional finale won't end soon.

Objections and Rejoinders

I've heard a couple of objections raised to this line of thinking, so I may as well bring them up and try to shoot them down right away.

Objection 1: Total performance gains aren't slowing down, just single-thread gains. Parallel performance continues to rise. To use accelerators you have to parallelize anyway, so you just apply that to the general purpose systems and the accelerator advantage goes away again.

Response: This comes from the mindset that accelerator = GPGPU. GPGPUs all get their performance from explicit parallelism, and, as the "GP" part says, that parallelism is becoming more and more general purpose. But the world of accelerators isn't limited to GPGPUs; some use architectures that simply (hah! Isn't often simple) embed algorithms directly in silicon. The guts of a crypto accelerator aren't anything like a general-purpose processor, for example. Conventional parallelism on conventional general processors will lose out to it. And in any event, this is comparing past apples to present oranges: Previously you did not have to do anything at all to reap the performance benefit of faster systems. This objection assumes that you do have to do something – parallelize code – and that something is far from trivial. Avoiding it may be a major benefit of accelerators.

Objection 2: Accelerator, schaccelerator, if a function is actually useful it will get embedded into the instruction set of general purpose systems, so the accelerator goes away. SIMD operations are an example of this.

Response: This will happen, and has happened, for some functions. But how did anybody get the experience to know what instruction set extensions were the most useful ones? Decades of outboard floating point processing preceded SIMD instructions. AMD says it will "fuse" graphics functions with processors – and how many years of GPU development and experience will let it pick the right functions to do that with? For other functions, well, I don't think many CPU designers will be all that happy absorbing the strange things done in, say, XML acceleration hardware.

Friday, June 19, 2009

Why Virtualize? A Primer for HPC Guys

Some folks who primarily know High-Performance Computing (HPC) think virtualization is a rather silly thing to do. Why add another layer of gorp above a perfectly good CPU? Typical question, which I happened to get in email:

I'm perplexed by the stuff going on the app layer -

first came the chip + programs

then came - chip+OS+ applications

then came - Chip+Hypervisor+OS+applications

So for a single unit of compute, the capability keeps decreasing while extra layers are added over again and again.. How does this help?

I mean Virtualization came for consolidation and this reducing the prices of the H/W it's being associated with something else?

In answering that question, I have two comments:

First Comment:

Though this really doesn't matter to the root question, you're missing a large bunch of layers. After your second should be


Where middleware expands to many things: Messaging, databases, transaction managers, Java Virtual Machines, .NET framework, etc. How you order the many layers within middleware, well, that can be argued forever; but in any given instance they obviously do have an order (often with bypasses).

So there are many more layers than you were considering.

How does this help in general? The usual way infrastructure software helps: It lets many people avoid writing everything they need from scratch. But that's not what you're really asking, which was why you would need virtualization in the first place. See below.

Second Comment:

What hypervisors -- really, virtual machines; hypervisors are one implementation of that notion – do is more than consolidation. Consolidation is, to be sure, the killer app of virtualization; it's what put virtualization on the map.

But hypervisors, in particular, do something else: They turn a whole system configuration into a bag of bits, a software abstraction decoupled from the hardware on which they are running. A whole system, ready to run, becomes a file. You can store it, copy it, send it somewhere else, publish it, and so on.

For example, you can:

  • Stop it for a while (like hibernate – a snapshot (no, not disk contents)).
  • Restart on the same machine, for example after hardware maintenance.
  • Restart on a different machine (e.g., VMware's VMotion; others have it under different names)
  • Copy it – deploy additional instances. This is a core technology of cloud computing that enables "elasticity." (That, and apps structured so this can work.)
  • By adding an additional layer, run it on a system with a different architecture from the original.

Most of these things have their primary value in commercial computing. The classic HPC app is a batch job: Start it, run it, it's done. Commercial computing's focus nowadays tends to be: Start it, run it, run it, run it, keep it running even though nodes go down, keep it running through power outages, earthquakes, terrorist strikes, … Think web sites, or, before them, transaction systems to which bank ATMs connect. Not that commercial batch doesn't still exist, nor is it less important; payrolls are still met, although there's less physical check printing now. But continuously operating application systems have been a focus of commercial computing for quite a while now.

Of course, some HPC apps are more continuous, too. I'm thinking of analyzing continuous streams of data, e.g., environmental or astronomical data that is continuously collected. Things like SETI or Folding at home could use it, too, running beside interactive use in a separate machine, continually, unhampered by your kids following dubious links and getting virii/trojans. But those are in the minority, so far. Grossly enormous-scale HPC with 10s or 100s of thousands of nodes will soon have to think in these terms as the time between failures (MTBF) of nodes exceeds the run time of jobs, but there it's an annoyance, a problem, not an intentional positive aspect of the application like it is for commercial.

Grid application deployment could do it á la clouds, but except for hibernate-style checkpoint/restart, I don't see that any time soon. They effectively have a kind of virtualization already, at a higher level (like Microsoft does its cloud virtualization in Azure above and assuming .NET).

The traditional performance cost of virtualization is anathema to HPC, too. But that's trending to near zero. With appropriate hardware support (Intel VT, AMD-V, similar things from others) that's gone away for processor & memory. It's still there for IO, but can be fixed; IBM zSeries native IO has had essentially no overhead for years. The rest of the world will have to wait for PCIe to finish finalizing its virtualization story and IO vendors to implement it in their devices; that will come about in a patchy manner, I'd predict, with high-end communication devices (like InfiniBand adapters) leading the way.

So that's what virtualization gets you: isolation (for consolidation) and abstraction from hardware into a manipulable software object (for lots else). Also, security, which I didn't get into here.

Not much for most HPC today, as I read it, but a lot of value for commercial.

Thursday, May 14, 2009

Multiple Clouds? !? (Cloud Computing Rap)

"Let me ask one simple question. Is there one cloud, or are there multiple clouds? This is a debate that I have heard a number of times already. One side of this argument is public versus private clouds. Can a private implementation of cloud technologies be called a cloud at all, …" Jordi Gonzalez Segura, Director Applications & Integration at Avanade, LinkedIn discussion.

First comment: "There is an open discussion about this…." Alfonso Olias Sanz, M.Sc., Senior Consultant


©2009 Gregory F. Pfister


<Lightning, thunder, dark storm clouds on top of Mount Volume.>

I AM the Lord Thy INTERNET who hath brought thee out of land of SNA and DECNET, out of the house of proprietary bondage. Thou SHALT NOT have other clouds before ME.

And whosoever shalt obtain services of any species through the crossing of MY domains SHALL be deemed to be cloud computing, across the ONE CLOUD, which I AM. Thou SHALT NOT virtualize anything else in the earth beneath, or that is in the water under the earth; thou shalt not serve platforms or infrastructure or any other thing except across the ONE CLOUD which I AM, for I the Lord thy Cloud am a jealous Cloud, punishing the iniquities of Release 1 upon the children to the third and fourth generations of the software that shall hate ME…


What, who dareth break…

Yo, homies, gotta know it's the Prince a' Darkness now, don't you go freak at that old mutha's cow. He is the road, there ain't no denyin', but the other side is what I'm spyin'. Clouds a' servers, that's what I catch, so lay down your load and it's gonna match.

They go in and out
Clouds they elastic bound!
They go in and out
Got one from Am a zon.
They go in and out
You got Googles, too
They go in and out
You got a whole damn zoo.

Yo it's on the other side, that's where it's groovin', in a cloud where you deploy it out so it's movin'. They're how build your stuff, not just how you get it, unless you're talking SaaS and then you can just fugget it, cuz SaaS, that's just a billing plan it ain't no tech, I ain't no CPA, you just get off my neck.

They go in and out
Get big just like you like it
They go in and out.
Get small you don't invite it
They go in and out
Babe they got what you need
They go in and out
It's just like smokin' good weed.

Got your privates, too, you know there's no denyin', public cloudies are cool, but you know I'm not lying when I say you watch out, might get security diseases, need protection there, and that messes up you action, gotta stay wit' cha homies to get real satisfaction.

Talkin' cloud computing, yall.
They go in and out
Elastic com pu ta shun
They go in and out
You don't get it you just get on your short bus and ride out of here, yeah.


OK, when I saw that question, asked by someone in a significant position in a company engaged in cloud computing, being seriously asked, and seriously replied to, as something actually being discussed … my brain exploded. That's the initial SPLAT.

It feels better now.

I've been putting off a ridiculously long discussion of what clouds are – not yet another definition, &deity help us all – but this has convinced me I shouldn't put it off longer. I am in the middle of a contract to sell my house and move, so this is the worst possible time, but I'm going to try.

At least some of what's above is probably not real comprehensible without that discussion. Sorry about that.

For the record, the initial biblical language was customized from this source.

Oh, yeah, and for the record, again, ©2009 Gregory F. Pfister, dammit.

Thursday, April 2, 2009

Twilight of the GPU?

If a new offering really works, it may substantially shrink the market for gaming consoles and high-end gamer PCs. Their demo runs Crysis – a game known to melt down extreme PCs – on a Dell Studio 15 with Intel's minimalist integrated graphics. Or on a Mac. Or a TV, with their little box for the controls. The market for Nvidia CUDA, Intel Larrabee, IBM Cell, AMD Fusion, are all impacted. And so much for cheap MFLOPS for HPC, riding on huge gamer GPU volumes.

This game-changer – pun absolutely intended – is Onlive. It came out of stealth mode a few days ago with an announcement and a demo presentation (more here and here) at the 2009 Game Developers' Conference. It's in private beta now, and scheduled for public beta in the summer followed by full availability in winter 2009. Here's a shot from the demo showing some Crysis player-versus-player action on that Dell:

What Onlive does seems kind of obvious: Run the game on a farm/cloud that hosts all the hairy graphics hardware, and stream compressed images back to the players' systems. Then the clients don't have to do any lifting heavier than displaying 720p streaming video, at 60 frames per second, in a browser plugin. If you're thinking "Cloud computing meets gaming," well, so are a lot of other people. It's true, for some definitions of cloud computing. (Of course, there are so many definitions that almost anything is true for some definition of cloud computing.) Also, it's the brainchild of Steve Perlman, creator of Web TV, now known as MSN TV.

Now, I'm sure some of you are saying "Been there, done that, doesn't work" because I left out something crucial: The client also must send user inputs back to the server, and the server must respond. This cannot just be not a typical render cloud, like those used to produce production cinematic animation.

That's where a huge potential hitch lies: Lag. How quickly you get round-trip response to inputs. Onlive must target First Person Shooters (FPS), a.k.a. twitch games; they're collectively the largest sellers. How well a player does in such games depends on the player's reaction time – well, among other things, but reaction time is key. If there's perceived lag between your twitch and the movement of your weapon, dirt bike, or whatever, Onlive will be shunned like the plague because it will have missed the point: Lag destroys your immersion in the game. Keeping lag imperceptible, while displaying a realistic image, is the real reason for high-end graphics.

Of course Onlive claims to have solved that problem. But, while crucial, lag isn't not the only issue; here's a list which I'll elaborate on below: Lag, angry ISPs, and game selection & pricing.


The golden number is 150 msec. If the displayed response to user inputs is even a few msec. longer than that, games feel rubbery. That's 150 msec. to get the input, package it for transport, ship it out through the Internet, get it back in the server, do the game simulation to figure out what objects to change, change them (often, a lot: BOOM!), update the resulting image, compress it, package that, send it back to the client, decompress it, and get it on the screen.

There is robust skepticism that this is possible.

The Onlive folks are of course exquisitely aware that this is a key problem, and spent significant time talking vaguely about how they solved it. Naturally such talk involves mention of roughly a gazillion patents in areas like compression.

They also said their servers were "like nothing else." I don't doubt that in the slightest. Not very many servers have high-end GPUs attached, nor do they have the client chipsets that provide 16-lane PCIe attachment for those GPUs. Interestingly, there's a workstation chipset for Intel's just-announced Xeon 5500 (Nehalem) with four 16-lane PCIe ports shown.

I wouldn't be surprised to find some TCP/IP protocol acceleration involved, too, and who knows – maybe some FPGAs on memory busses to do the compression gruntwork?

Those must be pretty toasty servers. Check out the fan on this typical high-end GPU (Diamond Multimedia using ATI 4870):

The comparable Nvidia GeForce GTX 295 is rated as consuming 289 Watts. (I used the Diamond/ATI picture because it's sexier-looking than the Nvidia card.) Since games like Crysis can soak up two or more of these cards – there are proprietary inter-card interconnects specifically for that purpose – "toasty" may be a gross understatement. Incendiary, more likely.

In addition, the Golive presenters made a big deal out of users having to be within a 1000-mile radius of the servers, since beyond that the speed of light starts messing things up. So if you're not that close to their initial three server farms in California, Texas, Virginia, and possibly elsewhere, you're out of luck. I think at least part of the time spent on this was just geek coolness: Wow, we get to talk about the speed of light for real!

Well, can they make it work? Adrian Covert reported on Gismodo about his hands-on experience trying the beta, playing Bioshock at Onlive's booth at the GDC (the server farm was about 50 miles away). He saw lag "just enough to not feel natural, but hardly enough to really detract from gameplay." So you won't mind unless you're a "competitive" gamer. There were, though, a number of compression artifacts, particularly when water and fire effects dominated the screen. Indoor scenes were good, and of course whatever they did with the demo beach scene in Crysis worked wonderfully.

So it sounds like this can be made to work, if you have a few extra HVAC units around to handle the heat. It's not perfect, but it sounds adequate.

Angry ISPs

But will you be allowed to run it? The bandwidth requirements are pretty fierce.

HD 720p bandwidth, with normal ATSC MPEG-2 compression, rarely goes over 3 Mb/sec. Given that, I'm inclined to take at face value Onlive's claim to require only 1.5 Mb/s for games. But 1.5Mb/s translates into 675 Megabytes per hour. Ouch. A good RPG can clock 50-100 hours before replay, and when a player is immersed that can happen in a nearly continuous shot, interrupted only by biology. That's passing-a-kidney-stone level of bandwidth ouch. 

NOTE: See update below. It's likely even worse than that.

Some ISPs already have throttling and bandwidth limiting in place today. They won't let Onlive take over their network. Tiered pricing may become the order of the day, or, possibly, the ISP itself may do the gaming offering itself and bill for the bandwidth invisibly, inside the subscription (this is done today for some on-demand TV). Otherwise, they're not likely to subsidize Onlive. Just limit bandwidth a little, or add just a bit of latency…

Net Neutrality, anyone?

Game Selection and Pricing

This is an area that Onlive seems to have nailed. Atari, Codemasters, Eidos, Electronic Arts, Epic, Take Two, THQ, Ubisoft, and Warner Bros. are all confirmed publishers. That's despite this being a new platform for them to support, something done only reluctantly. How many PC games are on the Mac?

I can see them investing that much in it, though, for two reasons: It reduces piracy, and it offers a wide range of pricing and selling options they don't have now.

The piracy issue is pretty obvious. It's kind of hard to crack a game and share it around when you're playing it on a TV.

As for pricing and selling, well, I can't imagine how much drooling is going on by game publishers over those. There are lots of additional ways to make money, as well as a big reduction of barriers to sale. Obviously, there are rentals. Then rent-to-own options. Free metered demos. Spectating: You can watch other people playing – the show-offs, anyway – to get a feel for a game before jumping in yourself. All this can dramatically reduce the hump everybody has to get over before shelling out $40-$60 for a new high-end title, and could broaden the base of potential customers for high-end games substantially. It could also, of course, give a real shot in the arm to more casual games.

Of course, this assumes nobody gets greedy here and strangles themselves with prices that aren't what people are willing to pay. I expect some experimentation there before things settle down.

A final note before leaving this general topic: Massively multiplayer game and virtual world publishers (e.g., Blizzard (Worlds of Warcraft), Linden Labs (Second Life)) were conspicuously absent from the list. This may not be a temporary situation. You could surely run, for example, the WoW client on Onlive – but the client is useless unless it is talking to a WoW server farm somewhere.

Impact on the GPU Market

According to HPCWire's Editor's blog, "GPU computing is the most compelling technology to come on the scene in recent memory" because GPUs have become general enough to do a wide variety of computationally intensive tasks, the software support has become adequate, and, oh, yes, they're dirt cheap MFLOPS. Relatively speaking. Because they're produced in much larger volumes than the HPC market would induce. Will Golive change this situation? It's hard to imagine that it won't have at least some impact.

Obviously, people with Dell Studio 16s aren't buying GPUs. They never would have, anyway – they're in the 50% of the GPU market Intel repeatedly says it has thanks to integrated graphics built into the chipset. The other end of the market, power gamers, will still buy GPUs; they won't be satisfied with what Golive provides. That's another 10% spoken for. What of the remaining 40%? If Golive works, I think it will ultimately be significantly impacted.

Many of those middle-range people may simply not bother getting a higher-function GPU. Systems with a decent-performing GPU, and the memory needed to run significant games, cost substantially more than those without; I know this directly, since I've been pricing them recently. A laptop with >2GB, >200GB disk, and a good Nvidia or ATI GPU with 256KB or more memory, pushes or surpasses $2000. This compares with substantially under $1000 for an otherwise perfectly adequate laptop if you don't want decent game performance. Really, it compares with a flat $0, since the lack of CPU performance increase would otherwise not even warrant a replacement.

GPUs, however, won't completely disappear. Onlive itself needs them to run the games. Will that replace the lost volume? Somewhat, but it's certain Onlive's central facilities will have higher utilization than individuals' personal systems. That utilization won't be as high as it could be with a worldwide sharing facility, since the 1000-mile radius prohibits the natural exploitation of time zones.

They did mention using virtualization to host multiple lower-intensity games on each server; that could be used to dramatically increase utilization. However, if they've managed to virtualize GPUs, my hat's off to them. It may be possible to share them among games, but not easily; no GPU I know of was designed to be virtualized, so changes to DirectX will undoubtedly be needed. If this kind of use becomes common, virtualization may be supported, but it probably won't be soon. (Larrabee may have an advantage here; at least its virtualization architecture, Intel VT, is already well-defined.)

There are other uses of GPUs. Microsoft Windows Vista Aero interface, for example, and the "3D Web" (like this effort). These, however, generally expect to ride on the multi-Billion dollar game market's driving of GPU volumes. They're not drivers (although Microsoft thought it was), they're followers.

If Onlive succeeds – and as noted above, it may be possible but has some speed bumps ahead – the market for games may actually increase while the GPU market sinks dramatically.


Acknowledgement: The ISP issue was originally raised by a colleague in email discussion. He's previously asked to remain incognito, so I'll respect that.


Well, I misheard some numbers. Onlive needs 1.5 Mbps for standard-TV resolution. They say it requires 5 Mbps for 720p HDTV images. That’s just achievable with current CATV facilities, and would require 1.125 GB downloaded per hour. That really puts it outside the pale unless the service provider is engaged in providing the service. (In my defense, when I tried to find out the real bandwidth for 720p video, I found contradictions. This What Exactly is HDTV? site, which I used above, seems pretty authoritative, and has the 3 Mbps number. But this site is incredibly detailed, and claims everybody needs major compression to fit into the congressionally allocated 18 Mbps television channel bandwidth; uncompressed it's about 885 Mbps. So I can’t claim to understand where this all ends up in detail, but it really doesn’t matter: Clearly it is a very bad problem.

Thanks again to my anonymous colleague, who pointed this out.

By the way, welcome to all the Mac folks directed here from The Tao of Mac. I appreciate the reference, and I fully understand why this may be of interest. If I could run Fallout 3 (and Oblivion) on a Mac without going through the hassle of Boot Camp, I’d have one now. I’m just totally disgusted with Vista.

Sunday, February 15, 2009

What Multicore Really Means (and a Larrabee/Cell Example)

So, now everybody's staring in rapt attention as Intel provides a peek at its upcoming eight-core chip. When they're not speculating about Larrabee replacing Cell on PlayStation 4, that is.


I often wish the guts of computers weren't so totally divorced from everyday human experience.

Just imagine if computers could be seen, heard, or felt as easily as, for example, cars. That would make what has gone on over the last few years instantly obvious; we'd actually understand it. It would be as if a guy from the computer (car) industry and a consumer had this conversation:

"Behold! The car!" says car (computer) industry guy. "It can travel at 15 miles per hour!"

"Oh, wow," says consumer guy, "that thing is fantastic. I can move stuff around a lot faster than I could before, and I don't have to scoop horse poop. I want one!"

Time passes.

"Behold!" says the industry guy again, "the 30 mile-an-hour car!"

"Great!" says consumer guy. "I can really use that. At 15 mph, it takes all day to get down to town. This will really simplify my life enormously. Gimme, gimme!"

Time passes once more.

"Behold!" says you-know-who, "the 60 mph car!"

"Oh, I need one of those. Now we can visit Aunt Sadie over in the other county, and not have to stay overnight with her 42 cats. Useful! I'll buy it!"

Some more time.

"Behold!" he says, "Two 62-mph cars!"

"Say what?"

"It's a dual car! It does more!"

"What is that supposed to mean? Look, where's my 120 mph car?"

"This is better! It's 124 mph. 62 plus 62."

"Bu… Wha… Are you nuts? Or did you just arrived from Planet Meepzorp? That's crazy. You can't add up speeds like that."

"Sure you can. One can deliver 62 boxes of muffins per hour, so the two together can deliver 124. Simple."

"Muffins? You changed what mph means, from speed to some kind of bulk transport? Did we just drop down the rabbit hole? Since when does bulk transport have anything to do with speed?"

"Well, of course the performance doubling doesn't apply to every possible workload or use. Nothing ever really did, did it? And this does cover a huge range. For example, how about mangos? It can do 124 mph on those, too. Or manure. It applies to a huge number of cases."

"Look, even if I were delivering mangos, or muffins, or manure, or even mollusks …"

"Good example! We can do those, too."

"Yeah, sure. Anyway, even if I were doing that, and I'm not saying I am, mind you, I'd have to hire another driver, make sure both didn't try to load and unload at the same time, pay for more oil changes, and probably do ten other things I didn't have to do before. If I don't get every one of them exactly right, I'll get less than your alleged 124 whatevers. And I have to do all that instead of just stepping on the gas. This is an enormous pain."

"We have your back on those issues. We're giving Jeb here – say hello, Jeb –"

"Quite pleased to meet you, I'm sure. Be sure to do me the honor of visiting my Universal Mango Loading Lab sometime."

"…a few bucks to get that all worked out for you."

"Hey, I'm sure Jeb is a fine fellow, but right down the road over there, Advanced Research has been working on massively multiple loading for about forty years. What can Jeb add to that?"

"Oh, that was for loading special High-Protein Comestibles, not every day mangos and muffins. HPC is a niche market. This is going to be used by everybody!"

"That is supposed to make it easier? Come on, give me my real 120 mile per hour car. That's a mile, not a munchkin, a monkey, a mattock, or anything else, just a regular, old, mile. That's what I want. In fact, that's what I really need."

"Sorry, the tires melt. That's just the way it is; there is no choice. But we'll have a Quad Car soon, and then eight, sixteen, thirty-two! We've even got a 128-car in our labs!"

"Oh, good grief. What on God's Green Earth am I going to do with a fleet of 128 cars?"

Yeah, yeah, I know, a bunch of separate computers (cars) isn't the same as a multi-processor. They're different kinds of things, like a pack of dogs is different from a single multi-headed dog. See illustrations here. The programming is very different. But parallel is still parallel, and anyway Microsoft and others will just virtualize each N-processor chip into N separate machines in servers. I'd bet the high-number multi-cores ultimately morph into a cluster-on-a-chip as time goes on, anyway, passing through NUMA-on-a-chip on the way.

But it's still true that:

  • Computers no longer go faster. We just get more of them. Yes, clock speeds still rise, but it's like watching grass grow compared to past rates of increase. Lots of software engineers really haven't yet digested this; they still expect hardware to bail them out like it used to.
  • The performance metrics got changed out from under us. SPECrate is muffins per hour.
  • Various hardware vendors are funding labs at Berkeley, UIUC, and Stanford to work on using them better, of course. Best of luck with your labs, guys, and I hope you manage to do a lot better than was achieved by 40 years of DARPA/NSF funding. Oh, but that was a niche.

My point in all of this is not to protest the rising of the tide. It's coming in. Our feet are already wet. "There is no choice" is a phrase I've heard a lot, and it's undeniably true. The tires do melt. (I sometimes wonder "Choice to do what?" but that's another issue.)

Rather, my point is this: We have to internalize the fact that the world has changed – not just casually admit it on a theoretical level, but really feel it, in our gut.

That internalization hasn't happened yet.

We should have reacted to multi-core systems like consumer guy seeing the dual car and hearing the crazy muffin discussion, instantly recoiling in horror, recognizing the marketing rationalizations as somewhere between lame and insane. Instead, we hide the change from ourselves, for example letting companies call a multi-core system "a processor" (singular) because it's packaged on one chip, when they should be laughed at so hard even their public relations people are too embarrassed to say it.

Also, we continue to casually talk in terms that suggest a two-processor system has the power of one processor running twice as fast – when they really can't be equated, except at a level of abstraction so high that miles are equated to muffins.

We need to understand that we've gone down a rabbit hole. So many standard assumptions no longer hold that we can't even enumerate them.

To ground this discussion in real GHz and performance, here's an example of what I mean by breaking standard assumptions.

In a discussion on Real World Technologies' Forums about the recent "Intel Larrabee in Sony PS4" rumors, it was suggested that Sony could, for backward compatibility, just emulate the PS3's Cell processor on Larrabee. After all, Larrabee is several processor generations after Cell, and it has much higher performance. As I mentioned elsewhere, the Cell cranks out "only" 204 GFLOPS (peak), and public information about Larrabee puts it somewhere in the range of at least 640 GFLOPS (peak), if not 1280 GFLOPS (peak) (depends on what assumptions you make, so call it an even 1TFLOP).

With that kind of performance difference, making a Larrabee act like a Cell should be a piece of cake, right? All those old games will run just as fast as before. The emulation technology (just-in-time compiling) is there, and the inefficiency introduced (not much) will be covered up by the faster processor. No problem. Standard thing to do. Anybody competent should think if it.

Not so fast. That's pre-rabbit-hole thinking. Those are all flocks of muffins flying past, not simple speed. Down in the warrens where we are now, it's possible for Larrabee to be both faster and slower than Cell.

In simple speed, the newest Cell's clock rate is actually noticeably faster than expected for Larrabee. Cell has shipped for years at 3.2 GHz; the more recent PowerXCell version uses newer fabrication technology to lower power (heat), not to increase speed. Public Larrabee estimates say that when it ships (late 2009 or 2010) it will be somewhere around 2 GHz., so in that sense Cell is about 1.25X faster than Larrabee (both are in-order, both count FLOPS double by having a multiply-add).

Larrabee is "faster" only because it contains much more stuff – many more transistors – to do more things at once than Cell does. This is true at two different levels. First, it has more processors: Cell has 8, while Larrabee at least 16 and may go up to 48. Second, while both Cell and Larrabee gain speed by lining up several numbers and operating on all of them at the same time (SIMD), Larrabee lines up more numbers at once than Cell: The GFLOPS numbers above assume Larrabee does 16 operations at once (512-bit vector registers), but Cell does only four operations at once (128-bit vector registers). To get maximum performance on both of them, you have to line up that many numbers at once. Unless you do, performance goes down proportionally.

This means that to match today's and several years' ago Cell performance, next year's Larrabee would have to not just emulate it, but extract more parallelism than is directly expressed in the program being emulated. It has to find more things to do at once than were there to begin with.

I'm not saying that's impossible; it's probably not. But it's certainly not at all as straightforward as it would have been before we went down the rabbit hole. (And I suspect that "not at all as straightforward" may be excessively delicate phrasing.)

Ah, but how many applications really use all the parallelism in Cell – get all its parts cranking at once? Some definitely do, and people figure out how to do more every day. But it's not a huge number, in part because Cell does not have the usual, nice, maximally convenient programming model exhibited by mainstream systems, and claimed for Larrabee; it traded that off for all that speed (in part). The idea was that Cell was not for "normal" programming; it was for game programming, with most of the action in intense, tight, hand-coded loops doing image creation from models. That happened, but certainly not all the time, and anecdotally not very often at all.

Question: Does that make the problem easier, or harder? Don't answer too quickly, and remember that we're talking about emulating from existing code, not rewriting from scratch.

A final thought about assumption breaking and Cell's notorious programmability issues compared with the usual simpler-to-use organizations: We may, one day, look back and say "It sure was nice back then, but we no longer have the luxury of using such nice, simple programming models." It'll be muffins all the way down. I just hope that we've merely gone down the rabbit hole, and not crossed the Mountains of Madness.