Friday, December 4, 2009

Intel’s Single-Chip Clus… (sorry) Cloud

Intel's recent announcement of a 48-core "single-chip cloud" (SCC) is now rattling around several news sources, with varying degrees of boneheaded-ness and/or willful suspension of disbelief in the hype. Gotta set this record straight, and also raise a few questions I didn't find answered in the sources now available (the presentation, the software paper, the developers' video).

Let me emphasize from the start that I do not think this is a bad idea. In fact, it's an idea good enough that I've lead or been associated with rather similar architectures twice, although not on a single chip (RP3, POWER 4 (no, not the POWER4 chip; this one was four original POWER processors in one box, apparently lost on the internet) (but I've still got a button…)). Neither was, in hindsight, a good idea at their times.

So, some facts:

SCC is not a product. It's an experimental implementation of which about a hundred will be made, given to various labs for software research. It is not like Larrabee, which will be shipped in full-bore product-scale quantities Some Day Real Soon Now. Think concept car. That software research will surely be necessary, since:

SCC is neither a multiprocessor nor a multicore system in the usual sense. They call it a "single chip cloud" because the term "cluster" is déclassé. Those 48 cores have caches (two levels), but cache coherence is not implemented in hardware. So, it's best thought of as a 48-node cluster of uniprocessors on a single chip. Except that those 48 cluster nodes all access the same memory. And if one processor changes what's in memory, the others… don't find out. Until the cache at random happens to kick the changed line out. Sometime. Who knows when. (Unless something else is done; but what? See below.)

Doing this certainly does save hardware complexity, but one might note that quite adequately scalable cache coherence does exist. It's sold today by Rackable (the part that was SGI); Fujitsu made big cache-coherent multi-x86 systems for quite a while; and there are ex-Sequent folks out there who remember it well. There's even an IEEE standard for it (SCI, Scalable Coherent Interface). So let's not pretend the idea is impossible and assume your audience will be ignorant. Mostly they will be, but that just makes misleading them more reprehensible.

To leave cache coherence out of an experimental chip like this is quite reasonable; I've no objection there. I do object to things like the presentation's calling this "New Data-Sharing Options." That's some serious lipstick being applied.

It also leads to several questions that are so far unanswered:

How do you keep the processors out of each others' pants? Ungodly uncontrolled race-like uglies must happen unless… what? Someone says "software," but what hardware does that software exercise? Do they, perhaps, keep the 48 separate from one another by virtualization techniques? (Among other things, virtualization hardware has to keep virtual machine A out of virtual machine B's memory.) That would actually be kind of cool, in my opinion, but I doubt it; hypervisor implementations require cache coherence, among other issues. Do you just rely on instructions that push individual cache lines out to memory? Ugh. Is there a way to decree whole swaths of memory to be non-cacheable? Sounds kind of inefficient, but whatever. There must be something, since they have demoed some real applications and so this problem must have been solved somehow. How?

What's going on with the operating system? Is there a separate kernel for each core? My guess: Yes. That's part of being a clus… sorry, a cloud. One news article said it ran "Rock Creek Linux." Never heard of it? Hint: The chip was called Rock Creek prior to PR.

One iteration of non-coherent hardware I dealt with used cluster single system image to make it look like one machine for management and some other purposes. I'll bet that becomes one of the software experiments. (If you don't know what SSI is, I've got five posts for you to read, starting here.)

Appropriately, there's mention of message-passing as the means of communicating among the cores. That's potentially fast message passing, since you're using memory-to-memory transfers in the same machine. (Until you saturate the memory interface – only four ports shared by all 48.) (Or until you start counting usual software layers. Not everybody loves MPI.) Is there any hardware included to support that, like DMA engines? Or protocol offload engines?

Finally, why does every Intel announcement of gee-whiz hardware always imply it will solve the same set of problems? I'm really tired of those flying cars. No, I don't expect to ever see an answer to that one.

I'll end by mentioning something in Intel's SCC (née Rock Creek) that I think is really good and useful: multiple separate power regions. Voltage and frequency can be varied separately in different areas of the chip, so if you aren't using a bunch of cores, they can go slower and/or draw less power. That's something that will be "jacks or better" in future multicore designs, and spending the effort to figure out how to build and use it is very worthwhile.

Heck, the whole thing is worthwhile, as an experiment. On its own. Without inflated hype about solving all the world's problems.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - -

(This has been an unusually topical post, brought to you courtesy of the author's head-banging-wall level of annoyance at boneheaded news stories. Soon we will resume our more usual programming. Not instantly. First I have to close on and move into a house. In Holiday and snow season.)