Friday, December 4, 2009

Intel’s Single-Chip Clus… (sorry) Cloud

Intel's recent announcement of a 48-core "single-chip cloud" (SCC) is now rattling around several news sources, with varying degrees of boneheaded-ness and/or willful suspension of disbelief in the hype. Gotta set this record straight, and also raise a few questions I didn't find answered in the sources now available (the presentation, the software paper, the developers' video).

Let me emphasize from the start that I do not think this is a bad idea. In fact, it's an idea good enough that I've lead or been associated with rather similar architectures twice, although not on a single chip (RP3, POWER 4 (no, not the POWER4 chip; this one was four original POWER processors in one box, apparently lost on the internet) (but I've still got a button…)). Neither was, in hindsight, a good idea at their times.

So, some facts:

SCC is not a product. It's an experimental implementation of which about a hundred will be made, given to various labs for software research. It is not like Larrabee, which will be shipped in full-bore product-scale quantities Some Day Real Soon Now. Think concept car. That software research will surely be necessary, since:

SCC is neither a multiprocessor nor a multicore system in the usual sense. They call it a "single chip cloud" because the term "cluster" is déclassé. Those 48 cores have caches (two levels), but cache coherence is not implemented in hardware. So, it's best thought of as a 48-node cluster of uniprocessors on a single chip. Except that those 48 cluster nodes all access the same memory. And if one processor changes what's in memory, the others… don't find out. Until the cache at random happens to kick the changed line out. Sometime. Who knows when. (Unless something else is done; but what? See below.)

Doing this certainly does save hardware complexity, but one might note that quite adequately scalable cache coherence does exist. It's sold today by Rackable (the part that was SGI); Fujitsu made big cache-coherent multi-x86 systems for quite a while; and there are ex-Sequent folks out there who remember it well. There's even an IEEE standard for it (SCI, Scalable Coherent Interface). So let's not pretend the idea is impossible and assume your audience will be ignorant. Mostly they will be, but that just makes misleading them more reprehensible.

To leave cache coherence out of an experimental chip like this is quite reasonable; I've no objection there. I do object to things like the presentation's calling this "New Data-Sharing Options." That's some serious lipstick being applied.

It also leads to several questions that are so far unanswered:

How do you keep the processors out of each others' pants? Ungodly uncontrolled race-like uglies must happen unless… what? Someone says "software," but what hardware does that software exercise? Do they, perhaps, keep the 48 separate from one another by virtualization techniques? (Among other things, virtualization hardware has to keep virtual machine A out of virtual machine B's memory.) That would actually be kind of cool, in my opinion, but I doubt it; hypervisor implementations require cache coherence, among other issues. Do you just rely on instructions that push individual cache lines out to memory? Ugh. Is there a way to decree whole swaths of memory to be non-cacheable? Sounds kind of inefficient, but whatever. There must be something, since they have demoed some real applications and so this problem must have been solved somehow. How?

What's going on with the operating system? Is there a separate kernel for each core? My guess: Yes. That's part of being a clus… sorry, a cloud. One news article said it ran "Rock Creek Linux." Never heard of it? Hint: The chip was called Rock Creek prior to PR.

One iteration of non-coherent hardware I dealt with used cluster single system image to make it look like one machine for management and some other purposes. I'll bet that becomes one of the software experiments. (If you don't know what SSI is, I've got five posts for you to read, starting here.)

Appropriately, there's mention of message-passing as the means of communicating among the cores. That's potentially fast message passing, since you're using memory-to-memory transfers in the same machine. (Until you saturate the memory interface – only four ports shared by all 48.) (Or until you start counting usual software layers. Not everybody loves MPI.) Is there any hardware included to support that, like DMA engines? Or protocol offload engines?

Finally, why does every Intel announcement of gee-whiz hardware always imply it will solve the same set of problems? I'm really tired of those flying cars. No, I don't expect to ever see an answer to that one.

I'll end by mentioning something in Intel's SCC (née Rock Creek) that I think is really good and useful: multiple separate power regions. Voltage and frequency can be varied separately in different areas of the chip, so if you aren't using a bunch of cores, they can go slower and/or draw less power. That's something that will be "jacks or better" in future multicore designs, and spending the effort to figure out how to build and use it is very worthwhile.

Heck, the whole thing is worthwhile, as an experiment. On its own. Without inflated hype about solving all the world's problems.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - -

(This has been an unusually topical post, brought to you courtesy of the author's head-banging-wall level of annoyance at boneheaded news stories. Soon we will resume our more usual programming. Not instantly. First I have to close on and move into a house. In Holiday and snow season.)

10 comments:

Anonymous said...

Thanks for the explanation. It does help clarify things for me, even though some questions remain unanswered (i.e., see the questionmarks in your article).

I read what's been written online about the 48-core CPU, but it is pretty hard to cut through the marketing speak and figure out what it really is about.

E.g., I can't believe that Intel's press release actually pointed this out as a possible consequence of the 48-core: "Some researchers believe computers may even be able to read brain waves [...]".

Igor

Jeff Darcy said...

It seems to me, as an OS developer who has also worked with chip designers, that a non-cache-coherent approach does open up some unique possibilities. Cache-coherence logic really does take up a lot of space, energy, and design time. As long as you have message passing and cache flush/clear instructions (for the write and read side respectively) you can still do medium-grain coordination between tiles. It's not quite as fast or simple as full cache coherence, perhaps it's not good enough to run a single OS instance effectively, but it's still likely to be faster and simpler than the RDMA interconnects people are used to.

The other possibility is an extension to the original benefit of SMP: more effective resource utilization. Dynamic reallocation of memory between processes allows it to be used more effectively than if it were divided into separate pools, even if multiple CPUs can't be dynamically added to speed up a single process. Similarly, dynamic reallocation of memory between virtual machines offers the same benefit even though multiple CPUs can't be dynamically added to speed up a single VM. Reallocating a page between VMs is the sort of thing that a non-cache-coherent multiprocessor can still do pretty effectively, and the page might be much more valuable in its new role.

Recession Cone said...

I like Bill Dally's perspective: Cache coherence is an example of "denial" architecture. It's denial in the sense that truly parallel programs can't possibly use the cache coherency protocol for anything important, because if they do, they end up serializing in the coherence controller. This is the cause of much poor parallel scaling. People write code that has lots of parallel program counters, but then don't pay attention to ensuring independent, parallel data accesses, and then are surprised when their program can't actually execute in parallel. Practically, any program which relies on cache coherence for more than message passing is not a parallel program, it is a collection of parallel threads which are forced to run serially, sequenced by the cache coherence controller.

Cache coherence, then, promises something it can't provide. Its presence is an attempt to paper over parallelism and make it "easier" to program parallel machines. However, it only makes programming harder, because it's difficult to tell why a parallel program which uses cache coherency actually executes serially. It represents a denial of reality: parallel processing requires parallel data accesses.

Shared memory parallel programming without cache coherence is easier than you might think. You use memory fences to ensure that all dirty cache lines are evicted, and then you use atomic memory transactions to signal to other cores that your data is globally visible. Yes, this takes a few lines of code. But you're doing something expensive - and this model forces you to think about it explicitly, rather than deny the data sharing problem exists and put your faith in a magic cache coherency controller instead.

Wes Felter said...

I suspect the message passing is cache-to-cache, FIFO-to-FIFO, or even register-to-register (RAW style?) so it won't touch the memory controllers. Knowing how moving a single cache line in a coherence protocol often involves multiple packets over the interconnect, I can easily imagine that sending one packet directly is faster... if the hardware will allow it. I always wondered why HT and QPI don't allow this.

Most of the workloads Intel showed appear to be sharing no memory at all and using the interconnect for MPI or IP so I don't agree that "there must be something". Each core could just be using a disjoint physical address range. Boring.

There was a mention of page-granularity coherence in one slide, which sounds like either old-style software distributed shared memory (Intel bought TreadMarks from Rice) or perhaps they're sharing the physical pages and using message passing for locking.

Platypus said...

With a sufficiently fast interconnect, SDSM might actually be usable. ;)

Greg Pfister said...

Someone who prefers to remain web-anonymous (even more than an anonymous comment) sent me this comment off-line:

***********

I managed to see a good portion of Justin Rattner's press briefing. There's several items not generally reported in the press that are notable, particularly the last one relative to your blog entry:

Microsoft has Visual Studio running on it...their demo was Mandelbrot set. While the picture was pretty fuzzy on the web, they showed the performance utilization graph of each core and I thought (through what I could see through the murkiness of the web) that the performance declined significantly when the core count was increased. This of course is no shocker since the thing is probably memory starved,

The core is a Pentium-class, in-order core.

1.3B transistors - focused on design efficiency since the team was just 40 people across 3 locations (US/Germany/India)

The two cores in the tile are cache coherent.

***********

To which I replied:

***********

Mandelbrot set demo --

Did they say it was "from" Visual Studio, and was this the only evidence the whole Visual Studio was running on it? Mandelbrot set drawing is one of those tiny-code demos that parallelizes fairly easily, and is easily written from scratch. I wrote one myself back in about 1985 on a PC AT.

That it slowed down a lot indicates nasty inter-processor communication or synchronization issues, or problems with bus overloading or something. Typical result of bad parallel system / software design.

Cores in a tile core cache coherent --

I wouldn't be surprised. I stared at the tile layout for a while, but didn't see any logic labelled coherence, and the 2 cores had separate L2s (no sharing), so I wasn't sure.

vak said...

What happend to OpenSSI?

This marvelous project seems to die :(

I used it about 5 years ago and was so happy...

Greg Pfister said...

OpenSSI is still around. See http://openssi.org/. For reasons why it may not be very active (if it's not; I don't know), see the link in the post above about why SSI in general seems like a great idea but never caught on.

Greg

vak said...

OpenSSI is inactive.

Unknown said...

>It is not like Larrabee, which will be shipped in full-bore product-scale quantities Some Day Real Soon Now.

... or not ... ;-)

Post a Comment

Thanks for commenting!

Note: Only a member of this blog may post a comment.