Also, there was much attempted discussion of the 2012 product version of the MIC architecture, dubbed Knight’s Corner (KC). Discussion was much attempted by me, anyway, with decidedly limited success. There were some hints, and some things can be deduced, but the real KC hasn’t stood up yet. That reticence is probably a turn for the better, since KF is the direct descendant of Intel’s Larrabee graphics engine, which was quite prematurely trumpeted as killing off such GPU stalwarts as Nvidia and ATI (now AMD), only to eventually be dropped – to become KF. A bit more circumspection is now certainly called for.
This circumspection does, however, make it difficult to separate what I learned into neat KF or KC buckets; KC is just too well hidden so far. Here is my best guesses, answering questions I received from Twitter and elsewhere as well as I can.
If you’re unfamiliar with MIC or KF or KC, you can call up a plethora of resources on the web that will tell you about it; I won’t be repeating that information here. Here’s a relatively recent one: Intel Larraabee Take Two. In short summary, MIC is the widest X86 shared-memory multicore anywhere: KF has 32 X86 cores, all sharing memory, four threads each, on one chip. KC has “50 or more.” In addition, and crucially for much of the discussion below, each core has an enhanced and expanded vector / SIMD unit. You can think of that as an extension of SSE or AVX, but 512 bits wide and with many more operations available.
An aside: Intel’s department of code names is fond of using place names – towns, rivers, etc. – for the externally-visible names of development projects. “Knight’s Ferry” follows that tradition; it’s a town up in the Sierra Nevada Mountains in central California. The only “Knight’s Corner” I could find, however, is a “populated area,” not even a real town, probably a hamlet or development, in central Massachusetts. This is at best an unlikely name source. I find this odd; I wish I’d remembered to ask about it.
Is It Real?
The MIC architecture is apparently as real as it can be. There are multiple generations of the MIC chip in roadmaps, and Intel has committed to supply KC (product-level) parts to the University of Texas TACC by January 2013, so at least the second generation is as guaranteed to be real as a contract makes it. I was repeatedly told by Intel execs I interviewed that it is as real as it gets, that the MIC architecture is a long-term commitment by Intel, and it is not transitional – not a step to other, different things. This is supposed to be the Intel highly-parallel technical computing accelerator architecture, period, a point emphasized to me by several people. (They still see a role for Xeon, of course, so they don't think of MIC as the only technical computing architecture.)
More importantly, Joe Curley (Intel HPC marketing) gave me a reason why MIC is real, and intended to be architecturally stable: HPC and general technical computing are about a third of Intel’s server business. Further, that business tends to be a very profitable third since those customers tend to buy high-end parts. MIC is intended to slot directly into that business, obviously taking the money that is now increasingly spent on other accelerators (chiefly Nvidia products) and moving that money into Intel’s pockets. Also, as discussed below, Intel’s intention for MIC is to greatly widen the pool of customers for accelerators.
The Big Feature: Source Compatibility
There is absolutely no question that Intel regards source compatibility as a primary, key feature of MIC: Take your existing programs, recompile with a “for this machine” flag set to MIC (literally: “-mmic” flag), and they run on KF. I have zero doubt that this will also be true of KC and is planned for every future release in their road map. I suspect it’s why there is a MIC – why they did it, rather than just burying Larrabee six feet deep. No binary compatibility, though; you need to recompile.
You do need to be on Linux; I heard no word about Microsoft Windows. However, Microsoft Windows 8 has a new task manager display changed to be a better visualization of many more – up to 640 – cores. So who knows; support is up to Microsoft.
Clearly, to get anywhere, you also need to be parallelized in some form; KF has support for MPI (messaging), OpenMP (shared memory), and OpenCL (GPUish SIMD), along with, of course, Intel’s Threading Building Blocks, Cilk, and probably others. No CUDA; that’s Nvidia’s product. It’s a real Linux, by the way, that runs on a few of the MIC processors; I was told “you can SSH to it.” The rest of the cores run some form of microkernel. I see no reason they would want any of that to become more restrictive on KC.
If you can pull off source compatibility, you have something that is wonderfully easy to sell to a whole lot of customers. For example, Sriram Swaminarayan of LANL has noted (really interesting video there) that over 80% of HPC codes have, like him, a very large body of legacy codes they need to carry into the future. “Just recompile” promises to bring back the good old days of clock speed increases when you just compiled for a new architecture and went faster. At least it does if you’ve already gone parallel on X86, which is far from uncommon. No messing with newfangled, brain-bending languages (like CUDA or OpenCL) unless you really want to. This collection of customers is large, well-funded, and not very well-served by existing accelerator architectures.
Right. Now, for all those readers screaming at me “OK, it runs, but does it perform?” –
Well, not necessarily.
The problem is that to get MIC – certainly KF, and it might be more so for KC – to really perform, on many applications you must get its 512-bit-wide SIMD / vector unit cranking away. Jim Reinders regaled me with a tale of a four-day port to MIC, where, surprised it took that long (he said), he found that it took one day to make it run (just recompile), and then three days to enable wider SIMD / vector execution. I would not be at all surprised to find that this is pleasantly optimistic. After all, Intel cherry-picked the recipients of KF, like CERN, which has one of the world’s most embarrassingly, ah pardon me, “pleasantly” parallel applications in the known universe. (See my post Random Things of Interest at IDF 2011.)
Where, on this SIMD/vector issue, are the 80% of folks with monster legacy codes? Well, Sriram (see above) commented that when LANL tried to use Roadrunner – the world’s first PetaFLOPS machine, X86 cluster nodes with the horsepower coming from attached IBM Cell blades – they had a problem because to perform well, the Cell SPUs needed crank up their two-way SIMD / vector units. Furthermore, they still have difficulty using earlier Xeons’ two-way (128-bit) vector/SIMD units. This makes it sound like using MIC’s 8-way (64-bit ops) SIMD / vector is going to be far from trivial in many cases.
On the other hand, getting good performance on other accelerators, like Nvidia’s, requires much wider SIMD; they need 100s of units cranking, minimally. Full-bore SIMD may in some cases be simpler to exploit than SIMD/vector instructions. But even going through gigabytes of grotty old FORTRAN code just to insert notations saying “do this loop in parallel,” without breaking the code, can be arduous. The programming language, by the way, is not the issue. Sriram reminded me of the old saying that great FORTRAN coders, who wrote the bulk of those old codes, can write FORTRAN in any language.
But wait! How can these guys be choking on 2-way parallelism when they have obviously exploited thousands of cluster nodes in parallel? The answer is that we have here two different forms of parallelism; the node-level one is based on scaling the amount of data, while the SIMD-level one isn’t.
In physical simulations, which many of these codes perform, what happens in this simulated galaxy, or this airplane wing, bomb, or atmosphere column over here has a relatively limited effect on what happens in that galaxy, wing, bomb or column way over there. The effects that do travel can be added as perturbations, smoothed out with a few more global iterations. That’s the basis of the node-level parallelism, with communication between nodes. It can also readily be the basis of processor/core-level parallelism across the cores of a single multiprocessor. (One basis of those kinds of parallelism, anyway; other techniques are possible.)
Inside any given galaxy, wing, bomb, or atmosphere column, however, quantities tend to be much more tightly coupled to each other. (Consider, for example, R2 force laws; irrelevant when sufficiently far, dominant when close.) Changing the way those tightly-coupled calculations and done can often strongly affect the precision of the results, the mathematical properties of the solution, or even whether you ever converge to any solution. That part may not be simple at all to parallelize, even two-way, and exploiting SIMD / vector forces you to work at that level. (For example, you can get into trouble when going parallel and/or SIMD naively changes from Gauss-Seidel iteration to Gauss-Jacobi simulation. I went into this in more detail way back in my book In Search of Clusters, (Prentice-Hall), Chapter 9, “Basic Programming Models and Issues.”) To be sure, not all applications have this problem; those that don’t often can easily spin up into thousands of operations in parallel at all levels. (Also, multithreaded “real” SIMD, as opposed to vector SIMD, can in some cases avoid some of those problems. Note italicized words.)
The difficulty of exploiting parallelism in tightly-coupled local computations implies that those 80% are in deep horse puckey no matter what. You have to carefully consider everything (even, in some cases, parenthesization of expressions, forcing order of operations) when changing that code. Needing to do this to exploit MIC’s SIMD suggests an opening for rivals: I can just see Nvidia salesmen saying “Sorry for the pain, but it’s actually necessary for Intel, too, and if you do it our way you get” tons more performance / lower power / whatever.
Can compilers help here? Sure, they can always eliminate a pile of gruntwork. Automatically vectorizing compilers have been working quite well since the 80s, and progress continues to be made in disentangling the aliasing problems that limit their effectiveness (think FORTRAN COMMON). But commercial (or semi-commercial) products from people like CAPS and The Portland Group get better results if you tell them what’s what, with annotations. Those, of course, must be very carefully applied across mountains of old codes. (They even emit CUDA and OpenCL these days.)
By the way, at least some of the parallelism often exploited by SIMD accelerators (as opposed to SIMD / vector) derives from what I called node-level parallelism above.
Returning to the main discussion, Intel’s MIC has the great advantage that you immediately get a simply ported, working program; and, in the cases that don’t require SIMD operations to hum, that may be all you need. Intel is pushing this notion hard. One IDF session presentation was titled “Program the SAME Here and Over There” (caps were in the title). This is a very big win, and can be sold easily because customers want to believe that they need do little. Furthermore, you will probably always need less SIMD / vector width with MIC than with GPGPU-style accelerators. Only experience over time will tell whether that really matters in a practical sense, but I suspect it does.
Several Other Things
Here are other MIC facts/factlets/opinions, each needing far less discussion.
How do you get from one MIC to another MIC? MIC, both KF and KC, is a PCIe-attached accelerator. It is only a PCIe target device; it does not have a PCIe root complex, so cannot source PCIe. It must be attached to a standard compute node. So all anybody was talking about was going down PCIe to node memory, then back up PCIe to a different MIC, all at least partially under host control. Maybe one could use peer-to-peer PCIe device transfers, although I didn’t hear that mentioned. I heard nothing about separate busses directly connecting MICs, like the ones that can connect dual GPUs. This PCIe use is known to be a bottleneck, um, I mean, “known to require using MIC on appropriate applications.” Will MIC be that way for ever and ever? Well, “no announcement of future plans”, but “typically what Intel has done with accelerators is eventually integrate them onto a package or chip.” They are “working with others” to better understand “the optimal arrangement” for connecting multiple MICs.
What kind of memory semantics does MIC have? All I heard was flat cache coherence across all cores, with ordering and synchronizing semantics “standard” enough (= Xeon) that multi-core Linux runs on multiple nodes. Not 32-way Linux, though, just 4-way (16, including threads). (Now that I think of it, did that count threads? I don’t know.) I asked whether the other cores ran a micro-kernel and got a nod of assent. It is not the same Linux that they run on Xeons. In some ways that’s obvious, since those microkernels on other nodes have to be managed; whether other things changed I don’t know. Each core has a private cache, and all memory is globally accessible.
Synchronization will likely change in KC. That’s how I interpret Jim Reinders’ comment that current synchronization is fine for 32-way, but over 40 will require some innovation. KC has been said to be 50 cores or more, so there you go. Will “flat” memory also change? I don’t know, but since it isn’t 100% necessary for source code to run (as opposed to perform), I think that might be a candidate for the chopping block at some point.
Is there adequate memory bandwidth for apps that strongly stream data? The answer was that they were definitely going to be competitive, which I interpret as saying they aren’t going to break any records, but will be good enough for less stressful cases. Some quite knowledgeable people I know (non-Intel) have expressed the opinion that memory chips will be used in stacks next to (not on top of) the MIC chip in the product, KC. Certainly that would help a lot. (This kind of stacking also appears in a leaked picture of a “far future prototype” from Nvidia, as well as an Intel Labs demo at IDF.)
Power control: Each core is individually controllable, and you can run all cores flat out, in their highest power state, without melting anything. That’s definitely true for KF; I couldn’t find out whether it’s true for KC. Better power controls than used in KF are now present in Sandy Bridge, so I would imagine that at least that better level of support will be there in KC.
Clearly, I feel the biggest point here are Intel’s planned commitment over time to a stable architecture that is source code compatible with Xeon. Stability and source code compatibility are clear selling points to the large fraction of the HPC and technical computing market that needs to move forward a large body of legacy applications; this fraction is not now well-served by existing accelerators. Also important is the availability of familiar tools, and more of them, compared with popular accelerators available now. There’s also a potential win in being able to evolve existing programmer skill, rather than replacing them. Things do change with the much wider core- and SIMD-levels of parallelism in MIC, but it’s a far less drastic change than that required by current accelerator products, and it starts in a familiar place.
Will MIC win in the marketplace? Big honking SIMD units, like Nvidia ships, will always produce more peak performance, which makes it easy to grab more press. But Intel’s architectural disadvantage in peak juice is countered by process advantage: They’re always two generations ahead of the fabs others use; KC is a 22nm part, with those famous “3D” transistors. It looks to me like there’s room for both approaches.
Finally, don’t forget that Nvidia in particular is here now, steadily increasing its already massive momentum, while a product version of MIC remains pie in the sky. What happens when the rubber meets the road with real MIC products is unknown – and the track record of Larrabee should give everybody pause until reality sets well into place, including SIMD issues, memory coherence and power (neither discussed here, but not trivial), etc.
I think a lot of people would, or should, want MIC to work. Nvidia is hard enough to deal with in reality that two best paper awards were given at the recently concluded IPDPS 2011 conference – the largest and most prestigious academic parallel computing conference – for papers that may as well have been titled “How I actually managed to do something interesting on an Nvidia GPGPU.” (I’m referring to the “PHAST” and “Profiling” papers shown here.) Granted, things like a shortest-path graph algorithm (PHAST) are not exactly what one typically expects to run well on a GPGPU. Nevertheless, this is not a good sign. People should not have to do work at the level of intellectual academic accolades to get something done – anything! – on a truly useful computer architecture.
Hope aside, a lot of very difficult hardware and software still has to come together to make MIC work. And…
Larrabee was supposed to be real, too.
Acknowledgement: This post was considerably improved by feedback from a colleague who wishes to maintain his Internet anonymity. Thank you!
Acknowledgement: This post was considerably improved by feedback from a colleague who wishes to maintain his Internet anonymity. Thank you!