At the recent Intel Developer Forum (IDF), I was given the opportunity to interview Joe Curley, Director, Technical Computing Marketing of Intel’s Datacenter & Connected Systems Group in Hillsboro.
Intel-provided information about Joe:
I recorded our conversation; what follows is a transcript. Also, I used Twitter to crowd-source questions, and some of my comments refer to picking questions out of the list that generated. (Thank you! to all who responded.)
This is the last in a series of three such transcripts. Hallelujah! Doing this has been a pain. I’ll have at least one additional post about IDF 2011, summarizing the things I learned about MIC and the Intel “Knight’s” accelerator boards using them, since some important things learned were outside the interviews. But some were in the interviews, including here.
Full disclosure: As I originally noted in a prior post, Intel paid for me to attend IDF. Thanks, again. It was a great experience, since I’d never before attended.
Occurrences of  indicate words I added for clarification or comment post-interview.
[We began by discovering we had similar deep backgrounds, both starting in graphics hardware. I designed & built a display processor (a prehistoric GPU), he built “the most efficient framework buffer controller you could possibly make”. Guess which one of us is in marketing?]
A: My experience in the [HPC] business really started relatively recently, a little under five years ago, [when] I started working on many-core processors. I won’t be able to go into history, but I can at least tell you what we’re doing and why.
Q: Why don’t we start there? At a high level, what are you doing, and why? High level for what you are doing, and as much detail on “why” as you can provide.
A: We have to narrow the question. So, at Intel, what we’re after first of all is in what we call our Technical Computing Marketing Group inside Data Center Group. That has really three major objectives. The first one is to specify the needs for high performance computing, how we can help our customers and developers build the best high performance computing systems.
Q: Let me stop you for a second right there. My impression for high performance computing is that they are people whose needs are that they want more. Just more.
A: Oh, yes, but more at what cost? What cost of power, what cost of programability, what cost of size. How are we going to build the IO system to handle it affordably or use the fabric of the day.
Q: Yes, they want more, but they want it at two bytes/FLOPS of memory bandwidth and communication bandwidth.
A: There’s an old thing called the Dilbert Spec, which is “I want it all, and by the way, can it be free?” But that’s not really what people tell us they want. People in HPC have actually been remarkably pragmatic about what it takes to develop innovation. So they really want us to do some things, and do them really well.
By the way, to finish what we do, we also have the workstation segment, and the MIC Many Integrated Core product line. The marketing for that is also in our group.
You asked “what are you doing and why.” It would probably take forever to go across all domains, but we could go into any one of them a little bit better.
Q: Can you give me a general “why” for HPC, and a specific “why” for MIC?
A: Well, HPC’s a really good business. I get stunned, somebody must be Asking really weird questions, asking “why are you doing HPC?”
Q: What I’ve heard is that HPC is traditionally 12% of the market.
A: Supercomputing is a relatively small percentage of the market. HPC and technical computing, combined, is, not exactly, but roughly, a third of our data center business. [emphasis added by me] Our data center business is a pretty robust business. And high performance computing is a business that requires very high end, high performance processors. It’s actually a very desirable business to be in, if you can do it, and if your systems work. It’s a business we spend a lot of time working on because it’s a good business.
Now, if you look at MIC, back in 2005 we made a tacit conclusion that the performance of a system will come out of parallelism. Parallelism could be expressed at Intel in a lot of different ways. You can look at it as threads, we have this concept called hyperthreading. You can look at it as cores. And we have the SSE instructions sitting around which are SIMD, that’s a form of parallelism; people argue about the definition, but yes, it is. [I agree.] So you take a look at the basic architectural constructs, ease of programming, you know, a cache-based CISC model, and then scaling on cores, threads, SIMD or vectors, these common attributes have been adopted and well-used by a lot of programmers. There are programs across the continuum of coarse- to fine-grained parallel, embarrassingly parallel, pick your taxonomy. But there are applications that developers would be willing to trade the performance of any particular task or thread for the sum of what you can do inside the power envelope at a given period of time. Lots of people have different ways of defining that, you hear throughput, whatever, but this is the class of applications, and over time they’re growing.
Q: Growing relatively, or, say, compared to commercial processing, or…? Is the segment getting larger?
A: The number of people who have tasks they want to run on that kind of hardware is clearly growing. One of the reasons we’re doing MIC, maybe I should just cut it to the easiest answer, is developers and customers asked us to.
A: And they came to us with a really simple question. We were struggling in the marketing group with how to position MIC, and one of our developers got worked up, like “Look, you give me the parallel performance of an accelerator, but you give me the ease of CPU programming!” Now, ease is a funny word; you can get into religious arguments about ease. But I think what he means is “I don’t have to re-think my algorithm, I don’t have to reorder my data set, there are some things that I don’t have to do. So that they wanted to have the idea of give me this architecture and get it to scale to be wildly parallel. And that is exactly what we’ve done with the MIC architecture. If you think about what the Kinght’s Ferry STP [? Undoubtedly this is SDP - Software Development Platform; I just heard it wrong on the recording.] is, a 32 core, coherent, on a chip, teraflop part, it’s kind of like Paragon or ASCI Red on a chip. [but it is only a TFLOPS in single precision] And the programming model is, surprisingly, kind of like a bunch of processor cores on a network, which a lot of people understand and can get a lot of utility out of in a very well-understood way. So, in a sense, we’re giving people what they want, and that, generally, is good business. And if you don’t give them what they want, they’ll have to go find someone else. So we’re simply doing what our marketplace asked us for.
Q: Well, let me play a little bit of devil’s advocate here, because MIC is very clearly derivative of Larrabee, and…
A: Knight’s Ferry is.
Q: … Knight’s Ferry is. Not MIC?
A: No. I think you have to take a look at what Larrabee was. Larrabee, by the way, was a really cool project, but what Larrabee was was a tile rendering graphics device, which meant its design point, was first of all the programming model was derived from what you do for graphics. It’s going to be API-based, the answer it’s going to generate is going to be a pixel, the pixel is going to have a defined level of sub-pixel accuracy. It’s a very predictable output. The internal optimizations you would make for a graphics implementation of a general many-core architecture is one very specific implementation. Let’s talk about the needs of the high performance computing market. I need bandwidth. I need memory depth. Larrabee didn’t need memory depth; it didn’t have a frame buffer.
Q: It needed bandwidth to local memory [of which it didn’t have enough; see my post The Problem with Larrabee]
A: Yes, but less than you think, because the cache was the critical element in that architecture [again, see that post] if you look through the academic papers on that…
Q: OK, OK.
A: So, they have a common heritage, they’re both derived out of the thoughts that came out of the Intel Labs terascale research. They’re both many-core. But Knight’s Ferry came out with a few, they’re only a few, modifications. But the programming model is completely different. You don’t program a graphics device like you do a computer, and MIC is a computer.
Q: The higher-level programming model is different.
Q: But it is a big, wide, cache-coherent SMP.
A: Well, yes, that’s what Knight’s Ferry is, but we haven’t talked about what Knight’s Corner yet, and unfortunately I won’t today, and we haven’t talked about where the product line will go from there, either. But there are many things that will remain the same, because there are things you can take and embellish and work and things that will be really different.
Q: But can you at least give me a hint? Is there a chance that Knight’s Corner will be a substantially different hardware model than Knight’s Ferry?
A: I’m going to really love to talk to you about Knight’s Corner. [his emphasis]
Q: But not today.
A: I’m going to duck it today.
Q: Oh, man…
A: The product is going to be in our 22 nm process, and 22 nm isn’t shipping yet. When we get a little bit closer, when it deserves to have the buzz generated, we’ll start generating buzz. Right now, the big thing is that we’re making the investments in the Knight’s Ferry software development platform, to see how codes scale across the many-core, to get the environment and tools up, to let developers poke at it and find stuff, good stuff, bad stuff, in between stuff, that allow us to adjust the product line for ongoing generations. We’ve done that really well since we announced the architecture about 15 months ago.
Q: I was wondering what else I was going to talk about after having talked to both John Hengeveld and Jim Reinders. This is great. Nobody talked about where it really came from, and even hinted that there were changes to the MIC chip [architecture].
A: Oh, no, no, many things will be the same, many things will be different. If you’re targeting trying to do a pixel-renderer, go do a pixel-renderer. If you’re trying to do a general-purpose computing device, do a general-purpose computing device. You’ll see some things and say “well, it’s all the same” and other things “wow, it’s completely different.” We’ll get around to talking about the part when we’re a little closer.
The most important thing that James and/or John should have been talking about is that the key thing is the ability to not force the developer to completely and utterly re-think their problem to use your hardware. There are two models: In an accelerator model, which is something I spent a lot of my life working with, accelerators have the advantage of optimization. You can say “I want to do one thing really well.” So you can then describe a programming model for the hardware. You can say “build your data this way, write your program this way” and if you do it will work. The problem is that not everything fits into the box. Oh, you have sparse data. Oh, you have recursive code.
Q: And there’s madness in that direction, because if you start supporting that you wind yourself around to a general-purpose machine. […usually, a very odd-looking general-purpose machine. I’ve talked about Sutherland’s “Wheel of Reincarnation” in this blog, haven’t I? Oh, there it is: The Cloud Got GPUs, back in November 2010.]
A: Then it’s not an accelerator any more. The thing that you get in MIC is the performance of one of those accelerators. We’ve shown this. We’ve hit 960GF out of a peak 1.2TF without throwing away precision, without playing any circus tricks, just run the hardware. On Knight’s Ferry we’ve shown that. So you get performance, but you’re getting it out of the general purpose programming model.
Q: That’s running LINPACK, or… ?
A: That was an even more basic thing; I’m just talking about SGEMM [single-precision dense matrix multiply].
Q: I just wanted to ground the number.
A: For LU factorization, I think we showed hybrid LU, really cool, one of the great things about this hybrid…
Q: They’re demo-ing that downstairs.
A: … OK. When the matrix size is small, I keep it on the host; when the matrix size is large, I move it. But it’s all the same code, the same code either place. I’m just deciding where I want to run the code intelligently, based on the size of the matrix. You can get the exact number, but I think it’s on the order of 750GBytes/sec for LU [GFLOPS?], which is actually, for a first-generation part, not shabby. [They were doing 650-750 GF according to the meter I saw. That's single precision; Knight's Ferry was originally a graphics part.]
Q: Yaahh, well, there are a lot of people who can deliver something like that.
A: We’ll keep working on it and making it better and better. So, what are we proving today. All we’ve proven today is that the architecture is capable of performance. We’ve got a lot of work to do before we have a product, but the architecture has shown itself to be capable. The programming model, we have people who will speak for us, like the quotes that came from LRZ [data center for the universities of Munich and the Bavarian Academy of Sciences], from Leibnitz [same place], a code they couldn’t port to other accelerators was running in two hours and optimized in two days. Now, actual mileage may vary, see dealer for…
Q: So, there are things that just won’t run on a CUDA model? Example?
A: Well, perhaps, again, the thing you try to get to is whether there is evidence growing that what you say is real. So we’re having people who are starting to be able to speak to that, and that gives people the confidence that we’re going to be able to get there. The other thing it ends up doing, it’s kind of an odd benefit, as people have started building their code, trying to optimize it for MIC, they’re finding the parallelism, they’re doing what we wanted them to do all along, they’re taking the same code on their current cluster and they’re getting benefits right now.
Q: That’s got a long history. People would have some grotty old FORTRAN code, and want to vectorize it, but the vectorizing compiler couldn’t make crap out of it. So they cleaned it up, made it obvious what was going on, and the vectorizer did its thing well. Then they put it back on the original machine and it ran twice as fast.
A: So, one of the nice things that’s happened is that as people are looking at ways to scale power, performance, they’re finally getting around to dealing with parallelism. The offer that we’re trying to provide is portable, high level, standards-based, and you can use it now.
You said “why.” That’s why. Our customers and developers say “if you can do that, that’s really valuable.” Now. We’re four men and a pudding, we haven’t shipped a product yet, we’ve got a lot of work to do, but the thought and the promise and the early data is really good.
Q: OK. Well, great.
A: Was that a good use of the time?
Q: That’s a very good use of the time. Let me poke on one thing a little bit. Conceptually, it ought to be simpler to write code to that kind of a shared memory model and get parallelism out of the code that way. Now, on the other hand, there was a talk – sorry, I forget his name, he was one of the software guys working on Larrabee [it was Tom Forsyth; see my post The Problem with Larrabee again] said someone on the project had written four renderers, and three of them were for Larrabee. He was having one hell of a time trying to get something that performed well. His big issue, at least what it came down to from what I remember of the talk, was memory bandwidth.
A: Well, first of all, we’ve said Larrabee’s not a product. As I’ve said, one of the things that is critical, you’ve got the compute-bound, you’ve got the memory-bound, and most people are somewhere in between, but you have to be able to handle the two edge cases. We understand that, and we intend to deliver a really good value across the spectrum. Now, Knight’s Ferry has the RVI silicon [RVI? I’m guessing here], it’s a variation off the silicon we used, no one cares about that, but on Knight’s Ferry, the memory bus is 256 bits wide. Relatively narrow, and for a graphics processor, very narrow. There are definitely design decisions in how that chip was made that would limit the bandwidth. And the memory it was designed with is slower than the memory today, you have all of the normal things. But if you went downstairs to the show floor, and talk to Daniel Paul, he’s demonstrating a pretty dramatic ray-tracer.
[What follows is a bit confused. He didn’t mean the Austrian Crown stochastic ray-tracing demo, but rather the real-time ray-tracing demo. As I said in my immediately previous post (Random Things of Interest at IDF 2011), the real-time demo is on a set of Knight’s Ferries attached to a Xeon-based node. At the time of the interview, I hadn’t seen the real-time demo, just the stochastic one; the latter is not on Knight’s Ferry.]
Q: I’ve seen that one. The Austrian Crown?
Q: I thought that was on a cluster.
A: In the little box behind there, he’s able to scale from one to eight Knight’s Ferries.
Q: He never told me there was a Knight’s Ferry in there.
A: Yes, it’s all Knight’s Ferry.
Q: Well, I’m going to go down there and beat on him a little bit.
A: I’m about to point you to a YouTube site, it got compressed and thrown up on YouTube. You can’t get the impact of the complexity of the rays, but you can at least get the superficial idea of the responsiveness of the system from Knight’s Ferry.
[He didn’t point me to YouTube, or I lost it, but here’s one I found. Ignore the fact that the introduction is in Swedish or something [it's Dutch, actually]; Daniel – and it’s Daniel, not David – speaks English, and gives a good demo. Yes, everybody in the “Labs” part of the showroom wore white lab coats. I did a bit of teasing. I also updated the Random Things of Interest post to directly include it.]
Well, if you believe that what we’re going to do in our mainstream processors is roughly double the FLOPS every generation for the next many generations, that’s our intent. What if we can do that on the MIC line as well? By the time you get to where ray-tracing would be practical, you could see multiple of those being integrated into a single device [added in transcription: Multiple MICs in a single device? Hierarchical MIC?] becomes practical computationally. That won’t be far from now. So, it’s a nice demo. David’s an expert in his field, I didn’t hear what he said, but it you want to see the device downstairs actually running a fairly strenuous graphics workload, take a look at that.
Q: OK. I did go down there and I did see that, I just didn’t know it was Knight’s Ferry. [It’s not, it’s not, still confused here.] On that HDR display that is gorgeous. [Where “it” = stochastically-ray-traced Austrian Crown. It is.]
[At that point, Dave Patterson walked in, which interrupted us. We said hello – I know Dave of old, a bit – thanks were exchanged with Joe, and I departed.]
[I can’t believe this is the end of the last one. I really don’t like transcribing.]