At the recent Intel Developer Forum (IDF), I was given the opportunity to interview John Hengeveld. John is in the Datacenter and Connected Systems Group in Hillsboro.
Intel-provided information about John:
I recorded our conversation. What follows is a transcript, rather than a summary, since our topics ranged fairly widely and in some cases information is conveyed by the style of the answer. Conditions weren’t optimal for recording; it was in a large open space with many other conversations going on and the “Intel Robotic Orchestra” playing in the background. Hopefully I got all the words right.
I used Twitter to crowd-source questions, and some of my comments refer to picking questions out of the list that generated. (Thank you! To all who responded.)
Full disclosure: As I noted in a prior post, Intel paid for me to attend IDF. Thanks, again.
Occurrences of  indicate words I added for clarification. There aren’t many.
Pfister: What, overall, is HPC to Intel? Is it synonymous with MIC?
Hengeveld: No. Actually, HPC has a research effort, how to scale applications, how to deal with performance and power issues that are upcoming. That’s the labs portion of it. Then we have significant product activity around our mainstream Xeon products, how to support the software and infrastructure when those products are delivered in cluster form to supercomputing activities. In addition to those products also get delivered into what we refer to as the volume HPC market, which is small and medium-sized clusters being used for product design, research activities, such as those in biomed, some in visualization. Then comes the MIC part. So, when we look at MIC, we try to manage and characterize the collection of workloads we create optimized performance for. About 20% of those, and we think these are representative of workloads in the industry, map to what MIC does really well. And the rest, most customers have…
Pfister: What is the distinguishing characteristic?
Hengeveld: There are two distinguishing characteristics. One is what I would refer to as compute density – applications that have relatively small memory footprints but have a high number of compute operations per memory access, and that parallelize well. Then there’s a second set of applications, streaming applications, where size isn’t significant but memory bandwidth is the distinguishing factor. You see some portion of the workload space there.
Pfister: Streaming is something I was specifically going to ask you about. It seems that with the accelerators being used today, there’s this bifurcation in HPC: Things that don’t need, or can’t use, memory streaming; and those that are limited by how fast you can move data to and from memory.
Hengeveld: That’s right. I agree.
Pfister: Is MIC designed for the streaming side?
Hengeveld: MIC will perform well for many streaming applications. Not all. There are some that require a memory access model MIC doesn’t map to particularly well. But a lot of the streaming applications will do very well on MIC in one of the generations. We have a collection of generations of MIC on the roadmap, but we’re not talking about anything beyond the next “Corner” generation [Knight’s Corner, 2012 product successor to the current limited-production Knight’s Ferry software development vehicle]. More beyond that, down the roadmap, you will see more and more effect for that class of application.
Pfister: So you expect that to be competitive in bandwidth and throughput with what comes out of Nvidia?
Hengeveld: Very much so. We’re competing in this market space to be successful; and we understand that we need to be competitive on a performance density, performance per watt basis. The way I kind of think about it is that we have a roadmap with exceptional performance, but, in addition to that, we have a consistent programming model with the rest of the Xeon platforms. The things you do to create an optimized cluster will work in the MIC space pretty much straightforwardly. We’ve done a number of demonstrations of that here and at ISC. That’s the main difference. So we’ll see the performance; we’ll be ahead in the performance. But the real difference is the programming model.
Pfister: But the application has to be amenable.
Hengeveld: The application has to be amenable. For many customers that do a wide range of applications – you know, if you are doing a few things, it’s likely possible that some of those few things will be these highly-parallel, many-core optimized kinds of things. But most customers are doing a range of things. The powerful general-purpose solution is still the mainstream Xeon architecture, which handles the widest range of workloads really robustly, and as we continue with our beat rate in the Xeon space, you know with Sandy Bridge coming out we moved significantly forward with floating-point performance, and you’ll see that again going forward. You see the charts going up and to the right 2X per release.
Pfister: Yes, all marketing charts go up and to the right.
Hengeveld: Yes, all marketing charts go up and to the right, but the point is that there’s a continued investment to drive floating-point performance and effective parallelism and power efficiency in a way that will be useful to HPC customers and mainstream customers.
Pfister: Is MIC going to be something that will continue over time? That you can write code for an expect it to continue to work in the future?
Hengeveld: Absolutely. It’s a major investment on our part on a distinct architectural approach that we expect to continue on as far out as our roadmaps envision today.
Pfister: Can you tell me anything about memory and connectivity? There was some indication at one point of memory being stacked on a MIC chip.
Hengeveld: A lot of research concepts are being explored for future products, and I can’t really talk about much of that kind of thing for things that are out in the roadmap. There’s a lot of work being done around innovative approaches about how to do the system work around this silicon.
Pfister: MIC vs. SCC – Single Chip Cluster.
Hengeveld: SCC! Got it! I thought you meant single chip computer.
Pfister: That would probably be SoC, System on a Chip. Is SCC part of your thinking on this?
Hengeveld: SCC was a research vehicle to try to explore extreme parallelism and some different instruction set architectures. It was a research vehicle. MIC is a series of products. It’s an architecture that underlies them. We always use “MIC” as an adjective: It’s a MIC architecture, MIC products, or something like that. It means Many Integrated Cores, Many Integrated Core architecture is an approach that underlies a collection of products, that are a product mix from Intel. As opposed to SCC, which is a research vehicle. It’s intended to get the academic community thinking about how to solve some of the major problems that remain in parallelism, using computer science to solve problems.
Pfister: One person noted that a big part of NVIDIA’s success in the space is CUDA…
Pfister: …which people can use to get, without too much trouble, really optimized code running on their accelerators. I know there are a lot of other things that can be re-used from Intel architecture – Threaded Building Blocks, etc. – but will CUDA be supported?
Hengeveld: That’s a question you have to ask NVIDIA. CUDA’s not my product. I have a collection of products that have an architectural approach.
Pfister: OpenCL is covered?
Hengeveld: OpenCL is part of our support roadmap, and we announced that previously. So, yes OpenCL.
Pfister: Inside of a MIC, right now, it has dual counter-rotating rings. Are connections other than that being considered? I’m thinking of the SCC mesh and other stuff. Are they in your thinking at this point?
Hengeveld: Yes, so, further out in the roadmap. These are all part of the research concepts. That’s the reason we do SCC and things like that, to see if it makes sense to use that architecture in the longer term products. But that’s a long ways away. Right now we have a fairly reasonable architectural approach that takes us out a bit, and certainly into our first generation of products. We’re not discussing yet how we’re going to use these learnings in future MIC products. But you can imagine that’s part of the thinking.
Hengeveld: So, here’s the key thing. There are problems in exascale that the industry doesn’t know how to solve yet, and we’re working with the industry very actively to try to figure out whether there are architectural breakthroughs, things like mesh architectures. Is that part of the solution to exascale conundrums? Are there workloads in exascale, sort of a wave processing model, that you might see in a mesh architecture, that might make sense. So working with research centers, working with the labs, in part, we’re trying to figure out how to crack some of these nuts. For us it’s about taking all the pieces people are thinking about and seeing what the whole is.
Pfister: I’m glad to hear you express it that way, since the way it seemed to be portrayed at ISC was, from Intel, “Exascale, we’ve got that covered.”
Hengeveld: So, at the very highest strategic level, we have it covered in that we are working closely with a collection of academic and industry partners to try and solve difficult problems. But exascale is a long way off yet. We’re committed to make it happen, committed to solve the problems. That’s the real meat of what Kirk declared at ISC. It’s not that we have the answer; it’s that we have a commitment to make it happen, and to make it happen in a relatively early time period, with a relatively sustainable product architectural approach. But there are many problems to solve in exascale; we can barely get our arms around it.
Pfister: Do you agree with the DARPA targets for exascale, particularly low power, or would you relax those?
Hengeveld: The Intel commit, what we said in the declaration, was not inconsistent with the DARPA thing. It may be slightly relaxed. You can relax one of two things, you can relax time or you can relax DARPA targets. So I think you’re going to reach DARPA’s targets eventually – but when. So the target that Kirk raised is right in there, in the same ballpark. Exascale in 20MW is one set of rational numbers; I’ve heard 10 [MW], I’ve heard 40 [MW], somewhere between those, right? I think 40 [MW] is so easy it’s not worth thinking about. I don’t think it’s economically rational.
Pfister: As you move forward, what do you think are the primary barriers to performance? There are two different axes here, technical barriers, and market barriers.
Hengeveld: The technical barriers are cracking bandwidth and not violating the power budget; tracking how to manage the thread complexity of an exascale system – how many threads are you going to need? A whole lot. So how do you get your arms around that? There are business barriers: How do you get a return on investment through productizing things that apply in the exascale world? This is a John [?] quote, not an Intel quote, but I am far less interested in the first exascale system than I am in the 100th. I would like a proliferation of exascale applications and performance, and have it be accessible to a wide range of people and applications, some applications that don’t exist today. In any ecosystem-building task, you’ve got to create awareness of the need, and create economic momentum behind serving that need. Those problems are equally complex to solve [equal to the technical ones]. In my camp, I think that maybe in some ways the technical problems are more solvable, since you’re not training people in a new way of thinking and working and solving problems. It takes some time to do that.
Pfister: Yes, in some ways the science is on a totally different time schedule.
Hengeveld: Yes, I agree. I agree entirely. A lot of what I’m talking about today is leaps forward in science as technical computing advances, but as the capability grows, the science will move to match it. How will that science be used? Interesting question. How will it be proliferated? Genome work is a great target for some of this stuff. You probably don’t need exascale for genome. You can make it faster, you can make it more cost-effective.
Pfister: From what I have heard from people working on this at CSU, they have a whole lot more problems with storage than with computing capability.
Hengeveld: That’s exactly right.
Pfister: They throw data away because they have no place to put it.
Hengeveld: That’s a fine example of the business problems you have to crack along with the compute problems that you have to crack. There’s a whole infrastructure around those applications that has to grow up.
Pfister: Looking at other questions I had… You wouldn’t call MIC a transitional architecture, would you?
Hengeveld: No. Heavens no. It’s a design point for a set of workloads in HPC and other areas. We believe MIC fits more things than just HPC. We started with HPC. It’s a design point that has a persistence well beyond as far as we can see on the roadmap. It’s not a transitional product.
Pfister: I have a lot of detailed technical questions which probably aren’t appropriate, like whether each of the MIC cores has equal latency to main memory.
Hengeveld: Yes, that’s a fine example of a question I probably shouldn’t answer.
Pfister: Returning to ultimate limits of computing, there are two that stand out, power and bandwidth, both to memory and between chips. Does either of those stand out to you as the sore thumb?
Hengeveld: Wow. So, the guts of that question gets to workload characterization. One of my favorite topics is “It’s the workload, stupid.” People say “it’s the economy, stupid,” well in this space it’s the workload. There aren’t general statements you can make about all workloads in this market.
Pfister: Yes, HPC is not one market.
Hengeveld: Right, it’s not one market, it’s not one class of usages, it’s not one architecture of solutions, it’s one reason why MIC is required, it’s not invisible. One size doesn’t fit all. Xeon does a great job of solving a lot of it really well, but there are individual workloads that are valuable that we want to dive into with more capability in a more targeted way. There are workloads in the industry where the interconnect bandwidth between processors in a node and nodes in a cluster is the dominant factor in performance. There are other workloads where the bandwidth to memory is the dominant factor in performance. All have to be solved. All have to be moved forward at a reasonable pace. I think the ones that are going to map to exascale best are ones where the memory bandwidth required can be solved well by local memory, and the problems that can be addressed well are those that have rational scaling of interconnect requirement between nodes. You’re not going to see problems that have a massive explosion of communication; the bandwidth won’t exist to keep up with that. You can actually see something I call “well-fed FLOPS,” which is how many FLOPS can you rationally support given the rest of this architecture. That’s something you have to know for each workload. You have to study it for each domain of HPC usage before you get to the answer about which is more important.
Pfister: You probably have to go now. I did want to say that I noticed the brass rat. Mine is somewhere in the Gulf of Mexico.
Hengeveld: That’s terrible. Class of ’80.
Pfister: Class of ’67.
Pfister: Stayed around for graduate school, too.
Hengeveld: When’d you leave?
Pfister: In ’74.
Hengeveld: We just missed overlapping, then. Have you been back recently?
Pfister: Not too recently. But there have been a lot of changes.
Hengeveld: That’s true, a lot of changes.
Pfister: But East Campus is still the same?
Hengeveld: You were in East Campus? Where’d you live?
Hengeveld: I was in the black hall of fifth-floor Bemis.
Pfister: That doesn’t ring a bell with me.
Hengeveld: Back in the early 70s, they painted the hall black, and put in red lights in 5th-floor Bemis.
Pfister: Oh, OK. We covered all the lights with green gel.
Hengeveld: Yes, I heard of that. That’s something that they did even to my time period there.
Pfister: Anyway, thank you.
Hengeveld: A pleasure. Nice talking to you, too.