Friday, October 31, 2008

Microsoft Azure Just Says NO to Multicore Apps in the Cloud

At their recent PDC’08, Microsoft unveiled their Azure Services Platform, Microsoft’s full-throttle venture into Cloud Computing. Apparently you shouldn’t bother with multithreading, since Azure doesn’t support multicore applications. It scales only “out,” using virtualization, as I said server code would generally do in IT Departments Should NOT Fear Multicore. I’ll give more details about that below; first, an aside about why this is important.

“Cloud Computing” is the most hyped buzzword in IT these days, all the way up to a multi-article special report in The Economist (recommended). So naturally its definition has been raped by the many people who apparently reason “Cloud computing is good. My stuff is good. Therefore my stuff is cloud computing.”

My definition, which I’m sure conflicts with many agendas: Cloud computing is hiring someone out on the web to host your computing, where “host your computing” can range across a wide spectrum: from providing raw iron (hardware provisioning), through providing building blocks of varying complexity, through providing standard commercial infrastructure, to providing the whole application. (See my cloud panel presentation at HPDC 2008.)

Clouds are good because they’re easy, cheap, fast, and can easily scale up and down. They’re easy because you don’t have to purchase anything to get started; just upload your stuff and go. They’re fast because you don’t have to go through your own procurement cycle, get the hardware, hire a sysadmin, upgrade your HVAC and power, etc. They’re cheap because you don’t have to shell out up front for hardware and licenses before getting going; you pay for what you use, when you use it. Finally, they scale because they’re on some monster compute center somewhere (think Google, Amazon, Microsoft – all cloud providers with acres of systems, and IBM’s getting in there too) that can populate servers and remove them very quickly – it’s their job, they’re good at it (or should be) – so if your app takes off and suddenly has huge requirements, you’re golden; and if your app tanks, all those servers can be given back. (This implicitly assumes “scale out,” not multicore, but that’s what everybody means by scale, anyway.)

It is possible, if you’re into such things, to have an interminable discussion with Grid Computing people about whether a cloud is a grid, grid includes cloud, a cloud is a grid with a simpler user interface, and so on. Foo. Similar discussions can revolve around terms like utility computing, SaaS (Software as a Service), PaaS (Platform as …), IaaS (Infrastructure …) and so on. Double foo – but with a nod to the antiquity of “utility computing.” Late 60s. Project MAC. Triassic computing.

Microsoft Azure Services slots directly into the spectrum of my definition at the “provide standard commercial infrastructure” point: Write your code using Microsoft .NET, Windows Live, and similar services; upload it to a Microsoft data center; and off you go. Its presentation is replete with assurances that people used to Microsoft’s development environment (.NET and related) can write the same kind of things for the Microsoft cloud. Code doesn’t port without change, since it will have to use different services – Azure’s storage services in particular look new, although SQL Services are there – but it’s the same kind of development process and code structure many people know and are comfortable with.

That sweeps up a tremendous number of potential cloud developers, and so in my estimation bodes very well for Microsoft doing a great hosting business over time. Microsoft definitely got out in front of the curve on this one. This assumes, of course, that the implementation works well enough. It’s all slideware right now, but a beta-ish Community Technology Preview platform is supposed to be available this fall (2008).

For more details, see Microsoft’s web site discussion and a rather well-written white paper.

So this is important, and big, and is likely to be widely used. Let’s get back to the multicore scaling issues.

That issue leaps out of the white paper with an illustration on page 13 (Figure 6) and a paragraph following. That wasn’t the intent of what was written, which was actually intended to show why you don’t have to manage or build your own Windows systems. But it suffices. Here’s the figure:

[Figure explanation: IIS is Microsoft’s web server (Internet Information Services) that receives web HTTP requests. The Web Role Instance is user code that initially processes that, and passes it off to the Worker Role Instance through queues via the Agents. This is all apparently standard .NET stuff (“apparently” because I can’t claim to be a .NET expert). So the two sets of VM boxes roughly correspond to the web tier (1st tier), with IIS instead of Apache, and application tier (2nd tier) in non-Microsoft lingo.]

Here’s the paragraph:

While this might change over time, Windows Azure’s initial release maintains a one-to-one relationship between a VM [virtual machine] and a physical processor core. Because of this, the performance of each application can be guaranteed—each Web role instance and Worker role instance has its own dedicated processor core. To increase an application’s performance, its owner can increase the number of running instances specified in the application’s configuration file. The Windows Azure fabric will then spin up new VMs, assign them to cores, and start running more instances of this application. The fabric also detects when a Web role or Worker role instance has failed, then starts a new one.

The scaling point is this: There’s a one-to-one relationship between a physical processor core and each of these VMs, therefore each role instance you write runs on one core. Period. “To increase an application’s performance, its owner can increase the number of running instances” each of which is a separate single-core virtual computer. This is classic scale out. It simply does not use multiple cores on any given piece of code.

There are weasel words up front about things possibly changing, but given the statement about how you increase performance, it’s clear that this refers to initially not sharing a single core among multiple VMs. That would be significantly cheaper, since most apps don’t use anywhere near 100% of even a single core’s performance; it’s more like 12%. Azure doesn’t share cores, at least initially, because they want to ensure performance isolation.

That’s very reasonable; performance isolation is a big reason people use separate servers (there are 5 or so other reasons). In a cloud megacenter, you don’t want your “instances” to be affected by another company’s stuff, possibly your competitor, suddenly pegging the meter. Sharing a core means relying on scheduler code to ensure that isolation, and, well, my experience of Windows systems doing that is somewhat spotty. The biggest benefit I’ve gotten out of dual core on my laptop is that when some application goes nuts and sucks up the CPU, I can still mouse around and kill it because the second core is not being used.

Why do this, when multicore is a known fact of life? I have a couple of speculations:

First, application developers in general shouldn’t have anything to do with parallelism, since it’s difficult, error-prone, and increases cost; developers who can do it don’t come cheap. That’s a lesson from multiple decades of commercial code development. Application developers haven’t dealt with parallelism since the early 1970s, with SMPs, where they wrote single-thread applications that ran under transaction monitors, instantiated just like Azure is planning (but not on VMs).

Second, it’s not just the applications that would have to be robustly multithreaded; there’s also the entire .NET and Azure services framework. That’s got to be multiple millions of lines of code. Making it all really rock for multicore – not just work right, but work fast – would be insanely expensive, get in the way of adding function that can be more easily charged for, and is likely unnecessary given the application developer point above.

Whatever the ultimate reasons, what this all means is that one of the largest providers of what will surely be one of the most used future programming platforms has just said NO! to multicore.

Bootnote: I’ve seen people on discussion lists pick up on the fact that “Azure” is the color of a cloudless sky. Cloudless. Hmmm. I’m more intrigued by its being a shade of blue: Awesome Azure replacing Big Blue?

Monday, October 20, 2008

XKCD on Twitter


Posted because anybody who enjoyed the type of humor in In Search of Clusters is sure to like that one. Enjoy.

Getting Back to Comments

It’s more than time to respond to some of the comments that have been posted. I’ve copied the shorter ones here, and summarized the longer ones. If your comment isn’t here, it’s because no response seemed needed, or I already responded with a comment of my own.

Steve, on IT Departments Should NOT Fear Multicore: I think you're spot on concerning why IT shouldn't fear multicore. Where they fear it, it's because they've had innovation beaten out of them.

I agree, Steve. Another possibility is that they are so very conservative they fear changing anything – which may be saying the same thing. It’s possible Claunch was trying to wake up the change-fearing crowd, but why he didn’t virtualize (or why Dignan didn’t repeat it) I just don’t understand, except in the most cynical possible way (“Oooh, it’s scary! Hire us to tell you what to do!”).

Several folks commented on Vive la (Killer App) Revolution!

Stu: isn't the killer app 'information retrieval' ? i.e. google search, or commoditized parallel data warehousing and mining?

Stu, sure those use multicore, but they’re server apps; see IT Departments Should NOT Fear Multicore for more about how server apps soak up multicore quite well. That article was looking for, and arguing that we need, client apps.

Steve: Embedded systems could soak up a lot of low power multiprocessor chips. At least a couple of start-ups are working on them. Intellasys has a very low power 40 core chip and XMOS has a 4 core chip with 8 hardware threads per core.

They surely can soak up some, Steve, but I don’t think enough. The high end, in things like MRI scanners, has really miniscule volumes. The low end, in things like dishwasher controllers, has enormous volume but only requires trivial capabilities; it doesn’t take much brains to run a dishwasher. The only high-volume fairly high-price case is PC clients, where new CPUs go for $100s – and possibly game systems. I don’t know the relative volumes of PCs and game systems. I suspect the game systems are pretty big, but not in the PC range, since they’re not on every desk in every business.

Philip Machanick: What is really needed is a killer app that addresses some real mass-market need. The history of computing tells us that technology packaged for low cost eventually overtakes technology packaged for speed, first on price:performance and eventually on performance. More at my blog.

Philip, I share much of your astonishment and unbelieving about multicore, but I think what you’re referring to with “low cost eventually overtakes” is the Disruptive Technology process of Clayton Christensen’s "The Innovator's Dilemma." I don’t think that’s what’s happening here, or can happen here. I think multicore is more like the process in which cars eventually just got fast enough for everybody, so to keep selling them manufacturers invented tail fins. I’m not sure that has a name other than “marketing.”

TerryMcIntyre: [My summary of his long comment: He thinks killer apps include crosses between games and tutors, managing your photo collection including face recognition and recognition-driven auto-tagging, optimizing use of applications, and smart implementations of Go (and the like).]

Those are all possibilities, Terry. One requirement I’d put on things, however, is that somebody know how to do it, meaning a practical algorithm is known. I think that rules out things like auto-tagging everything looking like the Grand Canyon from some examples. I’m not sure about Go.

Robert on Clarifying the Black Hole: Greg, With this sentence: "Unfortunately, servers alone don't produce the semiconductor volumes that keep an important fraction of this industry moving forward – a fraction, not all of the industry, but certainly the part that's arguably key, to say nothing of the loudest, and will make the biggest ruckus as it increasingly runs into trouble." you tap dance around this "important fraction of the industry" without actually saying who they are. Who, precisely, are these people who can't easily parallelize, and really need continuing performance gains, and constitute a significant fraction of the computing market? Before we can find new solutions for them we need to know who they are and what they are trying to accomplish.

Ok, Robert, a spade a spade: I believe that the technological trend of multicore hurts the traditional business models of Intel and Microsoft, and their derivative PC hardware and software vendors: HP, Dell, and company. That’s not to say they’ll have financial problems; finances are affected by a lot more than technology.

John on 101 Parallel Languages (Part 3, the last): [My summary of his long comment: My conclusion shouldn’t be so negative; we need to find new modes of expression expressing parallelism.]

John, all I can say is that’s not the evidence I see. Some language extensions can help; but I’m thinking here of relatively low-level support like, say, truly low-weight thread spawning in Java. But I think the attempts to provide parallelism in ordinary, programming-for-the-rest-of-us things are fundamentally misguided.

dvadell on Background: Me and This Blog: I'm very happy I found your blog. I can finally say: Thank you for "In search of clusters"! I couldn't get it here in Argentina so a friend of mine sent it to me from USA. I enjoyed (and laughed) a lot.

Well, thanks, dvadell. I’m glad you found it enjoyable. I hope I manage to get another out there for you in the foreseeable future.

Thursday, October 16, 2008

IT Departments Should NOT Fear Multicore

"Gartner analyst Carl Claunch ... argued Thursday that IT types should fear–yes fear–multicore technology."

This statement appears in "Does IT really need to fear multicore?" on ZDNet, to which I was alerted by Surendra and Sujana Byna's excellent So this is a fourth-level indirection to the source: Here to to the summary by Larry Dignan in ZDNet of a talk by Gartner analyst Carl Claunch at Gartner's Symposium/ITxpo (whew).

I'm bringing up this family tree because I don't have access to the original, only to the ZDNet summary; and, as the title of this blog entry indicates, I don't agree. Unfortunately, I don't know for sure whether I'm disagreeing with Claunch, of Gartner; or disagreeing with Digman's (ZDNet) summary rendition of what Claunch said. So Carl, if what I say interprets what you said wrongly, I apologize.

Let's start off quoting Claunch of Gartner (quote lifted from ZDNet): 
"We have had the easy fix for decades to deal with existing applications whose requirements have outgrown their current system — we refreshed the technology and the application ran better. This led to relatively long life cycles for applications, allowed operations to address application performance problems easily and required little advance notice of capacity shortfalls."

This is completely true. 

In contrast, with multicore, the simple process above becomes more complex. Of the five bullet points detailing that added complexity, three make sense to me: 
  • IT departments will need new coding skills 
  • IT departments will need new development tools;  
  • what you get out of it, the achieved performance increase, will vary, a lot, by application.
(I'll talk at the end about the ones that I consider flat-out wrong.)

I think this is entirely too alarmist.

Well. Given my past posts, you may be excused for thinking of bristly 300-lb hogs daintily hovering before tiny flowers, sipping nectar like hummingbirds; and of course the sky has fallen (thud!) (didn't even bounce). But I've not said multicore is a problem for server systems.

The issues brought up above are certainly problems if you have a single-thread production program and you need it to go faster. If that's your case, I'd be even more vehement about how you are in deep trouble, how much those skills are going to cost, etc.; see prior posts in this blog. Given the whole spectrum of business processing, I've no doubt such situations exist.

However: The vast majority of server workloads already scale out. They've been developed or recoded over the last decade or so to use multiple whole computers (clusters, farms) to provide greater performance. (Scale out is opposed to scale up, meaning using more CPUs in a single multicore system.) The price-performance of commodity rack systems is just too staggeringly better than the alternatives for anybody to do otherwise.

That server workloads scale out is not speculation. I and others have trudged through all the categories and sub-categories of International Data Corp's (IDC's) Server Workload taxonomy (see this report for a list of those categories, at the end, in Table 11), asking application experts whether their case scaled out: CRM? Check, scales out. Email? Check; got the diagrams. Data mining? Double-triple check, scales way out. And so on, through a dismaying long list. 

(Those are sample applications in IDC's categories "Business Processing," "Collaboration,"and "Business Intelligence," respectively.)

In fact, there is so much scale out of commercial workloads that the issue has confused many in the cloud computing crowd completely: Are you talking about computing out in the Internet cloud somewhere, or are you talking about computing on a cloud of systems (cluster)? Usual implicit answer: Both. They assume scale out so deeply that it's built into some of the names of cloud facilities, like Amazon's "Elastic Computing Cloud"; the claimed elasticity arises only if you scale out. But hey, everything scales out. That's the assumption. In fact, in most software discussions, "scales" is used to mean "scales out."

Now, a few of the server workloads in IDC's taxonomy don't scale out without issues. For example:
  • OLTP -- short transaction processing, like ATM transactions -- is problematical, despite the best efforts of Oracle, IBM, and others (IBM comes closer) (or maybe Tandem). 
  • Batch depends a whole lot on what you're batching; if what you're batching is totally single-thread, it can be a big problem. 
Those two, however, amount to less than 10% of server revenue, and... hey, they hark back to old-fashioned mainframe days. Mainframes have been SMP (multicore) since the 1970s, so at least some adaptation has taken place; OLTP in particular scales like a banshee on multicore; all the top record-holders of the OLTP benchmark, TPC-C, are SMPs (= multiprocessors) (= multicores). Batch is not always terrible either, but that's a longer story.

So IT shops already scale out, a whole lot, and if they don't they probably scale up and directly use multicore already. Where does that lead us?

Well, if you already scale out, have I got a deal for you: Virtualization. 

Just partition that 16-core monster into 16 single-core pussycats, each running as a separate, independent, system.

This isn't completely painless. You will need a large amount of memory per system, but probably not much more than would scale up due to the claimed increased performance of the multicore CPU set. As for IO, well, that will hopefully be relieved when PCIe 3.0 starts deploying. It can be a problem now, on some systems. And virtualization adds yet more pain to systems management, which we definitely do not need; this is probably the worst aspect of the virtualization deal. But it totally beats parallelizing all your code, hands down and match over, which is what the article says Claunch says you have to do. I don't think so.

I suppose virtualization isn't the best long-term solution, simply because it seems it would be much cleaner to use a multicore system as one parallel thing. Partitioning it up seems artificial and inelegant. However, it least buys some time for a beleagured IT manager, and, well, interim solutions that work always last longer than one would anticipate.

Now, about the other bullet points appearing in the ZDNet article:
  • There are multiple core designs that will affect your software.
This differs from the existing situation how? The next generation of hardware has always been had a design different from the previous one, requiring at least recompilation to tap the claimed performance. Moreover, with Intel apparently (we don't know for sure, but I'd bet on it) ditching their interconnect busses for point-to-point connections like AMD, multicore designs are converging, not diverging.
  • Your existing software licensing deals become more complicated.
Just to pick the key example they used: Oracle has been licensing its products on multicore/multiprocessor systems since at least the mid-1980s. See comments above about breaking records. It's not the same as licensing on multiple separate systems, true, and if that's all you've ever dealt with in your IT career, you probably have some education required. But that doesn't mean it's horrid. IT shops have dealt with it for decades.

Really, though, licensing points to a problem with my solution: Licensing across multiple virtual machines has been a quagmire of the first order. It's inching towards something more rational, but there are still problems. Fortunately, the particular case of Oracle (and DB2, and other databases) is irrelevant for the multicore case, since they scale up to many processors and have for a long time; so you don't need to play the virtualization game.

Overall, multicore is actually a pretty good fit for server workloads. That's one aspect of multicore that won't need the alarm bells ringing. If servers alone had the volumes to keep this train rolling, many of us would sleep better at night.

(By the way, apologies to all for neglecting this blog for a while. I've got a lot more to post; I just got tangled up for a while. Comments will be answered.)

((By the way)^2: If somebody knows of an equivalent of the IDC workload taxonomy but for clients, I'd be much obliged if they could give me a pointer to it.)

Thursday, September 25, 2008

Jealousy (Off Topic)

Pretty far off-topic, but I feel like pointing out that at the moment I'm jealous of my son. He lives in Shanghai, China, and is about to go on a "golden week" (national holiday week) trip to Wulingyuan. That's the paradigm of those impossible-seeming mist-shrouded, tiny, steep mountains. He's also visiting an ancient city thereFenghuang, which looks really interesting..

I'm positively green.

I really hope to get there some day myself.

On the other hand, I'm not particularly jealous of his recent work schedule. He's a legal assistant at a Chinese law firm, so he gets traditional Chinese two-hour lunches. On the other hand, he's also expected to exhibit the standard Chinese virtue of hard work by regularly working his brains out until 2 AM. Retirement's a lot less stressful.

Vive la (Killer App) Révolution!

I've pointed to finding applications that are embarrassingly parallel (EP), meaning it is essentially trivial for them to use parallel hardware, as a key way to avoid problems. Why should be the case?

Actually, what I’m trying to find are not just EP applications, but rather killer apps for mass-market parallel systems: Mass-market applications which force many people to buy a “manycore” system, because it’s the only thing that runs them well. In fact, I’m thinking now that EP is probably a misstatement, because that’s really not the point. If a killer app’s parallelism is so convoluted that only three people on the planet understand it, who cares? The main requirement is that the other 6.6 Billion minus 3 of us can’t live without it (or at least a sufficiently large fraction of that 6.6B-3). Those three might get obscenely rich. That’s OK, just so long as the rest of us don’t go down the tubes.

I’m thinking here by analogy to two prior computer industry revolutions: The introduction of the personal computer, and the cluster takeover. The latter is slightly less well-known, but it was there.

The introduction of the PC was clearly a revolution. It created the first mass market for microprocessors and related gear, as well as the first mass market for software. Prior to that, both were boutique products – a substantial boutique, but still a boutique. Certainly the adoption of the PC by IBM spurred this; it meant software developers could rely on there being a product that would last long enough, and be widespread enough, for them to afford to invest in product development. But none of that would have mattered if at the time this revolution began, microprocessors hadn’t gotten just (barely) fast enough to run a killer application – the spreadsheet. Later, with a bit more horsepower, desktop processing became a second level (and boosted Macs substantially) but the first big bump was the spreadsheet. It was the reason for the purchase of PCs in large numbers by businesses, which kicked everything into gear: The fortuitous combination of a particular form and level of computing power and an application using that form and power.

Note that nobody really lost in the PC revolution. A few people, Intel and Microsoft in particular, really won. But the revenue for other microprocessors didn’t decline, they just didn’t have the enormous increases of those two.

The takeover of servers by clusters was another revolution, ignited by another conjunction of fast-enough microprocessors and a killer app. This time, the killer app was the Internet. Microprocessors got just fast enough that they could provide adequate turnaround time for a straightforward “read this page” web request, and once that happened you just didn’t need single big computers any more to serve up your website, no matter how many requests came in: Just spray the request across a pile of very cheap servers. This revolution definitely had losers. It was a big part of what in IBM was referred to as “our near-death experience”; it’s hard to fight a difference in cost of about 300X (which is what I heard it estimated within IBM development for mainframe vs. commodity clusters, for cases where the clusters were adequate – big sore point there). Later Sun succumbed to this, too, even though they started out being one of the spears in IBM’s flanks. Once again, capability plus application produced revolution.

We have another revolution going on now in the turn to explicit parallelism. But so far, it’s not the same kind of revolution because while the sea-change in computer power is there, the killer app, so far, is not – or at least it’s not yet been widely recognized.

We need that killer app. Not a simpler means of parallel programming. What are you going to do with that programming? Parallel Powerpoint? The mind reels.

(By the way, this is perfect – it’s exactly what I hoped to have happen when I started this blog. Trying to respond to a question, I rethought a position and ended with a stronger story as a result. Thanks to David Chess for his comment!)

Wednesday, September 24, 2008

IO, IO, We Really need IO

Now for something completely different:

I had always wondered what the heck was going on with recent large disk systems. They're servers, for heaven's sake, with directly attached disks and, often, some kind of nonvolatile cache. IBM happily makes money selling dual Power systems with theirs, while almost everybody else uses Intel. If your IO system were decent, shouldn't you just attach the disks directly? Is the whole huge, expensive infrastructure of adapters, fibre-channel (soon Data Center Ethernet (probably copyright Cisco)), switches, etc., reall the result of the mechanical millisecond delays in disks?

Apparently I'm not the only person thinking this. According to Is flash a cache or pretend disk drive? By Chris Mellor, an interview with Rick White, one of the three founders of Fusion-io, you just plop your NAND flash directly to your PCIe bus. Then use it as a big disk cache. You have to use some algorithms to spread out the writes so you don't wear out the NAND too fast, but lots of overhead just goes away.

Of course, if you want to share that cache among multiple systems, there are a whole lot of other issues; it's a big block of shared memory. But start with "how the heck to you connect to it?" LAN/SAN time again, with processors on both ends to do the communication. Dang.

Relevance to the topic of this blog: As more compute power is concentrated in systems, you need more and faster IO, or there's a whole new reason why this show may have problems.

Clarifying the Black Hole

The announcement of this blog in LinkedIn’s High Performance and Supercomputing group produced some responses indicating that I need to clarify what this blog is about. Thanks for that discussion, folks. One of my (semi-?) writers' block problems with this book is clearly explaining exactly what the issue is.

Here’s a whack at that, an overall outline of the situation as I see it. Everything in this outline requires greater depth of explanation, with solid numbers and references. This is just the top layer, and I’m really unsatisfied with it. Hang in there with me, prod me with questions if there are parts you are really itchy about, and I’ll get there.

First, I don’t mean to imply or say that there is some dastardly conspiracy to force the industry to use explicit parallelism. The basic problems are (1) physics; (2) the exhaustion of other possibilities. There simply isn't any choice -- but, no choice to do what? To increase performance.

This industry has lived and breathed dramatically increasing year-to-year single-thread performance since the 1960s. The promise has always been: Do nothing, and your programs run 45% faster this time next year, on cheaper hardware. That’s an incredible situation, the only time it’s ever happened in history, and it’s a situation that’s the very bedrock of any number of industry business and technical assumptions. Now we are moving to a new paradigm: To actually achieve the additional performance available in new systems, programmers must do something. What they must do is a major effort, it’s not clear how broadly applicable it is, and it’s so difficult that few can master it. That, to put it mildly, is a major paradigm shift. (And it’s not being caused by some nefarious cabal; as I said, it’s physics.)

This is not a problem for server systems. Those have been broadly parallel – clusters / farms / warehouses -- since the mid-90s, and server processing has arguably been parallelized since the mid-70s on mainframe SMPs. Transaction monitors, web server software like Apache, and other infrastructure all have enabled this parallelism (but the applications had to be there).

Unfortunately, servers alone don't produce the semiconductor volumes that keep an important fraction of this industry moving forward – a fraction, not all of the industry, but certainly the part that's arguably key, to say nothing of the loudest, and will make the biggest ruckus as it increasingly runs into trouble. The industry is going to be a very different place in the foreseeable future.

At this point, it’s reasonable to ask that if this is actually true, why aren’t the combined voices of the blogosphere, industry rags, and industry stock analysts all shouting it from the housetops? Why are these statements believable? Where’s the consensus? Are we in conspiracy theory territory?

No, we’re not. I think we’re in the union of two territories. First and foremost, we’re in the “left hand doesn’t know what the right hand is doing” territory, greatly enhanced by the narrow range of people with the cross-domain expertise needed to fully understand the problem. There was a flare-up of concern, flashing through the blogosphere and the technical news outlets back in 2004, but it was focused (almost correctly) on crying “Moore’s Law Is Ending!” so it was squashed when the technorati high priests responded “No, Moore’s Law isn’t ending, you dummies” which it isn’t, in the original literal sense. But saying that is a fine case of not seeing the forest for the veins on the leaves of the trees, or, in this case, not seeing a chasm because you’re a botanist, not a geologist.

That is the main territory, or at least I hope so. But having lived in the industry for quite a while, I understand that there’s also a form of denial – or hope – going on: There have definitely been occasions, well-remembered by management and business leaders, where those excitable, barely comprehensible techno-nerds have cried wolf, said the sky was falling, and it didn’t. This produces a reaction like: Nah, this can’t really be that big a deal; it’s the wrong answer; obviously we should not panic, since things have always happened to make such issues just go away. Also, aren’t they really posturing for more funding?

Unfortunately, they’re not. And the truth of the situation is starting to come across, with some industry funding for parallel programming, and pleas for a Manhattan project on parallel programming showing up. For reasons detailed in my 101 Parallel Languages series of posts, I think those are looking under the wrong lamppost.

My choice of the place to look is for applications that are “embarrassingly parallel,” where there’s no search for the parallelism – it’s obvious – and little need for explicit parallel programming. There are a few possibilities there (I’m partial to virtual worlds), but I’m far from certain that they’ll be widespread enough to pick up the slack in the client space. So I fear we may be in for a significant, long-term, downturn for companies whose business relies on the replacement cycle for client systems. This is not a pleasant prospect, since those companies are presently at the heart of the industry.

101 Parallel Languages (Part 3, the last)

So why, after all the effort spent on parallel programming languages, have none seen broad use? I posed this question on the mailing list of the IEEE Technical Committee on Scalable Computing, and got a tremendously, gratifyingly high-quality response from people including major researchers in the field, a funding director of NSF, and programming language luminaries.

Nobody disagreed with the premise: MPI (message-passing interface) has a lock on parallel programming, with OpenMP (straightforward shared memory) a distant second. Saying the rest are in the noise level is being overly generous. The discussion was completely centered around reasons why that is the case.

After things settled down, I was requested to collect it for the on-line newsletter of the IEEE TCSC. You can read the whole thing here.

I also boiled down the results as much as I could; many reasons were provided, all of them good. Here’s that final summary, which appears at the end of the newsletter article, turned into prose instead of powerpoint bullets (which I originally used).

A key reason, and probably the key reason, is application longevity and portability. Applications outlive any given generation of hardware, so developers must create them in languages they are sure will be available over time – which is impossible except for broadly used, standard languages. You can’t assume you can just create a new compiler for wonderful language X on the next hardware generation, since good compilers are very expensive to develop. (I first proposed this as a reason; there was repeated agreement.) That longevity issue particularly scares independent software vendors (ISVs). (Pointed out by John D. McCalpin, from his experience at SGI.)

The investment issue isn’t just for developing a compiler; languages are just one link in an ecosystem. All links needed for success. A quote: “the perfect parallel programming language will not succeed [without, at introduction] effective compilers on a wide range of platforms,… massive support for re-writing applications,” analysis, debugging, databases, etc.” (Horst Simon, Lawrence Berkeley Laboratory Associate Lab Director CS; from a 2004 NRC report.)

Users are highly averse to using new languages. They have invested a lot of time and effort in the skills they use writing programs in existing languages, and as a result are reluctant to change. “I don't know what the dominant language for scientific programming will be in 10 years, but I do know it will be called Fortran.” (quoted by Allan Gottlieb, NYU). And there’s a Catch-22 in effect: A language will not be used unless it is popular, but it’s not popular unless it’s used by many people. (Edelman, MIT)

Since the only point of parallelism is performance, high quality object code required at the introduction of the new language; this is seldom the case, and of course adds to the cost issue. (Several)

I can completely understand the motivation for research in new languages; my Ph.D. thesis was a new language. There is always frustration with the limitations of existing languages, and also hubris: Much more than most infrastructure development, when you do language development you are directly influencing how people think, how they structure their approach to problems. This is heady stuff. (I am now of the opinion that nobody should presume to work on a new language until they have written at least 200,000 lines of code themselves – a requirement that would have ditched my thesis – but that’s another issue.)

I also would certainly not presume to stop work on language development. Many useful ideas come out of that, whose developers could not initially conceive of outside of a new language, but often enough then become embedded in the few standard languages and systems. Key example: SmallTalk, the language in which object-oriented programming was conceived, later to be embedded in several other languages. (I suspect that language-based Aspect Oriented Programming is headed that way, too.)

But I’d be extremely surprised to find the adoption of some new parallel language – E.g., Sun’s Fortress, Intel’s Ct – as a significant element in taming parallelism. That’s searching under the wrong lamppost.

I think the way forward is to discover killer applications that “embarrassingly parallel” – so clear and obvious in their parallelism that running them in parallel is not a Computer Science problem, and certainly not a programming language problem, because it’s so straightforward.

Tuesday, September 23, 2008

101 Parallel Languages (Part 2)

So, is there some history and practice that can shed some like on the issue of helping people use parallel computers? There certainly is, in the High-Performance Computing (HPC) community.

The HPC community has been trying to flog the parallel programming language horse info life for almost forty years. These are the people who do scientific and technical computing (mostly), including things like fluid dynamics of combustion in jet engines, pricing of the now-infamous tranches of mortgage-backed securities, and so on. They’re motivated. Faster means more money to the financial guys (I’ve heard estimates of millions of dollars a minute if you’re a millisecond faster than the competition), fame and tenure to scientists, product deadlines for crash simulators, and so on. They include the guys who use (and get the funding for) the government-funded record-smashing installations at national labs like Los Alamos, Livermore, Sandia, etc., the places always reported on the front pages of newspapers as “the world’s fastest computer.” (They’re not, because they’re not one computer, but let that pass for now.)

For HPC, no computer has ever been fast enough and likely no computer ever will be fast enough. (That’s my personal definition of HPC, by the way.) As a result, the HPC community has always wanted to get multiple computers to gang up on problems, meaning operate in parallel, so they have always wanted to make parallel computers easier. Also, it’s not like those people are exactly stupid. We’re talking about a group that probably has the highest density of Ph.D.’s outside a University faculty meeting. Also, they’ve got a good supply of very highly motivated, highly intelligent, mostly highly skilled workers in the persons of graduate students and post-docs.

They have certainly succeeded in using parallelism, in massive quantities. Some of those installations have 1000s of computers, all whacking on the same problem in parallel, and are the very definition of the term “Massively Parallel Processing.”

So they’ve got the motivation, they’ve got the skills, they’ve got the tools, they’ve had the time, and they’ve made it work. What programming methods do they use, after banging on the problem since the late 1960s?

No parallel languages, to a very good approximation. Of course, there are some places that use some of them (Ken, don’t rag on me about HPF in Japan or somewhere), but it’s in the noise. They use:

1. Message-Passing Interface (MPI), a subroutine package callable from Fortran, C, C++, and so on. It does basic “send this to that program on that parallel node” message passing, plus setup, plus collective operations [“Is everybody done with Phase 3?” (also known as a “barrier”) “Who’s got the largest error value – Is it small enough that we can stop?” (also known as a “reduction”)]. There are implementations of MPI exploiting special high performance inter-communication network hardware like InfiniBand, Myrinet, and others, as well as plain old Ethernet LAN, but at its core you get “send X to Y,” and manually write plain old serial code to call procedures to do that.

2. As a distant second, OpenMP. It provides similarly basic use of multiple processors sharing memory, that uses and uses more support at the language level. The big difference is the ability to do a FOR loop in parallel (FOR N=1 to some_big_number, do THIS (parameterized by N)).

That’s it. MPI is king, with some OpenMP. This is basic stuff: pass a message, do a loop. This seems a puny result after forty years of work. In fact, it wouldn’t be far from the truth to say that the major research result of this area is that parallel programming languages don’t work.

Why is this?

I effectively convened my own task force on this, in the open, asking the question on a mailing list. A susprising (to me) number of significant lights in the HPC parallelism community chimed in; this is clearly a significant and/or sore point for everybody.

I’ll talk about that in my next post here.

Monday, September 22, 2008

101 Parallel Languages (Part 1)

A while ago, at a “technical” meeting of the Grand Dukes and Peers of the IBM realm, the Corporate Senior VP of Software was heard to comment (approximately) “If all these new processors are going to be parallel, don’t our clients need a new programming language to deal with this?”

This, of course, produced vigorous agreement and the instantaneous creation of a task force to produce a recommendation. Since even the whim of a Senior VP is a source of funding, the recommendation was, of course, running on rails to a preordained conclusion: Fund research and/or productization of research in some parallel language.

I happened to end up participating in that task force (this was far from automatic). In the airport on the way to its opening meeting, I happened to run into Professor Jim Brown of UT, who’s been doing parallel at least as long as I have, with more emphasis on software than I. Since I knew what his reaction was going to be, I couldn’t resist; I told him why I was going. I wasn’t disappointed. He began muttering, stamped around in a circle in frustration, then whipped around and pointed his finger at me with beetled brows and a scowl, saying “You don’t agree with that, do you?”

I surely didn’t. I told him I was arriving carrying the result of about one evening of web searching: A list of 101 parallel programming languages, all in active development, all with zero significant use. (Spreadsheet here.)

He smiled, I smiled back, and we went on our respective ways.

Why that mutual reaction? The issue is certainly not the underlying sentiment expressed by that Senior VP. That’s right on target, a completely appropriate concern: Don’t we (not just IBM) really need to provide as much help as possible to customers, to deal with these changes? It’s also clearly a correct insight that if your company helps more than others, you will win.

The issue is “programming language.” And history. And current practice.

-- To be continued, in my next posting. --

(I’m trying the experiment of breaking up long streams of thought into multiple posts. The huge ones up to now seem ungainly. Comments on that?)

Friday, September 19, 2008

History Is Written by the Victors

There's a new book to be out in October 2008, published by Intel Press, The Business Value of Service Oriented Grids, by Enrique Castro-Leon, Jackson He, Mark Chang, and Parviz Peiravi. A few chapters are available online, or just the preface in a slightly different print layout.

First, my congratulations to Enrique and his co-authors. Getting any book out the door is more work than anybody can appreciate who hasn't done it. Kudos!

I won't comment on the main content, since I'm not really qualified, "business value" books just aren't my cup of tea, and N years (for large N) in IBM left me profoundly allergic to SOA. SOA roolz. OK? It's wonderful. But I've only read one discussion of it, about 5 pages total, that didn't make my skin crawl.

However, part 3 of the download (Chapter 1) had this, which is relevant to the issues of this blog: 2004 Intel found that a successor to the Pentium 4 processor, codenamed Prescott would have hit a "thermal wall."... the successor chip would have run too hot and consumed too much power at a time when power consumption was becoming an important factor for customer acceptance.
Big positive here: This event has been published, by Intel.

However, heat and power consumption "becoming an important factor for customer acceptance" is an excessively delicate way of putting it. 

As I recall, that proposed chip family was so hot that all major system vendors actually refused it. They stood up to Intel, a rather major thing to do, and rejected the proposed design as impossible to package in practical, shippable products. This was a huge deal inside Intel, causing the cancellation of all main-line processor projects and the elevation to stardom of a previously denigrated low-power design created by an outlier lab in Israel, far from the center of mass of development.

So history gets exposed in a way that placates the shareholders. 

The preface to this book also had some comments about the ancient history of virtualization that reminded me that I was there. Fodder for a future post.

(My thanks to a good friend and ex-co-worker who pointed me to this book and provided the title phrase.)

Larrabee vs. Nvidia, MIMD vs. SIMD

I'd guess everyone has heard something of the large, public, flame war that erupted between Intel and Nvidia about whose product is or will be superior: Intel Larrabee, or Nvidia's CUDA platforms. There have been many detailed analyses posted about details of these, such as who has (or will have) how many FLOPS, how much bandwidth per cycle, and how many nanoseconds latency when everything lines up right. Of course, all this is “peak values,” which still means “the values the vendor guarantees you cannot exceed” (Jack Dongarra’s definition), and one can argue forever about how much programming genius or which miraculous compiler is needed to get what fraction of those values.

Such discussion, it seems to me, ignores the elephant in the room. I think a key point, if not the key point, is that this is an issue of MIMD (Intel Larrabee) vs. SIMD (Nvidia CUDA).

If you question this, please see the update at the end of this post. Yes, Nvidia is SIMD, not SPMD.

I’d like to point to a Wikipedia article on those terms, from Flynn’s taxonomy, but their article on SIMD has been corrupted by Intel and others’ redefinition of SIMD to “vector.” I mean the original. So this post becomes much longer.

MIMD (Multiple Instruction, Multiple Data) refers to a parallel computer that runs an independent separate program – that’s the “multiple instruction” part – on each of its simultaneously-executing parallel units. SMPs and clusters are MIMD systems. You have multiple, independent programs barging along, doing things that may have nothing to do with each other, or may not. When they are related, they barge into each other at least occasionally, hopefully as intended by the programmer, to exchange data or to synchronize their otherwise totally separate operation. Quite regularly the barging is unintended, leading to a wide variety of insidious data- and time-dependent bugs.

SIMD (Single Instruction, Multiple Data) refers to a parallel computer that runs the EXACT SAME program – that’s the “single instruction” part – on each of its simultaneously-executing parallel units. When ILIAC IV, the original 1960s canonical SIMD system, basically a rectangular array of ALUs, was originally explained to me late in grad school (I think possibly by Bob Metcalfe) it was put this way:

Some guy sits in the middle of the room, shouts ADD!, and everybody adds.

I was a programming language hacker at the time (LISP derivatives), and I was horrified. How could anybody conceivably use such a thing? Well, first, it helps that when you say ADD! you really say something like “ADD Register 3 to Register 4 and put the result in Register 5,” and everybody has their own set of registers. That at least lets everybody have a different answer, which helps. Then you have to bend your head so all the world is linear algebra: Add matrix 1 to matrix 2, with each matrix element in a different parallel unit. Aha. Makes sense. For that. I guess. (Later I wrote about 150 KLOC of APL, which bent my head adequately.)

Unfortunately, the pure version doesn’t make quite enough sense, so Burroughs, Cray, and others developed a close relative called vector processing: You have a couple of lists of values, and say ADD ALL THOSE, producing another list whose elements are the pair wise sums of the originals. The lists can be in memory, but dedicated registers (“vector registers”) are more common. Rather than pure parallel execution, vectors lend themselves to pipelining of the operations done. That doesn’t do it all in the same amount of time – longer vectors take longer – but it’s a lot more parsimonious of hardware. Vectors also provide a lot more programming flexibility, since rows, columns, diagonals, and other structures can all be vectors. However, you still spend a lot of thought lining up all those operations so you can do them in large batches. Notice, however, that it’s a lot harder (but not impossible) for one parallel unit (or pipelined unit) to unintentionally barge into another’s business. SIMD and vector, when you can use them, are a whole lot easier to debug than MIMD because SIMD simply can’t exhibit a whole range of behaviors (bugs) possible with MIMD.

Intel’s SSE and variants, as well as AMD and IBM’s equivalents, are vector operations. But the marketers apparently decided “SIMD” was a cooler name, so this is what is now often called SIMD.

Bah, humbug. This exercises one of my pet peeves: Polluting the language for perceived gain, or just from ignorance, by needlessly redefining words. It damages our ability to communicate, causing people to have arguments about nothing.

Anyway, ILLIAC IV, the CM-1 Connection Machine (which, bizarrely, worked on lists – elements distributed among the units), and a variety of image processing and hard-wired graphics processors have been rather pure SIMD. Clearspeed’s accelerator products for HPC are a current example.

Graphics, by the way, is flat-out crazy mad for linear algebra. Graphics multiplies matrices multiple times for each endpoint of each of thousands or millions of triangles; then, in rasterizing, for each scanline across each triangle it interpolates a texture or color value, with additional illumination calculations involving normals to the approximated surface, doing the same operations for each pixel. There’s an utterly astonishing amount of repetitive arithmetic going on.

Now that we’ve got SIMD and MIMD terms defined, let’s get back to Larrabee and CUDA, or, strictly speaking, the Larrabee architecture and CUDA. (I’m strictly speaking in a state of sin when I say “Larrabee or CUDA,” since one’s an implementation and the other’s an architecture. What the heck, I’ll do penance later.)

Larrabee is a traditional cache-coherent SMP, programmed as a shared-memory MIMD system. Each independent processor does have its own vector unit (SSE stuff), but all 8, 16, 24, 32, or however many cores it has are independent executors of programs. As are each of the threads in those cores. You program it like MIMD, working in each program to batch together operations for each program’s vector (SIMD) unit.

CUDA, on the other hand, is basically SIMD at its top level: You issue an instruction, and many units execute that same instruction. There is an ability to partition those units into separate collections, each of which runs its own instruction stream, but there aren’t a lot of those (4, 8, or so). Nvidia calls that SIMT, where the “T” stands for “thread” and I refuse to look up the rest because this has a perfectly good term already existing: MSIMD, for Multiple SIMD. (See pet peeve above.) The instructions it can do are organized around a graphics pipeline, which adds its own set of issues that I won’t get into here.

Which is better? Here are basic arguments:

For a given technology, SIMD always has the advantage in raw peak operations per second. After all, it mainly consists of as many adders, floating-point units, shaders, or what have you, as you can pack into a given area. There’s little other overhead. All the instruction fetching, decoding, sequencing, etc., are done once, and shouted out, um, I mean broadcast. The silicon is mainly used for function, the business end of what you want to do. If Nvidia doesn’t have gobs of peak performance over Larrabee, they’re doing something really wrong. Engineers who have never programmed don’t understand why SIMD isn’t absolutely the cat’s pajamas.

On the other hand, there’s the problem of batching all those operations. If you really have only one ADD to do, on just two values, and you really have to do it before you do a batch (like, it’s testing for whether you should do the whole batch), then you’re slowed to the speed of one single unit. This is not good. Average speeds get really screwed up when you average with a zero. Also not good is the basic need to batch everything. My own experience in writing a ton of APL, a language where everything is a vector or matrix, is that a whole lot of APL code is written that is basically serial: One thing is done at a time.

So Larrabee should have a big advantage in flexibility, and also familiarity. You can write code for it just like SMP code, in C++ or whatever your favorite language is. You are potentially subject to a pile of nasty bugs that aren’t there, but if you stick to religiously using only parallel primitives pre-programmed by some genius chained in the basement, you’ll be OK.

[Here’s some free advice. Do not ever even program a simple lock for yourself. You’ll regret it. Case in point: A friend of mine is CTO of an Austin company that writes multithreaded parallel device drivers. He’s told me that they regularly hire people who are really good, highly experienced programmers, only to let them go because they can’t handle that kind of work. Granted, device drivers are probably a worst-case scenario among worst cases, but nevertheless this shows that doing it right takes a very special skill set. That’s why they can bill about $190 per hour.]

But what about the experience with these architectures in HPC? We should be able to say something useful about that, since MIMD vs. SIMD has been a topic forever in HPC, where forever really means back to ILLIAC days in the late 60s.

It seems to me that the way Intel's headed corresponds to how that topic finally shook out: A MIMD system with, effectively, vectors. This is reminiscent of the original, much beloved, Cray SMPs. (Well, probably except for Cray’s even more beloved memory bandwidth.) So by the lesson of history, Larrabee wins.

However, that history played out over a time when Moore’s Law was producing a 45% CAGR in performance. So if you start from basically serial code, which is the usual place, you just wait. It will go faster than the current best SIMD/vector/offload/whatever thingy in a short time and all you have to do is sit there on your dumb butt. Under those circumstances, the very large peak advantage of SIMD just dissipates, and doing the work to exploit it simply isn’t worth the effort.

Yo ho. Excuse me. We’re not in that world any more. Clock rates aren’t improving like that any more; they’re virtually flat. But density improvement is still going strong, so those SIMD guys can keep packing more and more units onto chips.

Ha, right back at ‘cha: MIMD can pack more of their stuff onto chips, too, using the same density. But… It’s not sit on your butt time any more. Making 100s of processors scale up performance is not like making 4 or 8 or even 64 scale up. Providing the same old SMP model can be done, but will be expensive and add ever-increasing overhead, so it won’t be done. Things will trend towards the same kinds of synch done in SIMD.

Furthermore, I've seen game developer interviews where they strongly state that Larrabee is not what they want; they like GPUs. They said the same when IBM had a meeting telling them about Cell, but then they just wanted higher clock rates; presumably everybody's beyond that now.

Pure graphics processing isn’t the end point of all of this, though. For game physics, well, maybe my head just isn't build for SIMD; I don't understand how it can possibly work well. But that may just be me.

If either doesn't win in that game market, the volumes won't exist, and how well it does elsewhere won't matter very much. I'm not at all certain Intel's market position matters; see Itanium. And, of course, execution matters. There Intel at least has a (potential?) process advantage.

I doubt Intel gives two hoots about this issue, since a major part of their motivation is to ensure than the X86 architecture rules the world everywhere.

But, on the gripping hand, does this all really matter in the long run? Can Nvidia survive as an independent graphics and HPC vendor? More density inevitably will lead to really significant graphics hardware integrated onto silicon with the processors, so it will be “free,” in the same sense that Microsoft made Internet Explorer free, which killed Netscape. AMD sucked up ATI for exactly this reason. Intel has decided to build the expertise in house, instead, hoping to rise above their prior less-than-stellar graphics heritage.

My take for now is that CUDA will at least not get wiped out by Larrabee for the foreseeable future, just because Intel no longer has Moore’s 45% CAGR on its side. Whether it will survive as a company depends on many things not relevant here, and on how soon embedded graphics becomes “good enough” for nearly everybody, and "good enough" for HPC.


Update 8/24/09.

There was some discussion on Reddit of this post; it seems to have aged off now – Reddit search doesn’t find it. I thought I'd comment on it anyway, since this is still one of the most-referenced posts I've made, even after all this time.

Part of what was said there was that I was off-base: Nvidia wasn’t SIMD, it was SPMD (separate instructions for each core). Unfortunately, some of the folks there appear to have been confused by Nvidia-speak. But you don’t have to take my word for that. See this excellent tutorial from SIGGRAPH 09 by Kayvon Fatahalian of Stanford. On pp. 49-53, he explains that in “generic-speak” (as opposed to “Nvidia-speak”) the Nvidia GeForce GTX 285 does have 30 independent MIMD cores, but each of those cores is effectively 1024-way SIMD: It has with groups of 32 “fragments” running as I described above, multiplied by 32 contexts also sharing the same instruction stream for memory stall overlap. So, to get performance you have to think SIMD out to 1024 if you want to get the parallel performance that is theoretically possible. Yes, then you have to use MIMD (SPMD) 30-way on top of that, but if you don’t have a lot of SIMD you just won’t exploit the hardware.

Thursday, September 18, 2008

Background: Me and This Blog

Some people still remember me as the author of In Search of Clusters, as I found out when I began posting to the Google group cloud-computing. A hearty Thank You! to all of them. For the rest of you, why not buy a copy? It's still in print, and I still get royalties. :-) (Really only a semi-smiley. Royalties are good.)

For the rest of you, and as a reminder, here’s a short bio:

I was until recently a Distinguished Engineer in the IBM Systems & Technology Group in Austin, Texas. I’m the author of In Search of Clusters, currently in its second edition and still occasionally referred to as “the bible of clusters” even seven years after its publication. I was also, back in the '80s, Chief Scientist of the RP3 massively-parallel computing project in IBM Research (joint with NYU). My job in Austin was to be a leader of future system architecture work, particularly in the areas of accelerators and appliances, and I was chair of the InfiniBand industry standard management workgroup. I've worked on parallel computing for over 30 years, and hold something around 30 patents in parallel computing and computer communications.

I’m currently retired, living in Austin, TX, and intending to move to Colorado (just North of Denver), where I've been invited to be part-time on the faculty of CSU - as soon as I can sell my house. That was supposed to be over and done three or four months ago, but things have ground to a trickle now that buyers find it virtually impossible to get a mortgage. Austin was pretty immune to housing slowdowns until that happened. Dang. Anyway.

In the meantime, I’m working on another book. Its working title is The End of Computing (As We Knew It), and my intention with it is to expose the hole into which the computer industry may be plunging by turning to explicit parallelism. I'm not implying there's any other realistic choice, mind you, but I got a snoot-full while in IBM of how (a) hardware engineers haven’t a clue what a bitch that is to program; (b) most software engineers haven’t a clue that this is being done to them; (c) many upper-management and analyst types have their heads firmly stuck in the sand on what this all will mean.

It particularly bugs me that people still blather on about how Moore's Law will keep on trucking for decades. Maybe it will, interpreted literally. But the Moore's Law that will keep on trucking has been castrated. It lacks a key element (frequency scaling) that drove the computing industry for the last four or more decades. This is a classic case of experts focussing on the veins in the leaves on the trees and ignoring the ravine they're about to fall into. 

More. I have this suspicion that many people who really understand how deep into the doodoo we're going are weasel-wording it deliberately. No point in frightening the hoi polloi, now, is there? Maybe there's a cure, who knows, we're not there yet, hm? Horsepuckey. 

Or else they're just scared to death and hoping like crazy they'll wake up one morning and somebody will have solved the problem. Unfortunately, that ignores 30 years of history, my friends.

That's the topic. Obviously, I feel strongly about it. This is good. It's motivational. 


I am sorry, and somewhat ashamed, to have to say that I’ve made nearly no progress on actually writing the book over the last year. I've messed around starting a couple of chapters, and have an outline, but that's all. 

I have been spending a lot of time just keeping up with what’s being said all around this subject, and have amassed a huge number of web references and comments – 600+ MB of them, in fact. Stitching it all together is, however, requiring more focused continuous effort than retirement with a decent pension and waiting for a house to sell seems induce. (Why do it today? There’s always tomorrow.)

It finally occurred to me, through a devious chain of events, that it might help make some of my data collection and musings public.

Hence this blog.

You can expect to find here comments on things like:

  • Semiconductor technology – in a simple way. I’m no expert in this area, but it’s the basis of the whole issue.
  • What it takes to make parallel computer hardware, like SMPs and clusters (farms, etc.), interconnects and communications issues in particular.
  • Issues and difficulties involved in programming that hardware, and what has been learned in about 40 years of experience in the HPC arena (a point seemingly lost in most discussions).
  • Ways to make use of that hardware that don’t require explicit parallel programming by the mass of programmers, like virtualization, transaction processing.
  • Yet Another Way to use all those transistors: Accelerators. And why they may now finally have legs and stay around for a while (hint: you can’t win against a 45% CAGR increase in clock speed).
  • Graphics accelerators and, in particular, Larrabee vs. Nvidia / ATI(AMD). Why it’s needed. Who’s betting on which characteristic, whether they know it or not. (Lessons from HPC.)
  • Possible killer apps for parallel systems, like virtual worlds, graphics, stream processing, cloud computing (sort of), grids (sort of).
  • What this all means to the industry.

Readers of In Search of Clusters will notice a lot of familiar material there, and a lot that's new. One big differnce is that I'm going to try to aim for a more popular audience on this one, trying to make it more accessible to more people. There are two reasons for this: First, I think the topic needs to be brought more out into the open, outside the confines of the industry. Second, that sells lots more copies of any book.

In this blog I’ll probably also grouch and rave about some things that have nothing to do with any of the above, just to keep from getting my head clear now and again. Like a short lesson in not to try to carry a three-foot Tai Chi (Taijichuan) Jian, a three-foot straight sword, when flying from Beijing to Shanghai. Or my struggles with using Word to produce a book manuscript, which I apparently have to these days.

I'm also newly twitter-ified (twitified?) as GregPfister, and will try to keep that channel stuffed, too. But this will be where the data is, of necessity.

Oh, and anybody know what the situation is on copyright for blog contents? Until I know otherwise, I’m going to have to be really conservative about posting excerpts of work in progress on the new book.

Anyway, Hi!

I'm here.

(tap, tap) This thing on?

Anybody listening?