I really wanted an Xbox 360.
My old PS2 is showing its age, and I wanted to upgrade to a new system as soon as I finished the last few missions of GTA: Vice City Stories — especially now that it looks like Manhunt 2 won’t be coming out for PS2 any time soon. I’m a huge fan of the GTA series, and I’m especially psyched about GTA4. I grew up in Brooklyn, on a block that looks a more than a little like a GTA4 screenshot.
But then something happened…
But a couple of weeks ago my plans changed. Jenny was happily thrashing away on Guitar Hero, when her TV screen just went blank. She looked down at her console, which had suddenly gone quiet. (That’s pretty noticeable, apparently, because the Xbox 360 is a really loud machine… which, as it turns out, is important to our story.) Much to her disappointment, she saw those three telltale LEDs that every Xbox owner dreads: the red ring of death.
Luckily, Jenny’s 360 lasted long enough so that she could take advantage of the Xbox 360 service site that Microsoft launched earlier this month. But her poor console was just the latest in a long line of casualties. Some retailers estimate that 30% of Xbox 360s need repair, and we’ve seen plenty of anecdotal evidence that gamers are unhappy. It’s costing Microsoft sales and spooking investors. Microsoft is doing everything they can to fix the problem — they’ve extended the warranty to three years, and it’s costing them over a billion dollars. But it’s a real mess.
As much as I want a new console, I’m not going to buy one until I know that it won’t break. I still plan on getting a 360, but not until I can be reasonably sure that I won’t have to return it. By the time I eventually get one, hopefully they’ll have figured out how to make it quieter. I’m certain that I’m not the only one who’s decided to put off buying an Xbox. And that’s bad news for Microsoft.
So what can we, as software developers, learn from the Xbox 360 fiasco?
Software people like us have a nasty habit of dismissing hardware problems as if they have nothing to do with us. We tend to think that designing software is really different from building hardware. And sure, there are definitely differences. We don’t have to worry about assembly lines, product getting damaged in shipment, or those pesky laws of physics that can prove to be such an irritating limitation when you have to design physical objects.
And it’s easy to dismiss the Xbox 360 failure as one of those unfortunate things that falls into that last category of physical faults. There’s a great Tech-On! article that gives us the dirt on exactly what’s caused the problem. It’s an excellent post-mortem on what amounts to terrible thermal design.
For those of you who’ve never taken a computer apart, here’s a little background information. Dealing with heat is an important part of modern computer design. Computer processors generate a lot of heat — so much that if you don’t come up with a way to get rid of it, they’ll fry themselves. So computer manufacturers will typically attach a heat sink to a processor. A heat sink is basically just a big radiator with fins or poles that lets air circulate and draw away the heat. (I once roasted a Pentium 4 processor by popping its heat sink off while the computer was running, just to see what would happen. It went “poof”.) A lot of processors are too hot even for heat sinks; in that case, you’ll need to stick a fan on top of it to cool it off. That’s why some computers are so noisy: they need fans to keep them cool.
It turns out that the Xbox 360 generates far too much heat, and a lot of people speculate that when that heat builds up past a critical point it unseats the GPU (a separate processor that’s used for graphics). Microsoft has so far refused to comment on exactly what the problem is, but as time goes on there does seem to be some consensus forming about it. And that Tech-On! article seems to have found a smoking (heh) gun.
But that’s just the hardware stuff. What does that have to do with building better software?
The punchline for all of this came at the end of that Tech-On! article, and it’s why I think this whole incident is so interesting. Here’s what it said:
Finally, we opened the chassis of the Xbox 360 repaired in May 2007 and compared it with the other Xbox 360 we purchased in late 2005.
“Huh? The heat sinks and fans are completely identical, aren’t they?”
To our surprise, the composition of the repaired Xbox 360 looked completely the same as that of the Xbox 360 purchased in late 2005. It turned out that Microsoft provided repair without changing the Xbox 360’s thermo design at least until May 2007.
The repaired units weren’t replaced with ones that had a better design. They were the same — as far as they could tell, Microsoft just replaced a broken unit with one that hadn’t broken yet. That’s probably why we’re seeing various reports of repeated breakdowns.
What that tells me is that the design of the Xbox 360 is deeply flawed, and that design flaw has already cost Microsoft well over a billion dollars. And it’s that flawed design that can teach us a whole lot about our own software projects.
So what does this all mean for us developers? Well, for the more cynical among us, it could just mean a whole lot of job security. I’ve met COBOL programmers who charge ridiculous amounts of money to maintain aging systems. But while their jobs pay well, personally they sound tedious and awful to me. Does anyone really aspire to spend years patching an aging software system? Most programmers will tell you that maintaining old systems is the worst part of the job. If you love designing new and innovative software, then the last thing you want to do is get your career stuck in maintenance mode.
And that’s what Microsoft is learning with the Xbox 360. I’m not a thermal design expert, but I am absolutely positive that they could have come up with a different design that wouldn’t fail so often. And while it may have cost more money to design the system and build each unit, I sincerely doubt those extra costs would have added up to over a billion dollars. And maybe the extra design time might have cost them more time… but now there are plenty of us who aren’t buying the system because we don’t want to be stung by the rampant quality problems.
Had Microsoft designed the system properly in the first place, they wouldn’t be in this mess now. And that’s the big lesson for us to learn. Oddly enough, it’s not a new lesson… in fact, it’s a pretty old one. One way to look at the Xbox thermal problem is to see it as a design defect that wasn’t caught until after the product was shipped.
Look what I found in an old 1997 issue of Windows Tech Journal… it’s an article by one of our favorite authors, Steve McConnell, called “Upstream Decisions, Downstream Costs”. The article lays out a scenario that most of us will recognize immediately: a fictional software company runs into problems because they don’t do enough planning up front, and end up getting buried with bugs, which cause awful delays. It also has a chart that anyone who’s read a few software engineering textbooks will recognize, showing that the earlier a bug is introduced in the project and the later it’s caught, the more expensive it is to fix.
So now we’ve seen a good, real-world situation where better design practices would have saved a whole lot of money. But what can we do about it in our own projects?
First and foremost, this gives us more ammunition when arguing with our coworkers and our bosses for more time to design our software. It’s really easy to get frustrated during the design phase of a software project, when a few people are generating a lot of paper or diagrams but nobody’s working on the code yet. That’s one of the things that we pointed out in our first book, Applied Software Project Management — that finding problems too late can sink projects. Luckily, there’s a relatively painless fix: adopt good review practices.
This is something that our friends in the open source world are really good at. Jenny and I talked about this in an ONLamp.com article we wrote last year called “What Corporate Projects Should Learn from Open Source”. A lot of high-profile, successful open source projects have very careful reviews, where they scrutinize scope and design decisions before they start coding. (To be fair, a lot of high-profile, successful closed source projects do the same, but we can’t just go to their websites and see their review results.)
So the moral of the story is that it often costs less to spend more time and money on design up front. And I bet there are some Microsoft shareholders that will agree.