Live for the Ding!

March 11, 2009

Patch Happy, Part Three

Filed under: Software — iridar @ 4:10 pm

Well, it seems the patches from yesterday (was I the only one that got two?  Or did I just miss one from Monday?) have cleared up the crash to desktop at the selection screen, at least for me.  So, it’s high time to wrap up this little rant about what, why these things happen.

We’ve answered the what and the why, really, so the only question that I had revolves around my guildmate’s comment, “Eh, patches cause crashes”.  Why are we all so indifferent to this?  Why do we accept software that locks, crashes, eats our data and causes us to tear our hair out in frustration to be the norm?

This is not relegated only to games, not by any means.  I’m talking about everyone, all users of software.  Do we like crashy software?  Hell, no.  Do we get frustrated, call customer service, go round and round and finally delete the piece of crap, fuming all the while about it would have been so nice if it had just worked?  Oh, yes.  But are we ever surprised?  No.  Because software causes crashes.  And why do we accept that as Truth?

Let me tell you a little story.  Now, this is apocryphal because I did not actually see this happen as it was before I entered the workforce.  But, it rings so true even if it is not completely correct in detail it certainly captures the spirit of the issue.  Some years ago there was a program that was written to help do complicated and tedious mathmatics.  It started as a small idea and was added to and modified over the course of a couple of years.  Eventually, someone realized that people might pay money for this no-longer-little application, so they started selling it.  Eventually, someone did something with the program that had not even been done (or tested!) before and noted that the program supplied the wrong answer.  They also noted that if they had been using the program to do something important, like, say, designing load tolerance for bridge, this kind of error could lead to the collapse of the bridge.  While there were cars and such on it.  Ouch.

So they went back to the developer and told them about the problem.  Now, at some level, the developer or one of his office mates must have known something like this was likely to happen because when they sold them the application, there was a little bit of text saying that they (the selling/developing company) could not be held responsible in the event the program was wrong.

WTH?  Talk about the ultimate get out of jail free card!  So for years, we as a consumer base have been buying into the idea that of course, software ships with bugs.  You can’t get every one, after all.

Bull.  Remember the triangle?  That line is a way to wriggle out from the “quality” side.  You have a deadline, you’ve capped your budget, you need to get this puppy out the door!  What do you do?  Slash quality of course.  The user’s aren’t expectning a bug-free product anyhow.  Why bust the bank over this?

This is the prevailing attitude of almost everyone in the software business I have ever worked with, from maintenance programmers to VPs of development.  Release the product when it’s “good enough”, and the users will complain, but not leave.

Well, that’s wrong.  And we, all of us, are letting them get away with it.  As long as we shrug our shoulders and say that it’s OK for the developers to cut corners on quality and give us something less than what we purchased, they will continue.  It’s a winning model for them.  They get our dollars, users stay and are at least satisfied if not happy, and they get to keep their jobs.

I say enough.  I say it’s not right that Mythic obviously shorted their testing cycle and rushed 1.2 out the door.  My user experience was the worse for it, and many others as well.  Is it life threatening?  Of course not.  But the scary thing is, one day it will be.  Do you really want a medical device/automatic braking system/space shuttle designed by an industry that ships software when it’s “good enough”?  I sure don’t.

Shape up, Mythic.  We’re watching you.

March 6, 2009

Patch Happy, Part Two

Filed under: Software — iridar @ 4:02 pm

I’m baaaaack.

Last time I was talking about the recent patch to WAR, and the sudden CTDs I was getting as a result each time I tried to log out to the character selection screen.  My guildmates’ reaction to this was “Eh, patches cause crashes.”

There are two things I want to examine about this situation.  One, what might have caused a game to be released with a bug so severe that it shuts the client down, and two, why we as users think not just that this is OK, but that it is expected and accepted.

As I said before, I have been professionally developing commercial software for 20 years now.  Standard disclaimer applies: I have never programmed a game in my life, and I certainly have no knowledge of the working of the WAR program.  That said, there are some overarching concepts in software development the apply regardless of what you are developing.

In part one I brought up the concept of the “software triangle” and mentioned I did not think that it was responsible for the bugs in this latest patch.  While it’s certainly possible, the honest truth of the matter is that this patch has been very smooth for the most part.  The fact that not everyone is crashing to desktop (I and Jennifer at GirlIRL are the only two that I know of at this point) indicates that the software triangle probably was not the cause.  Problems with the triangle are usually more catestrophic.

So what did happen?  Most likely a simple truth is the cause of our CTD woes: Software is complicated.  All software, even the simplest “Hello, world” program is complicated.  Those out there who remember Charles Petzold and “Programming the OS/2 Presentation Manager” remember the eight pages of raw C init code you needed just to get your “Hello” to display know what I’m talking about.  Add multiple layers to that, interacting subsystems, OS, hardware and myriad other things and you start to get an idea what I mean by “complicated”.

So, let’s look at this CTD issue in WAR 1.2.  For those not experiencing it, when you log out of the game to go to the character selection screen, the game window disappears and shuts itself down, poof.  Colloquially known as a “crash to desktop”, developers refer to them as “abend” (that’s ab-end, as in “abnormal end”) and the most common reason for that to happen is the program tried to use some data that simply was not there, something known as a “null pointer reference”.  When the program could not find what it was looking for, and there was not protective code surrounding it (i.e., no error catching), the application simply shuts itself off.

So, that does not sound too bad, right?  We know that it happens something like 4 out of 5 times (I was able to log out successfully once) going to the character screen, we know we are looking for a “null pointer reference”, whatever that is, so it should not be too hard to find, right?

Wrong.  The great bugaboo with errors like this is inconsistency.  4 out of 5 times means that 20% of the time the code is not hitting the “null pointer reference”  Why?  It’s only happening on some people’s machines.  Why?

Now you start to see (I hope) why I said I did not think that the software triangle was responsible for this error.  If it were, this would have happened on everyone’s machine, all the time (OK, OK, most machines most of the time…happy?).  The point is, it probably would have been caught before it went out the door.  But, a crash that only happens on some machines some of the time is much less likely to be caught for the reason I stated above.  It’s complicated.  Lining up all the moving parts of a complex program like an MMO and stepping through every single iteration of every single possibility of everything that could go wrong…would be nearly impossible.  It makes cleaning the Augean stables look like child’s play.

So we’ve answered question one, above, why did this happen.  Because testing every single possibility in a complex system is not feasible (note I did not say ‘possible’) in a reasonable amount of time, so bugs will slip through.  I’ll save the answer to question two for the next installment.  See you then!

March 4, 2009

Patch Happy, Part One

Filed under: Software — iridar @ 7:02 pm

Well, we got the new patch last night.  Overall I think the quality of the patch is pretty good, but I do have a couple of point that are worth examining.

First, would Mythic please take a clue from other MMOs like City of Heroes and kick off the “big patch” when the user logs off, instead of when they log on?  I sat down to play and was treated to a 171MB update, which was of course happening at the same time as everyone else was trying to get the same 171 megs.  To be fair, I have a slow (read: DSL) connection and even so the entire patch took only about an hour, but that was an hour I could have been snuffing Orcs and other tasty Destro minions.

OK, rant off.  Now to climb on my soapbox for real.  Since patching, I am getting a lot of CTD (crash to desktop, for the two of you who have not heard the term) whenever I log out to the character selection screen.  It’s not a huge big deal (at least it has not happened in the middle of a scenario or something!), but it does mean that I have to re-start the game and wait while it does all its leader screens yadda, yadda, yadda. 

More importantly, when an application crashes like that it’s very often the result of a misplaced pointer in the code (or as we used to say “the pointer is IN THE OPERATING SYSTEM”), which can have nasy side effects and lead to even less desireable results, like crashing the video card or the dreaded Blue Screen of Death.

So I log back in and apologize to the guildmate I was talking to when it happened for abruptly leaving the conversation.  His reaction was “Eh, patches cause crashes.”

WTH?

Now, I’m not taking umbrage with him, because of course this is a true statement.  My reaction is more to the fact that we all accept that “patches cause crashes”.  when what we should be demanding is a better quality product for our dollar.  Let me say up front that I develop software for a living, not games, but application software and this is not a problem that is exclusive to games.  This is endemic to all software development I have ever seen or worked on.  The question is, why do we keep letting the developers get away with writing crap and then patching over the crap with more crap until the system basically accidentally works and they call it done?

To avoid the hypocrisy inherent in my argument, I include myself in the list of developers that release crap.  Under the guise of anonymity, I will admit that I have released software knowing full well that there were bugs in it that could cause as serious an error as a CTD, and the decision was made to send it out and patch it later.  Notice that I switched to the passive voice, there.  While I will take responsibility for the release (I was running the engineering department, after all), the decision to release the application came from higher up.

Someone a long time ago told me that the answer to “Why?” was always “Money”, and that is certainly the case in every instance I have been involved in where poor quality software was released to a customer base.  It is deemed too expensive to develop most software to the quality level needed to ensure really reliable, robust product.  Is that true?  I contend that it is not, but that is an argument for a different post.

From a developer’s perspective, we want to write good software (yes, there are nuts out there who like to write viruses, but I’m talking about the majority of people, not the crazy minority).  We got into software because we love it, we want people to use our applications, play our games, and we want to be the solution to a problem.  We don’t want to write buggy crap, applications that don’t do what they should, or that are incomprehensible, or that crash every 15 minutes.

So why don’t we?  Well, there are a lot of reasons.  Let’s start with the one that I have seen be responsible for inflicting the most heinous releases in my career.  There is a little triangle we in the software biz use that has cost on one side, quality on another and time on the third.  What this means is that with any development effort, you have any two of these, but not all three.  You can be on time and spend little, but your quality will suffer.  You can have a quality product on schedule, but cost goes up.  Or you can have a quality product with low cost, but your timeline grows.

These are all relative to each other.  The more quality you try to pack in, the higher your costs and the longer your timeline.  Try to reduce costs and your deadline stretches over the horizon and you never finish.  Reining in the timeline means increased costs (often in the form of added bodies, which is another topic for a different post) and reduces the quality of the product.  Decreasing the cost (possibly by NOT adding more bodies) means that the end product will be so buggy as to be unusable.

Is this what I think happened with the latest patch for WAR?  Actually, it’s not.  But, I’m going to save my thoughts on that for another post, because I note that as I warm to my rhetoric, this post is getting awfully lengthy, so I’ll spare a few tidbits for later on.

Blog at WordPress.com.