SunlightD: October 2005

Thursday, October 27, 2005

Debuggers

It frequently surprises me when joining a software development organisation how difficult people find debugging customer problems.

When a program crashes on a developer's machine, the developer is perfectly happy running their debugger of choice to find out why. However, when that same program crashes on someone else's box, or at a customer site, it's back to reading sparse logs and static code analysis (staring at the monitor).

Part of this is down to ignorance - there are several ways to debug a remote issue, and most developers seem not to know about them. My personal favourites are crash dumps and remote debugging, and we'll go over each.

Crash Dumps

In Windows, if you don't have a debugger installed, when an app crashes, it writes a crash dump file. This dump file is written after you click "Don't Send" in the Windows error reporting box, and it's written by a tool called Dr. Watson. Occasionally, you'll find a killed application takes a long time to exit, and it's because it's writing a crash dump then, too.

This crash dump file is written to C:\Documents and Settings\All Users\Application Data\Microsoft\Dr Watson\user.dmp. This file hangs around; it represents a crash dump of the last application that crashed.

You can load this file into Visual Studio. Open it as you would open a project file. Then click Run, and VS will pretend that the app is running and just crashed - so you can see the stacks of all threads, locals and so on.

In order to make this useful, you'll need to ensure you have symbols for everything. The customer's machine is highly unlikely, in my experience, to have exactly the same set of Windows DLLs as are on your development machine, so installing the Microsoft Symbol Server is important. You also need to have built your app with debug information.

It's vital that you keep a copy of every EXE and DLL you ship to a customer, along with their associated PDBs. If you're not generating PDBs for some reason, then change your release builds now to do so (/Zi and /DEBUG). After all, the only difference in the resulting .EXE file is a path to the original PDB, and a couple of hundred bytes extra in the EXE file is not going to kill you.

Once you have all these, when you load the user.dmp file into Visual Studio, you should get source lines and call stacks for all threads. You may have to put the EXE and DLL files into the same locations on your machine as they are on the crashing machine. This is a peculiarity of Visual Studio, and can be circumvented using WinDbg (see below).

If the user has a debugger installed - sometimes people have old versions of Visual C++, for example - it won't write the crash dump file. This is irritating. To fix it, run the following command from Start - Run:

drwtsn32 -i

which replaces the current debugger with Dr. Watson again.

WinDbg

Visual Studio has a nice interface, but sometimes you need an extra few features, and most of those are present in a little tool called WinDbg.

WinDbg is part of the Debugging Tools for Windows. It's a free debugger, like VS only a lot less friendly. I tend to use a small subset of the features, because I mostly turn to WinDbg when Visual Studio lacks the specials.

First, ensure you've configured the Microsoft Symbol Server. To load a crash dump into WinDbg, simply load it from the File menu. WinDbg will then show you its Command window - yes, we're in command-line territory here.

The next thing to do is ensure you have symbols for your own EXE and DLL modules. Type "!sym noisy" to turn on symbol logging, then go to the Debug - Modules menu item. Select your EXE and DLL files and click Reload for each. Close the dialog box and you'll see information about each of your modules as it tries (and fails) to load the original modules. Close inspection of this information will reveal where (inside your symbol download directory) it is looking for these modules. Note the 16-digit hex number, which is a hash of the executable to identify the particular version. Copy your copy of the modules into the correct places.

Debug - Modules and Reload again will this time fail to load the symbol files. Again, inspect the log for the correct location in which you can place your PDB files. Reload once more should load all your symbols and you're ready to go.

Like VS, you can open a Processes and Threads view, and a call stack. My experience is that WinDbg is slightly better at call stacks. More powerful is the Memory window, in which you can type "esp" at the top to show the raw stack contents. Remember that the stack grows from high memory addresses to low, so as you scroll downwards, you are seeing earlier stack entries. If you set the data format to "Pointers and Symbols", you get a symbol name after each stack entry that might be a symbol.

This data format alone is worth using WinDbg for. You can scroll down the stack, identifying the return addresses of each of the functions shown in the Call Stack view. If there are entries missing from the Call Stack view, you may well be able to find them in the memory view. In addition, you can identify parameters (which will come immediately below the return address) and local variables (which will come immediately above).

If you're really canny, you can load both WinDbg and VS together on the same crash dump. Then you can identify memory locations in WinDbg and view them in VS.

More surprises come from typing !address and !vadump in the command window. These commands will show you the virtual memory contents in various ways - useful for finding out just what all that memory is for, and why you're running out of it.

Recently, I was debugging an app with a variable buffer size. The number of devices that could be connected to this app actually went down with a larger buffer size. However, there was only the one buffer...

Debugging with VS, I discovered an out-of-memory situation - it was actually failing to create the threads, failing with ERROR_OUT_OF_MEMORY. Where was all that memory going?

WinDbg found it pretty quickly - !address displayed a number of 10Mb chunks of memory marked as "stack". It turned out that CMake (considered harmful) had set the thread stack size to 10Mb - and this particular app had about 5 threads per device... so plugging in 20 devices took up 1Gb of virtual memory!

Remote Debugging

Remote debugging is not nearly as useful at customer sites as simply collecting crash dumps - but if you have an in-house test facility, even if it's just a second machine for you, it can be invaluable.

VS' remote debugging is usually installed from the CD-ROM, but if you don't have that handy, you can just copy the Common7\Packages\Debugger folder from your Visual Studio installation to the target machine. I then usually run:

msvcmon -anyuser -tcpip -timeout -1

It's not secure - but it's certainly the most convenient way. You can then go to Tools - Debug Processes in Visual Studio, change the Transport to TCP/IP and type in the destination machine's IP address. Click Refresh to show the processes on the remote machine, click one and Attach to start debugging.

Wednesday, October 12, 2005

Disjunction

Those of you with MP3 players probably discovered this some time ago; I myself only noticed when I started keeping large quantities of music on a hard disk (now an iPod)...

Sometimes, when you have it all on shuffle, you get pairs of songs that run together as if they were designed that way. For example, Steve Hackett's "Vampyre with a Healthy Appetite", from the Tokyo Tapes, just ran into Pallas' "Wilderness Years" from Beat the Drum, and I only noticed when Mr. Reed started singing and thought "hang on, is this still Hackett?"!

Of course, it happens the other way, too: my iPod has a disturbing habit of playing the Corrs between Dream Theater and Evergrey, which makes you blink.

Tuesday, October 04, 2005

Delusion

When you're a developer - of any kind - you rely on the tools, equipment and materials that you need to do your job.

Some things don't need introductions to their use. Hammers, for example, generally are hammers, and you hit things with them. It's kind of expected that when you buy a hammer, you already know how to hit things (and this is more complicated than it sounds!). Still, if I bought Homer Simpson's automatic hammer, I'd expect a manual that went into detail about its use - even if it should be bloody obvious.

Why doesn't this happen with the bleeding edge - or even the dull edge - of software?

When Microsoft publishes a new technology, it goes the whole hog - examples, documentation, tutorials, extensive testing - before it ever goes Beta 1. This means that you and I can plough in straight away and start trying it out without having to know every nasty little detail. You might not like the technology, and you might not approve of their motives - but their execution is stellar.

Other people aren't quite so complete. Take log4cplus (no, please, take it away!), for example. There seems to be a dearth of logging libraries for C++, otherwise surely this would never be used, much less adopted by companies as their "logging standard". It has no documentation of significance. In order to figure out how to use it, you study code written with it, and you study its code, and you study (and discard) the meager API docs, and then you experiment trial-and-error style.

The purpose of this post is not to dig at log4cplus. Rather, I would dig at the entire community which finds it acceptable to do the minimum possible for a project, then mark it "Production".

It ain't done until it's complete, people. Complete doesn't mean bug-free, and as much as I'd like to see that happen, it just never does. Complete does mean that the package is done, though, and the package is more than software.

You have a customer, even if it's just you. You need to get the software onto their machine and help them, every step of the way, to do what they want to do with it. This means installation. This means documentation (and documentation is not just Doxygen or Javadoc). This means intuitive and consistent interfaces. This means coping with non-admins. This means testing. This means support.

Developer tools are not exempt from these requirements. In fact, developers are an even more demanding set - or should be. A developer has to meet all these requirements themselves, and if you haven't provided what you had a duty to do, you have failed your customer.

Even if you have no bugs.

If you can't deliver a package, think again about what you're doing.

Dysfunction

Well, LA was not as bad as I feared, and indeed it was interesting to see a customer's system for the first time. It really gave me a new perspective on what I do, and it's always valuable to get feedback from the horse's mouth.

I made a few mistakes, though.

It's curious how technical users can put up with the craziest of things. People will wait and restart after crashes; they will endure counter-intuitive interfaces; they will generally find the best way they can to do the job. And provided you don't get data loss, they will live with what they have.

Now, under those circumstances, my heart always goes out to the poor fool on the front line who is actually doing this, and I really think "what can I do to ease their pain?". When I feel comfortable in a job, I occasionally generalise this to "what can we do?".

This generalisation turns out to be a mistake.

Now, I don't believe that not helping the user is a good thing; clearly quite the opposite. However, when you're on the front line, you tend to believe that a customer's needs are more pressing than the "grand plan" which is being expressed back in the office. Caring about the user is a good thing. Translating this "tactical" care into code is, apparently, bad.

Well, you live and learn.

The thing is, I over-generalise. I go in and solve a customer problem, and I assume that that customer is both all-important and representative of the customer base. I don't necessarily take my eye completely off the ball; I am aware of the conflicting needs of others, so I conceive things that are optimised for this customer, and appear to give everyone what they want.

At this point, I infringe on the "grand plan". The plan is generally shepherded by one person who has been at the company between three and thirty times as long as I. I'd like to believe this means they've seen many more customers and know more about the domain than I do, and so I will. However, I still get annoyed to have my idea of the time - which has been carefully constructed with my eye firmly on the user - cut down without investigation. In fact, this happens even when it is within my area of speciality. That really annoys me.

In the past, I have reached a point within a job in which I can go ahead with an idea anyway. Sometimes they don't work - but I still learn something. More often than not, however, they do work, and some real, tangible benefits arise.

If you want to innovate as a company, you need to nurture new ideas - even ones you think are poor ones. Few ideas in software require significant investment to get to a point where any hidden benefits are visible.

And, after all - if you didn't do anything new today, why did you get up?

SunlightD