OddThinking

A blog for odd things and odd thoughts.

Hunting Intermittent Bugs

The Situation

I was testing my real-time code. The problem with testing real-time systems is that errors can be intermittent. I dutifully ran the unit test repeatedly to make sure it was stable.

It wasn’t until the sixth iteration that an odd error showed up.

I spent some time made a change to fix it, and tried again…

After six iterations of the test, I hadn’t seen the error.

But that’s hardly good enough. The bug had roughly a 83% chance of simply not occurring in each run. If the bug wasn’t fixed, there would be a 33% chance that it simply didn’t show for six straight runs.

The Puzzle

Here was interesting real world puzzle: After fixing an intermittent bug, how many test runs do you need to do in order to be convinced that it has gone?

Fumbling for an Answer

The answer depends on several factors:

  1. How often did the intermittent bug occur?
  2. How often do you think you have fixed a bug, only to find you haven’t?
  3. How sure do you need to be that the bug is gone?

I think the formula is:

P(bug still exists | you think you solved it)
  * (P(test run passes | bug still exists) ^ n)
    <= PAcceptable(bug still exists)

Solve for n.

For the first question, I had to assume that the initial sample of six runs was representative. Why? Because I didn’t have the maths skills to work out what value I should use, and because I wasn’t about to re-run the old known-faulty code just to get a better idea of the MTBF.

The second question I could estimate. Maybe 2% of the time? If you include “introducing a new bug” it probably counts much higher, but this is when you think the bug has gone but it is still there.

How sure do you need to be that the bug is gone? This was a commercial project – I was a junior developer, so I figured that decision wasn’t mine. I asked my Project Manager. She looked at me bemused by the nerdy question, and hazarded “99% sure?” I was horrified – that seemed to be far too risky. That only required one iteration.

A Pragmatic Solution

Rather than debate it further, I went back to my desk and quietly ran it over and over again, until I couldn’t handle it any more.

On the 35th run, it failed! I hadn’t found the bug!

Coda

More careful examination of the code revealed the true problem. I fixed it, more confident that before. But now I had a further dilemma. If the bug only occurred once in 35 runs, how many times would I need to run it this time to reassure myself?

I spent a few hours producing a far more elaborate test harness, and let it run in a corner for several days straight, many thousands of time. This time I was going to be sure the bug was gone!


Comments

  1. Firstly, wouldn’t this type of thing be the proper role of stress testing rather than unit testing?

    Secondly, if you cannnot explain the behaviour and then _why_ your change fixes the issue then you must assume that you haven’t fixed the bug. It doesn’t matter how many runs through the tests you do.

    Raise a task and then leave it…if you don’t find it again then you are done…

    …and then when the customer calls and says that it happened you can point to the task and say “Yeah, thanks, but we already know about that one”

    while inside you are thinking “Damn, now I’m going to have to do some actual work for a change”.

  2. Pete,

    stress testing rather than unit testing

    I agree with you. I was using the term “unit testing” more loosely here. In this system, there were multiple threads per process, and multiple processes per system. I was stress-testing one “unit” (where unit = whole process), while a separate team would stress-test the whole system. This is an issue of definitions, and your usage is more conventional.

    Secondly, if you cannnot explain the behaviour and then _why_ your change fixes the issue then you must assume that you haven’t fixed the bug. It doesn’t matter how many runs through the tests you do.

    Again, I agree with you. However, it becomes tricky when the cause of the defect is an intermittent corruption.

    This incident was many, many years ago, and I can’t claim to remember much of the details well. However, I can paint the kind of picture it might have been to illustrate the point.

    Suppose that there was some data shared amongst multiple threads, and each access was protected by a semaphore.

    Suppose the symptom of the bug was an “assert” statement reporting some corruption in some shared data.

    Suppose that a careful examination found a reference to the data which was not properly protected by a semaphore, despite multiple code reviews.

    That unprotected access might, sometimes cause some corruption to that data when it is accessed by two threads simultaneously.

    Suppose that proper semaphore protection was inserted. You can probably see the dilemma. A possible cause has been fixed, but it is impossible to say whether that was really the root cause of the error.

    Raise a task and then leave it…

    This approach depends on the economics of the situation. In many projects – especially ones which had a low level of corporate importance or ones which have highly-iterative development processes – it may well be appropriate.

    On this project, the software was “mission-critical”, the customer was paying a high price for high quality, and the iterations were so long that they could be considered traditional waterfall.

    Hence, I was horrified that the Project Manager’s off-the-cuff assessment for the success rate for each bug fix was as low as 99%.

  3. I think your formula is slightly wrong:

    writing
    H = P( test passes | bug exists) and Q = P( bug exists ) [the prior estimate]

    P( bug exists | n tests pass ) = P( bug exists & n tests pass ) / P( n tests pass )

    = Q * H^n / ( Q * H^n + (1-Q) * P( n tests pass | bug doesn’t exist ) )

    assuming no false positives,

    P( bug exists | n tests pass ) = Q * (H^n) / ( Q.H^n + (1-Q) )

    where your equation was Q * H^n

    In this case the denominator is very close to 1 (because Q is close to zero) so there is little difference. But if we take the case H=Q=0.5 the difference becomes significant: after one test your equation suggests the chance of the bug being there reduces to 0.25, in fact it reduces only to 0.3333….

  4. Maybe it’s better to first put in some diagnostic code that somehow proves that the bug you think you’ve found actually causes the corruption behavior, so that you can be certain you’re fixing the right thing. The only way to be sure the bug is gone is to understand why it’s there and to know what you’ve done about it.

  5. Chris,

    You are absolutely right. My penalty for the error was to spend 10 minutes going back to Bayes Theorem and the basics to ensure I understood my mistake. I think I’ve learnt my lesson.

    [Note: I copy-edited an N to an H in the final paragraph of your conclusion. After 10 minutes’ study, I am pretty sure it was just a typo.]

  6. Joost,

    I agree in principle. However, I am still pondering how this could be done effectively in practice, here.

    The common practice of adding a plethora of debug output and assert statements to code when tracking down a bug is about confidently locating the cause of the defect. However, writing debug output is notoriously slow.

    In this case, we were pushing the system to its capacity (and if more powerful hardware had been available, the requirements would have been upped until it was being stressed again!) Adding debug output made the system too slow to meet its run-time deadlines. This made this form of debugging less practical.

    It wasn’t a simple Heisenbug that disappeared when you tried to debug it. It was just that adding debug output would slow the system down and trigger all sorts of failover behaviour which hid the original behaviour.

    I did, when I was desperate, write all sorts of special “black box” code. It would wait until the system was humming along, then store traces of various variables into a buffer of memory for a few seconds. When the buffer was full, it was start printing out values, slowing the system down enough to trigger all the failover code to go wild.

    This was quite an effort to go to, and didn’t seem warranted in this case (where I wrongly thought I had found the cause).

    Even if I could write the output quickly enough, I am not sure what I would write to prove that a particular write-access that was not protected by a semaphore wasn’t the cause of the corruption that had been detected.

    Again, time has dulled my memory of the actual code. Perhaps if I was looking at it again, the solution would be obvious.

  7. Very interesting problem. I would bet that there is no “direct cause” to this intermittent error, rather the result of combination of several independent factors come into force. The traditional way of locating the “cause” will lead nowhere. One need to think in terms of “chaotic system”

    The effective way, in my view, is:
    1. identify those factors
    2. build/allow sufficient margin for those factors

Leave a comment

You must be logged in to post a comment.

Web Mentions

  1. The Other Kind of Reentrant

  2. OddThinking » Self-Scaling Log Files