<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Hunting Intermittent Bugs</title>
	<atom:link href="http://www.somethinkodd.com/oddthinking/2005/11/22/hunting-intermittent-bugs/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.somethinkodd.com/oddthinking/2005/11/22/hunting-intermittent-bugs/</link>
	<description>A blog for odd things and odd thoughts.</description>
	<lastBuildDate>Wed, 10 Mar 2010 17:00:21 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: OddThinking &#187; Self-Scaling Log Files</title>
		<link>http://www.somethinkodd.com/oddthinking/2005/11/22/hunting-intermittent-bugs/comment-page-1/#comment-221266</link>
		<dc:creator>OddThinking &#187; Self-Scaling Log Files</dc:creator>
		<pubDate>Thu, 21 Jan 2010 02:27:19 +0000</pubDate>
		<guid isPermaLink="false">http://www.somethinkodd.com/oddthinking/?p=135#comment-221266</guid>
		<description>[...] the problem seems to be finally is fixed, and hasn&#8217;t recurred for several days or weeks (or 6 runs) so there is no point wasting time and disk-space with pointless [...]</description>
		<content:encoded><![CDATA[<p>[...] the problem seems to be finally is fixed, and hasn&#8217;t recurred for several days or weeks (or 6 runs) so there is no point wasting time and disk-space with pointless [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: chris</title>
		<link>http://www.somethinkodd.com/oddthinking/2005/11/22/hunting-intermittent-bugs/comment-page-1/#comment-204593</link>
		<dc:creator>chris</dc:creator>
		<pubDate>Thu, 16 Jul 2009 23:03:41 +0000</pubDate>
		<guid isPermaLink="false">http://www.somethinkodd.com/oddthinking/?p=135#comment-204593</guid>
		<description>Very interesting problem.  I would bet that there is no &quot;direct cause&quot; to this intermittent error, rather the result of combination of several independent factors come into force.  The traditional way of locating the &quot;cause&quot; will lead nowhere.  One need to think in terms of &quot;chaotic system&quot;

The effective way, in my view, is:
1. identify those factors
2. build/allow sufficient margin for those factors</description>
		<content:encoded><![CDATA[<p>Very interesting problem.  I would bet that there is no &#8220;direct cause&#8221; to this intermittent error, rather the result of combination of several independent factors come into force.  The traditional way of locating the &#8220;cause&#8221; will lead nowhere.  One need to think in terms of &#8220;chaotic system&#8221;</p>
<p>The effective way, in my view, is:<br />
1. identify those factors<br />
2. build/allow sufficient margin for those factors</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: The Other Kind of Reentrant</title>
		<link>http://www.somethinkodd.com/oddthinking/2005/11/22/hunting-intermittent-bugs/comment-page-1/#comment-22408</link>
		<dc:creator>The Other Kind of Reentrant</dc:creator>
		<pubDate>Mon, 18 Dec 2006 10:27:43 +0000</pubDate>
		<guid isPermaLink="false">http://www.somethinkodd.com/oddthinking/?p=135#comment-22408</guid>
		<description>[...] I verified that, after removing all traces of logging from my signal handlers, the unit tests ran perfectly, to completion. But this was still a mildly unsatisfactory explanation. It didn&#8217;t explain why the signals weren&#8217;t being delivered or what mutex was locked that could not be acquired by the signal handler. And maybe I just hadn&#8217;t tested enough to reproduce it? So I went digging further. [...]</description>
		<content:encoded><![CDATA[<p>[...] I verified that, after removing all traces of logging from my signal handlers, the unit tests ran perfectly, to completion. But this was still a mildly unsatisfactory explanation. It didn&#8217;t explain why the signals weren&#8217;t being delivered or what mutex was locked that could not be acquired by the signal handler. And maybe I just hadn&#8217;t tested enough to reproduce it? So I went digging further. [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Julian</title>
		<link>http://www.somethinkodd.com/oddthinking/2005/11/22/hunting-intermittent-bugs/comment-page-1/#comment-1964</link>
		<dc:creator>Julian</dc:creator>
		<pubDate>Sat, 26 Nov 2005 11:27:09 +0000</pubDate>
		<guid isPermaLink="false">http://www.somethinkodd.com/oddthinking/?p=135#comment-1964</guid>
		<description>Joost,

I agree in principle. However, I am still pondering how this could be done effectively in practice, here.


The common practice of adding a plethora of debug output and assert statements to code when tracking down a bug is about confidently locating the cause of the defect. However, writing debug output is notoriously slow.

In this case, we were pushing the system to its capacity (and if more powerful hardware had been available, the requirements would have been upped until it was being stressed again!)  Adding debug output made the system too slow to meet its run-time deadlines. This made this form of debugging less practical.

It wasn&#039;t a simple &lt;a href=&quot;http://catb.org/~esr/jargon/html/H/heisenbug.html&quot; rel=&quot;nofollow&quot;&gt;Heisenbug&lt;/a&gt; that disappeared when you tried to debug it. It was just that adding debug output would slow the system down and trigger all sorts of failover behaviour which hid the original behaviour.

I did, when I was desperate, write all sorts of special &quot;black box&quot; code. It would wait until the system was humming along, then store traces of various variables into a buffer of memory for a few seconds. When the buffer was full, it was start printing out values, slowing the system down enough to trigger all the failover code to go wild. 

This was quite an effort to go to, and didn&#039;t seem warranted in this case (where I wrongly thought I had found the cause).

Even if I could write the output quickly enough, I am not sure what I would write to prove that a particular write-access that was not protected by a semaphore wasn&#039;t the cause of the corruption that had been detected.

Again, time has dulled my memory of the actual code. Perhaps if I was looking at it again, the solution would be obvious.</description>
		<content:encoded><![CDATA[<p>Joost,</p>
<p>I agree in principle. However, I am still pondering how this could be done effectively in practice, here.</p>
<p>The common practice of adding a plethora of debug output and assert statements to code when tracking down a bug is about confidently locating the cause of the defect. However, writing debug output is notoriously slow.</p>
<p>In this case, we were pushing the system to its capacity (and if more powerful hardware had been available, the requirements would have been upped until it was being stressed again!)  Adding debug output made the system too slow to meet its run-time deadlines. This made this form of debugging less practical.</p>
<p>It wasn&#8217;t a simple <a href="http://catb.org/~esr/jargon/html/H/heisenbug.html" rel="nofollow" class="liexternal">Heisenbug</a> that disappeared when you tried to debug it. It was just that adding debug output would slow the system down and trigger all sorts of failover behaviour which hid the original behaviour.</p>
<p>I did, when I was desperate, write all sorts of special &#8220;black box&#8221; code. It would wait until the system was humming along, then store traces of various variables into a buffer of memory for a few seconds. When the buffer was full, it was start printing out values, slowing the system down enough to trigger all the failover code to go wild. </p>
<p>This was quite an effort to go to, and didn&#8217;t seem warranted in this case (where I wrongly thought I had found the cause).</p>
<p>Even if I could write the output quickly enough, I am not sure what I would write to prove that a particular write-access that was not protected by a semaphore wasn&#8217;t the cause of the corruption that had been detected.</p>
<p>Again, time has dulled my memory of the actual code. Perhaps if I was looking at it again, the solution would be obvious.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Julian</title>
		<link>http://www.somethinkodd.com/oddthinking/2005/11/22/hunting-intermittent-bugs/comment-page-1/#comment-1963</link>
		<dc:creator>Julian</dc:creator>
		<pubDate>Sat, 26 Nov 2005 10:56:55 +0000</pubDate>
		<guid isPermaLink="false">http://www.somethinkodd.com/oddthinking/?p=135#comment-1963</guid>
		<description>Chris,

You are absolutely right. My penalty for the error was to spend 10 minutes going back to Bayes Theorem and the basics to ensure I understood my mistake. I think I&#039;ve learnt my lesson.

[Note: I copy-edited an N to an H in the final paragraph of your conclusion. After 10 minutes&#039; study, I am pretty sure it was just a typo.]</description>
		<content:encoded><![CDATA[<p>Chris,</p>
<p>You are absolutely right. My penalty for the error was to spend 10 minutes going back to Bayes Theorem and the basics to ensure I understood my mistake. I think I&#8217;ve learnt my lesson.</p>
<p>[Note: I copy-edited an N to an H in the final paragraph of your conclusion. After 10 minutes' study, I am pretty sure it was just a typo.]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Joost</title>
		<link>http://www.somethinkodd.com/oddthinking/2005/11/22/hunting-intermittent-bugs/comment-page-1/#comment-1921</link>
		<dc:creator>Joost</dc:creator>
		<pubDate>Thu, 24 Nov 2005 18:11:02 +0000</pubDate>
		<guid isPermaLink="false">http://www.somethinkodd.com/oddthinking/?p=135#comment-1921</guid>
		<description>Maybe it&#039;s better to first put in some diagnostic code that somehow proves that the bug you think you&#039;ve found actually causes the corruption behavior, so that you can be certain you&#039;re fixing the right thing. The only way to be sure the bug is gone is to understand why it&#039;s there and to know what you&#039;ve done about it.</description>
		<content:encoded><![CDATA[<p>Maybe it&#8217;s better to first put in some diagnostic code that somehow proves that the bug you think you&#8217;ve found actually causes the corruption behavior, so that you can be certain you&#8217;re fixing the right thing. The only way to be sure the bug is gone is to understand why it&#8217;s there and to know what you&#8217;ve done about it.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Chris</title>
		<link>http://www.somethinkodd.com/oddthinking/2005/11/22/hunting-intermittent-bugs/comment-page-1/#comment-1901</link>
		<dc:creator>Chris</dc:creator>
		<pubDate>Tue, 22 Nov 2005 23:14:10 +0000</pubDate>
		<guid isPermaLink="false">http://www.somethinkodd.com/oddthinking/?p=135#comment-1901</guid>
		<description>I think your formula is slightly wrong:

writing 
H = P( test passes &#124; bug exists) and Q = P( bug exists ) [the prior estimate]

P( bug exists &#124; n tests pass ) = P( bug exists &amp; n tests pass ) / P( n tests pass )

= Q * H^n /  ( Q * H^n + (1-Q) * P( n tests pass &#124; bug doesn&#039;t exist ) )

assuming no false positives, 

P( bug exists &#124; n tests pass ) = Q * (H^n) / ( Q.H^n + (1-Q) )

where your equation was Q * H^n

In this case the denominator is very close to 1 (because Q is close to zero) so there is little difference. But if we take the case H=Q=0.5 the difference becomes significant: after one test your equation suggests the chance of the bug being there reduces to 0.25, in fact it reduces only to 0.3333....</description>
		<content:encoded><![CDATA[<p>I think your formula is slightly wrong:</p>
<p>writing<br />
H = P( test passes | bug exists) and Q = P( bug exists ) [the prior estimate]</p>
<p>P( bug exists | n tests pass ) = P( bug exists &amp; n tests pass ) / P( n tests pass )</p>
<p>= Q * H^n /  ( Q * H^n + (1-Q) * P( n tests pass | bug doesn&#8217;t exist ) )</p>
<p>assuming no false positives, </p>
<p>P( bug exists | n tests pass ) = Q * (H^n) / ( Q.H^n + (1-Q) )</p>
<p>where your equation was Q * H^n</p>
<p>In this case the denominator is very close to 1 (because Q is close to zero) so there is little difference. But if we take the case H=Q=0.5 the difference becomes significant: after one test your equation suggests the chance of the bug being there reduces to 0.25, in fact it reduces only to 0.3333&#8230;.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Julian</title>
		<link>http://www.somethinkodd.com/oddthinking/2005/11/22/hunting-intermittent-bugs/comment-page-1/#comment-1892</link>
		<dc:creator>Julian</dc:creator>
		<pubDate>Tue, 22 Nov 2005 12:49:47 +0000</pubDate>
		<guid isPermaLink="false">http://www.somethinkodd.com/oddthinking/?p=135#comment-1892</guid>
		<description>Pete,

&lt;blockquote&gt;stress testing rather than unit testing&lt;/blockquote&gt;

I agree with you. I was using the term &quot;unit testing&quot; more loosely here. In this system, there were multiple threads per process, and multiple processes per system. I was stress-testing one &quot;unit&quot; (where unit = whole process), while a separate team would stress-test the whole system. This is an issue of definitions, and your usage is more conventional.

&lt;blockquote&gt;Secondly, if you cannnot explain the behaviour and then _why_ your change fixes the issue then you must assume that you haven’t fixed the bug. It doesn’t matter how many runs through the tests you do.&lt;/blockquote&gt;

Again, I agree with you. However, it becomes tricky when the cause of the defect is an intermittent corruption.

This incident was many, many years ago, and I can&#039;t claim to remember much of the details well. However, I can paint the &lt;em&gt;kind&lt;/em&gt; of picture it might have been to illustrate the point.

Suppose that there was some data shared amongst multiple threads, and each access was protected by a semaphore.

Suppose the symptom of the bug was an &quot;assert&quot; statement reporting some corruption in some shared data.

Suppose that a careful examination found a reference to the data which was not properly protected by a semaphore, despite multiple code reviews.

That unprotected access &lt;em&gt;might&lt;/em&gt;, &lt;em&gt;sometimes&lt;/em&gt; cause some corruption to that data when it is accessed by two threads simultaneously. 

Suppose that proper semaphore protection was inserted. You can probably see the dilemma. A possible cause has been fixed, but it is impossible to say whether that was really the root cause of the error.

&lt;blockquote&gt;Raise a task and then leave it...&lt;/blockquote&gt;

This approach depends on the economics of the situation. In many projects - especially ones which had a low level of corporate importance or ones which have highly-iterative development processes - it may well be appropriate. 

On this project, the software was &quot;mission-critical&quot;, the customer was paying a high price for high quality, and the iterations were so long that they could be considered traditional waterfall. 

Hence, I was horrified that the Project Manager&#039;s off-the-cuff assessment for the success rate for each bug fix was as low as 99%.</description>
		<content:encoded><![CDATA[<p>Pete,</p>
<blockquote><p>stress testing rather than unit testing</p></blockquote>
<p>I agree with you. I was using the term &#8220;unit testing&#8221; more loosely here. In this system, there were multiple threads per process, and multiple processes per system. I was stress-testing one &#8220;unit&#8221; (where unit = whole process), while a separate team would stress-test the whole system. This is an issue of definitions, and your usage is more conventional.</p>
<blockquote><p>Secondly, if you cannnot explain the behaviour and then _why_ your change fixes the issue then you must assume that you haven’t fixed the bug. It doesn’t matter how many runs through the tests you do.</p></blockquote>
<p>Again, I agree with you. However, it becomes tricky when the cause of the defect is an intermittent corruption.</p>
<p>This incident was many, many years ago, and I can&#8217;t claim to remember much of the details well. However, I can paint the <em>kind</em> of picture it might have been to illustrate the point.</p>
<p>Suppose that there was some data shared amongst multiple threads, and each access was protected by a semaphore.</p>
<p>Suppose the symptom of the bug was an &#8220;assert&#8221; statement reporting some corruption in some shared data.</p>
<p>Suppose that a careful examination found a reference to the data which was not properly protected by a semaphore, despite multiple code reviews.</p>
<p>That unprotected access <em>might</em>, <em>sometimes</em> cause some corruption to that data when it is accessed by two threads simultaneously. </p>
<p>Suppose that proper semaphore protection was inserted. You can probably see the dilemma. A possible cause has been fixed, but it is impossible to say whether that was really the root cause of the error.</p>
<blockquote><p>Raise a task and then leave it&#8230;</p></blockquote>
<p>This approach depends on the economics of the situation. In many projects &#8211; especially ones which had a low level of corporate importance or ones which have highly-iterative development processes &#8211; it may well be appropriate. </p>
<p>On this project, the software was &#8220;mission-critical&#8221;, the customer was paying a high price for high quality, and the iterations were so long that they could be considered traditional waterfall. </p>
<p>Hence, I was horrified that the Project Manager&#8217;s off-the-cuff assessment for the success rate for each bug fix was as low as 99%.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Pete</title>
		<link>http://www.somethinkodd.com/oddthinking/2005/11/22/hunting-intermittent-bugs/comment-page-1/#comment-1891</link>
		<dc:creator>Pete</dc:creator>
		<pubDate>Tue, 22 Nov 2005 12:21:25 +0000</pubDate>
		<guid isPermaLink="false">http://www.somethinkodd.com/oddthinking/?p=135#comment-1891</guid>
		<description>Firstly, wouldn&#039;t this type of thing be the proper role of stress testing rather than unit testing?

Secondly, if you cannnot explain the behaviour and then _why_ your change fixes the issue then you must assume that you haven&#039;t fixed the bug. It doesn&#039;t matter how many runs through the tests you do.

Raise a task and then leave it...if you don&#039;t find it again then you are done...

...and then when the customer calls and says that it happened you can point to the task and say &quot;Yeah, thanks, but we already know about that one&quot;

while inside you are thinking &quot;Damn, now I&#039;m going to have to do some actual work for a change&quot;.</description>
		<content:encoded><![CDATA[<p>Firstly, wouldn&#8217;t this type of thing be the proper role of stress testing rather than unit testing?</p>
<p>Secondly, if you cannnot explain the behaviour and then _why_ your change fixes the issue then you must assume that you haven&#8217;t fixed the bug. It doesn&#8217;t matter how many runs through the tests you do.</p>
<p>Raise a task and then leave it&#8230;if you don&#8217;t find it again then you are done&#8230;</p>
<p>&#8230;and then when the customer calls and says that it happened you can point to the task and say &#8220;Yeah, thanks, but we already know about that one&#8221;</p>
<p>while inside you are thinking &#8220;Damn, now I&#8217;m going to have to do some actual work for a change&#8221;.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
