2. Debugging by testing dramatically lowers effective size

Mib Software
High Reuse Software
Development and Consulting
We extend and customize
reusable and open software.

Some Implications of Bazaar Size

Third Draft. Aug 11, 1998
Copyright 1997-1998, Forrest J. Cavalier, III Mib Software
All rights reserved. Comments welcome!

2. Debugging by testing dramatically lowers effective size

Conditions for making all defects "shallow"
Raymond states that when a bazaar is "large enough" then all "defects are found quickly."
To quote at length from the Eric Raymond paper:

8. Given a large enough beta-tester and co-developer base, almost every problem will be characterized quickly and the fix obvious to someone.

Or, less formally, ``Given enough eyeballs, all bugs are shallow.'' I dub this: ``Linus's Law''.

My original formulation was that every problem ``will be transparent to somebody''. Linus demurred that the person who understands and fixes the problem is not necessarily or even usually the person who first characterizes it. ``Somebody finds the problem,'' he says, ``and somebody else understands it. And I'll go on record as saying that finding it is the bigger challenge.'' But the point is that both things tend to happen quickly.

Here, I think, is the core difference underlying the cathedral-builder and bazaar styles. In the cathedral-builder view of programming, bugs and development problems are tricky, insidious, deep phenomena. It takes months of scrutiny by dedicated few to develop confidence that you've winkled them all out. Thus the long release intervals, and the inevitable disappointment when long-awaited releases are not perfect.

In the bazaar view, on the other hand, you assume that bugs are generally shallow phenomena -- or, at least, that they turn shallow pretty quick when exposed to a thousand eager co-developers pounding on every single new release. Accordingly you release often in order to get more corrections, and as a beneficial side effect you have less to lose if an occasional botch gets out the door.

And that's it. That's enough. If ``Linus's Law'' is false, then any system as complex as the Linux kernel, being hacked over by as many hands as the Linux kernel, should at some point have collapsed under the weight of unforseen bad interactions and undiscovered ``deep'' bugs. If it's true, on the other hand, it is sufficient to explain Linux's relative lack of bugginess.

And maybe it shouldn't have been such a surprise, at that. Sociologists years ago discovered that the averaged opinion of a mass of equally expert (or equally ignorant) observers is quite a bit more reliable a predictor than that of a single randomly chosen one of the observers. They called this the ``Delphi effect''. It appears that what Linus has shown is that this applies even to debugging an operating system -- that the Delphi effect can tame development complexity even at the complexity level of an OS kernel.

I am indebted to Jeff Dutky <dutky@wam.umd.edu> for pointing out that Linus's Law can be rephrased as ``Debugging is parallelizable''. Jeff observes that although debugging requires debuggers to communicate with some coordinating developer, it doesn't require significant coordination between debuggers. Thus it doesn't fall prey to the same quadratic complexity and management costs that make adding developers problematic.

In practice, the theoretical loss of efficiency due to duplication of work by debuggers almost never seems to be an issue in the Linux world. One effect of a ``release early and often policy'' is to minimize such duplication by propagating fed-back fixes quickly.

Brooks even made an off-hand observation related to Jeff's: ``The total cost of maintaining a widely used program is typically 40 percent or more of the cost of developing it. Surprisingly this cost is strongly affected by the number of users. More users find more bugs.'' (my emphasis).

More users find more bugs because adding more users adds more different ways of stressing the program. This effect is amplified when the users are co-developers. Each one approaches the task of bug characterization with a slightly different perceptual set and analytical toolkit, a different angle on the problem. The ``Delphi effect'' seems to work precisely because of this variation. In the specific context of debugging, the variation also tends to reduce duplication of effort.

So adding more beta-testers may not reduce the complexity of the current ``deepest'' bug from the developer's P.O.V., but it increases the probability that someone's toolkit will be matched to the problem in such a way that the bug is shallow to that person.

Linus coppers his bets, too. In case there are serious bugs, Linux kernel version are numbered in such a way that potential users can make a choice either to run the last version designated ``stable'' or to ride the cutting edge and risk bugs in order to get new features. This tactic is not yet formally imitated by most Linux hackers, but perhaps it should be; the fact that either choice are available makes both more attractive.

Raymond then goes on to describe the necessity of "frequent releases" to permit testing and defect detection. He does not discuss shortcomings of this method of testing or alternative defect detection methods.

It has been well-established (elsewhere) that testing is not a silver bullet for defect detection. (I.e. there are some defects which are not easily located using testing.) Some reasons for this are apparent by examining the requirements of defect detection by testing. These requirements have implications for the Bazaar debugging method of locating defects through user testing.

In order for a defect to be detected using testing:

1) The defective code must be "present" and "reachable"
2) The defective code must be "triggered" and run
3) The erroneous or unintended result must be detected (noticed.)
4) The erroneous or unintended result must be "reported" (advertised) to the developers.

In general, for any given defect, each successive condition will be met at only a subset of the sites that preceding conditions were met. Therefore the "effective size" for each step is always smaller than the effective size of the preceding step. This is extremely important, because there are many, many factors which reduce the "pass-through" portion at each step, and the "effective working size" of the entire defect removal sequence can be quite small.

Only if all four conditions are met, is the defect "advertised" and fed back to the developers for product improvement.

This debugging method described by Raymond hypothesizes a "large enough" effective size that step 4 is completed (effective size is at least one.) But there are reduction factors at each step which threaten to overwhelm the numbers of the largest bazaars.

In order for new defects to be exposed by bazaar style debugging by testing, the defect must be installed, reachable, and triggered.

2.1 Factors affecting defect "installation"

2.2 Factors affecting defect "reachability" and "triggering"

2.3 Factors affecting defect "detection"

2.4 Factors affecting defect "reporting"

[In an e-mail discussion in August 1998, Randy Boring <randy.boring@thursby.com> pointed out that despite the factors mentioned here, open source debugging has advantages as compared to closed source attempts at the same thing. He allows me to include summary commentary Commentary by Randy Boring on debugging by testing : Open is still much better than closed source. His points are insightful and since I lack much comparison to how the "cathedral" compares to the "bazaar" in this aspect, provides good balance to what I wrote. - F. Cavalier 11 Aug 1998]

Other steps in defect removal
After a defect is reported, it must be characterized and corrected. But once a defect is advertised, the effective size for the tasks is no longer determined by the pass-through portions of the preceding sequence. It is not necessary for the reporting participant to actually find and fix the defect. This is recognized by Linus and mentioned by Raymond: "Linus demurred that the person who understands and fixes the problem is not necessarily or even usually the person who first characterizes it."

Secondary Effects of Defects
There can be negative effects on the bazaar if a defect gets propagated only partially through the sequence. If a defect is exposed through to level 3, but the observer does not report it, then the bazaar and the developers cannot learn and improve from it. Worse, this partial exposure often has a negative effect on the reputation of the bazaar and the product. The observation of defects influence the observer's opinion of the product, and may even result in abandoning it for the alternative, sharing their complaints with others, etc.

Next main topic: 3. Alternate Debugging methods (useful for lower effective size.)
Up to: Some Implications of Bazaar Size