Jan 11

The Binomial Distribution and the Importance of Continuous Data


Working with proportions can sometimes require huge sample sizes to distinguish between different treatments or products, making a hypothesis test unfeasible due to cost or time constraints. To illustrate, consider a ACME Widgets, a company that has fairly low margins, but nevertheless feels pressure to innovate its widget offering, in order to bolster sales. In order to qualify a new product, the company has an assortment of qualifying tests, one of which is a screening test for certain cosmetic flaws that show up during use. These flaws haven’t been deal-breakers in the past, but customers are starting to buy the competitor’s widgets, which reportedly don’t have these flaws. It seems ACME has to do better to remain viable.

Their current test consists of running 30 widgets through a destructive test that simulates one year’s worth of use, and counting the number of widgets that exhibit the flaw in question. Typically, about 2 out of 30 fail the test, and they scrap the entire batch if they get more than 3 failures. ACME wants to improve it’s failure rate and their engineers have a few ideas up their sleeve. So, how can ACME know if their ideas will results in less failures?

Solve, Part I:

Working with proportions can be tricky. Let’s take a look at the binomial probabilities associated with their test as it currently stands (I’ll be using my very own Stats Helper, an inexpensive Excel Tool throughout this post).

Binomial Probabilities

As you can see, with ACME running only 30 runs and typically seeing 2 failures, they should expect anywhere from 0 (you can’t have -1 bad parts) to 5 bad parts. Already, you should see a problem with their current procedure of scrapping an entire batch if they get more than 3 failures–their typical “good” product should fail sometimes as many as 5 widgets per batch.

But the real problem is figuring out if their new top secret Widget Formula B is better. Even if they test 30 parts and experience 0 failures, that doesn’t mean they’ve made a better widget. Hovering over 0 in the distribution graph shows us that there’s a 12.6% chance of getting zero failures in a standard batch.

Hovering shows a 12.6% chance of getting zero failures

Looking at power and sample size, ACME would have to run about 215 widgets through it’s destructive test in order to determine if their new Widget Formula B has cut the problem in half (from about 6% failure in the test to about 3% failure).

Over 200 samples are needed to determine a change

Looking ahead, ACME’s new widget would have to run 215 parts with 2 or less failures to be considered better than it’s current widget. At 3 failures, the p-value is .056, but at 2 failures, the p-value drops to .02, which is low enough to accept that there’s a difference in the two types of widget.

215 samples with 2 or less failures is enough to show a difference

Unfortunately, with ACME’s current low margins, testing over 200 parts every time the engineers have an idea just won’t fly. There has to be a better way.

Solve, Part II:

Fortunately, with a simple shift in perspective, proportional data can often be transformed into continuous data. In ACME’s case, all they need to do is move from their current method (testing a sample of parts once and recording the number of failures) to a new method that tests each part to failure, which will allow them to create distributions of continuous data, drastically reducing their required sample size.

To illustrate, let’s assume that ACME runs only 10 of its current widgets through their screening test. As predicted, 2 of them fail the first time the test is run. Then they run the test again, and 4 more widgets fail. They repeat this cycle until all their widgets have failed, and they record the cycle number for which each widget failed. Then they perform the test again, but this time with 10 widgets of their new Widget Formula B. The results of their test are shown below:

Results of the two experiments

Now, ACME can run a 2-Sample T-Test, and see if there’s a difference between the two types of widgets. The results are shown below:

Results of ACME's better testing methods

All indications show that there is (just barely) a difference between the old and the new widget: the confidence interval graphs don’t show any overlap, there’s a very low p-value, and the confidence interval for the difference between the means does not cross zero. Another important point to make is that ACME has been able to glean a lot more information, while cutting the amount of widgets needed for each test by about 90%. Clearly, having continuous data is far better than working with proportions.


This problem isn’t really one that requires a tool to be created in order to be able to solve it. I suppose I could put a blatant plug in for Stats Helper again, but really, the point is that, sometimes with just a simple twist in the way we gather data, we can move from having very poor knowledge to being able to make sound and solid decisions. In this case, moving from testing once towards testing to failure helps. In nearly every case where attribute data is gathered, there is a way to change that to continuous data. It’s not always apparent at first, but after you’ve figured it out, you’ll wonder why it took you so long.


Consider this an invitation to spread the word about continuous data. Can you think of examples from your work or studies that could benefit from switching to continuous data? Are you stumped on how to make the transition? Ask away in the comments section below.

No comments yet, be the first.

Leave a Reply