18

Apr 11

Why is 30 the “Magic Number” for Sample Size?

Problem:

It seems like whenever people learn about statistical problem solving, the sample size question comes up. Invariably, the number 30 is bandied about as a sweet spot that should get the job done. Astute learners generally want to understand why 30 seems to work. Read on to find out why.

Solve:

The answer really hinges on an understanding of how confidence intervals for the standard deviation are created, and how they rely on the sample size for their accuracy: the larger the sample size, the better the accuracy of the standard deviation estimate. Here’s the formula for the upper and lower confidence limits on standard deviation:

Rather than go into a lengthy explanation of chi-squared distributions and how the formula is derived, it’s easier to visualize what’s going on. Imagine that we’re taking samples of the melting point of blue candles, and after each sample, we calculate the mean, standard deviation, and the confidence limits for the range of where the standard deviation could be at the 95% confidence level. For the sake of argument, let’s assume that we know from previous experience that the mean melting point is 100F, with a standard deviation of 3. If we start taking samples and calculating as we go, we get something like this:

The first few samples

With only one sample, we can’t calculate much in terms of standard deviation, but look at what happens to our best guess of the standard deviation (s) as we take each sample. It starts at 1.7, moves down to 1.3, and then jumps up to 1.9. Furthermore, look at the limits. While the lower limit isn’t changing much, the upper limit is certainly bouncing around. How long do we have to continue taking samples until the standard deviation and limits stop bouncing around?

Create:

The best way to see is to create a graph of the standard deviation and limits, calculated at each sample. Here’s the graph:

Notice how the confidence limits tend to bounce around a lot at the beginning, then they tend to calm down after awhile? This is why 30 samples is usually deemed sufficient: if we recreate our chart with some new measurements, here’s what we get:

In this case, we didn’t get so much bouncing as the first time, so we’re more confident more early on. However, it’s very hard to know beforehand how much bouncing around you’ll get, so most people stick with 30 samples, just to be sure.

Share:

But don’t just take my word for it. I’ve made an Excel Demo that you can play with. Just input your parameters, and it will calculate a sampling scenario. Pressing ‘F9′ will force Excel to choose new samples and recalculate the graph. If you’re curious about the Math and the Excel functions, just unlock the worksheet and have at look (there’s no password required–I locked the worksheet to make it simpler).

Download

Looking for a way to calculate how many samples you need? Take a look at the software page, and see if Stats Helper is right for you.

5 Comments

  1. Dalila says:

    Does this type of analysis have a name? I have never taken a statistical analysis class and I have been combing the internet to find the best way to determine when I have taken enough samples. I want my mean, standard deviation and confidence level to be meaningful but don’t want to take 1000 samples when 100 would work. Can you explain this analysis or point me in the right direction? I appreciate your time in reading this.
    Dalila

    1. Jed Campbell says:

      I think what you’re looking for is called “Power and Sample Size.” There’s a great online manual for statistics done by NIST and Sematech. Their entry on this subject is at: http://www.itl.nist.gov/div898/handbook/prc/section2/prc222.htm

      Also, Googling “Power and Sample Size” will lead you to some other great articles and free calculators. Let me know if you need more direction.

  2. Dalila says:

    Jed,
    Thank you. This has helped me understand this area of statistics a little more. However, I found that I had more questions than answers after looking up chi-square distributions. I was looking for an equation since in your spreadsheet the chi squared value was calculated with an excel formula. I came across several different chi-squared equations and chi test. Do you mind posting the details of the equation used by excel? No need to explain it in great detail since you said it would be lengthy. I just need a place to start. Thank you.
    Dalila

    1. Jed Campbell says:

      From what I understand, most of the …INV functions in Excel use an iterative, guessing process to hone in on the correct number through brute force. This is basically due to the nature of a “Complex Logarithm” (http://en.wikipedia.org/wiki/Complex_logarithm – not an easy read) that needs to happen if you were trying to solve the inverse function algebraically.

      For a more down-to-earth approach to the CHIINV function, I like Microsoft’s explanation on their Excel 2003 help page, found here: http://support.microsoft.com/kb/828313

      A more in-depth treatment of the subject is here: http://www.aaec.ttu.edu/faculty/eelam/3401/CourseMaterials/Notes_Fall07/Notes_Chi-Square.pdf (PDF warning).

      If I were just starting out, I’d check out a few library books before buying anything: Stats books are notoriously polarizing. One that I might be tempted to recommend has 5 reviews, 2 that like it and 3 that hate it.
      http://www.amazon.com/Applied-Statistics-Engineers-Physical-Scientists/dp/0136017983/ref=sr_1_4?s=books&ie=UTF8&qid=1316901182&sr=1-4

  3. Hello there, I discovered your site by the use of Google even as looking for a comparable subject,
    your web site got here up, it seems great. I have bookmarked it in my
    google bookmarks.
    Hi there, simply became aware of your blog via Google, and found that it’s truly informative. I am going to watch out for brussels. I’ll be
    grateful if you continue this in future. A lot of other people will likely be benefited from your
    writing. Cheers!

Leave a Reply