18

Apr 11

## Why is 30 the “Magic Number” for Sample Size?

### Problem:

It seems like whenever people learn about statistical problem solving, the sample size question comes up. Invariably, the number 30 is bandied about as a sweet spot that should get the job done. Astute learners generally want to understand why 30 seems to work. Read on to find out why.

### Solve:

The answer really hinges on an understanding of how confidence intervals for the standard deviation are created, and how they rely on the sample size for their accuracy: the larger the sample size, the better the accuracy of the standard deviation estimate. Here’s the formula for the upper and lower confidence limits on standard deviation:

Rather than go into a lengthy explanation of chi-squared distributions and how the formula is derived, it’s easier to visualize what’s going on. Imagine that we’re taking samples of the melting point of blue candles, and after each sample, we calculate the mean, standard deviation, and the confidence limits for the range of where the standard deviation could be at the 95% confidence level. For the sake of argument, let’s assume that we know from previous experience that the mean melting point is 100F, with a standard deviation of 3. If we start taking samples and calculating as we go, we get something like this:

With only one sample, we can’t calculate much in terms of standard deviation, but look at what happens to our best guess of the standard deviation (s) as we take each sample. It starts at 1.7, moves down to 1.3, and then jumps up to 1.9. Furthermore, look at the limits. While the lower limit isn’t changing much, the upper limit is certainly bouncing around. How long do we have to continue taking samples until the standard deviation and limits stop bouncing around?

### Create:

The best way to see is to create a graph of the standard deviation and limits, calculated at each sample. Here’s the graph:

Notice how the confidence limits tend to bounce around a lot at the beginning, then they tend to calm down after awhile? This is why 30 samples is usually deemed sufficient: if we recreate our chart with some new measurements, here’s what we get:

In this case, we didn’t get so much bouncing as the first time, so we’re more confident more early on. However, it’s very hard to know beforehand how much bouncing around you’ll get, so most people stick with 30 samples, just to be sure.

### Share:

But don’t just take my word for it. I’ve made an Excel Demo that you can play with. Just input your parameters, and it will calculate a sampling scenario. Pressing ‘F9′ will force Excel to choose new samples and recalculate the graph. If you’re curious about the Math and the Excel functions, just unlock the worksheet and have at look (there’s no password required–I locked the worksheet to make it simpler).

Looking for a way to calculate how many samples you need? Take a look at the software page, and see if Stats Helper is right for you.

Does this type of analysis have a name? I have never taken a statistical analysis class and I have been combing the internet to find the best way to determine when I have taken enough samples. I want my mean, standard deviation and confidence level to be meaningful but don’t want to take 1000 samples when 100 would work. Can you explain this analysis or point me in the right direction? I appreciate your time in reading this.

Dalila

I think what you’re looking for is called “Power and Sample Size.” There’s a great online manual for statistics done by NIST and Sematech. Their entry on this subject is at: http://www.itl.nist.gov/div898/handbook/prc/section2/prc222.htm

Also, Googling “Power and Sample Size” will lead you to some other great articles and free calculators. Let me know if you need more direction.

Jed,

Thank you. This has helped me understand this area of statistics a little more. However, I found that I had more questions than answers after looking up chi-square distributions. I was looking for an equation since in your spreadsheet the chi squared value was calculated with an excel formula. I came across several different chi-squared equations and chi test. Do you mind posting the details of the equation used by excel? No need to explain it in great detail since you said it would be lengthy. I just need a place to start. Thank you.

Dalila

From what I understand, most of the …INV functions in Excel use an iterative, guessing process to hone in on the correct number through brute force. This is basically due to the nature of a “Complex Logarithm” (http://en.wikipedia.org/wiki/Complex_logarithm – not an easy read) that needs to happen if you were trying to solve the inverse function algebraically.

For a more down-to-earth approach to the CHIINV function, I like Microsoft’s explanation on their Excel 2003 help page, found here: http://support.microsoft.com/kb/828313

A more in-depth treatment of the subject is here: http://www.aaec.ttu.edu/faculty/eelam/3401/CourseMaterials/Notes_Fall07/Notes_Chi-Square.pdf (PDF warning).

If I were just starting out, I’d check out a few library books before buying anything: Stats books are notoriously polarizing. One that I might be tempted to recommend has 5 reviews, 2 that like it and 3 that hate it.

http://www.amazon.com/Applied-Statistics-Engineers-Physical-Scientists/dp/0136017983/ref=sr_1_4?s=books&ie=UTF8&qid=1316901182&sr=1-4

Hello there, I discovered your site by the use of Google even as looking for a comparable subject,

your web site got here up, it seems great. I have bookmarked it in my

google bookmarks.

Hi there, simply became aware of your blog via Google, and found that it’s truly informative. I am going to watch out for brussels. I’ll be

grateful if you continue this in future. A lot of other people will likely be benefited from your

writing. Cheers!

I downloaded it but it doesn’t work on my Excel 2007. The macro was not included inside the spread sheet. Could you help me?

No macro is needed. Just press F9 a few times to cause Excel to recalculate, and you should see the results on the screen change.

In the table of the spread sheet, there are #NAME? under the Sample column as well as other “Mean”, “Lower Limit”, “s”, and “Upper Limit”. Thus, I couldn’t run it. In other words, hitting F9 couldn’t make it recalculate.

Ah, it turns out that, starting with Excel 2010 or so, the norminv() function became the norm.inv() function. To fix this in Excel 2007, just edit the entries in column B:

-From: =NORM.INV(RAND(),mean,st_dev)

-To: =NORMINV(RAND(),mean,st_dev)

In order to do this, you’ll first need to unprotect the sheet (no password is required). Let me know if this works out for you.

After I change the function, it works very well. Thanks for your kindly help.

In the table of the spread sheet, there are #NAME? under the Sample column as well as other “Mean”, “Lower Limit”, “s”, and “Upper Limit”. Thus, I couldn’t run it. In other words, hitting F9 couldn’t make it recalculate.

Excellent weblog here! Additionally your website rather

a lot up very fast! What host are you the usage of? Can I get your associate link for your host?

I desire my site loaded up as fast as yours lol

Hip flexion or moving your leg forward and hip extension or moving

it backward are the two main activities that the inner thighs engage

in. Each of our joints is cushioned by cartilage, a very

dense, sponge like substance. There’s nothing new about saffron,

since it is an all-natural blossom that has been utilized

as a spice in a dye along with Indian foods also.

Hi Jed,

Great article, has helped me better understand the significance of sample size.

Just one note, could you please check the formula in column D, I believe that the -1 should be outside the COUNT function. As it is it returns a constant 1 degree of freedom no matter what n is.

Are you then saying that to study the characteristics of a homogenous population of a few million 30 observations would be sufficient??

The results I get in this 30 samples you are saying therefore can be extrpolated on to the population??

Please clarify

My answer would be: “It depends.” If you’re absolutely certain the population is homogeneous, and you’re dealing with a continuous measurement (not an attribute/binomial/proportion type measurement), then somewhere around 30 samples would likely be indicative of the population. However, I strongly doubt a population of a few million, in most cases, would be truly homogenous. In determining a sample size, you also should consider possible rational subgroups. Also, bootstrapping is an interesting method of repeatedly taking small samples to build a model, and is very useful for larger datasets.

i can not run the excel, please help.

What version of Excel are you using, and what problem are you having? It seems to be working fine on my end.

i am using excel version 2007. i just change the number of std diavation/mean/ci, then press F9 but nothing happens. Please tell me step by step to run this excel file.

Thanks.

In some cases, isn’t sample size going to be a function of your overall population. So, for example, if I want to run a survey on my website and I know I get 50k visitors a week, in order to get a decent sample size, am I not going to need more than just 30 survey responses?

From calculators I’ve seen online (e.g., http://www.surveysystem.com/sscalc.htm), in order for me to have a 95% confidence level and 2.5% confidence interval my sample should be 1491.

Are you saying that there wouldn’t be that much difference between 30 and 1491, or am I missing something?

A sample size of 30 is only “sufficient” if the entire population is homogenous, and only if you are sampling a continuous variable. In your case, neither assumption would be valid, since it’s highly unlikely a population of 50k people would be homogenous, and since (I’m assuming from your mention of a 2.5% confidence interval) you’re not measuring a continuous variable as your response. What you’re actually getting at is referred to as “power and sample size,” which is a very different topic. By the way, Stats Helper, available on the “Software” tab above, has sample size calculators for both continuous and attribute variables.

Thanks for the explanation of the difference.