PDA

View Full Version : Math question


M.
10-19-2004, 01:12 PM
I know the basic rules for using a normal approximation to estimate the sample size one would need in order to get an estimate of the percentage of a population that has a particular condition (such as a political view, etc.)

However, I'm interested in how one changes the formulas for limited populations. For instance, let's say I have a population of only 75. How many of them would I need to poll to estimate (with 90% confidence) the percentage that have a particular condition? I'm thinking the negative binomial distribution is involved, but I could easily be mistaken. It's been way too long since I took Course 3 or 4.

whisper
10-19-2004, 01:19 PM
However, I'm interested in how one changes the formulas for limited populations.


Why would the formula change just because its a smaller population?

M.
10-19-2004, 01:23 PM
Why would the formula change just because its a smaller population?Well, if the normal formula tells me I need 250 people, for example, that answer doesn't make much sense for a limited population of 75.

Tim><
10-19-2004, 01:26 PM
However, I'm interested in how one changes the formulas for limited populations.


Why would the formula change just because its a smaller population?Because you are estimating in a finite population and not an infinite population. Your sample population is a significant part of the point you are estimating.

MountainHawk
10-19-2004, 01:27 PM
The normal approximation doesn't work for small population sizes, does it? You'd need the actual distribution of the population to get that.

asdfasdf
10-19-2004, 01:28 PM
I think it has something to do with the multi-nomial distibution. I don't think it explicitly follow the normal any more. You can use the normal for an infinite population because of the central limit theorem, which I don't think still holds for such a small population. But I clearly don't know the whole answer .....

Tim><
10-19-2004, 01:28 PM
Typically, when estimating a point based on sample data, you are really estimating a point for the average on non-sampled data. Therefore, I would just divide my confidence interval by the weight of the remaining population. For example, if you wanted to determine a 1% confidence interval, your answer would be twice the interval you solved for if your sample population is half of the total population.

bm1729
10-19-2004, 01:58 PM
M, I don't know if this is what you're looking for, but here goes (from my sampling knowledge):

This is how you would determine a confidence interval for your estimate. Suppose you have a population of size N, and you sample n members from this population in order to determine the proportion, p, that have blue eyes. Suppose t people in your sample have blue eyes. Then clearly p = t/n is your estimate of the actual value of p, and a 95% confidence interval for p is given by:

p +/- 2[V(p)]<sup>1/2</sup> = p +/- 2[pq/(n-1)*(N-n)/N)]<sup>1/2</sup>, where q = 1 - p.

Now when N is much greater than n, (N-n)/N is approximately equal to 1, so the above simplifies to:

p +/- 2[pq/(n - 1)]<sup>1/2</sup>

So, to answer your question, the formula is always the same whether n is large or small, but the smaller n is, the larger your confidence interval will be (obviously). If you want to have your confidence interval a certain size, then you would solve for the smallest value of n that yields a confidence interval of that size.

M.
10-19-2004, 02:04 PM
bm1729, that's exactly what I'm after. Thanks so much!

Avi
10-19-2004, 02:05 PM
Wouldn't one use Student's t distribution for a smaller sample size with unkown variance that would tend to normal as it gets large?

M.
10-19-2004, 02:20 PM
bm1729, that's exactly what I'm after. Thanks so much!Actually, after I said that, I started thinking that this still doesn't intuitively make sense. The approach you describe would still produce a confidence interval even if we had a response from all 75 people. It seems that you should have a 100% confidence interval of size 0 when you have everybody's response.

John F. Kennedy
10-19-2004, 02:25 PM
You haven't met 2pac yet, have you?

bm1729
10-19-2004, 02:30 PM
bm1729, that's exactly what I'm after. Thanks so much!
Actually, after I said that, I started thinking that this still doesn't intuitively make sense. The approach you describe would still produce a confidence interval even if we had a response from all 75 people. It seems that you should have a 100% confidence interval of size 0 when you have everybody's response.
When you put n=N into the first confidence interval formula that I gave you, don't you get an interval of size zero?

Actuary321
10-19-2004, 03:13 PM
You haven't met 2pac yet, have you?
:rofl:

M.
10-19-2004, 04:32 PM
When you put n=N into the first confidence interval formula that I gave you, don't you get an interval of size zero?
Whoops, yes I do. Thanks!

BC
10-19-2004, 04:41 PM
For a sufficiently small sample size, you can actually calculate the exact hypergeometric distribution for each possible result.

Actuary321
10-19-2004, 05:57 PM
M, I don't know if this is what you're looking for, but here goes (from my sampling knowledge):

This is how you would determine a confidence interval for your estimate. Suppose you have a population of size N, and you sample n members from this population in order to determine the proportion, p, that have blue eyes. Suppose t people in your sample have blue eyes. Then clearly p = t/n is your estimate of the actual value of p, and a 95% confidence interval for p is given by:

p +/- 2[V(p)]&lt;sup&gt;1/2&lt;/sup&gt; = p +/- 2[pq/(n-1)*(N-n)/N)]&lt;sup&gt;1/2&lt;/sup&gt;, where q = 1 - p.

Now when N is much greater than n, (N-n)/N is approximately equal to 1, so the above simplifies to:

p +/- 2[pq/(n - 1)]&lt;sup&gt;1/2&lt;/sup&gt;

So, to answer your question, the formula is always the same whether n is large or small, but the smaller n is, the larger your confidence interval will be (obviously). If you want to have your confidence interval a certain size, then you would solve for the smallest value of n that yields a confidence interval of that size.

Isn't the 2 in your formula the approximation for the z at 95% from the normal?

I should know the formula off the top of my head but don't, but IIRC there is a formula for the exact numbers for the CI. Anyone have that? Or should I get out of the profession?

bm1729
10-19-2004, 06:13 PM
Isn't the 2 in your formula the approximation for the z at 95% from the normal?
Yes... maybe... I don't know -- I'm confused.

Well, that's what you would use when sampling a population, anyway, and I hope M found it useful, and doesn't come back to rescind his thank-you a second time. That would be embarrassing. :oops:

Avi
10-19-2004, 07:21 PM
For a sufficiently small sample size, you can actually calculate the exact hypergeometric distribution for each possible result.

I am not sure I follow what you are trying to accomplish.

E.g. for a sample size of 10, p {0, .1, .2 ... .9, 1}

poll 5 and get 3 success.

p-hat = .6

So one could calculate the probabilty of getting .6 given the 11 possible values of p.

How does that help set a confidence interval?

Anyway, hypergeometric? Isn't the question "how many people have a certain condition?" Wouldn't that be binomial, or multi-nomial for conditions with >2 states (like political viewpoints)? I guess I do not understand, but where is the sampling without replacement?

Utanapishtim
10-19-2004, 07:49 PM
Isn't the 2 in your formula the approximation for the z at 95% from the normal?
Yes... maybe... I don't know -- I'm confused.

The interval is 2 standard deviations wide, which is the approximate width of a 95% confidence interval based on a normal distribution. Since we are sampling from a population of N individuals, we assume that it is Binomial (N,p) and approximate it with a Normal(Np,Npq) in order to compute the confidence interval. The Normal approximation is generally used as long as N>30, isn't it?

BC
10-19-2004, 08:54 PM
I know the basic rules for using a normal approximation to estimate the sample size one would need in order to get an estimate of the percentage of a population that has a particular condition (such as a political view, etc.)

However, I'm interested in how one changes the formulas for limited populations. For instance, let's say I have a population of only 75. How many of them would I need to poll to estimate (with 90% confidence) the percentage that have a particular condition? I'm thinking the negative binomial distribution is involved, but I could easily be mistaken. It's been way too long since I took Course 3 or 4.

Okay, let's start with the philosophical meaning of a confidence interval. This is more instructive than going by rote formulae. As applied to a similar problem: Suppose I draw 5 white balls out of a bag that has 10 balls in it. What can I say about the probability that the other 5 balls are black?

The answer: that it is between 0 and 1. Absolutely nothing more. For example, if I put 5 white balls and 5 black balls in the bag this morning, I might be a bit surprised that I picked the 5 white ones out first, but I would not be particularly stunned when the next 5 that came out were black.

The statistician turns the problem around as follows: "Given that I have 5 white balls and 5 black balls in the bag, what are the chances that I will take the 5 white ones out first?" The answer, of course, is 251 to 1 against. Hence, the statistician says "that just doesn't happen, so I will reject my original assumption with a p-value of 99.6%."

So: Out of the 75 balls, suppose you draw n balls, and have m white balls (and n-m black balls).

Given that there were originally B white balls, the probability of drawing m white balls in n tries is

p(m|B) = C(B, m) * C(75-B, n-m) / C(75, n).

If sum (0...m) p(m|B) &lt;= 2.5%, then you can reject the hypothesis that there are at least B white balls in the bag at the 95% significance level. Similarly, if sum(m..B) p(m|B) &lt;= 2.5% you can reject the hypothesis that there are at most B white balls in the bag at the 95% significance level (note that I used 2.5% instead of 5% to allow for a 2-sided test).

This shouldn't be hard to calculate on a spreadsheet for small N.

Incidentally, if you think n &lt;&lt; N and want to use an approximation (for N large) I suggest using the method above with a Poisson, rather than a normal.

10 balls, 5 draws, all 5 white:

B = 5: 1/252; reject.
B = 6: 6/252; reject (barely).
B = 7: 21/252; accept.
B = 8: 56/252; accpet.
B = 9: 1/2; accept.
B = 10: 1; accept.

2-sided 95% confidence interval would be [7,10].

10 balls, 5 draws, 4 white, 1 black:

B = 4: 6/252; reject (barely).
B = 5: 25/252 (4/1) + 1/252 (5/0); accept - 4 or more white happens often enough.
...
B = 9: 1/2+1/2 = 1; accept - you will always have 4 (slightly less than expected) or 5 (slightly more than expected) white balls.
B = 10: 0; reject. 2-sided confidence interval would be [5,9].

10 balls, 5 draws, 3 white, 2 black:

B = 0-2: 0; reject.
B = 3: 21/252; accept.
B = 4: 60/252 (3w/2b) + 6/252 (4w/1b); accept. The upper-bound test would be 60/252 (3w/2b) + 120/252 (2w/3b) + 60/252 (1w/4b) + 6/252 (0w/5b) = 246/252 (accept). Usually, if above expected, you only need to do the upper bound test and if below expected, you only need to do the lower bound test.
B = 5: ...
B = 8: 56/252 (3w/2b); accept
B = 9: 0; reject.

So we have a confidence interval of [3,8]. This is not surprising, and tells us nothing we hadn't already figured out.

Results will look a LOT more reasonable with more than 10 balls, or a narrower confidence interval, but they do demonstrate the mechanics.

Calculation of these numbers is more challenging, but not impossible (as I recently proved to myself) for some Ns as high as the 5-digit range.

Mel-o-rama
10-20-2004, 03:52 PM
I saw an episode of the Gilmore Girls a week ago, and they were having this election, and this guy said he was conducting a poll to see who would win, and he said something like "Gallup uses 1000 people to represent the whole nation. So using the same proportions, I only need to poll .002 people, but that rounds up to one. So I polled myself and Candidate X is going to win." And wouldn't you know it? Candidate X won!

Frenchie
10-20-2004, 04:24 PM
i remember my int. theory prof saying rule of thumb was n>= 30 to use normal approximation. But then again, he wrote questions for the exams, so he probably wasn't all that reliable... :lol: