Context Part VI Sampling Accuracy of Percentages Previously, we assumed that we knew the contents of the box and argued about chances for the draws based on this knowledge. In survey work, we frequently need to turn this reasoning, and argue from the draws to the box. This is called inference. Three main new ideas: method to estimate the SE error intervals simple random sample of 2500 voters. In the sample, 1328 people favor the candidate. This is simple random sample of 2500 voters. In the sample, 1328 people favor the candidate. 1328 100% = 53% 2500 Population Parameter Should he enter the primary? The crucial question is: how wrong is this estimate likely to be? Sample Statistic
simple random sample of 2500 voters. In the sample, 1328 people favor the candidate. The likely size of the chance error is given by the standard error, and to calculate that we need a box model: 0?? 1?? Population: 100,000 voters in the district Parameter: percentage of voters who favor the candidate Sample: 2500 people who were polled SD of the box = (1 0) (fraction of 1's) (fraction of 0's) Problem: We do not know the composition of the box. Statistic: percentage of voters in the sample who favor the candidate: 53% Solution: Substitute the fractions observed in the sample for the unknown fractions in the box. So SD of the box (1 0) 1, 328 1, 172 2, 500 2, 500 0.50 SE for the sum = 2500 0.5 = 25 SE for the sample percentage = 25 100% = 1% 2, 500 This technique is called bootstrap. When sampling from a 0-1 box whose composition is unknown, the SD of the box can be estimated by substituting the fractions of 0's and 1's in the sample for the unknown fractions in the box. This estimate is good when the sample is reasonably large. Thus, the estimate of 53% is likely to be o by 1% or so. The candidate is very likely to win.
The margin of error (not in The margin of error (not in textbook) In the media, a margin of error is commonly reported for polls. textbook) This is just twice the standard error. intervals intervals In, 53% of the voters in the sample were in favor of the candidate. The SE for the percentage was estimated as 1%. How far can the population percentage (parameter) be from 53%? We know that a chance error of more than 2 SEs is unlikely. We can make condence intervals with any condence level. Some common levels are: estimate ± 1 SE: 68% condence interval estimate ± 2 SEs: 95% condence interval So let's go 2 SEs in each direction: (51%, 55%). estimate ± 3 SEs: 99.7% condence interval The is called a 95% condence interval for the population percentage. : we are about 95% condent that this interval captures the percentage of voters in the population who favor the candidate. These numbers are based on the normal approximation; the method only works if the normal approximation works The more asymmetric the box, the larger the sample size we need (because of the Central Limit Theorem, see Ch 18.5)
intervals of condence intervals Consider 10, 100, 1000, or 10,000 draws from the following boxes: 0 500,000 1 500,000 0 990,000 1 10,000 0 999,995 1 5 What does it mean that we are about 95% condent that the interval captures the population parameter? Remember that the population percentage is a xed number. Each time we take a dierent sample, we get a dierent sample percentage, and thus also a dierent estimate for the SE. If we would repeat this a million times, then 95% of the condence intervals contain the true population percentage, and 5% don't. Problem: after computing a condence interval, we don't know if it is one that contains the true parameter, or if it is one of the few that do not contain the parameter. Considerations Example 2 1 The methods in this chapter only work for simple random samples For more complicated sampling methods like cluster sampling, we need more complicated formulas For non-probability sampling methods, we basically have no formulas A survey organization takes a simple random sample of 1500 persons from residents in a large city. Among the sampled persons, 1035 were renters. 2 The sample size should be small relative to the population (say < 1/10th), so that we can ignore that we draw without replacement 3 For the bootstrap method to work, the sample size should be reasonably large Fill in the blanks: We estimate that the percentage of renters in the city is... This estimate is likely to be o by... or so. 4 For the normal approximation to work, the sample size should be reasonably large. The more asymmetric the box is, the larger the sample size we need. If possible, also construct a 95% condence interval for the percentage of renters.
Example 3 Example 3 A simple random sample of 6,000 17-year-olds in school was taken. Only 36.1% of the students in the sample knew that Chaucer wrote The Canterbury Tales, but 95.2% knew that Edison invented the light bulb. A simple random sample of 6,000 17-year-olds in school was taken. Only 36.1% of the students in the sample knew that Chaucer wrote The Canterbury Tales, but 95.2% knew that Edison invented the light bulb. (a) If possibly, nd a 95% con- dence interval for the percentage of all 17-year-olds in school who knew Chaucer wrote The Canterbury Tales. (b) If possible, nd a 95% con- dence interval for the percentage of all 17-year-olds in school who knew that Edison invented the light bulb.