Chapter 27. Inferences for Regression. Remembering Regression. An Example: Body Fat and Waist Size. Remembering Regression (cont.)

Similar documents
More About Regression

STAT 113: Statistics and Society Ellen Gundlach, Purdue University. (Chapters refer to Moore and Notz, Statistics: Concepts and Controversies, 8e)

Chapter 5. Describing Distributions Numerically. Finding the Center: The Median. Spread: Home on the Range. Finding the Center: The Median (cont.

Relationships Between Quantitative Variables

Relationships. Between Quantitative Variables. Chapter 5. Copyright 2006 Brooks/Cole, a division of Thomson Learning, Inc.

Bootstrap Methods in Regression Questions Have you had a chance to try any of this? Any of the review questions?

Resampling Statistics. Conventional Statistics. Resampling Statistics

Chapter 4. Displaying Quantitative Data. Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

COMP Test on Psychology 320 Check on Mastery of Prerequisites

Algebra I Module 2 Lessons 1 19

Lecture 10: Release the Kraken!

MATH 214 (NOTES) Math 214 Al Nosedal. Department of Mathematics Indiana University of Pennsylvania. MATH 214 (NOTES) p. 1/11

MATH 214 (NOTES) Math 214 Al Nosedal. Department of Mathematics Indiana University of Pennsylvania. MATH 214 (NOTES) p. 1/3

DV: Liking Cartoon Comedy

Supplementary Figures Supplementary Figure 1 Comparison of among-replicate variance in invasion dynamics

Chapter 14. From Randomness to Probability. Probability. Probability (cont.) The Law of Large Numbers. Dealing with Random Phenomena

MID-TERM EXAMINATION IN DATA MODELS AND DECISION MAKING 22:960:575

Linear mixed models and when implied assumptions not appropriate

hprints , version 1-1 Oct 2008

N12/5/MATSD/SP2/ENG/TZ0/XX. mathematical STUDIES. Wednesday 7 November 2012 (morning) 1 hour 30 minutes. instructions to candidates

Chapter 6. Normal Distributions

AGAINST ALL ODDS EPISODE 22 SAMPLING DISTRIBUTIONS TRANSCRIPT

Problem Points Score USE YOUR TIME WISELY USE CLOSEST DF AVAILABLE IN TABLE SHOW YOUR WORK TO RECEIVE PARTIAL CREDIT

Sociology 7704: Regression Models for Categorical Data Instructor: Natasha Sarkisian

Estimating. Proportions with Confidence. Chapter 10. Copyright 2006 Brooks/Cole, a division of Thomson Learning, Inc.

Does the number of users rating the movie accurately predict the average user rating?

MANOVA/MANCOVA Paul and Kaila

Visual Encoding Design

I. Model. Q29a. I love the options at my fingertips today, watching videos on my phone, texting, and streaming films. Main Effect X1: Gender

RANDOMIZED COMPLETE BLOCK DESIGN (RCBD) Probably the most used and useful of the experimental designs.

Blueline, Linefree, Accuracy Ratio, & Moving Absolute Mean Ratio Charts

Tutorial 0: Uncertainty in Power and Sample Size Estimation. Acknowledgements:

ECONOMICS 351* -- INTRODUCTORY ECONOMETRICS. Queen's University Department of Economics. ECONOMICS 351* -- Winter Term 2005 INTRODUCTORY ECONOMETRICS

Detecting Medicaid Data Anomalies Using Data Mining Techniques Shenjun Zhu, Qiling Shi, Aran Canes, AdvanceMed Corporation, Nashville, TN

Mixed models in R using the lme4 package Part 2: Longitudinal data, modeling interactions

What is Statistics? 13.1 What is Statistics? Statistics

A STATISTICAL VIEW ON THE EXPRESSIVE TIMING OF PIANO ROLLED CHORDS

STAT 250: Introduction to Biostatistics LAB 6

F1000 recommendations as a new data source for research evaluation: A comparison with citations

Use black ink or black ball-point pen. Pencil should only be used for drawing. *

Why t? TEACHER NOTES MATH NSPIRED. Math Objectives. Vocabulary. About the Lesson

Statistical Consulting Topics. RCBD with a covariate

MOZART S PIANO SONATAS AND THE THE GOLDEN RATIO. The Relationship Between Mozart s Piano Sonatas and the Golden Ratio. Angela Zhao

TWO-FACTOR ANOVA Kim Neuendorf 4/9/18 COM 631/731 I. MODEL

Box Plots. So that I can: look at large amount of data in condensed form.

Release Year Prediction for Songs

WEB APPENDIX. Managing Innovation Sequences Over Iterated Offerings: Developing and Testing a Relative Innovation, Comfort, and Stimulation

Distribution of Data and the Empirical Rule

Lesson 7: Measuring Variability for Skewed Distributions (Interquartile Range)

AP Statistics Sampling. Sampling Exercise (adapted from a document from the NCSSM Leadership Institute, July 2000).

For the SIA. Applications of Propagation Delay & Skew tool. Introduction. Theory of Operation. Propagation Delay & Skew Tool

Fundamentals and applications of resampling methods for the analysis of speech production and perception data.

Reliability. What We Will Cover. What Is It? An estimate of the consistency of a test score.

STAT 503 Case Study: Supervised classification of music clips

Modeling memory for melodies

A High-Resolution Flash Time-to-Digital Converter Taking Into Account Process Variability. Nikolaos Minas David Kinniment Keith Heron Gordon Russell

Mixed Models Lecture Notes By Dr. Hanford page 151 More Statistics& SAS Tutorial at Type 3 Tests of Fixed Effects

Best Pat-Tricks on Model Diagnostics What are they? Why use them? What good do they do?

Frequencies. Chapter 2. Descriptive statistics and charts

Validity. What Is It? Types We Will Discuss. The degree to which an inference from a test score is appropriate or meaningful.

Latin Square Design. Design of Experiments - Montgomery Section 4-2

CS229 Project Report Polyphonic Piano Transcription

Analysis of WFS Measurements from first half of 2004

Measurement User Guide

GLM Example: One-Way Analysis of Covariance

The Fox News Eect:Media Bias and Voting S. DellaVigna and E. Kaplan (2007)

Moving on from MSTAT. March The University of Reading Statistical Services Centre Biometrics Advisory and Support Service to DFID

AskDrCallahan Calculus 1 Teacher s Guide

Predicting the Importance of Current Papers

BIBLIOGRAPHIC DATA: A DIFFERENT ANALYSIS PERSPECTIVE. Francesca De Battisti *, Silvia Salini

DELTA MODULATION AND DPCM CODING OF COLOR SIGNALS

Chapter 3. Averages and Variation

Sociology 704: Topics in Multivariate Statistics Instructor: Natasha Sarkisian

Regression Model for Politeness Estimation Trained on Examples

Analysis of local and global timing and pitch change in ordinary

Replicated Latin Square and Crossover Designs

2D ELEMENTARY CELLULAR AUTOMATA WITH FOUR NEIGHBORS

THE USE OF RESAMPLING FOR ESTIMATING CONTROL CHART LIMITS

Model II ANOVA: Variance Components

Measuring Variability for Skewed Distributions

SEVENTH GRADE. Revised June Billings Public Schools Correlation and Pacing Guide Math - McDougal Littell Middle School Math 2004

Comparison of Mixed-Effects Model, Pattern-Mixture Model, and Selection Model in Estimating Treatment Effect Using PRO Data in Clinical Trials

TI-Inspire manual 1. Real old version. This version works well but is not as convenient entering letter

Multiple-point simulation of multiple categories Part 1. Testing against multiple truncation of a Gaussian field

Lesson 7: Measuring Variability for Skewed Distributions (Interquartile Range)

Mixed Effects Models Yan Wang, Bristol-Myers Squibb, Wallingford, CT

APPLICATION OF MULTI-GENERATIONAL MODELS IN LCD TV DIFFUSIONS

Open access press vs traditional university presses on Amazon

Quantitative methods

NETFLIX MOVIE RATING ANALYSIS

Western Statistics Teachers Conference 2000

Paired plot designs experience and recommendations for in field product evaluation at Syngenta

NAA ENHANCING THE QUALITY OF MARKING PROJECT: THE EFFECT OF SAMPLE SIZE ON INCREASED PRECISION IN DETECTING ERRANT MARKING

THE FAIR MARKET VALUE

LESSON 1: WHAT IS BIVARIATE DATA?

Douglas D. Reynolds UNLV UNIVERSITY OF NEVADA LAS VEGAS CENTER FOR MECHANICAL & ENVIRONMENTAL SYSTEMS TECHNOLOGY

Draft last edited May 13, 2013 by Belinda Robertson

Margin of Error. p(1 p) n 0.2(0.8) 900. Since about 95% of the data will fall within almost two standard deviations, we will use the formula

Open Access Determinants and the Effect on Article Performance

in the Howard County Public School System and Rocketship Education

Transcription:

Chapter 27 Inferences for Regression Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-1 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley An Example: Body Fat and Waist Size Our chapter example revolves around the relationship between % body fat and waist size (in inches). Here is a scatterplot of our data set: Remembering Regression In regression, we want to model the relationship between two quantitative variables, one the predictor and the other the response. To do that, we imagine an idealized regression line, which assumes that the means of the distributions of the response variable fall along the line even though individual values are scattered around it. Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-3 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-4 Remembering Regression (cont.) Now we d like to know what the regression model can tell us beyond the individuals in the study. We want to make confidence intervals and test hypotheses about the slope and intercept of the regression line. The Population and the Sample When we found a confidence interval for a mean, we could imagine a single, true underlying value for the mean. When we tested whether two means or two proportions were equal, we imagined a true underlying difference. What does it mean to do inference for regression? Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-5 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-6 1

We know better than to think that even if we know every population value, the data would line up perfectly on a straight line. In our sample, there s a whole distribution of %body fat for men with 38-inch waists: This is true at each waist size. We could depict the distribution of %body fat at different waist sizes like this: Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-7 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-8 The model assumes that the means of the distributions of %body fat for each waist size fall along the line even though the individuals are scattered around it. The model is not a perfect description of how the variables are associated, but it may be useful. If we had all the values in the population, we could find the slope and intercept of the idealized regression line explicitly by using least squares. We write the idealized line with Greek letters and consider the coefficients to be parameters: β 0 is the intercept and β 1 is the slope. Corresponding to our fitted line of write, we Now, not all the individual y s are at these means some lie above the line and some below. Like all models, there are errors. Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-9 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-10 Denote the errors by ε and write ε = y µ y for each data point (x, y). When we add error to the model, we can talk about individual y s instead of means: This equation is now true for each data point (since the individual ε s soak up the deviations) and gives a value of y for each x. Assumptions and Conditions In Chapter 8 when we fit lines to data, we needed to check only the Straight Enough Condition. Now, when we want to make inferences about the coefficients of the line, we ll have to make more assumptions (and thus check more conditions). We need to be careful about the order in which we check conditions. If an initial assumption is not true, it makes no sense to check the later ones. Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-11 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-12 2

1. Linearity Assumption: Straight Enough Condition: Check the scatterplot the shape must be linear or we can t use regression at all. 1. Linearity Assumption: If the scatterplot is straight enough, we can go on to some assumptions about the errors. If not, stop here, or consider re-expressing the data to make the scatterplot more nearly linear. Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-13 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-14 2. Independence Assumption: Randomization Condition: the individuals are a representative sample from the population. Check the residual plot (part 1) the residuals should appear to be randomly scattered. 3. Equal Variance Assumption: Does The Plot Thicken? Condition: Check the residual plot (part 2) the spread of the residuals should be uniform. Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-15 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-16 4. Normal Population Assumption: Nearly Normal Condition: Check a histogram of the residuals. The distribution of the residuals should be unimodal and symmetric. If all four assumptions are true, the idealized regression model would look like this: At each value of x there is a distribution of y-values that follows a Normal model, and each of these Normal models is centered on the line and has the same standard deviation. Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-17 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-18 3

Which Come First: the Conditions or the Residuals? There s a catch in regression the best way to check many of the conditions is with the residuals, but we get the residuals only after we compute the regression model. To compute the regression model, however, we should check the conditions. So we work in this order: Make a scatterplot of the data to check the Straight Enough Condition. (If the relationship isn t straight, try re-expressing the data. Or stop.) Which Come First: the Conditions or the Residuals? (cont.) If the data are straight enough, fit a regression model and find the residuals, e, and predicted values,. Make a scatterplot of the residuals against x or the predicted values. This plot should have no pattern. Check in particular for any bend, any thickening, or any outliers. If the data are measured over time, plot the residuals against time to check for evidence of patterns that might suggest they are not independent. Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-19 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-20 Which Come First: the Conditions or the Residuals? (cont.) If the scatterplots look OK, then make a histogram and Normal probability plot of the residuals to check the Nearly Normal Condition. If all the conditions seem to be satisfied, go ahead with inference. Intuition About Regression Inference We expect any sample to produce a b 1 whose expected value is the true slope, β 1. What about its standard deviation? What aspects of the data affect how much the slope and intercept vary from sample to sample? Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-21 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-22 Spread around the line: Less scatter around the line means the slope will be more consistent from sample to sample. The spread around the line is measured with the residual standard deviation s e. You can always find s e in the regression output, often just labeled s. Spread around the line: Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-23 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-24 4

Spread of the x s: A large standard deviation of x provides a more stable regression. Sample size: Having a larger sample size, n, gives more consistent estimates. Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-25 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-26 Standard Error for the Slope Three aspects of the scatterplot affect the standard error of the regression slope: Sampling Distribution for Regression Slopes When the conditions are met, the standardized estimated regression slope spread around the line, s e spread of x values, s x sample size, n. The formula for the standard error (which you will probably never have to calculate by hand) is: follows a Student s t-model with n 2 degrees of freedom. Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-27 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-28 Sampling Distribution for Regression Slopes (cont.) What About the Intercept? We estimate the standard error with The same reasoning applies for the intercept. where: n is the number of data values s x is the ordinary standard deviation of the x-values. We can write but we rarely use this fact for anything. The intercept usually isn t interesting. Most hypothesis tests and confidence intervals for regression are about the slope. Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-29 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-30 5

Regression Inference Standard Errors for Predicted Values A null hypothesis of a zero slope questions the entire claim of a linear relationship between the two variables often just what we want to know. To test H 0 : β 1 = 0, we find and continue as we would with any other t-test. The formula for a confidence interval for β 1 is Once we have a useful regression, how can we indulge our natural desire to predict, without being irresponsible? Now we have standard errors we can use those to construct a confidence interval for the predictions, smudging the results in the right way to report our uncertainty honestly. Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-31 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-32 For our %body fat and waist size example, there are two questions we could ask: Do we want to know the mean %body fat for all men with a waist size of, say, 38 inches? Do we want to estimate the %body fat for a particular man with a 38-inch waist? The predicted %body fat is the same in both questions, but we can predict the mean %body fat for all men whose waist size is 38 inches with a lot more precision than we can predict the %body fat of a particular individual whose waist size happens to be 38 inches. We start with the same prediction in both cases. We are predicting for a new individual, one that was not in the original data set. Call his x-value x ν. The regression predicts %body fat as Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-33 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-34 Both intervals take the form The standard error of the mean predicted value is: The SE s will be different for the two questions we have posed. Individuals vary more than means, so the standard error for a single predicted value is larger than the standard error for the mean: Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-35 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-36 6

Confidence Intervals for Predicted Values What Can Go Wrong? Here s a look at the difference between predicting for a mean and predicting for an individual. The solid green lines near the regression line show the 95% confidence interval for the mean predicted value, and the dashed red lines show the prediction intervals for individuals. Don t fit a linear regression to data that aren t straight. Watch out for the plot thickening. If the spread in y changes with x, our predictions will be very good for some x-values and very bad for others. Make sure the errors are Normal. Check the histogram and Normal probability plot of the residuals to see if this assumption looks reasonable. Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-37 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-38 What Can Go Wrong? (cont.) What have we learned? Watch out for extrapolation. It s always dangerous to predict for x-values that lie far from the center of the data. Watch out for high-influence points and outliers. Watch out for one-tailed tests. Tests of hypotheses about regression coefficients are usually two-tailed, so software packages report twotailed P-values. If you are using software to conduct a one-tailed test about slope, you ll need to divide the reported P-value in half. We have now applied inference to regression models. Like in all inference situations, there are conditions that we must check. We can test a hypothesis about the slope and find a confidence interval for the true slope. And, again, we are reminded never to mistake the presence of an association for proof of causation. Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-39 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 27-40 7