Chapter 5. Describing Distributions Numerically. Finding the Center: The Median. Spread: Home on the Range. Finding the Center: The Median (cont.

Similar documents
Chapter 3. Averages and Variation

Measuring Variability for Skewed Distributions

Chapter 4. Displaying Quantitative Data. Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

STAT 113: Statistics and Society Ellen Gundlach, Purdue University. (Chapters refer to Moore and Notz, Statistics: Concepts and Controversies, 8e)

Lesson 7: Measuring Variability for Skewed Distributions (Interquartile Range)

Frequencies. Chapter 2. Descriptive statistics and charts

Box Plots. So that I can: look at large amount of data in condensed form.

Chapter 27. Inferences for Regression. Remembering Regression. An Example: Body Fat and Waist Size. Remembering Regression (cont.)

Algebra I Module 2 Lessons 1 19

What is Statistics? 13.1 What is Statistics? Statistics

MATH 214 (NOTES) Math 214 Al Nosedal. Department of Mathematics Indiana University of Pennsylvania. MATH 214 (NOTES) p. 1/3

Homework Packet Week #5 All problems with answers or work are examples.

Lesson 7: Measuring Variability for Skewed Distributions (Interquartile Range)

Chapter 1 Midterm Review

Comparing Distributions of Univariate Data

9.2 Data Distributions and Outliers

What can you tell about these films from this box plot? Could you work out the genre of these films?

Normalization Methods for Two-Color Microarray Data

NAA ENHANCING THE QUALITY OF MARKING PROJECT: THE EFFECT OF SAMPLE SIZE ON INCREASED PRECISION IN DETECTING ERRANT MARKING

Statistics for Engineers

Dot Plots and Distributions

Estimation of inter-rater reliability

(Week 13) A05. Data Analysis Methods for CRM. Electronic Commerce Marketing

Multiple-point simulation of multiple categories Part 1. Testing against multiple truncation of a Gaussian field

Chapter 6. Normal Distributions

AP Statistics Sampling. Sampling Exercise (adapted from a document from the NCSSM Leadership Institute, July 2000).

More About Regression

Math 7 /Unit 07 Practice Test: Collecting, Displaying and Analyzing Data

abc Mark Scheme Statistics 3311 General Certificate of Secondary Education Higher Tier 2007 examination - June series

The One Penny Whiteboard

Chapter 14. From Randomness to Probability. Probability. Probability (cont.) The Law of Large Numbers. Dealing with Random Phenomena

Distribution of Data and the Empirical Rule

EXPLORING DISTRIBUTIONS

Bootstrap Methods in Regression Questions Have you had a chance to try any of this? Any of the review questions?

Copyright 2013 Pearson Education, Inc.

Visual Encoding Design

Notes Unit 8: Dot Plots and Histograms

Collecting Data Name:

THE USE OF RESAMPLING FOR ESTIMATING CONTROL CHART LIMITS

Why visualize data? Advanced GDA and Software: Multivariate approaches, Interactive Graphics, Mondrian, iplots and R. German Bundestagswahl 2005

GBA 327: Module 7D AVP Transcript Title: The Monte Carlo Simulation Using Risk Solver. Title Slide

For the SIA. Applications of Propagation Delay & Skew tool. Introduction. Theory of Operation. Propagation Delay & Skew Tool

Use black ink or black ball-point pen. Pencil should only be used for drawing. *

Lecture 10: Release the Kraken!

Graphical Displays of Univariate Data

Sampling Plans. Sampling Plan - Variable Physical Unit Sample. Sampling Application. Sampling Approach. Universe and Frame Information

Draft 100G SR4 TxVEC - TDP Update. John Petrilla: Avago Technologies February 2014

Why t? TEACHER NOTES MATH NSPIRED. Math Objectives. Vocabulary. About the Lesson

6 ~ata-ink Maximization and Graphical Design

The impact of sound technology on the distribution of shot lengths in motion pictures

Level 1 Mathematics and Statistics, 2011

A comparison of inexpensive statistical packages for Apple II microcomputers

Sample Analysis Design. Element2 - Basic Software Concepts (cont d)

Good playing practice when drumming: Influence of tempo on timing and preparatory movements for healthy and dystonic players

NETFLIX MOVIE RATING ANALYSIS

Sampler Overview. Statistical Demonstration Software Copyright 2007 by Clifford H. Wagner

Objective: Write on the goal/objective sheet and give a before class rating. Determine the types of graphs appropriate for specific data.

Blueline, Linefree, Accuracy Ratio, & Moving Absolute Mean Ratio Charts

BBM 413 Fundamentals of Image Processing Dec. 11, Erkut Erdem Dept. of Computer Engineering Hacettepe University. Segmentation Part 1

PICK THE RIGHT TEAM AND MAKE A BLOCKBUSTER A SOCIAL ANALYSIS THROUGH MOVIE HISTORY

Supplemental Material: Color Compatibility From Large Datasets

SEVENTH GRADE. Revised June Billings Public Schools Correlation and Pacing Guide Math - McDougal Littell Middle School Math 2004

Analysis of AP/axon classes and PSP on the basis of AP amplitude

The Measurement Tools and What They Do

Release Year Prediction for Songs

Visible Vibrations (originally Chladni Patterns) - Adding Memory Buttons. Joshua Gutwill. August 2002

in the Howard County Public School System and Rocketship Education

Navigate to the Journal Profile page

STAT 503 Case Study: Supervised classification of music clips

ANALYSING DIFFERENCES BETWEEN THE INPUT IMPEDANCES OF FIVE CLARINETS OF DIFFERENT MAKES

Magnetic Rower. Manual Jetstream JMR-5000

Detecting Medicaid Data Anomalies Using Data Mining Techniques Shenjun Zhu, Qiling Shi, Aran Canes, AdvanceMed Corporation, Nashville, TN

Key Maths Facts to Memorise Question and Answer

Bioconductor s marray package: Plotting component

Chapter 2 Describing Data: Frequency Tables, Frequency Distributions, and

1.1 Common Graphs and Data Plots

40GBASE-ER4 optical budget

DV: Liking Cartoon Comedy

Page I-ix / Lab Notebooks, Lab Reports, Graphs, Parts Per Thousand Information on Lab Notebooks, Lab Reports and Graphs

Fast Ethernet Consortium Clause 25 PMD-EEE Conformance Test Suite v1.1 Report

download instant at

11, 6, 8, 7, 7, 6, 9, 11, 9

Relationships Between Quantitative Variables

GCSE MARKING SCHEME AUTUMN 2017 GCSE MATHEMATICS NUMERACY UNIT 1 - INTERMEDIATE TIER 3310U30-1. WJEC CBAC Ltd.

QuadTech Data Central Reports Overview

Answers. Chapter 9 A Puzzle Time MUSSELS. 9.1 Practice A. Technology Connection. 9.1 Start Thinking! 9.1 Warm Up. 9.1 Start Thinking!

AP Statistics Sec 5.1: An Exercise in Sampling: The Corn Field

Rhythm Rounds. Joyce Ma. January 2003

Relationships. Between Quantitative Variables. Chapter 5. Copyright 2006 Brooks/Cole, a division of Thomson Learning, Inc.

Sociology 7704: Regression Models for Categorical Data Instructor: Natasha Sarkisian

UNIVERSITY OF MASSACHUSETTS Department of Biostatistics and Epidemiology BioEpi 540W - Introduction to Biostatistics Fall 2002

Mixed Models Lecture Notes By Dr. Hanford page 151 More Statistics& SAS Tutorial at Type 3 Tests of Fixed Effects

Requirements for the Beam Abort Magnet and Dump

Piotr KLECZKOWSKI, Magdalena PLEWA, Grzegorz PYDA

Spectrum Analyser Basics

STUDIES on visual aesthetics have gained an increasing

Human Hair Studies: II Scale Counts

UNIVERSITY OF CAMBRIDGE INTERNATIONAL EXAMINATIONS General Certificate of Education Ordinary Level

Measurement User Guide

100G SR4 TxVEC - TDP Update (D2.1 comment 94) John Petrilla: Avago Technologies March 2014

Transcription:

Chapter 5 Describing Distributions Numerically Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5-1 Finding the Center: The Median When we think of a typical value, we usually look for the center of the distribution. For a unimodal, symmetric distribution, it s easy to find the center it s just the center of symmetry. Finding the Center: The Median (cont.) As a measure of center, the midrange (the average of the minimum and maximum values) is very sensitive to skewed distributions and outliers. The median is a more reasonable choice for center than the midrange. Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5-3 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5-4 Finding the Center: The Median (cont.) The median is the value with exactly half the data values below it and half above it. It is the middle data value (once the data values have been ordered) that divides the histogram into two equal areas. It has the same units as the data. Spread: Home on the Range Always report a measure of spread along with a measure of center when describing a distribution numerically. The range of the data is the difference between the maximum and minimum values: Range = max min A disadvantage of the range is that a single extreme value can make it very large and, thus, not representative of the data overall. Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5-5 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5-6 1

Spread: The Interquartile Range The interquartile range (IQR) lets us ignore extreme data values and concentrate on the middle of the data. To find the IQR, we first need to know what quartiles are Spread: The Interquartile Range (cont.) Quartiles divide the data into four equal sections. The lower quartile is the median of the half of the data below the median. The upper quartile is the median of the half of the data above the median. The difference between the quartiles is the IQR, so IQR = upper quartile lower quartile Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5-7 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5-8 Spread: The Interquartile Range (cont.) The Five-Number Summary The lower and upper quartiles are the 25 th and 75 th percentiles of the data, so The IQR contains the middle 50% of the values of the distribution, as shown in Figure 5.3 from the text: The five-number summary of a distribution reports its median, quartiles, and extremes (maximum and minimum). Example: The fivenumber summary for the ages at death for rock concert goers who died from being crushed is Max 47 years Q3 22 Median 19 Q1 17 Min 13 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5-9 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5-10 Rock Concert Deaths: Making Boxplots A boxplot is a graphical display of the five-number summary. Boxplots are particularly useful when comparing groups. Constructing Boxplots 1. Draw a single vertical axis spanning the range of the data. Draw short horizontal lines at the lower and upper quartiles and at the median. Then connect them with vertical lines to form a box. Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5-11 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5-12 2

Constructing Boxplots (cont.) 2. Erect fences around the main part of the data. The upper fence is 1.5 IQRs above the upper quartile. The lower fence is 1.5 IQRs below the lower quartile. Note: the fences only help with constructing the boxplot and should not appear in the final display. Constructing Boxplots (cont.) 3. Use the fences to grow whiskers. Draw lines from the ends of the box up and down to the most extreme data values found within the fences. If a data value falls outside one of the fences, we do not connect it with a whisker. Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5-13 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5-14 Constructing Boxplots (cont.) 4. Add the outliers by displaying any data values beyond the fences with special symbols. We often use a different symbol for far outliers that are farther than 3 IQRs from the quartiles. Rock Concert Deaths: Making Boxplots (cont.) Compare the histogram and boxplot for rock concert deaths: How does each display represent the distribution? Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5-15 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5-16 Comparing Groups With Boxplots The following set of boxplots compares the effectiveness of various coffee containers: Summarizing Symmetric Distributions Medians do a good job of identifying the center of skewed distributions. When we have symmetric data, the mean is a good measure of center. We find the mean by adding up all of the data values and dividing by n, the number of data values we have. What does this graphical display tell you? Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5-17 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5-18 3

Summarizing Symmetric Distributions (cont.) The distribution of pulse rates for 52 adults is generally symmetric, with a mean of 72.7 beats per minute (bpm) and a median of 73 bpm: The Formula for Averaging The formula for the mean is given by The formula says that to find the mean, we add up the numbers and divide by n. Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5-19 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5-20 Mean or Median? Regardless of the shape of the distribution, the mean is the point at which a histogram of the data would balance: Mean or Median? (cont.) In symmetric distributions, the mean and median are approximately the same in value, so either measure of center may be used. For skewed data, though, it s better to report the median than the mean as a measure of center. Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5-21 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5-22 What About Spread? The Standard Deviation A more powerful measure of spread than the IQR is the standard deviation, which takes into account how far each data value is from the mean. A deviation is the distance that a data value is from the mean. Since adding all deviations together would total zero, we square each deviation and find an average of sorts for the deviations. What About Spread? The Standard Deviation (cont.) The variance, notated by s 2, is found by summing the squared deviations and (almost) averaging them: The variance will play a role later in our study, but it is problematic as a measure of spread it is measured in squared units! Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5-23 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5-24 4

What About Spread? The Standard Deviation (cont.) The standard deviation, s, is just the square root of the variance and is measured in the same units as the original data. Thinking About Variation Since Statistics is about variation, spread is an important fundamental concept of Statistics. Measures of spread help us talk about what we don t know. When the data values are tightly clustered around the center of the distribution, the IQR and standard deviation will be small. When the data values are scattered far from the center, the IQR and standard deviation will be large. Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5-25 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5-26 Shape, Center, and Spread When telling about a quantitative variable, always report the shape of its distribution, along with a center and a spread. If the shape is skewed, report the median and IQR. If the shape is symmetric, report the mean and standard deviation and possibly the median and IQR as well. What About Outliers? If there are any clear outliers and you are reporting the mean and standard deviation, report them with the outliers present and with the outliers removed. The differences may be quite revealing. Note: The median and IQR are not likely to be affected by the outliers. Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5-27 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5-28 What Can Go Wrong? Don t forget to do a reality check don t let technology do your thinking for you. Don t forget to sort the values before finding the median or percentiles. Don t compute numerical summaries of a categorical variable. Watch out for multiple modes multiple modes might indicate multiple groups in your data. What Can Go Wrong? (cont.) Be aware of slightly different methods different statistics packages and calculators may give you different answers for the same data. Beware of outliers. Make a picture (make a picture, make a picture). Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5-29 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5-30 5

What Can Go Wrong? (cont.) Be careful when comparing groups that have very different spreads. Consider these side-by-side boxplots of cotinine levels: *Re-expressing to Equalize the Spread of Groups Here are the side-by-side boxplots of the log(cotinine) values: Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5-31 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5-32 What have we learned? We can now summarize distributions of quantitative variables numerically. The 5-number summary displays the min, Q1, median, Q3, and max. Measures of center include the mean and median. Measures of spread include the range, IQR, and standard deviation. We know which measures to use for symmetric distributions and skewed distributions. What have we learned? (cont.) We can also display distributions with boxplots. While histograms better show the shape of the distribution, boxplots reveal the center, middle 50%, and any outliers in the distribution. Boxplots are useful for comparing groups. Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5-33 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5-34 6