Introductory Statistics Lecture 1 Sinan Hanay
Image: wikipedia.org
There are three kinds of lies: Image: wikipedia.org
There are three kinds of lies: lies, Image: wikipedia.org
There are three kinds of lies: lies, damned lies, Image: wikipedia.org
There are three kinds of lies: lies, damned lies, and statistics. Image: wikipedia.org
There are three kinds of lies: lies, damned lies, and statistics. Mark Twain Image: wikipedia.org
Statistics Does Not Lie
but, what is Statistics?
Before that, what can Statistics do?
Statistics The heart of Data Science, Machine Learning Machine Learning The process of learning patterns by computer Some examples: Google Translate Google Driverless Car
Google driverless car completed 500,000 km accident-free
Watch movies online (or rent DVDs, like Tsutaya, Hulu)
Why Netflix Is #1?
Alternatives Why Netflix Is #1?
Why Netflix Is #1? Alternatives Amazon Prime, itunes, Hulu, Vudu, PSN,
Why Netflix Is #1? Alternatives Amazon Prime, itunes, Hulu, Vudu, PSN, May have many answers: price, amount of movies
Why Netflix Is #1? Alternatives Amazon Prime, itunes, Hulu, Vudu, PSN, May have many answers: price, amount of movies One cool feature of Netflix
Why Netflix Is #1? Alternatives Amazon Prime, itunes, Hulu, Vudu, PSN, May have many answers: price, amount of movies One cool feature of Netflix Movie recommendation system
Netflix Competition
movies watched Netflix Competition
Netflix Competition movies watched Develop a suggestion system that improves 10%
Netflix Competition movies watched the movies suggested Develop a suggestion system that improves 10%
Netflix Competition movies watched the movies suggested e z i r P r a l l o D n o i l l i M 1 Develop a suggestion system that improves 10%
How Does It Work? Machine Learning Based on Statistical Inference Other applications Credit score Amazon recommendation Travel sites: price predictors
Photo: wikipedia.org
Flight from Brasil to France on 31 May 2009 Photo: wikipedia.org
Flight from Brasil to France on 31 May 2009 Lost contact after a few hours Photo: wikipedia.org
Flight from Brasil to France on 31 May 2009 Lost contact after a few hours Five days later, the first wreckage was discovered Photo: wikipedia.org
Flight from Brasil to France on 31 May 2009 Lost contact after a few hours Five days later, the first wreckage was discovered What was the cause of the accident? Photo: wikipedia.org
The Cause Photo: wikipedia.org
The Cause They had to find the voice recorder (i.e. black box) Photo: wikipedia.org
Image: wikipedia.org
Image: wikipedia.org
Image: wikipedia.org 6,300 square km search area
Search for the Black Box By April 2011, it was still not found (22 months after the crash) Metron started to search using a statistical method In one week, a huge part of wreckage In May 2011, the black box was found
Probability What is the probability of at least two people having the same birthday in this class?(i.e. same month and day) Guess?
For 23 people, it is 50% For 30 people, it is 70% For 66 people, it is 99%
Statistics for Experiments Uncertainties in experiments and populations Image: freerangestock.com
Expressing Values
Expressing Values Is this 6.80 or 6.89 grams?
Expressing Values Is this 6.80 or 6.89 grams? Display shows only one digit
Expressing Values Is this 6.80 or 6.89 grams? Display shows only one digit Furthermore, the device rounds up or rounds down?
Expressing Values Is this 6.80 or 6.89 grams? Display shows only one digit Furthermore, the device rounds up or rounds down? Both are possible
Expressing Values Is this 6.80 or 6.89 grams? Display shows only one digit Furthermore, the device rounds up or rounds down? Both are possible It can be even 6.71 grams
Expressing Values Is this 6.80 or 6.89 grams? Display shows only one digit Furthermore, the device rounds up or rounds down? Both are possible It can be even 6.71 grams Express as 6.8 ± 0.1 g
Measure Length Image: flicker.com
Image: ebay.com
Image: ebay.com
Image: ebay.com
Thermal Expansion Image: ebay.com
Systematic Error Thermal Expansion Image: ebay.com
Statistics cannot fix systematic errors.
Systematic Error Statistics cannot eliminate systematic errors You need to calibrate the measurement devices Accurate measurements
Are we done, after fixing devices?
Photo: rolex.com
could be perfectly accurate but not precise enough Photo: rolex.com
Random Errors could be perfectly accurate but not precise enough Photo: rolex.com
Random Errors Maybe you have the perfect device. Photo: riverviews.net Photo: timex.com
Random Errors Maybe you have the perfect device. Photo: riverviews.net Photo: timex.com
Random Errors But you are not Maybe you have the perfect device. punctual enough Photo: riverviews.net Photo: timex.com
Random Errors Depends on the measurement Fortunately, Statistics can reduce the uncertainty
What Is Statistics? Statistics emerged as a communication tool Censuses as early as 3000 BC in Egypt
Data Name Height Acker Alex 1.96 436 NBA players How do we summarize? Adams Hassan 1.93 Afflalo Arron 1.96 Young Nick 1.98 Young Thaddeus 2.03
Centrality - Mean Name Height mean = sum of heights players Acker Alex 1.96 Adams Hassan 1.93 Afflalo Arron 1.96 436 NBA players Young Nick 1.98 Young Thaddeus 2.03
Centrality - Mean Name Height mean = sum of heights players Acker Alex 1.96 Adams Hassan 1.93 Afflalo Arron 1.96 436 NBA players (1.96 + 1.93 + 1.96 + + 2.03) / 436 = 2.01 meters Young Nick 1.98 Young Thaddeus 2.03
Centrality - Median Name Height Sort heights Mode: value in the middle Acker Alex 1.96 Adams Hassan 1.93 Afflalo Arron 1.96 436 NBA players Young Nick 1.98 Young Thaddeus 2.03
Centrality - Median Name Height Sort heights Mode: value in the middle Acker Alex 1.96 Adams Hassan 1.93 Afflalo Arron 1.96 436 NBA players Young Nick 1.98 Median= 2.03 meters Young Thaddeus 2.03
Centrality - Mode Name Height Acker Alex 1.96 Mode: Most frequent value Adams Hassan 1.93 Afflalo Arron 1.96 436 NBA players Young Nick 1.98 Young Thaddeus 2.03
Centrality - Mode Name Height Acker Alex 1.96 Mode: Most frequent value Adams Hassan 1.93 Afflalo Arron 1.96 436 NBA players Mode= 2.06 meters Young Nick 1.98 Young Thaddeus 2.03
Example 1 Shoe Size Mean 27 Median 28 Mode 29
Example 1 You own a shoe store Shoe Size Mean 27 Median 28 Mode 29
Example 1 You own a shoe store Shoe Size Can only manufacture one size Mean 27 Median 28 Mode 29
Example 1 You own a shoe store Shoe Size Can only manufacture one size Fitting should be exact Mean 27 Median 28 Mode 29
Example 1 You own a shoe store Shoe Size Can only manufacture one size Fitting should be exact Mean 27 Median 28 Which size would you set? Mode 29
Example 1 You own a shoe store Shoe Size Can only manufacture one size Fitting should be exact Mean 27 Median 28 Which size would you set? You should choose mode, 29. Mode 29
Example 2 Salary (M yen) Mean 5 Median 4.5 Mode 3.5
Example 2 Salary (M yen) You are a governor of 1,000 people Mean 5 Median 4.5 Mode 3.5
Example 2 Salary (M yen) You are a governor of 1,000 people You need to collect a tax of 1 billion yen Mean 5 Median 4.5 Mode 3.5
Example 2 Salary (M yen) You are a governor of 1,000 people You need to collect a tax of 1 billion yen Only fixed percentage Mean 5 Median 4.5 Mode 3.5
Example 2 Salary (M yen) You are a governor of 1,000 people You need to collect a tax of 1 billion yen Only fixed percentage Mean 5 Median 4.5 Not high, not low Mode 3.5
Example 2 Salary (M yen) You are a governor of 1,000 people You need to collect a tax of 1 billion yen Only fixed percentage Mean 5 Median 4.5 Not high, not low You should consider mean, and set tax as 20%. Mode 3.5
Is Centrality Enough? Yearly Salaries (million yen) Mean, mode and median are same Country A Country B Mean: 4.20 Median: 4 Mode: 4 Are they equal? 4 10 4 1 3 2 4 4 6 4
Is Centrality Enough? Yearly Salaries (million yen) Mean, mode and median are same Country A Country B Mean: 4.20 Median: 4 Mode: 4 4 10 No, we need another measure. 4 1 3 2 Are they equal? 4 4 6 4
Measure of Dispersion Salary A Difference from Mean Salary B Difference from Mean Mean: 4.20 4-0.20 4-0.20 3-1.2 4-0.2 6-1.8 10 5.80 1-3.20 2-2.20 4-0.20 4-0.20 sum of differences, A: 0.20 + 0.20 + 1.2 + 0.2 + 1.8 = 3.6 sum of differences, B: 5.80 + 3.20 + 2.20 + 0.20 + 0.20 = 11.6
Variance It is rather subjective However, Statisticians use something different Instead of differences, take squares of differences sum of differences, A: 0.20 + 0.20 + 1.2 + 0.2 + 1.8 = 3.2 use squares, 0.20 2 + 0.20 2 +1.2 2 + 0.20 2 + 1.80 2 = 4.8 Finally, divide by number of elements, 4.8/5= 0.96 Var(A) = 0.96, Var(B) = 9.76 or σ 2 (A) = 0.96, σ 2 (B) = 9.76
Mean: 4.20 Median: 4 Mode: 4 σ 2 (A) = 0.96 σ 2 (B) = 9.76 What is σ? Yearly Salaries (million yen) Country A Country B 4 10 4 1 3 2 4 4 6 4
Mean: 4.20 Median: 4 Mode: 4 σ 2 (A) = 0.96 σ 2 (B) = 9.76 What is σ? Yearly Salaries (million yen) Country A Country B 4 10 4 1 3 2 4 4 6 4
Standard Deviation Denoted by σ Square root of variance (σ 2 ) A measure of dispersion
Overview Why we need Statistics? What is Statistics? Reading Assignment: Sections 1.1-1.5 from the book Download and install R and Rstudio
The End