Fundamentals and applications of resampling methods for the analysis of speech production and perception data.

Similar documents
Resampling Statistics. Conventional Statistics. Resampling Statistics

Bootstrap Methods in Regression Questions Have you had a chance to try any of this? Any of the review questions?

Chapter 27. Inferences for Regression. Remembering Regression. An Example: Body Fat and Waist Size. Remembering Regression (cont.)

Hybrid resampling methods for confidence intervals: comment

STAT 113: Statistics and Society Ellen Gundlach, Purdue University. (Chapters refer to Moore and Notz, Statistics: Concepts and Controversies, 8e)

More About Regression

Lecture 10: Release the Kraken!

Special Article. Prior Publication Productivity, Grant Percentile Ranking, and Topic-Normalized Citation Impact of NHLBI Cardiovascular R01 Grants

Mixed models in R using the lme4 package Part 2: Longitudinal data, modeling interactions

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG?

COMP Test on Psychology 320 Check on Mastery of Prerequisites

Sociology 7704: Regression Models for Categorical Data Instructor: Natasha Sarkisian

Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and

Why t? TEACHER NOTES MATH NSPIRED. Math Objectives. Vocabulary. About the Lesson

Reviews of earlier editions

THE USE OF RESAMPLING FOR ESTIMATING CONTROL CHART LIMITS

Sample Analysis Design. Element2 - Basic Software Concepts (cont d)

CS229 Project Report Polyphonic Piano Transcription

Package ForImp. R topics documented: February 19, Type Package. Title Imputation of Missing Values Through a Forward Imputation.

Normalization Methods for Two-Color Microarray Data

Proceedings of the 2010 Winter Simulation Conference B. Johansson, S. Jain, J. Montoya-Torres, J. Hugan, and E. Yücesan, eds.

Time Domain Simulations

Modeling memory for melodies

Detecting Medicaid Data Anomalies Using Data Mining Techniques Shenjun Zhu, Qiling Shi, Aran Canes, AdvanceMed Corporation, Nashville, TN

Statistical Consulting Topics. RCBD with a covariate

Relationships Between Quantitative Variables

Multiple-point simulation of multiple categories Part 1. Testing against multiple truncation of a Gaussian field

ECONOMICS 351* -- INTRODUCTORY ECONOMETRICS. Queen's University Department of Economics. ECONOMICS 351* -- Winter Term 2005 INTRODUCTORY ECONOMETRICS

Blueline, Linefree, Accuracy Ratio, & Moving Absolute Mean Ratio Charts

Western Statistics Teachers Conference 2000

Relationships. Between Quantitative Variables. Chapter 5. Copyright 2006 Brooks/Cole, a division of Thomson Learning, Inc.

Analysis of local and global timing and pitch change in ordinary

BIBLIOGRAPHIC DATA: A DIFFERENT ANALYSIS PERSPECTIVE. Francesca De Battisti *, Silvia Salini

Problem Points Score USE YOUR TIME WISELY USE CLOSEST DF AVAILABLE IN TABLE SHOW YOUR WORK TO RECEIVE PARTIAL CREDIT

Reconstruction of Ca 2+ dynamics from low frame rate Ca 2+ imaging data CS229 final project. Submitted by: Limor Bursztyn

I. Model. Q29a. I love the options at my fingertips today, watching videos on my phone, texting, and streaming films. Main Effect X1: Gender

Patrick Neff. October 2017

Mixed Models Lecture Notes By Dr. Hanford page 151 More Statistics& SAS Tutorial at Type 3 Tests of Fixed Effects

Statistics For Dummies PDF

STAT 250: Introduction to Biostatistics LAB 6

TWO-FACTOR ANOVA Kim Neuendorf 4/9/18 COM 631/731 I. MODEL

Music Emotion Recognition. Jaesung Lee. Chung-Ang University

Linear mixed models and when implied assumptions not appropriate

Chapter 6. Normal Distributions

NETFLIX MOVIE RATING ANALYSIS

Libraries as Repositories of Popular Culture: Is Popular Culture Still Forgotten?

hprints , version 1-1 Oct 2008

Cryptography CS 555. Topic 5: Pseudorandomness and Stream Ciphers. CS555 Spring 2012/Topic 5 1

Sector sampling. Nick Smith, Kim Iles and Kurt Raynor

Acoustic and musical foundations of the speech/song illusion

Computer Coordination With Popular Music: A New Research Agenda 1

Automatic Laughter Detection

Processes for the Intersection

LOUDNESS EFFECT OF THE DIFFERENT TONES ON THE TIMBRE SUBJECTIVE PERCEPTION EXPERIMENT OF ERHU

Latin Square Design. Design of Experiments - Montgomery Section 4-2

NAA ENHANCING THE QUALITY OF MARKING PROJECT: THE EFFECT OF SAMPLE SIZE ON INCREASED PRECISION IN DETECTING ERRANT MARKING

in the Howard County Public School System and Rocketship Education

STAT 503 Case Study: Supervised classification of music clips

Discipline of Economics, University of Sydney, Sydney, NSW, Australia PLEASE SCROLL DOWN FOR ARTICLE

Algebra I Module 2 Lessons 1 19

Subject-specific observed profiles of change from baseline vs week trt=10000u

Regression Model for Politeness Estimation Trained on Examples

Tutorial 0: Uncertainty in Power and Sample Size Estimation. Acknowledgements:

Reproducibility Assessment of Independent Component Analysis of Expression Ratios from DNA microarrays.

WEB APPENDIX. Managing Innovation Sequences Over Iterated Offerings: Developing and Testing a Relative Innovation, Comfort, and Stimulation

m RSC Chromatographie Integration Methods Second Edition CHROMATOGRAPHY MONOGRAPHS Norman Dyson Dyson Instruments Ltd., UK

MANOVA/MANCOVA Paul and Kaila

Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng

Automatic Piano Music Transcription

EE373B Project Report Can we predict general public s response by studying published sales data? A Statistical and adaptive approach

Package spotsegmentation

MATH 214 (NOTES) Math 214 Al Nosedal. Department of Mathematics Indiana University of Pennsylvania. MATH 214 (NOTES) p. 1/3

Sociology 704: Topics in Multivariate Statistics Instructor: Natasha Sarkisian

Music Source Separation

Decision-Maker Preference Modeling in Interactive Multiobjective Optimization

How to Predict the Output of a Hardware Random Number Generator

Detecting Musical Key with Supervised Learning

PERCEPTUAL QUALITY COMPARISON BETWEEN SINGLE-LAYER AND SCALABLE VIDEOS AT THE SAME SPATIAL, TEMPORAL AND AMPLITUDE RESOLUTIONS. Yuanyi Xue, Yao Wang

More Precise Methods for National Research Citation Impact Comparisons 1

Detection and demodulation of non-cooperative burst signal Feng Yue 1, Wu Guangzhi 1, Tao Min 1

Common assumptions in color characterization of projectors

User Guide. S-Curve Tool

Audio Compression Technology for Voice Transmission

A combination of approaches to solve Task How Many Ratings? of the KDD CUP 2007

DOES MOVIE SOUNDTRACK MATTER? THE ROLE OF SOUNDTRACK IN PREDICTING MOVIE REVENUE

Example the number 21 has the following pairs of squares and numbers that produce this sum.

Technical report on validation of error models for n.

F1000 recommendations as a new data source for research evaluation: A comparison with citations

PRECISION OF MEASUREMENT OF DIAMETER, AND DIAMETER-LENGTH PROFILE, OF GREASY WOOL STAPLES ON-FARM, USING THE OFDA2000 INSTRUMENT

Paired plot designs experience and recommendations for in field product evaluation at Syngenta

Phenopix - Exposure extraction

International Comparison on Operational Efficiency of Terrestrial TV Operators: Based on Bootstrapped DEA and Tobit Regression

Improving Frame Based Automatic Laughter Detection

Modelling Intervention Effects in Clustered Randomized Pretest/Posttest Studies. Ed Stanek

Work Package 9. Deliverable 32. Statistical Comparison of Islamic and Byzantine chant in the Worship Spaces

Automatic Laughter Detection

Validity. What Is It? Types We Will Discuss. The degree to which an inference from a test score is appropriate or meaningful.

N12/5/MATSD/SP2/ENG/TZ0/XX. mathematical STUDIES. Wednesday 7 November 2012 (morning) 1 hour 30 minutes. instructions to candidates

Advanced Signal Processing 2

SIMULATION OF PRODUCTION LINES INVOLVING UNRELIABLE MACHINES; THE IMPORTANCE OF MACHINE POSITION AND BREAKDOWN STATISTICS

Transcription:

Fundamentals and applications of resampling methods for the analysis of speech production and perception data. Olivier Crouzet 1 Laboratoire de Linguistique de Nantes (LLING UMR 6310, Université de Nantes / CNRS) 2 University Medical Center Groningen (UMCG, ENT department, Reijksuniversiteit Groningen). Workshop on Statistical Methods in Phonetic Sciences, University of Cologne, June 11th 2017. LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 1 / 70

Talk outline Asymptotic vs. Resampling frameworks 1 Asymptotic vs. Resampling frameworks Example: simulated Normal data Example: simulated non-gaussian (log-normal) data 2 The standard bootstrap Drawing a random sample from an existing sample Performing the standard bootstrap 3 4 LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 2 / 70

Talk outline Asymptotic vs. Resampling frameworks 1 Asymptotic vs. Resampling frameworks Example: simulated Normal data Example: simulated non-gaussian (log-normal) data 2 The standard bootstrap Drawing a random sample from an existing sample Performing the standard bootstrap 3 4 LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 3 / 70

Talk outline Asymptotic vs. Resampling frameworks 1 Asymptotic vs. Resampling frameworks Example: simulated Normal data Example: simulated non-gaussian (log-normal) data 2 The standard bootstrap Drawing a random sample from an existing sample Performing the standard bootstrap 3 4 LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 4 / 70

Talk outline Asymptotic vs. Resampling frameworks 1 Asymptotic vs. Resampling frameworks Example: simulated Normal data Example: simulated non-gaussian (log-normal) data 2 The standard bootstrap Drawing a random sample from an existing sample Performing the standard bootstrap 3 4 LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 5 / 70

Aims of statistical analyses Asymptotic vs. Resampling frameworks Example: simulated Normal data Example: simulated non-gaussian (log-normal) data Estimating properties of a population or evaluating hypotheses on a population...... from the observation of a random sample; LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 6 / 70

Specific applications Asymptotic vs. Resampling frameworks Example: simulated Normal data Example: simulated non-gaussian (log-normal) data Estimating a statistical parameter (central tendency, dispersion, correlation... ) and computing associated confidence intervals... Hypothesis testing (comparing means... ); LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 7 / 70

Approaches Asymptotic vs. Resampling frameworks Example: simulated Normal data Example: simulated non-gaussian (log-normal) data Asymptotic results (traditional inference approach); Resampling methods Bootstrap Parameter estimation; Permutation tests Hypothesis testing; Outlier detection Data cleanup (though one should consider the implications definitely removing obvervations from the data, computing confidence intervals may often be sufficient); Bayesian framework (not included in this presentation); LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 8 / 70

Asymptotic results Asymptotic vs. Resampling frameworks Example: simulated Normal data Example: simulated non-gaussian (log-normal) data Assumptions about the underlying distribution; A mathematical model of the underlying distribution is refered to; The sample is viewed as a random exemplar that is drawn from the underlying population; Computing a Confidence Interval requires a specific mathematical formula for each parameter (mean, median... ); LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 9 / 70

Resampling approaches Asymptotic vs. Resampling frameworks Example: simulated Normal data Example: simulated non-gaussian (log-normal) data No assumptions about the underlying distribution; The mathematical model of the underlying distribution is replaced with a computational simulated estimation of the population by generating bootstrap samples ; The (original) sample is the source of this computating simulation; Computing a Confidence Interval is possible for any parameter without requiring specific formulas; LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 10 / 70

Example: simulated Normal data Example: simulated non-gaussian (log-normal) data A Gaussian distributed variable 4 2 0 2 4 2 0 2 4 Theoretical Quantiles Sample Quantiles Measurement scale (arbitrary) Frequency 4 2 0 2 4 0 200 600 Figure 1: An illustration of a Gaussian distribution from which data may be randomly sampled. The QQ-plot on the left shows compatibility with the Gaussian assumption. LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 11 / 70

Confidence Intervals Recalls Asymptotic vs. Resampling frameworks Example: simulated Normal data Example: simulated non-gaussian (log-normal) data We talk about 95%, 99%... Confidence Intervals (CIs); These mean that, in the long run, 95% (resp. 99%) of the computed CIs would contain the true value for the measured parameter; LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 12 / 70

Asymptotic framework Estimating 95% CI for the mean Asymptotic vs. Resampling frameworks Example: simulated Normal data Example: simulated non-gaussian (log-normal) data Estimating a 95% CI for a parameter s mean is done with the following formula: δ = 1.96 SD n (1) CI = mean ± δ (2) Gaussian assumption: the formula is valid for a normally distributed variable; 95% of the area under a normal curve lies within the mean ±1.96 sd; 99% of the area under a normal curve lies within the mean ±2.58 sd. LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 13 / 70

Conventional CI for the mean Function definition in R Asymptotic vs. Resampling frameworks Example: simulated Normal data Example: simulated non-gaussian (log-normal) data CI <- function(vector, targetprob = 0.95) { # CI for the mean # Compute the required percentile point from the target probability param <- qnorm(1 - ((1 - targetprob) / 2)) # Estimate the delta delta <- ((param * sd(vector)) / (sqrt(length(vector)))) # Generate the CI values ci <- c(mean(vector) - delta, mean(vector) + delta) # Give a name to the resulting vector values names(ci) <- as.character( c( paste0((1-targetprob)/2*100,"%"), paste0((1-(1-targetprob)/2)*100,"%") ) ) } return(ci) LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 14 / 70

Conventional CI for the mean Function application Asymptotic vs. Resampling frameworks Example: simulated Normal data Example: simulated non-gaussian (log-normal) data samplesize <- 10 set.seed(1) vecn <- rnorm(samplesize, mean = 0); vecn [1] -0.626 0.184-0.836 1.595 0.330-0.820 0.487 0.738 0.576-0.305 CI(vecn) 2.5% 97.5% -0.352 0.616 CI(vecn, targetprob =.99) 0.5% 99.5% -0.504 0.768 LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 15 / 70

Example: simulated Normal data Example: simulated non-gaussian (log-normal) data par(mfrow=c(1,1), cex=0.85) hist(vecn, breaks=40, main = "", xlab = "Measurement scale (arbitrary)") abline(v = CI(vecn), col = "red") Frequency 0.0 0.5 1.0 1.5 2.0 0.5 0.0 0.5 1.0 1.5 Measurement scale (arbitrary) Figure 2: Conventional Confidence Interval for the mean. LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 16 / 70

Conventional CI for the mean Function application Asymptotic vs. Resampling frameworks Example: simulated Normal data Example: simulated non-gaussian (log-normal) data samplesize <- 50 set.seed(1) vecn <- rnorm(samplesize, mean = 0); vecn [1] -0.6265 0.1836-0.8356 1.5953 0.3295-0.8205 0.4874 0.7383 [9] 0.5758-0.3054 1.5118 0.3898-0.6212-2.2147 1.1249-0.0449 [17] -0.0162 0.9438 0.8212 0.5939 0.9190 0.7821 0.0746-1.9894 [25] 0.6198-0.0561-0.1558-1.4708-0.4782 0.4179 1.3587-0.1028 [33] 0.3877-0.0538-1.3771-0.4150-0.3943-0.0593 1.1000 0.7632 [41] -0.1645-0.2534 0.6970 0.5567-0.6888-0.7075 0.3646 0.7685 [49] -0.1123 0.8811 CI(vecn) 2.5% 97.5% -0.130 0.331 LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 17 / 70

Example: simulated Normal data Example: simulated non-gaussian (log-normal) data Frequency 0 1 2 3 4 5 2 1 0 1 Measurement scale (arbitrary) Figure 3: Conventional Confidence Interval for the mean. LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 18 / 70

Conventional CI for the mean Function application Asymptotic vs. Resampling frameworks Example: simulated Normal data Example: simulated non-gaussian (log-normal) data samplesize <- 1000 set.seed(1) vecn <- rnorm(samplesize, mean = 0); CI(vecn) 2.5% 97.5% -0.0758 0.0525 LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 19 / 70

Example: simulated Normal data Example: simulated non-gaussian (log-normal) data Frequency 0 20 40 60 80 3 2 1 0 1 2 3 4 Measurement scale (arbitrary) Figure 4: Conventional Confidence Interval for the mean. LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 20 / 70

Example: simulated Normal data Example: simulated non-gaussian (log-normal) data 4 2 0 2 4 0 10 20 30 40 Theoretical Quantiles Sample Quantiles Measurement scale (arbitrary) Frequency 0 10 20 30 40 0 2000 4000 Figure 5: An illustration of a (strongly) non-gaussian distribution from which data may be randomly sampled. The QQ-plot on the left shows strong departure from the Gaussian assumption. This distribution will be used as an example for the computation of Confidence Intervals in both the asymptotic and the resampling framework. LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 21 / 70

Conventional CI for the mean Function application Asymptotic vs. Resampling frameworks Example: simulated Normal data Example: simulated non-gaussian (log-normal) data samplesize <- 1000 set.seed(1) vec <- rlnorm(samplesize, meanlog = 0); CI(vec) 2.5% 97.5% 1.54 1.83 LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 22 / 70

Example: simulated Normal data Example: simulated non-gaussian (log-normal) data par(mfrow=c(1,1), cex=0.85) hist(vec, breaks=40, main = "", xlab = "Measurement scale (arbitrary)") abline(v = CI(vec), col = "red") Frequency 0 200 400 0 10 20 30 40 Measurement scale (arbitrary) Figure 6: Conventional Confidence Interval for the mean. LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 23 / 70

Conventional CI for the mean Function application Asymptotic vs. Resampling frameworks Example: simulated Normal data Example: simulated non-gaussian (log-normal) data samplesize <- 50 set.seed(1) vec <- rlnorm(samplesize, meanlog = 0); CI(vec) 2.5% 97.5% 1.17 1.77 LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 24 / 70

Example: simulated Normal data Example: simulated non-gaussian (log-normal) data Frequency 0 1 2 3 4 5 6 0 1 2 3 4 5 Measurement scale (arbitrary) Figure 7: Conventional Confidence Interval for the mean. LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 25 / 70

Conventional CI for the mean Function application Asymptotic vs. Resampling frameworks Example: simulated Normal data Example: simulated non-gaussian (log-normal) data samplesize <- 10 set.seed(1) vec <- rlnorm(samplesize, meanlog = 0); CI(vec) 2.5% 97.5% 0.688 2.345 LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 26 / 70

Example: simulated Normal data Example: simulated non-gaussian (log-normal) data Frequency 0.0 0.5 1.0 1.5 2.0 1 2 3 4 5 Measurement scale (arbitrary) Figure 8: Conventional Confidence Interval for the mean. LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 27 / 70

Issues with conventional CIs Asymptotic vs. Resampling frameworks Example: simulated Normal data Example: simulated non-gaussian (log-normal) data They rely on distributional assumptions; These distributional assumptions imply that estimating different parameters involves different formulas; LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 28 / 70

Resampling or bootstrap framework The bootstrap principle The standard bootstrap Drawing a random sample from an existing sample Performing the standard bootstrap The sample is to the population... what the bootstrap sample is to the sample ; LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 29 / 70

Resampling or bootstrap framework The bootstrap principle The standard bootstrap Drawing a random sample from an existing sample Performing the standard bootstrap We can then use this principle to build a population of bootstrap samples; Principle: Draw random samples from the original sample (with replacement) a very high number of times; This can be done for any parameter (mean, median, linear regression parameter... ); LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 30 / 70

Resampling or bootstrap framework Drawing a single bootstrap sample The standard bootstrap Drawing a random sample from an existing sample Performing the standard bootstrap vec [1] 0.534 1.202 0.434 4.930 1.390 0.440 1.628 2.092 1.779 0.737 median(vec) [1] 1.3 LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 31 / 70

Resampling or bootstrap framework Drawing a single bootstrap sample (n o 1) The standard bootstrap Drawing a random sample from an existing sample Performing the standard bootstrap Note that the call to set.seed(n) is used only to enforce reproducibility in a pedagogical setting. It should not be used in real settings as we really need to get random samples. set.seed(10) n <- length(vec) # Bootstrap sample size samb <- sample(vec, n, replace = TRUE) samb [1] 0.440 4.930 1.390 1.628 0.534 0.434 0.434 0.434 1.628 1.390 median(samb) [1] 0.962 vec [1] 0.534 1.202 0.434 4.930 1.390 0.440 1.628 2.092 1.779 0.737 LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 32 / 70

Comparing the sample and a given bootstrap sample The standard bootstrap Drawing a random sample from an existing sample Performing the standard bootstrap 1 2 3 4 5 Original sample 1 2 3 4 5 Bootstrap sample Figure 9: Comparing the original and a bootstrap sample. LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 33 / 70

Resampling or bootstrap framework Drawing a single bootstrap sample (n o 2) The standard bootstrap Drawing a random sample from an existing sample Performing the standard bootstrap set.seed(20) n <- length(vec) # Bootstrap sample size samb <- sample(vec, n, replace = TRUE) samb [1] 1.779 2.092 0.434 0.440 0.737 0.737 0.534 0.534 4.930 4.930 median(samb) [1] 0.737 vec [1] 0.534 1.202 0.434 4.930 1.390 0.440 1.628 2.092 1.779 0.737 LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 34 / 70

Comparing the sample and a given bootstrap sample The standard bootstrap Drawing a random sample from an existing sample Performing the standard bootstrap 1 2 3 4 5 Original sample 1 2 3 4 5 Bootstrap sample Figure 10: Comparing the original and a bootstrap sample. LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 35 / 70

Resampling or bootstrap framework Drawing a single bootstrap sample (n o 3) The standard bootstrap Drawing a random sample from an existing sample Performing the standard bootstrap set.seed(30) n <- length(vec) # Bootstrap sample size samb <- sample(vec, n, replace = TRUE) samb [1] 0.534 1.390 4.930 1.390 4.930 1.202 1.779 0.434 0.737 1.202 median(samb) [1] 1.3 vec [1] 0.534 1.202 0.434 4.930 1.390 0.440 1.628 2.092 1.779 0.737 LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 36 / 70

Comparing the sample and a given bootstrap sample The standard bootstrap Drawing a random sample from an existing sample Performing the standard bootstrap 1 2 3 4 5 Original sample 1 2 3 4 5 Bootstrap sample Figure 11: Comparing the original and a bootstrap sample. LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 37 / 70

Performing a boostrap estimation Asymptotic vs. Resampling frameworks The standard bootstrap Drawing a random sample from an existing sample Performing the standard bootstrap Define the number of replications; Generate a loop and repeat the following for each replication / iteration: 1. Generate a bootstrap sample; 2. Compute the required statistical parameter on this bootstrap sample; 3. Store the result in a vector; Then compute the distribution of these results (the parameter distribution); Estimate the relevant quantiles in order to compute the CI; LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 38 / 70

Performing a boostrap estimation Asymptotic vs. Resampling frameworks The standard bootstrap Drawing a random sample from an existing sample Performing the standard bootstrap n <- length(vec) # Bootstrap sample size nreps <- 4000 # Number of replications statparam <- rep(na, nreps) # Storage vector for the estimate for (i in 1:nreps) { samb <- sample(vec, n, replace = TRUE) statparam[i] <- median(samb) } LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 39 / 70

Performing a boostrap estimation Asymptotic vs. Resampling frameworks The standard bootstrap Drawing a random sample from an existing sample Performing the standard bootstrap Histogram of statparam Frequency 0 400 800 0.5 1.0 1.5 2.0 2.5 3.0 3.5 statparam LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 40 / 70

Performing a boostrap estimation Asymptotic vs. Resampling frameworks The standard bootstrap Drawing a random sample from an existing sample Performing the standard bootstrap bci <- quantile(statparam, prob = c(2.5, 97.5)/100) bci 2.5% 97.5% 0.534 1.860 Compare with the original CI (for the mean): CI(vec) 2.5% 97.5% 0.688 2.345 LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 41 / 70

The standard bootstrap Drawing a random sample from an existing sample Performing the standard bootstrap Frequency 0.0 1.0 2.0 1 2 3 4 5 Measurement scale (arbitrary) LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 42 / 70

Issues with the standard bootstrap The standard bootstrap Drawing a random sample from an existing sample Performing the standard bootstrap The number of replication samples is choosen in order to reach relative stability of the estimate. Some time must be spent on evaluating the adequate number of replications; Standard bootstrap interval estimates are inaccurate: they will include the true value less often than the predicted probability; They are imprecise: they will include more erroneous values than is desirable (Good, 2005a); Using the R boot library provides CI computation functions with methods to deal with these errors; LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 43 / 70

Issues with sample size Asymptotic vs. Resampling frameworks Issues may arise concerning the applicability of bootstrap methods to small initial sample sizes; As mentionned supra, it has been shown that the standard bootstrap generates inacurrate and imprecise CI end-points; There are several solutions that are available in order to solve this issue; Efron (1987) describes the non-parametric BC a (Bias Corrected accelerated) Confidence Interval (see also DiCiccio & Efron, 1996); See also Ho & Lee (2005) for evaluations of various solutions (among which parametric bootstraps); LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 44 / 70

The boot library Asymptotic vs. Resampling frameworks The boot library is made available by Canty & Ripley (2016). If it is not already installed: install.packages("boot") Then load the library: library(boot) LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 45 / 70

Bootstrapping with the boot library The bootstrap parameter estimation must be defined in a home-made function; Then the boot() function calls this home-made function; LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 46 / 70

Bootstrapping with the boot library Defining the parameter estimation function The parameter estimation function takes 2 arguments: 1. The data object; 2. The indexing vector in the data object; SPar <- function(data, index) { res <- median(data[index]) return(res) } LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 47 / 70

Bootstrapping with the boot library It is useful to verify the function application SPar(vec, 1:length(vec)) [1] 1.3 Confirm that it is equal to: median(vec) [1] 1.3 LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 48 / 70

Bootstrapping with the boot library Performing the bootstrap nreps = 2000 #bootres <- boot(vec, statistic = SPar, R = nreps, sim = "ordinary", stype = "i") bootres <- boot(vec, statistic = SPar, R = nreps) str(bootres) List of 11 $ t0 : num 1.3 $ t : num [1:2000, 1] 1.628 1.202 1.703 0.534 1.415... $ R : num 2000 $ data : num [1:10] 0.534 1.202 0.434 4.93 1.39... $ seed : int [1:626] 403 84 356515316 1424289583-339859737 -1122151017 963274428-22198097 -430865073-146139 $ statistic:function (data, index)..- attr(*, "srcref")=class 'srcref' atomic [1:8] 1 9 4 1 9 1 1 4......- attr(*, "srcfile")=classes 'srcfilecopy', 'srcfile' <environment: 0x7da0930> $ sim : chr "ordinary" $ call : language boot(data = vec, statistic = SPar, R = nreps) $ stype : chr "i" $ strata : num [1:10] 1 1 1 1 1 1 1 1 1 1 $ weights : num [1:10] 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 - attr(*, "class")= chr "boot" LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 49 / 70

Bootstrapping with the boot library Accessing the information The boot() function returns a list object which contains the following information (among others): t0 Contains the original sample s value for the statistical parameter; t Contains the boostrapped values (as many as there are replications); R The number of replications; data The original sample s data; LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 50 / 70

Bootstrapping with the boot library It is then possible to use the library to compute various (uncorrected and corrected) estimates of a Confidence Interval; LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 51 / 70

Bootstrapping with the boot library Computing a Confidence Interval For example, CIs <- boot.ci(bootres, conf = 0.95, type = c("norm", "basic", "bca")) str(cis) List of 6 $ R : int 2000 $ t0 : num 1.3 $ call : language boot.ci(boot.out = bootres, conf = 0.95, type = c("norm", "basic", "bca")) $ normal: num [1, 1:3] 0.95 0.61 2.11..- attr(*, "dimnames")=list of 2....$ : NULL....$ : chr [1:3] "conf" "" "" $ basic : num [1, 1:5] 0.95 1950.97 50.03 0.732 2.057..- attr(*, "dimnames")=list of 2....$ : NULL....$ : chr [1:5] "conf" "" "" ""... $ bca : num [1, 1:5] 0.95 29.15 1919.61 0.487 1.779..- attr(*, "dimnames")=list of 2....$ : NULL....$ : chr [1:5] "conf" "" "" ""... - attr(*, "class")= chr "bootci" LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 52 / 70

Bootstrapping with the boot library Computing a Confidence Interval For example, Efron (1987) s non-parametric BC a Confidence Interval is available: CIs$bca[4:5] [1] 0.487 1.779 Compare with what we found: bci 2.5% 97.5% 0.534 1.860 CI(vec) 2.5% 97.5% 0.688 2.345 LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 53 / 70

Bootstrapping a linear regression from real data We will use a subset of a dataset that was generated from a speech production study in which locus equations in Jordanian Arabic were investigated (Abuoudeh & Crouzet, 2014); In order to replicate these analyses, you will need to download the corresponding dataset extract from: https://osf.io/j8pys/download and then load the corresponding file in R: load("locusdata.rdata") LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 54 / 70

Bootstrapping a linear regression from real data Data are usually stored in 2D datasets (dataframes in R); C V position num locuteur atburst F2ons F2mid F3mid duration length 2422 d a attaque 623 Mo 1593 1619 1464 2480 81.2 courte 2443 d i attaque 644 Mo 1759 1922 1964 2791 68.8 courte 2463 d u attaque 664 Mo 1705 1580 1326 2192 87.5 courte 2489 d u attaque 691 Mo NA 1450 NA 2368 75.0 courte 2506 d i attaque 708 Mo 1724 1754 1852 2721 93.8 courte 2518 d a attaque 720 Mo 1595 1588 1596 2524 87.5 courte intervalsize sex 2422 101 m 2443 93 m 2463 101 m 2489 96 m 2506 114 m 2518 101 m LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 55 / 70

Bootstrapping a linear regression from real data These data originate from speech recordings aimed at investigating locus equations ; locus equations are linear regressions expressing the relation between the frequencies of F 2 at the burst of a consonant and at the middle of a coarticulated vowel (e.g. in a CV sequence); A linear function of the form y = ax + b (with a the slope and b the intercept) is usually described as an indicator of the degree of coarticulation between the consonant and the vowel; 'data.frame': 30 obs. of 13 variables: $ C : Factor w/ 5 levels "b","d","g","k",..: 2 2 2 2 2 2 2 2 2 2... $ V : Factor w/ 6 levels "a","a:","i","i:",..: 1 3 5 5 3 1 1 3 5 5... $ position : Factor w/ 2 levels "attaque","finale": 1 1 1 1 1 1 1 1 1 1... $ num : int 623 644 664 691 708 720 761 765 791 802... $ locuteur : Factor w/ 7 levels "Ah","Al","As",..: 5 5 5 5 5 5 5 5 5 5... $ atburst : int 1593 1759 1705 NA 1724 1595 1639 1755 1434 1550... $ F2ons : int 1619 1922 1580 1450 1754 1588 1609 1879 1550 1660... $ F2mid : int 1464 1964 1326 NA 1852 1596 1629 1848 1289 1286... $ F3mid : int 2480 2791 2192 2368 2721 2524 2528 2552 2332 2282... $ duration : num 81.2 68.8 87.5 75 93.8... $ length : Factor w/ 2 levels "courte","longue": 1 1 1 1 1 1 1 1 1 1... $ intervalsize: int 101 93 101 96 114 101 100 108 100 98... $ sex : Factor w/ 1 level "m": 1 1 1 1 1 1 1 1 1 1... LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 56 / 70

Bootstrapping a linear regression from real data Let s take a LE for the voiced alveolar stop /d/ in various vocalic contexts (Jordanian Arabic, short vowels only): select$atburst 1400 1500 1600 1700 1800 u u u u u u u a u u a a a a a i i i ii i i i i 1400 1600 1800 select$f2mid LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 57 / 70

Bootstrapping a linear regression from real data Computing Locus Equations ## Compute LE = (simple) linear regression model <- lm(select$atburst ~ select$f2mid) LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 58 / 70

Bootstrapping a linear regression from real data select$atburst 1400 1500 1600 1700 1800 u u u u u u u a u u a a a a a i i i ii i i i i 1400 1600 1800 select$f2mid LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 59 / 70

Bootstrapping a linear regression from real data ## Extract LE parameters slope <- model$coefficients[2] intercept <- model$coefficients[1] slope select$f2mid 0.506 intercept (Intercept) 818 LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 60 / 70

Bootstrapping a linear regression from real data y = 0.506 x + 817.709 (3) select$atburst 0 500 1000 1500 2000 u uu u u a a a a u u i i i ii i 0 500 1000 1500 2000 select$f2mid LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 61 / 70

Bootstrapping a linear regression from real data Only for illustrating the process, one may plot the results of linear regressions over all bootstrap samples: select$atburst 1400 1500 1600 1700 1800 u u u u u u u a u u a a a a a i i i ii i i i i 1400 1600 1800 select$f2mid LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 62 / 70

Bootstrapping a linear regression from real data Using the boot library Define the parameter estimation function: bslope <- function(data, index) { slope <- lm(data[index, ]$atburst ~ data[index, ]$F2mid)$coefficients[2] return(slope) } bintercept <- function(data, index) { intercept <- lm(data[index, ]$atburst ~ data[index, ]$F2mid)$coefficients[1] return(intercept) } LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 63 / 70

Bootstrapping a linear regression from real data Test the function bslope(select, 1:length(select)) data[index, ]$F2mid 0.353 bintercept(select, 1:length(select)) (Intercept) 1087 LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 64 / 70

Bootstrapping a linear regression from real data Perform the bootstrap (separately on the slope / intercept) nreps = 2000 bootsl <- boot(select, statistic = bslope, R = nreps) bootsl ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = select, statistic = bslope, R = nreps) Bootstrap Statistics : original bias std. error t1* 0.506 9.45e-05 0.068 LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 65 / 70

Bootstrapping a linear regression from real data Perform the bootstrap (separately on the slope / intercept) bootint <- boot(select, statistic = bintercept, R = nreps) bootint ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = select, statistic = bintercept, R = nreps) Bootstrap Statistics : original bias std. error t1* 818-2.98 113 LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 66 / 70

Bootstrapping a linear regression from real data Compute the boostrapped CIs CISl <- boot.ci(bootsl, conf = 0.95, type = "bca") CIs$bca[4:5] [1] 0.487 1.779 CIInt <- boot.ci(bootint, conf = 0.95, type = "bca") CIInt$bca[4:5] [1] 642 1118 LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 67 / 70

Bootstrap procedures: There s more to discover We ve only adressed parameter estimation (partially); It may also be used for hypothesis testing (comparing means for continuous variables, comparing frequencies for categorical variables) in so-called permutation tests ; Though it is then still part of the NHST (Null-Hypothesis Significance Testing) framework, it may also help (me) understanding parts of Bayesian approaches to statistics; LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 68 / 70

(incomplete) Suggested readings Asymptotic vs. Resampling frameworks Good, P. I. (2005c). Resampling Methods: A Practical Guide to Data Analysis. Birkhäuser, 3rd ed. Good, P. (2005b). Permutation, Parametric and Bootstrap Tests of Hypotheses. Springer Series in Statistics, New-York, USA: Springer-Verlag Inc., 3rd ed. Robert, C., & Casella, G. (2010). Introducing Monte Carlo Methods with R. UseR!, New-York, USA: Springer-Verlag. Carpenter, J., & Bithell, J. (2000). Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians.... Concerning the specific issues associated with the computation of Confidence Intervals, several interesting sources are available (DiCiccio & Efron, 1996; Efron, 1987; Ho & Lee, 2005). LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 69 / 70

Bibliographie I Asymptotic vs. Resampling frameworks Abuoudeh, M., & Crouzet, O. (2014). Vowel length impact on locus equation parameters: An investigation on Jordanian Arabic. in Interspeech 2014 15th Annual Conference of the International Speech Communication Association, pp. 184 188, Singapore: Chinese and Oriental Languages Information Processing Society COLIPS, 2014, 14th 18th September. Canty, A., & Ripley, B. D. (2016). boot: Bootstrap R (S-Plus) Functions. R package version 1.3-18. Carpenter, J., & Bithell, J. (2000). Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians. DiCiccio, T. J., & Efron, B. (1996). Bootstrap confidence intervals. Statistical Science, 11(3), 189 228. Efron, B. (1987). Better Bootstrap Confidence Intervals. Journal of the American Statistical Association, 82(397), 171 185. Good, P. (2005a). Introduction to Statistics through Resampling Methods and R/S-Plus. NJ: Hoboken, USA: Wiley. Good, P. (2005b). Permutation, Parametric and Bootstrap Tests of Hypotheses. Springer Series in Statistics, New-York, USA: Springer-Verlag Inc., 3rd ed. Good, P. I. (2005c). Resampling Methods: A Practical Guide to Data Analysis. Birkhäuser, 3rd ed. Ho, Y. H. S., & Lee, S. M. S. (2005). Iterated smoothed bootstrap confidence intervals for population quantiles. The Annals of Statistics, 33(1), 437 462. Robert, C., & Casella, G. (2010). Introducing Monte Carlo Methods with R. UseR!, New-York, USA: Springer-Verlag. LLING UMR6310 (Nantes) & UMCG RUG (Groningen) O. Crouzet Resampling methods 70 / 70