Resampling Statistics. Conventional Statistics. Resampling Statistics

Similar documents
Fundamentals and applications of resampling methods for the analysis of speech production and perception data.

Chapter 27. Inferences for Regression. Remembering Regression. An Example: Body Fat and Waist Size. Remembering Regression (cont.)

Bootstrap Methods in Regression Questions Have you had a chance to try any of this? Any of the review questions?

DV: Liking Cartoon Comedy

Problem Points Score USE YOUR TIME WISELY USE CLOSEST DF AVAILABLE IN TABLE SHOW YOUR WORK TO RECEIVE PARTIAL CREDIT

More About Regression

STAT 113: Statistics and Society Ellen Gundlach, Purdue University. (Chapters refer to Moore and Notz, Statistics: Concepts and Controversies, 8e)

GLM Example: One-Way Analysis of Covariance

Linear mixed models and when implied assumptions not appropriate

Blueline, Linefree, Accuracy Ratio, & Moving Absolute Mean Ratio Charts

ECONOMICS 351* -- INTRODUCTORY ECONOMETRICS. Queen's University Department of Economics. ECONOMICS 351* -- Winter Term 2005 INTRODUCTORY ECONOMETRICS

Subject-specific observed profiles of change from baseline vs week trt=10000u

Moving on from MSTAT. March The University of Reading Statistical Services Centre Biometrics Advisory and Support Service to DFID

Mixed Models Lecture Notes By Dr. Hanford page 151 More Statistics& SAS Tutorial at Type 3 Tests of Fixed Effects

Mixed models in R using the lme4 package Part 2: Longitudinal data, modeling interactions

I. Model. Q29a. I love the options at my fingertips today, watching videos on my phone, texting, and streaming films. Main Effect X1: Gender

Lecture 10: Release the Kraken!

TI-Inspire manual 1. Real old version. This version works well but is not as convenient entering letter

COMP Test on Psychology 320 Check on Mastery of Prerequisites

STAT 250: Introduction to Biostatistics LAB 6

Algebra I Module 2 Lessons 1 19

Latin Square Design. Design of Experiments - Montgomery Section 4-2

Hybrid resampling methods for confidence intervals: comment

GBA 327: Module 7D AVP Transcript Title: The Monte Carlo Simulation Using Risk Solver. Title Slide

Reviews of earlier editions

THE USE OF RESAMPLING FOR ESTIMATING CONTROL CHART LIMITS

Reconstruction of Ca 2+ dynamics from low frame rate Ca 2+ imaging data CS229 final project. Submitted by: Limor Bursztyn

Chapter 21. Margin of Error. Intervals. Asymmetric Boxes Interpretation Examples. Chapter 21. Margin of Error

User Guide. S-Curve Tool

Model II ANOVA: Variance Components

MANOVA/MANCOVA Paul and Kaila

RANDOMIZED COMPLETE BLOCK DESIGN (RCBD) Probably the most used and useful of the experimental designs.

TWO-FACTOR ANOVA Kim Neuendorf 4/9/18 COM 631/731 I. MODEL

Replicated Latin Square and Crossover Designs

Release Year Prediction for Songs

UNIVERSITY OF MASSACHUSETTS Department of Biostatistics and Epidemiology BioEpi 540W - Introduction to Biostatistics Fall 2002

Why t? TEACHER NOTES MATH NSPIRED. Math Objectives. Vocabulary. About the Lesson

Tutorial 0: Uncertainty in Power and Sample Size Estimation. Acknowledgements:

Normalization Methods for Two-Color Microarray Data

Use black ink or black ball-point pen. Pencil should only be used for drawing. *

Frequencies. Chapter 2. Descriptive statistics and charts

Reproducibility Assessment of Independent Component Analysis of Expression Ratios from DNA microarrays.

LAB 1: Plotting a GM Plateau and Introduction to Statistical Distribution. A. Plotting a GM Plateau. This lab will have two sections, A and B.

MATH 214 (NOTES) Math 214 Al Nosedal. Department of Mathematics Indiana University of Pennsylvania. MATH 214 (NOTES) p. 1/3

AP Statistics Sampling. Sampling Exercise (adapted from a document from the NCSSM Leadership Institute, July 2000).

Multiple-point simulation of multiple categories Part 1. Testing against multiple truncation of a Gaussian field

International Comparison on Operational Efficiency of Terrestrial TV Operators: Based on Bootstrapped DEA and Tobit Regression

Supplementary Figures Supplementary Figure 1 Comparison of among-replicate variance in invasion dynamics

Modeling memory for melodies

Sociology 7704: Regression Models for Categorical Data Instructor: Natasha Sarkisian

FPA (Focal Plane Array) Characterization set up (CamIRa) Standard Operating Procedure

Western Statistics Teachers Conference 2000

RCBD with Sampling Pooling Experimental and Sampling Error

The Effects of Study Condition Preference on Memory and Free Recall LIANA, MARISSA, JESSI AND BROOKE

How Large a Sample? CHAPTER 24. Issues in determining sample size

Modelling Intervention Effects in Clustered Randomized Pretest/Posttest Studies. Ed Stanek

Paired plot designs experience and recommendations for in field product evaluation at Syngenta

Tech Paper. HMI Display Readability During Sinusoidal Vibration

EE373B Project Report Can we predict general public s response by studying published sales data? A Statistical and adaptive approach

MATH& 146 Lesson 11. Section 1.6 Categorical Data

Creating a Feature Vector to Identify Similarity between MIDI Files

Import and quantification of a micro titer plate image

Predicting the Importance of Current Papers

Programs. onevent("can", "mousedown", function(event) { var x = event.x; var y = event.y; circle( x, y, 10 ); });

Discipline of Economics, University of Sydney, Sydney, NSW, Australia PLEASE SCROLL DOWN FOR ARTICLE

Discriminant Analysis. DFs

Proceedings of the 2010 Winter Simulation Conference B. Johansson, S. Jain, J. Montoya-Torres, J. Hugan, and E. Yücesan, eds.

Topic 11. Score-Informed Source Separation. (chroma slides adapted from Meinard Mueller)

The following content is provided under a Creative Commons license. Your support

LabView Exercises: Part II

Package ForImp. R topics documented: February 19, Type Package. Title Imputation of Missing Values Through a Forward Imputation.

Incorporation of Escorting Children to School in Individual Daily Activity Patterns of the Household Members

Sector sampling. Nick Smith, Kim Iles and Kurt Raynor

in the Howard County Public School System and Rocketship Education

Statistical Consulting Topics. RCBD with a covariate

Best Pat-Tricks on Model Diagnostics What are they? Why use them? What good do they do?

Exercises. ASReml Tutorial: B4 Bivariate Analysis p. 55

NENS 230 Assignment #2 Data Import, Manipulation, and Basic Plotting

SECTION I. THE MODEL. Discriminant Analysis Presentation~ REVISION Marcy Saxton and Jenn Stoneking DF1 DF2 DF3

Comparison of Mixed-Effects Model, Pattern-Mixture Model, and Selection Model in Estimating Treatment Effect Using PRO Data in Clinical Trials

Factors affecting enhanced video quality preferences

4.1 GENERATION OF VIGNETTE TEXTS & RANDOM VIGNETTE SAMPLES

K ABC Mplus CFA Model. Syntax file (kabc-mplus.inp) Data file (kabc-mplus.dat)

Chapter 5. Describing Distributions Numerically. Finding the Center: The Median. Spread: Home on the Range. Finding the Center: The Median (cont.

Homework Packet Week #5 All problems with answers or work are examples.

Singer Traits Identification using Deep Neural Network

STAT 503 Case Study: Supervised classification of music clips

abc Mark Scheme Statistics 3311 General Certificate of Secondary Education Higher Tier 2007 examination - June series

F1000 recommendations as a new data source for research evaluation: A comparison with citations

Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and

What is Statistics? 13.1 What is Statistics? Statistics

Estimating. Proportions with Confidence. Chapter 10. Copyright 2006 Brooks/Cole, a division of Thomson Learning, Inc.

Mixed Effects Models Yan Wang, Bristol-Myers Squibb, Wallingford, CT

AmbDec User Manual. Fons Adriaensen

Sampler Overview. Statistical Demonstration Software Copyright 2007 by Clifford H. Wagner

A STATISTICAL VIEW ON THE EXPRESSIVE TIMING OF PIANO ROLLED CHORDS

How to present your paper in correct APA style

Static Timing Analysis for Nanometer Designs

PROC GLM AND PROC MIXED CODES FOR TREND ANALYSES FOR ROW-COLUMN DESIGNED EXPERIMENTS

BPS 7th Grade Pre-Algebra Revised summer 2014 Year at a Glance Unit Standards Practices Days

Transcription:

Resampling Statistics Introduction to Resampling Probability Modeling Resample add-in Bootstrapping values, vectors, matrices R boot package Conclusions Conventional Statistics Assumptions of conventional statistics: - Variables are randomly sampled - Follow a normal distribution (Gaussian) Thus, the basis of conventional inference is that samples are drawn at random from a larger population and the observations in the sample are then presumed to reflect the population (e.g., mean & variance). Resampling Statistics In resampling statistics, statistical estimates are formed by taking random samples directly from the data at hand. In other words, you randomly sample your random sample!

Resampling Statistics - Key Features - 1. For small data sets, resampling procedures probably provide more accurate statistical answers than conventional statistics. 2. For large data sets, resampling answers and conventional answers usually agree. 3. Resampling can handle virtually any statistic, not just those for which a distribution is known. 4. Resampling typically generates accurate 95CIs. Resampling Statistics - Terminology - Resampling is a generic term which refers to a whole array of computer intensive methods for testing hypotheses based on Monte Carlo and resampling simulations. Bootstrapping and jackknifing represent the two most common forms applied to conventional statistical designs. This lecture will focus primarily on bootstrapping procedures. Resampling Statistics - References - These procedures have been around for a long time but have really only begun to be applied recently because of enhanced computer technology. Selected References: Efron, B. 1982. The jackknife, the bootstrap, and other resampling plans. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA. Simon, J.L. 1997. Resampling: The new statistics, 2 nd ed. (online) http://www.resample.com/content/text/index.shtml Good, P.I. 2005. Introduction to statistics through resampling methods and R/S-Plus. Wiley Interscience, New York, NY.

Probability Modeling Direct modeling of probabilities is the primary point of resampling statistics. Consider a simple coin flip example. A coin contains two outcomes: heads (1), tails (0) If you flip 100 times, the expectation is: 50:50 or half 1s and half 0s. Probability Modeling Consider a less trivial & more biological case of probabilities: In clutch sizes of 8, how often would you expect to see 3 males and 5 females (i.e., 3:5 ratio)? This can be modeled using a coin flip algorithm. Assume the probability of male vs. female is equal and independent of previous clutches. One can flip 8 coins, count the heads (males), and repeat this procedure many times.

Probability Modeling The only possible logistical difficulty in this is the many times part. Resampling statistical software is available in a variety of forms. A simple Excel add-in is available for $99 (academic pricing) or calculations can be done various ways in R. Let's first look at a simple using the Excel add-in to get the general idea using our clutch size data. We can mathematically flip a coin 8 times, determine how many males there are, and do this many, many times: Resampling Software Select Resample, input range A1:A2, place data in D1 in a group of 8 Resampling Software The result is 8 values of 0 or 1 placed in column D. Cell D9 contains the column sum (5 males for this one case of 8 flips). We need to do this 999 more times!

Resampling Software Click OK, then 2x click on this cell (will turn red when selected, then 2x Click on any empty cell), 1 score recorded. Resampling Software Next, click on RS (Repeat and Score), enter 1000 trials, click OK, go to output tab The sum (males) of 1000 groups of 8-flips are placed in A on output sheet Data are sorted high to low

Resampling Software Now, using the stats add-in from Excel, construct a histogram of the 1000 resamples. 3 males happens in 210 of 1000 clutches or 0.210, or ca. 1 in 5 clutches. Boot Package v. 1.2-43 25-SEP-11 http://cran.r-project.org/web/packages/boot/boot.pdf The BOOT package is designed to provide extensive facilities for all forms of bootstrapping and resampling. One can bootstrap a simple statistic (e,g., median), a vector (e.g., regression weights), or an entire matrix. The main bootstrapping function is boot() and has the following format: Bootobject <- boot(data=, statistic=, R=,...) where, data = a vector, matrix, or dataframe statistic = a function that produces the k statistics to be bootstrapped (k=1 if bootstrapping a single statistic). The function should include an indicies parameter that the boot( ) function can use to select cases for each replication. R = the number of bootstrap replicates = additional parameters

Boot( ) calls the statistic function R times. Each time, it generates a set of random indices, with replacement. (Just like the resample Excel add-in.) These indices are used within the statistic function to select a sample. The statistics are calculated on the sample and the results accumulated in bootobject. The bootobject structure includes: t0 = The observed values of k statistics applied to the original data t = An R x k matrix where each row is a bootstrap replicate of the k statistics. You can access these as bootobject$t0 and bootobject$t Once the bootstrap samples have been generated, use print(bootobject) and plot(bootobject) to examine the results. boot.ci() can be used to obtain confidence intervals for the statistic(s). Let's load the library boot and use one of its datasets:...

We can try a standard linear model of mpg as a function of weight and displacement: > summary(reg) Call: lm(formula = mpg ~ wt + disp) Residuals: Min 1Q Median 3Q Max -3.4087-2.3243-0.7683 1.7721 6.3484 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 34.96055 2.16454 16.151 4.91e-16 *** wt -3.35082 1.16413-2.878 0.00743 ** disp -0.01773 0.00919-1.929 0.06362. --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 2.917 on 29 degrees of freedom Multiple R-squared: 0.7809, Adjusted R-squared: 0.7658 F-statistic: 51.69 on 2 and 29 DF, p-value: 2.744e-10

> results ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = mtcars, statistic = rsq, R = 1000, formula = mpg ~ wt + disp) Bootstrap Statistics : original bias std. error t1* 0.7809306 0.009334923 0.04890951 > quartz(height=4,width=7) > plot(results) > boot.ci(results, type="bca") BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 1000 bootstrap replicates CALL : boot.ci(boot.out = results, type = "bca") Intervals : Level BCa 95% ( 0.6314, 0.8525 ) Calculations and Intervals on Original Scale Some BCa intervals may be unstable

We can extend a single value bootstrap to an entire vector and continue with same example, but this time determine the model regression coefficients: > bsmodel <- function(formula, data, indices) { + d <- data[indices,] # allows boot to select sample + fit <- lm(formula, data=d) + return(coef(fit)) + } > results <- boot(data=mtcars, + statistic=bsmodel, + R=1000, formula=mpg~wt+disp) > results ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = mtcars, statistic = bs, R = 1000, formula = mpg ~ wt + disp) Bootstrap Statistics : original bias std. error t1* 34.96055404 9.262732e-02 2.493484690 t2* -3.35082533-5.329619e-02 1.180377872 t3* -0.01772474 3.939446e-05 0.008735869 > results$t [,1] [,2] [,3] [1,] 31.65568-2.06400409-2.212067e-02 [2,] 34.12020-2.88466428-1.819257e-02 [3,] 38.02991-4.35540788-1.735722e-02 [4,] 33.95197-3.77649064-9.752654e-03 [5,] 34.43601-3.16552898-1.873982e-02 [6,] 34.47165-2.89633129-2.302154e-02 [7,] 35.48928-3.69683419-1.510129e-02 [8,] 35.47456-3.11758947-2.271243e-02 [9,] 33.57981-2.30608721-2.730837e-02 [10,] 36.10200-4.51600675-4.876640e-03 [11,] 31.67622-2.60958056-1.730342e-02... > results$t0 (Intercept) wt disp 34.96055404-3.35082533-0.01772474

> boot.ci(results, type="bca", index=1) # intercept BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 1000 bootstrap replicates CALL : boot.ci(boot.out = results, type = "bca", index = 1) Intervals : Level BCa 95% (29.83, 39.96 ) Calculations and Intervals on Original Scale > boot.ci(results, type="bca", index=2) # wt > boot.ci(results, type="bca", index=3) # disp CarBoot.R Script File Resampling - Conclusions - Hopefully, by now, you can see that there is a very general principle here that can be applied to virtually any statistical design. Resampling via bootstrapping is a powerful tool in many statistical situations (cf. Chpt. 19 in W&S textbook). A nice overview of the concepts examined here can be found in: Efron, B. 1983. Computer-intensive methods in statistics. Scientific American, May, 116-130.