Resampling Statistics. Conventional Statistics. Resampling Statistics

Resampling Statistics Introduction to Resampling Probability Modeling Resample add-in Bootstrapping values, vectors, matrices R boot package Conclusions Conventional Statistics Assumptions of conventional statistics: - Variables are randomly sampled - Follow a normal distribution (Gaussian) Thus, the basis of conventional inference is that samples are drawn at random from a larger population and the observations in the sample are then presumed to reflect the population (e.g., mean & variance). Resampling Statistics In resampling statistics, statistical estimates are formed by taking random samples directly from the data at hand. In other words, you randomly sample your random sample!

Resampling Statistics - Key Features - 1. For small data sets, resampling procedures probably provide more accurate statistical answers than conventional statistics. 2. For large data sets, resampling answers and conventional answers usually agree. 3. Resampling can handle virtually any statistic, not just those for which a distribution is known. 4. Resampling typically generates accurate 95CIs. Resampling Statistics - Terminology - Resampling is a generic term which refers to a whole array of computer intensive methods for testing hypotheses based on Monte Carlo and resampling simulations. Bootstrapping and jackknifing represent the two most common forms applied to conventional statistical designs. This lecture will focus primarily on bootstrapping procedures. Resampling Statistics - References - These procedures have been around for a long time but have really only begun to be applied recently because of enhanced computer technology. Selected References: Efron, B. 1982. The jackknife, the bootstrap, and other resampling plans. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA. Simon, J.L. 1997. Resampling: The new statistics, 2 nd ed. (online) http://www.resample.com/content/text/index.shtml Good, P.I. 2005. Introduction to statistics through resampling methods and R/S-Plus. Wiley Interscience, New York, NY.

Probability Modeling Direct modeling of probabilities is the primary point of resampling statistics. Consider a simple coin flip example. A coin contains two outcomes: heads (1), tails (0) If you flip 100 times, the expectation is: 50:50 or half 1s and half 0s. Probability Modeling Consider a less trivial & more biological case of probabilities: In clutch sizes of 8, how often would you expect to see 3 males and 5 females (i.e., 3:5 ratio)? This can be modeled using a coin flip algorithm. Assume the probability of male vs. female is equal and independent of previous clutches. One can flip 8 coins, count the heads (males), and repeat this procedure many times.

Probability Modeling The only possible logistical difficulty in this is the many times part. Resampling statistical software is available in a variety of forms. A simple Excel add-in is available for $99 (academic pricing) or calculations can be done various ways in R. Let's first look at a simple using the Excel add-in to get the general idea using our clutch size data. We can mathematically flip a coin 8 times, determine how many males there are, and do this many, many times: Resampling Software Select Resample, input range A1:A2, place data in D1 in a group of 8 Resampling Software The result is 8 values of 0 or 1 placed in column D. Cell D9 contains the column sum (5 males for this one case of 8 flips). We need to do this 999 more times!

Resampling Software Click OK, then 2x click on this cell (will turn red when selected, then 2x Click on any empty cell), 1 score recorded. Resampling Software Next, click on RS (Repeat and Score), enter 1000 trials, click OK, go to output tab The sum (males) of 1000 groups of 8-flips are placed in A on output sheet Data are sorted high to low

Resampling Software Now, using the stats add-in from Excel, construct a histogram of the 1000 resamples. 3 males happens in 210 of 1000 clutches or 0.210, or ca. 1 in 5 clutches. Boot Package v. 1.2-43 25-SEP-11 http://cran.r-project.org/web/packages/boot/boot.pdf The BOOT package is designed to provide extensive facilities for all forms of bootstrapping and resampling. One can bootstrap a simple statistic (e,g., median), a vector (e.g., regression weights), or an entire matrix. The main bootstrapping function is boot() and has the following format: Bootobject <- boot(data=, statistic=, R=,...) where, data = a vector, matrix, or dataframe statistic = a function that produces the k statistics to be bootstrapped (k=1 if bootstrapping a single statistic). The function should include an indicies parameter that the boot( ) function can use to select cases for each replication. R = the number of bootstrap replicates = additional parameters

Boot( ) calls the statistic function R times. Each time, it generates a set of random indices, with replacement. (Just like the resample Excel add-in.) These indices are used within the statistic function to select a sample. The statistics are calculated on the sample and the results accumulated in bootobject. The bootobject structure includes: t0 = The observed values of k statistics applied to the original data t = An R x k matrix where each row is a bootstrap replicate of the k statistics. You can access these as bootobject$t0 and bootobject$t Once the bootstrap samples have been generated, use print(bootobject) and plot(bootobject) to examine the results. boot.ci() can be used to obtain confidence intervals for the statistic(s). Let's load the library boot and use one of its datasets:...

We can try a standard linear model of mpg as a function of weight and displacement: > summary(reg) Call: lm(formula = mpg ~ wt + disp) Residuals: Min 1Q Median 3Q Max -3.4087-2.3243-0.7683 1.7721 6.3484 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 34.96055 2.16454 16.151 4.91e-16 *** wt -3.35082 1.16413-2.878 0.00743 ** disp -0.01773 0.00919-1.929 0.06362. --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 2.917 on 29 degrees of freedom Multiple R-squared: 0.7809, Adjusted R-squared: 0.7658 F-statistic: 51.69 on 2 and 29 DF, p-value: 2.744e-10

> results ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = mtcars, statistic = rsq, R = 1000, formula = mpg ~ wt + disp) Bootstrap Statistics : original bias std. error t1* 0.7809306 0.009334923 0.04890951 > quartz(height=4,width=7) > plot(results) > boot.ci(results, type="bca") BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 1000 bootstrap replicates CALL : boot.ci(boot.out = results, type = "bca") Intervals : Level BCa 95% ( 0.6314, 0.8525 ) Calculations and Intervals on Original Scale Some BCa intervals may be unstable

We can extend a single value bootstrap to an entire vector and continue with same example, but this time determine the model regression coefficients: > bsmodel <- function(formula, data, indices) { + d <- data[indices,] # allows boot to select sample + fit <- lm(formula, data=d) + return(coef(fit)) + } > results <- boot(data=mtcars, + statistic=bsmodel, + R=1000, formula=mpg~wt+disp) > results ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = mtcars, statistic = bs, R = 1000, formula = mpg ~ wt + disp) Bootstrap Statistics : original bias std. error t1* 34.96055404 9.262732e-02 2.493484690 t2* -3.35082533-5.329619e-02 1.180377872 t3* -0.01772474 3.939446e-05 0.008735869 > results$t [,1] [,2] [,3] [1,] 31.65568-2.06400409-2.212067e-02 [2,] 34.12020-2.88466428-1.819257e-02 [3,] 38.02991-4.35540788-1.735722e-02 [4,] 33.95197-3.77649064-9.752654e-03 [5,] 34.43601-3.16552898-1.873982e-02 [6,] 34.47165-2.89633129-2.302154e-02 [7,] 35.48928-3.69683419-1.510129e-02 [8,] 35.47456-3.11758947-2.271243e-02 [9,] 33.57981-2.30608721-2.730837e-02 [10,] 36.10200-4.51600675-4.876640e-03 [11,] 31.67622-2.60958056-1.730342e-02... > results$t0 (Intercept) wt disp 34.96055404-3.35082533-0.01772474

> boot.ci(results, type="bca", index=1) # intercept BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 1000 bootstrap replicates CALL : boot.ci(boot.out = results, type = "bca", index = 1) Intervals : Level BCa 95% (29.83, 39.96 ) Calculations and Intervals on Original Scale > boot.ci(results, type="bca", index=2) # wt > boot.ci(results, type="bca", index=3) # disp CarBoot.R Script File Resampling - Conclusions - Hopefully, by now, you can see that there is a very general principle here that can be applied to virtually any statistical design. Resampling via bootstrapping is a powerful tool in many statistical situations (cf. Chpt. 19 in W&S textbook). A nice overview of the concepts examined here can be found in: Efron, B. 1983. Computer-intensive methods in statistics. Scientific American, May, 116-130.