Package ForImp. R topics documented: February 19, Type Package. Title Imputation of Missing Values Through a Forward Imputation.

Similar documents
Package schoenberg. June 26, 2018

Package RSentiment. October 15, 2017

Package spotsegmentation

Package hcandersenr. January 20, 2019

NETFLIX MOVIE RATING ANALYSIS

What is Statistics? 13.1 What is Statistics? Statistics

Package painter. August 13, 2018

Resampling Statistics. Conventional Statistics. Resampling Statistics

Package Polychrome. R topics documented: November 20, 2017

Homework Packet Week #5 All problems with answers or work are examples.

Introduction to IBM SPSS Statistics (v24)

Package colorpatch. June 10, 2017

A combination of approaches to solve Task How Many Ratings? of the KDD CUP 2007

Package icaocularcorrection

Multiple-point simulation of multiple categories Part 1. Testing against multiple truncation of a Gaussian field

COMP Test on Psychology 320 Check on Mastery of Prerequisites

Chapter 1 Midterm Review

COMP 249 Advanced Distributed Systems Multimedia Networking. Video Compression Standards

MATH& 146 Lesson 11. Section 1.6 Categorical Data

Release Year Prediction for Songs

Ferenc, Szani, László Pitlik, Anikó Balogh, Apertus Nonprofit Ltd.

Machine Vision System for Color Sorting Wood Edge-Glued Panel Parts

Package rasterimage. September 10, Index 5. Defines a color palette

STAT 503 Case Study: Supervised classification of music clips

Phenopix - Exposure extraction

Visual Encoding Design

Various Artificial Intelligence Techniques For Automated Melody Generation

Normalization Methods for Two-Color Microarray Data

Graphical User Interface for Modifying Structables and their Mosaic Plots

Algebra I Module 2 Lessons 1 19

ISOMET. Compensation look-up-table (LUT) and How to Generate. Isomet: Contents:

I. Model. Q29a. I love the options at my fingertips today, watching videos on my phone, texting, and streaming films. Main Effect X1: Gender

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of CS

SECTION I. THE MODEL. Discriminant Analysis Presentation~ REVISION Marcy Saxton and Jenn Stoneking DF1 DF2 DF3

A STATISTICAL VIEW ON THE EXPRESSIVE TIMING OF PIANO ROLLED CHORDS

Table of Contents. 2 Select camera-lens configuration Select camera and lens type Listbox: Select source image... 8

Measuring Variability for Skewed Distributions

Restoration of Hyperspectral Push-Broom Scanner Data

Package machina. October 7, 2016

Moving on from MSTAT. March The University of Reading Statistical Services Centre Biometrics Advisory and Support Service to DFID

AP Statistics Sampling. Sampling Exercise (adapted from a document from the NCSSM Leadership Institute, July 2000).

MVP: Capture-Power Reduction with Minimum-Violations Partitioning for Delay Testing

Music Genre Classification and Variance Comparison on Number of Genres

Using DICTION. Some Basics. Importing Files. Analyzing Texts

Cryptanalysis of LILI-128

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG?

Lesson 7: Measuring Variability for Skewed Distributions (Interquartile Range)

Detecting Musical Key with Supervised Learning

Skip Length and Inter-Starvation Distance as a Combined Metric to Assess the Quality of Transmitted Video

Module 8 VIDEO CODING STANDARDS. Version 2 ECE IIT, Kharagpur

Frequencies. Chapter 2. Descriptive statistics and charts

STAT 113: Statistics and Society Ellen Gundlach, Purdue University. (Chapters refer to Moore and Notz, Statistics: Concepts and Controversies, 8e)

Package clustrd. May 3, 2018

Klee or Kid? The subjective experience of drawings from children and Paul Klee Pronk, T.

Solution of Linear Systems

STAT 250: Introduction to Biostatistics LAB 6

COSC282 BIG DATA ANALYTICS FALL 2015 LECTURE 11 - OCT 21

Intra-frame JPEG-2000 vs. Inter-frame Compression Comparison: The benefits and trade-offs for very high quality, high resolution sequences

Clarification for 3G Coverage Obligation Verification Data

Paired plot designs experience and recommendations for in field product evaluation at Syngenta

Fundamentals and applications of resampling methods for the analysis of speech production and perception data.

SIDRA INTERSECTION 8.0 UPDATE HISTORY

Linköping University Post Print. Packet Video Error Concealment With Gaussian Mixture Models

Detecting Medicaid Data Anomalies Using Data Mining Techniques Shenjun Zhu, Qiling Shi, Aran Canes, AdvanceMed Corporation, Nashville, TN

AP Statistics Sec 5.1: An Exercise in Sampling: The Corn Field

Modeling memory for melodies

Skycoor Manual PEKASAT SE 2016

Audio Compression Technology for Voice Transmission

Analysis of data from the pilot exercise to develop bibliometric indicators for the REF

Chord Classification of an Audio Signal using Artificial Neural Network

Feature-Based Analysis of Haydn String Quartets

2D Interleaver Design for Image Transmission over Severe Burst-Error Environment

MusCat: A Music Browser Featuring Abstract Pictures and Zooming User Interface

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

Optical Signals Application Plug-in Programmer Manual

ENGINEERING COMMITTEE

What's New in Journal Citation Reports?

CHAPTER1: Digital Logic Circuits

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

OPERATIONS SEQUENCING IN A CABLE ASSEMBLY SHOP

TWO-FACTOR ANOVA Kim Neuendorf 4/9/18 COM 631/731 I. MODEL

CS2401-COMPUTER GRAPHICS QUESTION BANK

Bootstrap Methods in Regression Questions Have you had a chance to try any of this? Any of the review questions?

PLANE TESSELATION WITH MUSICAL-SCALE TILES AND BIDIMENSIONAL AUTOMATIC COMPOSITION

Pattern Creator/Converter Software User Manual

NAA ENHANCING THE QUALITY OF MARKING PROJECT: THE EFFECT OF SAMPLE SIZE ON INCREASED PRECISION IN DETECTING ERRANT MARKING

ISOMET. Compensation look-up-table (LUT) and Scan Uniformity

An Approach to Classifying Four-Part Music

Estimating. Proportions with Confidence. Chapter 10. Copyright 2006 Brooks/Cole, a division of Thomson Learning, Inc.

Recognising Cello Performers using Timbre Models

Section 6.8 Synthesis of Sequential Logic Page 1 of 8

Orthogonal rotation in PCAMIX

Outline. Why do we classify? Audio Classification

TechNote: MuraTool CA: 1 2/9/00. Figure 1: High contrast fringe ring mura on a microdisplay

Network Operations Subcommittee SCTE STANDARD SCTE SCTE-HMS-QAM-MIB

Recognising Cello Performers Using Timbre Models

Chapter 6. Normal Distributions

Package knitcitations

WHAT'S HOT: LINEAR POPULARITY PREDICTION FROM TV AND SOCIAL USAGE DATA Jan Neumann, Xiaodong Yu, and Mohamad Ali Torkamani Comcast Labs

Chord Representations for Probabilistic Models

Transcription:

Type Package Package ForImp February 19, 2015 Title Imputation of Missing s Through a Forward Imputation Algorithm Version 1.0.3 Date 2014-11-24 Author Alessandro Barbiero, Pier Alda Ferrari, Giancarlo Manzi Maintainer Alessandro Barbiero <alessandro.barbiero@unimi.it> Imputation of missing values in datasets of ordinal variables through a forward imputation algorithm License GPL LazyLoad yes Depends homals, sampling, mvtnorm Repository CRAN Date/Publication 2015-01-02 17:47:37 NeedsCompilation no R topics documented: ForImp-package....................................... 2 ForImp........................................... 3 ld.............................................. 4 meanimp.......................................... 5 medianimp......................................... 6 missing......................................... 7 missing2......................................... 8 missingness......................................... 9 modeimp.......................................... 10 rancat.......................................... 11 transfcat......................................... 12 vcosw............................................ 13 Index 15 1

2 ForImp-package ForImp-package Forward Imputation The package contains a function for the imputation of missing values in rices of ordinal data, called Forward Imputation, and other functions for generating ordinal data or imputing missing values. Package: ForImp Type: Package Version: 1.0 Date: 2013-01-30 License: GPL LazyLoad: yes Alessandro Barbiero<alessandro.barbiero@unimi.it>, Giancarlo Manzi<giancarlo.manzi@unimi.it>, Pier Alda Ferrari<pieralda.ferrari@unimi.it> Maintainer: Alessandro Barbiero<alessandro.barbiero@unimi.it> References Ferrari P.A., Annoni P., Barbiero A., Manzi G. (2011) An imputation method for categorical variables with application to nonlinear principal component analysis, Computational Statistics & Data Analysis, vol. 55, issue 7, pages 2410-2420 http://ideas.repec.org/a/eee/csdana/v55y2011i7p2410-2420.html http://www.sciencedirect.com/science/article/pii/s0167947311000521 Ferrari P.A., Barbiero A., Manzi G.: Handling missing data in presence of ordinal variables: a new imputation procedure. In "New Perspectives in Statistical Modeling and Data Analysis", S. Ingrassia, R. Rocci, M. Vichi, Eds., Springer, 2011 Little, R.J.A., Rubin, D.B., 2002. Statistical Analysis with Missing Data, 2nd ed. John Wiley & Sons, Inc.

ForImp 3 ForImp Forward Imputation procedure Forward Imputation of missing data ForImp(, p=2) p a rix/dataframe the parameter for computing the Minkowski distance used in the nearest neighbor procedure for missing value imputation. p can be any positive number (p=2 gives the euclidean distance); if a negative number or Inf is entered, the procedure will use the maximum distance (or supremum norm) The function implements the Forward Imputation algorithm (see reference) on a rix of ordinal data with missing values. The algorithm alternates NonLinear Principal Component Analysis (NLPCA) on a subset of the data with no missing data and sequential imputations of missing values by the nearest neighbor method. This sequential process starts from the units with the lowest number of missing values and ends with the units with the highest number of missing values. the imputed rix References Ferrari P.A., Annoni P., Barbiero A., Manzi G. (2011) An imputation method for categorical variables with application to nonlinear principal component analysis, Computational Statistics & Data Analysis, vol. 55, issue 7, pages 2410-2420 http://ideas.repec.org/a/eee/csdana/v55y2011i7p2410-2420.html http://www.sciencedirect.com/science/article/pii/s0167947311000521 Ferrari P.A., Barbiero A., Manzi G.: Handling missing data in presence of ordinal variables: a new imputation procedure. In "New Perspectives in Statistical Modeling and Data Analysis", S. Ingrassia, R. Rocci, M. Vichi, Eds., Springer, 2011

4 ld modeimp, medianimp, meanimp set.seed(1) # correlation rix sigma<-rix(c(1,0.5,0.5,0.5,0.5,1,0.5,0.5,0.5,0.5,1,0.5,0.5,0.5,0.5,1),4,4) # generate a 500*4 rix from a multivariate normal c<-rmvnorm(n=500, mean=rep(0,4), sigma=sigma) # transform the numerical values into ordinal categories (Likert scale) # obtaining rix o o<-transfcat(c,4) # set the number of desired missing values nummissing<-100 # create the random missing values, obtaining rix <-missing(o, nummissing, pattern="r") # use function \code{forimp} to impute missing values, obtaining rix i i<-forimp() # number of correct imputations nummissing-sum(i!=o) ld Listwise deletion Listwise deletion ld() a rix or a dataframe This function implements the listwise deletion on a given dataset, removing all the rows or units containing at least one missing value The rix/dataframe in input with the rows/units with missing values removed

meanimp 5 meanimp, modeimp, medianimp n<-10 m<-4 <-rix(rnorm(n*m),n,m) [c(3,6),1]<-na [10,2]<-NA ld() meanimp Mean imputation Mean imputation meanimp() A numerical rix The function implements the unconditional mean imputation on a numerical rix with missing values, substituting to each missing value the arithmetic mean of the corresponding variable The imputed rix modeimp, medianimp

6 medianimp set.seed(1) n<-10 m<-3 <-rix(rnorm(n*m),n,m) m<- m[1,1]<-na m[2,2:3]<-na # rix with missing values m # imputed rix meanimp() # original rix with no missing values medianimp Median imputation Median imputation medianimp() A rix of ordinal values, ordered according to the Likert scale (1, 2, 3,...) The function implements the median imputation on a rix of ordinal data with missing values. The function substitutes to each missing value the median of the corresponding variable. The imputed rix modeimp, meanimp

missing 7 set.seed(1) n<-10 m<-3 <-rix(ceiling(runif(n*m)*4),n,m) m<- m[1,3]<-na m[9:10,1]<-na # rix with missing values m # imputed rix medianimp(m) # original rix with no missing values missing Random generation of missing values Random generation of missing values in rices of numerical data or preferably categorical data coded as integers missing(, nummissing, pattern = "r", nk = 1, p = 0.1, w = 3) nummissing pattern nk p w A rix of numerical values number of missing values pattern of missing values ("r" random, "l" lowest value, "b" block, "n" not at random) category percentage of missing values weight for the lowest category in pps sampling (pattern "n") The function generates random missing values on a rix of categorical data according to a specific pattern. "r" is the random pattern, "l" generates a percentage p of missing values on the lowest values of variable nk, "b" generates random blocks of missing values on the group of variables indexed by nk, "n" generates a kind of not at random missing values: specifically, lowest values are more likely to be missing, since they are assigned a weight w (greater than 1, the default is 3) and the values are sampled according to an unequal probability sampling design (pivotal, see the reference for more details)

8 missing2 The original rix with the desired number of values randomly substituted by missing values References Ferrari P.A., Annoni P., Barbiero A., Manzi G. (2011) An imputation method for categorical variables with application to nonlinear principal component analysis, Computational Statistics & Data Analysis, vol. 55, issue 7, pages 2410-2420, http://www.sciencedirect.com/science/article/ pii/s0167947311000521 set.seed(1) # correlation rix sigma<-rix(c(1,0.5,0.5,0.5,0.5,1,0.5,0.5,0.5,0.5,1,0.5,0.5,0.5,0.5,1),4,4) # generate a n*m rix from a multivariate normal n<-500 m<-4 c<-rmvnorm(n, mean=rep(0,m), sigma=sigma) # transform the numerical values into ordinal categories (Likert scale) # obtaining rix o o<-transfcat(c,c(2,3,4,5)) # set the number of desired missing values nummissing<-150 # create the random missing values # random missing values c<-missing(o, nummissing, pattern= "r") c # random blocks of missing values on variables 1,2 and 3 c<-missing(o, nummissing, pattern= "b", nk=c(2,3)) c # missing values on lowest category of variable 4 l<-missing(o, nummissing, pattern= "l", nk=4, p=0.1) l # not at random missing values on variable 4 n<-missing(o, nummissing, pattern= "n", nk=4, w=4) n missing2 Random generation of missing values Random generation of missing values in rices

missingness 9 missing2(, missing) missing a rix (n rows, m columns) a vector: element i contains the desired number of rows with i missing values (1<=i<=m) a rix with the specified pattern of missing values missing,missingness <-rix(rnorm(500),100,5) # if you want 20 rows with 1 missing, 10 rows with 2 missing, # 4 rows with 3 missing, 1 row with 4 missing missing<-c(20,10,4,1) m<-missing2(, missing) m # check that the function works missingness(m) missingness Missing values Summary for the missing values in a rix missingness() a rix/dataframe with missing values

10 modeimp The function provides a summary for the missing values in a rix (units for variables) number_of_missing_values Total number of missing values in the rix missing_values_per_unit Number of units with a certain number of missing values missing_values_per_variable Number of missing values for each variable n<-100 m<-3 <-rix(rnorm(n*m),n,m) nummissing<-50 index<-sample(n*m,nummissing,replace=false) [index]<-na missingness() modeimp Mode imputation Mode imputation modeimp() A rix of categorical or ordinal values, coded as integer values (1, 2, 3,...) The function implements the mode imputation on a rix of categorical or ordinal data with missing values. The function substitutes to each missing value the mode of the corresponding variable. The imputed rix

rancat 11 Alessandro barbiero, Giancarlo Manzi, Pier Alda Ferrari medianimp, modeimp set.seed(1) n<-10 m<-3 <-rix(ceiling(runif(n*m)*4),n,m) m<- m[1,3]<-na m[9:10,1]<-na # rix with missing values m # imputed rix modeimp() # original rix with no missing values rancat Generating a random rix of ordinal variables The function generates a random rix of integer (ordinal) variables, with independent and uniform marginal distributions rancat(n, m, cat = 3) n m cat number of rows/units number of columns, variables number of categories for each variable The function generates a random rix of integer (ordinal) variables (coded with 1, 2, 3...), with independent and uniform marginal distributions a rix of ordinal values

12 transfcat Alessandro Barbiero, Giancarlo Manzi, Pieralda Ferrari transfcat n<-500 m<-3 <-rancat(n,m,c(3,4,5)) # let s check the marginal distributions... apply(,2,tabulate) #... should be "quite" uniform transfcat Transforming a rix of continuous values into a rix of ordinal values The function transforms a rix of continuous numerical values into a rix of integer (ordinal) values, with uniform marginal distributions and the desired number of categories transfcat(, cat = 3) cat a rix or a dataframe the number of categories, one for each column/variable of the rix/dataframe The function converts the rix in input, containing continuous numerical values, into a rix of ordinal values (1,2,3,... i.e.: Likert scale) according to the cat-1 normal quantiles corresponding to each variable (column) of. the rix of ordinal values

vcosw 13 References Ferrari P.A., Barbiero A., Manzi G.: Handling missing data in presence of ordinal variables: a new imputation procedure. In "New Perspectives in Statistical Modeling and Data Analysis", S. Ingrassia, R. Rocci, M. Vichi, Eds., Springer, 2011 Ferrari P.A., Annoni P., Barbiero A., Manzi G. (2011) An imputation method for categorical variables with application to nonlinear principal component analysis, Computational Statistics & Data Analysis, vol. 55, issue 7, pages 2410-2420, http://www.sciencedirect.com/science/article/ pii/s0167947311000521 rancat # generate a 40*3 rix from a multivariate normal r.v. # whose independent components have mean 10 and standard deviation 4 <-rix(rnorm(40,3),10,4) # transform the rix of normal data into a rix of ordinal data transfcat(, cat=c(2,3,4,3)) vcosw Cosine of the angle between two vectors The function calculates the cosine of the angle between two vectors, defined as the inner product of the vectors divided by the product of their euclidean norms vcosw(v, w) v w a vector a vector, of the same length of v The cosine of the angle between the two vectors

14 vcosw Ferrari P.A., Annoni P., Barbiero A., Manzi G. (2011) An imputation method for categorical variables with application to nonlinear principal component analysis, Computational Statistics & Data Analysis, vol. 55, issue 7, pages 2410-2420, http://www.sciencedirect.com/science/article/ pii/s0167947311000521 a<-1:10 b<-2:11 vcosw(a,b) # e<-c(1,2,3) f<-c(3,-3,1) vcosw(e,f) # e and f are orthogonal vectors!

Index Topic datagen ForImp, 3 ld, 4 meanimp, 5 medianimp, 6 missing, 7 missing2, 8 missingness, 9 modeimp, 10 rancat, 11 transfcat, 12 vcosw, 13 Topic multivariate ForImp, 3 ld, 4 meanimp, 5 medianimp, 6 missing, 7 missing2, 8 missingness, 9 modeimp, 10 rancat, 11 transfcat, 12 vcosw, 13 Topic package ForImp-package, 2 transfcat, 12, 12 vcosw, 13 ForImp, 3 ForImp-package, 2 ld, 4 meanimp, 4, 5, 5, 6 medianimp, 4, 5, 6, 11 missing, 7, 9 missing2, 8 missingness, 9, 9 modeimp, 4 6, 10, 11 rancat, 11, 13 15