Package ForImp. R topics documented: February 19, Type Package. Title Imputation of Missing Values Through a Forward Imputation.

Type Package Package ForImp February 19, 2015 Title Imputation of Missing s Through a Forward Imputation Algorithm Version 1.0.3 Date 2014-11-24 Author Alessandro Barbiero, Pier Alda Ferrari, Giancarlo Manzi Maintainer Alessandro Barbiero <alessandro.barbiero@unimi.it> Imputation of missing values in datasets of ordinal variables through a forward imputation algorithm License GPL LazyLoad yes Depends homals, sampling, mvtnorm Repository CRAN Date/Publication 2015-01-02 17:47:37 NeedsCompilation no R topics documented: ForImp-package....................................... 2 ForImp........................................... 3 ld.............................................. 4 meanimp.......................................... 5 medianimp......................................... 6 missing......................................... 7 missing2......................................... 8 missingness......................................... 9 modeimp.......................................... 10 rancat.......................................... 11 transfcat......................................... 12 vcosw............................................ 13 Index 15 1

2 ForImp-package ForImp-package Forward Imputation The package contains a function for the imputation of missing values in rices of ordinal data, called Forward Imputation, and other functions for generating ordinal data or imputing missing values. Package: ForImp Type: Package Version: 1.0 Date: 2013-01-30 License: GPL LazyLoad: yes Alessandro Barbiero<alessandro.barbiero@unimi.it>, Giancarlo Manzi<giancarlo.manzi@unimi.it>, Pier Alda Ferrari<pieralda.ferrari@unimi.it> Maintainer: Alessandro Barbiero<alessandro.barbiero@unimi.it> References Ferrari P.A., Annoni P., Barbiero A., Manzi G. (2011) An imputation method for categorical variables with application to nonlinear principal component analysis, Computational Statistics & Data Analysis, vol. 55, issue 7, pages 2410-2420 http://ideas.repec.org/a/eee/csdana/v55y2011i7p2410-2420.html http://www.sciencedirect.com/science/article/pii/s0167947311000521 Ferrari P.A., Barbiero A., Manzi G.: Handling missing data in presence of ordinal variables: a new imputation procedure. In "New Perspectives in Statistical Modeling and Data Analysis", S. Ingrassia, R. Rocci, M. Vichi, Eds., Springer, 2011 Little, R.J.A., Rubin, D.B., 2002. Statistical Analysis with Missing Data, 2nd ed. John Wiley & Sons, Inc.

ForImp 3 ForImp Forward Imputation procedure Forward Imputation of missing data ForImp(, p=2) p a rix/dataframe the parameter for computing the Minkowski distance used in the nearest neighbor procedure for missing value imputation. p can be any positive number (p=2 gives the euclidean distance); if a negative number or Inf is entered, the procedure will use the maximum distance (or supremum norm) The function implements the Forward Imputation algorithm (see reference) on a rix of ordinal data with missing values. The algorithm alternates NonLinear Principal Component Analysis (NLPCA) on a subset of the data with no missing data and sequential imputations of missing values by the nearest neighbor method. This sequential process starts from the units with the lowest number of missing values and ends with the units with the highest number of missing values. the imputed rix References Ferrari P.A., Annoni P., Barbiero A., Manzi G. (2011) An imputation method for categorical variables with application to nonlinear principal component analysis, Computational Statistics & Data Analysis, vol. 55, issue 7, pages 2410-2420 http://ideas.repec.org/a/eee/csdana/v55y2011i7p2410-2420.html http://www.sciencedirect.com/science/article/pii/s0167947311000521 Ferrari P.A., Barbiero A., Manzi G.: Handling missing data in presence of ordinal variables: a new imputation procedure. In "New Perspectives in Statistical Modeling and Data Analysis", S. Ingrassia, R. Rocci, M. Vichi, Eds., Springer, 2011

4 ld modeimp, medianimp, meanimp set.seed(1) # correlation rix sigma<-rix(c(1,0.5,0.5,0.5,0.5,1,0.5,0.5,0.5,0.5,1,0.5,0.5,0.5,0.5,1),4,4) # generate a 500*4 rix from a multivariate normal c<-rmvnorm(n=500, mean=rep(0,4), sigma=sigma) # transform the numerical values into ordinal categories (Likert scale) # obtaining rix o o<-transfcat(c,4) # set the number of desired missing values nummissing<-100 # create the random missing values, obtaining rix <-missing(o, nummissing, pattern="r") # use function \code{forimp} to impute missing values, obtaining rix i i<-forimp() # number of correct imputations nummissing-sum(i!=o) ld Listwise deletion Listwise deletion ld() a rix or a dataframe This function implements the listwise deletion on a given dataset, removing all the rows or units containing at least one missing value The rix/dataframe in input with the rows/units with missing values removed

meanimp 5 meanimp, modeimp, medianimp n<-10 m<-4 <-rix(rnorm(n*m),n,m) [c(3,6),1]<-na [10,2]<-NA ld() meanimp Mean imputation Mean imputation meanimp() A numerical rix The function implements the unconditional mean imputation on a numerical rix with missing values, substituting to each missing value the arithmetic mean of the corresponding variable The imputed rix modeimp, medianimp

6 medianimp set.seed(1) n<-10 m<-3 <-rix(rnorm(n*m),n,m) m<- m[1,1]<-na m[2,2:3]<-na # rix with missing values m # imputed rix meanimp() # original rix with no missing values medianimp Median imputation Median imputation medianimp() A rix of ordinal values, ordered according to the Likert scale (1, 2, 3,...) The function implements the median imputation on a rix of ordinal data with missing values. The function substitutes to each missing value the median of the corresponding variable. The imputed rix modeimp, meanimp

missing 7 set.seed(1) n<-10 m<-3 <-rix(ceiling(runif(n*m)*4),n,m) m<- m[1,3]<-na m[9:10,1]<-na # rix with missing values m # imputed rix medianimp(m) # original rix with no missing values missing Random generation of missing values Random generation of missing values in rices of numerical data or preferably categorical data coded as integers missing(, nummissing, pattern = "r", nk = 1, p = 0.1, w = 3) nummissing pattern nk p w A rix of numerical values number of missing values pattern of missing values ("r" random, "l" lowest value, "b" block, "n" not at random) category percentage of missing values weight for the lowest category in pps sampling (pattern "n") The function generates random missing values on a rix of categorical data according to a specific pattern. "r" is the random pattern, "l" generates a percentage p of missing values on the lowest values of variable nk, "b" generates random blocks of missing values on the group of variables indexed by nk, "n" generates a kind of not at random missing values: specifically, lowest values are more likely to be missing, since they are assigned a weight w (greater than 1, the default is 3) and the values are sampled according to an unequal probability sampling design (pivotal, see the reference for more details)

8 missing2 The original rix with the desired number of values randomly substituted by missing values References Ferrari P.A., Annoni P., Barbiero A., Manzi G. (2011) An imputation method for categorical variables with application to nonlinear principal component analysis, Computational Statistics & Data Analysis, vol. 55, issue 7, pages 2410-2420, http://www.sciencedirect.com/science/article/ pii/s0167947311000521 set.seed(1) # correlation rix sigma<-rix(c(1,0.5,0.5,0.5,0.5,1,0.5,0.5,0.5,0.5,1,0.5,0.5,0.5,0.5,1),4,4) # generate a n*m rix from a multivariate normal n<-500 m<-4 c<-rmvnorm(n, mean=rep(0,m), sigma=sigma) # transform the numerical values into ordinal categories (Likert scale) # obtaining rix o o<-transfcat(c,c(2,3,4,5)) # set the number of desired missing values nummissing<-150 # create the random missing values # random missing values c<-missing(o, nummissing, pattern= "r") c # random blocks of missing values on variables 1,2 and 3 c<-missing(o, nummissing, pattern= "b", nk=c(2,3)) c # missing values on lowest category of variable 4 l<-missing(o, nummissing, pattern= "l", nk=4, p=0.1) l # not at random missing values on variable 4 n<-missing(o, nummissing, pattern= "n", nk=4, w=4) n missing2 Random generation of missing values Random generation of missing values in rices

missingness 9 missing2(, missing) missing a rix (n rows, m columns) a vector: element i contains the desired number of rows with i missing values (1<=i<=m) a rix with the specified pattern of missing values missing,missingness <-rix(rnorm(500),100,5) # if you want 20 rows with 1 missing, 10 rows with 2 missing, # 4 rows with 3 missing, 1 row with 4 missing missing<-c(20,10,4,1) m<-missing2(, missing) m # check that the function works missingness(m) missingness Missing values Summary for the missing values in a rix missingness() a rix/dataframe with missing values

10 modeimp The function provides a summary for the missing values in a rix (units for variables) number_of_missing_values Total number of missing values in the rix missing_values_per_unit Number of units with a certain number of missing values missing_values_per_variable Number of missing values for each variable n<-100 m<-3 <-rix(rnorm(n*m),n,m) nummissing<-50 index<-sample(n*m,nummissing,replace=false) [index]<-na missingness() modeimp Mode imputation Mode imputation modeimp() A rix of categorical or ordinal values, coded as integer values (1, 2, 3,...) The function implements the mode imputation on a rix of categorical or ordinal data with missing values. The function substitutes to each missing value the mode of the corresponding variable. The imputed rix

rancat 11 Alessandro barbiero, Giancarlo Manzi, Pier Alda Ferrari medianimp, modeimp set.seed(1) n<-10 m<-3 <-rix(ceiling(runif(n*m)*4),n,m) m<- m[1,3]<-na m[9:10,1]<-na # rix with missing values m # imputed rix modeimp() # original rix with no missing values rancat Generating a random rix of ordinal variables The function generates a random rix of integer (ordinal) variables, with independent and uniform marginal distributions rancat(n, m, cat = 3) n m cat number of rows/units number of columns, variables number of categories for each variable The function generates a random rix of integer (ordinal) variables (coded with 1, 2, 3...), with independent and uniform marginal distributions a rix of ordinal values

12 transfcat Alessandro Barbiero, Giancarlo Manzi, Pieralda Ferrari transfcat n<-500 m<-3 <-rancat(n,m,c(3,4,5)) # let s check the marginal distributions... apply(,2,tabulate) #... should be "quite" uniform transfcat Transforming a rix of continuous values into a rix of ordinal values The function transforms a rix of continuous numerical values into a rix of integer (ordinal) values, with uniform marginal distributions and the desired number of categories transfcat(, cat = 3) cat a rix or a dataframe the number of categories, one for each column/variable of the rix/dataframe The function converts the rix in input, containing continuous numerical values, into a rix of ordinal values (1,2,3,... i.e.: Likert scale) according to the cat-1 normal quantiles corresponding to each variable (column) of. the rix of ordinal values

vcosw 13 References Ferrari P.A., Barbiero A., Manzi G.: Handling missing data in presence of ordinal variables: a new imputation procedure. In "New Perspectives in Statistical Modeling and Data Analysis", S. Ingrassia, R. Rocci, M. Vichi, Eds., Springer, 2011 Ferrari P.A., Annoni P., Barbiero A., Manzi G. (2011) An imputation method for categorical variables with application to nonlinear principal component analysis, Computational Statistics & Data Analysis, vol. 55, issue 7, pages 2410-2420, http://www.sciencedirect.com/science/article/ pii/s0167947311000521 rancat # generate a 40*3 rix from a multivariate normal r.v. # whose independent components have mean 10 and standard deviation 4 <-rix(rnorm(40,3),10,4) # transform the rix of normal data into a rix of ordinal data transfcat(, cat=c(2,3,4,3)) vcosw Cosine of the angle between two vectors The function calculates the cosine of the angle between two vectors, defined as the inner product of the vectors divided by the product of their euclidean norms vcosw(v, w) v w a vector a vector, of the same length of v The cosine of the angle between the two vectors

14 vcosw Ferrari P.A., Annoni P., Barbiero A., Manzi G. (2011) An imputation method for categorical variables with application to nonlinear principal component analysis, Computational Statistics & Data Analysis, vol. 55, issue 7, pages 2410-2420, http://www.sciencedirect.com/science/article/ pii/s0167947311000521 a<-1:10 b<-2:11 vcosw(a,b) # e<-c(1,2,3) f<-c(3,-3,1) vcosw(e,f) # e and f are orthogonal vectors!

Index Topic datagen ForImp, 3 ld, 4 meanimp, 5 medianimp, 6 missing, 7 missing2, 8 missingness, 9 modeimp, 10 rancat, 11 transfcat, 12 vcosw, 13 Topic multivariate ForImp, 3 ld, 4 meanimp, 5 medianimp, 6 missing, 7 missing2, 8 missingness, 9 modeimp, 10 rancat, 11 transfcat, 12 vcosw, 13 Topic package ForImp-package, 2 transfcat, 12, 12 vcosw, 13 ForImp, 3 ForImp-package, 2 ld, 4 meanimp, 4, 5, 5, 6 medianimp, 4, 5, 6, 11 missing, 7, 9 missing2, 8 missingness, 9, 9 modeimp, 4 6, 10, 11 rancat, 11, 13 15