Missing values
Inhaltsverzeichnis
Types of missing values
Currently in statistics are 3 types of missing values are distinguished:
- MCAR - Missing completely at random
- MAR - Missing at random
- MNAR - Missing not at random
The three types correspond to the different knowledge about the distribution of the missing values.
Let denote the variables where no information is missing, denote the variables where for some observations the information is missing. is a matrix with ones and zeros, a zero entry whenever in the information is missing and a one entry whenever in the information is not missing. Then
- MCAR means that the probability if an information is missing does not depend on or on ,
- MAR means that the probability if an information is missing does not depend on , but may depend on and
- MNAR means that the probability if an information is missing does depend on .
Note that MCAR is a special case of MAR.
We will see later that in case of MCAR and MAR Statistics deliver us algorithms that allow us to estimate unknown quantities unbiased from the data. In the MNAR case an (unbiased) estimation of the parameters requires that we know the missing generating process. But this not a sufficient condition. In practice we often do not know if the missings are MCAR, MAR or MNAR.
Example data
Consider the points a student gets in two different exams, e.g. mathematics and statistics. For mathematics () we observe the points for all students. For statistics () we generate missing values in three different ways:
- MCAR: for each observation we decide with a 0-1 random generator if the observation is recorded or not, e.g. the students decide randomly if they take the statistics class or not,
- MAR: only those observations in statistics, that are above a certain limit in mathematics, are recorded, e.g. only those students that good in mathematics take the statistics class and
- MNAR: only those observations in statistics, that are above a certain limit in statistics, are recorded, e.g. only the good student results in statistics are recorded.
A perfect replacement method for the missing values needs to recover the underlying distribution of the stat points. In the MCAR case we have information about the whole distribution but only fewer observations which induces higher variability. In the MAR case we have only indirect informations about the distribution via the relationship between math and stat points; that means about a part of the distribution we have even fewer observations than in the MAR case. Finally in the MNAR case we have no information about a part of the distribution. Thus there is no way to recover this information.
<R output="html" onsave iframe="width:100%;height:400px;" workspace="missing"> n <- 30 mean_math <- 30 sd_math <- 10 mean_stat <- 30 sd_stat <- 10 corr_ms <- 0.6 prob_mcar <- 0.6 cut <- mean_stat+sd_stat*qnorm(prob_mcar)
- start computing math and stat
covm <- matrix(c(sd_math^2, corr_ms*sd_math*sd_stat, corr_ms*sd_math*sd_stat, sd_stat^2), ncol=2) ev <- eigen(covm) pts <- matrix(rnorm(2*n), ncol=2)%*%ev$vectors%*%diag(sqrt(ev$values))%*%t(ev$vectors) math <- round(mean_math + pts[,1]) stat <- round(mean_stat + pts[,2])
- compute stat mcar, mar and mnar
stat_mcar <- stat stat_mcar[runif(n)<prob_mcar] <- NA stat_mar <- stat stat_mar[math<cut] <- NA stat_mnar <- stat stat_mnar[stat<cut] <- NA data <- cbind(math,stat,stat_mcar, stat_mar, stat_mnar) rownames(data)<-format(1:30) colnames(data)<-c("Math points", "Stat points", "Stat points (MCAR)", "Stat points (MAR)", "Stat points (MNAR)") outHTML(rhtml, data, title='No.') </R>
For our example we have generated the data from a bivariate normal distribution with means of 30, standard deviations of 10 and a correlation of 0,6. For all three cases we have choosen to take approximately 40% of the original data as non-missing data. Please take the exact numbers from the R program. Note that R use NA for missing values.
Single imputation methods
With imputation methods we describe all methods which tell us how to replace missings values. We will analyze with the help of simulations the following methods:
- Casewise deletion
- Mean substitution
- Hot deck
- Conditional mean
- Predictive distribution
Note that Casewise deletion is actually not imputation method, but very often is used in practice.
But for judging the quality of our "missing handle method" we have to generate a lot of samples and to analyze the distribution of the parameters. Therefore we want to compute the following parameters
- mean,
- median,
- standard deviation,
- mean absolute deviation,
- correlation and
- the regression coefficients and .
The correlation and regression coefficients are computed for the math and stat points for all four cases (Complete data, MCAR, MAR and MNAR).
In the "Simulation result ..." sections the distribution of the parameters is generated from a large number of samples. Each sample is generated from a bivariate normal distribution, the missing are generated for MCAR, MAR and MNAR and then the imputation method is applied and the parameters are recorded. For each a parameter and each missing type a histogram is drawn. See below one example data set:
Casewise deletion
Most statistical softwares offer as simplest way to handle missings the deletion of observations if they contain missings. See the parameter estimates from the example data:
<R output="html" onsave iframe="width:100%;height:200px;" workspace="missing"> deletion <- function (x) {
- create data
d11 = x[,1] d12 = x[,2] d13 = x[!is.na(x[,3]),3] d14 = x[!is.na(x[,4]),4] d15 = x[!is.na(x[,5]),5] d22 = x[,c(1,2)] d23 = x[!is.na(x[,3]), c(1,3)] d24 = x[!is.na(x[,4]), c(1,4)] d25 = x[!is.na(x[,5]), c(1,5)]
- mean
mean <- c(mean(d12), mean(d13), mean(d14), mean(d15))
- median
median <- c(median(d12), median(d13), median(d14), median(d15))
- sd
sd <- c(sqrt(var(d12)), sqrt(var(d13)), sqrt(var(d14)), sqrt(var(d15)))
- mad
mad <- c(mad(d12), mad(d13), mad(d14), mad(d15))
- cor
cor <- c(cor(d22)[1,2], cor(d23)[1,2], cor(d24)[1,2], cor(d25)[1,2])
- b0 & b1
lm2 <- lm (y~x, data.frame(x=d22[,1], y=d22[,2])) lm3 <- lm (y~x, data.frame(x=d23[,1], y=d23[,2])) lm4 <- lm (y~x, data.frame(x=d24[,1], y=d24[,2])) lm5 <- lm (y~x, data.frame(x=d25[,1], y=d25[,2])) b0 <- c(lm2$coef[1], lm3$coef[1], lm4$coef[1], lm5$coef[1]) b1 <- c(lm2$coef[2], lm3$coef[2], lm4$coef[2], lm5$coef[2]) return (cbind(mean, median, sd, mad, cor, b0, b1))
}
n <- 30 mean_math <- 30 sd_math <- 10 mean_stat <- 30 sd_stat <- 10 corr_ms <- 0.6 prob_mcar <- 0.6 cut <- mean_stat+sd_stat*qnorm(prob_mcar)
- data comes from the workspace !
param = rbind(c(mean_stat, mean_stat, sd_stat, 0.7979*sd_stat, corr_ms, mean_stat-corr_ms*sd_stat/sd_math*mean_math, corr_ms*sd_stat/sd_math), deletion(data)) rownames(param) = c("True value", "Complete data", "MCAR", "MAR", "MNAR") colnames(param) = c("Mean", "Median", "Std.Dev.", "Mean abs. Dev.", "Correlation", "Intercept", "Slope") outHTML(rhtml, param, digits=4, harow='left') </R>
Although we have only one sample, we can see that the MNAR case is the most problematic one in terms of deviation from the true values. Even the use of robust measurements (median, mean absolute deviation) does not improve the situation.
In general, casewise deletion is applicable under the MCAR case, but does not need to be efficient. In the MAR case the parameter estimates are often biased.
Available case analysis
A variant of casewise deletion is the available case analysis. For example, we know that the correlation can be computed as . Rather than computing the correlation directly we compute the covariance and the standard deviations using as much observations as possible.
But obviously the final correlation coefficient can be outside the interval . Also different sets of observations are used to compute the required quantaties which makes it difficult to compute standard errors.
Mean substitution
Another popular method is the unconditional mean or mean substitution imputation; each missing value is replaced by the mean of the available data. In the MCAR case the average is preserved, but other quantaties like correlation can not be computed correctly. Even the estimated standard error of the mean is reduced; we need to replace the unknown true by the estimated standard deviation . As we can see from the simulation the standard deviation is downward biased and the number of observations is overstated. Both reduces the estimated standard error of the mean. In the MAR case the estimates are generally biased.
Hot deck
To overcome the problem of understating the variance by unconditional mean imputation the unconditonal distribution imputations try to preserve the distribution of the variable. The hot deck imputation simply chooses randomly from the observed data a imputation value.
Conditional mean
A more sophisticated method uses the conditional mean imputation. Here we are using the information about the relationship between the math and stat points. We compute from the available observation a linear regression and impute for the missing stat points the result of the linear regression. In case that we no relationship the method reduces to the unconditional mean. This method is nearly optimal for a limited class of problems, but not for estimating coefficients of associations, e.g. correlation. Obviously using such imputed values leads to an .
Please note that the good results for estimating the correlation in the MNAR case rely on the fact that a linear transformation of a normal distribution is again a normal distribution.
Predictive distribution
Like in conditional mean we use the relationship between math and stats points top estimate a replacement value. The linear regression deliver us not only a prediction but also residuals. From the distribution of the residual we sample randomly a residual value which is added to the predicted value. This give us the final replacement value.
Simulation results
Missing value imputation method:
- simulation results for casewise deletion
- simulation results for mean substitution
- simulation results for hot deck
- simulation results for conditional mean
- simulation results for predictive distribution
The parameters in the programs which could be changed are:
- n the number of observations in the sample (30)
- mean_math the mean of the math points (30)
- sd_math the standard deviation of the math points (10)
- mean_stat the mean of the stat points (30)
- sd_stat the standard deviation of the stat points (10)
- corr_ms the correlation of the math and stats points (0.6)
- p_mcar percentage of non-missing observations (0.6)
- nboot the number of replications (500)
References
- Schafer, J.L., Graham, J.W. (2002), Missing Data: Our View of the State of the Art, Psychological Methods, 7(2), pp. 147-177