首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 578 毫秒
1.
Imputation: Methods, Simulation Experiments and Practical Examples   总被引:1,自引:0,他引:1  
When conducting surveys, two kinds of nonresponse may cause incomplete data files: unit nonresponse (complete nonresponse) and item nonresponse (partial nonresponse). The selectivity of the unit nonresponse is often corrected for. Various imputation techniques can be used for the missing values because of item nonresponse. Several of these imputation techniques are discussed in this report. One is the hot deck imputation. This paper describes two simulation experiments of the hot deck method. In the first study, data are randomly generated, and various percentages of missing values are then non-randomly'added'to the data. The hot deck method is used to reconstruct the data in this Monte Carlo experiment. The performance of the method is evaluated for the means, standard deviations, and correlation coefficients and compared with the available case method. In the second study, the quality of an imputation method is studied by running a simulation experiment. A selection of the data of the Dutch Housing Demand Survey is perturbed by leaving out specific values on a variable. Again hot deck imputations are used to reconstruct the data. The imputations are then compared with the true values. In both experiments the conclusion is that the hot deck method generally performs better than the available case method. This paper also deals with the questions which variables should be imputed and what the duration of the imputation process is. Finally the theory is illustrated by the imputation approaches of the Dutch Housing Demand Survey, the European Community Household Panel Survey (ECHP) and the new Dutch Structure of Earnings Survey (SES). These examples illustrate the levels of missing data that can be experienced in such surveys and the practical problems associated with choosing an appropriate imputation strategy for key items from each survey.  相似文献   

2.
Although item nonresponse can never be totally prevented, it can be considerably reduced, and thereby provide the researcher with not only more useable data, but also with helpful auxiliary information for a better imputation and adjustment. To achieve this an optimal data collection design is necessary. The optimization of the questionnaire and survey design are the main tools a researcher has to reduce the number of missing data in any such survey. In this contribution a concise typology of missing data patterns and their sources of origin are presented. Based on this typology, the mechanisms responsible for missing data are identified, followed by a discussion on how item nonresponse can be prevented.  相似文献   

3.
In many surveys, imputation procedures are used to account for non‐response bias induced by either unit non‐response or item non‐response. Such procedures are optimised (in terms of reducing non‐response bias) when the models include covariates that are highly predictive of both response and outcome variables. To achieve this, we propose a method for selecting sets of covariates used in regression imputation models or to determine imputation cells for one or more outcome variables, using the fraction of missing information (FMI) as obtained via a proxy pattern‐mixture (PMM) model as the key metric. In our variable selection approach, we use the PPM model to obtain a maximum likelihood estimate of the FMI for separate sets of candidate imputation models and look for the point at which changes in the FMI level off and further auxiliary variables do not improve the imputation model. We illustrate our proposed approach using empirical data from the Ohio Medicaid Assessment Survey and from the Service Annual Survey.  相似文献   

4.
A common problem in survey sampling is to compare two cross‐sectional estimates for the same study variable taken from two different waves or occasions. These cross‐sectional estimates often include imputed values to compensate for item non‐response. The estimation of the sampling variance of the estimator of change is useful to judge whether the observed change is statistically significant. Estimating the variance of a change is not straightforward because of the rotation in repeated surveys and imputation. We propose using a multivariate linear regression approach and show how it can be used to accommodate the effect of rotation and imputation. The regression approach gives a design‐consistent estimation of the variance of change when the sampling fraction is small. We illustrate the proposed approach using random hot‐deck imputation, although the proposed estimator can be implemented with other imputation techniques.  相似文献   

5.
This paper discusses the importance of managing data quality in academic research in its relation to satisfying the customer. This focus is on the data completeness objectivedimension of data quality in relation to recent advancements which have been made in the development of methods for analysing incomplete multivariate data. An overview and comparison of the traditional techniques with the recent advancements are provided. Multiple imputation is also discussed as a method of analysing incomplete multivariate data, which can potentially reduce some of the biases which can occur from using some of the traditional techniques. Despite these recent advancements in the analysis of incomplete multivariate data, evidence is presented which shows that researchers are not using these techniques to manage the data quality of their current research across a variety of academic disciplines. An analysis is then provided as to why these techniques have not been adopted along with suggestions to improve the frequency of their use in the future. Source-Reference. The ideas for this paper originated from research work on David J. Fogarty's Ph.D. dissertation. The subject area is the use of advanced techniques for the imputation of incomplete multivariate data on corporate data warehouses.  相似文献   

6.
The missing data problem has been widely addressed in the literature. The traditional methods for handling missing data may be not suited to spatial data, which can exhibit distinctive structures of dependence and/or heterogeneity. As a possible solution to the spatial missing data problem, this paper proposes an approach that combines the Bayesian Interpolation method [Benedetti, R. & Palma, D. (1994) Markov random field-based image subsampling method, Journal of Applied Statistics, 21(5), 495–509] with a multiple imputation procedure. The method is developed in a univariate and a multivariate framework, and its performance is evaluated through an empirical illustration based on data related to labour productivity in European regions.  相似文献   

7.
This study investigated the performance of multiple imputations with Expectation-Maximization (EM) algorithm and Monte Carlo Markov chain (MCMC) method in missing data imputation. We compared the accuracy of imputation based on some real data and set up two extreme scenarios and conducted both empirical and simulation studies to examine the effects of missing data rates and number of items used for imputation. In the empirical study, the scenario represented item of highest missing rate from a domain with fewest items. In the simulation study, we selected a domain with most items and the item imputed has lowest missing rate. In the empirical study, the results showed there was no significant difference between EM algorithm and MCMC method for item imputation, and number of items used for imputation has little impact, either. Compared with the actual observed values, the middle responses of 3 and 4 were over-imputed, and the extreme responses of 1, 2 and 5 were under-represented. The similar patterns occurred for domain imputation, and no significant difference between EM algorithm and MCMC method and number of items used for imputation has little impact. In the simulation study, we chose environmental domain to examine the effect of the following variables: EM algorithm and MCMC method, missing data rates, and number of items used for imputation. Again, there was no significant difference between EM algorithm and MCMC method. The accuracy rates did not significantly reduce with increase in the proportions of missing data. Number of items used for imputation has some contribution to accuracy of imputation, but not as much as expected.  相似文献   

8.
There has been a growing interest regarding generalized classes of distributions in statistical theory and practice because of their flexibility in model formation. Multiple imputation under such distributions that span a broader area in the symmetry–kurtosis plane appears to have the potential of better capturing real incomplete data trends. In this article, we impute continuous univariate data that exhibit varying characteristics under two well-known distributions, assess the extent to which this procedure works properly, make comparisons with normal imputation models in terms of commonly accepted bias and precision measures, and discuss possible generalizations to the multivariate case and to larger families of distributions.  相似文献   

9.
The statistical analysis of empirical questionnaire data can be hampered by the fact that not all questions are answered by all individuals. In this paper we propose a simple practical method to deal with such item nonresponse in case of ordinal questionnaire data, where we assume that item nonresponse is caused by an incomplete set of answers between which the individuals are supposed to choose. Our statistical method is based on extending the ordinal regression model with an additional category for nonresponse, and on investigating whether this extended model describes and forecasts the data well. We illustrate our approach for two questions from a questionnaire held amongst a sample of clients of a financial investment company.  相似文献   

10.
We studied undercoverage and nonresponse in a telephone survey among the population of the City ofGroningen, the Netherlands. The original sample, drawn from the municipal population register,contained 7000 individuals. For 37 percent of them, the telephone company was unable to produce a validtelephone number. Of those with a known telephone number, 49 percent did not answer the telephone orrefused to cooperate. As a result, the final respondents comprised merely 32 percent of the originalsample. To study distributional bias, we used individual-level data compiled from municipal records asour benchmark. Bivariate as well as multivariate analyses showed the undercoverage to be stronglyrelated to all sociodemographic variables studied, except gender. Nonresponse was related to age, countryof origin, marital status, and household size and composition, but not to gender, unemployment, socialassistance benefit, and education. Both undercoverage and nonresponse contributed to a strong middleclass bias in the final data set: middle-aged and older, economically secure people, of Dutch origin andliving with others in a household are overrepresented, while persons in disadvantaged and marginalpositions, such as the young, people of foreign stock, the unemployed, persons depending on publicincome support and singles are underrepresented.  相似文献   

11.
This study concerns list augmentation in direct marketing. List augmentation is a special case of missing data imputation. We review previous work on the mixed outcome factor model and apply it for the purpose of list augmentation. The model deals with both discrete and continuous variables and allows us to augment the data for all subjects in a company's transaction database with soft data collected in a survey among a sample of those subjects. We propose a bootstrap-based imputation approach, which is appealing to use in combination with the factor model, since it allows one to include estimation uncertainty in the imputation procedure in a simple, yet adequate manner. We provide an empirical case study of the performance of the approach to a transaction data base of a bank.  相似文献   

12.
Empirical count data are often zero‐inflated and overdispersed. Currently, there is no software package that allows adequate imputation of these data. We present multiple‐imputation routines for these kinds of count data based on a Bayesian regression approach or alternatively based on a bootstrap approach that work as add‐ons for the popular multiple imputation by chained equations (mice ) software in R (van Buuren and Groothuis‐Oudshoorn , Journal of Statistical Software, vol. 45, 2011, p. 1). We demonstrate in a Monte Carlo simulation that our procedures are superior to currently available count data procedures. It is emphasized that thorough modeling is essential to obtain plausible imputations and that model mis‐specifications can bias parameter estimates and standard errors quite noticeably. Finally, the strengths and limitations of our procedures are discussed, and fruitful avenues for future theory and software development are outlined.  相似文献   

13.
Factor analysis models are used in data dimensionality reduction problems where the variability among observed variables can be described through a smaller number of unobserved latent variables. This approach is often used to estimate the multidimensionality of well-being. We employ factor analysis models and use multivariate empirical best linear unbiased predictor (EBLUP) under a unit-level small area estimation approach to predict a vector of means of factor scores representing well-being for small areas. We compare this approach with the standard approach whereby we use small area estimation (univariate and multivariate) to estimate a dashboard of EBLUPs of the means of the original variables and then averaged. Our simulation study shows that the use of factor scores provides estimates with lower variability than weighted and simple averages of standardised multivariate EBLUPs and univariate EBLUPs. Moreover, we find that when the correlation in the observed data is taken into account before small area estimates are computed, multivariate modelling does not provide large improvements in the precision of the estimates over the univariate modelling. We close with an application using the European Union Statistics on Income and Living Conditions data.  相似文献   

14.
This paper estimates food Engel curves using data from the first wave of the Survey on Health, Aging and Retirement in Europe (SHARE). Our statistical model simultaneously takes into account selectivity due to unit and item nonresponse, endogeneity problems, and issues related to flexible specification of the relationship of interest. We estimate both parametric and semiparametric specifications of the model. The parametric specification assumes that the unobservables in the model follow a multivariate Gaussian distribution, while the semiparametric specification avoids distributional assumptions about the unobservables. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

15.
Receiver operating characteristic curves are widely used as a measure of accuracy of diagnostic tests and can be summarised using the area under the receiver operating characteristic curve (AUC). Often, it is useful to construct a confidence interval for the AUC; however, because there are a number of different proposed methods to measure variance of the AUC, there are thus many different resulting methods for constructing these intervals. In this article, we compare different methods of constructing Wald‐type confidence interval in the presence of missing data where the missingness mechanism is ignorable. We find that constructing confidence intervals using multiple imputation based on logistic regression gives the most robust coverage probability and the choice of confidence interval method is less important. However, when missingness rate is less severe (e.g. less than 70%), we recommend using Newcombe's Wald method for constructing confidence intervals along with multiple imputation using predictive mean matching.  相似文献   

16.
Multiple imputation has become viewed as a general solution to missing data problems in statistics. However, in order to lead to consistent asymptotically normal estimators, correct variance estimators and valid tests, the imputations must be proper . So far it seems that only Bayesian multiple imputation, i.e. using a Bayesian predictive distribution to generate the imputations, or approximately Bayesian multiple imputations has been shown to lead to proper imputations in some settings. In this paper, we shall see that Bayesian multiple imputation does not generally lead to proper multiple imputations. Furthermore, it will be argued that for general statistical use, Bayesian multiple imputation is inefficient even when it is proper.  相似文献   

17.
Rising nonresponse rates in social surveys make the issue of nonresponse bias contentious. There are conflicting messages about the importance of high response rates and the hazards of low rates. Some articles (e.g. Groves and Peytcheva, 2008) suggest that the response rate is in general not a good predictor of survey quality. Equally, it is well known that nonresponse may induce bias and increase data collection costs. We go back in the history of the literature of nonresponse and suggest a possible reason to the notion that even a rather small nonresponse rate makes the quality of a survey debatable. We also explore the relationship between nonresponse rate and bias, assuming non-ignorable nonresponse and focusing on estimates of totals or means. We show that there is a ‘safe area’ enclosed by the response rate on the one hand and the correlation between the response propensity and the study variable on the other hand; in this area, (1) the response rate does not greatly affect the nonresponse bias, and (2) the nonresponse bias is small.  相似文献   

18.
Imputation procedures such as fully efficient fractional imputation (FEFI) or multiple imputation (MI) create multiple versions of the missing observations, thereby reflecting uncertainty about their true values. Multiple imputation generates a finite set of imputations through a posterior predictive distribution. Fractional imputation assigns weights to the observed data. The focus of this article is the development of FEFI for partially classified two-way contingency tables. Point estimators and variances of FEFI estimators of population proportions are derived. Simulation results, when data are missing completely at random or missing at random, show that FEFI is comparable in performance to maximum likelihood estimation and multiple imputation and superior to simple stochastic imputation and complete case anlaysis. Methods are illustrated with four data sets.  相似文献   

19.
Wangli Xu  Lixing Zhu 《Metrika》2013,76(1):53-69
In this paper, we investigate checking the adequacy of varying coefficient models with response missing at random. In doing so, we first construct two completed data sets based on imputation and marginal inverse probability weighted methods, respectively. The empirical process-based tests by using these two completed data sets are suggested and the asymptotic properties of the test statistics under the null and local alternative hypotheses are studied. Because the limiting null distribution is intractable, a Monte Carlo approach is applied to approximate the distribution to determine critical values. Simulation studies are carried out to examine the performance of our method, and a real data set from an environmental study is analyzed for illustration.  相似文献   

20.
Past approaches to correcting for unit nonresponse in sample surveys by re-weighting the data assume that the problem is ignorable within arbitrary subgroups of the population. Theory and evidence suggest that this assumption is unlikely to hold, and that household characteristics such as income systematically affect survey compliance. We show that this leaves a bias in the re-weighted data and we propose a method of correcting for this bias. The geographic structure of nonresponse rates allows us to identify a micro compliance function, which is then used to re-weight the unit-record data. An example is given for the US Current Population Surveys, 1998–2004. We find, and correct for, a strong household income effect on response probabilities.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号