共查询到20条相似文献,搜索用时 15 毫秒
1.
This paper discusses the importance of managing data quality in academic research in its relation to satisfying the customer. This focus is on the data completeness objectivedimension of data quality in relation to recent advancements which have been made in the development of methods for analysing incomplete multivariate data. An overview and comparison of the traditional techniques with the recent advancements are provided. Multiple imputation is also discussed as a method of analysing incomplete multivariate data, which can potentially reduce some of the biases which can occur from using some of the traditional techniques. Despite these recent advancements in the analysis of incomplete multivariate data, evidence is presented which shows that researchers are not using these techniques to manage the data quality of their current research across a variety of academic disciplines. An analysis is then provided as to why these techniques have not been adopted along with suggestions to improve the frequency of their use in the future.
Source-Reference. The ideas for this paper originated from research work on David J. Fogarty's Ph.D. dissertation. The subject area is the use of advanced techniques for the imputation of incomplete multivariate data on corporate data warehouses. 相似文献
2.
Jaap P.L. Brand Stef van Buuren Karin Groothuis-Oudshoorn Edzard S. Gelsema† 《Statistica Neerlandica》2003,57(1):36-45
This paper outlines a strategy to validate multiple imputation methods. Rubin's criteria for proper multiple imputation are the point of departure. We describe a simulation method that yields insight into various aspects of bias and efficiency of the imputation process. We propose a new method for creating incomplete data under a general Missing At Random (MAR) mechanism. Software implementing the validation strategy is available as a SAS/IML module. The method is applied to investigate the behavior of polytomous regression imputation for categorical data. 相似文献
3.
Empirical count data are often zero‐inflated and overdispersed. Currently, there is no software package that allows adequate imputation of these data. We present multiple‐imputation routines for these kinds of count data based on a Bayesian regression approach or alternatively based on a bootstrap approach that work as add‐ons for the popular multiple imputation by chained equations (mice ) software in R (van Buuren and Groothuis‐Oudshoorn , Journal of Statistical Software, vol. 45, 2011, p. 1). We demonstrate in a Monte Carlo simulation that our procedures are superior to currently available count data procedures. It is emphasized that thorough modeling is essential to obtain plausible imputations and that model mis‐specifications can bias parameter estimates and standard errors quite noticeably. Finally, the strengths and limitations of our procedures are discussed, and fruitful avenues for future theory and software development are outlined. 相似文献
4.
Susanne Rässler 《Statistica Neerlandica》2003,57(1):58-74
Data fusion or statistical matching techniques merge datasets from different survey samples to achieve a complete but artificial data file which contains all variables of interest. The merging of datasets is usually done on the basis of variables common to all files, but traditional methods implicitly assume conditional independence between the variables never jointly observed given the common variables. Therefore we suggest using model based approaches tackling the data fusion task by more flexible procedures. By means of suitable multiple imputation techniques, the identification problem which is inherent in statistical matching is reflected. Here a non-iterative Bayesian version of Rubin's implicit regression model is presented and compared in a simulation study with imputations from a data augmentation algorithm as well as an iterative approach using chained equations. 相似文献
5.
Hot deck imputation is a method for handling missing data in which each missing value is replaced with an observed response from a \"similar\" unit. Despite being used extensively in practice, the theory is not as well developed as that of other imputation methods. We have found that no consensus exists as to the best way to apply the hot deck and obtain inferences from the completed data set. Here we review different forms of the hot deck and existing research on its statistical properties. We describe applications of the hot deck currently in use, including the U.S. Census Bureau's hot deck for the Current Population Survey (CPS). We also provide an extended example of variations of the hot deck applied to the third National Health and Nutrition Examination Survey (NHANES III). Some potential areas for future research are highlighted. 相似文献
6.
The missing data problem has been widely addressed in the literature. The traditional methods for handling missing data may be not suited to spatial data, which can exhibit distinctive structures of dependence and/or heterogeneity. As a possible solution to the spatial missing data problem, this paper proposes an approach that combines the Bayesian Interpolation method [Benedetti, R. & Palma, D. (1994) Markov random field-based image subsampling method, Journal of Applied Statistics, 21(5), 495–509] with a multiple imputation procedure. The method is developed in a univariate and a multivariate framework, and its performance is evaluated through an empirical illustration based on data related to labour productivity in European regions. 相似文献
7.
Hakan Demirtas 《Statistica Neerlandica》2004,58(4):466-482
In this article, we demonstrate by simulations that rich imputation models for incomplete longitudinal datasets produce more calibrated estimates in terms of reduced bias and higher coverage rates without duly deflating the efficiency. We argue that the use of supplementary variables that are thought to be potential causes or correlates of missingness or outcomes in the imputation process may lead to better inferential results in comparison to simpler imputation models. The liberal use of these variables is recommended as opposed to the conservative strategy. 相似文献
8.
Since the work of Little and Rubin (1987) not substantial advances in the analysisof explanatory regression models for incomplete data with missing not at randomhave been achieved, mainly due to the difficulty of verifying the randomness ofthe unknown data. In practice, the analysis of nonrandom missing data is donewith techniques designed for datasets with random or completely random missingdata, as complete case analysis, mean imputation, regression imputation, maximumlikelihood or multiple imputation. However, the data conditions required to minimizethe bias derived from an incorrect analysis have not been fully determined. In thepresent work, several Monte Carlo simulations have been carried out to establishthe best strategy of analysis for random missing data applicable in datasets withnonrandom missing data. The factors involved in simulations are sample size,percentage of missing data, predictive power of the imputation model and existenceof interaction between predictors. The results show that the smallest bias is obtainedwith maximum likelihood and multiple imputation techniques, although with lowpercentages of missing data, absence of interaction and high predictive power ofthe imputation model (frequent data structures in research on child and adolescentpsychopathology) acceptable results are obtained with the simplest regression imputation. 相似文献
9.
Among the wide variety of procedures to handle missing data, imputingthe missing values is a popular strategy to deal with missing itemresponses. In this paper some simple and easily implemented imputationtechniques like item and person mean substitution, and somehot-deck procedures, are investigated. A simulation study was performed based on responses to items forming a scale to measure a latent trait ofthe respondents. The effects of different imputation procedures onthe estimation of the latent ability of the respondents wereinvestigated, as well as the effect on the estimation of Cronbach'salpha (indicating the reliability of the test) and Loevinger'sH-coefficient (indicating scalability). The results indicate thatprocedures which use the relationships between items perform best,although they tend to overestimate the scale quality. 相似文献
10.
Martin Kroh 《Quality and Quantity》2006,40(2):225-244
Incomplete data is a common problem of survey research. Recent work on multiple imputation techniques has increased analysts’
awareness of the biasing effects of missing data and has also provided a convenient solution. Imputation methods replace non-response
with estimates of the unobserved scores. In many instances, however, non-response to a stimulus does not result from measurement
problems that inhibit accurate surveying of empirical reality, but from the inapplicability of the survey question. In such
cases, existing imputation techniques replace valid non-response with counterfactual estimates of a situation in which the
stimulus is applicable to all respondents. This paper suggests an alternative imputation procedure for incomplete data for
which no true score exists: multiple complete random imputation, which overcomes the biasing effects of missing data and allows
analysts to model respondents’ valid ‘I don’t know’ answers. 相似文献
11.
A Random Effects Transition Model For Longitudinal Binary Data With Informative Missingness 总被引:1,自引:0,他引:1
Understanding the transitions between disease states is often the goal in studying chronic disease. These studies, however, are typically subject to a large amount of missingness either due to patient dropout or intermittent missed visits. The missing data is often informative since missingness and dropout are usually related to either an individual's underlying disease process or the actual value of the missed observation. Our motivating example is a study of opiate addiction that examined the effect of a new treatment on thrice-weekly binary urine tests to assess opiate use over follow-up. The interest in this opiate addiction clinical trial was to characterize the transition pattern of opiate use (in each treatment arm) as well as to compare both the marginal probability of a positive urine test over follow-up and the time until the first positive urine test between the treatment arms. We develop a shared random effects model that links together the propensity of transition between states and the probability of either an intermittent missed observation or dropout. This approach allows for heterogeneous transition and missing data patterns between individuals as well as incorporating informative intermittent missing data and dropout. We compare this new approach with other approaches proposed for the analysis of longitudinal binary data with informative missingness. 相似文献
12.
It is shown that the classical taxonomy of missing data models, namely missing completely at random, missing at random and informative missingness, which has been developed almost exclusively within a selection modelling framework, can also be applied to pattern-mixture models. In particular, intuitively appealing identifying restrictions are proposed for a pattern-mixture MAR mechanism. 相似文献
13.
Repeated measurements often are analyzed by multivariate analysis of variance (MANOVA). An alternative approach is provided by multilevel analysis, also called the hierarchical linear model (HLM), which makes use of random coefficient models. This paper is a tutorial which indicates that the HLM can be specified in many different ways, corresponding to different sets of assumptions about the covariance matrix of the
repeated measurements. The possible assumptions range from the very restrictive compound symmetry model to the unrestricted
multivariate model. Thus, the HLM can be used to steer a useful middle road between the two traditional methods for analyzing repeated measurements. Another
important advantage of the multilevel approach to analyzing repeated measures is the fact that it can be easily used also
if the data are incomplete. Thus it provides a way to achieve a fully multivariate analysis of repeated measures with incomplete
data.
This revised version was published online in June 2006 with corrections to the Cover Date. 相似文献
14.
Joseph L. Schafer 《Statistica Neerlandica》2003,57(1):19-35
Bayesian multiple imputation (MI) has become a highly useful paradigm for handling missing values in many settings. In this paper, I compare Bayesian MI with other methods – maximum likelihood, in particular—and point out some of its unique features. One key aspect of MI, the separation of the imputation phase from the analysis phase, can be advantageous in settings where the models underlying the two phases do not agree. 相似文献
15.
An important application of multiple regression is predictor selection. When there are no missing values in the data, information criteria can be used to select predictors. For example, one could apply the small‐sample‐size corrected version of the Akaike information criterion (AIC), the (AICC). In this article, we discuss how information criteria should be calculated when the dependent variable and/or the predictors contain missing values. Therewith, we extensively discuss and evaluate three models that can be employed to deal with the missing data, that is, to predict the missing values. The most complex model, that is, the model with all available predictors, outperforms the other models. These results also apply to more general hypotheses than predictor selection and also to structural equation modeling (SEM) models. 相似文献
16.
Receiver operating characteristic curves are widely used as a measure of accuracy of diagnostic tests and can be summarised using the area under the receiver operating characteristic curve (AUC). Often, it is useful to construct a confidence interval for the AUC; however, because there are a number of different proposed methods to measure variance of the AUC, there are thus many different resulting methods for constructing these intervals. In this article, we compare different methods of constructing Wald‐type confidence interval in the presence of missing data where the missingness mechanism is ignorable. We find that constructing confidence intervals using multiple imputation based on logistic regression gives the most robust coverage probability and the choice of confidence interval method is less important. However, when missingness rate is less severe (e.g. less than 70%), we recommend using Newcombe's Wald method for constructing confidence intervals along with multiple imputation using predictive mean matching. 相似文献
17.
In missing data problems, it is often the case that there is a natural test statistic for testing a statistical hypothesis had all the data been observed. A fuzzy p -value approach to hypothesis testing has recently been proposed which is implemented by imputing the missing values in the \"complete data\" test statistic by values simulated from the conditional null distribution given the observed data. We argue that imputing data in this way will inevitably lead to loss in power. For the case of scalar parameter, we show that the asymptotic efficiency of the score test based on the imputed \"complete data\" relative to the score test based on the observed data is given by the ratio of the observed data information to the complete data information. Three examples involving probit regression, normal random effects model, and unidentified paired data are used for illustration. For testing linkage disequilibrium based on pooled genotype data, simulation results show that the imputed Neyman Pearson and Fisher exact tests are less powerful than a Wald-type test based on the observed data maximum likelihood estimator. In conclusion, we caution against the routine use of the fuzzy p -value approach in latent variable or missing data problems and suggest some viable alternatives. 相似文献
18.
Longitudinal data sets with the structure T (time points) × N (subjects) are often incomplete because of data missing for certain subjects at certain time points. The EM algorithm is applied in conjunction with the Kalman smoother for computing maximum likelihood estimates of longitudinal LISREL models from varying missing data patterns. The iterative procedure uses the LISREL program in the M-step and the Kalman smoother in the E-step. The application of the method is illustrated by simulating missing data on a data set from educational research. 相似文献
19.