首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 593 毫秒
1.
This paper discusses the importance of managing data quality in academic research in its relation to satisfying the customer. This focus is on the data completeness objectivedimension of data quality in relation to recent advancements which have been made in the development of methods for analysing incomplete multivariate data. An overview and comparison of the traditional techniques with the recent advancements are provided. Multiple imputation is also discussed as a method of analysing incomplete multivariate data, which can potentially reduce some of the biases which can occur from using some of the traditional techniques. Despite these recent advancements in the analysis of incomplete multivariate data, evidence is presented which shows that researchers are not using these techniques to manage the data quality of their current research across a variety of academic disciplines. An analysis is then provided as to why these techniques have not been adopted along with suggestions to improve the frequency of their use in the future. Source-Reference. The ideas for this paper originated from research work on David J. Fogarty's Ph.D. dissertation. The subject area is the use of advanced techniques for the imputation of incomplete multivariate data on corporate data warehouses.  相似文献   

2.
Incomplete data is a common problem of survey research. Recent work on multiple imputation techniques has increased analysts’ awareness of the biasing effects of missing data and has also provided a convenient solution. Imputation methods replace non-response with estimates of the unobserved scores. In many instances, however, non-response to a stimulus does not result from measurement problems that inhibit accurate surveying of empirical reality, but from the inapplicability of the survey question. In such cases, existing imputation techniques replace valid non-response with counterfactual estimates of a situation in which the stimulus is applicable to all respondents. This paper suggests an alternative imputation procedure for incomplete data for which no true score exists: multiple complete random imputation, which overcomes the biasing effects of missing data and allows analysts to model respondents’ valid ‘I don’t know’ answers.  相似文献   

3.
Since the work of Little and Rubin (1987) not substantial advances in the analysisof explanatory regression models for incomplete data with missing not at randomhave been achieved, mainly due to the difficulty of verifying the randomness ofthe unknown data. In practice, the analysis of nonrandom missing data is donewith techniques designed for datasets with random or completely random missingdata, as complete case analysis, mean imputation, regression imputation, maximumlikelihood or multiple imputation. However, the data conditions required to minimizethe bias derived from an incorrect analysis have not been fully determined. In thepresent work, several Monte Carlo simulations have been carried out to establishthe best strategy of analysis for random missing data applicable in datasets withnonrandom missing data. The factors involved in simulations are sample size,percentage of missing data, predictive power of the imputation model and existenceof interaction between predictors. The results show that the smallest bias is obtainedwith maximum likelihood and multiple imputation techniques, although with lowpercentages of missing data, absence of interaction and high predictive power ofthe imputation model (frequent data structures in research on child and adolescentpsychopathology) acceptable results are obtained with the simplest regression imputation.  相似文献   

4.
In this article, we demonstrate by simulations that rich imputation models for incomplete longitudinal datasets produce more calibrated estimates in terms of reduced bias and higher coverage rates without duly deflating the efficiency. We argue that the use of supplementary variables that are thought to be potential causes or correlates of missingness or outcomes in the imputation process may lead to better inferential results in comparison to simpler imputation models. The liberal use of these variables is recommended as opposed to the conservative strategy.  相似文献   

5.
Hot deck imputation is a method for handling missing data in which each missing value is replaced with an observed response from a "similar" unit. Despite being used extensively in practice, the theory is not as well developed as that of other imputation methods. We have found that no consensus exists as to the best way to apply the hot deck and obtain inferences from the completed data set. Here we review different forms of the hot deck and existing research on its statistical properties. We describe applications of the hot deck currently in use, including the U.S. Census Bureau's hot deck for the Current Population Survey (CPS). We also provide an extended example of variations of the hot deck applied to the third National Health and Nutrition Examination Survey (NHANES III). Some potential areas for future research are highlighted.  相似文献   

6.
In many surveys, imputation procedures are used to account for non‐response bias induced by either unit non‐response or item non‐response. Such procedures are optimised (in terms of reducing non‐response bias) when the models include covariates that are highly predictive of both response and outcome variables. To achieve this, we propose a method for selecting sets of covariates used in regression imputation models or to determine imputation cells for one or more outcome variables, using the fraction of missing information (FMI) as obtained via a proxy pattern‐mixture (PMM) model as the key metric. In our variable selection approach, we use the PPM model to obtain a maximum likelihood estimate of the FMI for separate sets of candidate imputation models and look for the point at which changes in the FMI level off and further auxiliary variables do not improve the imputation model. We illustrate our proposed approach using empirical data from the Ohio Medicaid Assessment Survey and from the Service Annual Survey.  相似文献   

7.
In missing data problems, it is often the case that there is a natural test statistic for testing a statistical hypothesis had all the data been observed. A fuzzy  p -value approach to hypothesis testing has recently been proposed which is implemented by imputing the missing values in the "complete data" test statistic by values simulated from the conditional null distribution given the observed data. We argue that imputing data in this way will inevitably lead to loss in power. For the case of scalar parameter, we show that the asymptotic efficiency of the score test based on the imputed "complete data" relative to the score test based on the observed data is given by the ratio of the observed data information to the complete data information. Three examples involving probit regression, normal random effects model, and unidentified paired data are used for illustration. For testing linkage disequilibrium based on pooled genotype data, simulation results show that the imputed Neyman Pearson and Fisher exact tests are less powerful than a Wald-type test based on the observed data maximum likelihood estimator. In conclusion, we caution against the routine use of the fuzzy  p -value approach in latent variable or missing data problems and suggest some viable alternatives.  相似文献   

8.
Multiple imputation methods properly account for the uncertainty of missing data. One of those methods for creating multiple imputations is predictive mean matching (PMM), a general purpose method. Little is known about the performance of PMM in imputing non‐normal semicontinuous data (skewed data with a point mass at a certain value and otherwise continuously distributed). We investigate the performance of PMM as well as dedicated methods for imputing semicontinuous data by performing simulation studies under univariate and multivariate missingness mechanisms. We also investigate the performance on real‐life datasets. We conclude that PMM performance is at least as good as the investigated dedicated methods for imputing semicontinuous data and, in contrast to other methods, is the only method that yields plausible imputations and preserves the original data distributions.  相似文献   

9.
Empirical count data are often zero‐inflated and overdispersed. Currently, there is no software package that allows adequate imputation of these data. We present multiple‐imputation routines for these kinds of count data based on a Bayesian regression approach or alternatively based on a bootstrap approach that work as add‐ons for the popular multiple imputation by chained equations (mice ) software in R (van Buuren and Groothuis‐Oudshoorn , Journal of Statistical Software, vol. 45, 2011, p. 1). We demonstrate in a Monte Carlo simulation that our procedures are superior to currently available count data procedures. It is emphasized that thorough modeling is essential to obtain plausible imputations and that model mis‐specifications can bias parameter estimates and standard errors quite noticeably. Finally, the strengths and limitations of our procedures are discussed, and fruitful avenues for future theory and software development are outlined.  相似文献   

10.
A basic concern in statistical disclosure limitation is the re-identification of individuals in anonymised microdata. Linking against a second dataset that contains identifying information can result in a breach of confidentiality. Almost all linkage approaches are based on comparing the values of variables that are common to both datasets. It is tempting to think that if datasets contain no common variables, then there can be no risk of re-identification. However, linkage has been attempted between such datasets via the extraction of structural information using ordered weighted averaging (OWA) operators. Although this approach has been shown to perform better than randomly pairing records, it is debatable whether it demonstrates a practically significant disclosure risk. This paper reviews some of the main aspects of statistical disclosure limitation. It then goes on to show that a relatively simple, supervised Bayesian approach can consistently outperform OWA linkage. Furthermore, the Bayesian approach demonstrates a significant risk of re-identification for the types of data considered in the OWA record linkage literature.  相似文献   

11.
This paper outlines a strategy to validate multiple imputation methods. Rubin's criteria for proper multiple imputation are the point of departure. We describe a simulation method that yields insight into various aspects of bias and efficiency of the imputation process. We propose a new method for creating incomplete data under a general Missing At Random (MAR) mechanism. Software implementing the validation strategy is available as a SAS/IML module. The method is applied to investigate the behavior of polytomous regression imputation for categorical data.  相似文献   

12.
Assuming that two‐step monotone missing data is drawn from a multivariate normal population, this paper derives the Bartlett‐type correction to the likelihood ratio test for missing completely at random (MCAR), which plays an important role in the statistical analysis of incomplete datasets. The advantages of our approach are confirmed in Monte Carlo simulations. Our correction drastically improved the accuracy of the type I error in Little's (1988, Journal of the American Statistical Association, 83 , 1198–1202) test for MCAR and performed well even on moderate sample sizes.  相似文献   

13.
Huisman  Mark 《Quality and Quantity》2000,34(4):331-351
Among the wide variety of procedures to handle missing data, imputingthe missing values is a popular strategy to deal with missing itemresponses. In this paper some simple and easily implemented imputationtechniques like item and person mean substitution, and somehot-deck procedures, are investigated. A simulation study was performed based on responses to items forming a scale to measure a latent trait ofthe respondents. The effects of different imputation procedures onthe estimation of the latent ability of the respondents wereinvestigated, as well as the effect on the estimation of Cronbach'salpha (indicating the reliability of the test) and Loevinger'sH-coefficient (indicating scalability). The results indicate thatprocedures which use the relationships between items perform best,although they tend to overestimate the scale quality.  相似文献   

14.
Sensitivity Analysis of Continuous Incomplete Longitudinal Outcomes   总被引:1,自引:0,他引:1  
Even though models for incomplete longitudinal data are in common use, they are surrounded with problems, largely due to the untestable nature of the assumptions one has to make regarding the missingness mechanism. Two extreme views on how to deal with this problem are (1) to avoid incomplete data altogether and (2) to construct ever more complicated joint models for the measurement and missingness processes. In this paper, it is argued that a more versatile approach is to embed the treatment of incomplete data within a sensitivity analysis. Several such sensitivity analysis routes are presented and applied to a case study, the milk protein trial analyzed before by Diggle and Kenward (1994) . Apart from the use of local influence methods, some emphasis is put on pattern-mixture modeling. In the latter case, it is shown how multiple-imputation ideas can be used to define a practically feasible modeling strategy.  相似文献   

15.
One of the most difficult problems confronting investigators who analyze data from surveys is how treat missing data. Many statistical procedures can not be used immediately if any values are missing. This paper considers the problem of estimating the population mean using auxiliary information when some observations on the sample are missing and the population mean of the auxiliary variable is not available. We use tools of classical statistical estimation theory to find a suitable estimator. We study the model and design properties of the proposed estimator. We also report the results of a broad-based simulation study of the efficiency of the estimator, which reveals very promising results.  相似文献   

16.
The increasing penetration of intermittent renewable energy in power systems brings operational challenges. One way of supporting them is by enhancing the predictability of renewables through accurate forecasting. Convolutional Neural Networks (Convnets) provide a successful technique for processing space-structured multi-dimensional data. In our work, we propose the U-Convolutional model to predict hourly wind speeds for a single location using spatio-temporal data with multiple explanatory variables as an input. The U-Convolutional model is composed of a U-Net part, which synthesizes input information, and a Convnet part, which maps the synthesized data into a single-site wind prediction. We compare our approach with advanced Convnets, a fully connected neural network, and univariate models. We use time series from the Climate Forecast System Reanalysis as datasets and select temperature and u- and v-components of wind as explanatory variables. The proposed models are evaluated at multiple locations (totaling 181 target series) and multiple forecasting horizons. The results indicate that our proposal is promising for spatio-temporal wind speed prediction, with results that show competitive performance on both time horizons for all datasets.  相似文献   

17.
The missing data problem has been widely addressed in the literature. The traditional methods for handling missing data may be not suited to spatial data, which can exhibit distinctive structures of dependence and/or heterogeneity. As a possible solution to the spatial missing data problem, this paper proposes an approach that combines the Bayesian Interpolation method [Benedetti, R. & Palma, D. (1994) Markov random field-based image subsampling method, Journal of Applied Statistics, 21(5), 495–509] with a multiple imputation procedure. The method is developed in a univariate and a multivariate framework, and its performance is evaluated through an empirical illustration based on data related to labour productivity in European regions.  相似文献   

18.
A typical Business Register (BR) is mainly based on administrative data files provided by organisations that produce them as a by-product of their function. Such files do not necessarily yield a perfect Business Register. A good BR should have the following characteristics: (1) It should reflect the complex structures of businesses with multiple activities, in multiple locations or with multiple legal entities; (2) It should be free of duplication, extraneous or missing units; (3) It should be properly classified in terms of key stratification variables, including size, geography and industry; (4) It should be easily updateable to represent the "newer" business picture, and not lag too much behind it. In reality, not all these desirable features are fully satisfied, resulting in a universe that has missing units, inaccurate structures, as well as improper contact information, to name a few defects.
These defects can be compensated by using sampling and estimation procedures. For example, coverage can be improved using multiple frame techniques, and the sample size can be increased to account for misclassification of units and deaths on the register. At the time of estimation, auxiliary information can be used in a variety of ways. It can be used to impute missing variables, to treat outliers, or to create synthetic variables obtained via modelling. Furthermore, time lags between the birth of units and the time that they are included on the register can be accounted for appropriately inflating the design-based estimates.  相似文献   

19.
This paper proposes a general framework for the analysis of survey data with missing observations. The approach presented here treats missing data as an unavoidable feature of any survey of the human population and aims at incorporating the unobserved part of the data into the analysis rather than trying to avoid it or make up for it. To handle coverage error and unit non-response, the true distribution is modeled as a mixture of an observable and of an unobservable component. Generally, for the unobserved component, its relative size (the no-observation rate) and its distribution are not known. It is assumed that the goal of the analysis is to assess the fit of a statistical model, and for this purpose the mixture index of fit is used. The mixture index of fit does not postulate that the statistical model of interest is able to account for the entire population rather, that it may only describe a fraction of it. This leads to another mixture representation of the true distribution, with one component from the statistical model of interest and another unrestricted one. Inference with respect to the fit of the model, with missing data taken into account, is obtained by equating these two mixtures and asking, for different no-observation rates, what is the largest fraction of the population where the statistical model may hold. A statistical model is deemed relevant for the population, if it may account for a large enough fraction of the population, assuming the true (if known) or a sufficiently small or a realistic no-observation rate.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号