首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Linkage errors can occur when probability‐based methods are used to link records from two distinct data sets corresponding to the same target population. Current approaches to modifying standard methods of regression analysis to allow for these errors only deal with the case of two linked data sets and assume that the linkage process is complete, that is, all records on the two data sets are linked. This study extends these ideas to accommodate the situation when more than two data sets are probabilistically linked and the linkage is incomplete.  相似文献   

2.
Record linkage is the act of bringing together records from two files that are believed to belong to the same unit (e.g., a person or business). It is a low‐cost way of increasing the set of variables available for analysis. Errors may arise in the linking process if an error‐free unit identifier is not available. Two types of linking errors include an incorrect link (records belonging to two different units are linked) and a missed record (an unlinked record for which a correct link exists). Naively ignoring linkage errors may mean that analysis of the linked file is biased. This paper outlines a “weighting approach” to making correct inference about regression coefficients and population totals in the presence of such linkage errors. This approach is designed for analysts who do not have the expertise or time to use specialist software required by other approaches but who are comfortable using weights in inference. The performance of the estimator is demonstrated in a simulation study.  相似文献   

3.
Probabilistic record linkage is the act of bringing together records that are believed to belong to the same unit (e.g., person or business) from two or more files. It is a common way to enhance dimensions such as time and breadth or depth of detail. Probabilistic record linkage is not an error-free process and link records that do not belong to the same unit. Naively treating such a linked file as if it is linked without errors can lead to biased inferences. This paper develops a method of making inference with estimating equations when records are linked using algorithms that are widely used in practice. Previous methods for dealing with this problem cannot accommodate such linking algorithms. This paper develops a parametric bootstrap approach to inference in which each bootstrap replicate involves applying the said linking algorithm. This paper demonstrates the effectiveness of the method in simulations and in real applications.  相似文献   

4.
A basic concern in statistical disclosure limitation is the re-identification of individuals in anonymised microdata. Linking against a second dataset that contains identifying information can result in a breach of confidentiality. Almost all linkage approaches are based on comparing the values of variables that are common to both datasets. It is tempting to think that if datasets contain no common variables, then there can be no risk of re-identification. However, linkage has been attempted between such datasets via the extraction of structural information using ordered weighted averaging (OWA) operators. Although this approach has been shown to perform better than randomly pairing records, it is debatable whether it demonstrates a practically significant disclosure risk. This paper reviews some of the main aspects of statistical disclosure limitation. It then goes on to show that a relatively simple, supervised Bayesian approach can consistently outperform OWA linkage. Furthermore, the Bayesian approach demonstrates a significant risk of re-identification for the types of data considered in the OWA record linkage literature.  相似文献   

5.
This paper addresses the problem of endogenous regressors due to the presence of unobserved heterogeneity, when this is correlated with the regressors, and caused by regressors’ measurement errors. A simple two‐stage testing procedure is proposed for the identification of the underlying cause of correlation between regressors and the error term. The statistical performance of the resulting sequential test is assessed using simulated data.  相似文献   

6.
Abstract

When two value estimates are about equally likely, conservatism dictates reporting the less optimistic one (e.g. Lower of Cost or Market). We use an analytical model to investigate informational implications of this dictum, and identify types of environments where the conservative accounting treatment is more informative than a predetermined choice. The bias induced by the conservative choice is found to be adequately moderate, never excessive. It benefits users of the financial statements that take the reported figures at face value whenever upside errors are more costly (possibly only slightly more costly) than similar downside errors. Sophisticated users, who know how to give the reports the best possible interpretation, benefit from the lower variability, not from the bias. These latter benefits are least ambiguous when upside errors and downside errors are about equally costly.  相似文献   

7.
Statistical offices are responsible for publishing accurate statistical information about many different aspects of society. This task is complicated considerably by the fact that data collected by statistical offices generally contain errors. These errors have to be corrected before reliable statistical information can be published. This correction process is referred to as statistical data editing. Traditionally, data editing was mainly an interactive activity with the aim to correct all data in every detail. For that reason the data editing process was both expensive and time-consuming. To improve the efficiency of the editing process it can be partly automated. One often divides the statistical data editing process into the error localisation step and the imputation step. In this article we restrict ourselves to discussing the former step, and provide an assessment, based on personal experience, of several selected algorithms for automatically solving the error localisation problem for numerical (continuous) data. Our article can be seen as an extension of the overview article by Liepins, Garfinkel & Kunnathur (1982). All algorithms we discuss are based on the (generalised) Fellegi–Holt paradigm that says that the data of a record should be made to satisfy all edits by changing the fewest possible (weighted) number of fields. The error localisation problem may have several optimal solutions for a record. In contrast to what is common in the literature, most of the algorithms we describe aim to find all optimal solutions rather than just one. As numerical data mostly occur in business surveys, the described algorithms are mainly suitable for business surveys and less so for social surveys. For four algorithms we compare the computing times on six realistic data sets as well as their complexity.  相似文献   

8.
Microaggregation by individual ranking (IR) is an important technique for masking confidential econometric data. While being a successful method for controlling the disclosure risk of observations, IR also affects the results of statistical analyses. We conduct a theoretical analysis on the estimation of arbitrary moments from a data set that has been anonymized by means of the IR method. We show that classical moment estimators remain both consistent and asymptotically normal under weak assumptions. This theory provides the justification for applying standard statistical estimation techniques to the anonymized data without having to correct for a possible bias caused by anonymization.  相似文献   

9.
Effective linkage detection and gene mapping requires analysis of data jointly on members of extended pedigrees, jointly at multiple genetic markers. Exact likelihood computation is then often infeasible, but Markov chain Monte Carlo (MCMC) methods permit estimation of posterior probabilities of genome sharing among relatives, conditional upon marker data. In principle, MCMC also permits estimation of linkage analysis location score curves, but in practice effective MCMC samplers are hard to find. Although the whole-meiosis Gibbs sampler (M-sampler) performs well in some cases, for extended pedigrees and tightly linked markers better samplers are needed. However, using the M-sampler as a proposal distribution in a Metropolis-Hastings algorithm does allow genetic interference to be incorporated into the analysis.  相似文献   

10.
孙成霖 《价值工程》2010,29(6):39-39
假设检验是统计推断的内容之一,统计推断在体育统计学中的地位也十分重要。在假设检验中存在两类错误。在很多时候,我们往往只注意第一类错误的控制,而对于第二类错误经常不考虑。其实,对于第二类错误的控制也是十分必要的。本文对于两类错误的成因以及如何控制第二类错误进行了探讨,希望对于第二类错误的控制提出一些解决的方法。  相似文献   

11.
Good statistical practice dictates that summaries in Monte Carlo studies should always be accompanied by standard errors. Those standard errors are easy to provide for summaries that are sample means over the replications of the Monte Carlo output: for example, bias estimates, power estimates for tests and mean squared error estimates. But often more complex summaries are of interest: medians (often displayed in boxplots), sample variances, ratios of sample variances and non‐normality measures such as skewness and kurtosis. In principle, standard errors for most of these latter summaries may be derived from the Delta Method, but that extra step is often a barrier for standard errors to be provided. Here, we highlight the simplicity of using the jackknife and bootstrap to compute these standard errors, even when the summaries are somewhat complicated. © 2014 The Authors. International Statistical Review © 2014 International Statistical Institute  相似文献   

12.
13.
In this paper, we consider the use of auxiliary and paradata for dealing with non‐response and measurement errors in household surveys. Three over‐arching purposes are distinguished: response enhancement, statistical adjustment, and bias exploration. Attention is given to the varying focus at the different phases of statistical production from collection, processing to analysis, and how to select and utilize the useful auxiliary and paradata. Administrative register data provide the richest source of relevant auxiliary information, in addition to data collected in previous surveys and censuses. Due to their importance in effective dealings with non‐sampling errors, one should make every effort to increase their availability in the statistical system and, at the same time, to develop efficient statistical methods that capitalize on the combined data sources.  相似文献   

14.
Multicollinearity is one of the most important issues in regression analysis, as it produces unstable coefficients’ estimates and makes the standard errors severely inflated. The regression theory is based on specific assumptions concerning the set of error random variables. In particular, when errors are uncorrelated and have a constant variance, the ordinary least squares estimator produces the best estimates among all linear estimators. If, as often happens in reality, these assumptions are not met, other methods might give more efficient estimates and their use is therefore recommendable. In this paper, after reviewing and briefly describing the salient features of the methods, proposed in the literature, to determine and address the multicollinearity problem, we introduce the Lpmin method, based on Lp-norm estimation, an adaptive robust procedure that is used when the residual distribution has deviated from normality. The major advantage of this approach is that it produces more efficient estimates of the model parameters, for different degrees of multicollinearity, than those generated by the ordinary least squares method. A simulation study and a real-data application are also presented, in order to show the better results provided by the Lpmin method in the presence of multicollinearity.  相似文献   

15.
We propose a simple estimator for nonlinear method of moment models with measurement error of the classical type when no additional data, such as validation data or double measurements, are available. We assume that the marginal distributions of the measurement errors are Laplace (double exponential) with zero means and unknown variances and the measurement errors are independent of the latent variables and are independent of each other. Under these assumptions, we derive simple revised moment conditions in terms of the observed variables. They are used to make inference about the model parameters and the variance of the measurement error. The results of this paper show that the distributional assumption on the measurement errors can be used to point identify the parameters of interest. Our estimator is a parametric method of moments estimator that uses the revised moment conditions and hence is simple to compute. Our estimation method is particularly useful in situations where no additional data are available, which is the case in many economic data sets. Simulation study demonstrates good finite sample properties of our proposed estimator. We also examine the performance of the estimator in the case where the error distribution is misspecified.  相似文献   

16.
This paper provides a review of common statistical disclosure control (SDC) methods implemented at statistical agencies for standard tabular outputs containing whole population counts from a census (either enumerated or based on a register). These methods include record swapping on the microdata prior to its tabulation and rounding of entries in the tables after they are produced. The approach for assessing SDC methods is based on a disclosure risk–data utility framework and the need to find a balance between managing disclosure risk while maximizing the amount of information that can be released to users and ensuring high quality outputs. To carry out the analysis, quantitative measures of disclosure risk and data utility are defined and methods compared. Conclusions from the analysis show that record swapping as a sole SDC method leaves high probabilities of disclosure risk. Targeted record swapping lowers the disclosure risk, but there is more distortion of distributions. Small cell adjustments (rounding) give protection to census tables by eliminating small cells but only one set of variables and geographies can be disseminated in order to avoid disclosure by differencing nested tables. Full random rounding offers more protection against disclosure by differencing, but margins are typically rounded separately from the internal cells and tables are not additive. Rounding procedures protect against the perception of disclosure risk compared to record swapping since no small cells appear in the tables. Combining rounding with record swapping raises the level of protection but increases the loss of utility to census tabular outputs. For some statistical analysis, the combination of record swapping and rounding balances to some degree opposing effects that the methods have on the utility of the tables.  相似文献   

17.
This paper develops a statistical method for defining housing submarkets. The method is applied using household survey data for Sydney and Melbourne, Australia. First, principal component analysis is used to extract a set of factors from the original variables for both local government area (LGA) data and a combined set of LGA and individual dwelling data. Second, factor scores are calculated and cluster analysis is used to determine the composition of housing submarkets. Third, hedonic price equations are estimated for each city as a whole, fora prioriclassifications of submarkets, and for submarkets defined by the cluster analysis. The weighted mean squared errors from the hedonic equations are used to compare alternative classifications of submarkets. In Melbourne, the classification derived from aKmeans clustering procedure on individual dwelling data is significantly better than classifications derived from all other methods of constructing housing submarkets. In some other cases, the statistical analysis produces submarkets that are better than thea prioriclassification, but the improvement is not significant.  相似文献   

18.
Estimation of parameters of the Pareto income distribution is discussed for the situation when data are underreported, i.e., observed with negative random bias. Specifically it is proved that maximum likelihood estimates are consistent and asymptotically normal in large samples, and formulae for the large-sample standard errors are given. The large-sample theory illustrates how some important results from mathematical statistics apply to non-standard statistical models.  相似文献   

19.
Surveys usually include questions where individuals must select one in a series of possible options that can be sorted. On the other hand, multiple frame surveys are becoming a widely used method to decrease bias due to undercoverage of the target population. In this work, we propose statistical techniques for handling ordinal data coming from a multiple frame survey using complex sampling designs and auxiliary information. Our aim is to estimate proportions when the variable of interest has ordinal outcomes. Two estimators are constructed following model‐assisted generalised regression and model calibration techniques. Theoretical properties are investigated for these estimators. Simulation studies with different sampling procedures are considered to evaluate the performance of the proposed estimators in finite size samples. An application to a real survey on opinions towards immigration is also included.  相似文献   

20.
This study presents an error structure that offers a rich statistical framework for panel data analysis. It includes as special cases most of the error specifications found in longitudinal studies of wages and earnings. A general set of procedures for choosing a specification of this error structure and estimating its parameters appears in the first part of this study. The last section applies these procedures to fit an error structure for wages and earnings of prime-age males using data from the Michigan Panel of Income Dynamics.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号