Improvements over the duplicate performance method for outlier detection in categorical multivariate surveys

López, Carlos

doi:10.1007/BF02589173

Improvements over the duplicate performance method for outlier detection in categorical multivariate surveys

Published: August 1996

Volume 5, pages 211–228, (1996)
Cite this article

Journal of the Italian Statistical Society Aims and scope Submit manuscript

Carlos López¹^nAff2

35 Accesses
Explore all metrics

Summary

The detection of errors and outliers is an important step in data processing, especially those errors arising from data entry operations because they are of the entire responsability of the data processing staff. The duplicate performance method, is commonly used as an attempt to detect such type of errors. It implies typically typing twice the same data without any special precedence. If the errors are uniformly distributed among individuals, retyping a fraction of the total will also remove typically the same fraction of the errors. A new method is presented, which is able to improve that procedure by sorting the records putting first the most unlikely ones. The ability of the present methodology has been tested by a Monte Carlo simulation, using an existing database of categorical answers of housing characteristics in Uruguay. At first, it has been randomly contaiminated, and after that, the proposed procedure applied. The results show that if a partial retyping is done following the proposed order about 50 % of the errors can be removed while keeping the retyping effort between 4 and 14% of the dataset, while to attain a similar result with the standard methodology 50% (on, average) of the database should be processed. The new ordering is based upon the unrotated Principal Component Analysis (PCA) transformation of the previously coded data. No special shape of the multivariate distribution function is assumed or required.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Coale, A. J. andStephan, F. F. (1962). The case of the Indians and the teen-age widows,Journal of the American Statistical Association, Vol. 57, No. 298, 338–347.
Article Google Scholar
Davies, L. andGather, U. (1993). The identification of multiple outliers.Journal of the American Statistical Association, Vol. 88, No. 423, 782–801.
Article MATH MathSciNet Google Scholar
Fellegi, P. andHolt, D. (1976). A systematic approach to automatic edit and imputation.Journal of the American Statistical Association, Vol. 71, No. 353, 17–35.
Article Google Scholar
Hawkins, D. M. (1974). The detection of errors in multivariate data using principal components.Journal of the American Statistical Association, Vol. 69, No. 346, 340–344.
Article MATH Google Scholar
Lebart, L. Morineau, A., Tabard, N. (1977).Techniques de la description statistique; methodes et logiciels pour l'analyse des grands tableaux, Ed. Dunod, Paris.
Google Scholar
Little, R. J. A. andSmith, P. J. (1987). Editing and Imputation for quantitative survey data.Journal of the American Statistical Association, Vol. 82, No. 397, 58–68.
Article MATH MathSciNet Google Scholar
López, C., González, E. andGoyret, J. (1994). Analisis por componentes principales de datos pluviométricos: a) aplicación a la detección de datos anómalos (in spanish)Estadística Vol. 6, No. 146-147, 55–83. (English version available at, http://www.fing.edu.uy/cecal/reports/rep92_1/papera.html|url
Google Scholar
Minton, G. (1969). Inspection and correction error in data processing.Journal of the American Statistical Association, Vol. 64, No. 328, 1256–1275
Article Google Scholar
Paradice, D. B. andFuerst, W. L. (1991). An MIS data quality methodology based on optimal error detection.Journal of Information Systems, Vol. 5, No. 1, 48–66.
Google Scholar
Strayhorn, J. M. (1990). Estimating the errors remaining in a data set: techniques for quality control,The American Statistician, Vol. 44, No. 1, 14–18.
Article Google Scholar

Download references

Author information

Carlos López
Present address: Ingenieros Consultores Asociados, Cerro Largo 1321, Montevideo, Uruguay

Authors and Affiliations

Centro de Cálculo, Facultad de Ingenìería Montevideo, Uruguay
Carlos López

Authors

Carlos López
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

López, C. Improvements over the duplicate performance method for outlier detection in categorical multivariate surveys. J. It. Statist. Soc. 5, 211–228 (1996). https://doi.org/10.1007/BF02589173

Download citation

Issue Date: August 1996
DOI: https://doi.org/10.1007/BF02589173

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improvements over the duplicate performance method for outlier detection in categorical multivariate surveys

Summary

Access this article

Similar content being viewed by others

Data Pre-processing to Apply Multiple Imputation Techniques: A Case Study on Real-World Census Data

Handling Missing Data in Principal Component Analysis Using Multiple Imputation

Using Generalized Procrustes Analysis for Multiple Imputation in Principal Component Analysis

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improvements over the duplicate performance method for outlier detection in categorical multivariate surveys

Summary

Access this article

Similar content being viewed by others

Data Pre-processing to Apply Multiple Imputation Techniques: A Case Study on Real-World Census Data

Handling Missing Data in Principal Component Analysis Using Multiple Imputation

Using Generalized Procrustes Analysis for Multiple Imputation in Principal Component Analysis

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation