Summary
The detection of errors and outliers is an important step in data processing, especially those errors arising from data entry operations because they are of the entire responsability of the data processing staff. The duplicate performance method, is commonly used as an attempt to detect such type of errors. It implies typically typing twice the same data without any special precedence. If the errors are uniformly distributed among individuals, retyping a fraction of the total will also remove typically the same fraction of the errors. A new method is presented, which is able to improve that procedure by sorting the records putting first the most unlikely ones. The ability of the present methodology has been tested by a Monte Carlo simulation, using an existing database of categorical answers of housing characteristics in Uruguay. At first, it has been randomly contaiminated, and after that, the proposed procedure applied. The results show that if a partial retyping is done following the proposed order about 50 % of the errors can be removed while keeping the retyping effort between 4 and 14% of the dataset, while to attain a similar result with the standard methodology 50% (on, average) of the database should be processed. The new ordering is based upon the unrotated Principal Component Analysis (PCA) transformation of the previously coded data. No special shape of the multivariate distribution function is assumed or required.
Similar content being viewed by others
References
Coale, A. J. andStephan, F. F. (1962). The case of the Indians and the teen-age widows,Journal of the American Statistical Association, Vol. 57, No. 298, 338–347.
Davies, L. andGather, U. (1993). The identification of multiple outliers.Journal of the American Statistical Association, Vol. 88, No. 423, 782–801.
Fellegi, P. andHolt, D. (1976). A systematic approach to automatic edit and imputation.Journal of the American Statistical Association, Vol. 71, No. 353, 17–35.
Hawkins, D. M. (1974). The detection of errors in multivariate data using principal components.Journal of the American Statistical Association, Vol. 69, No. 346, 340–344.
Lebart, L. Morineau, A., Tabard, N. (1977).Techniques de la description statistique; methodes et logiciels pour l'analyse des grands tableaux, Ed. Dunod, Paris.
Little, R. J. A. andSmith, P. J. (1987). Editing and Imputation for quantitative survey data.Journal of the American Statistical Association, Vol. 82, No. 397, 58–68.
López, C., González, E. andGoyret, J. (1994). Analisis por componentes principales de datos pluviométricos: a) aplicación a la detección de datos anómalos (in spanish)Estadística Vol. 6, No. 146-147, 55–83. (English version available at, http://www.fing.edu.uy/cecal/reports/rep92_1/papera.html|url
Minton, G. (1969). Inspection and correction error in data processing.Journal of the American Statistical Association, Vol. 64, No. 328, 1256–1275
Paradice, D. B. andFuerst, W. L. (1991). An MIS data quality methodology based on optimal error detection.Journal of Information Systems, Vol. 5, No. 1, 48–66.
Strayhorn, J. M. (1990). Estimating the errors remaining in a data set: techniques for quality control,The American Statistician, Vol. 44, No. 1, 14–18.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
López, C. Improvements over the duplicate performance method for outlier detection in categorical multivariate surveys. J. It. Statist. Soc. 5, 211–228 (1996). https://doi.org/10.1007/BF02589173
Issue Date:
DOI: https://doi.org/10.1007/BF02589173