Skip to main content
Log in

Improvements over the duplicate performance method for outlier detection in categorical multivariate surveys

  • Published:
Journal of the Italian Statistical Society Aims and scope Submit manuscript

Summary

The detection of errors and outliers is an important step in data processing, especially those errors arising from data entry operations because they are of the entire responsability of the data processing staff. The duplicate performance method, is commonly used as an attempt to detect such type of errors. It implies typically typing twice the same data without any special precedence. If the errors are uniformly distributed among individuals, retyping a fraction of the total will also remove typically the same fraction of the errors. A new method is presented, which is able to improve that procedure by sorting the records putting first the most unlikely ones. The ability of the present methodology has been tested by a Monte Carlo simulation, using an existing database of categorical answers of housing characteristics in Uruguay. At first, it has been randomly contaiminated, and after that, the proposed procedure applied. The results show that if a partial retyping is done following the proposed order about 50 % of the errors can be removed while keeping the retyping effort between 4 and 14% of the dataset, while to attain a similar result with the standard methodology 50% (on, average) of the database should be processed. The new ordering is based upon the unrotated Principal Component Analysis (PCA) transformation of the previously coded data. No special shape of the multivariate distribution function is assumed or required.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Coale, A. J. andStephan, F. F. (1962). The case of the Indians and the teen-age widows,Journal of the American Statistical Association, Vol. 57, No. 298, 338–347.

    Article  Google Scholar 

  • Davies, L. andGather, U. (1993). The identification of multiple outliers.Journal of the American Statistical Association, Vol. 88, No. 423, 782–801.

    Article  MATH  MathSciNet  Google Scholar 

  • Fellegi, P. andHolt, D. (1976). A systematic approach to automatic edit and imputation.Journal of the American Statistical Association, Vol. 71, No. 353, 17–35.

    Article  Google Scholar 

  • Hawkins, D. M. (1974). The detection of errors in multivariate data using principal components.Journal of the American Statistical Association, Vol. 69, No. 346, 340–344.

    Article  MATH  Google Scholar 

  • Lebart, L. Morineau, A., Tabard, N. (1977).Techniques de la description statistique; methodes et logiciels pour l'analyse des grands tableaux, Ed. Dunod, Paris.

    Google Scholar 

  • Little, R. J. A. andSmith, P. J. (1987). Editing and Imputation for quantitative survey data.Journal of the American Statistical Association, Vol. 82, No. 397, 58–68.

    Article  MATH  MathSciNet  Google Scholar 

  • López, C., González, E. andGoyret, J. (1994). Analisis por componentes principales de datos pluviométricos: a) aplicación a la detección de datos anómalos (in spanish)Estadística Vol. 6, No. 146-147, 55–83. (English version available at, http://www.fing.edu.uy/cecal/reports/rep92_1/papera.html|url

    Google Scholar 

  • Minton, G. (1969). Inspection and correction error in data processing.Journal of the American Statistical Association, Vol. 64, No. 328, 1256–1275

    Article  Google Scholar 

  • Paradice, D. B. andFuerst, W. L. (1991). An MIS data quality methodology based on optimal error detection.Journal of Information Systems, Vol. 5, No. 1, 48–66.

    Google Scholar 

  • Strayhorn, J. M. (1990). Estimating the errors remaining in a data set: techniques for quality control,The American Statistician, Vol. 44, No. 1, 14–18.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

López, C. Improvements over the duplicate performance method for outlier detection in categorical multivariate surveys. J. It. Statist. Soc. 5, 211–228 (1996). https://doi.org/10.1007/BF02589173

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02589173

Keywords

Navigation