Advertisement

Outliers and the Simpson’s Paradox

  • Eduarda Portela
  • Rita P. Ribeiro
  • João Gama
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10632)

Abstract

There is no standard definition of outliers, but most authors agree that outliers are points far from other data points. Several outlier detection techniques have been developed mainly with two different purposes. On one hand, outliers are the interesting observations, like in fraud detection, on the other side, outliers are considered measurement observations that should be removed from the analysis, e.g. robust statistics. In this work, we start from the observation that outliers are effected by the so called Simpson paradox: a trend that appears in different groups of data but disappears or reverses when these groups are combined. Given a dataset, we learn a regression tree. The tree grows by partitioning the data into groups more and more homogeneous of the target variable. At each partition defined by the tree, we apply a box plot on the target variable to detect outliers. We would expected that deeper nodes of the tree contain less and less outliers. We observe that some points previously signaled as outliers are no more signaled as such, but new outliers appear. The identification of outliers depends on the context considered. Based on this observation, we propose a new method to quantify the level of outlierness of data points.

Notes

Ackowledgements

This work is financed by the European Regional Development Fund through the COMPETE 2020 Programme within project POCI-01-0145-FEDER-006961, and by National Funds through the FCT - Fundaao para a Cincia e a Tecnologia as part of project UID/EEA/50014/2013.

References

  1. 1.
    Hawkins, D.M.: Identification of Outliers, vol. 11. Springer, Dordrecht (1980).  https://doi.org/10.1007/978-94-015-3994-4CrossRefzbMATHGoogle Scholar
  2. 2.
    Singh, K., Upadhyaya, S.: Outlier detection: applications and techniques. Int. J. Comput. Sci. Issues 9(1), 307–323 (2012)Google Scholar
  3. 3.
    Acuna, E., Rodriguez, C.: A meta analysis study of outlier detection methods in classification. Technical paper, Department of Mathematics, University of Puerto Rico at Mayaguez (2004)Google Scholar
  4. 4.
    Hodge, V.J., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22(2), 85–126 (2004)CrossRefGoogle Scholar
  5. 5.
    Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. (CSUR) 41(3), 15 (2009)CrossRefGoogle Scholar
  6. 6.
    Shewhart, W.A.: Economic Control of Quality of Manufactured Product. ASQ Quality Press, Milwaukee (1931)Google Scholar
  7. 7.
    Breunig, M.M., Kriegel, H.-P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: ACM SIGMOD record, vol. 29, pp. 93–104. ACM (2000)Google Scholar
  8. 8.
    Ribeiro, R.P., Oliveira, R., Gama, J.: Detection of fraud symptoms in the retail industry. In: Montes-y-Gómez, M., Escalante, H.J., Segura, A., Murillo, J.D. (eds.) IBERAMIA 2016. LNCS (LNAI), vol. 10022, pp. 189–200. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-47955-2_16CrossRefGoogle Scholar
  9. 9.
    Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth and Brooks, Monterey (1984)zbMATHGoogle Scholar
  10. 10.
    Lichman, M.: UCI machine learning repository (2013)Google Scholar
  11. 11.
    Therneau, T., Atkinson, B., Ripley, B.: rpart: Recursive Partitioning and Regression Trees. R package version 4.1-10 (2015)Google Scholar
  12. 12.
    R Development Core Team: R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (2008). ISBN 3-900051-07-0Google Scholar
  13. 13.
    Tsanas, A., Xifara, A.: Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools. Energy Build. 49, 560–567 (2012)CrossRefGoogle Scholar
  14. 14.
    Blyth, C.R.: On Simpson’s paradox and the sure-thing principle. J. Am. Statist. Assoc. 67(338), 364–366 (1972)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Eduarda Portela
    • 1
  • Rita P. Ribeiro
    • 1
    • 3
  • João Gama
    • 1
    • 2
  1. 1.LIAAD-INESC TECUniversity of PortoPortoPortugal
  2. 2.Faculty of EconomicsUniversity PortoPortoPortugal
  3. 3.Faculty of SciencesUniversity PortoPortoPortugal

Personalised recommendations