Outliers and the Simpson’s Paradox

Portela, Eduarda; Ribeiro, Rita P.; Gama, João

doi:10.1007/978-3-030-02837-4_22

Eduarda Portela¹⁵,
Rita P. Ribeiro^15,17 &
João Gama^15,16

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10632))

Included in the following conference series:

Mexican International Conference on Artificial Intelligence

420 Accesses

Abstract

There is no standard definition of outliers, but most authors agree that outliers are points far from other data points. Several outlier detection techniques have been developed mainly with two different purposes. On one hand, outliers are the interesting observations, like in fraud detection, on the other side, outliers are considered measurement observations that should be removed from the analysis, e.g. robust statistics. In this work, we start from the observation that outliers are effected by the so called Simpson paradox: a trend that appears in different groups of data but disappears or reverses when these groups are combined. Given a dataset, we learn a regression tree. The tree grows by partitioning the data into groups more and more homogeneous of the target variable. At each partition defined by the tree, we apply a box plot on the target variable to detect outliers. We would expected that deeper nodes of the tree contain less and less outliers. We observe that some points previously signaled as outliers are no more signaled as such, but new outliers appear. The identification of outliers depends on the context considered. Based on this observation, we propose a new method to quantify the level of outlierness of data points.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Hawkins, D.M.: Identification of Outliers, vol. 11. Springer, Dordrecht (1980). https://doi.org/10.1007/978-94-015-3994-4
Book MATH Google Scholar
Singh, K., Upadhyaya, S.: Outlier detection: applications and techniques. Int. J. Comput. Sci. Issues 9(1), 307–323 (2012)
Google Scholar
Acuna, E., Rodriguez, C.: A meta analysis study of outlier detection methods in classification. Technical paper, Department of Mathematics, University of Puerto Rico at Mayaguez (2004)
Google Scholar
Hodge, V.J., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22(2), 85–126 (2004)
Article Google Scholar
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. (CSUR) 41(3), 15 (2009)
Article Google Scholar
Shewhart, W.A.: Economic Control of Quality of Manufactured Product. ASQ Quality Press, Milwaukee (1931)
Google Scholar
Breunig, M.M., Kriegel, H.-P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: ACM SIGMOD record, vol. 29, pp. 93–104. ACM (2000)
Google Scholar
Ribeiro, R.P., Oliveira, R., Gama, J.: Detection of fraud symptoms in the retail industry. In: Montes-y-Gómez, M., Escalante, H.J., Segura, A., Murillo, J.D. (eds.) IBERAMIA 2016. LNCS (LNAI), vol. 10022, pp. 189–200. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-47955-2_16
Chapter Google Scholar
Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth and Brooks, Monterey (1984)
MATH Google Scholar
Lichman, M.: UCI machine learning repository (2013)
Google Scholar
Therneau, T., Atkinson, B., Ripley, B.: rpart: Recursive Partitioning and Regression Trees. R package version 4.1-10 (2015)
Google Scholar
R Development Core Team: R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (2008). ISBN 3-900051-07-0
Google Scholar
Tsanas, A., Xifara, A.: Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools. Energy Build. 49, 560–567 (2012)
Article Google Scholar
Blyth, C.R.: On Simpson’s paradox and the sure-thing principle. J. Am. Statist. Assoc. 67(338), 364–366 (1972)
Article MathSciNet Google Scholar

Download references

Ackowledgements

This work is financed by the European Regional Development Fund through the COMPETE 2020 Programme within project POCI-01-0145-FEDER-006961, and by National Funds through the FCT - Fundaao para a Cincia e a Tecnologia as part of project UID/EEA/50014/2013.

Author information

Authors and Affiliations

LIAAD-INESC TEC, University of Porto, Porto, Portugal
Eduarda Portela, Rita P. Ribeiro & João Gama
Faculty of Economics, University Porto, Porto, Portugal
João Gama
Faculty of Sciences, University Porto, Porto, Portugal
Rita P. Ribeiro

Authors

Eduarda Portela
View author publications
You can also search for this author in PubMed Google Scholar
Rita P. Ribeiro
View author publications
You can also search for this author in PubMed Google Scholar
João Gama
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to João Gama .

Editor information

Editors and Affiliations

Universidad Autónoma del Estado de Hidalgo, Pachuca, Mexico
Félix Castro
INFOTEC Aguascalientes, Aguascalientes, Mexico
Sabino Miranda-Jiménez
Tecnológico de Monterrey, Atizapán de Zaragoza, Mexico
Miguel González-Mendoza

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Portela, E., Ribeiro, R.P., Gama, J. (2018). Outliers and the Simpson’s Paradox. In: Castro, F., Miranda-Jiménez, S., González-Mendoza, M. (eds) Advances in Soft Computing. MICAI 2017. Lecture Notes in Computer Science(), vol 10632. Springer, Cham. https://doi.org/10.1007/978-3-030-02837-4_22

Download citation

DOI: https://doi.org/10.1007/978-3-030-02837-4_22
Published: 01 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-02836-7
Online ISBN: 978-3-030-02837-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics