Advertisement

Advances in Data Analysis and Classification

, Volume 12, Issue 3, pp 785–822 | Cite as

Outlier detection in interval data

  • A. Pedro Duarte Silva
  • Peter Filzmoser
  • Paula Brito
Regular Article
  • 207 Downloads

Abstract

A multivariate outlier detection method for interval data is proposed that makes use of a parametric approach to model the interval data. The trimmed maximum likelihood principle is adapted in order to robustly estimate the model parameters. A simulation study demonstrates the usefulness of the robust estimates for outlier detection, and new diagnostic plots allow gaining deeper insight into the structure of real world interval data.

Keywords

Outliers Robust statistics Interval data Mahalanobis distance 

Mathematics Subject Classification

62-07 (Data Analysis) 62F35 (Robustness and adaptive procedures) 62H86 (Multivariate analysis and fuzziness) 

Notes

Acknowledgements

This work is financed by the ERDF-European Regional Development Fund through the Operational Programme for Competitiveness and Internationalisation-COMPETE 2020 Programme within project POCI-01-0145-FEDER-006961, and by National Funds through the FCT - Fundação para a Ciência e Tecnologia (Portuguese Foundation for Science and Technology) as part of projects UID/EEA/50014/2013 and UID/GES/00731/2013.

References

  1. Billard B, Diday E (2003) From the statistics of data to the statistics of knowledge: symbolic data analysis. J Am Stat Assoc 98(462):470–487MathSciNetCrossRefGoogle Scholar
  2. Bock H-H, Diday E (2000) Analysis of symbolic data. Exploratory methods for extracting statistical information from complex data. Springer, HeidelbergzbMATHGoogle Scholar
  3. Brito P (2014) Symbolic data analysis: another look at the interaction of data mining and statistics. WIREs Data Min Knowl Discov 4(4):281–295CrossRefGoogle Scholar
  4. Brito P, Duarte Silva AP (2012) Modelling interval data with Normal and Skew-Normal distributions. J Appl Stat 39(1):3–20MathSciNetCrossRefGoogle Scholar
  5. Cerioli A (2010) Multivariate outlier detection with high-breakdown estimators. J Am Stat Assoc 105(489):147–156MathSciNetCrossRefGoogle Scholar
  6. De Carvalho FAT, Brito P, Bock H-H (2006) Dynamic clustering for interval data based on \(L_2\) distance. Comput Stat 21(2):231–250MathSciNetCrossRefGoogle Scholar
  7. De Carvalho FAT, Lechevallier Y (2009) Partitional clustering algorithms for symbolic interval data based on single adaptive distances. Pattern Recogn 42(7):1223–1236CrossRefGoogle Scholar
  8. Dias S, Brito P (2017) Off the beaten track: a new linear model for interval data. Eur J Oper Res 258(3):1118–1130MathSciNetCrossRefGoogle Scholar
  9. Diday E, Noirhomme-Fraiture M (2008) Symbolic data analysis and the SODAS software. Wiley, ChichesterzbMATHGoogle Scholar
  10. Douzal-Chouakria A, Billard L, Diday E (2011) Principal component analysis for interval-valued observations. Stat Anal Data Min 4(2):229–246MathSciNetCrossRefGoogle Scholar
  11. Duarte Silva AP, Brito P (2017) MAINT.DATA: Model and analyze interval data. R Package,version 1.2.0. http://cran.r-project.org/web/packages/MAINT.Data/index.html
  12. Duarte Silva AP, Brito P (2015) Discriminant analysis of interval data: an assessment of parametric and distance-based approaches. J Classif 32(3):516–541MathSciNetCrossRefGoogle Scholar
  13. Filzmoser P (2004) A multivariate outlier detection method. In: S. Aivazian, P. Filzmoser and Yu. Kharin, editors, In Proceedings of the 7th international conference on computer data analysis and modeling, vol 1, 18–22, Belarusian State University, MinskGoogle Scholar
  14. Filzmoser P, Reimann C, Garrett RG (2005) Multivariate outlier detection in exploration geochemistry. Comput Geosci 31:579–587CrossRefGoogle Scholar
  15. Hadi AS, Luceño A (1997) Maximum trimmed likelihood estimators: a unified approach, examples, and algorithms. Comput Stat Data Anal 25(3):251–272MathSciNetCrossRefGoogle Scholar
  16. Hardin J, Rocke DM (2005) The distribution of robust distances. J Comput Gr Stat 14:910–927MathSciNetCrossRefGoogle Scholar
  17. Hubert M, Rousseeuw PJ, Van Aelst S (2008) High-breakdown robust multivariate methods. Stat Sci 23(1):92–119MathSciNetCrossRefGoogle Scholar
  18. Korkmaz S, Goksuluk D, Zararsiz G (2014) MVN: an R package for assessing multivariate normality. R J 6(2):151–162Google Scholar
  19. Le-Rademacher J, Billard L (2011) Likelihood functions and some maximum likelihood estimators for symbolic data. J Stat Plan Inference 141:1593–1602MathSciNetCrossRefGoogle Scholar
  20. Le-Rademacher J, Billard L (2012) Symbolic covariance principal component analysis and visualization for interval-valued data. J Comput Gr Stat 21(2):413–432MathSciNetCrossRefGoogle Scholar
  21. Li S, Lee R, Lang S-D (2006) Detecting outliers in interval data. In Proceedings of the 44th annual southeast regional conference, ACM, pp 290–295Google Scholar
  22. Lima Neto E, De Carvalho FAT (2008) Centre and range method for fitting a linear regression model to symbolic interval data. Comput Stat Data Anal 52(3):1500–1515MathSciNetCrossRefGoogle Scholar
  23. Lima Neto E, De Carvalho FAT (2010) Constrained linear regression models for symbolic interval-valued variables. Comput Stat Data Anal 54(2):333–347MathSciNetCrossRefGoogle Scholar
  24. Lima Neto E, Cordeiro GM, De Carvalho FAT (2011) Bivariate symbolic regression models for interval-valued variables. J Stat Comput Simul 81(11):1727–1744MathSciNetCrossRefGoogle Scholar
  25. Neykov N, Filzmoser P, Dimova R, Neytchev P (2007) Robust fitting of mixtures using the trimmed likelihood estimator. Comput Stat Data Anal 52(1):299–308MathSciNetCrossRefGoogle Scholar
  26. Neykov NM, Müller CH (2003) Breakdown point and computation of trimmed likelihood estimators in generalized linear models. In: Dutter R, Filzmoser P, Gather U, Rousseeuw PJ (eds) Developments in robust statistics. Physica-Verlag, Heidelberg, pp 277–286CrossRefGoogle Scholar
  27. Noirhomme-Fraiture M, Brito P (2011) Far beyond the classical data models: symbolic data analysis. Stat Anal Data Min 4(2):157–170MathSciNetCrossRefGoogle Scholar
  28. Pison G, Van Aelst S, Willems G (2002) Small sample corrections for LTS and MCD. Metrika 55(1–2):111–123MathSciNetCrossRefGoogle Scholar
  29. Ramos-Guajardo AB, Grzegorzewski P (2016) Distance-based linear discriminant analysis for interval-valued data. Inf Sci 372:591–607CrossRefGoogle Scholar
  30. Rousseeuw PJ (1984) Least median of squares regression. J Am Stat Assoc 79(388):871–880MathSciNetCrossRefGoogle Scholar
  31. Rousseeuw PJ (1985) Multivariate estimation with high breakdown point. Math Stat Appl 8:283–297MathSciNetCrossRefGoogle Scholar
  32. Rousseeuw PJ, Van Driessen K (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41(3):212–223CrossRefGoogle Scholar
  33. Rousseeuw PJ, Van Zomeren BC (1990) Unmasking multivariate outliers and leverage points. J Am Stat Assoc 85(411):633–639CrossRefGoogle Scholar
  34. Van Rijsbergen CJ (1979) Information retrieval, 2nd edn. Butterworth, LondonzbMATHGoogle Scholar
  35. Vandev DL, Neykov NM (1998) About regression estimators with high breakdown point. Statistics 32:111–129MathSciNetCrossRefGoogle Scholar
  36. Viattchenin D (2012) Detecting outliers in interval-valued data using heuristic possibilistic clustering. J Comput Sci Control Syst 5(2):39–44Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2017

Authors and Affiliations

  • A. Pedro Duarte Silva
    • 1
  • Peter Filzmoser
    • 2
  • Paula Brito
    • 3
  1. 1.Católica Porto Business School, & CEGEUniversidade Catolica PortuguesaPortoPortugal
  2. 2.Institute of Statistics and Mathematical Methods in EconomicsVienna University of TechnologyViennaAustria
  3. 3.Faculdade de Economia & LIAAD-INESC TECUniversidade do PortoPortoPortugal

Personalised recommendations