# Outlier detection in interval data

- 260 Downloads

## Abstract

A multivariate outlier detection method for interval data is proposed that makes use of a parametric approach to model the interval data. The trimmed maximum likelihood principle is adapted in order to robustly estimate the model parameters. A simulation study demonstrates the usefulness of the robust estimates for outlier detection, and new diagnostic plots allow gaining deeper insight into the structure of real world interval data.

## Keywords

Outliers Robust statistics Interval data Mahalanobis distance## Mathematics Subject Classification

62-07 (Data Analysis) 62F35 (Robustness and adaptive procedures) 62H86 (Multivariate analysis and fuzziness)## Notes

### Acknowledgements

This work is financed by the ERDF-European Regional Development Fund through the Operational Programme for Competitiveness and Internationalisation-COMPETE 2020 Programme within project POCI-01-0145-FEDER-006961, and by National Funds through the FCT - Fundação para a Ciência e Tecnologia (Portuguese Foundation for Science and Technology) as part of projects UID/EEA/50014/2013 and UID/GES/00731/2013.

## References

- Billard B, Diday E (2003) From the statistics of data to the statistics of knowledge: symbolic data analysis. J Am Stat Assoc 98(462):470–487MathSciNetCrossRefGoogle Scholar
- Bock H-H, Diday E (2000) Analysis of symbolic data. Exploratory methods for extracting statistical information from complex data. Springer, HeidelbergzbMATHGoogle Scholar
- Brito P (2014) Symbolic data analysis: another look at the interaction of data mining and statistics. WIREs Data Min Knowl Discov 4(4):281–295CrossRefGoogle Scholar
- Brito P, Duarte Silva AP (2012) Modelling interval data with Normal and Skew-Normal distributions. J Appl Stat 39(1):3–20MathSciNetCrossRefGoogle Scholar
- Cerioli A (2010) Multivariate outlier detection with high-breakdown estimators. J Am Stat Assoc 105(489):147–156MathSciNetCrossRefGoogle Scholar
- De Carvalho FAT, Brito P, Bock H-H (2006) Dynamic clustering for interval data based on \(L_2\) distance. Comput Stat 21(2):231–250MathSciNetCrossRefGoogle Scholar
- De Carvalho FAT, Lechevallier Y (2009) Partitional clustering algorithms for symbolic interval data based on single adaptive distances. Pattern Recogn 42(7):1223–1236CrossRefGoogle Scholar
- Dias S, Brito P (2017) Off the beaten track: a new linear model for interval data. Eur J Oper Res 258(3):1118–1130MathSciNetCrossRefGoogle Scholar
- Diday E, Noirhomme-Fraiture M (2008) Symbolic data analysis and the SODAS software. Wiley, ChichesterzbMATHGoogle Scholar
- Douzal-Chouakria A, Billard L, Diday E (2011) Principal component analysis for interval-valued observations. Stat Anal Data Min 4(2):229–246MathSciNetCrossRefGoogle Scholar
- Duarte Silva AP, Brito P (2017) MAINT.DATA: Model and analyze interval data. R Package,version 1.2.0. http://cran.r-project.org/web/packages/MAINT.Data/index.html
- Duarte Silva AP, Brito P (2015) Discriminant analysis of interval data: an assessment of parametric and distance-based approaches. J Classif 32(3):516–541MathSciNetCrossRefGoogle Scholar
- Filzmoser P (2004) A multivariate outlier detection method. In: S. Aivazian, P. Filzmoser and Yu. Kharin, editors, In Proceedings of the 7th international conference on computer data analysis and modeling, vol 1, 18–22, Belarusian State University, MinskGoogle Scholar
- Filzmoser P, Reimann C, Garrett RG (2005) Multivariate outlier detection in exploration geochemistry. Comput Geosci 31:579–587CrossRefGoogle Scholar
- Hadi AS, Luceño A (1997) Maximum trimmed likelihood estimators: a unified approach, examples, and algorithms. Comput Stat Data Anal 25(3):251–272MathSciNetCrossRefGoogle Scholar
- Hardin J, Rocke DM (2005) The distribution of robust distances. J Comput Gr Stat 14:910–927MathSciNetCrossRefGoogle Scholar
- Hubert M, Rousseeuw PJ, Van Aelst S (2008) High-breakdown robust multivariate methods. Stat Sci 23(1):92–119MathSciNetCrossRefGoogle Scholar
- Korkmaz S, Goksuluk D, Zararsiz G (2014) MVN: an R package for assessing multivariate normality. R J 6(2):151–162Google Scholar
- Le-Rademacher J, Billard L (2011) Likelihood functions and some maximum likelihood estimators for symbolic data. J Stat Plan Inference 141:1593–1602MathSciNetCrossRefGoogle Scholar
- Le-Rademacher J, Billard L (2012) Symbolic covariance principal component analysis and visualization for interval-valued data. J Comput Gr Stat 21(2):413–432MathSciNetCrossRefGoogle Scholar
- Li S, Lee R, Lang S-D (2006) Detecting outliers in interval data. In Proceedings of the 44th annual southeast regional conference, ACM, pp 290–295Google Scholar
- Lima Neto E, De Carvalho FAT (2008) Centre and range method for fitting a linear regression model to symbolic interval data. Comput Stat Data Anal 52(3):1500–1515MathSciNetCrossRefGoogle Scholar
- Lima Neto E, De Carvalho FAT (2010) Constrained linear regression models for symbolic interval-valued variables. Comput Stat Data Anal 54(2):333–347MathSciNetCrossRefGoogle Scholar
- Lima Neto E, Cordeiro GM, De Carvalho FAT (2011) Bivariate symbolic regression models for interval-valued variables. J Stat Comput Simul 81(11):1727–1744MathSciNetCrossRefGoogle Scholar
- Neykov N, Filzmoser P, Dimova R, Neytchev P (2007) Robust fitting of mixtures using the trimmed likelihood estimator. Comput Stat Data Anal 52(1):299–308MathSciNetCrossRefGoogle Scholar
- Neykov NM, Müller CH (2003) Breakdown point and computation of trimmed likelihood estimators in generalized linear models. In: Dutter R, Filzmoser P, Gather U, Rousseeuw PJ (eds) Developments in robust statistics. Physica-Verlag, Heidelberg, pp 277–286CrossRefGoogle Scholar
- Noirhomme-Fraiture M, Brito P (2011) Far beyond the classical data models: symbolic data analysis. Stat Anal Data Min 4(2):157–170MathSciNetCrossRefGoogle Scholar
- Pison G, Van Aelst S, Willems G (2002) Small sample corrections for LTS and MCD. Metrika 55(1–2):111–123MathSciNetCrossRefGoogle Scholar
- Ramos-Guajardo AB, Grzegorzewski P (2016) Distance-based linear discriminant analysis for interval-valued data. Inf Sci 372:591–607CrossRefGoogle Scholar
- Rousseeuw PJ (1984) Least median of squares regression. J Am Stat Assoc 79(388):871–880MathSciNetCrossRefGoogle Scholar
- Rousseeuw PJ (1985) Multivariate estimation with high breakdown point. Math Stat Appl 8:283–297MathSciNetCrossRefGoogle Scholar
- Rousseeuw PJ, Van Driessen K (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41(3):212–223CrossRefGoogle Scholar
- Rousseeuw PJ, Van Zomeren BC (1990) Unmasking multivariate outliers and leverage points. J Am Stat Assoc 85(411):633–639CrossRefGoogle Scholar
- Van Rijsbergen CJ (1979) Information retrieval, 2nd edn. Butterworth, LondonzbMATHGoogle Scholar
- Vandev DL, Neykov NM (1998) About regression estimators with high breakdown point. Statistics 32:111–129MathSciNetCrossRefGoogle Scholar
- Viattchenin D (2012) Detecting outliers in interval-valued data using heuristic possibilistic clustering. J Comput Sci Control Syst 5(2):39–44Google Scholar