Discussion of “The power of monitoring: how to make the most of a contaminated multivariate sample”
- 689 Downloads
We contribute to the discussion of an article where Andrea Cerioli, Marco Riani, Anthony Atkinson and Aldo Corbellini review the advantages of analyzing multivariate data by monitoring how the estimated model parameters change as the estimation parameters vary. The focus is on robust methods and their sensitivity to the nominal efficiency and breakdown point. In congratulating with the authors for the clear and stimulating exposition, we contribute to its discussion with an overview of what we experienced in applying the monitoring in our application domain.
KeywordsForward Search MM-estimation S-estimation Density estimation Thinning International trade data
Andrea Cerioli, Marco Riani, Anthony Atkinson and Aldo Corbellini (hereafter CRAC) are passionate supporters of the data analysis approach proposed for this discussion (Cerioli et al. 2018), which consists in monitoring the model parameters estimated for a reasonable range of values of the key parameters of the estimation method, and selecting those producing the best results. Robust estimation algorithms depend on several tuning constants, producing effects that should be monitored. CRAC focus on the key parameters used to specify the maximum possible breakdown or efficiency to be achieved. We would like to complement their exposition with recent applications of the monitoring approach to datasets relevant for international trade analysis and anti-fraud, which bring new statistical challenges not yet fully addressed.
2 Monitoring trade data
CRAC have introduced us to a particular monitoring instance, the Forward Search (FS, Atkinson and Riani 2000), more than ten years ago. We studied together the application of monitoring to other established robust regression estimators (Riani et al. 2014). Currently, we use different forms of monitoring in the routine analysis of large amounts of regression datasets relevant to European Union policies, such as international trade and anti-fraud. We have many more reasons for supporting enthusiastically the approach than drawbacks to signal.
We compute on a monthly basis robust estimates of “fair prices” for goods imported in the European Union from third countries. The estimates are used by customs and anti-fraud services to combat illegal practices. The financial impact for the budget of the EU is very big and the fair prices must be somehow “certified”, in view of their use in Court cases. We are therefore studying appropriate statistics or indicators to summarize the sensitivity of the robust fair price estimate to the choice of the estimation method and the related parameters and tuning constants. To this end the monitoring is a precious instrument, although we are facing with two main disadvantages: one is the substantial computation time (which increases with the sample size and number of parameters monitored) and the other is the lack of clear instruments to summarize automatically in a unique statistic or indicator the rich collection of monitored results.
3 The effect of concentrated non-contaminated observations
We introduce in the discussion another complication that occurs rather often in trade data, consisting in large proportions of non-contaminated observations falling in a small data region. To our knowledge, this problem was addressed in robust statistics only recently, with Heikkonen et al. (2013) and Cerioli and Perrotta (2014) showing that the effect of a high-density region can be so strong to override the benefits of robust devices such as trimming methods for robust clustering. We show that the monitoring plots do not make exception and become completely uninformative in presence of highly concentrated data. The proposal of Cerioli and Perrotta (2014) in these cases, is to sample a much smaller subset of observations which preserves the cluster structure and also retains the main outliers of the original data set. This goal is achieved by defining the retention probability of each point as an inverse function of the estimated density function for the whole data set.
Consider for example the datasets of Fig. 2, which for the sake of clarity will be called respectively “Books” and “Jewellery” datasets. They are both characterized by a densely populated area in a “small trade” region of no practical interest in the anti-fraud context. In the case of the Books dataset, units are so concentrated that only \(0.02\%\) of the data is retained, while the general data pattern is preserved. Note that the initial size of these datasets can be so large to make analyses computationally very demanding (the application of the FS to the 33304 books import flows went out of memory after running several hours on a 2.1 GHz Xeon processor with 16 Gb of memory).
After thinning, when the same monitoring is applied on the retained units, the forward plots of the S estimator (right panel of Fig. 5) show that, when the breakdown point is chosen between 0.5 and 0.45 the outliers are very well identifiable. On the contrary, for smaller breakdown point values, masking occurs. By checking the id number of the units in the lower group of trajectories between 0.5 and 0.45 breakdown point (we used for this an interactive data tooltip of our FSDA toolbox), we could verify that they correspond to the group of outliers. Same information in the forward plot of the MM residuals (left panel of Fig. 5) is more difficult to grasp, but the presence of structure in the data is now very clear in comparison with the flat plot of Fig. 4 obtained with the original dataset. Along the lines of CRAC, we provide the corresponding correlation forward plots in Fig. 6, where the structural change points are well identified.
Note, in both figures, the different scales of the monitored residuals in the original and thinned datasets. To understand the nature of this effect we have monitored the intercept and slope values estimated in the two cases. Figure 9, which refers to the Books dataset, shows that the intercept is close to 0 if all data are fit, while with the retained units it is between 100 and 350, depending on the breakdown point, with obvious inflation effect on the residuals. The corresponding slopes for a standard 0.5 breakdown point are respectively around 3.5 and 2.7. We could verify that the most reasonable slope (obtained with a robust fit using a model without intercept, to estimate the import price of the books) is 2.8, which is very close to the S fit on the retained units. Finally note that also the monitoring of the estimated regression parameters shows that something is occurring for a breakdown point approximately equal to 0.1.
For CRAC, the monitoring is more than a particular way of dealing with data: they often like to state that it is a truly data analysis philosophy, which comes from the belief that data can be completely understood only by appraising the effect on a fitted model of each statistical unit, or sub-groups of units. In this discussion we have provided other evidence that the monitoring is, at least, a very powerful instrument to summarize lot of information in one single plot.
- Cerioli A, Perrotta D (2014) Robust clustering around regression lines with high density regions. Adv Data Anal Classif 8(1):5–26. ISSN 1862-5355Google Scholar
- Cerioli A, Riani M, Atkinson AC, Corbellini A (2018) The power of monitoring: how to make the most of a contaminated multivariate sample. Stat Methods Appl (1). In pressGoogle Scholar
- Heikkonen J, Perrotta D, Riani M, Torti F (2013) Issues on clustering and data gridding. Springer, Berlin, pp 37–44Google Scholar
- Riani M, Perrotta D, Cerioli A (2015) The forward search for very large datasets. J Stat Softw Code Snippets 67(1):1–20Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.