Abstract
To detect database records containing approximate and exact duplicates because of data entry error or differences in the detailed schemas of records from multiple databases or for some other reasons is an important line of research. Yet no comprehensive comparative study has been performed to evaluate the effectiveness of Silhouette width, Calinski & Harbasz index (pseudo F-statistics) and Baker & Hubert index (γ index) algorithms for exact and approximate duplicates. In this chapter, a comparative study and effectiveness of these three cluster validation techniques which involve measuring the stability of a partition in a data set in the presence of noise, in particular, approximate and exact duplicates are presented. Silhouette width, Calinski & Harbasz index and Baker & Hubert index are calculated before and after inserting the exact and approximate duplicates (deliberately) in the data set. Comprehensive experiments on glass, wine, iris and ruspini database confirms that the Baker & Hubert index is not stable in the presence of approximate duplicates. Moreover, Silhouette width, Calinski and Harbasz index and Baker & Hubert indice do not exceed the original data indice in the presence of approximate duplicates.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Rousseeuw, P.J.: Finding groups in data: an introduction to cluster analysis. Wiley, New York (1990)
Rousseeuw, P.J.: Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Comp App. Math. 20, 53–65 (1987)
R Development Core Team R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, http://www.r-project.org/
Hirano, S., et al.: Comparison of clustering methods for clinical databases. Journal of Information Sciences, 155–165 (2004)
Halkidi, M., et al.: On Clustering validation techniques. Journal of Intelligent Information Systems 17(2/3), 107–145 (2001)
Jain, A., et al.: Data clustering: a review. ACM Computing Surveys 31(3), 264–323 (1999)
Halkidi, M., et al.: Cluster validity methods: Part 1. Sigmod Record 31(2), 40–45 (2002)
Halkidi, M., et al.: Cluster validity methods: Part 2. Sigmod Record 31(3), 19–27 (2002)
Halkidi, M., et al.: On clustering validation techniques. Journal of Intelligent Information Systems 17(2/3), 107–145 (2001)
MacQueen, J.B.: Some Methods for classification and analysis of multivariate observations. In: Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press, Berkeley (1967)
Bolshakova, N., Azuaje, F.: Cluster validation techniques for genome expression data 83(4), 825–833 (2003)
Tibshirani, et al.: Estimating the number of clusters in a data set via the gap statistic. Journal R. Stat. Soc. Ser. B 63, 411–423 (2001)
Milligan, G.W., Cooper, M.C.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2), 159–179 (1985)
Baker, F.B., Hubert, L.J.: Measuring the power of hierarchical cluster analysis. Journal of the American Statistical Association, 31–38 (1975)
Stein, B., et al.: On cluster validity and the information need of users. In: 3rd IASTED Int. Conference on Artificial Intelligence and Applications (AIA 2003), pp. 216–221 (2003)
Ahmed, K., et al.: Duplicate record detection: A survey. IEEE Transactions on Data and Knowledge and Engineering 19(1), 1–16 (2007)
Calinski, R.B., Harabasz, J.: A dendrite method for cluster analysis. Communications in Statistics 3, 1–27 (1974)
Hubert, L., Arabie, P.: Comparing partitions. Journal of Classification 2, 193–218 (1985)
Blake, C.L., et al.: UCI repository of machine learning databases (1998)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Jain, R., Koronios, A. (2008). Cluster Validating Techniques in the Presence of Duplicates. In: Jain, L.C., Sato-Ilic, M., Virvou, M., Tsihrintzis, G.A., Balas, V.E., Abeynayake, C. (eds) Computational Intelligence Paradigms. Studies in Computational Intelligence, vol 137. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-79474-5_9
Download citation
DOI: https://doi.org/10.1007/978-3-540-79474-5_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-79473-8
Online ISBN: 978-3-540-79474-5
eBook Packages: EngineeringEngineering (R0)