Abstract
This paper explores the application of inequality indices, a concept successfully applied in comparative software analysis among many application domains, to find the optimal value k for k-means when clustering road traffic data. We demonstrate that traditional methods for identifying the optimal value for k (such as gap statistic and Pham et al.’s method) are unable to produce meaningful values for k when applying them to a real-world dataset for road traffic. On the other hand, a method based on inequality indices shows significant promises in producing much more sensible values for the number k of clusters to be used in k-means clustering for the same road network traffic dataset.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
Here \(k_{max}\) has to be a reasonably large upper bound reflecting the specific characteristics of the dataset [13].
References
Ben-David, S., von Luxburg, U., Pál, D.: A sober look at clustering stability. In: Lugosi, G., Simon, H.U. (eds.) COLT 2006. LNCS (LNAI), vol. 4005, pp. 5–19. Springer, Heidelberg (2006)
Cowell, F.A., Jenkins, S.P.: How much inequality can we explain? A methodology and an application to the united states. Econ. J. 105(429), 412–430 (1995)
Färber, I., Günnemann, S., Kriegel, H.P., Kröger, P., Müller, E., Schubert, E., Seidl, T., Zimek, A.: On using class-labels in evaluation of clusterings. In: MultiClust: 1st International Workshop on Discovering, Summarizing and Using Multiple Clusterings held in conjunction with KDD, p. 1 (2010)
Goloshchapova, O., Lumpe, M.: On the application of inequality indices in comparative software analysis. In: Proceedings of 22nd Australian Software Engineering Conference (ASWEC 2013), pp. 117–126. IEEE Computer Society, Melbourne, June 2013
Hamerly, G., Elkan, C.: Learning the \(k\) in \(k\)-means. In: Thrun, S., Saul, L.K., Schölkopf, B. (eds.) Advances in Neural Information Processing Systems 16, pp. 281–288. The MIT Press, Cambridge (2004)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edn. Springer, New York (2013)
Kasturi, J., Acharya, R., Ramanathan, M.: An information theoretic approach for analyzing temporal patterns of gene expression. Bioinformatics 19, 449–458 (2003)
Le, T., Vu, H.L., Nazarathy, Y., Vo, Q.B.: Hoogendoorn: linear-quadratic model predicative control for urban traffic networks. J. Transp. Res. Part C: Emerg. Technol. 36, 498–512 (2013)
van Leeuwaarden, J.S.H., Lefeber, E., Nazarathy, Y., Rooda, J.E.: Model predictive control for the acquisition queue and related queueing networks. In: Proceedings of 5th International Conference on Queueing Theory and Network Applications (QTNA 2010), pp. 193–200. ACM, New York, July 2010
Lumpe, M.: Partition refinement of Component Interaction Automata. Sci. Comput. Program. 78, 27–45 (2012)
von Luxburg, U.: Clustering stability: an overview. Found. Trends Mach. Learn. 2(3), 235–274 (2010)
Murphy, K.P.: Machine Learning: A Probabilistic Perspective. The MIT Press, Cambridge (2013)
Pham, D.T., Dimov, S.S., Nguyen, C.D.: Selection of K in K-means clustering. J. Mech. Eng. Sci. 219(Part C), 103–119 (2005)
Sen, A.K.: On Economic Inequality. Oxford University Press, Oxford (1973)
Serebrenik, A., van den Brand, M.: Theil index for aggregation of software metrics values. In: Proceedings of 26th IEEE International Conference on Software Maintenance (ICSM 2010), pp. 1–9. IEEE Computer Society, Timişoara, September 2010
Shannon, C.E.: A mathematical theory of communication. SIGMOBILE Mob. Comput. Commun. Rev. 5(1), 3–55 (2001)
Theil, H.: Economics and Information Theory. North-Holland Publishing Company, Amsterdam (1967)
Vasa, R., Lumpe, M., Branch, P., Nierstrasz, O.: Comparative analysis of evolving software systems using the gini coefficient. In: Proceedings of 25th IEEE International Conference on Software Maintenance (ICSM 2009), pp. 179–188. IEEE Computer Society, Edmonton, September 2009
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Lumpe, M., Vo, Q.B. (2015). Finding the k in K-means Clustering: A Comparative Analysis Approach. In: Pfahringer, B., Renz, J. (eds) AI 2015: Advances in Artificial Intelligence. AI 2015. Lecture Notes in Computer Science(), vol 9457. Springer, Cham. https://doi.org/10.1007/978-3-319-26350-2_31
Download citation
DOI: https://doi.org/10.1007/978-3-319-26350-2_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26349-6
Online ISBN: 978-3-319-26350-2
eBook Packages: Computer ScienceComputer Science (R0)