Finding the k in K-means Clustering: A Comparative Analysis Approach

Lumpe, Markus; Vo, Quoc Bao

doi:10.1007/978-3-319-26350-2_31

Finding the k in K-means Clustering: A Comparative Analysis Approach

Markus Lumpe¹⁵ &
Quoc Bao Vo¹⁵

Conference paper
First Online: 22 November 2015

1579 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9457))

Abstract

This paper explores the application of inequality indices, a concept successfully applied in comparative software analysis among many application domains, to find the optimal value k for k-means when clustering road traffic data. We demonstrate that traditional methods for identifying the optimal value for k (such as gap statistic and Pham et al.’s method) are unable to produce meaningful values for k when applying them to a real-world dataset for road traffic. On the other hand, a method based on inequality indices shows significant promises in producing much more sensible values for the number k of clusters to be used in k-means clustering for the same road network traffic dataset.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
https://www.vicroads.vic.gov.au/business-and-industry/design-and-management/-traffic-signals-and-systems/traffic-signals-in-victoria.
2.
Here \(k_{max}\) has to be a reasonably large upper bound reflecting the specific characteristics of the dataset [13].

References

Ben-David, S., von Luxburg, U., Pál, D.: A sober look at clustering stability. In: Lugosi, G., Simon, H.U. (eds.) COLT 2006. LNCS (LNAI), vol. 4005, pp. 5–19. Springer, Heidelberg (2006)
Chapter Google Scholar
Cowell, F.A., Jenkins, S.P.: How much inequality can we explain? A methodology and an application to the united states. Econ. J. 105(429), 412–430 (1995)
Article Google Scholar
Färber, I., Günnemann, S., Kriegel, H.P., Kröger, P., Müller, E., Schubert, E., Seidl, T., Zimek, A.: On using class-labels in evaluation of clusterings. In: MultiClust: 1st International Workshop on Discovering, Summarizing and Using Multiple Clusterings held in conjunction with KDD, p. 1 (2010)
Google Scholar
Goloshchapova, O., Lumpe, M.: On the application of inequality indices in comparative software analysis. In: Proceedings of 22nd Australian Software Engineering Conference (ASWEC 2013), pp. 117–126. IEEE Computer Society, Melbourne, June 2013
Google Scholar
Hamerly, G., Elkan, C.: Learning the \(k\) in \(k\)-means. In: Thrun, S., Saul, L.K., Schölkopf, B. (eds.) Advances in Neural Information Processing Systems 16, pp. 281–288. The MIT Press, Cambridge (2004)
Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edn. Springer, New York (2013)
MATH Google Scholar
Kasturi, J., Acharya, R., Ramanathan, M.: An information theoretic approach for analyzing temporal patterns of gene expression. Bioinformatics 19, 449–458 (2003)
Article Google Scholar
Le, T., Vu, H.L., Nazarathy, Y., Vo, Q.B.: Hoogendoorn: linear-quadratic model predicative control for urban traffic networks. J. Transp. Res. Part C: Emerg. Technol. 36, 498–512 (2013)
Article Google Scholar
van Leeuwaarden, J.S.H., Lefeber, E., Nazarathy, Y., Rooda, J.E.: Model predictive control for the acquisition queue and related queueing networks. In: Proceedings of 5th International Conference on Queueing Theory and Network Applications (QTNA 2010), pp. 193–200. ACM, New York, July 2010
Google Scholar
Lumpe, M.: Partition refinement of Component Interaction Automata. Sci. Comput. Program. 78, 27–45 (2012)
Article MATH Google Scholar
von Luxburg, U.: Clustering stability: an overview. Found. Trends Mach. Learn. 2(3), 235–274 (2010)
MATH Google Scholar
Murphy, K.P.: Machine Learning: A Probabilistic Perspective. The MIT Press, Cambridge (2013)
MATH Google Scholar
Pham, D.T., Dimov, S.S., Nguyen, C.D.: Selection of K in K-means clustering. J. Mech. Eng. Sci. 219(Part C), 103–119 (2005)
Article Google Scholar
Sen, A.K.: On Economic Inequality. Oxford University Press, Oxford (1973)
Book Google Scholar
Serebrenik, A., van den Brand, M.: Theil index for aggregation of software metrics values. In: Proceedings of 26th IEEE International Conference on Software Maintenance (ICSM 2010), pp. 1–9. IEEE Computer Society, Timişoara, September 2010
Google Scholar
Shannon, C.E.: A mathematical theory of communication. SIGMOBILE Mob. Comput. Commun. Rev. 5(1), 3–55 (2001)
Article MathSciNet Google Scholar
Theil, H.: Economics and Information Theory. North-Holland Publishing Company, Amsterdam (1967)
Google Scholar
Vasa, R., Lumpe, M., Branch, P., Nierstrasz, O.: Comparative analysis of evolving software systems using the gini coefficient. In: Proceedings of 25th IEEE International Conference on Software Maintenance (ICSM 2009), pp. 179–188. IEEE Computer Society, Edmonton, September 2009
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Science, Engineering and Technology, Swinburne University of Technology, P.O. Box 218, Hawthorn, VIC, 3122, Australia
Markus Lumpe & Quoc Bao Vo

Authors

Markus Lumpe
View author publications
You can also search for this author in PubMed Google Scholar
Quoc Bao Vo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Markus Lumpe .

Editor information

Editors and Affiliations

The University of Waikato, Hamilton, New Zealand
Bernhard Pfahringer
The Australian National University, Canberra, Aust Capital Terr, Australia
Jochen Renz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lumpe, M., Vo, Q.B. (2015). Finding the k in K-means Clustering: A Comparative Analysis Approach. In: Pfahringer, B., Renz, J. (eds) AI 2015: Advances in Artificial Intelligence. AI 2015. Lecture Notes in Computer Science(), vol 9457. Springer, Cham. https://doi.org/10.1007/978-3-319-26350-2_31

Download citation

DOI: https://doi.org/10.1007/978-3-319-26350-2_31
Published: 22 November 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26349-6
Online ISBN: 978-3-319-26350-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics