Skip to main content

Finding the k in K-means Clustering: A Comparative Analysis Approach

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9457))

Abstract

This paper explores the application of inequality indices, a concept successfully applied in comparative software analysis among many application domains, to find the optimal value k for k-means when clustering road traffic data. We demonstrate that traditional methods for identifying the optimal value for k (such as gap statistic and Pham et al.’s method) are unable to produce meaningful values for k when applying them to a real-world dataset for road traffic. On the other hand, a method based on inequality indices shows significant promises in producing much more sensible values for the number k of clusters to be used in k-means clustering for the same road network traffic dataset.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://www.vicroads.vic.gov.au/business-and-industry/design-and-management/-traffic-signals-and-systems/traffic-signals-in-victoria.

  2. 2.

    Here \(k_{max}\) has to be a reasonably large upper bound reflecting the specific characteristics of the dataset [13].

References

  1. Ben-David, S., von Luxburg, U., Pál, D.: A sober look at clustering stability. In: Lugosi, G., Simon, H.U. (eds.) COLT 2006. LNCS (LNAI), vol. 4005, pp. 5–19. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  2. Cowell, F.A., Jenkins, S.P.: How much inequality can we explain? A methodology and an application to the united states. Econ. J. 105(429), 412–430 (1995)

    Article  Google Scholar 

  3. Färber, I., Günnemann, S., Kriegel, H.P., Kröger, P., Müller, E., Schubert, E., Seidl, T., Zimek, A.: On using class-labels in evaluation of clusterings. In: MultiClust: 1st International Workshop on Discovering, Summarizing and Using Multiple Clusterings held in conjunction with KDD, p. 1 (2010)

    Google Scholar 

  4. Goloshchapova, O., Lumpe, M.: On the application of inequality indices in comparative software analysis. In: Proceedings of 22nd Australian Software Engineering Conference (ASWEC 2013), pp. 117–126. IEEE Computer Society, Melbourne, June 2013

    Google Scholar 

  5. Hamerly, G., Elkan, C.: Learning the \(k\) in \(k\)-means. In: Thrun, S., Saul, L.K., Schölkopf, B. (eds.) Advances in Neural Information Processing Systems 16, pp. 281–288. The MIT Press, Cambridge (2004)

    Google Scholar 

  6. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edn. Springer, New York (2013)

    MATH  Google Scholar 

  7. Kasturi, J., Acharya, R., Ramanathan, M.: An information theoretic approach for analyzing temporal patterns of gene expression. Bioinformatics 19, 449–458 (2003)

    Article  Google Scholar 

  8. Le, T., Vu, H.L., Nazarathy, Y., Vo, Q.B.: Hoogendoorn: linear-quadratic model predicative control for urban traffic networks. J. Transp. Res. Part C: Emerg. Technol. 36, 498–512 (2013)

    Article  Google Scholar 

  9. van Leeuwaarden, J.S.H., Lefeber, E., Nazarathy, Y., Rooda, J.E.: Model predictive control for the acquisition queue and related queueing networks. In: Proceedings of 5th International Conference on Queueing Theory and Network Applications (QTNA 2010), pp. 193–200. ACM, New York, July 2010

    Google Scholar 

  10. Lumpe, M.: Partition refinement of Component Interaction Automata. Sci. Comput. Program. 78, 27–45 (2012)

    Article  MATH  Google Scholar 

  11. von Luxburg, U.: Clustering stability: an overview. Found. Trends Mach. Learn. 2(3), 235–274 (2010)

    MATH  Google Scholar 

  12. Murphy, K.P.: Machine Learning: A Probabilistic Perspective. The MIT Press, Cambridge (2013)

    MATH  Google Scholar 

  13. Pham, D.T., Dimov, S.S., Nguyen, C.D.: Selection of K in K-means clustering. J. Mech. Eng. Sci. 219(Part C), 103–119 (2005)

    Article  Google Scholar 

  14. Sen, A.K.: On Economic Inequality. Oxford University Press, Oxford (1973)

    Book  Google Scholar 

  15. Serebrenik, A., van den Brand, M.: Theil index for aggregation of software metrics values. In: Proceedings of 26th IEEE International Conference on Software Maintenance (ICSM 2010), pp. 1–9. IEEE Computer Society, Timişoara, September 2010

    Google Scholar 

  16. Shannon, C.E.: A mathematical theory of communication. SIGMOBILE Mob. Comput. Commun. Rev. 5(1), 3–55 (2001)

    Article  MathSciNet  Google Scholar 

  17. Theil, H.: Economics and Information Theory. North-Holland Publishing Company, Amsterdam (1967)

    Google Scholar 

  18. Vasa, R., Lumpe, M., Branch, P., Nierstrasz, O.: Comparative analysis of evolving software systems using the gini coefficient. In: Proceedings of 25th IEEE International Conference on Software Maintenance (ICSM 2009), pp. 179–188. IEEE Computer Society, Edmonton, September 2009

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Markus Lumpe .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Lumpe, M., Vo, Q.B. (2015). Finding the k in K-means Clustering: A Comparative Analysis Approach. In: Pfahringer, B., Renz, J. (eds) AI 2015: Advances in Artificial Intelligence. AI 2015. Lecture Notes in Computer Science(), vol 9457. Springer, Cham. https://doi.org/10.1007/978-3-319-26350-2_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-26350-2_31

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-26349-6

  • Online ISBN: 978-3-319-26350-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics