Skip to main content

Comparison of Distance Measures for Graph-Based Clustering of Documents

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2726))

Abstract

In this paper we describe work relating to clustering of document collections. We compare the conventional vector-model approach using cosine similarity and Euclidean distance to a novel method we have developed for clustering graph-based data with the standard k-means algorithm. The proposed method is evaluated using five different graph distance measures under three clustering performance indices. The experiments are performed on two separate document collections. The results show the graph-based approach performs as well as vector-based methods or even better when using normalized graph distance measures.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bunke, H.: On a relation between graph edit distance and maximum common subgraph. Pattern Recognition Letters 18 (1997) 689–694

    Article  MathSciNet  Google Scholar 

  2. Bunke, H., Günter, S. and Jiang, X.: Towards bridging the gap between statistical and structural pattern recognition: two new concepts in graph matching. In: Singh, S., Murshed, N., and Kropatsch, W. (eds.): Advances in Pattern Recognition — ICAPR 2001, LNCS 2013. Springer-Verlag (2001) 1–11

    Chapter  Google Scholar 

  3. Bunke, H. and Shearer, K.: A graph distance metric based on the maximal common subgraph. Pattern Recognition Letters 19 (1998) 255–259

    Article  MATH  Google Scholar 

  4. Cover, T. M. and Thomas, J. A.: Elements of Information Theory. Wiley (1991)

    Google Scholar 

  5. Dunn, J.: Well separated clusters and optimal fuzzy partitions. Journal of Cybernetics 4, 95–104.

    Google Scholar 

  6. Günter, S. and Bunke, H.: Self-organizing map for clustering in the graph domain. Pattern Recognition Letters 23 (2002) 405–417

    Article  MATH  Google Scholar 

  7. Fernández, M.-L. and Valiente, G.: A graph distance metric combining maximum common subgraph and minimum common supergraph. Pattern Recognition Letters 22 (2001) 753–758

    Article  MATH  Google Scholar 

  8. Jain, A. K., Murty, M. N. and Flynn, P. J.: Data clustering: a review. ACM Computing Surveys 31 (1999) 264–323

    Article  Google Scholar 

  9. Luo, B., Robles-Kelly, A., Torsello, A., Wilson, R. C. and Hancock, E. R.: Clustering shock trees. 3rd IAPR-TC15 Workshop on Graph-based Representations in Pattern Recognition (2001) 217–228

    Google Scholar 

  10. Messmer, B. T. and Bunke, H.: A new algorithm for error-tolerant subgraph isomorphism detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (1998) 493–504

    Article  Google Scholar 

  11. Mitchell, T. M.: Machine Learning. McGraw-Hill, Boston (1997)

    MATH  Google Scholar 

  12. Rand, W. M.: Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66 (1971) 846–850

    Article  Google Scholar 

  13. Salton, G.: Automatic Text Processing: the Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)

    Google Scholar 

  14. Sanfeliu, A., Serratosa, F. and Alquézar, R.: Clustering of attributed graphs and unsupervised synthesis of function-described graphs. Proceedings of the 15th International Conference on Pattern Recognition (ICPR’2000) 2 (2000) 1026–1029

    Article  Google Scholar 

  15. Schenker, A., Last, M., Bunke, H., and Kandel, A.: Clustering of web documents using a graph model. Web Document Analysis: Challenges and Opportunities, Antonacopoulos, A. and Hu, J. (eds.). To appear

    Google Scholar 

  16. Strehl, A., Ghosh, J., and Mooney, R.: Impact of similarity measures on web-page clustering. AAAI-2000: Workshop of Artificial Intelligence for Web Search (2000) 58–64

    Google Scholar 

  17. Wallis, W. D., Shoubridge, P., Kraetz, M. and Ray, D.: Graph distances using graph union. Pattern Recognition Letters 22 (2001) 701–704

    Article  MATH  Google Scholar 

  18. Zahn, C. T.: Graph-theoretical methods for detecting and describing gestalt structures. IEEE Transactions on Computers C-20 (1971) 68–86

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Schenker, A., Last, M., Bunke, H., Kandel, A. (2003). Comparison of Distance Measures for Graph-Based Clustering of Documents. In: Hancock, E., Vento, M. (eds) Graph Based Representations in Pattern Recognition. GbRPR 2003. Lecture Notes in Computer Science, vol 2726. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45028-9_18

Download citation

  • DOI: https://doi.org/10.1007/3-540-45028-9_18

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-40452-1

  • Online ISBN: 978-3-540-45028-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics