Comparison of Distance Measures for Graph-Based Clustering of Documents

Schenker, Adam; Last, Mark; Bunke, Horst; Kandel, Abraham

doi:10.1007/3-540-45028-9_18

Comparison of Distance Measures for Graph-Based Clustering of Documents

Adam Schenker⁶,
Mark Last⁷,
Horst Bunke⁸ &
…
Abraham Kandel⁶

Conference paper
First Online: 01 January 2003

671 Accesses
12 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2726))

Abstract

In this paper we describe work relating to clustering of document collections. We compare the conventional vector-model approach using cosine similarity and Euclidean distance to a novel method we have developed for clustering graph-based data with the standard k-means algorithm. The proposed method is evaluated using five different graph distance measures under three clustering performance indices. The experiments are performed on two separate document collections. The results show the graph-based approach performs as well as vector-based methods or even better when using normalized graph distance measures.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bunke, H.: On a relation between graph edit distance and maximum common subgraph. Pattern Recognition Letters 18 (1997) 689–694
Article MathSciNet Google Scholar
Bunke, H., Günter, S. and Jiang, X.: Towards bridging the gap between statistical and structural pattern recognition: two new concepts in graph matching. In: Singh, S., Murshed, N., and Kropatsch, W. (eds.): Advances in Pattern Recognition — ICAPR 2001, LNCS 2013. Springer-Verlag (2001) 1–11
Chapter Google Scholar
Bunke, H. and Shearer, K.: A graph distance metric based on the maximal common subgraph. Pattern Recognition Letters 19 (1998) 255–259
Article MATH Google Scholar
Cover, T. M. and Thomas, J. A.: Elements of Information Theory. Wiley (1991)
Google Scholar
Dunn, J.: Well separated clusters and optimal fuzzy partitions. Journal of Cybernetics 4, 95–104.
Google Scholar
Günter, S. and Bunke, H.: Self-organizing map for clustering in the graph domain. Pattern Recognition Letters 23 (2002) 405–417
Article MATH Google Scholar
Fernández, M.-L. and Valiente, G.: A graph distance metric combining maximum common subgraph and minimum common supergraph. Pattern Recognition Letters 22 (2001) 753–758
Article MATH Google Scholar
Jain, A. K., Murty, M. N. and Flynn, P. J.: Data clustering: a review. ACM Computing Surveys 31 (1999) 264–323
Article Google Scholar
Luo, B., Robles-Kelly, A., Torsello, A., Wilson, R. C. and Hancock, E. R.: Clustering shock trees. 3rd IAPR-TC15 Workshop on Graph-based Representations in Pattern Recognition (2001) 217–228
Google Scholar
Messmer, B. T. and Bunke, H.: A new algorithm for error-tolerant subgraph isomorphism detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (1998) 493–504
Article Google Scholar
Mitchell, T. M.: Machine Learning. McGraw-Hill, Boston (1997)
MATH Google Scholar
Rand, W. M.: Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66 (1971) 846–850
Article Google Scholar
Salton, G.: Automatic Text Processing: the Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)
Google Scholar
Sanfeliu, A., Serratosa, F. and Alquézar, R.: Clustering of attributed graphs and unsupervised synthesis of function-described graphs. Proceedings of the 15th International Conference on Pattern Recognition (ICPR’2000) 2 (2000) 1026–1029
Article Google Scholar
Schenker, A., Last, M., Bunke, H., and Kandel, A.: Clustering of web documents using a graph model. Web Document Analysis: Challenges and Opportunities, Antonacopoulos, A. and Hu, J. (eds.). To appear
Google Scholar
Strehl, A., Ghosh, J., and Mooney, R.: Impact of similarity measures on web-page clustering. AAAI-2000: Workshop of Artificial Intelligence for Web Search (2000) 58–64
Google Scholar
Wallis, W. D., Shoubridge, P., Kraetz, M. and Ray, D.: Graph distances using graph union. Pattern Recognition Letters 22 (2001) 701–704
Article MATH Google Scholar
Zahn, C. T.: Graph-theoretical methods for detecting and describing gestalt structures. IEEE Transactions on Computers C-20 (1971) 68–86
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of South Florida, 4202 E. Fowler Ave. ENB 118, Tampa, FL, 33620, USA
Adam Schenker & Abraham Kandel
Department of Information Systems Engineering, Ben-Gurion University of the Negev, Beer-Sheva, 84105, Israel
Mark Last
Department of Computer Science, University of Bern, Neubrückstrasse 10, CH-3012, Bern, Switzerland
Horst Bunke

Authors

Adam Schenker
View author publications
You can also search for this author in PubMed Google Scholar
Mark Last
View author publications
You can also search for this author in PubMed Google Scholar
Horst Bunke
View author publications
You can also search for this author in PubMed Google Scholar
Abraham Kandel
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of York, York, YO1 5DD, UK
Edwin Hancock
D.I.I.I.E., Università degli Studi di Salerno, Via Ponte don Melillo, 1, 84084, Fisciano (SA), Italy
Mario Vento

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Schenker, A., Last, M., Bunke, H., Kandel, A. (2003). Comparison of Distance Measures for Graph-Based Clustering of Documents. In: Hancock, E., Vento, M. (eds) Graph Based Representations in Pattern Recognition. GbRPR 2003. Lecture Notes in Computer Science, vol 2726. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45028-9_18

Download citation

DOI: https://doi.org/10.1007/3-540-45028-9_18
Published: 24 June 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40452-1
Online ISBN: 978-3-540-45028-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics