Abstract
In this paper we describe work relating to clustering of document collections. We compare the conventional vector-model approach using cosine similarity and Euclidean distance to a novel method we have developed for clustering graph-based data with the standard k-means algorithm. The proposed method is evaluated using five different graph distance measures under three clustering performance indices. The experiments are performed on two separate document collections. The results show the graph-based approach performs as well as vector-based methods or even better when using normalized graph distance measures.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Bunke, H.: On a relation between graph edit distance and maximum common subgraph. Pattern Recognition Letters 18 (1997) 689–694
Bunke, H., Günter, S. and Jiang, X.: Towards bridging the gap between statistical and structural pattern recognition: two new concepts in graph matching. In: Singh, S., Murshed, N., and Kropatsch, W. (eds.): Advances in Pattern Recognition — ICAPR 2001, LNCS 2013. Springer-Verlag (2001) 1–11
Bunke, H. and Shearer, K.: A graph distance metric based on the maximal common subgraph. Pattern Recognition Letters 19 (1998) 255–259
Cover, T. M. and Thomas, J. A.: Elements of Information Theory. Wiley (1991)
Dunn, J.: Well separated clusters and optimal fuzzy partitions. Journal of Cybernetics 4, 95–104.
Günter, S. and Bunke, H.: Self-organizing map for clustering in the graph domain. Pattern Recognition Letters 23 (2002) 405–417
Fernández, M.-L. and Valiente, G.: A graph distance metric combining maximum common subgraph and minimum common supergraph. Pattern Recognition Letters 22 (2001) 753–758
Jain, A. K., Murty, M. N. and Flynn, P. J.: Data clustering: a review. ACM Computing Surveys 31 (1999) 264–323
Luo, B., Robles-Kelly, A., Torsello, A., Wilson, R. C. and Hancock, E. R.: Clustering shock trees. 3rd IAPR-TC15 Workshop on Graph-based Representations in Pattern Recognition (2001) 217–228
Messmer, B. T. and Bunke, H.: A new algorithm for error-tolerant subgraph isomorphism detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (1998) 493–504
Mitchell, T. M.: Machine Learning. McGraw-Hill, Boston (1997)
Rand, W. M.: Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66 (1971) 846–850
Salton, G.: Automatic Text Processing: the Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)
Sanfeliu, A., Serratosa, F. and Alquézar, R.: Clustering of attributed graphs and unsupervised synthesis of function-described graphs. Proceedings of the 15th International Conference on Pattern Recognition (ICPR’2000) 2 (2000) 1026–1029
Schenker, A., Last, M., Bunke, H., and Kandel, A.: Clustering of web documents using a graph model. Web Document Analysis: Challenges and Opportunities, Antonacopoulos, A. and Hu, J. (eds.). To appear
Strehl, A., Ghosh, J., and Mooney, R.: Impact of similarity measures on web-page clustering. AAAI-2000: Workshop of Artificial Intelligence for Web Search (2000) 58–64
Wallis, W. D., Shoubridge, P., Kraetz, M. and Ray, D.: Graph distances using graph union. Pattern Recognition Letters 22 (2001) 701–704
Zahn, C. T.: Graph-theoretical methods for detecting and describing gestalt structures. IEEE Transactions on Computers C-20 (1971) 68–86
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Schenker, A., Last, M., Bunke, H., Kandel, A. (2003). Comparison of Distance Measures for Graph-Based Clustering of Documents. In: Hancock, E., Vento, M. (eds) Graph Based Representations in Pattern Recognition. GbRPR 2003. Lecture Notes in Computer Science, vol 2726. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45028-9_18
Download citation
DOI: https://doi.org/10.1007/3-540-45028-9_18
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40452-1
Online ISBN: 978-3-540-45028-3
eBook Packages: Springer Book Archive