Document Clustering Using Incremental and Pairwise Approaches

Tran, Tien; Nayak, Richi; Bruza, Peter

doi:10.1007/978-3-540-85902-4_20

Tien Tran¹,
Richi Nayak¹ &
Peter Bruza¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4862))

Included in the following conference series:

International Workshop of the Initiative for the Evaluation of XML Retrieval

561 Accesses
12 Citations

Abstract

This paper presents the experiments and results of a clustering approach for clustering of the large Wikipedia dataset in the INEX 2007 Document Mining Challenge. The clustering approach employed makes use of an incremental clustering method and a pairwise clustering method. The approach enables us to perform the clustering task on a large dataset by first reducing the dimension of the dataset to an undefined number of clusters using the incremental method. The lower-dimension dataset is then clustered to a required number of clusters using the pairwise method. In this way, clustering of the large number of documents is performed successfully and the accuracy of the clustering solution is achieved.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Do, H.H., Rahm, E.: Coma - a system for flexible combination of schema matching approaches. In: 28th VLDB, Hong Kong, China, August) propose a hybrid matching algorithm using the modulation of veraious approaches. They support user feedback and reuse previous matchings (one to one matching) (2002)
Google Scholar
Lee, L.M., Yang, L.H., Hsu, W., Yang, X.: Xclust: Clustering xml schemas for effective integration. In: 11th ACM International Conference on Information and Knowledge Management (CIKM 2002), propose a clustering method that computes a similarity between XMl schemas (one to one matching), Virginia (November 2002)
Google Scholar
Lian, W., Cheung, D.W., Maoulis, N., Yiu, S.M.: An efficient and scalable algorithm for clustering xml documents by structure. IEEE TKDE 16(1), 82–96 (2004)
Google Scholar
Karypis, G.: Cluto - software for clustering high-dimensional datasets karypis lab
Google Scholar
Nayak, R., Tran, T.: A progressive clustering algorithm to group the xml data by structural and semantic similarity. IJPRAI 21(3), 1–21 (2007)
Google Scholar
Nayak, R., Xu, S.: Xcls: A fast and effective clustering algorithm for heterogenous xml documents. In: Ng, W.-K., Kitsuregawa, M., Li, J., Chang, K. (eds.) PAKDD 2006. LNCS (LNAI), vol. 3918. Springer, Heidelberg (2006)
Chapter Google Scholar
Fuhr, N., Lalmas, M., Trotman, A., Kamps, J.: Focused access to xml documents. In: 6th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2007, Revised and Selected Papers, Dagstuhl Castle, Germany. Springer, Heidelberg (2007) (to appear 2008)
Google Scholar
Cristianini, N., Shawe-Taylor, J., Lodhi, H.: Latent semantic kernels. Journal of Intelligent Information Systems (JJIS) 18(2) (2002)
Google Scholar
Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Processes (25), 259–284 (1998)
Article Google Scholar
Kim, Y.S., Cho, W.J., Lee, J.Y.: An intelligent grading system using heterogeneous linguistic resources. In: Gallagher, M., Hogan, J.P., Maire, F. (eds.) IDEAL 2005. LNCS, vol. 3578, pp. 102–108. Springer, Heidelberg (2005)
Google Scholar
Yang, J., Cheung, W., Chen, X.: Learning the kernel matrix for xml document clustering. In: e-Technology, e-Commerce and e-Service (2005)
Google Scholar
Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Google Scholar
Kutty, S., Tran, T., Nayak, R., Li, Y.: Clustering xml documents using closed frequency subtrees - a structure-only based approach. In: 6th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2007, Dagstuhl Castle, Germany, December 17-19 (2007)
Google Scholar
Hagenbuchner, M., Tsoi, A., Sperduti, A., Kc, M.: Efficient clustering of structured documents using graph self-organizing maps. In: 6th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2007, Dagstuhl Castle, Germany, Decemeber 17-19 (2007)
Google Scholar
Yao, J., Zerida, N.: Rare patterns to improve path-based clustering. In: 6th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2007, Dagstuhl Castle, Germany, December 17-19 (2007)
Google Scholar
Salton, G., McGill, M.J.: Introduction to modern information retrieval. McGraw-Hill, New York (1983)
MATH Google Scholar
Zhao, Y., Karypis, G.: Empirical and theorectical comparisons of selected criterion functions for document clustering. In: Machine Learning, pp. 311–331 (2004)
Google Scholar
Zhao, Y., Karypis, G.: Hierarchical clustering alogrithms for document datasets. Data Mining and Knowledge Discovery 10(2), 141–168 (2005)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Information Technology, Queensland University of Technology, Brisbane, Australia
Tien Tran, Richi Nayak & Peter Bruza

Authors

Tien Tran
View author publications
You can also search for this author in PubMed Google Scholar
Richi Nayak
View author publications
You can also search for this author in PubMed Google Scholar
Peter Bruza
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Norbert Fuhr Jaap Kamps Mounia Lalmas Andrew Trotman

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tran, T., Nayak, R., Bruza, P. (2008). Document Clustering Using Incremental and Pairwise Approaches. In: Fuhr, N., Kamps, J., Lalmas, M., Trotman, A. (eds) Focused Access to XML Documents. INEX 2007. Lecture Notes in Computer Science, vol 4862. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85902-4_20

Download citation

DOI: https://doi.org/10.1007/978-3-540-85902-4_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-85901-7
Online ISBN: 978-3-540-85902-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics