Clustering Large Collection of Biomedical Literature Based on Ontology-Enriched Bipartite Graph Representation and Mutual Refinement Strategy

Yoo, Illhoi; Hu, Xiaohua

doi:10.1007/11731139_36

Illhoi Yoo²² &
Xiaohua Hu²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3918))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

3036 Accesses
6 Citations

Abstract

In this paper we introduce a novel document clustering approach that solves some major problems of traditional document clustering approaches. Instead of depending on traditional vector space model, this approach represents a set of documents as bipartite graphs using domain knowledge in ontology. In this representation, the concepts of the documents are classified according to their relationships with documents that are reflected on the bipartite graph. Using the concept groups, documents are clustered based on the concepts’ contribution to each document. Through the mutual-refinement relationship with concept groups and document groups, the two groups are recursively refined. Our experimental results on MEDLINE articles show that our approach outperforms two leading document clustering algorithms: BiSecting K-means and CLUTO. In addition to its decent performance, our approach provides a meaningful explanation for each document cluster by identifying its most contributing concepts, thus helps users to understand and interpret documents and clustering results.

This research work is supported in part from the NSF Career grant (NSF IIS 0448023). NSF CCF 0514679 and the PA Dept of Health Tobacco Settlement Formula Grant (#240205, 240196).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Buttersworth, London (1979), http://www.dcs.gla.ac.uk/Keith/Preface.html
MATH Google Scholar
Willett, P.: Recent trends in hierarchical document clustering: A critical review. Information Processing & Management 24(5), 577–597 (1988)
Article Google Scholar
Cutting, D., Karger, D., Pedersen, J., Tukey, J.: Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. In: SIGIR 1992, pp. 318–329 (1992)
Google Scholar
Buckley, C., Lewit, A.F.: Optimization of inverted vector searches. In: Proceedings of SIGIR 1985, pp. 97–110 (1985)
Google Scholar
Hearst, M.A., Pedersen, J.O.: Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In: Proceedings of SIGIR 1996, Zurich, Switzerland, pp. 76–84 (1996)
Google Scholar
Zamir, O., Etzioni, O.: Web Document Clustering: A Feasibility Demonstration. In: Proc. ACM SIGIR 1998, pp. 46–54 (1998)
Google Scholar
Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. In: Proceedings of ICML 1997, Nashville, TN, pp. 170–176 (1997)
Google Scholar
Wang, B.B., (Bob) McKay, R I., Abbass, H.A. Barlow, M.: Learning Text Classifier using the Domain Concept Hierarchy. In: Proceedings of International Conference on Communications, Circuits and Systems 2002, China (2002)
Google Scholar
Hotho, A., Maedche, A., Staab, S.: Text Clustering Based on Good Aggregations. Künstliche Intelligenz (KI) 16(4), 48–54 (2002)
Google Scholar
Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is nearest neighbor meaningful? In: Proceedings of 7th International Conference on Database Theory, pp. 217–235 (1999)
Google Scholar
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, Chichester (1990)
Book MATH Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. Technical Report #00-034. Department of Computer Science and Engineering, University of Minnesota (2000)
Google Scholar
Larsen, B., Aone, C.: Fast and Effective Text Mining Using Linear-time Document Clustering. In: KDD 1999, San Diego, California (1999)
Google Scholar
Hu, X.: Mining Novel Connections from Large Online Digital Library Using Biomedical Ontologies. Library Management Journal 26(4/5), 261–270 (2005)
Article Google Scholar
Harper, D.J., van Rijsbergen, C.J.: Evaluation of feedback in document retrieval using co-occurrence data. Journal of Documentation 34, 189–216 (1978)
Article Google Scholar
Van Rijsbergen, C.J., Harper, D.J., Porter, M.F.: The selection of good search terms. Information Processing and Management 17, 77–91 (1981)
Article Google Scholar
Lin, D.: An information-theoretic definition of similarity. In: Proceedings of the Fifteenth International Conference on Machine Learning, pp. 296–304 (1998)
Google Scholar
Zeng, Y., Tang, J., Garcia-Frias, J., Gao, G.R.: An Adaptive Meta-Clustering Approach: Combining The Information From Different Clustering Results. In: CSB 2002 IEEE Computer Society Bioinformatics Conference Proceedings, pp. 276–287 (2002)
Google Scholar
Beil, F., Ester, M., Xu, X.: Frequent Term-Based Text Clustering. In: 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, July 23-26 (2002)
Google Scholar
Zhong, S., Ghosh, J.: A comparative study of generative models for document clustering. In: Proceedings of the workshop on Clustering High Dimensional Data and Its Applications in SIAM Data Mining Conference (2003)
Google Scholar
Pantel, P., Lin, D.: Document clustering with committees. In: SIGIR 2002, pp. 199–206 (2002)
Google Scholar
Liu, J., Wang, W., Yang, J.: A framework for ontology-driven subspace clustering. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 623–628 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Information Science and Technology, Drexel University, Philadelphia, PA, 19104, USA
Illhoi Yoo & Xiaohua Hu

Authors

Illhoi Yoo
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohua Hu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Nanyang Technological University, Singapore
Wee-Keong Ng
Institute of Industrial Science, The University of Tokyo, 4-6-1 Komaba, Meguro-ku, 153-8505, Tokyo, Japan
Masaru Kitsuregawa
School of Computer Science and Technology, Heilongjiang University, China
Jianzhong Li
School of Computer Engineering, Nanyang Technological University, 639798, Singapore, Singapore
Kuiyu Chang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yoo, I., Hu, X. (2006). Clustering Large Collection of Biomedical Literature Based on Ontology-Enriched Bipartite Graph Representation and Mutual Refinement Strategy. In: Ng, WK., Kitsuregawa, M., Li, J., Chang, K. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2006. Lecture Notes in Computer Science(), vol 3918. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11731139_36

Download citation

DOI: https://doi.org/10.1007/11731139_36
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33206-0
Online ISBN: 978-3-540-33207-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics