Skip to main content

Clustering Large Collection of Biomedical Literature Based on Ontology-Enriched Bipartite Graph Representation and Mutual Refinement Strategy

  • Conference paper
Advances in Knowledge Discovery and Data Mining (PAKDD 2006)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3918))

Included in the following conference series:

Abstract

In this paper we introduce a novel document clustering approach that solves some major problems of traditional document clustering approaches. Instead of depending on traditional vector space model, this approach represents a set of documents as bipartite graphs using domain knowledge in ontology. In this representation, the concepts of the documents are classified according to their relationships with documents that are reflected on the bipartite graph. Using the concept groups, documents are clustered based on the concepts’ contribution to each document. Through the mutual-refinement relationship with concept groups and document groups, the two groups are recursively refined. Our experimental results on MEDLINE articles show that our approach outperforms two leading document clustering algorithms: BiSecting K-means and CLUTO. In addition to its decent performance, our approach provides a meaningful explanation for each document cluster by identifying its most contributing concepts, thus helps users to understand and interpret documents and clustering results.

This research work is supported in part from the NSF Career grant (NSF IIS 0448023). NSF CCF 0514679 and the PA Dept of Health Tobacco Settlement Formula Grant (#240205, 240196).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Buttersworth, London (1979), http://www.dcs.gla.ac.uk/Keith/Preface.html

    MATH  Google Scholar 

  2. Willett, P.: Recent trends in hierarchical document clustering: A critical review. Information Processing & Management 24(5), 577–597 (1988)

    Article  Google Scholar 

  3. Cutting, D., Karger, D., Pedersen, J., Tukey, J.: Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. In: SIGIR 1992, pp. 318–329 (1992)

    Google Scholar 

  4. Buckley, C., Lewit, A.F.: Optimization of inverted vector searches. In: Proceedings of SIGIR 1985, pp. 97–110 (1985)

    Google Scholar 

  5. Hearst, M.A., Pedersen, J.O.: Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In: Proceedings of SIGIR 1996, Zurich, Switzerland, pp. 76–84 (1996)

    Google Scholar 

  6. Zamir, O., Etzioni, O.: Web Document Clustering: A Feasibility Demonstration. In: Proc. ACM SIGIR 1998, pp. 46–54 (1998)

    Google Scholar 

  7. Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. In: Proceedings of ICML 1997, Nashville, TN, pp. 170–176 (1997)

    Google Scholar 

  8. Wang, B.B., (Bob) McKay, R I., Abbass, H.A. Barlow, M.: Learning Text Classifier using the Domain Concept Hierarchy. In: Proceedings of International Conference on Communications, Circuits and Systems 2002, China (2002)

    Google Scholar 

  9. Hotho, A., Maedche, A., Staab, S.: Text Clustering Based on Good Aggregations. Künstliche Intelligenz (KI) 16(4), 48–54 (2002)

    Google Scholar 

  10. Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is nearest neighbor meaningful? In: Proceedings of 7th International Conference on Database Theory, pp. 217–235 (1999)

    Google Scholar 

  11. Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, Chichester (1990)

    Book  MATH  Google Scholar 

  12. Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. Technical Report #00-034. Department of Computer Science and Engineering, University of Minnesota (2000)

    Google Scholar 

  13. Larsen, B., Aone, C.: Fast and Effective Text Mining Using Linear-time Document Clustering. In: KDD 1999, San Diego, California (1999)

    Google Scholar 

  14. Hu, X.: Mining Novel Connections from Large Online Digital Library Using Biomedical Ontologies. Library Management Journal 26(4/5), 261–270 (2005)

    Article  Google Scholar 

  15. Harper, D.J., van Rijsbergen, C.J.: Evaluation of feedback in document retrieval using co-occurrence data. Journal of Documentation 34, 189–216 (1978)

    Article  Google Scholar 

  16. Van Rijsbergen, C.J., Harper, D.J., Porter, M.F.: The selection of good search terms. Information Processing and Management 17, 77–91 (1981)

    Article  Google Scholar 

  17. Lin, D.: An information-theoretic definition of similarity. In: Proceedings of the Fifteenth International Conference on Machine Learning, pp. 296–304 (1998)

    Google Scholar 

  18. Zeng, Y., Tang, J., Garcia-Frias, J., Gao, G.R.: An Adaptive Meta-Clustering Approach: Combining The Information From Different Clustering Results. In: CSB 2002 IEEE Computer Society Bioinformatics Conference Proceedings, pp. 276–287 (2002)

    Google Scholar 

  19. Beil, F., Ester, M., Xu, X.: Frequent Term-Based Text Clustering. In: 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, July 23-26 (2002)

    Google Scholar 

  20. Zhong, S., Ghosh, J.: A comparative study of generative models for document clustering. In: Proceedings of the workshop on Clustering High Dimensional Data and Its Applications in SIAM Data Mining Conference (2003)

    Google Scholar 

  21. Pantel, P., Lin, D.: Document clustering with committees. In: SIGIR 2002, pp. 199–206 (2002)

    Google Scholar 

  22. Liu, J., Wang, W., Yang, J.: A framework for ontology-driven subspace clustering. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 623–628 (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Yoo, I., Hu, X. (2006). Clustering Large Collection of Biomedical Literature Based on Ontology-Enriched Bipartite Graph Representation and Mutual Refinement Strategy. In: Ng, WK., Kitsuregawa, M., Li, J., Chang, K. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2006. Lecture Notes in Computer Science(), vol 3918. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11731139_36

Download citation

  • DOI: https://doi.org/10.1007/11731139_36

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-33206-0

  • Online ISBN: 978-3-540-33207-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics