Skip to main content

Using Text Segmentation to Enhance the Cluster Hypothesis

  • Conference paper
Artificial Intelligence: Methodology, Systems, and Applications (AIMSA 2008)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5253))

Abstract

An alternative way to tackle Information Retrieval, called Passage Retrieval, considers text fragments independently rather than assessing global relevance of documents. In such a context, the fact that relevant information is surrounded by parts of text deviating from the interesting topic does not penalize the document. In this paper, we propose to study the impact of the consideration of these text fragments on a document clustering process. The use of clustering in the field of Information Retrieval is mainly supported by the cluster hypothesis which states that relevant documents tend to be more similar one to each other than to non-relevant documents and hence a clustering process is likely to gather them. Previous experiments have shown that clustering the first retrieved documents as response to a user’s query allows the Information Retrieval systems to improve their effectiveness. In the clustering process used in these studies, documents have been considered globally. Nevertheless, the assumption stating that a document can refer to more than one topic/concept may have also impacts on the document clustering process. Considering passages of the retrieved documents separately may allow to create more representative clusters of the addressed topics. Different approaches have been assessed and results show that using text fragments in the clustering process may turn out to be actually relevant.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Voorhees, E.M., Harman, D.: Overview of the fifth text retrieval conference (trec-5). In: Proceedings of the Fifth Text Retrieval Conference, NIST Special Publication, pp. 1–28. NIST Special Publication 500-238 (1997)

    Google Scholar 

  2. Leuski, A.V.: Interactive information organization: techniques and evaluation. PhD thesis, University of Amhert, Massachussets, Director-James Allan (2001)

    Google Scholar 

  3. Koenemann, J., Belkin, N.J.: A case for interaction: a study of interactive information retrieval behavior and effectiveness. In: CHI 1996: Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 205–212. ACM Press, New York (1996)

    Google Scholar 

  4. Harper, D.J., Koychev, I., Yixing, S.: Query-based document skimming: A user-centred evaluation of relevance profiling. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 377–392. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  5. Croft, W.B.: A model of cluster searching bases on classification. Information Systems 5(3), 189–195 (1980)

    Article  Google Scholar 

  6. Willett, P.: Recent trends in hierarchic document clustering: a critical review. Information Processing & Management 24(5), 577–597 (1988)

    Article  Google Scholar 

  7. Voorhees, E.M.: The effectiveness and efficiency of agglomerative hierarchic clustering in document retrieval. PhD thesis, Cornell University, Ithaca, NY, USA (1986)

    Google Scholar 

  8. Hearst, M.A., Pedersen, J.O.: Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In: Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval, Zürich, CH, pp. 76–84 (1996)

    Google Scholar 

  9. Leuski, A.: Evaluating document clustering for interactive information retrieval. In: CIKM 2001: Proceedings of the tenth international conference on Information and knowledge management, pp. 33–40. ACM, New York (2001)

    Chapter  Google Scholar 

  10. Tombros, A., Villa, R., Rijsbergen, C.J.V.: The effectiveness of query-specific hierarchic clustering in information retrieval. Information Processing & Management 38(4), 559–582 (2002)

    Article  MATH  Google Scholar 

  11. Jardine, N., van Rijsbergen, C.J.: The use of hierarchic clustering in information retrieval. Information Storage and Retrieval 7(5), 217–240 (1971)

    Article  Google Scholar 

  12. Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Google Scholar 

  13. Callan, J.P.: Passage-level evidence in document retrieval. In: SIGIR 1994: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 302–310. Springer, New York (1994)

    Google Scholar 

  14. Kaszkiel, M., Zobel, J.: Effective ranking with arbitrary passages. Journal of the American Society of Information Science 52(4), 344–364 (2001)

    Article  Google Scholar 

  15. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)

    Article  Google Scholar 

  16. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press /Addison-Wesley, New York (1999)

    Google Scholar 

  17. Salton, G., Singhal, A., Buckley, C., Mitra, M.: Automatic text decomposition using text segments and text themes. In: Hypertext 1996, The Seventh ACM Conference on Hypertext, March 16-20, 1996, pp. 53–65. ACM, New York (1996)

    Chapter  Google Scholar 

  18. Kozima, H.: Text segmentation based on similarity between words. In: Meeting of the Association for Computational Linguistics, pp. 286–288 (1993)

    Google Scholar 

  19. Hearst, M.: Texttiling: segmenting text into multi-paragraph subtopic passages. Computational Linguistics 23(1), 33–64 (1997)

    Google Scholar 

  20. Choi, F.: Advances in domain independent linear text segmentation. In: Proceedings of the first conference on North American chapter of the Association for Computational Linguistics, pp. 26–33. Morgan Kaufmann Publishers Inc., San Francisco (2000)

    Google Scholar 

  21. Lamprier, S., Amghar, T., Levrat, B., Saubion, F.: Seggen: A genetic algorithm for linear text segmentation. In: Veloso, M.M. (ed.) IJCAI, pp. 1647–1652 (2007)

    Google Scholar 

  22. Rasmussen, E.M.: Clustering algorithms. In: Information Retrieval: Data Structures & Algorithms, pp. 419–442. Prentice-Hall, Englewood Cliffs (1992)

    Google Scholar 

  23. Salton, G.: Automatic Information Organization and Retrieval. McGraw-Hill, New York (1968)

    Google Scholar 

  24. Mendes, M.E.S., Sacks, L.: Evaluating fuzzy clustering for relevance-based information access. In: The IEEE International Conference on Fuzzy Systems FUZZ-IEEE 2003, pp. 648–653 (2003)

    Google Scholar 

  25. Conrad, J.G., Al-Kofahi, K., Zhao, Y., Karypis, G.: Effective document clustering for large heterogeneous law firm collections. In: ICAIL 2005: Proceedings of the 10th international conference on Artificial intelligence and law, pp. 177–187. ACM Press, New York (2005)

    Chapter  Google Scholar 

  26. Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Kluwer Academic Publishers, Norwell (1981)

    MATH  Google Scholar 

  27. Kraft, D.H., Chen, J., Mikulcic, A.: Combining fuzzy clustering and fuzzy inference in information retrieval. In: The IEEE International Conference on Fuzzy Systems FUZZ-IEEE 2000, pp. 375–380 (2000)

    Google Scholar 

  28. Bradley, P.S., Reina, C., Fayyad, U.M.: Clustering very large databases using em mixture models. In: The International Conference on Pattern Recognition (ICPR 2000), vol. 2, pp. 2076–2080 (2000)

    Google Scholar 

  29. Chang, H.C., Hsu, C.C.: Using topic keyword clusters for automatic document clustering. In: ICITA 2005: Proceedings of the Third International Conference on Information Technology and Applications (ICITA 2005), vol. 2, pp. 419–424. IEEE Computer Society, Washington (2005)

    Chapter  Google Scholar 

  30. Muscat, R.: Automatic document clustering using topic analysis. Master’s thesis, University of Malta (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Danail Dochev Marco Pistore Paolo Traverso

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lamprier, S., Amghar, T., Levrat, B., Saubion, F. (2008). Using Text Segmentation to Enhance the Cluster Hypothesis. In: Dochev, D., Pistore, M., Traverso, P. (eds) Artificial Intelligence: Methodology, Systems, and Applications. AIMSA 2008. Lecture Notes in Computer Science(), vol 5253. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85776-1_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-85776-1_7

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-85775-4

  • Online ISBN: 978-3-540-85776-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics