Using Text Segmentation to Enhance the Cluster Hypothesis

Lamprier, Sylvain; Amghar, Tassadit; Levrat, Bernard; Saubion, Frédéric

doi:10.1007/978-3-540-85776-1_7

Sylvain Lamprier¹,
Tassadit Amghar¹,
Bernard Levrat¹ &
…
Frédéric Saubion¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5253))

Included in the following conference series:

International Conference on Artificial Intelligence: Methodology, Systems, and Applications

1003 Accesses
4 Citations

Abstract

An alternative way to tackle Information Retrieval, called Passage Retrieval, considers text fragments independently rather than assessing global relevance of documents. In such a context, the fact that relevant information is surrounded by parts of text deviating from the interesting topic does not penalize the document. In this paper, we propose to study the impact of the consideration of these text fragments on a document clustering process. The use of clustering in the field of Information Retrieval is mainly supported by the cluster hypothesis which states that relevant documents tend to be more similar one to each other than to non-relevant documents and hence a clustering process is likely to gather them. Previous experiments have shown that clustering the first retrieved documents as response to a user’s query allows the Information Retrieval systems to improve their effectiveness. In the clustering process used in these studies, documents have been considered globally. Nevertheless, the assumption stating that a document can refer to more than one topic/concept may have also impacts on the document clustering process. Considering passages of the retrieved documents separately may allow to create more representative clusters of the addressed topics. Different approaches have been assessed and results show that using text fragments in the clustering process may turn out to be actually relevant.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Voorhees, E.M., Harman, D.: Overview of the fifth text retrieval conference (trec-5). In: Proceedings of the Fifth Text Retrieval Conference, NIST Special Publication, pp. 1–28. NIST Special Publication 500-238 (1997)
Google Scholar
Leuski, A.V.: Interactive information organization: techniques and evaluation. PhD thesis, University of Amhert, Massachussets, Director-James Allan (2001)
Google Scholar
Koenemann, J., Belkin, N.J.: A case for interaction: a study of interactive information retrieval behavior and effectiveness. In: CHI 1996: Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 205–212. ACM Press, New York (1996)
Google Scholar
Harper, D.J., Koychev, I., Yixing, S.: Query-based document skimming: A user-centred evaluation of relevance profiling. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 377–392. Springer, Heidelberg (2003)
Chapter Google Scholar
Croft, W.B.: A model of cluster searching bases on classification. Information Systems 5(3), 189–195 (1980)
Article Google Scholar
Willett, P.: Recent trends in hierarchic document clustering: a critical review. Information Processing & Management 24(5), 577–597 (1988)
Article Google Scholar
Voorhees, E.M.: The effectiveness and efficiency of agglomerative hierarchic clustering in document retrieval. PhD thesis, Cornell University, Ithaca, NY, USA (1986)
Google Scholar
Hearst, M.A., Pedersen, J.O.: Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In: Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval, Zürich, CH, pp. 76–84 (1996)
Google Scholar
Leuski, A.: Evaluating document clustering for interactive information retrieval. In: CIKM 2001: Proceedings of the tenth international conference on Information and knowledge management, pp. 33–40. ACM, New York (2001)
Chapter Google Scholar
Tombros, A., Villa, R., Rijsbergen, C.J.V.: The effectiveness of query-specific hierarchic clustering in information retrieval. Information Processing & Management 38(4), 559–582 (2002)
Article MATH Google Scholar
Jardine, N., van Rijsbergen, C.J.: The use of hierarchic clustering in information retrieval. Information Storage and Retrieval 7(5), 217–240 (1971)
Article Google Scholar
Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Google Scholar
Callan, J.P.: Passage-level evidence in document retrieval. In: SIGIR 1994: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 302–310. Springer, New York (1994)
Google Scholar
Kaszkiel, M., Zobel, J.: Effective ranking with arbitrary passages. Journal of the American Society of Information Science 52(4), 344–364 (2001)
Article Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)
Article Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press /Addison-Wesley, New York (1999)
Google Scholar
Salton, G., Singhal, A., Buckley, C., Mitra, M.: Automatic text decomposition using text segments and text themes. In: Hypertext 1996, The Seventh ACM Conference on Hypertext, March 16-20, 1996, pp. 53–65. ACM, New York (1996)
Chapter Google Scholar
Kozima, H.: Text segmentation based on similarity between words. In: Meeting of the Association for Computational Linguistics, pp. 286–288 (1993)
Google Scholar
Hearst, M.: Texttiling: segmenting text into multi-paragraph subtopic passages. Computational Linguistics 23(1), 33–64 (1997)
Google Scholar
Choi, F.: Advances in domain independent linear text segmentation. In: Proceedings of the first conference on North American chapter of the Association for Computational Linguistics, pp. 26–33. Morgan Kaufmann Publishers Inc., San Francisco (2000)
Google Scholar
Lamprier, S., Amghar, T., Levrat, B., Saubion, F.: Seggen: A genetic algorithm for linear text segmentation. In: Veloso, M.M. (ed.) IJCAI, pp. 1647–1652 (2007)
Google Scholar
Rasmussen, E.M.: Clustering algorithms. In: Information Retrieval: Data Structures & Algorithms, pp. 419–442. Prentice-Hall, Englewood Cliffs (1992)
Google Scholar
Salton, G.: Automatic Information Organization and Retrieval. McGraw-Hill, New York (1968)
Google Scholar
Mendes, M.E.S., Sacks, L.: Evaluating fuzzy clustering for relevance-based information access. In: The IEEE International Conference on Fuzzy Systems FUZZ-IEEE 2003, pp. 648–653 (2003)
Google Scholar
Conrad, J.G., Al-Kofahi, K., Zhao, Y., Karypis, G.: Effective document clustering for large heterogeneous law firm collections. In: ICAIL 2005: Proceedings of the 10th international conference on Artificial intelligence and law, pp. 177–187. ACM Press, New York (2005)
Chapter Google Scholar
Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Kluwer Academic Publishers, Norwell (1981)
MATH Google Scholar
Kraft, D.H., Chen, J., Mikulcic, A.: Combining fuzzy clustering and fuzzy inference in information retrieval. In: The IEEE International Conference on Fuzzy Systems FUZZ-IEEE 2000, pp. 375–380 (2000)
Google Scholar
Bradley, P.S., Reina, C., Fayyad, U.M.: Clustering very large databases using em mixture models. In: The International Conference on Pattern Recognition (ICPR 2000), vol. 2, pp. 2076–2080 (2000)
Google Scholar
Chang, H.C., Hsu, C.C.: Using topic keyword clusters for automatic document clustering. In: ICITA 2005: Proceedings of the Third International Conference on Information Technology and Applications (ICITA 2005), vol. 2, pp. 419–424. IEEE Computer Society, Washington (2005)
Chapter Google Scholar
Muscat, R.: Automatic document clustering using topic analysis. Master’s thesis, University of Malta (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

LERIA - University of Angers, 2 Bd Lavoisier, 49000, Angers, France
Sylvain Lamprier, Tassadit Amghar, Bernard Levrat & Frédéric Saubion

Authors

Sylvain Lamprier
View author publications
You can also search for this author in PubMed Google Scholar
Tassadit Amghar
View author publications
You can also search for this author in PubMed Google Scholar
Bernard Levrat
View author publications
You can also search for this author in PubMed Google Scholar
Frédéric Saubion
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Danail Dochev Marco Pistore Paolo Traverso

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lamprier, S., Amghar, T., Levrat, B., Saubion, F. (2008). Using Text Segmentation to Enhance the Cluster Hypothesis. In: Dochev, D., Pistore, M., Traverso, P. (eds) Artificial Intelligence: Methodology, Systems, and Applications. AIMSA 2008. Lecture Notes in Computer Science(), vol 5253. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85776-1_7

Download citation

DOI: https://doi.org/10.1007/978-3-540-85776-1_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-85775-4
Online ISBN: 978-3-540-85776-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics