Investigating Usage of Text Segmentation and Inter-passage Similarities to Improve Text Document Clustering

Paliwal, Shashank; Pudi, Vikram

doi:10.1007/978-3-642-31537-4_43

Investigating Usage of Text Segmentation and Inter-passage Similarities to Improve Text Document Clustering

Shashank Paliwal²⁰ &
Vikram Pudi²⁰

Conference paper

5865 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7376))

Abstract

Measuring inter-document similarity is one of the most essential steps in text document clustering. Traditional methods rely on representing text documents using the simple Bag-of-Words (BOW) model. A document is an organized structure consisting of various text segments or passages. Such single term analysis of the text treats whole document as a single semantic unit and thus, ignores other semantic units like sentences, passages etc. In this paper, we attempt to take advantage of underlying subtopic structure of text documents and investigate whether clustering of text documents can be improved if text segments of two documents are utilized, while calculating similarity between them. We concentrate on examining effects of combining suggested inter-document similarities (based on inter-passage similarities) with traditional inter-document similarities following a simple approach for the same. Experimental results on standard data sets suggest improvement in clustering of text documents.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Tellex, S., Katz, B., Lin, J., Fernandes, A., Marton, G.: Quantitative evaluation of passage retrieval algorithms for question answering. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval (SIGIR 2003), pp. 41–47. ACM, New York (2003)
Chapter Google Scholar
Salton, G., Allan, J., Buckley, C.: Approaches to passage retrieval in full text information systems. In: Korfhage, R., Rasmussen, E., Willett, P. (eds.) Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1993), pp. 49–58. ACM, New York (1993)
Chapter Google Scholar
Kaszkiel, M., Zobel, J.: Effective ranking with arbitrary passages. J. Am. Soc. Inf. Sci. Technol. 52(4), 344–364 (2001)
Article Google Scholar
Conrad, J.G., Al-Kofahi, K., Zhao, Y., Karypis, G.: Effective document clustering for large heterogeneous law firm collections. In: Proceedings of the 10th International Conference on Artificial Intelligence and Law, ICAIL 2005 (2005)
Google Scholar
Lamprier, S., Amghar, T., Levrat, B., Saubion, F.: Using Text Segmentation to Enhance the Cluster Hypothesis. In: Dochev, D., Pistore, M., Traverso, P. (eds.) AIMSA 2008. LNCS (LNAI), vol. 5253, pp. 69–82. Springer, Heidelberg (2008)
Chapter Google Scholar
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Block-based web search. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2004), pp. 456–463. ACM, New York (2004)
Google Scholar
Hotho, A., Staab, S., Stumme, G.: Wordnet improves text document clustering. In: Proceedings of the Semantic Web Workshop at SIGIR-2003, 26th Annual International ACM SIGIR Conference (2003b)
Google Scholar
Hammouda, K.M., Kamel, M.S.: Efficient Phrase-Based Document Indexing for Web Document Clustering. IEEE Trans. on Knowl. and Data Eng. 16(10), 1279–1296 (2004)
Article Google Scholar
Callan, J.P.: Passage-level evidence in document retrieval. In: Bruce Croft, W., van Rijsbergen, C.J. (eds.) Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1994), pp. 302–310. Springer-Verlag New York, Inc., New York (1994)
Google Scholar
Hearst, M.A., Plaunt, C.: Subtopic structuring for full-length document access. In: Korfhage, R., Rasmussen, E., Willett, P. (eds.) Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development In Information Retrieval (SIGIR 1993), pp. 59–68. ACM, New York (1993)
Chapter Google Scholar
Tagarelli, A., Karypis, G.: A segment-based approach to clustering multi-topic documents. In: Proceedings of the Text Mining Workshop, SIAM Data Mining Conference (2008)
Google Scholar
Kim, J., Kim, M.H.: An Evaluation of Passage-Based Text Categorization. J. Intell. Inf. Syst. 23(1), 47–65 (2004)
Article MATH Google Scholar
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Article MATH Google Scholar
Chim, H., Deng, X.: A new suffix tree similarity measure for document clustering. In: Proceedings of the 16th International Conference on World Wide Web (WWW 2007), pp. 121–130. ACM, New York (2007)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Center for Data Engineering, International Institute of Information Technology, Hyderabad, India
Shashank Paliwal & Vikram Pudi

Authors

Shashank Paliwal
View author publications
You can also search for this author in PubMed Google Scholar
Vikram Pudi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Vision and Applied Computer Sciences, IBaI, Kohlenstraße 2, 04107, Leipzig, Germany
Petra Perner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Paliwal, S., Pudi, V. (2012). Investigating Usage of Text Segmentation and Inter-passage Similarities to Improve Text Document Clustering. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2012. Lecture Notes in Computer Science(), vol 7376. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31537-4_43

Download citation

DOI: https://doi.org/10.1007/978-3-642-31537-4_43
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31536-7
Online ISBN: 978-3-642-31537-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics