Towards Efficient Similar Sentences Extraction

Gu, Yanhui; Yang, Zhenglu; Nakano, Miyuki; Kitsuregawa, Masaru

doi:10.1007/978-3-642-32639-4_33

Yanhui Gu¹⁹,
Zhenglu Yang¹⁹,
Miyuki Nakano¹⁹ &
…
Masaru Kitsuregawa¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7435))

Included in the following conference series:

International Conference on Intelligent Data Engineering and Automated Learning

1582 Accesses
1 Citations

Abstract

Similar sentences extraction is an essential issue for many applications, such as natural language processing, Web page retrieval, question-answer model, and so forth. Although there are many studies exploring on this issue, most of them focus on how to improve the effectiveness aspect. In this paper, we address the efficiency issue, i.e., for a given sentence collection, how to efficiently discover the top-k semantic similar sentences to a query. The issue is very important for real applications because the data becomes huge and the existing state-of-the-art strategies cannot satisfy the users’ performance requirement. We propose efficient strategies to tackle the problem based on a general framework. Extensive experimental evaluations demonstrate that the efficiency of our proposal outperforms the state-of-the-art approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Burgess, C., Livesay, K., Lund, K.: Explorations in context space: words, sentences, discourse. Discourse Processes (1998)
Google Scholar
Ceccarelli, D., Lucchese, C., Orlando, S., Perego, R., Silvestri, F.: Caching query-biased snippets for efficient retrieval. In: EDBT (2011)
Google Scholar
Cui, H., Sun, R., Li, K., Kan, M.-Y., Chua, T.-S.: Question answering passage retrieval using dependency relations. In: SIGIR (2005)
Google Scholar
Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. In: PODS (2001)
Google Scholar
Hellerstein, J.M., Avnur, R., Chou, A., Hidber, C., Olston, C., Raman, V., Roth, T., Haas, P.J.: Interactive data analysis: The control project. Computer 32 (1999)
Google Scholar
Hirschberg, D.S.: A linear space algorithm for computing maximal common subsequences. Commun. ACM (1975)
Google Scholar
Islam, A., Inkpen, D.: Semantic text similarity using corpus-based word similarity and string similarity. ACM Transactions on Knowledge Discovery from Data (2008)
Google Scholar
Landauer, T., Dumais, S.: A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychological Review (1997)
Google Scholar
Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Processes (1998)
Google Scholar
Li, Y., McLean, D., Bandar, Z.A., O’Shea, J.D., Crockett, K.: Sentence similarity based on semantic nets and corpus statistics. IEEE Transaction on Knowledge and Data Engineering (2006)
Google Scholar
Metzler, D., Dumais, S.T., Meek, C.: Similarity Measures for Short Segments of Text. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECiR 2007. LNCS, vol. 4425, pp. 16–27. Springer, Heidelberg (2007)
Chapter Google Scholar
Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: AAAI (2006)
Google Scholar
Radlinski, F., Broder, A., Ciccolo, P., Gabrilovich, E., Josifovski, V., Riedel, L.: Optimizing relevance and revenue in ad search: a query substitution approach. In: SIGIR (2008)
Google Scholar
Sahami, M., Heilman, T.D.: A web-based kernel function for measuring the similarity of short text snippets. In: WWW (2006)
Google Scholar
Tsatsaronis, G., Varlamis, I., Vazirgiannis, M.: Text relatedness based on a word thesaurus. Journal of Artificial Intelligence Research (2010)
Google Scholar
Turney, P.D.: Mining the web for synonyms: Pmi-ir versus lsa on toefl. In: EMCL (2001)
Google Scholar
Ukkonen, E.: Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science 92(1) (1992)
Google Scholar
Wei, F., Li, W., Lu, Q., He, Y.: Query-sensitive mutual reinforcement chain and its application in query-oriented multi-document summarization. In: SIGIR (2008)
Google Scholar
Yang, Z., Kitsuregawa, M.: Efficient searching top-k semantic similar words. In: IJCAI (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Industrial Science, The University of Tokyo, Japan
Yanhui Gu, Zhenglu Yang, Miyuki Nakano & Masaru Kitsuregawa

Authors

Yanhui Gu
View author publications
You can also search for this author in PubMed Google Scholar
Zhenglu Yang
View author publications
You can also search for this author in PubMed Google Scholar
Miyuki Nakano
View author publications
You can also search for this author in PubMed Google Scholar
Masaru Kitsuregawa
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Electrical and Electronic Engineering, The University of Manchester, M13 9PL, Manchester, UK
Hujun Yin
Department of Electrical Engineering, Federal University of Rio Grande do Norte, Lagoa Nova, 59072-970, Natal, RN, Brazil
José A. F. Costa
Department of Teleinformatics Engineering, Federal University of Ceará, Campus of Pici, CP 6005, 60455-760, Fortaleza, CE, Brazil
Guilherme Barreto

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gu, Y., Yang, Z., Nakano, M., Kitsuregawa, M. (2012). Towards Efficient Similar Sentences Extraction. In: Yin, H., Costa, J.A.F., Barreto, G. (eds) Intelligent Data Engineering and Automated Learning - IDEAL 2012. IDEAL 2012. Lecture Notes in Computer Science, vol 7435. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32639-4_33

Download citation

DOI: https://doi.org/10.1007/978-3-642-32639-4_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32638-7
Online ISBN: 978-3-642-32639-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics