Abstract
Users face the Vocabulary Gap problem when attempting to retrieve relevant textual documents from small databases, especially when there are only a small number of relevant documents, as it is likely that different terms are used in queries and relevant documents to describe the same concept. To enable comparison of results of different approaches to semantic search in small textual databases, the PIKES team constructed an annotated test collection and Gold Standard comprising 35 search queries and 331 articles. We present two different possible solutions. In one, we index an unannotated version of the PIKES collection using Latent Semantic Analysis (LSA) retrieving relevant documents using a combination of query coordination and automatic relevance feedback. Although we outperform prior work, this approach is dependent on the underlying collection, and is not necessarily scalable. In the second approach, we use an LSA Model generated by SEMILAR from a Wikipedia dump to generate a Term Similarity Matrix (TSM). Queries are automatically expanded with related terms from the TSM and are submitted to a term-by-document matrix Vector Space Model of the PIKES collection. Coupled with a combination of query coordination and automatic relevance feedback we also outperform prior work with this approach. The advantage of the second approach is that it is independent of the underlying document collection.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
The original and cleaned collections are available from http://pikes.fbk.eu/ke4ir.html.
- 2.
Wiki 4 and other LSA Models are available from http://www.semanticsimilarity.org.
- 3.
- 4.
- 5.
- 6.
The stop word list is available at http://www.lextek.com/manuals/onix/stopwords1.html.
- 7.
- 8.
Concisely explained at https://www.elastic.co/guide/en/elasticsearch/guide/current/practical-scoring-function.html.
- 9.
- 10.
Available from http://pikes.fbk.eu/ke4ir.html.
- 11.
https://www.elastic.co/guide/en/elasticsearch/guide/current/practical-scoring-function.html explains that query coordination may not be effective when the query contains synonyms.
References
Azzopardi, J., Staff, C.: Fusion of news reports using surface-based methods. In: WAINA 2012: Proceedings of the 26th International Conference on Advanced Information Networking and Applications Workshops, pp. 809–814. IEEE Computer Society, Los Alamitos (2012)
Azzopardi, J., Staff, C.: Incremental clustering of news reports. Algorithms 5(3), 364–378 (2012)
Carpineto, C., Romano, G.: A survey of automatic query expansion in information retrieval. ACM Comput. Surv. 44(1), 1:1–1:50 (2012)
Corcoglioniti, F., Dragoni, M., Rospocher, M., Aprosio, A.P.: Knowledge extraction for information retrieval. In: Sack, H., Blomqvist, E., d’Aquin, M., Ghidini, C., Ponzetto, S.P., Lange, C. (eds.) ESWC 2016. LNCS, vol. 9678, pp. 317–333. Springer, Heidelberg (2016). doi:10.1007/978-3-319-34129-3_20
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)
Dumais, S.: Improving the retrieval of information from external sources. Behav. Res. Methods Instrum. Comput. 23(2), 229–236 (1991)
Huston, S., Bruce Croft, W.: Evaluating verbose query processing techniques. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2010, pp. 291–298. ACM, New York (2010)
Jorge-Botana, G., Olmos, R., Barroso, A.: The construction-integration framework: a means to diminish bias in LSA-based call routing. I. J. Speech Technol. 15(2), 151–164 (2012)
Mitra, M., Singhal, A., Buckley, C.: Improving automatic query expansion. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1998, pp. 206–214. ACM, New York (1998)
Navigli, R., Vannella, D.: SemEval-2013 task 11: word sense induction and disambiguation within an end-user application. In: Second Joint Conference on Lexical and Computational Semantics (*SEM), Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), vol. 2, pp. 193–201. Association for Computational Linguistics, Atlanta, June 2013
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Rocchio, J.J.: Relevance feedback in information retrieval. In: Salton, G. (ed.) The SMART Retrieval System: Experiments in Automatic Document Processing, pp. 313–323. Prentice-Hall, Englewood Cliffs (1971)
Spoerri, A.: How visual query tools can support users searching the internet. In: 2014 18th International Conference on Information Visualisation, pp. 329–334 (2004)
Staff, C., Azzopardi, J., Layfield, C., Mercieca, D.: Search results clustering without external resources. In: Spies, M., Wagner, R.R., Min Tjoa, A. (eds.) Proceedings of the 26th International Workshop on Database and Expert Systems Applications DEXA 2015, Valencia, Spain, 1–4 September 2015, pp. 276–280 (2015)
Stanković, R., Krstev, C., Obradović, I., Kitanović, O.: Indexing of textual databases based on lexical resources: a case study for Serbian. In: Cardoso, J., Guerra, F., Houben, G.-J., Pinto, A.M., Velegrakis, Y. (eds.) KEYSTONE 2015. LNCS, vol. 9398, pp. 167–181. Springer, Heidelberg (2015). doi:10.1007/978-3-319-27932-9_15
Stefanescu, D., Banjade, R., Rus, V.: Latent semantic analysis models on wikipedia and tasa. In: Calzolari, N., Choukri, K., Declerck, T., Loftsson, H., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., Piperidis, S. (eds.) Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014). European Language Resources Association (ELRA), Reykjavik, May 2014
Waitelonis, J., Exeler, C., Sack, H.: Linked data enabled generalized vector space model to improve document retrieval. In: Proceedings of NLP and DBpedia 2015 Workshop in Conjunction with 14th International Semantic Web Conference (ISWC 2015), CEUR Workshop Proceedings (2015)
Yao, J., Cui, B., Hua, L., Huang, Y.: Keyword query reformulation on structured data. In: 2012 IEEE 28th International Conference on Data Engineering (ICDE), pp. 953–964. IEEE (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Layfield, C., Azzopardi, J., Staff, C. (2017). Experiments with Document Retrieval from Small Text Collections Using Latent Semantic Analysis or Term Similarity with Query Coordination and Automatic Relevance Feedback. In: Calì, A., Gorgan, D., Ugarte, M. (eds) Semantic Keyword-Based Search on Structured Data Sources. IKC 2016. Lecture Notes in Computer Science(), vol 10151. Springer, Cham. https://doi.org/10.1007/978-3-319-53640-8_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-53640-8_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-53639-2
Online ISBN: 978-3-319-53640-8
eBook Packages: Computer ScienceComputer Science (R0)