Skip to main content

Experiments with Document Retrieval from Small Text Collections Using Latent Semantic Analysis or Term Similarity with Query Coordination and Automatic Relevance Feedback

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10151))

Abstract

Users face the Vocabulary Gap problem when attempting to retrieve relevant textual documents from small databases, especially when there are only a small number of relevant documents, as it is likely that different terms are used in queries and relevant documents to describe the same concept. To enable comparison of results of different approaches to semantic search in small textual databases, the PIKES team constructed an annotated test collection and Gold Standard comprising 35 search queries and 331 articles. We present two different possible solutions. In one, we index an unannotated version of the PIKES collection using Latent Semantic Analysis (LSA) retrieving relevant documents using a combination of query coordination and automatic relevance feedback. Although we outperform prior work, this approach is dependent on the underlying collection, and is not necessarily scalable. In the second approach, we use an LSA Model generated by SEMILAR from a Wikipedia dump to generate a Term Similarity Matrix (TSM). Queries are automatically expanded with related terms from the TSM and are submitted to a term-by-document matrix Vector Space Model of the PIKES collection. Coupled with a combination of query coordination and automatic relevance feedback we also outperform prior work with this approach. The advantage of the second approach is that it is independent of the underlying document collection.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    The original and cleaned collections are available from http://pikes.fbk.eu/ke4ir.html.

  2. 2.

    Wiki 4 and other LSA Models are available from http://www.semanticsimilarity.org.

  3. 3.

    http://wordnet.princeton.edu.

  4. 4.

    https://www.elastic.co/guide/en/elasticsearch/guide/current/practical-scoring-function.html.

  5. 5.

    http://pikes.fbk.eu/ke4ir.html.

  6. 6.

    The stop word list is available at http://www.lextek.com/manuals/onix/stopwords1.html.

  7. 7.

    http://stanfordnlp.github.io/CoreNLP/.

  8. 8.

    Concisely explained at https://www.elastic.co/guide/en/elasticsearch/guide/current/practical-scoring-function.html.

  9. 9.

    https://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html.

  10. 10.

    Available from http://pikes.fbk.eu/ke4ir.html.

  11. 11.

    https://www.elastic.co/guide/en/elasticsearch/guide/current/practical-scoring-function.html explains that query coordination may not be effective when the query contains synonyms.

References

  1. Azzopardi, J., Staff, C.: Fusion of news reports using surface-based methods. In: WAINA 2012: Proceedings of the 26th International Conference on Advanced Information Networking and Applications Workshops, pp. 809–814. IEEE Computer Society, Los Alamitos (2012)

    Google Scholar 

  2. Azzopardi, J., Staff, C.: Incremental clustering of news reports. Algorithms 5(3), 364–378 (2012)

    Article  Google Scholar 

  3. Carpineto, C., Romano, G.: A survey of automatic query expansion in information retrieval. ACM Comput. Surv. 44(1), 1:1–1:50 (2012)

    Article  Google Scholar 

  4. Corcoglioniti, F., Dragoni, M., Rospocher, M., Aprosio, A.P.: Knowledge extraction for information retrieval. In: Sack, H., Blomqvist, E., d’Aquin, M., Ghidini, C., Ponzetto, S.P., Lange, C. (eds.) ESWC 2016. LNCS, vol. 9678, pp. 317–333. Springer, Heidelberg (2016). doi:10.1007/978-3-319-34129-3_20

    Chapter  Google Scholar 

  5. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)

    Article  Google Scholar 

  6. Dumais, S.: Improving the retrieval of information from external sources. Behav. Res. Methods Instrum. Comput. 23(2), 229–236 (1991)

    Article  Google Scholar 

  7. Huston, S., Bruce Croft, W.: Evaluating verbose query processing techniques. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2010, pp. 291–298. ACM, New York (2010)

    Google Scholar 

  8. Jorge-Botana, G., Olmos, R., Barroso, A.: The construction-integration framework: a means to diminish bias in LSA-based call routing. I. J. Speech Technol. 15(2), 151–164 (2012)

    Article  Google Scholar 

  9. Mitra, M., Singhal, A., Buckley, C.: Improving automatic query expansion. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1998, pp. 206–214. ACM, New York (1998)

    Google Scholar 

  10. Navigli, R., Vannella, D.: SemEval-2013 task 11: word sense induction and disambiguation within an end-user application. In: Second Joint Conference on Lexical and Computational Semantics (*SEM), Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), vol. 2, pp. 193–201. Association for Computational Linguistics, Atlanta, June 2013

    Google Scholar 

  11. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Article  Google Scholar 

  12. Rocchio, J.J.: Relevance feedback in information retrieval. In: Salton, G. (ed.) The SMART Retrieval System: Experiments in Automatic Document Processing, pp. 313–323. Prentice-Hall, Englewood Cliffs (1971)

    Google Scholar 

  13. Spoerri, A.: How visual query tools can support users searching the internet. In: 2014 18th International Conference on Information Visualisation, pp. 329–334 (2004)

    Google Scholar 

  14. Staff, C., Azzopardi, J., Layfield, C., Mercieca, D.: Search results clustering without external resources. In: Spies, M., Wagner, R.R., Min Tjoa, A. (eds.) Proceedings of the 26th International Workshop on Database and Expert Systems Applications DEXA 2015, Valencia, Spain, 1–4 September 2015, pp. 276–280 (2015)

    Google Scholar 

  15. Stanković, R., Krstev, C., Obradović, I., Kitanović, O.: Indexing of textual databases based on lexical resources: a case study for Serbian. In: Cardoso, J., Guerra, F., Houben, G.-J., Pinto, A.M., Velegrakis, Y. (eds.) KEYSTONE 2015. LNCS, vol. 9398, pp. 167–181. Springer, Heidelberg (2015). doi:10.1007/978-3-319-27932-9_15

    Chapter  Google Scholar 

  16. Stefanescu, D., Banjade, R., Rus, V.: Latent semantic analysis models on wikipedia and tasa. In: Calzolari, N., Choukri, K., Declerck, T., Loftsson, H., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., Piperidis, S. (eds.) Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014). European Language Resources Association (ELRA), Reykjavik, May 2014

    Google Scholar 

  17. Waitelonis, J., Exeler, C., Sack, H.: Linked data enabled generalized vector space model to improve document retrieval. In: Proceedings of NLP and DBpedia 2015 Workshop in Conjunction with 14th International Semantic Web Conference (ISWC 2015), CEUR Workshop Proceedings (2015)

    Google Scholar 

  18. Yao, J., Cui, B., Hua, L., Huang, Y.: Keyword query reformulation on structured data. In: 2012 IEEE 28th International Conference on Data Engineering (ICDE), pp. 953–964. IEEE (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chris Staff .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Layfield, C., Azzopardi, J., Staff, C. (2017). Experiments with Document Retrieval from Small Text Collections Using Latent Semantic Analysis or Term Similarity with Query Coordination and Automatic Relevance Feedback. In: Calì, A., Gorgan, D., Ugarte, M. (eds) Semantic Keyword-Based Search on Structured Data Sources. IKC 2016. Lecture Notes in Computer Science(), vol 10151. Springer, Cham. https://doi.org/10.1007/978-3-319-53640-8_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-53640-8_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-53639-2

  • Online ISBN: 978-3-319-53640-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics