Indexing of Textual Databases Based on Lexical Resources: A Case Study for Serbian

  • Ranka Stanković
  • Cvetana Krstev
  • Ivan Obradović
  • Olivera Kitanović
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9398)

Abstract

In this paper we describe an approach to improvement of information retrieval results for large textual databases by pre-indexing documents using bag-of-words and named entity recognition. The approach was applied on a database of geological projects financed by the Republic of Serbia for several decades now. Each document within this database is described by a summary report, consisting of metadata on the geological project, such as title, domain, keywords, abstract, and geographical location. A bag of words was produced from these metadata with the help of morphological dictionaries and transducers, while named entities were recognized using a rule-based system. Both were then used for pre-indexing documents for information retrieval purposes where ranking of retrieved documents was based on several \(tf\_idf\) based measures. Evaluation of ranked retrieval results based on data obtained by pre-indexing were compared to results obtained by informational retrieval without pre-indexing with precision-recall curve, showing a significant improvement in terms of the mean average precision measure.

Notes

Acknowledgements

This research was supported by the Serbian Ministry of Education and Science under the grant #47003 and KEYSTONE COST Action IC1302. The authors would also like to thank the anonymous reviewers for their helpful and constructive comments.

References

  1. 1.
    Courtois, B., Silberztein, M.: Dictionnaires électroniques du français. Larousse, Paris (1990)Google Scholar
  2. 2.
    Furlan, B., Batanović, V., Nikolić, B.: Semantic similarity of short texts in languages with a deficient natural language processing support. Decis. Support Syst. 55(3), 710–719 (2013)CrossRefGoogle Scholar
  3. 3.
    Graovac, J.: Wordnet-based Serbian text categorization. INFOtheca 14(2), 2a–17a (2013)Google Scholar
  4. 4.
    Gross, M.: The use of finite automata in the lexical representation of natural language. In: Gross, M., Perrin, D. (eds.) Electronic Dictionaries and Automata in Computational Linguistics. LNCS, vol. 377, pp. 34–50. Springer, Berlin/Heidelberg (1989). http://dx.doi.org/10.1007/3-540-51465-1_3CrossRefGoogle Scholar
  5. 5.
    Hiemstra, D.: Using language models for information retrieval. Taaluitgeverij Neslia Paniculata (2001)Google Scholar
  6. 6.
    Ivanović, D., Milosavljević, G., Milosavljević, B., Surla, D.: A CERIF-compatible research management system based on the MARC 21 format. Inf. Knowl. Manage. 44(3), 229–251 (2010)Google Scholar
  7. 7.
    Jackson, P., Moulinier, I.: Natural Language Processing for Online Applications: Text Retrieval, Extraction and Categorization, vol. 5. John Benjamins Publishing, Amsterdam (2007)CrossRefGoogle Scholar
  8. 8.
    Kešelj, V., Šipka, D.: A suffix subsumption-based approach to building stemmers and lemmatizers for highly inflectional languages with sparse resources. INFOtheca 9(1–2), 23a–33a (2008)Google Scholar
  9. 9.
    Krstev, C.: Processing of Serbian - Automata, Texts and Electronic Dictionaries. Faculty of Philology. University of Belgrade, Belgrade (2008)Google Scholar
  10. 10.
    Krstev, C., Obradović, I., Utvić, M., Vitas, D.: A system for named entity recognition based on local grammars. J. Logic Comput. 24(2), 473–489 (2014)CrossRefGoogle Scholar
  11. 11.
    Shrimpton, J.: Introduction. In: Shrimpton, J. (ed.) Charge Injection Systems. Heat, Mass Transfer, vol. 1, pp. 1–4. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  12. 12.
    Martinović, M.: Transfer of natural language processing technology: experiments, possibilities and limitations case study: english to Serbian. INFOtheca 9(1–2), 11a–21a (2008)Google Scholar
  13. 13.
    Maurel, D., Friburger, N., Antoine, J.Y., Eshkol, I., Nouvel, D., et al.: Cascades de transducteurs autour de la reconnaissance des entités nommées. Traitement Automatique des Langues 52(1), 69–96 (2011)Google Scholar
  14. 14.
    Milosevic, N.: Stemmer for Serbian language. CoRR (2012). abs/1209.4471Google Scholar
  15. 15.
    Mladenović, M., Mitrović, J., Krstev, C., Vitas, D.: Hybrid sentiment analysis framework for a morphologically rich language. J. Intell. Inf. Syst. 45(129), 1573–7675 (2015). doi: 10.1007/s10844-015-0372-5. Springer, ISSN 0925-9902Google Scholar
  16. 16.
    Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. In: Sekine, S., Ranchhod, E. (eds.) Named Entities: Recognition, Classification and Use, pp. 3–28. John Benjamins Pub. Co., Amsterdam/Philadelphia (2009)CrossRefGoogle Scholar
  17. 17.
    Rehm, G., Uszkoreit, H. (eds.): META-NET. White Paper Series. Springer, Heidelberg (2012). http://www.meta-net.eu/whitepapersGoogle Scholar
  18. 18.
    Salton, G., McGill, M.J.: Introduction to modern information retrieval. McGraw-Hill Inc, New York (1983)MATHGoogle Scholar
  19. 19.
    Stanković, R., Trivić, B., Kitanović, O., Blagojević, B., Nikolić, V.: The development of the geolissterm terminological dictionary. INFOtheca 12(1), 49a–63a (2011)Google Scholar
  20. 20.
    Utvić, M.: Annotating the corpus of contemporary Serbian. INFOtheca - J. Inf. Librariansh. 12(2), 36a–47a (2011)Google Scholar
  21. 21.
    Vitas, D., Popović, L., Krstev, C., Obradović, I., Pavlović-Lažetić, G., Stanojević, M.: Srpski jezik u digitalnom dobu - the Serbian language in the digital age. In: Rehm, G., Uszkoreit, H. (eds.) META-NET. White Paper Series. Springer, Heidelberg (2012). http://www.meta-net.eu/whitepapersCrossRefGoogle Scholar
  22. 22.
    Zečević, A., Stanković-Vujičić, S.: Language identification-the case of Serbian. In: Pavlović-Lažetić, G., Krstev, C., Vitas, D., Obradović, I. (eds.) Natural Language Processing for Serbian - Resources and Applications, pp. 101–112. Faculty of Mathematics. University of Belgrade, Belgrade (2014). http://jerteh.rs/wp-content/uploads/2015/05/Zecevic.pdfGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Open Access This chapter is distributed under the terms of the Creative Commons Attribution Noncommercial License, which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Authors and Affiliations

  • Ranka Stanković
    • 1
  • Cvetana Krstev
    • 2
  • Ivan Obradović
    • 1
  • Olivera Kitanović
    • 1
  1. 1.Faculty of Mining and GeologyUniversity of BelgradeBelgradeSerbia
  2. 2.Faculty of PhilologyUniversity of BelgradeBelgradeSerbia

Personalised recommendations