Indexing of Textual Databases Based on Lexical Resources: A Case Study for Serbian
In this paper we describe an approach to improvement of information retrieval results for large textual databases by pre-indexing documents using bag-of-words and named entity recognition. The approach was applied on a database of geological projects financed by the Republic of Serbia for several decades now. Each document within this database is described by a summary report, consisting of metadata on the geological project, such as title, domain, keywords, abstract, and geographical location. A bag of words was produced from these metadata with the help of morphological dictionaries and transducers, while named entities were recognized using a rule-based system. Both were then used for pre-indexing documents for information retrieval purposes where ranking of retrieved documents was based on several \(tf\_idf\) based measures. Evaluation of ranked retrieval results based on data obtained by pre-indexing were compared to results obtained by informational retrieval without pre-indexing with precision-recall curve, showing a significant improvement in terms of the mean average precision measure.
This research was supported by the Serbian Ministry of Education and Science under the grant #47003 and KEYSTONE COST Action IC1302. The authors would also like to thank the anonymous reviewers for their helpful and constructive comments.
- 1.Courtois, B., Silberztein, M.: Dictionnaires électroniques du français. Larousse, Paris (1990)Google Scholar
- 3.Graovac, J.: Wordnet-based Serbian text categorization. INFOtheca 14(2), 2a–17a (2013)Google Scholar
- 4.Gross, M.: The use of finite automata in the lexical representation of natural language. In: Gross, M., Perrin, D. (eds.) Electronic Dictionaries and Automata in Computational Linguistics. LNCS, vol. 377, pp. 34–50. Springer, Berlin/Heidelberg (1989). http://dx.doi.org/10.1007/3-540-51465-1_3CrossRefGoogle Scholar
- 5.Hiemstra, D.: Using language models for information retrieval. Taaluitgeverij Neslia Paniculata (2001)Google Scholar
- 6.Ivanović, D., Milosavljević, G., Milosavljević, B., Surla, D.: A CERIF-compatible research management system based on the MARC 21 format. Inf. Knowl. Manage. 44(3), 229–251 (2010)Google Scholar
- 8.Kešelj, V., Šipka, D.: A suffix subsumption-based approach to building stemmers and lemmatizers for highly inflectional languages with sparse resources. INFOtheca 9(1–2), 23a–33a (2008)Google Scholar
- 9.Krstev, C.: Processing of Serbian - Automata, Texts and Electronic Dictionaries. Faculty of Philology. University of Belgrade, Belgrade (2008)Google Scholar
- 12.Martinović, M.: Transfer of natural language processing technology: experiments, possibilities and limitations case study: english to Serbian. INFOtheca 9(1–2), 11a–21a (2008)Google Scholar
- 13.Maurel, D., Friburger, N., Antoine, J.Y., Eshkol, I., Nouvel, D., et al.: Cascades de transducteurs autour de la reconnaissance des entités nommées. Traitement Automatique des Langues 52(1), 69–96 (2011)Google Scholar
- 14.Milosevic, N.: Stemmer for Serbian language. CoRR (2012). abs/1209.4471Google Scholar
- 19.Stanković, R., Trivić, B., Kitanović, O., Blagojević, B., Nikolić, V.: The development of the geolissterm terminological dictionary. INFOtheca 12(1), 49a–63a (2011)Google Scholar
- 20.Utvić, M.: Annotating the corpus of contemporary Serbian. INFOtheca - J. Inf. Librariansh. 12(2), 36a–47a (2011)Google Scholar
- 21.Vitas, D., Popović, L., Krstev, C., Obradović, I., Pavlović-Lažetić, G., Stanojević, M.: Srpski jezik u digitalnom dobu - the Serbian language in the digital age. In: Rehm, G., Uszkoreit, H. (eds.) META-NET. White Paper Series. Springer, Heidelberg (2012). http://www.meta-net.eu/whitepapersCrossRefGoogle Scholar
- 22.Zečević, A., Stanković-Vujičić, S.: Language identification-the case of Serbian. In: Pavlović-Lažetić, G., Krstev, C., Vitas, D., Obradović, I. (eds.) Natural Language Processing for Serbian - Resources and Applications, pp. 101–112. Faculty of Mathematics. University of Belgrade, Belgrade (2014). http://jerteh.rs/wp-content/uploads/2015/05/Zecevic.pdfGoogle Scholar
Open Access This chapter is distributed under the terms of the Creative Commons Attribution Noncommercial License, which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.