Multi-Lingual LSA with Serbian and Croatian: An Investigative Case Study

Layfield, Colin; Ivanović, Dragan; Azzopardi, Joel

doi:10.1007/978-3-319-74497-1_15

Colin Layfield¹⁵,
Dragan Ivanović¹⁶ &
Joel Azzopardi¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10546))

Included in the following conference series:

Semanitic Keyword-based Search on Structured Data Sources

765 Accesses
1 Citations

Abstract

One of the challenges in information retrieval is attempting to search a corpus of documents that may contain multiple languages. This exploratory study expands upon earlier research employing Latent Semantic Analysis (so called Multi-Lingual Latent Semantic Indexing, or ML-LSI/LSA). We experiment using this approach, and a new one, in a multi-lingual context utilising two similar languages, namely Serbian and Croatian. Traditionally, with an LSA approach, a parallel corpus would be needed in order to train the system by combining identical documents in two languages into one document. We repeat that approach and also experiment with creating a semantic space using the parallel corpus on its own without merging the documents together to test the hypothesis that, with very similar languages, the merging of documents may not be required for good results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
As a side effect, the XML turned out to be badly formed in places and needed to be fixed by hand.
2.
http://snowballstem.org.
3.
Diacritics are added to the top or bottom of a letter to indicate appropriate stress, special pronunciation, or unusual sounds not common in the Roman alphabet. In Serbian and Croatian, these markings indicate special pronunciation, like the difference between the pronunciation of C compared to Ć.
4.
The stop word list is available at http://www.lextek.com/manuals/onix/stopwords1.html. Note that single character stop words were not included as it was found that many Serbian/Croatian documents were flagged as English when they were present in the list.
5.
We discovered, serendipitously, that the results of using tf-idf and l-e were actually superior when the folded-in search queries were only weighted using raw term-frequency. This was unexpected and will be a topic of future research. The results reported here use the commonly accepted approach of weighting the query appropriately with the weighting method used for the creation of the semantic space.
6.
The same similarity score is the cosine similarity between the two ‘mate’ documents.

References

Berry, M.W., Browne, M.: Understanding Search Engines: Mathematical Modeling and Text Retrieval, 2nd edn. SIAM, Philadelphia (2005)
Book MATH Google Scholar
Chew, P., Abdelali, A.: The effects of language relatedness on multilingual information retrieval: a case study with Indo-European and semitic languages. In: Proceedings of the 2nd International Workshop on “Cross Lingual Information Access” Addressing the Information Need of Multilingual Societies, pp. 1–9, January 2008. http://anthology.aclweb.org/I/I08/I08-6.pdf#page=10
Corbett, G.G., Browne, W.: Serbo-croat: Bosnian, Croatian, Montenegrin, Serbian. In: The World’s Major Languages, pp. 330–346. Routledge, London (2009)
Google Scholar
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)
Article Google Scholar
Dhavachelvan, P., Pothula, S.: A review on the cross and multilingual information retrieval. Int. J. Web Semantic Technol. 2(4), 115–124 (2011)
Article Google Scholar
Dumais, S.T., Letsche, T.A., Littman, M.L., Landauer, T.K.: Automatic cross-language retrieval using latent semantic indexing. AAAI Technical Report SS-97-05, pp. 18–24 (1997)
Google Scholar
Dwivedi, S., Chandra, G.: A survey on cross language information retrieval. Int. J. Cybern. Inform. 5(1), 127–142 (2016)
Google Scholar
Greenberg, R.D.: Language politics in the federal republic of Yugoslavia: the crisis over the future of serbian. Slavic Rev. 59(3), 625–640 (2008)
Article Google Scholar
Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Process. 25, 259–284 (1998)
Article Google Scholar
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)
Book MATH Google Scholar
Sharma, M., Morwal, S.: A survey on cross language information retrieval. Int. J. Adv. Res. Comput. Commun. Eng. 4(2), 384–387 (2015)
Article Google Scholar
Tyers, F.M., Alperen, M.S.: South-East European Times: a parallel corpus of Balkan languages. In: Proceedings of the Workshop on Exploitation of Multilingual Resources and Tools for Central and (South) Eastern European Languages, LREC 2010, pp. 49–53 (2010). http://xixona.dlsi.ua.es/~fran/publications/lrec2010.pdf
Young, P.G.: Cross-language information retrieval using latent semantic indexing. Master’s thesis. University of Knoxville, Tennessee (1994)
Google Scholar

Download references

Acknowledgements

This article is based upon work from COST Action KEYSTONE IC1302, supported by COST (European Cooperation in Science and Technology).

Author information

Authors and Affiliations

University of Malta, Msida, Malta
Colin Layfield & Joel Azzopardi
University of Novi Sad, Novi Sad, Serbia
Dragan Ivanović

Authors

Colin Layfield
View author publications
You can also search for this author in PubMed Google Scholar
Dragan Ivanović
View author publications
You can also search for this author in PubMed Google Scholar
Joel Azzopardi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Colin Layfield .

Editor information

Editors and Affiliations

Gdańsk University of Technology, Gdańsk, Poland
Julian Szymański
Università degli Studi di Trento, Trento, Italy
Yannis Velegrakis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Layfield, C., Ivanović, D., Azzopardi, J. (2018). Multi-Lingual LSA with Serbian and Croatian: An Investigative Case Study. In: Szymański, J., Velegrakis, Y. (eds) Semantic Keyword-Based Search on Structured Data Sources. IKC 2017. Lecture Notes in Computer Science(), vol 10546. Springer, Cham. https://doi.org/10.1007/978-3-319-74497-1_15

Download citation

DOI: https://doi.org/10.1007/978-3-319-74497-1_15
Published: 08 February 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-74496-4
Online ISBN: 978-3-319-74497-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics