Assessing the Efficiency of Suffix Stripping Approaches for Portuguese Stemming

Gomes Ferreira, Wadson; Antônio dos Santos, Willian; Macena Pereira de Souza, Breno; Matta Machado Zaidan, Tiago; Cardoso Brandão, Wladmir

doi:10.1007/978-3-319-23826-5_21

Wadson Gomes Ferreira¹⁶,
Willian Antônio dos Santos¹⁶,
Breno Macena Pereira de Souza¹⁶,
Tiago Matta Machado Zaidan¹⁶ &
…
Wladmir Cardoso Brandão¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9309))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

1086 Accesses

Abstract

Stemming is the process of reducing inflected words to their root form, the stem. Search engines use stemming algorithms to conflate words in the same stem, reducing index size and improving recall. Suffix stripping is a strategy used by stemming algorithms to reduce words to stems by processing suffix rules suitable to address the constraints of each language. For Portuguese stemming, the RSLP was the first suffix stripping algorithm proposed in literature, and it is still widely used in commercial and open source search engines. Typically, the RSLP algorithm uses a list-based approach to process rules for suffix stripping. In this article, we introduce two suffix stripping approaches for Portuguese stemming. Particularly, we propose the hash-based and the automata-based approach, and we assess their efficiency by contrasting them with the state-of-the-art list-based approach. Complexity analysis shows that the automata-based approach is more efficient in time. In addition, experiments on two datasets attest the efficiency of our approaches. In particular, the hash-based and the automata-based approaches outperform the list-based approach, with reduction of up to 65.28% and 86.48% in stemming time, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Alvares, R.V., Garcia, A.C.B., Ferraz, I.N.: STEMBR: A stemming algorithm for the brazilian portuguese language. In: Bento, C., Cardoso, A., Dias, G. (eds.) EPIA 2005. LNCS (LNAI), vol. 3808, pp. 693–701. Springer, Heidelberg (2005)
Chapter Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B.: Modern information retrieval: the concepts and technology behind search, 2nd edn. Pearson Education, Harlow (2011)
Google Scholar
Coelho, A.R.: Stemming for the Portuguese language: study, analysis and improvement of the RSLP algorithm. Universidade Federal do Rio Grande do Sul, Monography (2007)
Google Scholar
Matt, C.: Spotlight keynote. In: Proceedings of Search Engines Strategies, San Francisco, CA, USA (2012)
Google Scholar
Jain, R.: The art of computer systems performance analysis: Techniques for experimental design, measurement, simulation, and modeling. Wiley-Interscience, New York (1991)
MATH Google Scholar
Orengo, V.M., Buriol, L.S., Coelho, A.R.: A study on the use of stemming for monolingual Ad-Hoc portuguese information retrieval. In: Peters, C., Clough, P., Gey, F.C., Karlgren, J., Magnini, B., Oard, D.W., de Rijke, M., Stempfhuber, M. (eds.) CLEF 2006. LNCS, vol. 4730, pp. 91–98. Springer, Heidelberg (2007)
Chapter Google Scholar
Orengo, V.M., Huyck, C.: A stemming algorithm for the portuguese language. In: Proceedings of the 8th International Symposium on String Processing and Information Retrieval (SPIRE), pp. 186–193 (2001)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program: electronic library and information systems 40, 211–218 (2006)
Article Google Scholar
Bruno, P., Nivio Jr., Z., Meira, W., Ribeiro-Neto, B.A.: Set-based vector model: An efficient approach for correlation-based ranking. ACM Transactions on Information Systems 23(4), 397–429 (2005)
Article Google Scholar
Pôssas, B., Ziviani, N., Ribeiro-Neto, B.A., Meira Jr., W.: Processing conjunctive and phrase queries with the set-based model. In: Apostolico, A., Melucci, M. (eds.) SPIRE 2004. LNCS, vol. 3246, pp. 171–182. Springer, Heidelberg (2004)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Pontifical Catholic University of Minas Gerais, Belo Horizonte, Brazil
Wadson Gomes Ferreira, Willian Antônio dos Santos, Breno Macena Pereira de Souza, Tiago Matta Machado Zaidan & Wladmir Cardoso Brandão

Authors

Wadson Gomes Ferreira
View author publications
You can also search for this author in PubMed Google Scholar
Willian Antônio dos Santos
View author publications
You can also search for this author in PubMed Google Scholar
Breno Macena Pereira de Souza
View author publications
You can also search for this author in PubMed Google Scholar
Tiago Matta Machado Zaidan
View author publications
You can also search for this author in PubMed Google Scholar
Wladmir Cardoso Brandão
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wadson Gomes Ferreira .

Editor information

Editors and Affiliations

King's College London, London, United Kingdom
Costas Iliopoulos
University of Helsinki, Helsinki, Finland
Simon Puglisi
University College London, London, United Kingdom
Emine Yilmaz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gomes Ferreira, W., Antônio dos Santos, W., Macena Pereira de Souza, B., Matta Machado Zaidan, T., Cardoso Brandão, W. (2015). Assessing the Efficiency of Suffix Stripping Approaches for Portuguese Stemming. In: Iliopoulos, C., Puglisi, S., Yilmaz, E. (eds) String Processing and Information Retrieval. SPIRE 2015. Lecture Notes in Computer Science(), vol 9309. Springer, Cham. https://doi.org/10.1007/978-3-319-23826-5_21

Download citation

DOI: https://doi.org/10.1007/978-3-319-23826-5_21
Published: 05 September 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23825-8
Online ISBN: 978-3-319-23826-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics