Skip to main content

Evaluating Phonetic Spellers for User-Generated Content in Brazilian Portuguese

  • Conference paper
  • First Online:
Computational Processing of the Portuguese Language (PROPOR 2016)

Abstract

Recently, spell checking (or spelling correction systems) has regained attention due to the need of normalizing user-generated content (UGC) on the web. UGC presents new challenges to spellers, as its register is much more informal and contains much more variability than traditional spelling correction systems can handle. This paper proposes two new approaches to deal with spelling correction of UGC in Brazilian Portuguese (BP), both of which take into account phonetic errors. The first approach is based on three phonetic modules running in a pipeline. The second one is based on machine learning, with soft decision making, and considers context-sensitive misspellings. We compared our methods with others on a human annotated UGC corpus of reviews of products. The machine learning approach surpassed all other methods, with 78.0 % correction rate, very low false positive (0.7 %) and false negative rate (21.9 %).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The small benchmark of 120 tokens used in [16, 17] is not representative of our scenario.

  2. 2.

    http://jaspell.sourceforge.net/.

  3. 3.

    Currently, a BP version of the phonetic rules can be found at http://sourceforge.net/projects/metaphoneptbr/.

  4. 4.

    http://www.nilc.icmc.usp.br/nilc/projects/unitex-pb/web/.

  5. 5.

    http://corpusbrasileiro.pucsp.br/cb/.

  6. 6.

    The dictionary is available upon request.

  7. 7.

    http://www.buscape.com.br/.

  8. 8.

    https://github.com/gustavoauma/propor_2016_speller.

References

  1. Duan, H., Hsu, B.P.: Online spelling correction for query completion. In: Proceedings of the 20th International Conference on World Wide Web, WWW 2011, NY, USA, pp. 117–126. ACM (2011)

    Google Scholar 

  2. Fossati, D., Di Eugenio, B.: A mixed trigrams approach for context sensitive spell checking. In: Gelbukh, A. (ed.) CICLing 2007. LNCS, vol. 4394, pp. 623–633. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  3. Fossati, D., Di Eugenio, B.: I saw TREE trees in the park: how to correct real-word spelling mistakes. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation LREC 2008 (2008)

    Google Scholar 

  4. Mays, E., Damerau, F.J., Mercer, R.L.: Context based spelling correction. Inf. Process. Manage. 27(5), 517–522 (1991)

    Article  Google Scholar 

  5. Wilcox-O’Hearn, A., Hirst, G., Budanitsky, A.: Real-word spelling correction with trigrams: a reconsideration of the Mays, Damerau, and Mercer Model. In: Gelbukh, A. (ed.) CICLing 2008. LNCS, vol. 4919, pp. 605–616. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  6. Islam, A., Inkpen, D.: Real-word spelling correction using Google web 1tn-gram data set. In ACM International Conference on Information and Knowledge Management CIKM 2009, pp. 1689–1692(2009)

    Google Scholar 

  7. Sonmez, C., Ozgur, A.: A graph-based approach for contextual text normalization. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing EMNLP 2014, pp. 313–324 (2014)

    Google Scholar 

  8. Hirst, G.: An evaluation of the contextual spelling checker of Microsoft Office Word 2007(2008)

    Google Scholar 

  9. Németh, L.: Hunspell. Dostupno na (2010). http://hunspell.sourceforge.net/ [01.10.2013]

  10. Zampieri, M., Amorim, R.: Between sound and spelling: combining phonetics and clustering algorithms to improve target word recovery. In: Proceedings of the 9th International Conference on Natural Language Processing PolTAL 2014, pp. 438–449 (2014)

    Google Scholar 

  11. Rusell, R.C.: US Patent 1261167 issued 1918–04-02 (1918)

    Google Scholar 

  12. Duran, M., Avanço, L., Aluísio, S., Pardo, T., Nunes, M.G.V.: Some issues on the normalization of a corpus of products reviews in Portuguese. In: Proceedings of the 9th Web as Corpus Workshop WaC-9, Gothenburg, Sweden, pp. 22–28, April 2014

    Google Scholar 

  13. De Clercq, O., Schulz, S., Desmet, B., Lefever, E., Hoste, V.: Normalization of dutch user-generated content. In: Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013, pp. 179–188 (2013)

    Google Scholar 

  14. Han, B., Cook, P., Baldwin, T.: Lexical normalization for social media text. ACM Trans. Intell. Syst. Technol. 4(1), 5:1–5:27 (2013)

    Article  Google Scholar 

  15. Andrade, G., Teixeira, F., Xavier, C., Oliveira, R., Rocha, L., Evsukoff, A.: HASCH: high performance automatic spell checker for Portuguese texts from the web. In: Proceedings of the International Conference on Computational Science, vol. 9, pp. 403–411 (2012)

    Google Scholar 

  16. Martins, B., Silva, M.J.: Spelling correction for search engine queries. In: Vicedo, J.L., Martínez-Barco, P., Muńoz, R., Saiz Noeda, M. (eds.) EsTAL 2004. LNCS (LNAI), vol. 3230, pp. 372–383. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  17. Ahmed, F., Luca, E.W.D., Nürnberger, A.: Revised N-Gram based Automatic spelling correction tool to improve retrieval effectiveness. Polibits 40, 39–48 (2009)

    Article  Google Scholar 

  18. Philips, L.: The double metaphone search algorithm. C/C++ Users J. 18(6), 38–43 (2000)

    MathSciNet  Google Scholar 

  19. Avanço, L., Duran, M., Nunes, M.G.V.: Towards a phonetic Brazilian Portuguese spell checker. In: Proceedings of ToRPorEsp Workshop PROPOR 2014, São Carlos, Brazil, pp. 24–31 (2014)

    Google Scholar 

  20. Hartmann, N., Avanço, L., Balage, P., Duran, M., Nunes, M.G.V., Pardo, T., Aluísio, S.: A large corpus of product reviews in Portuguese: tackling out-of-vocabulary words. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation LREC 2014, pp. 3866–3871 (2014)

    Google Scholar 

  21. Mendonça, G., Aluísio, S.: Using a hybrid approach to build a pronunciation dictionary for Brazilian Portuguese. In: Proceedings of the 15th Annual Conference of the International Speech Communication Association INTERSPEECH 2014, Singapore (2014)

    Google Scholar 

  22. Toutanova, K., Moore, R.C.: Pronunciation modeling for improved spelling correction. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL 2002, pp. 144–151 (2002)

    Google Scholar 

  23. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  24. Browne, K.: Snowball sampling: using social networks to research non-heterosexual women. Int. J. Soc. Res. Methodol. 8(1), 47–60 (2005)

    Article  MathSciNet  Google Scholar 

  25. Carletta, J.: Assessing agreement on classification tasks: the kappa statistic. Comput. Linguist. 22(2), 249–254 (1996)

    Google Scholar 

  26. Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33(1), 159–174 (1977)

    Article  MathSciNet  MATH  Google Scholar 

  27. Damerau, F.J.: A technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171–176 (1964)

    Article  Google Scholar 

  28. Brill, E., Moore, R.C.: An improved error model for noisy channel spelling correction. In: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, ACL 2000, pp. 286–293(2000)

    Google Scholar 

  29. van Berkel, B., Smedt, K.D.: Triphone analysis: a combined method for the correction of orthographical and typographical errors. In: Proceedings of the Second Conference on Applied Natural Language Processing, Austin, Texas, USA, pp. 77–83, February 1988

    Google Scholar 

Download references

Acknowledgments

Part of the results presented in this paper were obtained through research activity in the project titled “Semantic Processing of Brazilian Portuguese Texts”, sponsored by Samsung Eletrônica da Amazônia Ltda. under the terms of Brazilian federal law number 8.248/91.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Erick Rocha Fonseca .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

de Mendonça Almeida, G.A., Avanço, L., Duran, M.S., Fonseca, E.R., Nunes, M.d.G.V., Aluísio, S.M. (2016). Evaluating Phonetic Spellers for User-Generated Content in Brazilian Portuguese. In: Silva, J., Ribeiro, R., Quaresma, P., Adami, A., Branco, A. (eds) Computational Processing of the Portuguese Language. PROPOR 2016. Lecture Notes in Computer Science(), vol 9727. Springer, Cham. https://doi.org/10.1007/978-3-319-41552-9_37

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-41552-9_37

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-41551-2

  • Online ISBN: 978-3-319-41552-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics