Skip to main content

Experiments on Sentence Boundary Detection in User-Generated Web Content

  • Conference paper
Computational Linguistics and Intelligent Text Processing (CICLing 2015)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9041))

Abstract

Sentence Boundary Detection (SBD) is a very important prerequisite for proper sentence analysis in different Natural Language Processing tasks. During the last years, many SBD methods have been used in the transcriptions produced by Automatic Speech Recognition systems and in well-structured texts (e.g. news, scientific texts). However, there are few researches about SBD in informal user-generated content such as web reviews, comments, and posts, which are not necessarily well written and structured. In this paper, we adapt and extend a well-known SBD method to the domain of the opinionated texts in the web. Particularly, we evaluate our proposal in a set of online product reviews and compare it with other traditional SBD methods. The experimental results show that we outperform these other methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Al-Subaihin, A., Al-Khalifa, H., Al-Salman, A.: Sentence Boundary Detection in Colloquial Arabic Text: A Preliminary Result. In: Proceedings of the International Conference on Asian Language Processing, pp. 30–32 (2011)

    Google Scholar 

  2. Aluísio, S., Pelizzoni, J.M., Marchi, A.R., de Oliveira, L., Manenti, R., Marquiafável, V.: An Account of the Challenge of Tagging a Reference Corpus for Brazilian Portuguese. In: Mamede, N.J., Baptista, J., Trancoso, I., Nunes, M.d.G.V. (eds.) PROPOR 2003. LNCS, vol. 2721, pp. 110–117. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  3. Aluísio, R.M., Pinheiro, G., Finger, M., Nunes, M.G., Tagnin, S.: The LacioWeb Project: Overview and Issues in Brazilian Portuguese Corpora Creation. In: Proceedings of Corpus Linguistics, pp. 14–21 (2003)

    Google Scholar 

  4. Baldridge, J.: The OpenNLP Project (2005), http://opennlp.apache.org/index.html (accessed January 15, 2015)

  5. Batista, F., Caseiro, D., Mamede, N., Trancoso, I.: Recovering Capitalization and Punctuation Marks for Automatic Speech Recognition: Case Study for Portuguese Broadcast News. Speech Communication 50(10), 847–862 (2008)

    Article  Google Scholar 

  6. Bruckschen, M., Muniz, F., Souza, J., Fuchs, J., Infante, K., Muniz, M., Gonçalves, P., Vieira, R., Aluísio, S.: Anotação Linguística em XML do Corpus PLN-BR. Série de Relatórios do NILC, NILC-TR-09-08 (2008)

    Google Scholar 

  7. Cardoso, P.C., Maziero, E.G., Jorge, M., Seno, E.M., Di Felippo, A., Rino, L.H., Nunes, M.D.G.V., Pardo, T.A.: CSTNews-A Discourse-Annotated Corpus for Single and Multi-Document Summarization of News Texts in Brazilian Portuguese. In: Proceedings of the 3rd RST Brazilian Meeting, pp. 88–105 (2011)

    Google Scholar 

  8. Daelemans, W., Jakub, Z., Van Der Sloot, K., Van Den Bosch, A.: TiMBL: Tilburg Memory Based Learner-Version 2.0 - Reference Guide (1999)

    Google Scholar 

  9. Duran, M., Avanço, L., Aluísio, S., Pardo, T., Nunes, M.: d.G.: Some Issues on the Normalization of a Corpus of Products Reviews in Portuguese. In: Proceedings of the 9th Web as Corpus Workshop (WaC-9), pp. 22–28 (2014)

    Google Scholar 

  10. Kira, K., Rendell, L.A.: The Feature Selection Problem: Traditional Methods and a New Algorithm. In: Proceedings of the 10th National Conference on Artificial Intelligence, pp. 129–134 (1992)

    Google Scholar 

  11. Kiss, T., Strunk, J.: Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics 32(4), 485–525 (2006)

    Article  Google Scholar 

  12. Liu, Y., Chawla, N.V., Harper, M.P., Shriberg, E., Stolcke, A.: A Study in Machine Learning from Imbalanced Data for Sentence Boundary Detection in Speech. Computer Speech & Language 20(4), 468–494 (2006)

    Article  Google Scholar 

  13. Loper, E., Bird, S.: NLTK: The Natural Language Toolkit. In: Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, pp. 63–70 (2002)

    Google Scholar 

  14. Mann, W.C., Thompson, S.A.: Rhetorical Structure Theory: A Theory of Text Organization. University of Southern California, Information Sciences Institute (1987)

    Google Scholar 

  15. Muniz, M.C., Nunes, M.D.G.V., Laporte, E.: UNITEX-PB, a set of Flexible Language Resources for Brazilian Portuguese. In: Workshop on Technology on Information and Human Language, pp. 2059–2068 (2005)

    Google Scholar 

  16. Palmer, D.D., Hearst, M.A.: Adaptive Multilingual Sentence Boundary Disambiguation. Computational Linguistics 23(2), 241–267 (1997)

    Google Scholar 

  17. Pardo, T.A.S.: SENTER: Um Segmentador Sentencial Automático para o Português do Brasil. Série de Relatórios do NILC, NILC-TR-06-01 (2006)

    Google Scholar 

  18. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011)

    MATH  Google Scholar 

  19. Pinheiro, G.M., Aluísio, S.M.: Corpus Nilc: Descrição e Análise Crítica com Vistas ao Projeto Lacio-Web. Série de Relatórios do NILC, NILC-TR-06-09 (2003)

    Google Scholar 

  20. Radev, D.R.: A Common Theory of Information Fusion from Multiple Text Sources Step One: Cross-document Structure. In: Proceedings of the 1st SIGdial Workshop on Discourse and Dialogue, pp. 74–83 (2000)

    Google Scholar 

  21. Read, J., Dridan, R., Oepen, S., Solberg, J.L.: Sentence Boundary Detection: A Long Solved Problem? In: Proceedings of 24th International Conference on Computational Linguistics, pp. 985–994 (2012)

    Google Scholar 

  22. Reynar, J.C., Ratnaparkhi, A.: A Maximum Entropy Approach to Identifying Sentence Boundaries. In: Proceedings of the 5th Conference on Applied Natural Language Processing, pp. 16–19 (1997)

    Google Scholar 

  23. Silla, C., Kaestner, C.: Automatic Sentence Detection Using Regular Expressions. In: Proceedings of the 3rd Brazilian Computer Science Congress, pp. 548–560 (2003) (in Portuguese)

    Google Scholar 

  24. Silla, C., Kaestner, C.: An Analysis of Sentence Boundary Detection Systems for English and Portuguese Documents. In: Computational Linguistics and Intelligent Text Processing, pp. 135–141 (2004)

    Google Scholar 

  25. Stevenson, M., Gaizauskas, R.: Experiments on Sentence Boundary Detection. In: Proceedings of the Sixth Conference on Applied Natural Language Processing, pp. 84–89 (2000)

    Google Scholar 

  26. Wong, D.F., Chao, L.S., Zeng, X.: iSentenizer-μ: Multilingual Sentence Boundary Detection Model. The Scientific World Journal 2014 (2014)

    Google Scholar 

  27. Zhao, Y., Fu, G.: A MEMs-based Labeling Approach to Punctuation Correction in Chinese Opinionated Text. In: Proceedings of the 2013 International Conference on Intelligence Artificial, pp. 329–336 (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Roque López .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

López, R., Pardo, T.A.S. (2015). Experiments on Sentence Boundary Detection in User-Generated Web Content. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2015. Lecture Notes in Computer Science(), vol 9041. Springer, Cham. https://doi.org/10.1007/978-3-319-18111-0_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-18111-0_18

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-18110-3

  • Online ISBN: 978-3-319-18111-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics