Including Social Media – A Very Dynamic Style – in the Corpora for Processing Romanian Language

Perez, Cenel-Augusto; Mărănduc, Cătălina; Simionescu, Radu

doi:10.1007/978-3-319-32942-0_10

Cenel-Augusto Perez¹²,
Cătălina Mărănduc^12,13 &
Radu Simionescu¹²

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 588))

Included in the following conference series:

Workshop on Social Media and the Web of Linked Data

389 Accesses
3 Citations

Abstract

This paper aims to describe the process of introducing a new sub-corpus, in a new style, social media, in our UAIC-Ro-Dependency-Treebank. Our purpose is to enhance the corpus and to also include all the styles of the language. Unfortunately, the growth of the corpus is interrelated with the development of the syntactic parser. The inclusion of all the styles is a very difficult target; when parsing texts in a style for which the tools are not yet trained, the accuracy drops significantly. At least 1,000 sentences are needed for the first step of the training of the parser in a new style. We describe this first step that implies the introduction of social media style in the Treebank, the first series of orthographic, stylistic, pragmatic, lexical, semantic, syntactic, and discursive observations on this style of the language, and we communicate the first statistical evaluation of the automatic annotation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
A parallel corpus for Romanian-English created at the HLT/NAACL 2003 workshop, titled “Building and Using Parallel Texts: Data Driven Machine Translation and Beyond”.
2.
JRC-ACQUIS is the largest parallel corpus. It is composed of lows for the EU Member States, since 1958 till present, translated and aligned for 23 languages.
3.
UAIC Romanian dependency parser http://nlptools.infoiasi.ro/WebFdgRo/.

References

Avontuur, T., Balemans, I., Elshof, L., van Noord, N., van Zaanen, M.: Developing a part of speech tagger for Dutch tweets. Comput. Linguist. Neth. J. 2, 34–51 (2012)
Google Scholar
Cristea, D.: The right frontier constraint holds unconditionally. In: Proceedings of the Multidisciplinary Approaches to Discourse 2005 (MAD 2005), Chorin/Berlin, Germany (2005)
Google Scholar
Dent, K., Alto, P., Diep, F.: Parsing the twitterverse. Sci. Am. 305, 22 (2011)
Google Scholar
Darling, W., Paul, M., Song, F.: Unsupervised part-of-speech tagging in noisy and esoteric domains with a syntactic-semantic Bayesian HMM. In: Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics, pp. 1–9. ACL (2012)
Google Scholar
Derczynski, L., Maynard, D., Aswani, N., Bontcheva, K.: Microblog-genre noise and impact on semantic annotation accuracy. In: Proceedings of the 24th ACM Conference on Hypertext and Social Media. ACM (2013)
Google Scholar
Foster, J., Cetinoglu, O., Wagner, J., Le Roux, J., Hogan, S., Nivre, J., Hogan, D., van Genabith, J.: POS tagging and parsing the twitterverse. In: Proceedings of the AAAI Workshop on Analyzing Microtext (2011)
Google Scholar
Gadde, P., Subramaniam, L., Faruquie, T.: Adapting a WSJ trained part-of-speech tagger to noisy text: preliminary results. In: Proceedings of the Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data, pp. 5:1–5:8. ACM (2011)
Google Scholar
Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman, M., Yogatama, D., Flanigan, J., Smith, N.: Part-of-speech tagging for twitter: annotation, features, and experiments. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 42–47. ACL (2011)
Google Scholar
Liu, F., Weng, F., Jiang, X.: A broadcoverage normalization system for social media language. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (2012)
Google Scholar
Neunerdt, M., Trevisan, B., Reyer, M., Mathar, R.: Part-of-speech tagging for social media texts. In: Gurevych, I., Biemann, C., Zesch, T. (eds.) GSCL 2013. LNCS, vol. 8105, pp. 139–150. Springer, Heidelberg (2013)
Chapter Google Scholar
Neunerdt, M., Reyer, M., Mathar, R.: A POS tagger for social media texts trained on web comments. Polibits 48, 59–66 (2013)
Article Google Scholar
Nivre, J., Hall, J., Nilsson, J.: MaltParser: a data-driven parser-generator for dependency parsing. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC), May 24-26, 2006, Genoa, Italy, pp. 2216–2219 (2006)
Google Scholar
Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schneider, N.: Part-of-Speech Tagging for Twitter: Word Clusters and Other Advances. Technical Report CMU-ML-12-107, Machine Learning Department, Carnegie Mellon University (2012)
Google Scholar
Robinson, J.J.: Dependency structures and transformational rules. Language 46, 259–285 (1970)
Article Google Scholar
Simionescu, R.: Hybrid POS tagger. In: The Workshop on Language Resources and Tools in Industrial Applications, Eurolan 2011 summer school (2011)
Google Scholar
Simionescu, R.: Graphical grammar studio as a constraint grammar solution for part of speech tagging. In: The Conference on Linguistic Resources and Instruments for Romanian Language Processing (2011)
Google Scholar
Singha, K.R., Purkayastha, B.S., Singha, K.D.: Part of speech tagging in manipuri: a rule-based approach. Int. J. Comput. Appl. (0975 – 8887) 51(14), 31–36 (2012)
Google Scholar
Toutanova, D., Klein, C., Manning, C., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 173–180. ACL (2003)
Google Scholar

Download references

Acknowledgement

We are grateful to members of the NLP group in our faculty for suggesting us such an interesting research topic.

Author information

Authors and Affiliations

Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iași, Iași, Romania
Cenel-Augusto Perez, Cătălina Mărănduc & Radu Simionescu
“Iorgu Iordan - Al. Rosetti” Institute of Linguistics of the Romanian Academy, Bucharest, Romania
Cătălina Mărănduc

Authors

Cenel-Augusto Perez
View author publications
You can also search for this author in PubMed Google Scholar
Cătălina Mărănduc
View author publications
You can also search for this author in PubMed Google Scholar
Radu Simionescu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cătălina Mărănduc .

Editor information

Editors and Affiliations

University “Alexandru Ioan Cuza”, Iaşi, Romania
Diana Trandabăţ
University “Alexandru Ioan Cuza”, Iaşi, Romania
Daniela Gîfu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Perez, CA., Mărănduc, C., Simionescu, R. (2016). Including Social Media – A Very Dynamic Style – in the Corpora for Processing Romanian Language. In: Trandabăţ, D., Gîfu, D. (eds) Linguistic Linked Open Data. RUMOUR 2015. Communications in Computer and Information Science, vol 588. Springer, Cham. https://doi.org/10.1007/978-3-319-32942-0_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-32942-0_10
Published: 10 April 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-32941-3
Online ISBN: 978-3-319-32942-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics