Abstract
This paper aims to describe the process of introducing a new sub-corpus, in a new style, social media, in our UAIC-Ro-Dependency-Treebank. Our purpose is to enhance the corpus and to also include all the styles of the language. Unfortunately, the growth of the corpus is interrelated with the development of the syntactic parser. The inclusion of all the styles is a very difficult target; when parsing texts in a style for which the tools are not yet trained, the accuracy drops significantly. At least 1,000 sentences are needed for the first step of the training of the parser in a new style. We describe this first step that implies the introduction of social media style in the Treebank, the first series of orthographic, stylistic, pragmatic, lexical, semantic, syntactic, and discursive observations on this style of the language, and we communicate the first statistical evaluation of the automatic annotation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
A parallel corpus for Romanian-English created at the HLT/NAACL 2003 workshop, titled “Building and Using Parallel Texts: Data Driven Machine Translation and Beyond”.
- 2.
JRC-ACQUIS is the largest parallel corpus. It is composed of lows for the EU Member States, since 1958 till present, translated and aligned for 23 languages.
- 3.
UAIC Romanian dependency parser http://nlptools.infoiasi.ro/WebFdgRo/.
References
Avontuur, T., Balemans, I., Elshof, L., van Noord, N., van Zaanen, M.: Developing a part of speech tagger for Dutch tweets. Comput. Linguist. Neth. J. 2, 34–51 (2012)
Cristea, D.: The right frontier constraint holds unconditionally. In: Proceedings of the Multidisciplinary Approaches to Discourse 2005 (MAD 2005), Chorin/Berlin, Germany (2005)
Dent, K., Alto, P., Diep, F.: Parsing the twitterverse. Sci. Am. 305, 22 (2011)
Darling, W., Paul, M., Song, F.: Unsupervised part-of-speech tagging in noisy and esoteric domains with a syntactic-semantic Bayesian HMM. In: Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics, pp. 1–9. ACL (2012)
Derczynski, L., Maynard, D., Aswani, N., Bontcheva, K.: Microblog-genre noise and impact on semantic annotation accuracy. In: Proceedings of the 24th ACM Conference on Hypertext and Social Media. ACM (2013)
Foster, J., Cetinoglu, O., Wagner, J., Le Roux, J., Hogan, S., Nivre, J., Hogan, D., van Genabith, J.: POS tagging and parsing the twitterverse. In: Proceedings of the AAAI Workshop on Analyzing Microtext (2011)
Gadde, P., Subramaniam, L., Faruquie, T.: Adapting a WSJ trained part-of-speech tagger to noisy text: preliminary results. In: Proceedings of the Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data, pp. 5:1–5:8. ACM (2011)
Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman, M., Yogatama, D., Flanigan, J., Smith, N.: Part-of-speech tagging for twitter: annotation, features, and experiments. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 42–47. ACL (2011)
Liu, F., Weng, F., Jiang, X.: A broadcoverage normalization system for social media language. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (2012)
Neunerdt, M., Trevisan, B., Reyer, M., Mathar, R.: Part-of-speech tagging for social media texts. In: Gurevych, I., Biemann, C., Zesch, T. (eds.) GSCL 2013. LNCS, vol. 8105, pp. 139–150. Springer, Heidelberg (2013)
Neunerdt, M., Reyer, M., Mathar, R.: A POS tagger for social media texts trained on web comments. Polibits 48, 59–66 (2013)
Nivre, J., Hall, J., Nilsson, J.: MaltParser: a data-driven parser-generator for dependency parsing. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC), May 24-26, 2006, Genoa, Italy, pp. 2216–2219 (2006)
Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schneider, N.: Part-of-Speech Tagging for Twitter: Word Clusters and Other Advances. Technical Report CMU-ML-12-107, Machine Learning Department, Carnegie Mellon University (2012)
Robinson, J.J.: Dependency structures and transformational rules. Language 46, 259–285 (1970)
Simionescu, R.: Hybrid POS tagger. In: The Workshop on Language Resources and Tools in Industrial Applications, Eurolan 2011 summer school (2011)
Simionescu, R.: Graphical grammar studio as a constraint grammar solution for part of speech tagging. In: The Conference on Linguistic Resources and Instruments for Romanian Language Processing (2011)
Singha, K.R., Purkayastha, B.S., Singha, K.D.: Part of speech tagging in manipuri: a rule-based approach. Int. J. Comput. Appl. (0975 – 8887) 51(14), 31–36 (2012)
Toutanova, D., Klein, C., Manning, C., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 173–180. ACL (2003)
Acknowledgement
We are grateful to members of the NLP group in our faculty for suggesting us such an interesting research topic.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Perez, CA., Mărănduc, C., Simionescu, R. (2016). Including Social Media – A Very Dynamic Style – in the Corpora for Processing Romanian Language. In: Trandabăţ, D., Gîfu, D. (eds) Linguistic Linked Open Data. RUMOUR 2015. Communications in Computer and Information Science, vol 588. Springer, Cham. https://doi.org/10.1007/978-3-319-32942-0_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-32942-0_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-32941-3
Online ISBN: 978-3-319-32942-0
eBook Packages: Computer ScienceComputer Science (R0)