Skip to main content

Including Social Media – A Very Dynamic Style – in the Corpora for Processing Romanian Language

  • Conference paper
  • First Online:
Linguistic Linked Open Data (RUMOUR 2015)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 588))

Included in the following conference series:

Abstract

This paper aims to describe the process of introducing a new sub-corpus, in a new style, social media, in our UAIC-Ro-Dependency-Treebank. Our purpose is to enhance the corpus and to also include all the styles of the language. Unfortunately, the growth of the corpus is interrelated with the development of the syntactic parser. The inclusion of all the styles is a very difficult target; when parsing texts in a style for which the tools are not yet trained, the accuracy drops significantly. At least 1,000 sentences are needed for the first step of the training of the parser in a new style. We describe this first step that implies the introduction of social media style in the Treebank, the first series of orthographic, stylistic, pragmatic, lexical, semantic, syntactic, and discursive observations on this style of the language, and we communicate the first statistical evaluation of the automatic annotation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    A parallel corpus for Romanian-English created at the HLT/NAACL 2003 workshop, titled “Building and Using Parallel Texts: Data Driven Machine Translation and Beyond”.

  2. 2.

    JRC-ACQUIS is the largest parallel corpus. It is composed of lows for the EU Member States, since 1958 till present, translated and aligned for 23 languages.

  3. 3.

    UAIC Romanian dependency parser http://nlptools.infoiasi.ro/WebFdgRo/.

References

  1. Avontuur, T., Balemans, I., Elshof, L., van Noord, N., van Zaanen, M.: Developing a part of speech tagger for Dutch tweets. Comput. Linguist. Neth. J. 2, 34–51 (2012)

    Google Scholar 

  2. Cristea, D.: The right frontier constraint holds unconditionally. In: Proceedings of the Multidisciplinary Approaches to Discourse 2005 (MAD 2005), Chorin/Berlin, Germany (2005)

    Google Scholar 

  3. Dent, K., Alto, P., Diep, F.: Parsing the twitterverse. Sci. Am. 305, 22 (2011)

    Google Scholar 

  4. Darling, W., Paul, M., Song, F.: Unsupervised part-of-speech tagging in noisy and esoteric domains with a syntactic-semantic Bayesian HMM. In: Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics, pp. 1–9. ACL (2012)

    Google Scholar 

  5. Derczynski, L., Maynard, D., Aswani, N., Bontcheva, K.: Microblog-genre noise and impact on semantic annotation accuracy. In: Proceedings of the 24th ACM Conference on Hypertext and Social Media. ACM (2013)

    Google Scholar 

  6. Foster, J., Cetinoglu, O., Wagner, J., Le Roux, J., Hogan, S., Nivre, J., Hogan, D., van Genabith, J.: POS tagging and parsing the twitterverse. In: Proceedings of the AAAI Workshop on Analyzing Microtext (2011)

    Google Scholar 

  7. Gadde, P., Subramaniam, L., Faruquie, T.: Adapting a WSJ trained part-of-speech tagger to noisy text: preliminary results. In: Proceedings of the Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data, pp. 5:1–5:8. ACM (2011)

    Google Scholar 

  8. Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman, M., Yogatama, D., Flanigan, J., Smith, N.: Part-of-speech tagging for twitter: annotation, features, and experiments. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 42–47. ACL (2011)

    Google Scholar 

  9. Liu, F., Weng, F., Jiang, X.: A broadcoverage normalization system for social media language. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (2012)

    Google Scholar 

  10. Neunerdt, M., Trevisan, B., Reyer, M., Mathar, R.: Part-of-speech tagging for social media texts. In: Gurevych, I., Biemann, C., Zesch, T. (eds.) GSCL 2013. LNCS, vol. 8105, pp. 139–150. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  11. Neunerdt, M., Reyer, M., Mathar, R.: A POS tagger for social media texts trained on web comments. Polibits 48, 59–66 (2013)

    Article  Google Scholar 

  12. Nivre, J., Hall, J., Nilsson, J.: MaltParser: a data-driven parser-generator for dependency parsing. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC), May 24-26, 2006, Genoa, Italy, pp. 2216–2219 (2006)

    Google Scholar 

  13. Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schneider, N.: Part-of-Speech Tagging for Twitter: Word Clusters and Other Advances. Technical Report CMU-ML-12-107, Machine Learning Department, Carnegie Mellon University (2012)

    Google Scholar 

  14. Robinson, J.J.: Dependency structures and transformational rules. Language 46, 259–285 (1970)

    Article  Google Scholar 

  15. Simionescu, R.: Hybrid POS tagger. In: The Workshop on Language Resources and Tools in Industrial Applications, Eurolan 2011 summer school (2011)

    Google Scholar 

  16. Simionescu, R.: Graphical grammar studio as a constraint grammar solution for part of speech tagging. In: The Conference on Linguistic Resources and Instruments for Romanian Language Processing (2011)

    Google Scholar 

  17. Singha, K.R., Purkayastha, B.S., Singha, K.D.: Part of speech tagging in manipuri: a rule-based approach. Int. J. Comput. Appl. (0975 – 8887) 51(14), 31–36 (2012)

    Google Scholar 

  18. Toutanova, D., Klein, C., Manning, C., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 173–180. ACL (2003)

    Google Scholar 

Download references

Acknowledgement

We are grateful to members of the NLP group in our faculty for suggesting us such an interesting research topic.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cătălina Mărănduc .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Perez, CA., Mărănduc, C., Simionescu, R. (2016). Including Social Media – A Very Dynamic Style – in the Corpora for Processing Romanian Language. In: Trandabăţ, D., Gîfu, D. (eds) Linguistic Linked Open Data. RUMOUR 2015. Communications in Computer and Information Science, vol 588. Springer, Cham. https://doi.org/10.1007/978-3-319-32942-0_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-32942-0_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-32941-3

  • Online ISBN: 978-3-319-32942-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics