The Study of Informality as a Framework for Evaluating the Normalisation of Web 2.0 Texts

  • Alejandro Mosquera
  • Paloma Moreda
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7337)


The language used in Web 2.0 applications such as blogging platforms, realtime chats, social networks or collaborative encyclopaedias shows remarkable differences in comparison with traditional texts. The presence of informal features such as emoticons, spelling errors or Internet-specific slang can lower the performance of Natural Language Processing applications. In order to overcome this problem, text normalisation approaches can provide a clean word or sentence by transforming all non-standard lexical or syntactic variations into their canonical forms. Nevertheless, because the characteristics of each normalisation approach there exist different performance metrics and evaluation procedures. We hypothesize that the analysis of informality levels can be used to evaluate text normalization techniques. Thus, in this study we are going to propose a text normalisation evaluation framework using informality levels and its application to Web 2.0 texts.


Informality Normalisation Web 2.0 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aw, A., Zhang, M., Xiao, J., Su, J.: A phrase-based statistical model for sms text normalization. In: Proceedings of the COLING/ACL, pp. 33–40 (2006)Google Scholar
  2. 2.
    Choudhury, M., Saraf, R., Jain, V., Sarkar, S., Basu, A.: Investigation and modeling of the structure of texting language. In: Proceedings of the IJCAI-Workshop on Analytics for Noisy Unstructured Text Data, pp. 63–70 (2007)Google Scholar
  3. 3.
    Crystal, D.: Language and the Internet. Cambridge Univ. Press (2001)Google Scholar
  4. 4.
    Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 39, 1–22 (1977)MathSciNetzbMATHGoogle Scholar
  5. 5.
    Han, B., Baldwin, T.: Lexical normalisation of short text messages: Makn sens a #twitter. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 368–378. Association for Computational Linguistics, Portland (2011)Google Scholar
  6. 6.
    Hunt, M.J.: Figures of Merit for Assessing Connected Word Recognisers. Speech Communication 9, 239–336 (1990)CrossRefGoogle Scholar
  7. 7.
    Mosquera, A., Moreda, P.: Informality levels in Web 2.0 texts, the Facebook case study. In: Proceedings of LREC, Workshop @NLP can u tag #user_generated_content?!, Istambul, TR (2012)Google Scholar
  8. 8.
    NIST. Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurence Statistics (2002),
  9. 9.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)Google Scholar
  10. 10.
    Philips, L.: The double metaphone search algorithm. C/C++ Users Journal 18, 38–43 (2000)Google Scholar
  11. 11.
    Ratcliff, J., Metzener, D.: Pattern matching: The gestalt approach. Dr. Dobb’s Journal 13(7), 46–72 (1988)Google Scholar
  12. 12.
    Ritter, A., Cherry, C., Dolan, B.: Unsupervised modeling of Twitter conversations. In: HLT 2010: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, USA, pp. 172–180 (2010)Google Scholar
  13. 13.
    Tang, J., Li, H., Cao, Y., Tang, Z.: Email data cleaning. In: KDD 2005: Proceeding of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 489–498. ACM, New York (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Alejandro Mosquera
    • 1
  • Paloma Moreda
    • 1
  1. 1.DLSIUniversity of AlicanteAlicanteSpain

Personalised recommendations