Advertisement

Online Expectation Maximization for Language Characterization of Streaming Text

  • Jonathan WintrodeEmail author
  • Nhat Bui
  • Jan Stepinski
  • Chris Reed
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10544)

Abstract

This work examines the effect of online Expectation Maximization (EM) methods for learning language distributions of unlabeled text data in a streaming environment and its impact on language identification (ID). We show that unsupervised estimation of the language distribution over the test environment improves ID error by up to 40% relative to a mismatched prior scenario. EM-based strategies also improve distribution estimates over a simple maximum likelihood baseline by up to 75% on our largest test set. By introducing online approaches we can achieve maximal ID performance after only a single pass over the data, and achieve our best distribution estimate when compared to the batch approach while processing no more than 25% of the data.

References

  1. 1.
    Bergsma, S., McNamee, P., Bagdouri, M., Fink, C., Wilson, T.: Language identification for creating language-specific twitter collections. In: Proceedings of the Second Workshop on Language in Social Media, pp. 65–74. Association for Computational Linguistics (2012)Google Scholar
  2. 2.
    Cappé, O., Moulines, E.: On-line expectation-maximization algorithm for latent data models. J. Royal Stat. Soc. Ser. B (Stat. Methodol.) 71(3), 593–613 (2009)Google Scholar
  3. 3.
    Carter, S., Weerkamp, W., Tsagkias, M.: Microblog language identification: overcoming the limitations of short, unedited and idiomatic text. Lang. Resour. Eval. 47(1), 195–215 (2013)CrossRefGoogle Scholar
  4. 4.
    Liang, P., Klein, D.: Online EM for unsupervised models. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 611–619. Association for Computational Linguistics (2009)Google Scholar
  5. 5.
    Lui, M., Baldwin, T.: langid.py: an off-the-shelf language identification tool. In: Proceedings of the ACL 2012 System Demonstrations, pp. 25–30. Association for Computational Linguistics (2012)Google Scholar
  6. 6.
    Mathur, P., Misra, A., Budur, E.: Language Identification from Text Documents (2015). http://cs229.stanford.edu/proj2015/324_report.pdf. Accessed 01 Feb 2017
  7. 7.
    McCree, A.: Estimating and exploiting language distributions of unlabeled data. In: Proceedings of Odyssey, pp. 210–214 (2010)Google Scholar
  8. 8.
    Neal, R., Hinton, G.: A view of the EM algorithm that justifies incremental, sparse, and other variants. In: Learning in Graphical Models, pp. 355–368. Springer (1998)Google Scholar
  9. 9.
    Dick Sites. Compact Language Detector 2 (2014). https://github.com/CLD2Owners/cld2. Accessed 01 Feb 2017
  10. 10.
    Trampus, M.: Evaluating Language Identification Performance (2015). https://blog.twitter.com/2015/evaluating-language-identification-performance. Accessed 01 Feb 2017

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  • Jonathan Wintrode
    • 1
    Email author
  • Nhat Bui
    • 1
  • Jan Stepinski
    • 1
  • Chris Reed
    • 1
  1. 1.Raytheon Applied Signal TechnologySunnyvaleUSA

Personalised recommendations