Skip to main content
Log in

Implementation of a large-scale language model adaptation in a cloud environment

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

This paper presents a system of large-scale language model adaptation for daily generated big-size text corpus using MapReduce in a cloud environment. Our large-scale trigram language model, consisting of 800 million trigram counts, was successfully implemented by a new approach using a representative cloud service (Amazon EC2), and a representative distributed processing framework (Hadoop). The ultimate goal of our research is to find the optimal number of Amazon EC2 instances in the LM adaptation under the time constraint that the daily-generated Twitter texts should be processed within 1 day. Trigram count extraction and model update for language model adaptation were performed for 200 million daily-generated Twitter texts. For trigram count extraction, we found that fewer than 3 h are required to process daily-generated Twitter texts when the number of instances is six. For model update, it was shown that fewer than 20 h are required to perform the model update when the number of instances is 10. Therefore, language model adaptation for daily generated 200 million Twitter texts can be successfully adapted within 24 h using at least 10 instances in Amazon EC2.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. 200 Million tweets per day (2011) http://blog.twitter.com/2011/06/200-million-tweets-per-day.html

  2. Amazon Elastic Compute Cloud (2010) http://aws.amazon.com/ec2

  3. Bacchiani M, Riley M, Roark B, Sproat R (2006) MAP adaptation of stochastic grammars. J Comput Speech Lang 20(1):41–68

    Article  Google Scholar 

  4. Bakis R, Chen S, Gopalakrishnan P, Gopinath R, Stephane M, Polymenakos L, Franz M (1997) Transcription of broadcast news shows with the IBM large vocabulary speech recognition system. Proceedings of the speech recognition workshop, pp. 67–72

  5. Bellegada J (2001) An overview of statistical language model adaptation. Proceedings of ITRW on adaptation methods for speech recognition, pp. 165–174

  6. Bellegarda J (2004) Statistical language model adaptation review and perspectives. J Speech Commun 42(1):93–108

    Article  Google Scholar 

  7. Brugnara F, Cettolo M (1995) Improvements in tree-based language model representation. Proceedings of European conference on speech communication and technology, pp. 1797–1800

  8. Burrows M (2006) The chubby lock service for loosely-coupled distributed systems. Proceedings of 7th USENIX symposium on operating systems design and implementation, pp. 335–350

  9. Chang F, Dean J, Ghemawat S, Hsieh W, Deborah A, Wallach B, Chandra T, Fikes A, Gruber R (2006) BigTable: A Distributed Storage System for Structured Data, Proceedings of. In: 7th Conference on USENIX Symposium on Operating Systems Design and Implementation, pp., pp 205–218

    Google Scholar 

  10. Clarkson P, Rosenfeld R (1997) Statistical language modeling using the CMU-Cambridge toolkit. Proceedings of European conference on speech communication and technology, pp. 2707–2710

  11. Dean J, Ghemawa S (2004) MapReduce: Simplied data processing on large clusters, proceedings of operating systems design and implementation, pp.137–150

  12. Federico M (1999) Efficient language model adaptation through MDI estimation. Proceedings of European conference on speech communication technology, pp. 1583–1586

  13. Federico M (2002) Language model adaptation through topic decomposition and MDl estimation. Proceedings of international conference on acoustics, speech and signal processing, pp. 773–776

  14. Gauvain L, Lee H (1994) Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. J Speech Audio Process 2(2):291–298

    Article  Google Scholar 

  15. George L (2011) HBase: The definitive guide, O’Reilly Media, pp. 68–71

  16. Langzhou C, Gauvain J, Lamel L, Adda G (2003) Unsupervised language model adaptation for broadcast news. Proceedings of international conference on acoustics, speech and signal processing, pp. 220–223

  17. Masataki H, Sagisaka Y, Tawahara T (1997) Task adaptation using MAP Estimation in N-gram language model. Proceedings of international conference on acoustics, speech, and signal processing, pp. 783–786

  18. Pietra S, Pietra V, Mercer R, Roukos S (1992) Adaptive language modeling using minimum discriminant estimation. Proceedings of international conference on acoustics, speech, and signal processing, pp. 633–636

  19. Rosenfeld R (1996) A maximum entropy approach to adaptive statistical language modeling. J Comput Speech Lang 10(3):187–228

    Article  MathSciNet  Google Scholar 

  20. Saraclar M, Bacchiani M (2004) Language Model Adaptation with MAP estimation and the Perceptron algorithm. Proceedings of Human language technology conference and meeting of the North American chapter of the association for computational linguistics, pp. 21–24

  21. Shvachko K, Kuang H, Radia S, Chansler R (2010) The Hadoop distributed file system, proceedings of mass storage systems and technologies, pp. 1–10

  22. Web 1 T 5-gram Version 1 (2006) http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalog Id = LDC2006T1324 LSM

  23. White T (2011) Hadoop: The definitive guide, O’Reilly media, pp. 27–41

  24. Wolberg J (2005) Data analysis using the method of least squares: Extracting the most information from experiments. Springer, pp. 31–64

Download references

Acknowledgments

This work was supported by the Industrial Strategic Technology Development Program 10035252, Development of Dialog-based Spontaneous Speech Interface Technology on Mobile Platform funded by the Ministry of Trade, Industry & Energy, Korea.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ji-Hwan Kim.

Additional information

This paper is an extended version of the 3rd International Conference on Intelligent Robotics, Automations, Telecommunication Facilities, and Applications (IRoA 2013).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, KH., Jung, DY., Lee, D. et al. Implementation of a large-scale language model adaptation in a cloud environment. Multimed Tools Appl 75, 5029–5045 (2016). https://doi.org/10.1007/s11042-013-1787-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-013-1787-z

Keywords

Navigation