Cluster Computing

, Volume 22, Supplement 1, pp 1911–1924 | Cite as

Word recommendation for English composition using big corpus data processing

  • Keon Myung LeeEmail author
  • Chan-Sik Han
  • Kwang-Il Kim
  • Sang Ho Lee


Writing essays and technical documents can be a challenging task for many people, especially for non-native speakers. Good content and ideas are both important in writing, but clear and effective expressions that can accurately convey the meaning of these ideas to the readers are essential for good writing. Many writers often face difficulty in selecting the proper words that would fit into their sentences. Proper words may be widely used words that appear in similar contexts. These can be identified by a statistical analysis of a corpus, which is a collection of a large number of sentences. This paper propses a method that can recommend suitable words based on word pattern queries, which are expressed as a combination of words, part-of-speech (POS) tags, and wild card words, such as ‘<verb > {1:2} idea.’ The proposed method enables to recommend some words for the POS tags of a word pattern query, along with their popularity and example sentences in a corpus. To facilitate such query processing, the method first conducts the POS tagging for all the sentences in a corpus. From the tagged sentences, it generates the 2-grams up to 5-grams, which consist of words, POS tags, and the special wild card word symbol ‘*’. It then builds an inverted file-like data structure which keeps the relevant information for each potential word pattern from the n-grams. Due to the large number of word patterns and sentences, the MapReduce algorithms are developed to realize the proposed method and HBase are deployed to manage the inverted file-like data structure. Some experiment results are presented to show the characteristics of the proposed method.


Big data Pattern query Word recommendataion MapReduce HBase Natural language processing 



This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) (Grant No.: 2015R1D1A1A01061062) and the MSIP(Ministry of Science, ICT and Future Planning), Korea, under the ITRC(Information Technology Research Center) support program (IITP-2017-2013-0-00881) supervised by the IITP(Institute for Information & communication Technology Promotion).


  1. 1.
    Martin, J.H., Jurafsky, D.: Speech and Language Processing. Prentice Hall, Upper Saddle River (2000)Google Scholar
  2. 2.
    Toutanova, K., Klein, D., Manning, C., Morgan, W., Rafferty, A., Galley, M., Bauer, J.: Stanford log-linear part-of-speech tagger. Stanford University Std, The Stanford Natural Language Processing Group, Stanford (2000)Google Scholar
  3. 3.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating Systems Design and Implementation (2004)Google Scholar
  4. 4.
    Huang, T.-C., Chu, K.-C., Lee, W.-T., Ho, Y.-S.: Adaptive combiner for MapReduce on cloud computing. Clust. Comput. 14(4), 11252–12311 (2014)Google Scholar
  5. 5.
    George, L.: HBase: The Definitive Guide: Random Access to Your Planet-Size Data. O’Reilly Media, Inc., Newton (2011)Google Scholar
  6. 6.
    Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30(1), 107–117 (1998)CrossRefGoogle Scholar
  7. 7.
    Cöster, R., Svensson, M.: Inverted file search algorithms for collaborative filtering. In: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 246–252. ACM, New York (2002)Google Scholar
  8. 8.
    Song, F., Croft, W.B.: A general language model for information retrieval. In: Proceedings of the Eighth International Conference on Information and Knowledge Management, pp. 316–321. ACM, New York (1999)Google Scholar
  9. 9.
    Ganguly, D., Roy, D., M. Mitra, G. J. Jones, “Word embedding based generalized language model for information retrieval”, Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, pp. 795–798. (2015)Google Scholar
  10. 10.
    Rohde, D.L.T., Plaut, D.C.: Simple recurrent neural networks and language: how important is starting small? In: Proceedings of the 19th annual conference of the Cognitive Science Society, pp. 656–661. (1997)Google Scholar
  11. 11.
    Mikolov, T., Karafiat, M., Burget, L., Cernocky, J.: Recurrent neural network based language model. In: Proceedings of the 11th Annual Conference of the International Speech Communication Association, pp. 1045–1048. (2010)Google Scholar
  12. 12.
    Jelinek, F., Mercer, R.L.: Interpolated estimation of markov source parameters from spare data. In: Gelsema, E.S., Kanal, L.N. (eds.) Pattern Recognition in Practice. North-Holland, New York (1980)Google Scholar
  13. 13.
    Demuth, H.B., Beale, M.H., Jess, O., Hagan, M.T.: Neural Network Design. Campus Pub. Service, Boulder (2014)Google Scholar
  14. 14.
    Specht, D.F.: Probabilistic neural networks. Neural Netw. 3(1), 109–118 (1990)CrossRefGoogle Scholar
  15. 15.
    Elman, J.L.: Distributed representations, simple recurrent networks, and grammatical structure. Mach. Learn. 7, 195–225 (1991)Google Scholar
  16. 16.
    Jang, J., Lee, Y., Lee, S., Shin, D., Kim, D., Rim, H.: A novel density-based clustering method using word embedding features for dialogue intention recognition. Clust. Comput. 19, 2315–2326 (2016)CrossRefGoogle Scholar
  17. 17.
    Sahlgren, M.: The distributional hypothesis. Italian J. Linguist. 20(1), 33–54 (2008)Google Scholar
  18. 18.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed Representations of Words and Phrases and their Compositionality. In: Proceedings of the 26th International Conference on Advances in Neural Information Processing Systems, pp. 3111–3119, (2013)Google Scholar
  19. 19.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representation. In: Proceedings of the International Conference on Learning Representations, pp. 1–12. (2013)Google Scholar
  20. 20.
    Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)zbMATHGoogle Scholar
  21. 21.
    Berglund, M., Raiko, T., Honkala, M., Karkkainen, L., Vetek, A., Karhunen, J.: Bidirectional recurrent neural networks as generative models. In: Proceedings of the 26th International Conference on Advances in Neural Information Processing Systems, pp. 856–864. (2015)Google Scholar
  22. 22.
    Zhang, J., Zong, C.: Deep neural networks in machine translation: an overview. IEEE Intell. Syst. 30, 16–25 (2015)CrossRefGoogle Scholar
  23. 23.
    Socher, R., Pennington, J., Huang, E.H., Ng, A.Y. Manning, C.D.: Semi-supervised recursive autoencoders for predicting sentiment distributions. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, pp. 151–161, (2011)Google Scholar
  24. 24.
    Pollack, J.B.: Recursive distributed representations. Artif. Intell. 46(1), 77–105 (1990)CrossRefGoogle Scholar
  25. 25.
    Bahdanau, D., Cho, K.H., Bengjio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, (2014)
  26. 26.
    Lee, H.-W., Kim, N.-R., Lee, J.-H.: Deep neural network self-training based on unsupervised learning and dropout. Int. J. Fuzzy Logic Intell. Syst. 17(1), 1–9 (2017)CrossRefGoogle Scholar
  27. 27.
    Lee, K.M., Lee, S.Y., Lee, K.M., Lee, S.H.: Density and frequency-aware cluster identification for spatio-temporal sequence data. Wirel. Pers. Commun. 93(1), 47–65 (2017)CrossRefGoogle Scholar
  28. 28.
    Kang, S.J., Lee, S.Y., Lee, K.M.: Performance comparison of OpenMP, MPI, and MapReduce in practical problems. Adv. Multimedia 2015, 7 (2015)CrossRefGoogle Scholar
  29. 29.
    Lam, C.: Hadoop in action. Manning Publications Co., Shelter Island (2010)Google Scholar
  30. 30.
    Vavilapalli, V.K., Murthy, A.C., et al.: Apache Hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th annual Symposium on Cloud Computing, ACM, New York (2013)Google Scholar
  31. 31.
    Sakr, S.: Cloud-hosted databases: technologies, challenges and opportunities. Clust. Comput. 17, 487–502 (2014)CrossRefGoogle Scholar
  32. 32.
    Hunt, P., Konar, M., Junqueira, F.P., Reed, B.: ZooKeeper: wait-free coordination for internet-scale systems. Proc. USENIX Ann. Tech. Conf. 8, 9 (2010)Google Scholar
  33. 33.
    O’Neil, P., Cheng, E., Gawlick, D., O’Neil, E.: The log-structured merge-tree (LSM-tree). Acta Inf. 33(4), 351–385 (1996)CrossRefzbMATHGoogle Scholar
  34. 34.
    Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to algorithms. MIT press, Cambridge (2001)zbMATHGoogle Scholar
  35. 35.
    Broder, A., Mitzenmacher, M.: Network applications of bloom filters: a survey. Internet Math. 1(4), 485–509 (2004)MathSciNetCrossRefzbMATHGoogle Scholar
  36. 36.
    Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of english: the Penn Treebank. Comput. Linguist. 19(2), 313–330 (1993)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Computer ScienceChungbuk National UniversityCheongjuKorea

Personalised recommendations