Advertisement

Distributing N-Gram Graphs for Classification

  • Ioannis KontopoulosEmail author
  • George Giannakopoulos
  • Iraklis Varlamis
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 767)

Abstract

N-gram models have been an established choice for language modeling in machine translation, summarization and other tasks. Recently n-gram graphs managed to capture significant language characteristics that go beyond mere vocabulary and grammar, for tasks such as text classification. This work proposes an efficient distributed implementation of the n-gram graph framework on Apache Spark, named ARGOT. The implementation performance is evaluated on a demanding text classification task, where the n-gram graphs are used for extracting features for a supervised classifier. A provided experimental study shows the scalability of the proposed implementation to large text corpora and its ability to take advantage of a varying number of processing cores.

Keywords

Distributed processing N-gram graphs Text classification 

References

  1. 1.
    Angelova, R., Weikum, G.: Graph-based text classification: learn from your neighbors. In: 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 485–492. ACM, New York (2006)Google Scholar
  2. 2.
    Bechini, A., Marcelloni, F., Segatori, A.: A mapreduce solution for associative classification of big data. Inf. Sci. 332(C), 33–55 (2016)CrossRefGoogle Scholar
  3. 3.
    Choi, M., Jin, R., Chung, T.-S.: Document classification using Word2Vec and Chi-square on apache spark. In: Park, J., Pan, Y., Yi, G., Loia, V. (eds.) CSA 2016, CUTE 2016, UCAWSN 2016. LNEE, vol. 421, pp. 867–872. Springer, Singapore (2017). doi: 10.1007/978-981-10-3023-9_134 Google Scholar
  4. 4.
    Das, N., Ghosh, S., Goncalves, T., Quaresma, P.: Comparison of different graph distance metrics for semantic text based classification. Polibits 51–58 (2014)Google Scholar
  5. 5.
    Fei, X., Li, X., Shen, C.: Parallelized text classification algorithm for processing large scale TCM clinical data with mapreduce. In: 2015 IEEE International Conference on Information and Automation, pp. 1983–1986, August 2015Google Scholar
  6. 6.
    Ferreira, D.C., Martins, A.F.T., Almeida, M.S.C.: Jointly learning to embed and predict with multiple languages. In: ACL (2016)Google Scholar
  7. 7.
    Fortuna, B., Shawe-Taylor, J.: The use of machine translation tools for cross-lingual text mining. In: ICML Workshop on Learning with Multiple Views (2005)Google Scholar
  8. 8.
    Giannakopoulos, G.: Automatic summarization from multiple documents. Ph.D. thesis (2009)Google Scholar
  9. 9.
    Giannakopoulos, G., Karkaletsis, V., Vouros, G., Stamatopoulos, P.: Summarization system evaluation revisited: N-gram graphs. ACM Trans. Speech Lang. Process. 5(3), 5:1–5:39 (2008)CrossRefGoogle Scholar
  10. 10.
    Giannakopoulos, G., Mavridi, P., Paliouras, G., Papadakis, G., Tserpes, K.: Representation models for text classification: a comparative analysis over three web document types. In: 2nd International Conference on Web Intelligence, Mining and Semantics, pp. 13:1–13:12. ACM, New York (2012)Google Scholar
  11. 11.
    Malliaros, F.D., Skianis, K.: Graph-based term weighting for text categorization. In: 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015, pp. 1473–1479. ACM, New York (2015)Google Scholar
  12. 12.
    Santoso, J., Yuniarno, E.M., Hariadi, M.: Large scale text classification using map reduce and naive bayes algorithm for domain specified ontology building. In: 7th International Conference on Intelligent Human-Machine Systems and Cybernetics, vol. 1, pp. 428–432 (2015)Google Scholar
  13. 13.
    Semberecki, P., Maciejewski, H.: Distributed Classification of Text Documents on Apache Spark Platform, pp. 621–630. Springer International Publishing, Cham (2016)Google Scholar
  14. 14.
    Song, Y., Upadhyay, S., Peng, H., Roth, D.: Cross-lingual dataless classification for many languages. In: Twenty-Fifth International Joint Conference on Artificial Intelligence, pp. 2901–2907. AAAI Press (2016)Google Scholar
  15. 15.
    Zhou, L., Yu, Z.: Acceleration of MapReduce Framework on a Multicore Processor, pp. 175–190. Springer International Publishing, Cham (2017)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Ioannis Kontopoulos
    • 1
    Email author
  • George Giannakopoulos
    • 1
  • Iraklis Varlamis
    • 2
  1. 1.Institute of Informatics and Telecommunications, N.C.S.R. “Demokritos”Agia ParaskeviGreece
  2. 2.Department of Informatics and TelematicsHarokopio University of AthensKallitheaGreece

Personalised recommendations