Skip to main content

Distributing N-Gram Graphs for Classification

  • Conference paper
  • First Online:
New Trends in Databases and Information Systems (ADBIS 2017)

Abstract

N-gram models have been an established choice for language modeling in machine translation, summarization and other tasks. Recently n-gram graphs managed to capture significant language characteristics that go beyond mere vocabulary and grammar, for tasks such as text classification. This work proposes an efficient distributed implementation of the n-gram graph framework on Apache Spark, named ARGOT. The implementation performance is evaluated on a demanding text classification task, where the n-gram graphs are used for extracting features for a supervised classifier. A provided experimental study shows the scalability of the proposed implementation to large text corpora and its ability to take advantage of a varying number of processing cores.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    JInsect toolkit, found at https://github.com/ggianna/JInsect.

  2. 2.

    This paper is supported by the project “Integrating Big Data, Software and Communities for Addressing Europe’s Societal Challenges – BigDataEurope”, which has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 644564. https://www.big-data-europe.eu/.

  3. 3.

    https://github.com/ioannis-kon/ARGOT.

  4. 4.

    http://trec.nist.gov/data/reuters/reuters.html.

  5. 5.

    Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50 GHz) and 96 GB of RAM, running Debian 64-bit with Linux Kernel 3.16.0-4-amd64 and Java OpenJDK 1.8.

  6. 6.

    Intel(R) Core(TM) i5-3330S CPU @ 2.70 GHz) and 8 GB of RAM, totaling in 24 cores and 48 GB of RAM, running OS X 10.10 (14A389) with Kernel Darwin 14.0.0 and Java OpenJDK 1.8, connected with 100-Mbit ethernet links.

  7. 7.

    Cluster nodes are connected with 100-Mbit Ethernet links.

References

  1. Angelova, R., Weikum, G.: Graph-based text classification: learn from your neighbors. In: 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 485–492. ACM, New York (2006)

    Google Scholar 

  2. Bechini, A., Marcelloni, F., Segatori, A.: A mapreduce solution for associative classification of big data. Inf. Sci. 332(C), 33–55 (2016)

    Article  Google Scholar 

  3. Choi, M., Jin, R., Chung, T.-S.: Document classification using Word2Vec and Chi-square on apache spark. In: Park, J., Pan, Y., Yi, G., Loia, V. (eds.) CSA 2016, CUTE 2016, UCAWSN 2016. LNEE, vol. 421, pp. 867–872. Springer, Singapore (2017). doi:10.1007/978-981-10-3023-9_134

    Google Scholar 

  4. Das, N., Ghosh, S., Goncalves, T., Quaresma, P.: Comparison of different graph distance metrics for semantic text based classification. Polibits 51–58 (2014)

    Google Scholar 

  5. Fei, X., Li, X., Shen, C.: Parallelized text classification algorithm for processing large scale TCM clinical data with mapreduce. In: 2015 IEEE International Conference on Information and Automation, pp. 1983–1986, August 2015

    Google Scholar 

  6. Ferreira, D.C., Martins, A.F.T., Almeida, M.S.C.: Jointly learning to embed and predict with multiple languages. In: ACL (2016)

    Google Scholar 

  7. Fortuna, B., Shawe-Taylor, J.: The use of machine translation tools for cross-lingual text mining. In: ICML Workshop on Learning with Multiple Views (2005)

    Google Scholar 

  8. Giannakopoulos, G.: Automatic summarization from multiple documents. Ph.D. thesis (2009)

    Google Scholar 

  9. Giannakopoulos, G., Karkaletsis, V., Vouros, G., Stamatopoulos, P.: Summarization system evaluation revisited: N-gram graphs. ACM Trans. Speech Lang. Process. 5(3), 5:1–5:39 (2008)

    Article  Google Scholar 

  10. Giannakopoulos, G., Mavridi, P., Paliouras, G., Papadakis, G., Tserpes, K.: Representation models for text classification: a comparative analysis over three web document types. In: 2nd International Conference on Web Intelligence, Mining and Semantics, pp. 13:1–13:12. ACM, New York (2012)

    Google Scholar 

  11. Malliaros, F.D., Skianis, K.: Graph-based term weighting for text categorization. In: 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015, pp. 1473–1479. ACM, New York (2015)

    Google Scholar 

  12. Santoso, J., Yuniarno, E.M., Hariadi, M.: Large scale text classification using map reduce and naive bayes algorithm for domain specified ontology building. In: 7th International Conference on Intelligent Human-Machine Systems and Cybernetics, vol. 1, pp. 428–432 (2015)

    Google Scholar 

  13. Semberecki, P., Maciejewski, H.: Distributed Classification of Text Documents on Apache Spark Platform, pp. 621–630. Springer International Publishing, Cham (2016)

    Google Scholar 

  14. Song, Y., Upadhyay, S., Peng, H., Roth, D.: Cross-lingual dataless classification for many languages. In: Twenty-Fifth International Joint Conference on Artificial Intelligence, pp. 2901–2907. AAAI Press (2016)

    Google Scholar 

  15. Zhou, L., Yu, Z.: Acceleration of MapReduce Framework on a Multicore Processor, pp. 175–190. Springer International Publishing, Cham (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ioannis Kontopoulos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Kontopoulos, I., Giannakopoulos, G., Varlamis, I. (2017). Distributing N-Gram Graphs for Classification. In: Kirikova, M., et al. New Trends in Databases and Information Systems. ADBIS 2017. Communications in Computer and Information Science, vol 767. Springer, Cham. https://doi.org/10.1007/978-3-319-67162-8_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-67162-8_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-67161-1

  • Online ISBN: 978-3-319-67162-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics