Distributing N-Gram Graphs for Classification

Kontopoulos, Ioannis; Giannakopoulos, George; Varlamis, Iraklis

doi:10.1007/978-3-319-67162-8_1

Ioannis Kontopoulos¹⁶,
George Giannakopoulos¹⁶ &
Iraklis Varlamis¹⁷

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 767))

Included in the following conference series:

European Conference on Advances in Databases and Information Systems

1042 Accesses
1 Citations

Abstract

N-gram models have been an established choice for language modeling in machine translation, summarization and other tasks. Recently n-gram graphs managed to capture significant language characteristics that go beyond mere vocabulary and grammar, for tasks such as text classification. This work proposes an efficient distributed implementation of the n-gram graph framework on Apache Spark, named ARGOT. The implementation performance is evaluated on a demanding text classification task, where the n-gram graphs are used for extracting features for a supervised classifier. A provided experimental study shows the scalability of the proposed implementation to large text corpora and its ability to take advantage of a varying number of processing cores.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
JInsect toolkit, found at https://github.com/ggianna/JInsect.
2.
This paper is supported by the project “Integrating Big Data, Software and Communities for Addressing Europe’s Societal Challenges – BigDataEurope”, which has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 644564. https://www.big-data-europe.eu/.
3.
https://github.com/ioannis-kon/ARGOT.
4.
http://trec.nist.gov/data/reuters/reuters.html.
5.
Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50 GHz) and 96 GB of RAM, running Debian 64-bit with Linux Kernel 3.16.0-4-amd64 and Java OpenJDK 1.8.
6.
Intel(R) Core(TM) i5-3330S CPU @ 2.70 GHz) and 8 GB of RAM, totaling in 24 cores and 48 GB of RAM, running OS X 10.10 (14A389) with Kernel Darwin 14.0.0 and Java OpenJDK 1.8, connected with 100-Mbit ethernet links.
7.
Cluster nodes are connected with 100-Mbit Ethernet links.

References

Angelova, R., Weikum, G.: Graph-based text classification: learn from your neighbors. In: 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 485–492. ACM, New York (2006)
Google Scholar
Bechini, A., Marcelloni, F., Segatori, A.: A mapreduce solution for associative classification of big data. Inf. Sci. 332(C), 33–55 (2016)
Article Google Scholar
Choi, M., Jin, R., Chung, T.-S.: Document classification using Word2Vec and Chi-square on apache spark. In: Park, J., Pan, Y., Yi, G., Loia, V. (eds.) CSA 2016, CUTE 2016, UCAWSN 2016. LNEE, vol. 421, pp. 867–872. Springer, Singapore (2017). doi:10.1007/978-981-10-3023-9_134
Google Scholar
Das, N., Ghosh, S., Goncalves, T., Quaresma, P.: Comparison of different graph distance metrics for semantic text based classification. Polibits 51–58 (2014)
Google Scholar
Fei, X., Li, X., Shen, C.: Parallelized text classification algorithm for processing large scale TCM clinical data with mapreduce. In: 2015 IEEE International Conference on Information and Automation, pp. 1983–1986, August 2015
Google Scholar
Ferreira, D.C., Martins, A.F.T., Almeida, M.S.C.: Jointly learning to embed and predict with multiple languages. In: ACL (2016)
Google Scholar
Fortuna, B., Shawe-Taylor, J.: The use of machine translation tools for cross-lingual text mining. In: ICML Workshop on Learning with Multiple Views (2005)
Google Scholar
Giannakopoulos, G.: Automatic summarization from multiple documents. Ph.D. thesis (2009)
Google Scholar
Giannakopoulos, G., Karkaletsis, V., Vouros, G., Stamatopoulos, P.: Summarization system evaluation revisited: N-gram graphs. ACM Trans. Speech Lang. Process. 5(3), 5:1–5:39 (2008)
Article Google Scholar
Giannakopoulos, G., Mavridi, P., Paliouras, G., Papadakis, G., Tserpes, K.: Representation models for text classification: a comparative analysis over three web document types. In: 2nd International Conference on Web Intelligence, Mining and Semantics, pp. 13:1–13:12. ACM, New York (2012)
Google Scholar
Malliaros, F.D., Skianis, K.: Graph-based term weighting for text categorization. In: 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015, pp. 1473–1479. ACM, New York (2015)
Google Scholar
Santoso, J., Yuniarno, E.M., Hariadi, M.: Large scale text classification using map reduce and naive bayes algorithm for domain specified ontology building. In: 7th International Conference on Intelligent Human-Machine Systems and Cybernetics, vol. 1, pp. 428–432 (2015)
Google Scholar
Semberecki, P., Maciejewski, H.: Distributed Classification of Text Documents on Apache Spark Platform, pp. 621–630. Springer International Publishing, Cham (2016)
Google Scholar
Song, Y., Upadhyay, S., Peng, H., Roth, D.: Cross-lingual dataless classification for many languages. In: Twenty-Fifth International Joint Conference on Artificial Intelligence, pp. 2901–2907. AAAI Press (2016)
Google Scholar
Zhou, L., Yu, Z.: Acceleration of MapReduce Framework on a Multicore Processor, pp. 175–190. Springer International Publishing, Cham (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Informatics and Telecommunications, N.C.S.R. “Demokritos”, Agia Paraskevi, Greece
Ioannis Kontopoulos & George Giannakopoulos
Department of Informatics and Telematics, Harokopio University of Athens, Kallithea, Greece
Iraklis Varlamis

Authors

Ioannis Kontopoulos
View author publications
You can also search for this author in PubMed Google Scholar
George Giannakopoulos
View author publications
You can also search for this author in PubMed Google Scholar
Iraklis Varlamis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ioannis Kontopoulos .

Editor information

Editors and Affiliations

Riga Technical University , Riga, Latvia
Mārīte Kirikova
Norwegian University of Science and Technology, Trondheim, Norway
Kjetil Nørvåg
University of Cyprus , Nicosia, Cyprus
George A. Papadopoulos
Free University of Bozen-Bolzano , Bozen-Bolzano, Italy
Johann Gamper
Institute of Computing Science, Poznan University of Technology, Poznan, Poland
Robert Wrembel
Université Lumière Lyon 2, Lyon, France
Jérôme Darmont
University of Bologna , Bologna, Italy
Stefano Rizzi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kontopoulos, I., Giannakopoulos, G., Varlamis, I. (2017). Distributing N-Gram Graphs for Classification. In: Kirikova, M., et al. New Trends in Databases and Information Systems. ADBIS 2017. Communications in Computer and Information Science, vol 767. Springer, Cham. https://doi.org/10.1007/978-3-319-67162-8_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-67162-8_1
Published: 09 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67161-1
Online ISBN: 978-3-319-67162-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics