Abstract
N-gram models have been an established choice for language modeling in machine translation, summarization and other tasks. Recently n-gram graphs managed to capture significant language characteristics that go beyond mere vocabulary and grammar, for tasks such as text classification. This work proposes an efficient distributed implementation of the n-gram graph framework on Apache Spark, named ARGOT. The implementation performance is evaluated on a demanding text classification task, where the n-gram graphs are used for extracting features for a supervised classifier. A provided experimental study shows the scalability of the proposed implementation to large text corpora and its ability to take advantage of a varying number of processing cores.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
JInsect toolkit, found at https://github.com/ggianna/JInsect.
- 2.
This paper is supported by the project “Integrating Big Data, Software and Communities for Addressing Europe’s Societal Challenges – BigDataEurope”, which has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 644564. https://www.big-data-europe.eu/.
- 3.
- 4.
- 5.
Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50 GHz) and 96 GB of RAM, running Debian 64-bit with Linux Kernel 3.16.0-4-amd64 and Java OpenJDK 1.8.
- 6.
Intel(R) Core(TM) i5-3330S CPU @ 2.70 GHz) and 8 GB of RAM, totaling in 24 cores and 48 GB of RAM, running OS X 10.10 (14A389) with Kernel Darwin 14.0.0 and Java OpenJDK 1.8, connected with 100-Mbit ethernet links.
- 7.
Cluster nodes are connected with 100-Mbit Ethernet links.
References
Angelova, R., Weikum, G.: Graph-based text classification: learn from your neighbors. In: 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 485–492. ACM, New York (2006)
Bechini, A., Marcelloni, F., Segatori, A.: A mapreduce solution for associative classification of big data. Inf. Sci. 332(C), 33–55 (2016)
Choi, M., Jin, R., Chung, T.-S.: Document classification using Word2Vec and Chi-square on apache spark. In: Park, J., Pan, Y., Yi, G., Loia, V. (eds.) CSA 2016, CUTE 2016, UCAWSN 2016. LNEE, vol. 421, pp. 867–872. Springer, Singapore (2017). doi:10.1007/978-981-10-3023-9_134
Das, N., Ghosh, S., Goncalves, T., Quaresma, P.: Comparison of different graph distance metrics for semantic text based classification. Polibits 51–58 (2014)
Fei, X., Li, X., Shen, C.: Parallelized text classification algorithm for processing large scale TCM clinical data with mapreduce. In: 2015 IEEE International Conference on Information and Automation, pp. 1983–1986, August 2015
Ferreira, D.C., Martins, A.F.T., Almeida, M.S.C.: Jointly learning to embed and predict with multiple languages. In: ACL (2016)
Fortuna, B., Shawe-Taylor, J.: The use of machine translation tools for cross-lingual text mining. In: ICML Workshop on Learning with Multiple Views (2005)
Giannakopoulos, G.: Automatic summarization from multiple documents. Ph.D. thesis (2009)
Giannakopoulos, G., Karkaletsis, V., Vouros, G., Stamatopoulos, P.: Summarization system evaluation revisited: N-gram graphs. ACM Trans. Speech Lang. Process. 5(3), 5:1–5:39 (2008)
Giannakopoulos, G., Mavridi, P., Paliouras, G., Papadakis, G., Tserpes, K.: Representation models for text classification: a comparative analysis over three web document types. In: 2nd International Conference on Web Intelligence, Mining and Semantics, pp. 13:1–13:12. ACM, New York (2012)
Malliaros, F.D., Skianis, K.: Graph-based term weighting for text categorization. In: 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015, pp. 1473–1479. ACM, New York (2015)
Santoso, J., Yuniarno, E.M., Hariadi, M.: Large scale text classification using map reduce and naive bayes algorithm for domain specified ontology building. In: 7th International Conference on Intelligent Human-Machine Systems and Cybernetics, vol. 1, pp. 428–432 (2015)
Semberecki, P., Maciejewski, H.: Distributed Classification of Text Documents on Apache Spark Platform, pp. 621–630. Springer International Publishing, Cham (2016)
Song, Y., Upadhyay, S., Peng, H., Roth, D.: Cross-lingual dataless classification for many languages. In: Twenty-Fifth International Joint Conference on Artificial Intelligence, pp. 2901–2907. AAAI Press (2016)
Zhou, L., Yu, Z.: Acceleration of MapReduce Framework on a Multicore Processor, pp. 175–190. Springer International Publishing, Cham (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Kontopoulos, I., Giannakopoulos, G., Varlamis, I. (2017). Distributing N-Gram Graphs for Classification. In: Kirikova, M., et al. New Trends in Databases and Information Systems. ADBIS 2017. Communications in Computer and Information Science, vol 767. Springer, Cham. https://doi.org/10.1007/978-3-319-67162-8_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-67162-8_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67161-1
Online ISBN: 978-3-319-67162-8
eBook Packages: Computer ScienceComputer Science (R0)