A Distributed Support Vector Machine Using Apache Spark for Semi-supervised Classification with Data Augmentation

Blessy Trencia Lincy, S. S.; Nagarajan, Suresh Kumar

doi:10.1007/978-981-13-3393-4_41

S. S. Blessy Trencia Lincy¹⁸ &
Suresh Kumar Nagarajan¹⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 898))

747 Accesses
2 Citations

Abstract

One of the popular and extensively used classification algorithms in the data mining and the machine learning technique is the support vector machine (SVM). Yet, conversely they have been traditionally applied to a small dataset or to an extent medium dataset. The current requirement and demand to scale up with the evolving size of the datasets have fascinated the research notice and attention such that new techniques and implementations can be carried out for the SVM, and as a result can scale well with large datasets and tasks. Recently, the distributed SVM is studied by the researchers, but the data augmentation with semi-supervised classification using the distributed SVM is not yet implemented. In this paper, a distributed implementation of support vector machine along with the data augmentation upon the SparkR, which is a recent and effective platform for performing distributed computation, is introduced and analyzed. This framework—A Distributed Support Vector Machine under Apache Spark for Semi-supervised Classification with Smart Data Augmentation—is implemented with a large-scale dataset with more than million data points. The results and analysis show that the proposed approach greatly enhances the predictive performance of the method in terms of execution time and faster processing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Dean, J., Ghemawat, S.: Simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008). https://doi.org/10.1145/1327452.1327492
Article Google Scholar
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Gruber, R.E.: Bigtable: a distributed storage system for structured data. In: 7th Symposium on Operating Systems Design and Implementation (OSDI ’06), pp. 205–218, Seattle, WA, USA, 6–8 Nov 2006. https://doi.org/10.1145/1365815.1365816
Article Google Scholar
Le Guennec, A., Malinowski, S., Tavenard, R.: Data Augmentation for time series classification using convolutional neural networks. In: ECML/PKDD Workshop on Advanced Analytics and Learning on Temporal Data (2016)
Google Scholar
Gelman, A.: Parameterization and Bayesian modeling. J. Am. Stat. Assoc. 99(466), 537–545 (2004). https://doi.org/10.1198/016214504000000458
Article MathSciNet MATH Google Scholar
Tanner, M.A.: Data Augmentation, pp. 105–126 (2004). https://doi.org/10.1002/0471667196.ess0283
Van Dyk, D.A., Meng, X.L.: The art of data augmentation. J. Comput. Graph. Stat. 10(1), 1–50 (2001)
Article MathSciNet Google Scholar
Polson, N.G., Scott, S.L.: Data augmentation for support vector machines. Bayesian Anal. 6(1), 1–24 (2011). https://doi.org/10.1214/11-BA601
Article MathSciNet MATH Google Scholar
Touloupou, P., Alzahrani, N., Neal, P., Spencer, S.E., McKinley, T.J.: Efficient model comparison techniques for models requiring large scale data augmentation. Bayesian Anal. (2017)
Google Scholar
Consoli, S., Kustra, J., Vos, P., Hendriks, M., Mavroeidis, D.: Towards an automated method based on Iterated Local Search optimization for tuning the parameters of Support Vector Machines, pp. 1–3 (2017)
Google Scholar
Triguero, I., Peralta, D., Bacardit, J., Garca, S., Herrera, F.: MRPR: a MapReduce solution for prototype reduction in big data classification. Neuro Comput. 150, 331–345 (2015)
Google Scholar
Cheng, H., Fernando, R., Garrick, D.: Parallel computing to speed up whole-genome bayesian regression analyses using orthogonal data augmentation. bioRxiv, 148965 (2017)
Google Scholar
Nandimath, J., Banerjee, E., Patil, A., Kakade, P., Vaidya, S., Chaturvedi, D.: Big data analysis using Apache Hadoop. In: 2013 IEEE 14th International Conference on Information Reuse and Integration (IRI), pp. 700–703. IEEE, Aug 2013
Google Scholar
Velasco, J.M., Garnica, O., Contador, S., Lanchares, J., Maqueda, E., Botella, M., Hidalgo, J.I.: Data augmentation and evolutionary algorithms to improve the prediction of blood glucose levels in scarcity of training data. In: Evolutionary Computation (CEC), 2017 IEEE Congress on, pp. 2193–2200. IEEE, June 2017
Google Scholar
Piza, D.L., Schulze-Bonhage, A., Stieglitz, T., Jacobs, J., Dümpelmann, M.: Depuration, augmentation and balancing of training data for supervised learning based detectors of EEG patterns. In: 2017 8th International IEEE/EMBS Conference on Neural Engineering (NER), pp. 497–500. IEEE, May 2017
Google Scholar
Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random Erasing Data Augmentation (2017). arXiv:1708.04896
van Doorn, J., Ly, A., Marsman, M., Wagenmakers, E.J.: Bayesian Estimation of Kendall’s tau Using a Latent Normal Approach (2017). arXiv:1703.01805

Download references

Author information

Authors and Affiliations

School of Computer Science and Engineering, Vellore Institute of Technology, Vellore, India
S. S. Blessy Trencia Lincy & Suresh Kumar Nagarajan

Authors

S. S. Blessy Trencia Lincy
View author publications
You can also search for this author in PubMed Google Scholar
Suresh Kumar Nagarajan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to S. S. Blessy Trencia Lincy .

Editor information

Editors and Affiliations

Department of Computer Science and Software Engineering, Monmouth University, West Long Branch, NJ, USA
Jiacun Wang
Department of Information Technology, National Institute of Technology Karnataka, Surathkal, Mangaluru, Karnataka, India
G. Ram Mohana Reddy
Department of Computer Science and Engineering, JNTUH College of Engineering Hyderabad, Hyderabad, Telangana, India
V. Kamakshi Prasad
Department of Electronics and Communication Engineering, Malla Reddy College of Engineering & Technology, Secunderabad, Telangana, India
V. Sivakumar Reddy

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Blessy Trencia Lincy, S.S., Nagarajan, S.K. (2019). A Distributed Support Vector Machine Using Apache Spark for Semi-supervised Classification with Data Augmentation. In: Wang, J., Reddy, G., Prasad, V., Reddy, V. (eds) Soft Computing and Signal Processing . Advances in Intelligent Systems and Computing, vol 898. Springer, Singapore. https://doi.org/10.1007/978-981-13-3393-4_41

Download citation

DOI: https://doi.org/10.1007/978-981-13-3393-4_41
Published: 14 February 2019
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-3392-7
Online ISBN: 978-981-13-3393-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics