Abstract
One of the popular and extensively used classification algorithms in the data mining and the machine learning technique is the support vector machine (SVM). Yet, conversely they have been traditionally applied to a small dataset or to an extent medium dataset. The current requirement and demand to scale up with the evolving size of the datasets have fascinated the research notice and attention such that new techniques and implementations can be carried out for the SVM, and as a result can scale well with large datasets and tasks. Recently, the distributed SVM is studied by the researchers, but the data augmentation with semi-supervised classification using the distributed SVM is not yet implemented. In this paper, a distributed implementation of support vector machine along with the data augmentation upon the SparkR, which is a recent and effective platform for performing distributed computation, is introduced and analyzed. This framework—A Distributed Support Vector Machine under Apache Spark for Semi-supervised Classification with Smart Data Augmentation—is implemented with a large-scale dataset with more than million data points. The results and analysis show that the proposed approach greatly enhances the predictive performance of the method in terms of execution time and faster processing.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Dean, J., Ghemawat, S.: Simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008). https://doi.org/10.1145/1327452.1327492
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Gruber, R.E.: Bigtable: a distributed storage system for structured data. In: 7th Symposium on Operating Systems Design and Implementation (OSDI ’06), pp. 205–218, Seattle, WA, USA, 6–8 Nov 2006. https://doi.org/10.1145/1365815.1365816
Le Guennec, A., Malinowski, S., Tavenard, R.: Data Augmentation for time series classification using convolutional neural networks. In: ECML/PKDD Workshop on Advanced Analytics and Learning on Temporal Data (2016)
Gelman, A.: Parameterization and Bayesian modeling. J. Am. Stat. Assoc. 99(466), 537–545 (2004). https://doi.org/10.1198/016214504000000458
Tanner, M.A.: Data Augmentation, pp. 105–126 (2004). https://doi.org/10.1002/0471667196.ess0283
Van Dyk, D.A., Meng, X.L.: The art of data augmentation. J. Comput. Graph. Stat. 10(1), 1–50 (2001)
Polson, N.G., Scott, S.L.: Data augmentation for support vector machines. Bayesian Anal. 6(1), 1–24 (2011). https://doi.org/10.1214/11-BA601
Touloupou, P., Alzahrani, N., Neal, P., Spencer, S.E., McKinley, T.J.: Efficient model comparison techniques for models requiring large scale data augmentation. Bayesian Anal. (2017)
Consoli, S., Kustra, J., Vos, P., Hendriks, M., Mavroeidis, D.: Towards an automated method based on Iterated Local Search optimization for tuning the parameters of Support Vector Machines, pp. 1–3 (2017)
Triguero, I., Peralta, D., Bacardit, J., Garca, S., Herrera, F.: MRPR: a MapReduce solution for prototype reduction in big data classification. Neuro Comput. 150, 331–345 (2015)
Cheng, H., Fernando, R., Garrick, D.: Parallel computing to speed up whole-genome bayesian regression analyses using orthogonal data augmentation. bioRxiv, 148965 (2017)
Nandimath, J., Banerjee, E., Patil, A., Kakade, P., Vaidya, S., Chaturvedi, D.: Big data analysis using Apache Hadoop. In: 2013 IEEE 14th International Conference on Information Reuse and Integration (IRI), pp. 700–703. IEEE, Aug 2013
Velasco, J.M., Garnica, O., Contador, S., Lanchares, J., Maqueda, E., Botella, M., Hidalgo, J.I.: Data augmentation and evolutionary algorithms to improve the prediction of blood glucose levels in scarcity of training data. In: Evolutionary Computation (CEC), 2017 IEEE Congress on, pp. 2193–2200. IEEE, June 2017
Piza, D.L., Schulze-Bonhage, A., Stieglitz, T., Jacobs, J., Dümpelmann, M.: Depuration, augmentation and balancing of training data for supervised learning based detectors of EEG patterns. In: 2017 8th International IEEE/EMBS Conference on Neural Engineering (NER), pp. 497–500. IEEE, May 2017
Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random Erasing Data Augmentation (2017). arXiv:1708.04896
van Doorn, J., Ly, A., Marsman, M., Wagenmakers, E.J.: Bayesian Estimation of Kendall’s tau Using a Latent Normal Approach (2017). arXiv:1703.01805
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Blessy Trencia Lincy, S.S., Nagarajan, S.K. (2019). A Distributed Support Vector Machine Using Apache Spark for Semi-supervised Classification with Data Augmentation. In: Wang, J., Reddy, G., Prasad, V., Reddy, V. (eds) Soft Computing and Signal Processing . Advances in Intelligent Systems and Computing, vol 898. Springer, Singapore. https://doi.org/10.1007/978-981-13-3393-4_41
Download citation
DOI: https://doi.org/10.1007/978-981-13-3393-4_41
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-3392-7
Online ISBN: 978-981-13-3393-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)