A Distributed Support Vector Machine Using Apache Spark for Semi-supervised Classification with Data Augmentation

  • S. S. Blessy Trencia LincyEmail author
  • Suresh Kumar Nagarajan
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 898)


One of the popular and extensively used classification algorithms in the data mining and the machine learning technique is the support vector machine (SVM). Yet, conversely they have been traditionally applied to a small dataset or to an extent medium dataset. The current requirement and demand to scale up with the evolving size of the datasets have fascinated the research notice and attention such that new techniques and implementations can be carried out for the SVM, and as a result can scale well with large datasets and tasks. Recently, the distributed SVM is studied by the researchers, but the data augmentation with semi-supervised classification using the distributed SVM is not yet implemented. In this paper, a distributed implementation of support vector machine along with the data augmentation upon the SparkR, which is a recent and effective platform for performing distributed computation, is introduced and analyzed. This framework—A Distributed Support Vector Machine under Apache Spark for Semi-supervised Classification with Smart Data Augmentation—is implemented with a large-scale dataset with more than million data points. The results and analysis show that the proposed approach greatly enhances the predictive performance of the method in terms of execution time and faster processing.


Big data SVM Data augmentation Apache Spark Semi-supervised classification 


  1. 1.
    Dean, J., Ghemawat, S.: Simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008). Scholar
  2. 2.
    Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Gruber, R.E.: Bigtable: a distributed storage system for structured data. In: 7th Symposium on Operating Systems Design and Implementation (OSDI ’06), pp. 205–218, Seattle, WA, USA, 6–8 Nov 2006. Scholar
  3. 3.
    Le Guennec, A., Malinowski, S., Tavenard, R.: Data Augmentation for time series classification using convolutional neural networks. In: ECML/PKDD Workshop on Advanced Analytics and Learning on Temporal Data (2016)Google Scholar
  4. 4.
    Gelman, A.: Parameterization and Bayesian modeling. J. Am. Stat. Assoc. 99(466), 537–545 (2004). Scholar
  5. 5.
    Tanner, M.A.: Data Augmentation, pp. 105–126 (2004).
  6. 6.
    Van Dyk, D.A., Meng, X.L.: The art of data augmentation. J. Comput. Graph. Stat. 10(1), 1–50 (2001)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Polson, N.G., Scott, S.L.: Data augmentation for support vector machines. Bayesian Anal. 6(1), 1–24 (2011). Scholar
  8. 8.
    Touloupou, P., Alzahrani, N., Neal, P., Spencer, S.E., McKinley, T.J.: Efficient model comparison techniques for models requiring large scale data augmentation. Bayesian Anal. (2017)Google Scholar
  9. 9.
    Consoli, S., Kustra, J., Vos, P., Hendriks, M., Mavroeidis, D.: Towards an automated method based on Iterated Local Search optimization for tuning the parameters of Support Vector Machines, pp. 1–3 (2017)Google Scholar
  10. 10.
    Triguero, I., Peralta, D., Bacardit, J., Garca, S., Herrera, F.: MRPR: a MapReduce solution for prototype reduction in big data classification. Neuro Comput. 150, 331–345 (2015)Google Scholar
  11. 11.
    Cheng, H., Fernando, R., Garrick, D.: Parallel computing to speed up whole-genome bayesian regression analyses using orthogonal data augmentation. bioRxiv, 148965 (2017)Google Scholar
  12. 12.
    Nandimath, J., Banerjee, E., Patil, A., Kakade, P., Vaidya, S., Chaturvedi, D.: Big data analysis using Apache Hadoop. In: 2013 IEEE 14th International Conference on Information Reuse and Integration (IRI), pp. 700–703. IEEE, Aug 2013Google Scholar
  13. 13.
    Velasco, J.M., Garnica, O., Contador, S., Lanchares, J., Maqueda, E., Botella, M., Hidalgo, J.I.: Data augmentation and evolutionary algorithms to improve the prediction of blood glucose levels in scarcity of training data. In: Evolutionary Computation (CEC), 2017 IEEE Congress on, pp. 2193–2200. IEEE, June 2017Google Scholar
  14. 14.
    Piza, D.L., Schulze-Bonhage, A., Stieglitz, T., Jacobs, J., Dümpelmann, M.: Depuration, augmentation and balancing of training data for supervised learning based detectors of EEG patterns. In: 2017 8th International IEEE/EMBS Conference on Neural Engineering (NER), pp. 497–500. IEEE, May 2017Google Scholar
  15. 15.
    Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random Erasing Data Augmentation (2017). arXiv:1708.04896
  16. 16.
    van Doorn, J., Ly, A., Marsman, M., Wagenmakers, E.J.: Bayesian Estimation of Kendall’s tau Using a Latent Normal Approach (2017). arXiv:1703.01805

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  • S. S. Blessy Trencia Lincy
    • 1
    Email author
  • Suresh Kumar Nagarajan
    • 1
  1. 1.School of Computer Science and EngineeringVellore Institute of TechnologyVelloreIndia

Personalised recommendations