Skip to main content

A Distributed Support Vector Machine Using Apache Spark for Semi-supervised Classification with Data Augmentation

  • Conference paper
  • First Online:
Soft Computing and Signal Processing

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 898))

Abstract

One of the popular and extensively used classification algorithms in the data mining and the machine learning technique is the support vector machine (SVM). Yet, conversely they have been traditionally applied to a small dataset or to an extent medium dataset. The current requirement and demand to scale up with the evolving size of the datasets have fascinated the research notice and attention such that new techniques and implementations can be carried out for the SVM, and as a result can scale well with large datasets and tasks. Recently, the distributed SVM is studied by the researchers, but the data augmentation with semi-supervised classification using the distributed SVM is not yet implemented. In this paper, a distributed implementation of support vector machine along with the data augmentation upon the SparkR, which is a recent and effective platform for performing distributed computation, is introduced and analyzed. This framework—A Distributed Support Vector Machine under Apache Spark for Semi-supervised Classification with Smart Data Augmentation—is implemented with a large-scale dataset with more than million data points. The results and analysis show that the proposed approach greatly enhances the predictive performance of the method in terms of execution time and faster processing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Dean, J., Ghemawat, S.: Simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008). https://doi.org/10.1145/1327452.1327492

    Article  Google Scholar 

  2. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Gruber, R.E.: Bigtable: a distributed storage system for structured data. In: 7th Symposium on Operating Systems Design and Implementation (OSDI ’06), pp. 205–218, Seattle, WA, USA, 6–8 Nov 2006. https://doi.org/10.1145/1365815.1365816

    Article  Google Scholar 

  3. Le Guennec, A., Malinowski, S., Tavenard, R.: Data Augmentation for time series classification using convolutional neural networks. In: ECML/PKDD Workshop on Advanced Analytics and Learning on Temporal Data (2016)

    Google Scholar 

  4. Gelman, A.: Parameterization and Bayesian modeling. J. Am. Stat. Assoc. 99(466), 537–545 (2004). https://doi.org/10.1198/016214504000000458

    Article  MathSciNet  MATH  Google Scholar 

  5. Tanner, M.A.: Data Augmentation, pp. 105–126 (2004). https://doi.org/10.1002/0471667196.ess0283

  6. Van Dyk, D.A., Meng, X.L.: The art of data augmentation. J. Comput. Graph. Stat. 10(1), 1–50 (2001)

    Article  MathSciNet  Google Scholar 

  7. Polson, N.G., Scott, S.L.: Data augmentation for support vector machines. Bayesian Anal. 6(1), 1–24 (2011). https://doi.org/10.1214/11-BA601

    Article  MathSciNet  MATH  Google Scholar 

  8. Touloupou, P., Alzahrani, N., Neal, P., Spencer, S.E., McKinley, T.J.: Efficient model comparison techniques for models requiring large scale data augmentation. Bayesian Anal. (2017)

    Google Scholar 

  9. Consoli, S., Kustra, J., Vos, P., Hendriks, M., Mavroeidis, D.: Towards an automated method based on Iterated Local Search optimization for tuning the parameters of Support Vector Machines, pp. 1–3 (2017)

    Google Scholar 

  10. Triguero, I., Peralta, D., Bacardit, J., Garca, S., Herrera, F.: MRPR: a MapReduce solution for prototype reduction in big data classification. Neuro Comput. 150, 331–345 (2015)

    Google Scholar 

  11. Cheng, H., Fernando, R., Garrick, D.: Parallel computing to speed up whole-genome bayesian regression analyses using orthogonal data augmentation. bioRxiv, 148965 (2017)

    Google Scholar 

  12. Nandimath, J., Banerjee, E., Patil, A., Kakade, P., Vaidya, S., Chaturvedi, D.: Big data analysis using Apache Hadoop. In: 2013 IEEE 14th International Conference on Information Reuse and Integration (IRI), pp. 700–703. IEEE, Aug 2013

    Google Scholar 

  13. Velasco, J.M., Garnica, O., Contador, S., Lanchares, J., Maqueda, E., Botella, M., Hidalgo, J.I.: Data augmentation and evolutionary algorithms to improve the prediction of blood glucose levels in scarcity of training data. In: Evolutionary Computation (CEC), 2017 IEEE Congress on, pp. 2193–2200. IEEE, June 2017

    Google Scholar 

  14. Piza, D.L., Schulze-Bonhage, A., Stieglitz, T., Jacobs, J., Dümpelmann, M.: Depuration, augmentation and balancing of training data for supervised learning based detectors of EEG patterns. In: 2017 8th International IEEE/EMBS Conference on Neural Engineering (NER), pp. 497–500. IEEE, May 2017

    Google Scholar 

  15. Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random Erasing Data Augmentation (2017). arXiv:1708.04896

  16. van Doorn, J., Ly, A., Marsman, M., Wagenmakers, E.J.: Bayesian Estimation of Kendall’s tau Using a Latent Normal Approach (2017). arXiv:1703.01805

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. S. Blessy Trencia Lincy .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Blessy Trencia Lincy, S.S., Nagarajan, S.K. (2019). A Distributed Support Vector Machine Using Apache Spark for Semi-supervised Classification with Data Augmentation. In: Wang, J., Reddy, G., Prasad, V., Reddy, V. (eds) Soft Computing and Signal Processing . Advances in Intelligent Systems and Computing, vol 898. Springer, Singapore. https://doi.org/10.1007/978-981-13-3393-4_41

Download citation

Publish with us

Policies and ethics