On Development of Data Science and Machine Learning Applications in Databricks

  • Wenhao Ruan
  • Yifan Chen
  • Babak ForouraghiEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11517)


Databricks is a unified analytics engine that allows rapid development of data science applications using machine learning techniques such as classification, linear and nonlinear regression, clustering, etc. Existence of myriad sophisticated computational options, however, can become overwhelming for designers as it may not always be clear what choices can produce the best predictive model given a specific data set. Further, the mere high dimensionality of big data sets is a challenge for data scientists to gain a deep understanding of the results obtained by a utilized model.

This paper provides general guidelines for utilizing a variety of machine learning algorithms on the cloud computing platform, Databricks. Visualization is an important means for users to understand the significance of the underlying data. Therefore, it is also demonstrated how graphical tools such as Tableau can be used to efficiently examine results of classification or clustering. The dimensionality reduction techniques such as Principal Component Analysis (PCA), which help reduce the number of features in a learning experiment, are also discussed.

To demonstrate the utility of Databricks tools, two big data sets are used for performing clustering and classification. A variety of machine learning algorithms are applied to both data sets, and it is shown how to obtain the most accurate learning models employing appropriate evaluation methods.


Big data Machine learning Cloud computing Classification Data Science Clustering Databricks 


  1. 1.
    Shah, A.: Machine Learning Vs. Statistics. Accessed 21 Dec 2018. DocumentationGoogle Scholar
  2. 2.
    Deshmukh, A.A.: Kernel Approximation, Stats 608 Methods in Optimization, pp. 1–3 (2015)Google Scholar
  3. 3.
    Custer, C.: Answer to what is the connection between data science and artificial intelligence? Is it machine learning? (2016). Accessed 21 Dec 2018. DocumentationGoogle Scholar
  4. 4.
    Choosing the right estimator: - scikit-learn 0.20.1 DocumentationGoogle Scholar
  5. 5.
    Agrawal, D., Das, S., EI Abbadi, A.: Big data and cloud computing: current state and future opportunities. In: 14th International Conference on Extending Database Technology, pp. 530–533 (2011)Google Scholar
  6. 6.
    Blodgett, D.: An initial investigation: K-Means and bisecting K-Means algorithms for clustering (2016)Google Scholar
  7. 7.
    Decision Trees (DTs). In: Decision Trees Classifier - scikit-learn 0.20.1 DocumentationGoogle Scholar
  8. 8.
    Sarkar, D.: The Art of Effective Visualization of Multi-dimensional Data. Accessed 20 Jan 2019. DocumentationGoogle Scholar
  9. 9.
    Extracting, transforming and selecting features. - scikit-learn 0.20.1 DocumentationGoogle Scholar
  10. 10.
    Li, H.: Which machine learning algorithm should I use. The SAS Data Science Blog (2017). DocumentationGoogle Scholar
  11. 11.
    Individual household electric power consumption Data Set. In: UCI Machine Learning Repository. DocumentationGoogle Scholar
  12. 12.
    Rouse, M.: Data Visualization. Accessed 21 Dec 2018. DocumentationGoogle Scholar
  13. 13.
    Rouse, M.: What is DataBricks? (2018). DocumentationGoogle Scholar
  14. 14.
    Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering TechniquesGoogle Scholar
  15. 15.
    Multi-layer Perceptron (MLP). In: Multi-layer Perceptron - scikit-learn 0.20.1 DocumentationGoogle Scholar
  16. 16.
    Donges, N.: Pros and Cons of Neural Networks (2018). DocumentationGoogle Scholar
  17. 17.
    PAMAP2 Physical Activity Monitoring Data Set. In: UCI Machine Learning Repository. Accessed 16 Oct 2018. DocumentationGoogle Scholar
  18. 18.
    Principal component analysis (PCA). - scikit-learn 0.20.1 DocumentationGoogle Scholar
  19. 19.
    Ruchika R. Patil, Amreen Khan: Bisecting K-means for Clustering Web Log data (2015), International Journal of Computer Applications (0975 – 8887), Volume 116 – No. 19Google Scholar
  20. 20.
    Asthana, S.: You need these cheat sheets if you’re tackling Machine Learning Algorithms (2017)Google Scholar
  21. 21.
    Stochastic Gradient Descent (SGD). In: SGDClassifier - scikit-learn 0.20.1 DocumentationGoogle Scholar
  22. 22.
    Kodinariya, T.M., Makeana, P.R.: Int. J. Adv. Res. Comput. Sci. Manag. Stud. (2017). ISSN 2347-1778, 2321-7782Google Scholar
  23. 23.
    Malik, U.: Implementing SVM and Kernel SVM with Python’s Scikit-Learn (2018)Google Scholar
  24. 24.
    Thompson, W., Li, H., Bolen, A.: Artificial intelligence, machine learning, deep learning and beyond – understanding AI technologies and how they lead to smart applications. Accessed 21 Dec 2018. DocumentationGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Saint Joseph’s UniversityPhiladelphiaUSA

Personalised recommendations