Distributed and Parallel Databases

, Volume 37, Issue 3, pp 329–350 | Cite as

Scalable machine learning computing a data summarization matrix with a parallel array DBMS

  • Carlos OrdonezEmail author
  • Yiqun Zhang
  • S. Lennart Johnsson
Part of the following topical collections:
  1. Special Issue on Extending Data Warehouses to Big Data Analytics


Big data analytics requires scalable (beyond RAM limits) and highly parallel (exploiting many CPU cores) processing of machine learning models, which in general involve heavy matrix manipulation. Array DBMSs represent a promising system to manipulate large matrices. With that motivation in mind, we present a high performance system exploiting a parallel array DBMS to evaluate a general, but compact, matrix summarization that benefits many machine learning models. We focus on two representative models: linear regression (supervised) and PCA (unsupervised). Our approach combines data summarization inside the parallel DBMS with further model computation in a mathematical language (e.g. R). We introduce a two-phase algorithm which first computes a general data summary in parallel and then evaluates matrix equations with reduced intermediate matrices in main memory on one node. We present theory results characterizing speedup and time/space complexity. From a parallel data system perspective, we consider scale-up and scale-out in a shared-nothing architecture. In contrast to most big data analytic systems, our system is based on array operators programmed in C++, working directly on the Unix file system instead of Java or Scala running on HDFS mounted of top of Unix, resulting in much faster processing. Experiments compare our system with Spark (parallel) and R (single machine), showing orders of magnitude time improvement. We present parallel benchmarks varying number of threads and processing nodes. Our two-phase approach should motivate analysts to exploit a parallel array DBMS for matrix summarization.


Matrix Summarization Parallel DBMS Linear algebra 



This work concludes a long-time project, during which the first author visited MIT from 2013 to 2016. The first author thanks the guidance from Michael Stonebraker to move away from relational DBMSs to compute machine learning models in a scalable manner and to understand SciDB storage and processing mechanisms for large matrices.


  1. 1.
    Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R.H., Konwinski, A., Lee, G., Patterson, D.A., Rabkin, A., Stoica, I., Zaharia, M.: A view of cloud computing. Commun. ACM 53(4), 50–58 (2010)CrossRefGoogle Scholar
  2. 2.
    Behm, A., Borkar, V.R., Carey, M.J., Grover, R., Li, C., Onose, N., Vernica, R., Deutsch, A., Papakonstantinou, Y., Tsotras, V.J.: ASTERIX: towards a scalable, semistructured data platform for evolving-world models. Distrib. Parallel Databases (DAPD) 29(3), 185–216 (2011)CrossRefGoogle Scholar
  3. 3.
    Bradley, P., Fayyad, U., Reina, C.: Scaling clustering algorithms to large databases. In: Proc. ACM KDD Conference, pp. 9–15 (1998)Google Scholar
  4. 4.
    Chen, Q., Hsu, M., Liu, R.: Extend udf technology for integrated analytics. Data Warehous. Knowl. Discov. 5691, 256–270 (2009)CrossRefGoogle Scholar
  5. 5.
    Cormode, G.: Compact summaries over large datasets. In: Proc. ACM PODS (2015)Google Scholar
  6. 6.
    Das, S., Sismanis, Y., Beyer, K.S., Gemulla, R., Haas, P.J., McPherson, J.: RICARDO: integrating R and hadoop. In: Proc. ACM SIGMOD Conference, pp. 987–998 (2010)Google Scholar
  7. 7.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  8. 8.
    DuMouchel, W., Volinski, C., Johnson, T., Pregybon, D.: Squashing flat files flatter. In: Proc. ACM KDD Conference (1999)Google Scholar
  9. 9.
    Gemulla, R., Nijkamp, E., Haas, P.J., Sismanis, Y.: Large-scale matrix factorization with distributed stochastic gradient descent. In: Proc. KDD, pp. 69–77 (2011)Google Scholar
  10. 10.
    Gucht, D.V., Williams, R., Woodruff, D.P., Zhang, Q.: The communication complexity of distributed set-joins with applications to matrix multiplication. In: Proc. ACM PODS, pp. 199–212 (2015)Google Scholar
  11. 11.
    Hameurlain, A., Morvan, F.: Parallel relational database systems: why, how and beyond. In: Proc. DEXA Conference, pp. 302–312 (1996)Google Scholar
  12. 12.
    Hameurlain, A., Morvan, F.: CPU and incremental memory allocation in dynamic parallelization of SQL queries. Parallel Comput. 28(4), 525–556 (2002)CrossRefzbMATHGoogle Scholar
  13. 13.
    Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2006)zbMATHGoogle Scholar
  14. 14.
    Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning, 1st edn. Springer, New York (2001)CrossRefzbMATHGoogle Scholar
  15. 15.
    Hellerstein, J., Re, C., Schoppmann, F., Wang, D.Z., Fratkin, E., Gorajek, A., Ng, K.S., Welton, C.: The MADlib analytics library or MAD skills, the SQL. Proc. VLDB 5(12), 1700–1711 (2012)CrossRefGoogle Scholar
  16. 16.
    Lamb, A., Fuller, M., Varadarajan, R., Tran, N., Vandier, B., Doshi, L., Bear, C.: The Vertica analytic database: C-store 7 years later. PVLDB 5(12), 1790–1801 (2012)Google Scholar
  17. 17.
    Li, F., Nath, S.: Scalable data summarization on big data. Distrib. Parallel Databases 32(3), 313–314 (2014)CrossRefGoogle Scholar
  18. 18.
    Liu, J., Wright, S.J., Re, C., Bittorf, V., Sridhar, S.: An asynchronous parallel stochastic coordinate descent algorithm. J. Mach. Learn. Res. 16(1), 285–322 (2015)MathSciNetzbMATHGoogle Scholar
  19. 19.
    Ordonez, C.: Statistical model computation with UDFs. IEEE Trans. Knowl. Data Eng. (TKDE) 22(12), 1752–1765 (2010)CrossRefGoogle Scholar
  20. 20.
    Ordonez, C., Mohanam, N., Garcia-Alvarado, C.: PCA for large data sets with parallel data summarization. Distrib. Parallel Databases 32(3), 377–403 (2014)CrossRefGoogle Scholar
  21. 21.
    Ordonez, C., Zhang, Y., Cabrera, W.: The Gamma matrix to summarize dense and sparse data sets for big data analytics. IEEE Trans. Knowl. Data Eng. (TKDE) 28(7), 1906–1918 (2016)CrossRefGoogle Scholar
  22. 22.
    Parthasarathy, S., Dwarkadas, S.: Shared state for distributed interactive data mining applications. Distrib. Parallel Databases 11(2), 129–155 (2002)CrossRefzbMATHGoogle Scholar
  23. 23.
    Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and parallel DBMSs: friends or foes? Commun. ACM 53(1), 64–71 (2010)CrossRefGoogle Scholar
  24. 24.
    Stonebraker, M., Becla, J., DeWitt, D.J., Lim, K.T., Maier, D., Ratzesberger, O., Zdonik, S.B.: Requirements for science data bases and SciDB. In: Proc. CIDR Conference (2009)Google Scholar
  25. 25.
    Stonebraker, M., Brown, P., Zhang, D., Becla, J.: SciDB: a database management system for applications with complex analytics. Comput. Sci. Eng. 15(3), 54–62 (2013)CrossRefGoogle Scholar
  26. 26.
    Stonebraker, M., Madden, S., Abadi, D. J., Harizopoulos, S., Hachem, N., Helland, P.: The end of an architectural era: (it’s time for a complete rewrite). In: VLDB, pp. 1150–1160 (2007)Google Scholar
  27. 27.
    Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. In: HotCloud USENIX Workshop (2010)Google Scholar
  28. 28.
    Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: An efficient data clustering method for very large databases. In: Proc. ACM SIGMOD Conference, pp. 103–114 (1996)Google Scholar
  29. 29.
    Zhang, Y., Ordonez, C., Cabrera, W.: Big data analytics integrating a parallel columnar DBMS and the R language. In: Proc. of IEEE CCGrid Conference (2016)Google Scholar
  30. 30.
    Zhang, Y., Ordonez, C., Johnsson, L.: A cloud system for machine learning exploiting a parallel array DBMS. In: Proc. DEXA Workshops (BDMICS), pp. 22–26 (2017)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Carlos Ordonez
    • 1
    Email author
  • Yiqun Zhang
    • 1
  • S. Lennart Johnsson
    • 1
  1. 1.Department of Computer ScienceUniversity of HoustonHoustonUSA

Personalised recommendations