Skip to main content

Integrating DBMS and Parallel Data Mining Algorithms for Modern Many-Core Processors

  • Conference paper
  • First Online:
Data Analytics and Management in Data Intensive Domains (DAMDID/RCDL 2017)

Abstract

Relational DBMSs (RDBMSs) remain the most popular tool for processing structured data in data intensive domains. However, most of stand-alone data mining packages process flat files outside a RDBMS. In-database data mining avoids export-import data/results bottleneck as opposed to use stand-alone mining packages and keeps all the benefits provided by a RDBMS. The paper presents an approach to data mining inside a RDBMS based on a parallel implementation of user-defined functions (UDFs). Such an approach is implemented for PostgreSQL and modern Intel MIC (Many Integrated Core) architecture. The UDF performs a single mining task on data from the specified table and produces a resulting table. The UDF is organized as a wrapper of an appropriate mining algorithm, which is implemented in C language and is parallelized by the OpenMP technology and thread-level parallelism. The heavy-weight parts of the algorithm are additionally parallelized by intrinsic functions for MIC platforms to reach the optimal loop vectorization manually. The library of such UDFs supports a cache of precomputed mining structures to reduce costs of further computations. In the experiments, the proposed approach shows good scalability and overtakes R data mining package.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.rscgroup.ru/en/.

References

  1. Duran, A., Klemm, M.: The Intel Many Integrated Core architecture. In: Smari, W.W., Zeljkovic, V. (eds.) HPCS, pp. 365–366. IEEE (2012)

    Google Scholar 

  2. Engreitz, J.M., Daigle Jr., B.J., Marshall, J.J., Altman, R.B.: Independent component analysis: mining microarray data for fundamental human gene expression modules. J. Biomed. Inform. 43(6), 932–944 (2010)

    Article  Google Scholar 

  3. Feng, X., Kumar, A., Recht, B., Re, C.: Towards a unified architecture for in-RDBMS analytics. In: Candan, K.S., Chen, Y., Snodgrass, R.T., Gravano, L., Fuxman, A. (eds.) Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, 20–24 May 2012, pp. 325–336. ACM (2012)

    Google Scholar 

  4. Garcia, W., Ordonez, C., Zhao, K., Chen, P.: Efficient algorithms based on relational queries to mine frequent graphs. In: Nica, A., Varde, A.S. (eds.) Proceedings of the Third Ph.D. Workshop on Information and Knowledge Management, PIKM 2010, Toronto, Ontario, Canada, pp. 17–24. ACM, 30 October 2010

    Google Scholar 

  5. Han, J., Fu, Y., Wang, W., Chiang, J., Gong, W., Koperski, K., Li, D., Lu, Y., Rajan, A., Stefanovic, N., Xia, B., Zaiane, O.R.: Dbminer: a system for mining knowledge in large relational databases. In: Simoudis, E., Han, J., Fayyad, U.M. (eds.) Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, Oregon, USA, pp. 250–255. AAAI Press (1996)

    Google Scholar 

  6. Hellerstein, J.M., Re, C., Schoppmann, F., Wang, D.Z., Fratkin, E., Gorajek, A., Ng, K.S., Welton, C., Feng, X., Li, K., Kumar, A.: The MADlib analytics library or MAD skills, the SQL. PVLDB 5(12), 1700–1711 (2012)

    Google Scholar 

  7. Imielinski, T., Virmani, A.: MSQL: a query language for database mining. Data Min. Knowl. Discov. 3(4), 373–408 (1999)

    Article  Google Scholar 

  8. Jaedicke, M., Mitschang, B.: On parallel processing of aggregate and scalar functions in object-relational DBMS. In: Haas, L.M., Tiwary, A. (eds.) SIGMOD 1998, Proceedings of the ACM SIGMOD International Conference on Management of Data, 2–4 June, 1998, Seattle, Washington, USA, pp. 379–389. ACM Press (1998)

    Google Scholar 

  9. Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York (1990)

    Google Scholar 

  10. Kostenetskiy, P., Safonov, A.: SUSU supercomputer resources. In: Sokolinsky, L., Starodubov, I. (eds.) PCT 2016, International Scientific Conference on Parallel Computational Technologies, Arkhangelsk, Russia, 29–31 March 2016, CEUR Workshop Proceedings, vol. 1576, pp. 561–573. CEUR-WS.org (2016)

    Google Scholar 

  11. Lichman, M.: UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science (2013). http://archive.ics.uci.edu/ml/datasets/individual+household+electric+power+consumption

  12. Mahajan, D., Kim, J.K., Sacks, J., Ardalan, A., Kumar, A., Esmaeilzadeh, H.: In-RDBMS Hardware Acceleration of Advanced Analytics. CoRR abs/1801.06027 (2018)

    Google Scholar 

  13. Meek, C., Thiesson, B., Heckerman, D.: The learning-curve sampling method applied to model-based clustering. J. Mach. Learn. Res. 2, 397–418 (2002)

    MathSciNet  MATH  Google Scholar 

  14. Melnykov, V., Chen, W.C., Maitra, R.: MixSim: an R package for simulating data to study performance of clustering algorithms. J. Stat. Softw. Artic. 51(12), 1–25 (2012)

    Google Scholar 

  15. Miniakhmetov, R., Zymbler, M.: Integration of fuzzy c-means clustering algorithm with PostgreSQL database management system. Numer. Methods Programm. 13(2(26)), 46–52 (2012)

    Google Scholar 

  16. O’Neil, E.J., O’Neil, P.E., Weikum, G.: The LRU-K page replacement algorithm for database disk buffering. In: Buneman, P., Jajodia, S. (eds.) Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, D.C., 26–28 May 1993, pp. 297–306. ACM Press (1993)

    Google Scholar 

  17. Ordonez, C.: Integrating k-means clustering with a relational DBMS using SQL. IEEE Trans. Knowl. Data Eng. 18(2), 188–201 (2006)

    Article  Google Scholar 

  18. Ordonez, C.: Building statistical models and scoring with UDFs. In: Chan, C.Y., Ooi, B.C., Zhou, A. (eds.) Proceedings of the ACM SIGMOD International Conference on Management of Data, Beijing, China, 12–14 June 2007, pp. 1005–1016. ACM (2007)

    Google Scholar 

  19. Ordonez, C., Garcia-Garcia, J.: Vector and matrix operations programmed with UDFs in a relational DBMS. In: Yu, P.S., Tsotras, V.J., Fox, E.A., Liu, B. (eds.) Proceedings of the 2006 ACM CIKM International Conference on Information and Knowledge Management, Arlington, Virginia, USA, 6–11 November 2006, pp. 503–512. ACM (2006)

    Google Scholar 

  20. Ordonez, C., Pitchaimalai, S.K.: Bayesian classifiers programmed in SQL. IEEE Trans. Knowl. Data Eng. 22(1), 139–144 (2010)

    Article  Google Scholar 

  21. Pan, C.S., Zymbler, M.L.: Very large graph partitioning by means of parallel DBMS. In: Catania, B., Guerrini, G., Pokorný, J. (eds.) ADBIS 2013. LNCS, vol. 8133, pp. 388–399. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40683-6_29

    Chapter  Google Scholar 

  22. Peng, Y., Grossman, M., Sarkar, V.: Static cost estimation for data layout selection on GPUs. In: 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, PMBS@SC 2016, Salt Lake, UT, USA, 14 November 2016, pp. 76–86. IEEE (2016)

    Google Scholar 

  23. Rantzau, R.: Frequent itemset discovery with SQL using universal quantification. In: Meo, R., Lanzi, P.L., Klemettinen, M. (eds.) Database Support for Data Mining Applications. LNCS (LNAI), vol. 2682, pp. 194–213. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-44497-8_10

    Chapter  Google Scholar 

  24. Rechkalov, T., Zymbler, M.: Accelerating medoids-based clustering with the Intel Many Integrated Core architecture. In: 9th International Conference on Application of Information and Communication Technologies, AICT 2015, 14–16 October 2015, Rostov-on-Don, Russia - Proceedings, pp. 413–417 (2015)

    Google Scholar 

  25. Rechkalov, T., Zymbler, M.: An approach to data mining inside PostgreSQL based on parallel implementation of UDFs. In: Kalinichenko, L.A., Manolopoulos, Y., Kuznetsov, S.O. (eds.) Selected Papers of the XIX International Conference on Data Analytics and Management in Data Intensive Domains (DAMDID/RCDL 2017), Moscow, Russia, 9–13 October 2017, CEUR Workshop Proceedings, vol. 2022, pp. 114–121. CEUR-WS.org (2017)

    Google Scholar 

  26. Sattler, K., Dunemann, O.: SQL database primitives for decision tree classifiers. In: Proceedings of the 2001 ACM CIKM International Conference on Information and Knowledge Management, Atlanta, Georgia, USA, 5–10 November 2001, pp. 379–386. ACM (2001)

    Google Scholar 

  27. Shang, X., Sattler, K.-U., Geist, I.: SQL Based Frequent Pattern Mining with FP-Growth. In: Seipel, D., Hanus, M., Geske, U., Bartenstein, O. (eds.) INAP/WLP -2004. LNCS (LNAI), vol. 3392, pp. 32–46. Springer, Heidelberg (2005). https://doi.org/10.1007/11415763_3

    Chapter  Google Scholar 

  28. Sokolinsky, L.B.: LFU-K: an effective buffer management replacement algorithm. In: Lee, Y., Li, J., Whang, K.-Y., Lee, D. (eds.) DASFAA 2004. LNCS, vol. 2973, pp. 670–681. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24571-1_60

    Chapter  Google Scholar 

Download references

Acknowledgments

This work was financially supported by the Russian Foundation for Basic Research (grant No. 17-07-00463), by Act 211 Government of the Russian Federation (contract No. 02.A03.21.0011) and by the Ministry of education and science of Russian Federation (government order 2.7905.2017/8.9). Authors thank RSC Group (Moscow, Russia) for the provided computational resources.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mikhail Zymbler .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rechkalov, T., Zymbler, M. (2018). Integrating DBMS and Parallel Data Mining Algorithms for Modern Many-Core Processors. In: Kalinichenko, L., Manolopoulos, Y., Malkov, O., Skvortsov, N., Stupnikov, S., Sukhomlin, V. (eds) Data Analytics and Management in Data Intensive Domains. DAMDID/RCDL 2017. Communications in Computer and Information Science, vol 822. Springer, Cham. https://doi.org/10.1007/978-3-319-96553-6_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-96553-6_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-96552-9

  • Online ISBN: 978-3-319-96553-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics