Integrating DBMS and Parallel Data Mining Algorithms for Modern Many-Core Processors

Rechkalov, Timofey; Zymbler, Mikhail

doi:10.1007/978-3-319-96553-6_17

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 822))

Included in the following conference series:

International Conference on Data Analytics and Management in Data Intensive Domains

501 Accesses
3 Citations

Abstract

Relational DBMSs (RDBMSs) remain the most popular tool for processing structured data in data intensive domains. However, most of stand-alone data mining packages process flat files outside a RDBMS. In-database data mining avoids export-import data/results bottleneck as opposed to use stand-alone mining packages and keeps all the benefits provided by a RDBMS. The paper presents an approach to data mining inside a RDBMS based on a parallel implementation of user-defined functions (UDFs). Such an approach is implemented for PostgreSQL and modern Intel MIC (Many Integrated Core) architecture. The UDF performs a single mining task on data from the specified table and produces a resulting table. The UDF is organized as a wrapper of an appropriate mining algorithm, which is implemented in C language and is parallelized by the OpenMP technology and thread-level parallelism. The heavy-weight parts of the algorithm are additionally parallelized by intrinsic functions for MIC platforms to reach the optimal loop vectorization manually. The library of such UDFs supports a cache of precomputed mining structures to reduce costs of further computations. In the experiments, the proposed approach shows good scalability and overtakes R data mining package.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://www.rscgroup.ru/en/.

References

Duran, A., Klemm, M.: The Intel Many Integrated Core architecture. In: Smari, W.W., Zeljkovic, V. (eds.) HPCS, pp. 365–366. IEEE (2012)
Google Scholar
Engreitz, J.M., Daigle Jr., B.J., Marshall, J.J., Altman, R.B.: Independent component analysis: mining microarray data for fundamental human gene expression modules. J. Biomed. Inform. 43(6), 932–944 (2010)
Article Google Scholar
Feng, X., Kumar, A., Recht, B., Re, C.: Towards a unified architecture for in-RDBMS analytics. In: Candan, K.S., Chen, Y., Snodgrass, R.T., Gravano, L., Fuxman, A. (eds.) Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, 20–24 May 2012, pp. 325–336. ACM (2012)
Google Scholar
Garcia, W., Ordonez, C., Zhao, K., Chen, P.: Efficient algorithms based on relational queries to mine frequent graphs. In: Nica, A., Varde, A.S. (eds.) Proceedings of the Third Ph.D. Workshop on Information and Knowledge Management, PIKM 2010, Toronto, Ontario, Canada, pp. 17–24. ACM, 30 October 2010
Google Scholar
Han, J., Fu, Y., Wang, W., Chiang, J., Gong, W., Koperski, K., Li, D., Lu, Y., Rajan, A., Stefanovic, N., Xia, B., Zaiane, O.R.: Dbminer: a system for mining knowledge in large relational databases. In: Simoudis, E., Han, J., Fayyad, U.M. (eds.) Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, Oregon, USA, pp. 250–255. AAAI Press (1996)
Google Scholar
Hellerstein, J.M., Re, C., Schoppmann, F., Wang, D.Z., Fratkin, E., Gorajek, A., Ng, K.S., Welton, C., Feng, X., Li, K., Kumar, A.: The MADlib analytics library or MAD skills, the SQL. PVLDB 5(12), 1700–1711 (2012)
Google Scholar
Imielinski, T., Virmani, A.: MSQL: a query language for database mining. Data Min. Knowl. Discov. 3(4), 373–408 (1999)
Article Google Scholar
Jaedicke, M., Mitschang, B.: On parallel processing of aggregate and scalar functions in object-relational DBMS. In: Haas, L.M., Tiwary, A. (eds.) SIGMOD 1998, Proceedings of the ACM SIGMOD International Conference on Management of Data, 2–4 June, 1998, Seattle, Washington, USA, pp. 379–389. ACM Press (1998)
Google Scholar
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York (1990)
Google Scholar
Kostenetskiy, P., Safonov, A.: SUSU supercomputer resources. In: Sokolinsky, L., Starodubov, I. (eds.) PCT 2016, International Scientific Conference on Parallel Computational Technologies, Arkhangelsk, Russia, 29–31 March 2016, CEUR Workshop Proceedings, vol. 1576, pp. 561–573. CEUR-WS.org (2016)
Google Scholar
Lichman, M.: UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science (2013). http://archive.ics.uci.edu/ml/datasets/individual+household+electric+power+consumption
Mahajan, D., Kim, J.K., Sacks, J., Ardalan, A., Kumar, A., Esmaeilzadeh, H.: In-RDBMS Hardware Acceleration of Advanced Analytics. CoRR abs/1801.06027 (2018)
Google Scholar
Meek, C., Thiesson, B., Heckerman, D.: The learning-curve sampling method applied to model-based clustering. J. Mach. Learn. Res. 2, 397–418 (2002)
MathSciNet MATH Google Scholar
Melnykov, V., Chen, W.C., Maitra, R.: MixSim: an R package for simulating data to study performance of clustering algorithms. J. Stat. Softw. Artic. 51(12), 1–25 (2012)
Google Scholar
Miniakhmetov, R., Zymbler, M.: Integration of fuzzy c-means clustering algorithm with PostgreSQL database management system. Numer. Methods Programm. 13(2(26)), 46–52 (2012)
Google Scholar
O’Neil, E.J., O’Neil, P.E., Weikum, G.: The LRU-K page replacement algorithm for database disk buffering. In: Buneman, P., Jajodia, S. (eds.) Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, D.C., 26–28 May 1993, pp. 297–306. ACM Press (1993)
Google Scholar
Ordonez, C.: Integrating k-means clustering with a relational DBMS using SQL. IEEE Trans. Knowl. Data Eng. 18(2), 188–201 (2006)
Article Google Scholar
Ordonez, C.: Building statistical models and scoring with UDFs. In: Chan, C.Y., Ooi, B.C., Zhou, A. (eds.) Proceedings of the ACM SIGMOD International Conference on Management of Data, Beijing, China, 12–14 June 2007, pp. 1005–1016. ACM (2007)
Google Scholar
Ordonez, C., Garcia-Garcia, J.: Vector and matrix operations programmed with UDFs in a relational DBMS. In: Yu, P.S., Tsotras, V.J., Fox, E.A., Liu, B. (eds.) Proceedings of the 2006 ACM CIKM International Conference on Information and Knowledge Management, Arlington, Virginia, USA, 6–11 November 2006, pp. 503–512. ACM (2006)
Google Scholar
Ordonez, C., Pitchaimalai, S.K.: Bayesian classifiers programmed in SQL. IEEE Trans. Knowl. Data Eng. 22(1), 139–144 (2010)
Article Google Scholar
Pan, C.S., Zymbler, M.L.: Very large graph partitioning by means of parallel DBMS. In: Catania, B., Guerrini, G., Pokorný, J. (eds.) ADBIS 2013. LNCS, vol. 8133, pp. 388–399. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40683-6_29
Chapter Google Scholar
Peng, Y., Grossman, M., Sarkar, V.: Static cost estimation for data layout selection on GPUs. In: 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, PMBS@SC 2016, Salt Lake, UT, USA, 14 November 2016, pp. 76–86. IEEE (2016)
Google Scholar
Rantzau, R.: Frequent itemset discovery with SQL using universal quantification. In: Meo, R., Lanzi, P.L., Klemettinen, M. (eds.) Database Support for Data Mining Applications. LNCS (LNAI), vol. 2682, pp. 194–213. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-44497-8_10
Chapter Google Scholar
Rechkalov, T., Zymbler, M.: Accelerating medoids-based clustering with the Intel Many Integrated Core architecture. In: 9th International Conference on Application of Information and Communication Technologies, AICT 2015, 14–16 October 2015, Rostov-on-Don, Russia - Proceedings, pp. 413–417 (2015)
Google Scholar
Rechkalov, T., Zymbler, M.: An approach to data mining inside PostgreSQL based on parallel implementation of UDFs. In: Kalinichenko, L.A., Manolopoulos, Y., Kuznetsov, S.O. (eds.) Selected Papers of the XIX International Conference on Data Analytics and Management in Data Intensive Domains (DAMDID/RCDL 2017), Moscow, Russia, 9–13 October 2017, CEUR Workshop Proceedings, vol. 2022, pp. 114–121. CEUR-WS.org (2017)
Google Scholar
Sattler, K., Dunemann, O.: SQL database primitives for decision tree classifiers. In: Proceedings of the 2001 ACM CIKM International Conference on Information and Knowledge Management, Atlanta, Georgia, USA, 5–10 November 2001, pp. 379–386. ACM (2001)
Google Scholar
Shang, X., Sattler, K.-U., Geist, I.: SQL Based Frequent Pattern Mining with FP-Growth. In: Seipel, D., Hanus, M., Geske, U., Bartenstein, O. (eds.) INAP/WLP -2004. LNCS (LNAI), vol. 3392, pp. 32–46. Springer, Heidelberg (2005). https://doi.org/10.1007/11415763_3
Chapter Google Scholar
Sokolinsky, L.B.: LFU-K: an effective buffer management replacement algorithm. In: Lee, Y., Li, J., Whang, K.-Y., Lee, D. (eds.) DASFAA 2004. LNCS, vol. 2973, pp. 670–681. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24571-1_60
Chapter Google Scholar

Download references

Acknowledgments

This work was financially supported by the Russian Foundation for Basic Research (grant No. 17-07-00463), by Act 211 Government of the Russian Federation (contract No. 02.A03.21.0011) and by the Ministry of education and science of Russian Federation (government order 2.7905.2017/8.9). Authors thank RSC Group (Moscow, Russia) for the provided computational resources.

Author information

Authors and Affiliations

South Ural State University, Chelyabinsk, Russia
Timofey Rechkalov & Mikhail Zymbler

Authors

Timofey Rechkalov
View author publications
You can also search for this author in PubMed Google Scholar
Mikhail Zymbler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mikhail Zymbler .

Editor information

Editors and Affiliations

Federal Research Center “Computer Science and Control”, Russian Academy of Sciences, Moscow, Russia
Leonid Kalinichenko
Open University of Cyprus, Latsia, Cyprus
Yannis Manolopoulos
Institute of Astronomy, Russian Academy of Sciences, Moscow, Russia
Oleg Malkov
Federal Research Center “Computer Science and Control”, Russian Academy of Sciences, Moscow, Russia
Nikolay Skvortsov
Federal Research Center “Computer Science and Control”, Russian Academy of Sciences, Moscow, Russia
Sergey Stupnikov
Moscow State University, Moscow, Russia
Vladimir Sukhomlin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rechkalov, T., Zymbler, M. (2018). Integrating DBMS and Parallel Data Mining Algorithms for Modern Many-Core Processors. In: Kalinichenko, L., Manolopoulos, Y., Malkov, O., Skvortsov, N., Stupnikov, S., Sukhomlin, V. (eds) Data Analytics and Management in Data Intensive Domains. DAMDID/RCDL 2017. Communications in Computer and Information Science, vol 822. Springer, Cham. https://doi.org/10.1007/978-3-319-96553-6_17

Download citation

DOI: https://doi.org/10.1007/978-3-319-96553-6_17
Published: 13 July 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-96552-9
Online ISBN: 978-3-319-96553-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics