Polystore and Tensor Data Model for Logical Data Independence and Impedance Mismatch in Big Data Analytics

Leclercq, Éric; Gillet, Annabelle; Grison, Thierry; Savonnet, Marinette

doi:10.1007/978-3-662-60531-8_3

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 11860))

445 Accesses
2 Citations
2 Altmetric

Abstract

This paper presents a Tensor based Data Model (TDM) for polystore systems meant to address two major closely related issues in big data analytics architectures, namely logical data independence and data impedance mismatch. The TDM is an expressive model that subsumes traditional data models, it allows to link different data models of various data stores, and which also facilitates data transformations by using operators with clearly defined semantics. Our contribution is twofold. Firstly, it is the addition of the notion of a schema for the tensor mathematical object using typed associative arrays. Secondly, it is the definition of a set of operators to manipulate data through the TDM. In order to validate our approach we first show how our TDM model is inserted into a given polystore architecture. We then describe some use cases of real analyses using our TDM and its operators in the context of the French Presidential Election in 2017.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://kafka.apache.org/.
2.
https://flink.apache.org/.
3.
Single Instruction Multiple Data.
4.
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/FlexTables/FlexTableHandbook.htm.
5.
https://docs.arangodb.com/3.4/Manual/index.html.
6.
https://orientdb.com/graph-database/.
7.
https://bigdata.uni-saarland.de/projects/octopusdb.php.
8.
http://wp.sigmod.org/?p=1629.
9.
https://spark.apache.org/sql/.
10.
https://drill.apache.org/.
11.
http://forward.ucsd.edu/.
12.
https://hive.apache.org/.
13.
https://neo4j.com/developer/graph-algorithms/.
14.
https://www.vertica.com/product/database-machine-learning/.
15.
https://www.paradigm4.com/.
16.
https://www.tensorflow.org/.
17.
http://deeplearning.net/software/theano/.
18.
https://keras.io/.
19.
https://spark.apache.org/mllib/.
20.
https://amplab.cs.berkeley.edu/software/.
21.
http://www.alluxio.org/.
22.
https://azure.microsoft.com/en-us/services/data-lake-analytics/.
23.
https://www.ibm.com/analytics/data-lake.
24.
https://github.com/nicolewhite/RNeo4j.
25.
https://github.com/RevolutionAnalytics/rhbase.
26.
The notation | is the restriction applied to sets, \(A|B=A-(A-B)\).
27.
expr is a logical expression to compare values of \(\varvec{\mathcal {X}}\) to constants. Its form is as follows: expr : : = <condition\(>\vert<\)condition> <logical operator> <condition\(> \vert \lnot<\)condition\(> \vert \) (<condition>)
Logical operators are \(\{\wedge , \vee \}\) condition> : : = values of \(\varvec{\mathcal {X}}\) (implicit) <comparison operator> constantComparison operators are \(\{<,\le , =,\ne ,\ge ,>\}\).
28.
expr allows to compare keys of the dimensions with constants. Its shape is the same as for the operator \(\sigma \) except for
<condition> : : = name of a dimension <comparison operator> constant.
29.
https://botometer.iuni.iu.edu/.
30.
https://github.com/ginestrab/Multiplex-PageRank.
31.
https://github.com/AnnabelleGillet/Multiplex-PageRank.
32.
https://github.com/scalanlp/breeze.

References

Abo Khamis, M., Ngo, H.Q., Nguyen, X., Olteanu, D., Schleich, M.: In-database learning with sparse tensors. In: ACM SIGMOD/PODS Symposium on Principles of Database Systems, pp. 325–340 (2018)
Google Scholar
Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc. VLDB Endowment 2(1), 922–933 (2009)
Article Google Scholar
Al-Garadi, M.A., et al.: Analysis of online social network connections for identification of influential users: survey and open research issues. ACM Comput. Surv. (CSUR) 51(1), 1–37 (2018)
Article Google Scholar
Allen, D., Hodler, A.: Weave together graph and relational data in apache spark. In: Spark+AI Summit. Neo4j (2018). https://vimeo.com/274433801
Alsubaiee, S., et al.: AsterixDB: a scalable, open source BDMS. Proc. VLDB Endow. 7(14), 1905–1916 (2014)
Article Google Scholar
Angles, R.: A comparison of current graph database models. In: IEEE International Conference on Data Engineering Workshops (ICDEW), pp. 171–177 (2012)
Google Scholar
Astrahan, M.M., et al.: System R: relational approach to database management. ACM Trans. Database Syst. (TODS) 1(2), 97–137 (1976)
Article Google Scholar
Atikoglu, B., Xu, Y., Frachtenberg, E., Jiang, S., Paleczny, M.: Workload analysis of a large-scale key-value store. ACM SIGMETRICS Perform. Evaluation Rev. 40, 53–64 (2012)
Article Google Scholar
Austin, W., Ballard, G., Kolda, T.G.: Parallel tensor compression for large-scale scientific data. In: IEEE International Parallel and Distributed Processing Symposium, pp. 912–922 (2016)
Google Scholar
Baazizi, M.A., Lahmar, H.B., Colazzo, D., Ghelli, G., Sartiani, C.: Schema inference for massive JSON datasets. In: Extending Database Technology (EDBT), p. 222, 233 (2017)
Google Scholar
Barabási, A.L., et al.: Network Science. Cambridge University Press, Cambridge (2016)
MATH Google Scholar
Battaglino, C., Ballard, G., Kolda, T.: A practical randomized CP tensor decomposition. arXiv preprint arXiv:1701.06600 (2017)
Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech: Theory Exp. 2008(10), P10008 (2008)
Article Google Scholar
Brodie, M.L., Schmidt, J.W.: Final report of the ANSI/X3/SPARC DBS-SG relational database task group. ACM SIGMOD Rec. 12(4), 1–62 (1982)
Google Scholar
Bugiotti, F., Bursztyn, D., Deutsch, A., Ileana, I., Manolescu, I.: Invisible glue: scalable self-tuning multi-stores. In: Conference on Innovative Data Systems Research (CIDR) (2015)
Google Scholar
Bugiotti, F., Bursztyn, D., Deutsch, A., Manolescu, I., Zampetakis, S.: Flexible hybrid stores: constraint-based rewriting to the rescue. In: International Conference on Data Engineering (ICDE), pp. 1394–1397 (2016)
Google Scholar
Buluc, A., Gilbert, J.: On the representation and multiplication of hypersparse matrices. In: IEEE International Symposium on Parallel and Distributed Processing, pp. 1–11 (2008)
Google Scholar
Chen, J., Huang, Q.: Eliminating the Impedance Mismatch Between Relational Systems and Object-Oriented Programming Languages. Monash University, Clayton (1995)
Google Scholar
Cichocki, A., Zdunek, R., Phan, A.H., Amari, S.: Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-way Data Analysis and Blindsource Separation. Wiley, Hoboken (2009)
Book Google Scholar
De Domenico, M., et al.: Mathematical formulation of multilayer networks. Phys. Rev. X 3(4), 041022 (2013)
Google Scholar
Deng, D., et al.: The data civilizer system. In: Conference on Innovative Data Systems Research (CIDR) (2017)
Google Scholar
DiScala, M., Abadi, D.J.: Automatic generation of normalized relational schemas from nested key-value data. In: Proceedings of the International Conference on Management of Data, pp. 295–310. ACM (2016)
Google Scholar
Dittrich, J., Jindal, A.: Towards a one size fits all database architecture. In: Conference on Innovative Data Systems Research (CIDR), pp. 195–198 (2011)
Google Scholar
Duggan, J., et al.: The BigDAWG polystore system. ACM SIGMOD Rec. 44(2), 11–16 (2015)
Article Google Scholar
Färber, F., et al.: The SAP HANA database-an architecture overview. IEEE Data Eng. Bull. 35(1), 28–33 (2012)
Google Scholar
Gadepally, V., et al.: The BigDAWG polystore system and architecture. In: IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6 (2016)
Google Scholar
Gama, J.: A survey on learning from data streams: current and future trends. Prog. Artif. Intell. 1(1), 45–55 (2012)
Article Google Scholar
Ghosh, D.: Multiparadigm data storage for enterprise applications. IEEE Soft. 27(5), 57–60 (2010)
Article Google Scholar
Giannakouris, V., Papailiou, N., Tsoumakos, D., Koziris, N.: MuSQLE: distributed SQL query execution over multiple engine environments. In: IEEE International Conference on Big Data, pp. 452–461 (2016)
Google Scholar
Gray, J., Liu, D.T., Nieto-Santisteban, M., Szalay, A., DeWitt, D.J., Heber, G.: Scientific data management in the coming decade. ACM SIGMOD Rec. 34(4), 34–41 (2005)
Article Google Scholar
Haerder, T., Reuter, A.: Principles of transaction-oriented database recovery. ACM Comput. Surv. (CSUR) 15(4), 287–317 (1983)
Article MathSciNet Google Scholar
Halu, A., Mondragón, R.J., Panzarasa, P., Bianconi, G.: Multiplex pagerank. PloS ONE 8(10), e78293 (2013)
Article Google Scholar
Hammer, M., McLeod, D.: On database management system architecture. Technical report, Massachusetts Institute of Technology, Cambridge Lab. For Computer Science (1979)
Google Scholar
Härder, T.: DBMS architecture-the layer model and its evolution. Datenbank-Spektrum 13, 45–57 (2005)
Google Scholar
Hellerstein, J.M., et al.: The MADlib analytics library or MAD skills, the SQL. Proc. VLDB Endow. 5(12), 1700–1711 (2012)
Article Google Scholar
Hewasinghage, M., Varga, J., Abelló, A., Zimányi, E.: Managing polyglot systems metadata with hypergraphs. In: Trujillo, J.C., et al. (eds.) ER 2018. LNCS, vol. 11157, pp. 463–478. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00847-5_33
Chapter Google Scholar
Hölsch, J., Schmidt, T., Grossniklaus, M.: On the performance of analytical and pattern matching graph queries in Neo4j and a relational database. In: EDBT/ICDT International Workshop on Querying Graph Structured Data (GraphQ) (2017)
Google Scholar
Hutchison, D., Howe, B., Suciu, D.: Lara: a key-value algebra underlying arrays and relations. arXiv preprint arXiv:1604.03607 (2016)
Hutchison, D., Howe, B., Suciu, D.: LaraDB: A minimalist kernel for linear and relational algebra computation. In: ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, pp. 2–12 (2017)
Google Scholar
Jananthan, H., Zhou, Z., Gadepally, V., Hutchison, D., Kim, S., Kepner, J.: Polystore mathematics of relational algebra. In: IEEE International Conference on Big Data, pp. 3180–3189 (2017)
Google Scholar
Johnson, M., Rosebrugh, R., et al.: Database interoperability through state-based logical data independence. Int. J. Comput. Appl. Technol. 16(2–3), 97–102 (2003)
Article Google Scholar
Kanellakis, P.C.: Elements of relational database theory. In: Formal models and semantics, pp. 1073–1156. Elsevier (1990)
Google Scholar
Kang, U., Papalexakis, E., Harpale, A., Faloutsos, C.: Gigatensor: scaling tensor analysis up by 100 times-algorithms and discoveries. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 316–324 (2012)
Google Scholar
Kepner, J., et al.: Dynamic distributed dimensional data model (D4M) database and computation system. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5349–5352 (2012)
Google Scholar
Kepner, J., et al.: Achieving 100,000,000 database inserts per second using Accumulo and D4M. In: High Performance Extreme Computing Conference (HPEC), pp. 1–6. IEEE (2014)
Google Scholar
Kim, M.: TensorDB and tensor-relational model (TRM) for efficient tensor-relational operations (2014)
Google Scholar
Kivelä, M., Arenas, A., Barthelemy, M., Gleeson, J.P., Moreno, Y., Porter, M.A.: Multilayer networks. J. Complex Netw. 2(3), 203–271 (2014)
Article Google Scholar
Klug, A.: Equivalence of relational algebra and relational calculus query languages having aggregate functions. J. ACM 29(3), 699–717 (1982)
Article MathSciNet MATH Google Scholar
Knuth, D.: The Art of Computer Programming, Vol. 1: Fundamental Algorithms. Addison-Wesley, Boston (1978)
MATH Google Scholar
Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500 (2009)
Article MathSciNet MATH Google Scholar
Kolev, B., Bondiombouy, C., Valduriez, P., Jiménez-Peris, R., Pau, R., Pereira, J.: The CloudMdsQL multistore system. In: International Conference on Management of Data (SIGMOD), pp. 2113–2116 (2016)
Google Scholar
Kuang, L., Hao, F., Yang, L.T., Lin, M., Luo, C., Min, G.: A tensor-based approach for big data representation and dimensionality reduction. IEEE Trans. Emerg. Top. Comput. 2(3), 280–291 (2014)
Article Google Scholar
Lämmel, R., Meijer, E.: Revealing the X/O impedance mismatch. In: Backhouse, R., Gibbons, J., Hinze, R., Jeuring, J. (eds.) SSDGP 2006. LNCS, vol. 4719, pp. 285–367. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76786-2_6
Chapter Google Scholar
Leclercq, E., Savonnet, M.: TDM: A tensor data model for logical data independence in polystore systems. In: Heterogeneous Data Management, Polystores, and Analytics for Healthcare - VLDB 2018 Workshops, Poly and DMAH, pp. 39–56 (2018)
Chapter Google Scholar
LeFevre, J., Sankaranarayanan, J., Hacigumus, H., Tatemura, J., Polyzotis, N., Carey, M.J.: MISO: souping up big data query processing with a multistore system. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1591–1602 (2014)
Google Scholar
Li, X., Cui, B., Chen, Y., Wu, W., Zhang, C.: MLog: towards declarative in-database machine learning. Proc. VLDB Endow. 10(12), 1933–1936 (2017)
Article Google Scholar
Lin, J., Ryaboy, D.: Scaling big data mining infrastructure: the Twitter experience. SIGKDD Explor. Newsl. 14(2), 6–19 (2013)
Article Google Scholar
Litwin, W., Abdellatif, A., Zeroual, A., Nicolas, B., Vigier, P.: MSQL: a multidatabase language. Inf. Sci. 49(1–3), 59–101 (1989)
Article MATH Google Scholar
Lu, J., Holubova, I.: Multi-model databases: a new journey to handle the variety of data. ACM Comput. Surv. (CSUR) 52(3), 55 (2019)
Article Google Scholar
Maccioni, A., Torlone, R.: Augmented access for querying and exploring a Polystore. In: 34th International Conference on Data Engineering (ICDE), pp. 77–88. IEEE (2018)
Google Scholar
Maier, D., Rozenshtein, D., Salveter, S., Stein, J., Warren, D.S.: Toward logical data independence: a relational query language without relations. In: ACM SIGMOD International Conference on Management of Data, pp. 51–60 (1982)
Google Scholar
McGregor, A.: Graph stream algorithms: a survey. ACM SIGMOD Rec. 43(1), 9–20 (2014)
Article Google Scholar
McHugh, J., Cuddihy, P.E., Williams, J.W., Aggour, K.S., Kumar, V.S., Mulwad, V.: Integrated access to big data polystores through a knowledge-driven framework. In: IEEE International Conference on Big Data, pp. 1494–1503 (2017)
Google Scholar
Ong, K.W., Papakonstantinou, Y., Vernoux, R.: The SQL++ query language: configurable. Unifying and semi-structured. Technical report, UCSD (2015)
Google Scholar
Ouzzani, M., Tang, N., Fernandez, R.C.: Data civilizer: end-to-end support for data discovery, integration, and cleaning. In: Making Databases Work, pp. 291–300. Association for Computing Machinery and Morgan & Claypool (2019)
Google Scholar
Özsoyoğlu, G., Özsoyoğlu, Z.M., Matos, V.: Extending relational algebra and relational calculus with set-valued attributes and aggregate functions. ACM Trans. Database Syst. 12(4), 566–592 (1987)
Article MathSciNet Google Scholar
Özsu, M.T., Valduriez, P.: Principles of Distributed Database Systems. Springer, New York (2011)
Google Scholar
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing order to the Web. In: Proceedings of the 7th International World Wide Web Conference, pp. 161–172 (1999)
Google Scholar
Papalexakis, E.E., Faloutsos, C., Sidiropoulos, N.D.: Tensors for data mining and data fusion: models, applications, and scalable algorithms. ACM Trans. Intell. Syst. Technol. (TIST) 8(2), 16 (2017)
Google Scholar
Riquelme, F., González-Cantergiani, P.: Measuring user influence on Twitter: a survey. Inf. Process. Manage. 52(5), 949–975 (2016)
Article Google Scholar
Sharp, J., McMurtry, D., Oakley, A., Subramanian, M., Zhang, H.: Data Access for Highly-Scalable Solutions: Using SQL, NoSQL, and Polyglot Persistence. Microsoft patterns & practices (2013)
Google Scholar
Sheth, A.P., Larson, J.A.: Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Comput. Surv. (CSUR) 22(3), 183–236 (1990)
Article Google Scholar
Silva, J.A., Faria, E.R., Barros, R.C., Hruschka, E.R., De Carvalho, A.C., Gama, J.: Data stream clustering: a survey. ACM Comput. Surv. (CSUR) 46(1), 13 (2013)
Article MATH Google Scholar
Singh, D., Reddy, C.K.: A survey on platforms for big data analytics. J. Big Data 2(1), 8 (2015)
Article Google Scholar
Smith, S., Ravindran, N., Sidiropoulos, N.D., Karypis, G.: SPLATT: efficient and parallel sparse tensor-matrix multiplication. In: IEEE International Parallel and Distributed Processing Symposium, pp. 61–70 (2015)
Google Scholar
Stonebraker, M., et al.: One size fits all? Part 2: benchmarking results. In: Conference on Innovative Data Systems Research (CIDR) (2007)
Google Scholar
Stonebraker, M., Cetintemel, U.: “One size fits all”: an idea whose time has come and gone. In: International Conference on Data Engineering, ICDE 2005, pp. 2–11. IEEE (2005)
Google Scholar
Stonebraker, M., et al.: C-store: a column-oriented DBMS. In: Proceedings of the 31st International Conference on Very Large Data Bases, pp. 553–564. VLDB Endowment (2005)
Google Scholar
Tan, R., Chirkova, R., Gadepally, V., Mattson, T.G.: Enabling query processing across heterogeneous data models: a survey. In: IEEE International Conference on Big Data, pp. 3211–3220 (2017)
Google Scholar
Vargas-Solar, G., Zechinelli-Martini, J.L., Espinosa-Oviedo, J.A.: Big Data management: what to keep from the past to face future challenges? Data Sci. Eng. 2(4), 328–345 (2017)
Article Google Scholar
Varol, O., Ferrara, E., Davis, C.A., Menczer, F., Flammini, A.: Online human-bot interactions: detection, estimation, and characterization. In: Proceedings of the 11th International Conference on Web and Social Media (ICWSM), pp. 280–289 (2017)
Google Scholar
Vogt, M., Stiemer, A., Schuldt, H.: Icarus: towards a multistore database system. In: IEEE International Conference on Big Data, pp. 2490–2499 (2017)
Google Scholar
Wang, J., et al.: The Myria big data management and analytics system and cloud services. In: Conference on Innovative Data Systems Research (CIDR)
Google Scholar
Wiederhold, G.: Mediators in the architecture of future information systems. Computer 25(3), 38–49 (1992)
Article Google Scholar
Wu, D., Sakr, S., Zhu, L.: Big Data programming models. In: Zomaya, A.Y., Sakr, S. (eds.) Handbook of Big Data Technologies, pp. 31–63. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-49340-4_2
Chapter Google Scholar

Download references

Acknowledgement

This research was partially supported by the project I-SITE UBFC COCKTAIL. We thank George Becker for comments that have greatly improved the manuscript and Arnaud Da Costa for the maintenance of the server infrastructure.

Author information

Authors and Affiliations

LIB EA 7534 - University of Bourgogne, 21078, Dijon, France
Éric Leclercq, Annabelle Gillet, Thierry Grison & Marinette Savonnet

Authors

Éric Leclercq
View author publications
You can also search for this author in PubMed Google Scholar
Annabelle Gillet
View author publications
You can also search for this author in PubMed Google Scholar
Thierry Grison
View author publications
You can also search for this author in PubMed Google Scholar
Marinette Savonnet
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marinette Savonnet .

Editor information

Editors and Affiliations

IRIT, Paul Sabatier University, Toulouse, France
Abdelkader Hameurlain
FAW, University of Linz, Linz, Austria
Roland Wagner

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Leclercq, É., Gillet, A., Grison, T., Savonnet, M. (2019). Polystore and Tensor Data Model for Logical Data Independence and Impedance Mismatch in Big Data Analytics. In: Hameurlain, A., Wagner, R. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XLII. Lecture Notes in Computer Science(), vol 11860. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-60531-8_3

Download citation

DOI: https://doi.org/10.1007/978-3-662-60531-8_3
Published: 18 October 2019
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-60530-1
Online ISBN: 978-3-662-60531-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics