Skip to main content

Polystore and Tensor Data Model for Logical Data Independence and Impedance Mismatch in Big Data Analytics

  • Chapter
  • First Online:
Transactions on Large-Scale Data- and Knowledge-Centered Systems XLII

Abstract

This paper presents a Tensor based Data Model (TDM) for polystore systems meant to address two major closely related issues in big data analytics architectures, namely logical data independence and data impedance mismatch. The TDM is an expressive model that subsumes traditional data models, it allows to link different data models of various data stores, and which also facilitates data transformations by using operators with clearly defined semantics. Our contribution is twofold. Firstly, it is the addition of the notion of a schema for the tensor mathematical object using typed associative arrays. Secondly, it is the definition of a set of operators to manipulate data through the TDM. In order to validate our approach we first show how our TDM model is inserted into a given polystore architecture. We then describe some use cases of real analyses using our TDM and its operators in the context of the French Presidential Election in 2017.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://kafka.apache.org/.

  2. 2.

    https://flink.apache.org/.

  3. 3.

    Single Instruction Multiple Data.

  4. 4.

    https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/FlexTables/FlexTableHandbook.htm.

  5. 5.

    https://docs.arangodb.com/3.4/Manual/index.html.

  6. 6.

    https://orientdb.com/graph-database/.

  7. 7.

    https://bigdata.uni-saarland.de/projects/octopusdb.php.

  8. 8.

    http://wp.sigmod.org/?p=1629.

  9. 9.

    https://spark.apache.org/sql/.

  10. 10.

    https://drill.apache.org/.

  11. 11.

    http://forward.ucsd.edu/.

  12. 12.

    https://hive.apache.org/.

  13. 13.

    https://neo4j.com/developer/graph-algorithms/.

  14. 14.

    https://www.vertica.com/product/database-machine-learning/.

  15. 15.

    https://www.paradigm4.com/.

  16. 16.

    https://www.tensorflow.org/.

  17. 17.

    http://deeplearning.net/software/theano/.

  18. 18.

    https://keras.io/.

  19. 19.

    https://spark.apache.org/mllib/.

  20. 20.

    https://amplab.cs.berkeley.edu/software/.

  21. 21.

    http://www.alluxio.org/.

  22. 22.

    https://azure.microsoft.com/en-us/services/data-lake-analytics/.

  23. 23.

    https://www.ibm.com/analytics/data-lake.

  24. 24.

    https://github.com/nicolewhite/RNeo4j.

  25. 25.

    https://github.com/RevolutionAnalytics/rhbase.

  26. 26.

    The notation | is the restriction applied to sets, \(A|B=A-(A-B)\).

  27. 27.

    expr is a logical expression to compare values of \(\varvec{\mathcal {X}}\) to constants. Its form is as follows: expr  :  : = <condition\(>\vert<\)condition> <logical operator> <condition\(> \vert \lnot<\)condition\(> \vert \) (<condition>)

    Logical operators are \(\{\wedge , \vee \}\) condition>  :  : = values of \(\varvec{\mathcal {X}}\) (implicit) <comparison operator> constantComparison operators are \(\{<,\le , =,\ne ,\ge ,>\}\).

  28. 28.

    expr allows to compare keys of the dimensions with constants. Its shape is the same as for the operator \(\sigma \) except for

    <condition>  :  : = name of a dimension <comparison operator> constant.

  29. 29.

    https://botometer.iuni.iu.edu/.

  30. 30.

    https://github.com/ginestrab/Multiplex-PageRank.

  31. 31.

    https://github.com/AnnabelleGillet/Multiplex-PageRank.

  32. 32.

    https://github.com/scalanlp/breeze.

References

  1. Abo Khamis, M., Ngo, H.Q., Nguyen, X., Olteanu, D., Schleich, M.: In-database learning with sparse tensors. In: ACM SIGMOD/PODS Symposium on Principles of Database Systems, pp. 325–340 (2018)

    Google Scholar 

  2. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc. VLDB Endowment 2(1), 922–933 (2009)

    Article  Google Scholar 

  3. Al-Garadi, M.A., et al.: Analysis of online social network connections for identification of influential users: survey and open research issues. ACM Comput. Surv. (CSUR) 51(1), 1–37 (2018)

    Article  Google Scholar 

  4. Allen, D., Hodler, A.: Weave together graph and relational data in apache spark. In: Spark+AI Summit. Neo4j (2018). https://vimeo.com/274433801

  5. Alsubaiee, S., et al.: AsterixDB: a scalable, open source BDMS. Proc. VLDB Endow. 7(14), 1905–1916 (2014)

    Article  Google Scholar 

  6. Angles, R.: A comparison of current graph database models. In: IEEE International Conference on Data Engineering Workshops (ICDEW), pp. 171–177 (2012)

    Google Scholar 

  7. Astrahan, M.M., et al.: System R: relational approach to database management. ACM Trans. Database Syst. (TODS) 1(2), 97–137 (1976)

    Article  Google Scholar 

  8. Atikoglu, B., Xu, Y., Frachtenberg, E., Jiang, S., Paleczny, M.: Workload analysis of a large-scale key-value store. ACM SIGMETRICS Perform. Evaluation Rev. 40, 53–64 (2012)

    Article  Google Scholar 

  9. Austin, W., Ballard, G., Kolda, T.G.: Parallel tensor compression for large-scale scientific data. In: IEEE International Parallel and Distributed Processing Symposium, pp. 912–922 (2016)

    Google Scholar 

  10. Baazizi, M.A., Lahmar, H.B., Colazzo, D., Ghelli, G., Sartiani, C.: Schema inference for massive JSON datasets. In: Extending Database Technology (EDBT), p. 222, 233 (2017)

    Google Scholar 

  11. Barabási, A.L., et al.: Network Science. Cambridge University Press, Cambridge (2016)

    MATH  Google Scholar 

  12. Battaglino, C., Ballard, G., Kolda, T.: A practical randomized CP tensor decomposition. arXiv preprint arXiv:1701.06600 (2017)

  13. Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech: Theory Exp. 2008(10), P10008 (2008)

    Article  Google Scholar 

  14. Brodie, M.L., Schmidt, J.W.: Final report of the ANSI/X3/SPARC DBS-SG relational database task group. ACM SIGMOD Rec. 12(4), 1–62 (1982)

    Google Scholar 

  15. Bugiotti, F., Bursztyn, D., Deutsch, A., Ileana, I., Manolescu, I.: Invisible glue: scalable self-tuning multi-stores. In: Conference on Innovative Data Systems Research (CIDR) (2015)

    Google Scholar 

  16. Bugiotti, F., Bursztyn, D., Deutsch, A., Manolescu, I., Zampetakis, S.: Flexible hybrid stores: constraint-based rewriting to the rescue. In: International Conference on Data Engineering (ICDE), pp. 1394–1397 (2016)

    Google Scholar 

  17. Buluc, A., Gilbert, J.: On the representation and multiplication of hypersparse matrices. In: IEEE International Symposium on Parallel and Distributed Processing, pp. 1–11 (2008)

    Google Scholar 

  18. Chen, J., Huang, Q.: Eliminating the Impedance Mismatch Between Relational Systems and Object-Oriented Programming Languages. Monash University, Clayton (1995)

    Google Scholar 

  19. Cichocki, A., Zdunek, R., Phan, A.H., Amari, S.: Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-way Data Analysis and Blindsource Separation. Wiley, Hoboken (2009)

    Book  Google Scholar 

  20. De Domenico, M., et al.: Mathematical formulation of multilayer networks. Phys. Rev. X 3(4), 041022 (2013)

    Google Scholar 

  21. Deng, D., et al.: The data civilizer system. In: Conference on Innovative Data Systems Research (CIDR) (2017)

    Google Scholar 

  22. DiScala, M., Abadi, D.J.: Automatic generation of normalized relational schemas from nested key-value data. In: Proceedings of the International Conference on Management of Data, pp. 295–310. ACM (2016)

    Google Scholar 

  23. Dittrich, J., Jindal, A.: Towards a one size fits all database architecture. In: Conference on Innovative Data Systems Research (CIDR), pp. 195–198 (2011)

    Google Scholar 

  24. Duggan, J., et al.: The BigDAWG polystore system. ACM SIGMOD Rec. 44(2), 11–16 (2015)

    Article  Google Scholar 

  25. Färber, F., et al.: The SAP HANA database-an architecture overview. IEEE Data Eng. Bull. 35(1), 28–33 (2012)

    Google Scholar 

  26. Gadepally, V., et al.: The BigDAWG polystore system and architecture. In: IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6 (2016)

    Google Scholar 

  27. Gama, J.: A survey on learning from data streams: current and future trends. Prog. Artif. Intell. 1(1), 45–55 (2012)

    Article  Google Scholar 

  28. Ghosh, D.: Multiparadigm data storage for enterprise applications. IEEE Soft. 27(5), 57–60 (2010)

    Article  Google Scholar 

  29. Giannakouris, V., Papailiou, N., Tsoumakos, D., Koziris, N.: MuSQLE: distributed SQL query execution over multiple engine environments. In: IEEE International Conference on Big Data, pp. 452–461 (2016)

    Google Scholar 

  30. Gray, J., Liu, D.T., Nieto-Santisteban, M., Szalay, A., DeWitt, D.J., Heber, G.: Scientific data management in the coming decade. ACM SIGMOD Rec. 34(4), 34–41 (2005)

    Article  Google Scholar 

  31. Haerder, T., Reuter, A.: Principles of transaction-oriented database recovery. ACM Comput. Surv. (CSUR) 15(4), 287–317 (1983)

    Article  MathSciNet  Google Scholar 

  32. Halu, A., Mondragón, R.J., Panzarasa, P., Bianconi, G.: Multiplex pagerank. PloS ONE 8(10), e78293 (2013)

    Article  Google Scholar 

  33. Hammer, M., McLeod, D.: On database management system architecture. Technical report, Massachusetts Institute of Technology, Cambridge Lab. For Computer Science (1979)

    Google Scholar 

  34. Härder, T.: DBMS architecture-the layer model and its evolution. Datenbank-Spektrum 13, 45–57 (2005)

    Google Scholar 

  35. Hellerstein, J.M., et al.: The MADlib analytics library or MAD skills, the SQL. Proc. VLDB Endow. 5(12), 1700–1711 (2012)

    Article  Google Scholar 

  36. Hewasinghage, M., Varga, J., Abelló, A., Zimányi, E.: Managing polyglot systems metadata with hypergraphs. In: Trujillo, J.C., et al. (eds.) ER 2018. LNCS, vol. 11157, pp. 463–478. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00847-5_33

    Chapter  Google Scholar 

  37. Hölsch, J., Schmidt, T., Grossniklaus, M.: On the performance of analytical and pattern matching graph queries in Neo4j and a relational database. In: EDBT/ICDT International Workshop on Querying Graph Structured Data (GraphQ) (2017)

    Google Scholar 

  38. Hutchison, D., Howe, B., Suciu, D.: Lara: a key-value algebra underlying arrays and relations. arXiv preprint arXiv:1604.03607 (2016)

  39. Hutchison, D., Howe, B., Suciu, D.: LaraDB: A minimalist kernel for linear and relational algebra computation. In: ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, pp. 2–12 (2017)

    Google Scholar 

  40. Jananthan, H., Zhou, Z., Gadepally, V., Hutchison, D., Kim, S., Kepner, J.: Polystore mathematics of relational algebra. In: IEEE International Conference on Big Data, pp. 3180–3189 (2017)

    Google Scholar 

  41. Johnson, M., Rosebrugh, R., et al.: Database interoperability through state-based logical data independence. Int. J. Comput. Appl. Technol. 16(2–3), 97–102 (2003)

    Article  Google Scholar 

  42. Kanellakis, P.C.: Elements of relational database theory. In: Formal models and semantics, pp. 1073–1156. Elsevier (1990)

    Google Scholar 

  43. Kang, U., Papalexakis, E., Harpale, A., Faloutsos, C.: Gigatensor: scaling tensor analysis up by 100 times-algorithms and discoveries. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 316–324 (2012)

    Google Scholar 

  44. Kepner, J., et al.: Dynamic distributed dimensional data model (D4M) database and computation system. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5349–5352 (2012)

    Google Scholar 

  45. Kepner, J., et al.: Achieving 100,000,000 database inserts per second using Accumulo and D4M. In: High Performance Extreme Computing Conference (HPEC), pp. 1–6. IEEE (2014)

    Google Scholar 

  46. Kim, M.: TensorDB and tensor-relational model (TRM) for efficient tensor-relational operations (2014)

    Google Scholar 

  47. Kivelä, M., Arenas, A., Barthelemy, M., Gleeson, J.P., Moreno, Y., Porter, M.A.: Multilayer networks. J. Complex Netw. 2(3), 203–271 (2014)

    Article  Google Scholar 

  48. Klug, A.: Equivalence of relational algebra and relational calculus query languages having aggregate functions. J. ACM 29(3), 699–717 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  49. Knuth, D.: The Art of Computer Programming, Vol. 1: Fundamental Algorithms. Addison-Wesley, Boston (1978)

    MATH  Google Scholar 

  50. Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  51. Kolev, B., Bondiombouy, C., Valduriez, P., Jiménez-Peris, R., Pau, R., Pereira, J.: The CloudMdsQL multistore system. In: International Conference on Management of Data (SIGMOD), pp. 2113–2116 (2016)

    Google Scholar 

  52. Kuang, L., Hao, F., Yang, L.T., Lin, M., Luo, C., Min, G.: A tensor-based approach for big data representation and dimensionality reduction. IEEE Trans. Emerg. Top. Comput. 2(3), 280–291 (2014)

    Article  Google Scholar 

  53. Lämmel, R., Meijer, E.: Revealing the X/O impedance mismatch. In: Backhouse, R., Gibbons, J., Hinze, R., Jeuring, J. (eds.) SSDGP 2006. LNCS, vol. 4719, pp. 285–367. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76786-2_6

    Chapter  Google Scholar 

  54. Leclercq, E., Savonnet, M.: TDM: A tensor data model for logical data independence in polystore systems. In: Heterogeneous Data Management, Polystores, and Analytics for Healthcare - VLDB 2018 Workshops, Poly and DMAH, pp. 39–56 (2018)

    Chapter  Google Scholar 

  55. LeFevre, J., Sankaranarayanan, J., Hacigumus, H., Tatemura, J., Polyzotis, N., Carey, M.J.: MISO: souping up big data query processing with a multistore system. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1591–1602 (2014)

    Google Scholar 

  56. Li, X., Cui, B., Chen, Y., Wu, W., Zhang, C.: MLog: towards declarative in-database machine learning. Proc. VLDB Endow. 10(12), 1933–1936 (2017)

    Article  Google Scholar 

  57. Lin, J., Ryaboy, D.: Scaling big data mining infrastructure: the Twitter experience. SIGKDD Explor. Newsl. 14(2), 6–19 (2013)

    Article  Google Scholar 

  58. Litwin, W., Abdellatif, A., Zeroual, A., Nicolas, B., Vigier, P.: MSQL: a multidatabase language. Inf. Sci. 49(1–3), 59–101 (1989)

    Article  MATH  Google Scholar 

  59. Lu, J., Holubova, I.: Multi-model databases: a new journey to handle the variety of data. ACM Comput. Surv. (CSUR) 52(3), 55 (2019)

    Article  Google Scholar 

  60. Maccioni, A., Torlone, R.: Augmented access for querying and exploring a Polystore. In: 34th International Conference on Data Engineering (ICDE), pp. 77–88. IEEE (2018)

    Google Scholar 

  61. Maier, D., Rozenshtein, D., Salveter, S., Stein, J., Warren, D.S.: Toward logical data independence: a relational query language without relations. In: ACM SIGMOD International Conference on Management of Data, pp. 51–60 (1982)

    Google Scholar 

  62. McGregor, A.: Graph stream algorithms: a survey. ACM SIGMOD Rec. 43(1), 9–20 (2014)

    Article  Google Scholar 

  63. McHugh, J., Cuddihy, P.E., Williams, J.W., Aggour, K.S., Kumar, V.S., Mulwad, V.: Integrated access to big data polystores through a knowledge-driven framework. In: IEEE International Conference on Big Data, pp. 1494–1503 (2017)

    Google Scholar 

  64. Ong, K.W., Papakonstantinou, Y., Vernoux, R.: The SQL++ query language: configurable. Unifying and semi-structured. Technical report, UCSD (2015)

    Google Scholar 

  65. Ouzzani, M., Tang, N., Fernandez, R.C.: Data civilizer: end-to-end support for data discovery, integration, and cleaning. In: Making Databases Work, pp. 291–300. Association for Computing Machinery and Morgan & Claypool (2019)

    Google Scholar 

  66. Özsoyoğlu, G., Özsoyoğlu, Z.M., Matos, V.: Extending relational algebra and relational calculus with set-valued attributes and aggregate functions. ACM Trans. Database Syst. 12(4), 566–592 (1987)

    Article  MathSciNet  Google Scholar 

  67. Özsu, M.T., Valduriez, P.: Principles of Distributed Database Systems. Springer, New York (2011)

    Google Scholar 

  68. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing order to the Web. In: Proceedings of the 7th International World Wide Web Conference, pp. 161–172 (1999)

    Google Scholar 

  69. Papalexakis, E.E., Faloutsos, C., Sidiropoulos, N.D.: Tensors for data mining and data fusion: models, applications, and scalable algorithms. ACM Trans. Intell. Syst. Technol. (TIST) 8(2), 16 (2017)

    Google Scholar 

  70. Riquelme, F., González-Cantergiani, P.: Measuring user influence on Twitter: a survey. Inf. Process. Manage. 52(5), 949–975 (2016)

    Article  Google Scholar 

  71. Sharp, J., McMurtry, D., Oakley, A., Subramanian, M., Zhang, H.: Data Access for Highly-Scalable Solutions: Using SQL, NoSQL, and Polyglot Persistence. Microsoft patterns & practices (2013)

    Google Scholar 

  72. Sheth, A.P., Larson, J.A.: Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Comput. Surv. (CSUR) 22(3), 183–236 (1990)

    Article  Google Scholar 

  73. Silva, J.A., Faria, E.R., Barros, R.C., Hruschka, E.R., De Carvalho, A.C., Gama, J.: Data stream clustering: a survey. ACM Comput. Surv. (CSUR) 46(1), 13 (2013)

    Article  MATH  Google Scholar 

  74. Singh, D., Reddy, C.K.: A survey on platforms for big data analytics. J. Big Data 2(1), 8 (2015)

    Article  Google Scholar 

  75. Smith, S., Ravindran, N., Sidiropoulos, N.D., Karypis, G.: SPLATT: efficient and parallel sparse tensor-matrix multiplication. In: IEEE International Parallel and Distributed Processing Symposium, pp. 61–70 (2015)

    Google Scholar 

  76. Stonebraker, M., et al.: One size fits all? Part 2: benchmarking results. In: Conference on Innovative Data Systems Research (CIDR) (2007)

    Google Scholar 

  77. Stonebraker, M., Cetintemel, U.: “One size fits all”: an idea whose time has come and gone. In: International Conference on Data Engineering, ICDE 2005, pp. 2–11. IEEE (2005)

    Google Scholar 

  78. Stonebraker, M., et al.: C-store: a column-oriented DBMS. In: Proceedings of the 31st International Conference on Very Large Data Bases, pp. 553–564. VLDB Endowment (2005)

    Google Scholar 

  79. Tan, R., Chirkova, R., Gadepally, V., Mattson, T.G.: Enabling query processing across heterogeneous data models: a survey. In: IEEE International Conference on Big Data, pp. 3211–3220 (2017)

    Google Scholar 

  80. Vargas-Solar, G., Zechinelli-Martini, J.L., Espinosa-Oviedo, J.A.: Big Data management: what to keep from the past to face future challenges? Data Sci. Eng. 2(4), 328–345 (2017)

    Article  Google Scholar 

  81. Varol, O., Ferrara, E., Davis, C.A., Menczer, F., Flammini, A.: Online human-bot interactions: detection, estimation, and characterization. In: Proceedings of the 11th International Conference on Web and Social Media (ICWSM), pp. 280–289 (2017)

    Google Scholar 

  82. Vogt, M., Stiemer, A., Schuldt, H.: Icarus: towards a multistore database system. In: IEEE International Conference on Big Data, pp. 2490–2499 (2017)

    Google Scholar 

  83. Wang, J., et al.: The Myria big data management and analytics system and cloud services. In: Conference on Innovative Data Systems Research (CIDR)

    Google Scholar 

  84. Wiederhold, G.: Mediators in the architecture of future information systems. Computer 25(3), 38–49 (1992)

    Article  Google Scholar 

  85. Wu, D., Sakr, S., Zhu, L.: Big Data programming models. In: Zomaya, A.Y., Sakr, S. (eds.) Handbook of Big Data Technologies, pp. 31–63. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-49340-4_2

    Chapter  Google Scholar 

Download references

Acknowledgement

This research was partially supported by the project I-SITE UBFC COCKTAIL. We thank George Becker for comments that have greatly improved the manuscript and Arnaud Da Costa for the maintenance of the server infrastructure.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marinette Savonnet .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer-Verlag GmbH Germany, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Leclercq, É., Gillet, A., Grison, T., Savonnet, M. (2019). Polystore and Tensor Data Model for Logical Data Independence and Impedance Mismatch in Big Data Analytics. In: Hameurlain, A., Wagner, R. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XLII. Lecture Notes in Computer Science(), vol 11860. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-60531-8_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-60531-8_3

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-60530-1

  • Online ISBN: 978-3-662-60531-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics