Abstract
Current scientific applications must analyze enormous amounts of array data using complex mathematical data processing methods. This paper describes a distributed query processing framework for large-scale scientific data analysis that captures array-based computations using SQL-like queries and optimizes and evaluates these computations using state-of-the-art parallel processing algorithms. Instead of providing a library of concrete distributed algorithms that implement certain matrix operations efficiently, we generalize these algorithms by making them parametric in such a way that the same efficient implementations that apply to the concrete algorithms can also apply to their generic counterparts. By specifying matrix operations as generic algebraic operators, we are able to perform inter-operator optimizations, such as fusing matrix transpose with matrix multiplication, resulting to new instantiations of the generic algebraic operators, without having to introduce new efficient algorithms on the fly. We report on a prototype implementation of our framework on three Big Data platforms: Hadoop Map-Reduce, Apache Spark, and Apache Flink, using Apache MRQL, which is a query processing and optimization system for large-scale, distributed data analysis. Finally, we evaluate the effectiveness of our framework through experiments on three queries: a matrix multiplication query, a simple query that combines matrix multiplication with matrix transpose, and a complex iterative query for matrix factorization.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Armbrust, M., et al.: Spark SQL: relational data processing in spark. In: SIGMOD 2015 (2015)
Apache Flink (2018). http://flink.apache.org/
Apache Hadoop (2018). http://hadoop.apache.org/
Apache Hama (2018). http://hama.apache.org/
Apache Hive (2018). http://hive.apache.org/
Apache Giraph (2018). http://giraph.apache.org/
GraphX: Apache Spark’s API for Graphs and Graph-Parallel Computation (2018). https://spark.apache.org/graphx/
Apache MRQL (incubating) (2018). http://mrql.incubator.apache.org/
Apache Spark (2018). http://spark.apache.org/
Battre, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. In: 1st ACM Symposium on Cloud computing (SOCC 2010), pp. 119–130 (2010)
Buck, J., et al.: SciHadoop: array-based query processing in hadoop. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2011)
Chaiken, R., et al.: SCOPE: easy and efficient parallel processing of massive data sets. Proc. VLDB Endow. (PVLDB) 1(2), 1265–1276 (2008)
A. Das, F.N. Afrati, S. Salihoglu, and J.D. Ullman. Upper and lower bounds on the cost of a map-reduce computation. In VLDB 2013 (2013)
Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: OSDI 2004 (2004)
Fan, J., et al.: The case against specialized graph analytics engines. In: CIDR (2015)
Fegaras, L.: A query processing framework for array-based computations. In: Hartmann, S., Ma, H. (eds.) DEXA 2016, Part I. LNCS, vol. 9827, pp. 240–254. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44403-1_15
Fegaras, L.: An Algebra for Distributed Big Data Analytics. Journal of Functional Programming, Special issue on Programming Languages for Big Data, Volume 27 (2017)
Fegaras, L., Li, C., Gupta, U.: An optimization framework for map-reduce queries. In: EDBT 2012 (2012)
Fegaras, L., Li, C., Gupta, U., Philip, J.J.: XML query optimization in map-reduce. In: International Workshop on the Web and Databases (WebDB) (2011)
Fegaras, L., Maier, D.: Towards an effective calculus for object query languages. In: International Conference on Management of Data (SIGMOD), pp. 47–58 (1995)
Fegaras, L., Maier, D.: Optimizing object queries using an effective calculus. ACM Trans. Database Syst. (TODS) 25(4), 457–516 (2000)
Folk, M., Heber, G., Koziol, Q., Pourmal, E., Robinson, D.: An overview of the HDF5 technology suite and its applications. In: EDBT/ICDT Workshop on Array Databases (2011)
Gates, A.F., et al.: Building a high-level dataflow system on top of map-reduce: the pig experience. Proc. VLDB Endow. (PVLDB) 2(2), 1414–1425 (2009)
Geijn, R.A., Watts, J.: SUMMA: scalable universal matrix multiplication algorithm. Concurr. Pract. Exp. 9(4), 255–274 (1997)
Geng, Y., Huang, X., Zhu, M., Ruan, H., Yang, G.: SciHive: array-based query processing with HiveQL. In: IEEE International Conference on Trust, Security and Privacy in Computing and Communications (Trustcom) (2013)
Jindal, A., et al.: Vertexica: your relational friend for graph analytics!. PVLDB 7(13), 1669–1672 (2014)
Ghoting, A., et al.: SystemML: declarative machine learning on mapreduce. In: IEEE International Conference on Data Engineering (ICDE) (2011)
Isard, M., Yu, Y.: Distributed data-parallel computing using a high-level programming language. In: ACM SIGMOD International Conference on Management of Data, pp. 987–994 (2009)
Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems. In: IEEE Computer, August 2009
Kraska, T., Talwalkar, A., Duchi, J., Griffith, R., Franklin, M., Jordan, M.I.: MLbase: a distributed machine learning system. In: Conference on Innovative Data Systems Research (2013)
Lin, J., Dyer, C.: Data-Intensive Text Processing with MapReduce. Morgan & Claypool Publishers, San Rafael (2010)
Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Distributed GraphLab: a framework for machine learning and data mining in the cloud. In: VLDB 2012 (2012)
Malewicz, G., et al.: Pregel: a system for large-scale graph processing. In: ACM SIGMOD International Conference on Management of Data, pp. 135–146 (2010)
Meng, X., Bradley, J., Yavuz, B., et al.: MLlib: machine learning in apache spark. J. Mach. Learn. Res. 17, 1–7 (2016)
NetCDF: Network Common Data Form. https://www.unidata.ucar.edu/software/netcdf/
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: a not-so-Foreign language for data processing. In: ACM SIGMOD International Conference on Management of Data (2008)
Papadopoulos, S., Datta, K., Madden, S., Mattson, T.: The TileDB array data storage manager. PVLDB 10(4), 349–360 (2016)
Soroush, E., Balazinska, M., Wang, D.: ArrayStore: a storage manager for complex parallel array processing. In: ACM SIGMOD International Conference on Management of Data (2011)
Soroush, E., Balazinska, M., Krughoff, S., Connolly, A.: Efficient iterative processing in the SciDB parallel array engine. In: 27th International Conference on Scientific and Statistical Database Management (SSDBM) (2015)
Shinnar, A., Cunningham, D., Herta, B., Saraswat, B.: M3R: Increased performance for in-memory Hadoop jobs. In: VLDB 2012 (2012)
The SciDB Development Team. Overview of SciDB: large scale array storage, processing and analysis. In: ACM SIGMOD International Conference on Management of Data (2010)
Thusoo, A., et al.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. (PVLDB) 2(2), 1626–1629 (2009)
Thusoo, A., et al.: Hive: a petabyte scale data warehouse using hadoop. In: IEEE International Conference on Data Engineering (ICDE), pp. 996–1005 (2010)
Valiant, L.G.: A bridging model for parallel computation. CACM 33(8), 103–111 (1990)
Wang, Y., Jiang, W., Agrawal, G.: SciMATE: a novel MapReduce-like framework for multiple scientific data formats. In: IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) (2012)
Yu, Y., et al.: DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language. In: Symposium on Operating Systems Design and Implementation (OSDI) (2008)
Acknowledgments
Our performance evaluations were performed at the Chameleon cloud computing infrastructure, www.chameleoncloud.org, supported by NSF.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer-Verlag GmbH Germany, part of Springer Nature
About this chapter
Cite this chapter
Fegaras, L. (2018). A Query Processing Framework for Large-Scale Scientific Data Analysis. In: Hameurlain, A., Wagner, R., Hartmann, S., Ma, H. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXXVIII. Lecture Notes in Computer Science(), vol 11250. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-58384-5_5
Download citation
DOI: https://doi.org/10.1007/978-3-662-58384-5_5
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-58383-8
Online ISBN: 978-3-662-58384-5
eBook Packages: Computer ScienceComputer Science (R0)