Abstract
This chapter covers advanced techniques in Big Data analytics and query processing. As the data is getting bigger and, at the same time, workloads and analytics are getting more complex, the advances in big data applications are no longer hindered by their ability to collect or generate data. But instead, by their ability to efficiently and effectively manage the available data. Therefore, numerous scalable and distributed infrastructures have been proposed to manage big data. However, it is well known in literature that scalability and distributed processing alone are not enough to achieve high performance. Instead, the underlying infrastructure has to be highly optimized for various types of workloads and query classes. These optimizations typically start from the lowest layer of the data management stack, which is the storage layer. In this chapter, we will cover two well-known techniques for optimized storage and organization of data that have big influence on query performance, namely the indexing, and data layout techniques. However, in the cases of non-traditional workloads where queries have special execution and data-access characteristics, the standard indexing and layout techniques may fall short in providing the desired performance goals. Therefore, further optimizations specific to the workload characteristics can be applied. In this chapter, we will cover techniques addressing several of these non-traditional workloads in the context of big data. Some of these techniques rely on curating either the data or the workflows (or both) with useful metadata information. This curation information can be very valuable for both query optimization and the business logic. In this chapter, we will cover the curation and metadata management of big data in query optimization and different systems. In this chapter, we focus on the MapReduce-like infrastructures, more specifically its open-source implementation Hadoop. The chapter covers the state-of-art in big data indexing techniques, and the data layout and organization strategies to speedup queries. It will also cover advanced techniques for enabling non-traditional workloads in Hadoop. Hadoop is primarily designed for workloads that are characterized by being batch, offline, ad-hoc, and disk-based. Yet, this chapter will cover recent projects and techniques targeting non-traditional workloads such as continuous query evaluation, main-memory processing, and recurring workloads. In addition, the chapter covers recent techniques proposed for data curation and efficient metadata management in Hadoop. These techniques vary from being semantic specific, e.g., provenance tracking techniques, to generic frameworks for data curation and annotation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
D.J. Abadi, Tradeoffs between parallel database systems, hadoop, and hadoopdb as platforms for petabyte-scale analysis, in SSDBM (2010), pp. 1–3
D.J. Abadi, D. Carney, U. Çetintemel, M. Cherniack, C. Convey, C. Erwin, E.F. Galvez, M. Hatoun, A. Maskey, A. Rasin, A. Singer, M. Stonebraker, N. Tatbul, Y. Xing, R. Yan, S.B. Zdonik, Aurora: a data stream management system, in SIGMOD Conference (2003), p. 666
D.J. Abadi, Y. Ahmad, M. Balazinska, U. Çetintemel, M. Cherniack, J.-H. Hwang, W. Lindner, A. Maskey, A. Rasin, E. Ryvkina, N. Tatbul, Y. Xing, S.B. Zdonik, The design of the borealis stream processing engine, in CIDR (2005), pp. 277–289
A. Abouzeid, K. Bajda-Pawlikowski, A.R. Daniel Abadi, A. Silberschatz, HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads, in VLDB (2009), pp. 922–933
A. Abouzied, K. Bajda-Pawlikowski, J. Huang, D.J. Abadi, A. Silberschatz, Hadoopdb in action: building real world applications, in SIGMOD Conference (2010), pp. 1111–1114
S. Akoush, R. Sohan, A. Hopper, HadoopProv: towards provenance as a first class citizen in MapReduce, in USENIX Workshop on the Theory and Practice of Provenance (2013)
S. Akoush, L. Carata, R. Sohan, A. Hopper, MrLazy: lazy runtime label propagation for MapReduce, in HotCloud (2014)
A.M. Aly, A. Sallam, B.M. Gnanasekaran et al., M3: stream processing on main-memory mapreduce, in ICDE (2012), pp. 1253–1256
Y. Amsterdamer, S.B. Davidson, D. Deutch, T. Milo, J. Stoyanovich, V. Tannen, Putting lipstick on pig: enabling database-style workflow provenance, in PVLDB (2011), pp. 346–357
Apache. Oozie: hadoop workflow system, http://yahoo.github.com/oozie/
N. Backman, K. Pattabiraman, R. Fonseca et al., C-mr: continuously mapreduce workflows on multi-core processors, in Proceedings of 3rd International Workshop on MapReduce and Its Applications Date (2012), pp. 1–8
A. Balmin, T. Kaldewey, S. Tata, Clydesdale: structured data processing on hadoop, in Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20–24 (2012), pp. 705–708
A. Balmin, K.S. Beyer, V. Ercegovac, J. McPherson, F. Özcan, H. Pirahesh, E.J. Shekita, Y. Sismanis, S. Tata, Y. Tian, A platform for extreme analytics. IBM J. Res. Develop. 57(3/4), 4 (2013)
K. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M.Y. Eltabakh, C.-C. Kanne, F. Ozcan, E. Shekita, Jaql: a scripting language for large scale semi-structured data analysis, in PVLDB, vol. 4 (2011)
D. Bhagwat, L. Chiticariu, W. Tan, An annotation management system for relational databases, in VLDB (2004), pp. 900–911
Y. Bu, B. Howe, M. Balazinska, M.D. Ernst, Haloop: efficient iterative data processing on large clusters. Proc. VLDB Endow. 3(1–2), 285–296 (2010)
P. Buneman et al., On propagation of deletions and annotations through views, in PODS (2002), pp. 150–158
P. Buneman, A. Chapman, J. Cheney, Provenance management in curated databases, in SIGMOD (2006), pp. 539–550
P. Buneman, J. Cheney, W.-C. Tan, S. Vansummeren, Curated databases, in Proceedings of the 27th ACM symposium on Principles of database systems (PODS) (2008), pp. 1–12
P. Buneman, S. Khanna, W. Tan, Why and where: a characterization of data provenance. Lect. Notes Comput. Sci. 316–333, 2001 (1973)
S. Chen, Cheetah: a high performance, custom data warehouse on top of mapreduce. PVLDB 3(2), 1459–1468 (2010)
L. Chiticariu, W.-C. Tan, G. Vijayvargiya, DBNotes: a post-it system for relational databases based on provenance, in SIGMOD (2005), pp. 942–944
T. Condie, N. Conway, P. Alvaro, J.M. Hellerstein, K. Elmeleegy, R. Sears, Mapreduce online, in NSDI (2010), pp. 313–328
T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, J. Gerth, J. Talbot, K. Elmeleegy, R. Sears, Online aggregation and continuous query support in mapreduce, in SIGMOD (2010), pp. 1115–1118
D. Crawl, J. Wang, I. Altintas, Provenance for MapReduce-based data-intensive workflows, in WORKS Workshop (2011), pp. 21–30
Y. Cui, J. Widom, Lineage tracing for general data warehouse transformations, in VLDB (2001), pp. 471–480
J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, J. Schad, Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). VLDB 3, 518–529 (2010)
J. Dittrich, J. Quiané-Ruiz, S. Richter, S. Schuh, A. Jindal, J. Schad, Only aggressive elephants are fast elephants. PVLDB 5(11), 1591–1602 (2012)
T. Donnelly, 9 Brilliant Inventions Made by Mistake. Inc. Accessed 24 Aug 2012
A. Eldawy, M.F. Mokbel, Spatialhadoop: a mapreduce framework for spatial data, in 31st IEEE International Conference on Data Engineering, ICDE 2015, Seoul, South Korea, April 13–17 (2015), pp. 1352–1363
I. Elghandour, A. Aboulnaga, Restore: reusing results of mapreduce jobs. Proc. VLDB Endow. 5(6), 586–597 (2012)
M.Y. Eltabakh, W.G. Aref, A.K. Elmagarmid, M. Ouzzani, Y.N. Silva, Supporting annotations on relations, in EDBT (2009), pp. 379–390
M.Y. Eltabakh, Y. Tian, F. Özcan, R. Gemulla, A. Krettek, J. McPherson, Cohadoop: flexible data placement and its exploitation in hadoop. PVLDB 4(9), 575–585 (2011)
M.Y. Eltabakh, F. Özcan, Y. Sismanis, P. Haas, H. Pirahesh, J. Vondrak, Eagle-eyed elephant: split-oriented indexing in Hadoop, in Proceedings of the 16th International Conference on Extending Database Technology (EDBT) (2013), pp. 89–100
A. Floratou, J.M. Patel, E.J. Shekita, S. Tata, Column-oriented storage techniques for mapreduce. PVLDB 4(7), 419–429 (2011)
A. Floratou, U.F. Minhas, F. Özcan, Sql-on-hadoop: full circle back to shared-nothing database architectures. PVLDB 7(12), 1295–1306 (2014)
A. Floratou, F. Özcan, B. Schiefer, Benchmarking sql-on-hadoop systems: TPC or not tpc? in Big Data Benchmarking - 5th International Workshop, WBDB, Potsdam, Germany, August 5–6, 2014. Revised Selected Papers 2014, 63–72 (2014)
V.R. Gankidi, N. Teletia, J.M. Patel, A. Halverson, D.J. DeWitt, Indexing HDFS data in PDW: splitting the data from the index. PVLDB 7(13), 1520–1528 (2014)
A.F. Gates, O. Natkovich, S. Chopra, P. Kamath, S.M. Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, U. Srivastava, Building a high-level dataflow system on top of map-reduce: the pig experience. Proc. VLDB Endow. 1414–1425 (2009)
W. Gatterbauer, M. Balazinska, N. Khoussainova, D. Suciu, Believe it or not: adding belief annotations to databases. Proc. VLDB Endow. 2(1), 1–12 (2009)
F. Geerts, J. Van Den Bussche, Relational completeness of query languages for annotated databases, in DBPL (2007), pp. 127–137
F. Geerts et al., Mondrian: annotating and querying databases through colors and blocks, in ICDE (2006), p. 82
F. Geerts, A. Kementsietsidis, D. Milano, MONDRIAN: annotating and querying databases through colors and blocks, Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006, 3–8 April 2006 (GA, USA, Atlanta, 2006), p. 82
K. Ibrahim, D. Xiao, M.Y. Eltabakh, Elevating annotation summaries to first-class citizens in insightnotes, in Proceedings of the 18th International Conference on Extending Database Technology, EDBT 2015, Brussels, Belgium, March 23–27 (2015), pp. 49–60
R. Ikeda, H. Park, J. Widom, Provenance for generalized map and reduce workflows, in CIDR (2011), pp. 273–283
D. Jiang, B. C. Ooi, L. Shi, S. Wu, The performance of mapreduce: an in-depth study. Proc. VLDB Endow. 472–483 (2010)
A. Jindal, J. Quiané-Ruiz, J. Dittrich, Trojan data layouts: right shoes for a running elephant, in ACM Symposium on Cloud Computing in conjunction with SOSP 2011, SOCC ’11, Cascais, Portugal, October 26–28 (2011), p. 21
T. Kaldewey, E.J. Shekita, S. Tata, Clydesdale: structured data processing on mapreduce, in 15th International Conference on Extending Database Technology, EDBT ’12, Berlin, Germany, March 27–30, 2012, Proceedings (2012), pp. 15–25
G. Karvounarakis, T.J. Green, Semiring-annotated data: queries and provenance. SIGMOD Rec. 41(3), 5–14 (2012)
P. Larson, J. Zhou, View matching for outer-join views. VLDB J. 16(1), 29–53 (2007)
C. Lei, E. Rundensteiner, M.Y. Eltabakh, Redoop: supporting recurring queries in Hadoop, in Proceedings of the 16th International Conference on Extending Database Technology (EDBT) (2013)
C. Lei, Z. Zhuang, E.A. Rundensteiner, M.Y. Eltabakh, Shared execution of recurring workloads in mapreduce. PVLDB 8(7), 714–725 (2015)
B. Li, E. Mazur et al. A platform for scalable one-pass analytics using mapreduce, in SIGMOD (2011), pp. 985–996
Q. Li, A. Labrinidis, P.K. Chrysanthis, ViP: a user-centric view-based annotation framework for scientific data, in Proceedings of the 20th international conference on Scientific and Statistical Database Management (SSDBM) (2008), pp. 295–312
H. Lim, H. Herodotou, S. Babu, Stubby: a transformation-based optimizer for MapReduce workflows. PVLDB 5(11), 1196–1207 (2012)
Y. Liu, S. Hu, T. Rabl, W. Liu, H. Jacobsen, K. Wu, J. Chen, J. Li, Dgfindex for smart grid: enhancing hive with a cost-effective multidimensional range index. PVLDB 7(13), 1496–1507 (2014)
D. Logothetis, S. De, K. Yocum, Scalable lineage capture for debugging DISC analytics, in SOCC (2013), pp. 17:1–17:15
P. Lu, G. Chen, B.C. Ooi, H.T. Vo, S. Wu, Scalagist: scalable generalized search trees for mapreduce systems [innovative systems paper]. PVLDB 7(14), 1797–1808 (2014)
Y. Lu, Y. Li, M.Y. Eltabakh, Decorating the cloud: enabling annotation management in MapReduce. PVLDB 5(11), 1–26 (2016)
T. Nykiel, M. Potamias, C. Mishra, G. Kollios, N. Koudas, Mrshare: sharing across multiple queries in mapreduce. Proc. VLDB Endow. 494–505 (2010)
C. Olston, G. Chiou, L. Chitnis, F. Liu, Y. Han, M. Larsson, A. Neumann, V.B. N. Rao, V. Sankarasubramanian, S. Seth, C. Tian, T. ZiCornell, X. Wang, Nova: continuous pig/hadoop workflows, in SIGMOD Conference (2011), pp. 1081–1090
H. Park, R. Ikeda, J. Widom, Ramp: a system for capturing and tracing provenance in mapreduce workflows. PVLDB 4(12), 1351–1354 (2011)
H. Park, R. Ikeda, J. Widom, Ramp: a system for capturing and tracing provenance in mapreduce workflows, in VLDB. Stanford InfoLab (2011)
M. Ray, E.A. Rundensteiner, M. Liu, C. Gupta, S. Wang, I. Ari. High-performance complex event processing using continuous sliding views, in EDBT (2013), pp. 525–536
S. Richter, J. Quiané-Ruiz, S. Schuh, J. Dittrich, Towards zero-overhead adaptive indexing in hadoop, in CoRR (2012). arXiv:abs/1212.3480
A. Shinnar, D. Cunningham, B. Herta et al., M3r: increased performance for in-memory hadoop jobs. PVLDB 1736–1747 (2012)
M. Stonebraker et al., Mapreduce and parallel dbmss: friends or foes? Commun. ACM 53(1), 64–71 (2010)
W.-C. Tan, Containment of relational queries with annotation propagation, in DBPL (2003)
A. Thusoo, J.S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, R. Murthy, Hive - a warehousing solution over a map-reduce framework. PVLDB, 1626–1629 (2009)
G. Wang, C.-Y. Chan, Multi-query optimization in mapreduce framework. PVLDB 7(3), 145–156 (2013)
A. Woodruff, M. Stonebraker, Supporting fine-grained data lineage in a database visualization environment, in ICDE (1997), pp. 91–102
E. Wu, S. Madden, M. Stonebraker, SubZero: a fine-grained lineage system for scientific databases, in ICDE (2013), pp. 865–876
D. Xiao, M.Y. Eltabakh, InsightNotes: summary-based annotation management in relational databases, in SIGMOD Conference (2014), pp. 661–672
D. Zhang, M. Ray, M. Liu, D. Dougherty, E.A. Rundensteiner, Nested complex event processing: predicate specification and evaluation, in Transactions on Large-Scale Data- and Knowledge-Centered Systems V. Special Issue on Advanced Data Stream Management and Processing of Continuous Queries (Springer, Berlin, 2013)
J. Zhou, P. Larson, H.G. Elmongui, Lazy maintenance of materialized views, in Proceedings of the 33rd International Conference on Very Large Data Bases, University of Vienna, Austria, September 23–27, 2007 (2007), pp. 231–242
J. Zhou, P. Larson, J. Goldstein, L. Ding, Dynamic materialized views, in Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, The Marmara Hotel, Istanbul, Turkey, April 15–20 (2007), pp. 526–535
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Eltabakh, M.Y. (2017). Data Organization and Curation in Big Data. In: Zomaya, A., Sakr, S. (eds) Handbook of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-49340-4_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-49340-4_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49339-8
Online ISBN: 978-3-319-49340-4
eBook Packages: Computer ScienceComputer Science (R0)