Data Organization and Curation in Big Data

Eltabakh, Mohamed Y.

doi:10.1007/978-3-319-49340-4_5

Mohamed Y. Eltabakh³

Abstract

This chapter covers advanced techniques in Big Data analytics and query processing. As the data is getting bigger and, at the same time, workloads and analytics are getting more complex, the advances in big data applications are no longer hindered by their ability to collect or generate data. But instead, by their ability to efficiently and effectively manage the available data. Therefore, numerous scalable and distributed infrastructures have been proposed to manage big data. However, it is well known in literature that scalability and distributed processing alone are not enough to achieve high performance. Instead, the underlying infrastructure has to be highly optimized for various types of workloads and query classes. These optimizations typically start from the lowest layer of the data management stack, which is the storage layer. In this chapter, we will cover two well-known techniques for optimized storage and organization of data that have big influence on query performance, namely the indexing, and data layout techniques. However, in the cases of non-traditional workloads where queries have special execution and data-access characteristics, the standard indexing and layout techniques may fall short in providing the desired performance goals. Therefore, further optimizations specific to the workload characteristics can be applied. In this chapter, we will cover techniques addressing several of these non-traditional workloads in the context of big data. Some of these techniques rely on curating either the data or the workflows (or both) with useful metadata information. This curation information can be very valuable for both query optimization and the business logic. In this chapter, we will cover the curation and metadata management of big data in query optimization and different systems. In this chapter, we focus on the MapReduce-like infrastructures, more specifically its open-source implementation Hadoop. The chapter covers the state-of-art in big data indexing techniques, and the data layout and organization strategies to speedup queries. It will also cover advanced techniques for enabling non-traditional workloads in Hadoop. Hadoop is primarily designed for workloads that are characterized by being batch, offline, ad-hoc, and disk-based. Yet, this chapter will cover recent projects and techniques targeting non-traditional workloads such as continuous query evaluation, main-memory processing, and recurring workloads. In addition, the chapter covers recent techniques proposed for data curation and efficient metadata management in Hadoop. These techniques vary from being semantic specific, e.g., provenance tracking techniques, to generic frameworks for data curation and annotation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 349.00; Price excludes VAT (USA)

Softcover Book: USD 449.99; Price excludes VAT (USA)

Hardcover Book: USD 449.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

D.J. Abadi, Tradeoffs between parallel database systems, hadoop, and hadoopdb as platforms for petabyte-scale analysis, in SSDBM (2010), pp. 1–3
Google Scholar
D.J. Abadi, D. Carney, U. Çetintemel, M. Cherniack, C. Convey, C. Erwin, E.F. Galvez, M. Hatoun, A. Maskey, A. Rasin, A. Singer, M. Stonebraker, N. Tatbul, Y. Xing, R. Yan, S.B. Zdonik, Aurora: a data stream management system, in SIGMOD Conference (2003), p. 666
Google Scholar
D.J. Abadi, Y. Ahmad, M. Balazinska, U. Çetintemel, M. Cherniack, J.-H. Hwang, W. Lindner, A. Maskey, A. Rasin, E. Ryvkina, N. Tatbul, Y. Xing, S.B. Zdonik, The design of the borealis stream processing engine, in CIDR (2005), pp. 277–289
Google Scholar
A. Abouzeid, K. Bajda-Pawlikowski, A.R. Daniel Abadi, A. Silberschatz, HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads, in VLDB (2009), pp. 922–933
Google Scholar
A. Abouzied, K. Bajda-Pawlikowski, J. Huang, D.J. Abadi, A. Silberschatz, Hadoopdb in action: building real world applications, in SIGMOD Conference (2010), pp. 1111–1114
Google Scholar
S. Akoush, R. Sohan, A. Hopper, HadoopProv: towards provenance as a first class citizen in MapReduce, in USENIX Workshop on the Theory and Practice of Provenance (2013)
Google Scholar
S. Akoush, L. Carata, R. Sohan, A. Hopper, MrLazy: lazy runtime label propagation for MapReduce, in HotCloud (2014)
Google Scholar
A.M. Aly, A. Sallam, B.M. Gnanasekaran et al., M3: stream processing on main-memory mapreduce, in ICDE (2012), pp. 1253–1256
Google Scholar
Y. Amsterdamer, S.B. Davidson, D. Deutch, T. Milo, J. Stoyanovich, V. Tannen, Putting lipstick on pig: enabling database-style workflow provenance, in PVLDB (2011), pp. 346–357
Google Scholar
Apache. Oozie: hadoop workflow system, http://yahoo.github.com/oozie/
N. Backman, K. Pattabiraman, R. Fonseca et al., C-mr: continuously mapreduce workflows on multi-core processors, in Proceedings of 3rd International Workshop on MapReduce and Its Applications Date (2012), pp. 1–8
Google Scholar
A. Balmin, T. Kaldewey, S. Tata, Clydesdale: structured data processing on hadoop, in Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20–24 (2012), pp. 705–708
Google Scholar
A. Balmin, K.S. Beyer, V. Ercegovac, J. McPherson, F. Özcan, H. Pirahesh, E.J. Shekita, Y. Sismanis, S. Tata, Y. Tian, A platform for extreme analytics. IBM J. Res. Develop. 57(3/4), 4 (2013)
Article Google Scholar
K. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M.Y. Eltabakh, C.-C. Kanne, F. Ozcan, E. Shekita, Jaql: a scripting language for large scale semi-structured data analysis, in PVLDB, vol. 4 (2011)
Google Scholar
D. Bhagwat, L. Chiticariu, W. Tan, An annotation management system for relational databases, in VLDB (2004), pp. 900–911
Google Scholar
Y. Bu, B. Howe, M. Balazinska, M.D. Ernst, Haloop: efficient iterative data processing on large clusters. Proc. VLDB Endow. 3(1–2), 285–296 (2010)
Article Google Scholar
P. Buneman et al., On propagation of deletions and annotations through views, in PODS (2002), pp. 150–158
Google Scholar
P. Buneman, A. Chapman, J. Cheney, Provenance management in curated databases, in SIGMOD (2006), pp. 539–550
Google Scholar
P. Buneman, J. Cheney, W.-C. Tan, S. Vansummeren, Curated databases, in Proceedings of the 27th ACM symposium on Principles of database systems (PODS) (2008), pp. 1–12
Google Scholar
P. Buneman, S. Khanna, W. Tan, Why and where: a characterization of data provenance. Lect. Notes Comput. Sci. 316–333, 2001 (1973)
MATH Google Scholar
S. Chen, Cheetah: a high performance, custom data warehouse on top of mapreduce. PVLDB 3(2), 1459–1468 (2010)
Google Scholar
L. Chiticariu, W.-C. Tan, G. Vijayvargiya, DBNotes: a post-it system for relational databases based on provenance, in SIGMOD (2005), pp. 942–944
Google Scholar
T. Condie, N. Conway, P. Alvaro, J.M. Hellerstein, K. Elmeleegy, R. Sears, Mapreduce online, in NSDI (2010), pp. 313–328
Google Scholar
T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, J. Gerth, J. Talbot, K. Elmeleegy, R. Sears, Online aggregation and continuous query support in mapreduce, in SIGMOD (2010), pp. 1115–1118
Google Scholar
D. Crawl, J. Wang, I. Altintas, Provenance for MapReduce-based data-intensive workflows, in WORKS Workshop (2011), pp. 21–30
Google Scholar
Y. Cui, J. Widom, Lineage tracing for general data warehouse transformations, in VLDB (2001), pp. 471–480
Google Scholar
J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, J. Schad, Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). VLDB 3, 518–529 (2010)
Google Scholar
J. Dittrich, J. Quiané-Ruiz, S. Richter, S. Schuh, A. Jindal, J. Schad, Only aggressive elephants are fast elephants. PVLDB 5(11), 1591–1602 (2012)
Google Scholar
T. Donnelly, 9 Brilliant Inventions Made by Mistake. Inc. Accessed 24 Aug 2012
Google Scholar
A. Eldawy, M.F. Mokbel, Spatialhadoop: a mapreduce framework for spatial data, in 31st IEEE International Conference on Data Engineering, ICDE 2015, Seoul, South Korea, April 13–17 (2015), pp. 1352–1363
Google Scholar
I. Elghandour, A. Aboulnaga, Restore: reusing results of mapreduce jobs. Proc. VLDB Endow. 5(6), 586–597 (2012)
Article Google Scholar
M.Y. Eltabakh, W.G. Aref, A.K. Elmagarmid, M. Ouzzani, Y.N. Silva, Supporting annotations on relations, in EDBT (2009), pp. 379–390
Google Scholar
M.Y. Eltabakh, Y. Tian, F. Özcan, R. Gemulla, A. Krettek, J. McPherson, Cohadoop: flexible data placement and its exploitation in hadoop. PVLDB 4(9), 575–585 (2011)
Google Scholar
M.Y. Eltabakh, F. Özcan, Y. Sismanis, P. Haas, H. Pirahesh, J. Vondrak, Eagle-eyed elephant: split-oriented indexing in Hadoop, in Proceedings of the 16th International Conference on Extending Database Technology (EDBT) (2013), pp. 89–100
Google Scholar
A. Floratou, J.M. Patel, E.J. Shekita, S. Tata, Column-oriented storage techniques for mapreduce. PVLDB 4(7), 419–429 (2011)
Google Scholar
A. Floratou, U.F. Minhas, F. Özcan, Sql-on-hadoop: full circle back to shared-nothing database architectures. PVLDB 7(12), 1295–1306 (2014)
Google Scholar
A. Floratou, F. Özcan, B. Schiefer, Benchmarking sql-on-hadoop systems: TPC or not tpc? in Big Data Benchmarking - 5th International Workshop, WBDB, Potsdam, Germany, August 5–6, 2014. Revised Selected Papers 2014, 63–72 (2014)
Google Scholar
V.R. Gankidi, N. Teletia, J.M. Patel, A. Halverson, D.J. DeWitt, Indexing HDFS data in PDW: splitting the data from the index. PVLDB 7(13), 1520–1528 (2014)
Google Scholar
A.F. Gates, O. Natkovich, S. Chopra, P. Kamath, S.M. Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, U. Srivastava, Building a high-level dataflow system on top of map-reduce: the pig experience. Proc. VLDB Endow. 1414–1425 (2009)
Google Scholar
W. Gatterbauer, M. Balazinska, N. Khoussainova, D. Suciu, Believe it or not: adding belief annotations to databases. Proc. VLDB Endow. 2(1), 1–12 (2009)
Article Google Scholar
F. Geerts, J. Van Den Bussche, Relational completeness of query languages for annotated databases, in DBPL (2007), pp. 127–137
Google Scholar
F. Geerts et al., Mondrian: annotating and querying databases through colors and blocks, in ICDE (2006), p. 82
Google Scholar
F. Geerts, A. Kementsietsidis, D. Milano, MONDRIAN: annotating and querying databases through colors and blocks, Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006, 3–8 April 2006 (GA, USA, Atlanta, 2006), p. 82
Google Scholar
K. Ibrahim, D. Xiao, M.Y. Eltabakh, Elevating annotation summaries to first-class citizens in insightnotes, in Proceedings of the 18th International Conference on Extending Database Technology, EDBT 2015, Brussels, Belgium, March 23–27 (2015), pp. 49–60
Google Scholar
R. Ikeda, H. Park, J. Widom, Provenance for generalized map and reduce workflows, in CIDR (2011), pp. 273–283
Google Scholar
D. Jiang, B. C. Ooi, L. Shi, S. Wu, The performance of mapreduce: an in-depth study. Proc. VLDB Endow. 472–483 (2010)
Google Scholar
A. Jindal, J. Quiané-Ruiz, J. Dittrich, Trojan data layouts: right shoes for a running elephant, in ACM Symposium on Cloud Computing in conjunction with SOSP 2011, SOCC ’11, Cascais, Portugal, October 26–28 (2011), p. 21
Google Scholar
T. Kaldewey, E.J. Shekita, S. Tata, Clydesdale: structured data processing on mapreduce, in 15th International Conference on Extending Database Technology, EDBT ’12, Berlin, Germany, March 27–30, 2012, Proceedings (2012), pp. 15–25
Google Scholar
G. Karvounarakis, T.J. Green, Semiring-annotated data: queries and provenance. SIGMOD Rec. 41(3), 5–14 (2012)
Article Google Scholar
P. Larson, J. Zhou, View matching for outer-join views. VLDB J. 16(1), 29–53 (2007)
Article Google Scholar
C. Lei, E. Rundensteiner, M.Y. Eltabakh, Redoop: supporting recurring queries in Hadoop, in Proceedings of the 16th International Conference on Extending Database Technology (EDBT) (2013)
Google Scholar
C. Lei, Z. Zhuang, E.A. Rundensteiner, M.Y. Eltabakh, Shared execution of recurring workloads in mapreduce. PVLDB 8(7), 714–725 (2015)
Google Scholar
B. Li, E. Mazur et al. A platform for scalable one-pass analytics using mapreduce, in SIGMOD (2011), pp. 985–996
Google Scholar
Q. Li, A. Labrinidis, P.K. Chrysanthis, ViP: a user-centric view-based annotation framework for scientific data, in Proceedings of the 20th international conference on Scientific and Statistical Database Management (SSDBM) (2008), pp. 295–312
Google Scholar
H. Lim, H. Herodotou, S. Babu, Stubby: a transformation-based optimizer for MapReduce workflows. PVLDB 5(11), 1196–1207 (2012)
Google Scholar
Y. Liu, S. Hu, T. Rabl, W. Liu, H. Jacobsen, K. Wu, J. Chen, J. Li, Dgfindex for smart grid: enhancing hive with a cost-effective multidimensional range index. PVLDB 7(13), 1496–1507 (2014)
Google Scholar
D. Logothetis, S. De, K. Yocum, Scalable lineage capture for debugging DISC analytics, in SOCC (2013), pp. 17:1–17:15
Google Scholar
P. Lu, G. Chen, B.C. Ooi, H.T. Vo, S. Wu, Scalagist: scalable generalized search trees for mapreduce systems [innovative systems paper]. PVLDB 7(14), 1797–1808 (2014)
Google Scholar
Y. Lu, Y. Li, M.Y. Eltabakh, Decorating the cloud: enabling annotation management in MapReduce. PVLDB 5(11), 1–26 (2016)
Google Scholar
T. Nykiel, M. Potamias, C. Mishra, G. Kollios, N. Koudas, Mrshare: sharing across multiple queries in mapreduce. Proc. VLDB Endow. 494–505 (2010)
Google Scholar
C. Olston, G. Chiou, L. Chitnis, F. Liu, Y. Han, M. Larsson, A. Neumann, V.B. N. Rao, V. Sankarasubramanian, S. Seth, C. Tian, T. ZiCornell, X. Wang, Nova: continuous pig/hadoop workflows, in SIGMOD Conference (2011), pp. 1081–1090
Google Scholar
H. Park, R. Ikeda, J. Widom, Ramp: a system for capturing and tracing provenance in mapreduce workflows. PVLDB 4(12), 1351–1354 (2011)
Google Scholar
H. Park, R. Ikeda, J. Widom, Ramp: a system for capturing and tracing provenance in mapreduce workflows, in VLDB. Stanford InfoLab (2011)
Google Scholar
M. Ray, E.A. Rundensteiner, M. Liu, C. Gupta, S. Wang, I. Ari. High-performance complex event processing using continuous sliding views, in EDBT (2013), pp. 525–536
Google Scholar
S. Richter, J. Quiané-Ruiz, S. Schuh, J. Dittrich, Towards zero-overhead adaptive indexing in hadoop, in CoRR (2012). arXiv:abs/1212.3480
A. Shinnar, D. Cunningham, B. Herta et al., M3r: increased performance for in-memory hadoop jobs. PVLDB 1736–1747 (2012)
Google Scholar
M. Stonebraker et al., Mapreduce and parallel dbmss: friends or foes? Commun. ACM 53(1), 64–71 (2010)
Article Google Scholar
W.-C. Tan, Containment of relational queries with annotation propagation, in DBPL (2003)
Google Scholar
A. Thusoo, J.S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, R. Murthy, Hive - a warehousing solution over a map-reduce framework. PVLDB, 1626–1629 (2009)
Google Scholar
G. Wang, C.-Y. Chan, Multi-query optimization in mapreduce framework. PVLDB 7(3), 145–156 (2013)
Google Scholar
A. Woodruff, M. Stonebraker, Supporting fine-grained data lineage in a database visualization environment, in ICDE (1997), pp. 91–102
Google Scholar
E. Wu, S. Madden, M. Stonebraker, SubZero: a fine-grained lineage system for scientific databases, in ICDE (2013), pp. 865–876
Google Scholar
D. Xiao, M.Y. Eltabakh, InsightNotes: summary-based annotation management in relational databases, in SIGMOD Conference (2014), pp. 661–672
Google Scholar
D. Zhang, M. Ray, M. Liu, D. Dougherty, E.A. Rundensteiner, Nested complex event processing: predicate specification and evaluation, in Transactions on Large-Scale Data- and Knowledge-Centered Systems V. Special Issue on Advanced Data Stream Management and Processing of Continuous Queries (Springer, Berlin, 2013)
Google Scholar
J. Zhou, P. Larson, H.G. Elmongui, Lazy maintenance of materialized views, in Proceedings of the 33rd International Conference on Very Large Data Bases, University of Vienna, Austria, September 23–27, 2007 (2007), pp. 231–242
Google Scholar
J. Zhou, P. Larson, J. Goldstein, L. Ding, Dynamic materialized views, in Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, The Marmara Hotel, Istanbul, Turkey, April 15–20 (2007), pp. 526–535
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, Worcester Polytechnic Institute, Worcester, MA, USA
Mohamed Y. Eltabakh

Authors

Mohamed Y. Eltabakh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohamed Y. Eltabakh .

Editor information

Editors and Affiliations

School of Information Technologies, The University of Sydney, Sydney, New South Wales, Australia
Albert Y. Zomaya
The School of Computer Science, The University of New South Wales, Eveleigh, New South Wales, Australia
Sherif Sakr

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Eltabakh, M.Y. (2017). Data Organization and Curation in Big Data. In: Zomaya, A., Sakr, S. (eds) Handbook of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-49340-4_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-49340-4_5
Published: 26 February 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49339-8
Online ISBN: 978-3-319-49340-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics