Skip to main content

Data Organization and Curation in Big Data

  • Chapter
  • First Online:
Handbook of Big Data Technologies

Abstract

This chapter covers advanced techniques in Big Data analytics and query processing. As the data is getting bigger and, at the same time, workloads and analytics are getting more complex, the advances in big data applications are no longer hindered by their ability to collect or generate data. But instead, by their ability to efficiently and effectively manage the available data. Therefore, numerous scalable and distributed infrastructures have been proposed to manage big data. However, it is well known in literature that scalability and distributed processing alone are not enough to achieve high performance. Instead, the underlying infrastructure has to be highly optimized for various types of workloads and query classes. These optimizations typically start from the lowest layer of the data management stack, which is the storage layer. In this chapter, we will cover two well-known techniques for optimized storage and organization of data that have big influence on query performance, namely the indexing, and data layout techniques. However, in the cases of non-traditional workloads where queries have special execution and data-access characteristics, the standard indexing and layout techniques may fall short in providing the desired performance goals. Therefore, further optimizations specific to the workload characteristics can be applied. In this chapter, we will cover techniques addressing several of these non-traditional workloads in the context of big data. Some of these techniques rely on curating either the data or the workflows (or both) with useful metadata information. This curation information can be very valuable for both query optimization and the business logic. In this chapter, we will cover the curation and metadata management of big data in query optimization and different systems. In this chapter, we focus on the MapReduce-like infrastructures, more specifically its open-source implementation Hadoop. The chapter covers the state-of-art in big data indexing techniques, and the data layout and organization strategies to speedup queries. It will also cover advanced techniques for enabling non-traditional workloads in Hadoop. Hadoop is primarily designed for workloads that are characterized by being batch, offline, ad-hoc, and disk-based. Yet, this chapter will cover recent projects and techniques targeting non-traditional workloads such as continuous query evaluation, main-memory processing, and recurring workloads. In addition, the chapter covers recent techniques proposed for data curation and efficient metadata management in Hadoop. These techniques vary from being semantic specific, e.g., provenance tracking techniques, to generic frameworks for data curation and annotation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 349.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 449.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 449.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. D.J. Abadi, Tradeoffs between parallel database systems, hadoop, and hadoopdb as platforms for petabyte-scale analysis, in SSDBM (2010), pp. 1–3

    Google Scholar 

  2. D.J. Abadi, D. Carney, U. Çetintemel, M. Cherniack, C. Convey, C. Erwin, E.F. Galvez, M. Hatoun, A. Maskey, A. Rasin, A. Singer, M. Stonebraker, N. Tatbul, Y. Xing, R. Yan, S.B. Zdonik, Aurora: a data stream management system, in SIGMOD Conference (2003), p. 666

    Google Scholar 

  3. D.J. Abadi, Y. Ahmad, M. Balazinska, U. Çetintemel, M. Cherniack, J.-H. Hwang, W. Lindner, A. Maskey, A. Rasin, E. Ryvkina, N. Tatbul, Y. Xing, S.B. Zdonik, The design of the borealis stream processing engine, in CIDR (2005), pp. 277–289

    Google Scholar 

  4. A. Abouzeid, K. Bajda-Pawlikowski, A.R. Daniel Abadi, A. Silberschatz, HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads, in VLDB (2009), pp. 922–933

    Google Scholar 

  5. A. Abouzied, K. Bajda-Pawlikowski, J. Huang, D.J. Abadi, A. Silberschatz, Hadoopdb in action: building real world applications, in SIGMOD Conference (2010), pp. 1111–1114

    Google Scholar 

  6. S. Akoush, R. Sohan, A. Hopper, HadoopProv: towards provenance as a first class citizen in MapReduce, in USENIX Workshop on the Theory and Practice of Provenance (2013)

    Google Scholar 

  7. S. Akoush, L. Carata, R. Sohan, A. Hopper, MrLazy: lazy runtime label propagation for MapReduce, in HotCloud (2014)

    Google Scholar 

  8. A.M. Aly, A. Sallam, B.M. Gnanasekaran et al., M3: stream processing on main-memory mapreduce, in ICDE (2012), pp. 1253–1256

    Google Scholar 

  9. Y. Amsterdamer, S.B. Davidson, D. Deutch, T. Milo, J. Stoyanovich, V. Tannen, Putting lipstick on pig: enabling database-style workflow provenance, in PVLDB (2011), pp. 346–357

    Google Scholar 

  10. Apache. Oozie: hadoop workflow system, http://yahoo.github.com/oozie/

  11. N. Backman, K. Pattabiraman, R. Fonseca et al., C-mr: continuously mapreduce workflows on multi-core processors, in Proceedings of 3rd International Workshop on MapReduce and Its Applications Date (2012), pp. 1–8

    Google Scholar 

  12. A. Balmin, T. Kaldewey, S. Tata, Clydesdale: structured data processing on hadoop, in Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20–24 (2012), pp. 705–708

    Google Scholar 

  13. A. Balmin, K.S. Beyer, V. Ercegovac, J. McPherson, F. Özcan, H. Pirahesh, E.J. Shekita, Y. Sismanis, S. Tata, Y. Tian, A platform for extreme analytics. IBM J. Res. Develop. 57(3/4), 4 (2013)

    Article  Google Scholar 

  14. K. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M.Y. Eltabakh, C.-C. Kanne, F. Ozcan, E. Shekita, Jaql: a scripting language for large scale semi-structured data analysis, in PVLDB, vol. 4 (2011)

    Google Scholar 

  15. D. Bhagwat, L. Chiticariu, W. Tan, An annotation management system for relational databases, in VLDB (2004), pp. 900–911

    Google Scholar 

  16. Y. Bu, B. Howe, M. Balazinska, M.D. Ernst, Haloop: efficient iterative data processing on large clusters. Proc. VLDB Endow. 3(1–2), 285–296 (2010)

    Article  Google Scholar 

  17. P. Buneman et al., On propagation of deletions and annotations through views, in PODS (2002), pp. 150–158

    Google Scholar 

  18. P. Buneman, A. Chapman, J. Cheney, Provenance management in curated databases, in SIGMOD (2006), pp. 539–550

    Google Scholar 

  19. P. Buneman, J. Cheney, W.-C. Tan, S. Vansummeren, Curated databases, in Proceedings of the 27th ACM symposium on Principles of database systems (PODS) (2008), pp. 1–12

    Google Scholar 

  20. P. Buneman, S. Khanna, W. Tan, Why and where: a characterization of data provenance. Lect. Notes Comput. Sci. 316–333, 2001 (1973)

    MATH  Google Scholar 

  21. S. Chen, Cheetah: a high performance, custom data warehouse on top of mapreduce. PVLDB 3(2), 1459–1468 (2010)

    Google Scholar 

  22. L. Chiticariu, W.-C. Tan, G. Vijayvargiya, DBNotes: a post-it system for relational databases based on provenance, in SIGMOD (2005), pp. 942–944

    Google Scholar 

  23. T. Condie, N. Conway, P. Alvaro, J.M. Hellerstein, K. Elmeleegy, R. Sears, Mapreduce online, in NSDI (2010), pp. 313–328

    Google Scholar 

  24. T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, J. Gerth, J. Talbot, K. Elmeleegy, R. Sears, Online aggregation and continuous query support in mapreduce, in SIGMOD (2010), pp. 1115–1118

    Google Scholar 

  25. D. Crawl, J. Wang, I. Altintas, Provenance for MapReduce-based data-intensive workflows, in WORKS Workshop (2011), pp. 21–30

    Google Scholar 

  26. Y. Cui, J. Widom, Lineage tracing for general data warehouse transformations, in VLDB (2001), pp. 471–480

    Google Scholar 

  27. J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, J. Schad, Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). VLDB 3, 518–529 (2010)

    Google Scholar 

  28. J. Dittrich, J. Quiané-Ruiz, S. Richter, S. Schuh, A. Jindal, J. Schad, Only aggressive elephants are fast elephants. PVLDB 5(11), 1591–1602 (2012)

    Google Scholar 

  29. T. Donnelly, 9 Brilliant Inventions Made by Mistake. Inc. Accessed 24 Aug 2012

    Google Scholar 

  30. A. Eldawy, M.F. Mokbel, Spatialhadoop: a mapreduce framework for spatial data, in 31st IEEE International Conference on Data Engineering, ICDE 2015, Seoul, South Korea, April 13–17 (2015), pp. 1352–1363

    Google Scholar 

  31. I. Elghandour, A. Aboulnaga, Restore: reusing results of mapreduce jobs. Proc. VLDB Endow. 5(6), 586–597 (2012)

    Article  Google Scholar 

  32. M.Y. Eltabakh, W.G. Aref, A.K. Elmagarmid, M. Ouzzani, Y.N. Silva, Supporting annotations on relations, in EDBT (2009), pp. 379–390

    Google Scholar 

  33. M.Y. Eltabakh, Y. Tian, F. Özcan, R. Gemulla, A. Krettek, J. McPherson, Cohadoop: flexible data placement and its exploitation in hadoop. PVLDB 4(9), 575–585 (2011)

    Google Scholar 

  34. M.Y. Eltabakh, F. Özcan, Y. Sismanis, P. Haas, H. Pirahesh, J. Vondrak, Eagle-eyed elephant: split-oriented indexing in Hadoop, in Proceedings of the 16th International Conference on Extending Database Technology (EDBT) (2013), pp. 89–100

    Google Scholar 

  35. A. Floratou, J.M. Patel, E.J. Shekita, S. Tata, Column-oriented storage techniques for mapreduce. PVLDB 4(7), 419–429 (2011)

    Google Scholar 

  36. A. Floratou, U.F. Minhas, F. Özcan, Sql-on-hadoop: full circle back to shared-nothing database architectures. PVLDB 7(12), 1295–1306 (2014)

    Google Scholar 

  37. A. Floratou, F. Özcan, B. Schiefer, Benchmarking sql-on-hadoop systems: TPC or not tpc? in Big Data Benchmarking - 5th International Workshop, WBDB, Potsdam, Germany, August 5–6, 2014. Revised Selected Papers 2014, 63–72 (2014)

    Google Scholar 

  38. V.R. Gankidi, N. Teletia, J.M. Patel, A. Halverson, D.J. DeWitt, Indexing HDFS data in PDW: splitting the data from the index. PVLDB 7(13), 1520–1528 (2014)

    Google Scholar 

  39. A.F. Gates, O. Natkovich, S. Chopra, P. Kamath, S.M. Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, U. Srivastava, Building a high-level dataflow system on top of map-reduce: the pig experience. Proc. VLDB Endow. 1414–1425 (2009)

    Google Scholar 

  40. W. Gatterbauer, M. Balazinska, N. Khoussainova, D. Suciu, Believe it or not: adding belief annotations to databases. Proc. VLDB Endow. 2(1), 1–12 (2009)

    Article  Google Scholar 

  41. F. Geerts, J. Van Den Bussche, Relational completeness of query languages for annotated databases, in DBPL (2007), pp. 127–137

    Google Scholar 

  42. F. Geerts et al., Mondrian: annotating and querying databases through colors and blocks, in ICDE (2006), p. 82

    Google Scholar 

  43. F. Geerts, A. Kementsietsidis, D. Milano, MONDRIAN: annotating and querying databases through colors and blocks, Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006, 3–8 April 2006 (GA, USA, Atlanta, 2006), p. 82

    Google Scholar 

  44. K. Ibrahim, D. Xiao, M.Y. Eltabakh, Elevating annotation summaries to first-class citizens in insightnotes, in Proceedings of the 18th International Conference on Extending Database Technology, EDBT 2015, Brussels, Belgium, March 23–27 (2015), pp. 49–60

    Google Scholar 

  45. R. Ikeda, H. Park, J. Widom, Provenance for generalized map and reduce workflows, in CIDR (2011), pp. 273–283

    Google Scholar 

  46. D. Jiang, B. C. Ooi, L. Shi, S. Wu, The performance of mapreduce: an in-depth study. Proc. VLDB Endow. 472–483 (2010)

    Google Scholar 

  47. A. Jindal, J. Quiané-Ruiz, J. Dittrich, Trojan data layouts: right shoes for a running elephant, in ACM Symposium on Cloud Computing in conjunction with SOSP 2011, SOCC ’11, Cascais, Portugal, October 26–28 (2011), p. 21

    Google Scholar 

  48. T. Kaldewey, E.J. Shekita, S. Tata, Clydesdale: structured data processing on mapreduce, in 15th International Conference on Extending Database Technology, EDBT ’12, Berlin, Germany, March 27–30, 2012, Proceedings (2012), pp. 15–25

    Google Scholar 

  49. G. Karvounarakis, T.J. Green, Semiring-annotated data: queries and provenance. SIGMOD Rec. 41(3), 5–14 (2012)

    Article  Google Scholar 

  50. P. Larson, J. Zhou, View matching for outer-join views. VLDB J. 16(1), 29–53 (2007)

    Article  Google Scholar 

  51. C. Lei, E. Rundensteiner, M.Y. Eltabakh, Redoop: supporting recurring queries in Hadoop, in Proceedings of the 16th International Conference on Extending Database Technology (EDBT) (2013)

    Google Scholar 

  52. C. Lei, Z. Zhuang, E.A. Rundensteiner, M.Y. Eltabakh, Shared execution of recurring workloads in mapreduce. PVLDB 8(7), 714–725 (2015)

    Google Scholar 

  53. B. Li, E. Mazur et al. A platform for scalable one-pass analytics using mapreduce, in SIGMOD (2011), pp. 985–996

    Google Scholar 

  54. Q. Li, A. Labrinidis, P.K. Chrysanthis, ViP: a user-centric view-based annotation framework for scientific data, in Proceedings of the 20th international conference on Scientific and Statistical Database Management (SSDBM) (2008), pp. 295–312

    Google Scholar 

  55. H. Lim, H. Herodotou, S. Babu, Stubby: a transformation-based optimizer for MapReduce workflows. PVLDB 5(11), 1196–1207 (2012)

    Google Scholar 

  56. Y. Liu, S. Hu, T. Rabl, W. Liu, H. Jacobsen, K. Wu, J. Chen, J. Li, Dgfindex for smart grid: enhancing hive with a cost-effective multidimensional range index. PVLDB 7(13), 1496–1507 (2014)

    Google Scholar 

  57. D. Logothetis, S. De, K. Yocum, Scalable lineage capture for debugging DISC analytics, in SOCC (2013), pp. 17:1–17:15

    Google Scholar 

  58. P. Lu, G. Chen, B.C. Ooi, H.T. Vo, S. Wu, Scalagist: scalable generalized search trees for mapreduce systems [innovative systems paper]. PVLDB 7(14), 1797–1808 (2014)

    Google Scholar 

  59. Y. Lu, Y. Li, M.Y. Eltabakh, Decorating the cloud: enabling annotation management in MapReduce. PVLDB 5(11), 1–26 (2016)

    Google Scholar 

  60. T. Nykiel, M. Potamias, C. Mishra, G. Kollios, N. Koudas, Mrshare: sharing across multiple queries in mapreduce. Proc. VLDB Endow. 494–505 (2010)

    Google Scholar 

  61. C. Olston, G. Chiou, L. Chitnis, F. Liu, Y. Han, M. Larsson, A. Neumann, V.B. N. Rao, V. Sankarasubramanian, S. Seth, C. Tian, T. ZiCornell, X. Wang, Nova: continuous pig/hadoop workflows, in SIGMOD Conference (2011), pp. 1081–1090

    Google Scholar 

  62. H. Park, R. Ikeda, J. Widom, Ramp: a system for capturing and tracing provenance in mapreduce workflows. PVLDB 4(12), 1351–1354 (2011)

    Google Scholar 

  63. H. Park, R. Ikeda, J. Widom, Ramp: a system for capturing and tracing provenance in mapreduce workflows, in VLDB. Stanford InfoLab (2011)

    Google Scholar 

  64. M. Ray, E.A. Rundensteiner, M. Liu, C. Gupta, S. Wang, I. Ari. High-performance complex event processing using continuous sliding views, in EDBT (2013), pp. 525–536

    Google Scholar 

  65. S. Richter, J. Quiané-Ruiz, S. Schuh, J. Dittrich, Towards zero-overhead adaptive indexing in hadoop, in CoRR (2012). arXiv:abs/1212.3480

  66. A. Shinnar, D. Cunningham, B. Herta et al., M3r: increased performance for in-memory hadoop jobs. PVLDB 1736–1747 (2012)

    Google Scholar 

  67. M. Stonebraker et al., Mapreduce and parallel dbmss: friends or foes? Commun. ACM 53(1), 64–71 (2010)

    Article  Google Scholar 

  68. W.-C. Tan, Containment of relational queries with annotation propagation, in DBPL (2003)

    Google Scholar 

  69. A. Thusoo, J.S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, R. Murthy, Hive - a warehousing solution over a map-reduce framework. PVLDB, 1626–1629 (2009)

    Google Scholar 

  70. G. Wang, C.-Y. Chan, Multi-query optimization in mapreduce framework. PVLDB 7(3), 145–156 (2013)

    Google Scholar 

  71. A. Woodruff, M. Stonebraker, Supporting fine-grained data lineage in a database visualization environment, in ICDE (1997), pp. 91–102

    Google Scholar 

  72. E. Wu, S. Madden, M. Stonebraker, SubZero: a fine-grained lineage system for scientific databases, in ICDE (2013), pp. 865–876

    Google Scholar 

  73. D. Xiao, M.Y. Eltabakh, InsightNotes: summary-based annotation management in relational databases, in SIGMOD Conference (2014), pp. 661–672

    Google Scholar 

  74. D. Zhang, M. Ray, M. Liu, D. Dougherty, E.A. Rundensteiner, Nested complex event processing: predicate specification and evaluation, in Transactions on Large-Scale Data- and Knowledge-Centered Systems V. Special Issue on Advanced Data Stream Management and Processing of Continuous Queries (Springer, Berlin, 2013)

    Google Scholar 

  75. J. Zhou, P. Larson, H.G. Elmongui, Lazy maintenance of materialized views, in Proceedings of the 33rd International Conference on Very Large Data Bases, University of Vienna, Austria, September 23–27, 2007 (2007), pp. 231–242

    Google Scholar 

  76. J. Zhou, P. Larson, J. Goldstein, L. Ding, Dynamic materialized views, in Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, The Marmara Hotel, Istanbul, Turkey, April 15–20 (2007), pp. 526–535

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohamed Y. Eltabakh .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Eltabakh, M.Y. (2017). Data Organization and Curation in Big Data. In: Zomaya, A., Sakr, S. (eds) Handbook of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-49340-4_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-49340-4_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-49339-8

  • Online ISBN: 978-3-319-49340-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics