Overview
The era of big data is upon us. However, traditional data management and analysis systems, which are mainly based on relational database management system (RDBMS), may not be able to handle the ever-growing data volume. Therefore, it is important to design scalable system architectures to efficiently process big data and exploit their value. This chapter discusses various horizontal and vertical scaling big data platforms, focusing on their architectural principle for big data analysis applications, such as machine learning and graph processing. This chapter could aid users to select right system architectures or platforms for their big data applications.
Introduction
This is an era of big data, evidenced by the sheer volume of data from a variety of sources and its growing rate of generation. According to a report from the International Data Corporation (IDC), the global data volume will grow by a factor of 300, from 130 exabytes (1 exabyte = 106terabytes) to 40,000...
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, Kudlur M, Levenberg J, Monga R, Moore S, Murray DG, Steiner B, Tucker P, Vasudevan V, Warden P, Wicke M, Yu Y, and Zheng X (2016) Tensorflow: a system for large-scale machine learning. In: 12th USENIX symposium on operating systems design and implementation (OSDI’16). USENIX Association, Savannah
Anderson MJ, Sundaram N, Satish N, Patwary MMA, Willke TL, and Dubey P (2016) Graphpad: optimized graph primitives for parallel and distributed platforms. In: IPDPS. IEEE, pp 313–322
Bergstra J, Bastien F, Breuleux O, Lamblin P, Pascanu R, Delalleau O, Desjardins G, Warde-Farley D, Goodfellow I, Bergeron A et al (2011) Theano: deep learning on GPUs with python. In: NIPS 2011, BigLearning workshop, Granada
Beyer MA, Laney D (2012) The importance of big data: a definition. Gartner, Stamford, pp 2014–2018
Bu Y, Howe B, Balazinska M, Ernst MD (2010) Haloop: efficient iterative data processing on large clusters. Proc VLDB Endow 3(1–2):285–296
Bu Y, Borkar V, Jia J, Carey MJ, Condie T (2014) Pregelix: big(ger) graph analytics on a dataflow engine. Proc VLDB Endow 8(2):161–172
Buluç A, Gilbert JR (2011) The combinatorial blas: design, implementation, and applications. Int J High Perfor Comput Appl 25(4):496–509
Chen M, Mao S, Liu Y (2014) Big data: a survey. Mob Netw Appl 19(2):171–209
Chen R, Shi J, Chen Y, Chen H (2015a) Powerlyra: differentiated graph computation and partitioning on skewed graphs. In: Proceedings of the tenth European conference on computer systems. ACM, p 1
Chen T, Li M, Li Y, Lin M, Wang N, Wang M, Xiao T, Xu B, Zhang C, Zhang Z (2015b) Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512. 01274
Chilimbi T, Suzue Y, Apacible J, Kalyanaraman K (2014) Project Adam: building an efficient and scalable deep learning training system. In: 11th USENIX symposium on operating systems design and implementation (OSDI’14), pp 571–582
Ching A, Edunov S, Kabiljo M, Logothetis D, Muthukrishnan S (2015) One trillion edges: graph processing at facebook-scale. Proc VLDB Endow 8(12):1804–1815
Dai G, Chi Y, Wang Y, Yang H (2016) FPGP: graph processing framework on FPGA a case study of breadth-first search. In: Proceedings of the 2016 ACM/SIGDA international symposium on field-programmable gate arrays. ACM, pp 105–110
Dayarathna M, Wen Y, Fan R (2016) Data center energy consumption modeling: a survey. IEEE Commun Surv Tutorials 18(1):732–794
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Ekanayake J, Li H, Zhang B, Gunarathne T, Bae SH, Qiu J, Fox G (2010) Twister: a runtime for iterative mapreduce. In: Proceedings of the 19th ACM international symposium on high performance distributed computing. ACM, pp 810–818
Gabriel E, Fagg GE, Bosilca G, Angskun T, Dongarra JJ, Squyres JM, Sahay V, Kambadur P, Barrett B, Lumsdaine A et al (2004) Open MPI: goals, concept, and design of a next generation MPI implementation. In: European parallel virtual machine/message passing interface users group meeting. Springer, pp 97–104
Gantz J, Reinsel D (2012) The digital universe in 2020: big data, bigger digital shadows, and biggest growth in the far east. IDC iView IDC Analyze Future 2007(2012):1–16
Gonzalez JE, Low Y, Gu H, Bickson D, Guestrin C (2012) Powergraph: distributed graph-parallel computation on natural graphs. OSDI 12(1):2
Gonzalez JE, Xin RS, Dave A, Crankshaw D, Franklin MJ, Stoica I (2014) Graphx: graph processing in a distributed dataflow framework. In: 11th USENIX symposium on operating systems design and implementation (OSDI’14). USENIX Association, Broomfield, pp 599–613
Gropp W, Lusk E, Skjellum A (1999) Using MPI: portable parallel programming with the message-passing interface, vol 1. MIT press, Cambridge
Hindman B, Konwinski A, Zaharia M, Ghodsi A, Joseph AD, Katz RH, Shenker S, Stoica I (2011) Mesos: a platform for fine-grained resource sharing in the data center. NSDI 11:22–22
Hu H, Wen Y, Chua TS, Li X (2014) Toward scalable systems for big data analytics: a technology tutorial. IEEE Access 2:652–687
Iosup A, Hegeman T, Ngai WL, Heldens S, Prat-Pérez A, Manhardto T, Chafio H, Capotă M, Sundaram N, Anderson M et al (2016) Ldbc graphalytics: a benchmark for large-scale graph analysis on parallel and distributed platforms. Proc VLDB Endow 9(13):1317–1328
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the ACM international conference on multimedia. ACM, pp 675–678
Karonis NT, Toonen B, Foster I (2003) Mpich-g2: a grid-enabled implementation of the message passing interface. J Parallel Distrib Comput 63(5): 551–563
Khorasani F, Vora K, Gupta R, Bhuyan LN (2014) Cusha: vertex-centric graph processing on GPUs. In: Proceedings of the 23rd international symposium on High-performance parallel and distributed computing. ACM, pp 239–252
Laney D (2001) 3D data management: controlling data volume, velocity and variety. META Group Res Note 6(70):1–4
Li M, Andersen DG, Park JW, Smola AJ, Ahmed A, Josifovski V, Long J, Shekita EJ, Su BY (2014) Scaling distributed machine learning with the parameter server. In 11th USENIX symposium on operating systems design and implementation (OSDI’14). USENIX Association, Broomfield, pp 583–598
Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data. ACM, pp 135–146
Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D, Amde M, Owen S et al (20116) MLlib: machine learning in apache spark. J Mach Learn Res 17(1):1235–1241
Nurvitadhi E, Weisz G, Wang Y, Hurkat S, Nguyen M, Hoe JC, Martínez JF, Guestrin C (2014) Graphgen: an fpga framework for vertex-centric graph computation. In: IEEE 22nd annual international symposium on field-programmable custom computing machines (FCCM). IEEE, pp 25–28
Ovtcharov K, Ruwase O, Kim JY, Fowers J, Strauss K, Chung ES (2015) Accelerating deep convolutional neural networks using specialized hardware. Microsoft Res Whitepaper 2(11):1–4
Panda DK, Tomko K, Schulz K, Majumdar A (2013) The MVAPICH project: evolution and sustainability of an open source production quality MPI library for HPC. In: Workshop on sustainable software for science: practice and experiences, held in conjunction with international conference on supercomputing (WSSPE)
Qiu J, Wang J, Yao S, Guo K, Li B, Zhou E, Yu J, Tang T, Xu N, Song S et al (2016) Going deeper with embedded FPGA platform for convolutional neural network. In: Proceedings of the 2016 ACM/SIGDA international symposium on field-programmable gate arrays. ACM, pp 26–35
Roy A, Bindschaedler L, Malicevic J, Zwaenepoel W (2015) Chaos: scale-out graph processing from secondary storage. In: Proceedings of the 25th symposium on operating systems principles. ACM, pp 410–424
Salihoglu S, Widom J (2013) GPS: a graph processing system. In: Proceedings of the 25th international conference on scientific and statistical database management. ACM, p 22
Schelter S, Satuluri V, Zadeh R (2014) Factorbird-a parameter server approach to distributed matrix factorization. arXiv preprint arXiv:1411.0602
Shun J, Blelloch GE (2013) Ligra: a lightweight graph processing framework for shared memory. In: ACM SIGPLAN notices, vol 48(8). ACM, pp 135–146
Shun J, Dhulipala L, Blelloch GE (2015) Smaller and faster: parallel processing of compressed graphs with ligra+. In: Data compression conference (DCC). IEEE, pp 403–412
Wang W, Chen G, Dinh ATT, Gao J, Ooi BC, Tan KL, Wang S (2015) SINGA: putting deep learning in the hands of multimedia users. In: Proceedings of the 23rd ACM international conference on multimedia. ACM, pp 25–34
Wang Y, Davidson A, Pan Y, Wu Y, Riffel A, Owens JD (2016) Gunrock: a high-performance graph processing library on the GPU. In: ACM SIGPLAN notices 51(8). ACM, p 11
White T (2012) Hadoop: the definitive guide. O’Reilly Media, Inc., Sebastopol
Xing EP, Ho Q, Dai W, Kim JK, Wei J, Lee S, Zheng X, Xie P, Kumar A, Yu Y (2015) Petuum: a new platform for distributed machine learning on big data. IEEE Trans Big Data 1(2):49–67
Yan D, Cheng J, Xing K, Lu Y, Ng W, Bu Y (2014) Pregel algorithms for graph connectivity problems with performance guarantees. Proc VLDB Endow 7(14):1821–1832
Yan D, Huang Y, Liu M, Chen H, Cheng J, Wu H, Zhang C (2017) Graphd: distributed vertex-centric graph processing beyond the memory limit. IEEE Trans Parallel Distrib Syst 29(1):99–114
Yang F, Li J, Cheng J (2016) Husky: towards a more efficient and expressive distributed computing framework. Proc VLDB Endow 9(5):420–431
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementation. USENIX Association
Zaharia M, Das T, Li H, Hunter T, Shenker S, Stoica I (2013) Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of the twenty-fourth ACM symposium on operating systems principles. ACM, pp 423–438
Zhang C, Li P, Sun G, Guan Y, Xiao B, Cong J (2015) Optimizing FPGA-based accelerator design for deep convolutional neural networks. In: Proceedings of the 2015 ACM/SIGDA international symposium on field-programmable gate arrays. ACM, pp 161–170
Zhao Y, Yoshigoe K, Bian J, Xie M, Xue Z, Feng Y (2016) A distributed graph-parallel computing system with lightweight communication overhead. IEEE Trans Big Data 2(3):204–218
Zhong J, He B (2014) Medusa: simplified graph processing on GPUs. IEEE Trans Parallel Distrib Syst 25(6):1543–1552
Zhou C, Gao J, Sun B, Yu JX (2014) Mocgraph: scalable distributed graph processing using message online computing. Proc VLDB Endow 8(4):377–388
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this entry
Cite this entry
Sun, P., Wen, Y. (2019). Scalable Architectures for Big Data Analysis. In: Sakr, S., Zomaya, A.Y. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-77525-8_281
Download citation
DOI: https://doi.org/10.1007/978-3-319-77525-8_281
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77524-1
Online ISBN: 978-3-319-77525-8
eBook Packages: Computer ScienceReference Module Computer Science and Engineering