Abstract
Today’s lightening-fast generation of data from massive sources and advanced data analytics have made mining the information from big data possible. We have witnessed the success of many big data applications. For example, Amazon uses its massive historical shipment tracking data to recommend goods to targeted customers, and Google uses billions of query data to predict flu trends, which can sometimes do one week earlier than the National Centers for Disease Control and Prevention (CDC).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Note that the value is not being reported, and thus, the information received by the master node for each item will only require a small amount of space.
- 2.
Here we make an implicit assumption that each pair represents a workload of unit size, but our algorithm can easily work also for variable integer workload weights.
References
Wikipedia page-to-page link, available at http://haselgrove.id.au/wikipedia.htm.
Y. Bu, B. Howe, M. Balazinska, and M. Ernst, “HaLoop: efficient iterative data processing on large clusters”, in Proc. of the VLDB Endowment, Sept. 2010.
H. Chang, M. Kodialam, R. Kompella, T. V. Lakshman, M. Lee, and S. Mukherjee, “Scheduling in mapreduce-like systems for fast completion time”, in Proc. of IEEE INFOCOM’11, Shanghai, China, Apr. 2011.
F. Chen, M. Kodialam, and T. V. Lakshman, in Proc. IEEE INFOCOM’12, “Joint scheduling of processing and Shuffle phases in MapReduce systems”, Orlando, Florida, USA, Mar. 2012.
J. Devore, Probability & Statistics for Engineering and the Sciences, CengageBrain.com, 2012.
J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S. Bae, J. Qiu, and G. Fox, “Twister: a runtime for iterative MapReduce”, in Proc. ACM HPDC’10, Chicago, Illinois, USA, June, 2010.
M. Englert, D. Ozmen, and M. Westermann, “The Power of Reordering for Online Minimum Makespan Scheduling”, in Proc. IEEE FOCS’08, Philadelphia, Pennsylvania, USA, Oct. 2008.
B. Gufler, N. Augsten, A. Reiser, and A. Kemper, “Handling Data Skew In MapReduce”, in The First International Conference on Cloud Computing and Services Science, 2011.
B. Gufler, N. Augsten, A. Reiser, and A. Kemper, “Load Balancing in MapReduce Based on Scalable Cardinality Estimates”, in Proc. IEEE ICDE’12, Washington, DC, USA, Apr. 2012.
J. Kleinberg and E. Tardos, Algorithm Design, Pearson Education India, 2006.
Y. Kwon, M. Balazinska, B. Howe, and J. Rolia, “A study of skew in mapreduce applications”, in The 5th Open Cirrus Summit, 2011.
Y. Kwon, M. Balazinska, B. Howe, and J. Rolia, “SkewTune: Mitigating Skew in MapReduce Applications”, in Proc. ACM SIGMOD’12, Scottsdale, Arizona, USA, May. 2012.
W. Lang and J. Patel, “Energy management for MapReduce clusters”, in Proc. of the VLDB Endowment, Sept. 2010.
J. Leverich and C. Kozyrakis, “On the energy (in) efficiency of Hadoop clusters”, in ACM SIGOPS Operating Systems Review, Jan. 2010.
B. Li, E. Mazur, Y. Diao, A. McGregor, and P. Shenoy, “A platform for scalable one-pass analytics using MapReduce”, in Proc. ACM SIGMOD’11, Athens, Greece, June, 2011.
J. Lin, “The Curse of Zipf and Limits to Parallelization: A Look at the Stragglers Problem in MapReduce”, in The 7th Workshop on Large-Scale Distributed Systems for Information Retrieval, July. 2009.
G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, “Pregel: a system for large-scale graph processing”, in Proc. ACM SIGMOD’10, Indianapolis, Indiana, USA, June, 2010.
K. Morton, M. Balazinska, and D. Grossman, “ParaTimer: a progress indicator for MapReduce DAGs”, in Proc. ACM SIGMOD’10, Indianapolis, Indiana, USA, June. 2010.
S. Ramakrishnan, G. Swart, and A. Urmanov, “Balancing reducer skew in MapReduce workloads using progressive sampling”, in Proc. ACM SoCC’12, San Jose, California, USA, 2012.
M. Schatz, “CloudBurst: highly sensitive read mapping with MapReduce”, in Bioinformatics, vol. 25, no. 11, pp. 1363–1369, 2009.
J. Stamos and H. Young, “A symmetric fragment and replicate algorithm for distributed joins”, in IEEE Transactions on Parallel and Distributed Systems, 1993.
J. Tan, X. Meng, and L. Zhang, “Coupling task progress for MapReduce resource-aware scheduling”, in Proc. IEEE INFOCOM’13, Turin, Italy, Apr. 2013.
W. Yan and P. Larson, “Eager Aggregation and Lazy Aggregation”, in Proc. VLDB’95, Zurich, Switzerland, Sept. 1995.
H. Yang, et. al., “Cloud 9: A MapReduce library for Hadoop, available at http://lintool.github.io/Cloud9/
H. Yang, A. Dasdan, R. Hsiao, and D. Parker, “Map-reduce-merge: simplified relational data processing on large clusters”, in Proc. ACM SIGMOD’07, Beijing, China, June, 2007.
M. Zaharia, A. Konwinski, A. Joseph, R. Katz, and I. Stoica, “Improving MapReduce Performance in Heterogeneous Environments”, in Proc. USENIX OSDI’08, Dec. 2008.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2015 The Author(s)
About this chapter
Cite this chapter
Wang, D., Han, Z. (2015). Application on Big Data Processing. In: Sublinear Algorithms for Big Data Applications. SpringerBriefs in Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-319-20448-2_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-20448-2_4
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-20447-5
Online ISBN: 978-3-319-20448-2
eBook Packages: Computer ScienceComputer Science (R0)