A Sliding-Window Algorithm Implementation in MapReduce
A limited resource processing platform may not be suited to process a large volume of data. The distributed processing platforms can solve this problem by incorporating commodity hardware collaboratively to process a large volume of data. The MapReduce programming framework is one candidate framework for large-scale processing, and Hadoop is its open-source implementation. This framework consists of the Hadoop Distributed File System and the MapReduce for computation capabilities. However, the MapReduce framework does not allow for data sharing for computation among the computing nodes. In this paper, we present an implementation of a sliding-window algorithm for data sharing for computation dependency in MapReduce. The algorithm is designed to facilitate the data processing a sequential order, e.g., moving average. The algorithm utilizes the MapReduce job metadata, e.g., input split size, to prepare the shared data between the computing nodes without violating the MapReduce fault tolerance handling mechanism.
KeywordsMapReduce Hadoop Data sharing Moving average Sequential algorithm
This work is supported and funded by Alberta Innovates Technology Futures (AITF), Calgary, AB, Canada. The authors would like to thank Alberta Health Services (AHS) and Calgary Laboratory Services (CLS), Calgary, Alberta, Canada, for endless logistics support.
- 2.Ekanayake, J., et al. (2008). Mapreduce for data intensive scientific analyses. In Proceedings of the 2008 Fourth IEEE International Conference on eScience, eScience ’08. pp. 277–284.Google Scholar
- 3.Apache Hadoop. (2015). Retrieved 19 Dec 2015, from, https://hadoop.apache.org/.
- 4.Shvachko, K., et al. (2010). The hadoop distributed file system. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). pp. 1–10.Google Scholar
- 5.Ma, Z., & Gu, L. (2010). The limitation of MapReduce: A probing case and a lightweight solution. In Proceedings of the 1st International Conference on Cloud Computing, GRIDs, and virtualization. pp. 68–73.Google Scholar
- 6.Elteir, M., et al. (2010) Enhancing Mapreduce via asynchronous data processing. In IEEE 16th International Conference on Parallel and Distributed Systems (ICPADS). pp. 397–405.Google Scholar
- 9.Olson, M. (2010). Hadoop: Scalable, flexible data storage and analysis. In IQT quarterly (Vol. 1, pp. 14–18). New York, NY: Springer.Google Scholar
- 10.Yu, Y., et al. (2008). DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI (pp. 1–14). Berkeley, CA: USENIX Association.Google Scholar
- 11.Yang, H.-C., et al. (2007). Map-reduce-merge: Simplified relational data processing on large clusters. In Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data. pp. 1029–1040.Google Scholar
- 12.Li, L., et al. (2014). Rolling window time series prediction using MapReduce. In 2014 IEEE 15th International Conference on Information Reuse and Integration (IRI). pp. 757–764.Google Scholar
- 16.Burgstahler, L., Neubauer, M. (2002). New modifications of the exponential moving average algorithm for bandwidth estimation. In Proceedings of the 15th ITC Specialist Seminar.Google Scholar
- 17.White, T. (2012). Hadoop: The definitive guide. Sebastopol, CA: O’Reilly Media, Inc..Google Scholar