Abstract
A limited resource processing platform may not be suited to process a large volume of data. The distributed processing platforms can solve this problem by incorporating commodity hardware collaboratively to process a large volume of data. The MapReduce programming framework is one candidate framework for large-scale processing, and Hadoop is its open-source implementation. This framework consists of the Hadoop Distributed File System and the MapReduce for computation capabilities. However, the MapReduce framework does not allow for data sharing for computation among the computing nodes. In this paper, we present an implementation of a sliding-window algorithm for data sharing for computation dependency in MapReduce. The algorithm is designed to facilitate the data processing a sequential order, e.g., moving average. The algorithm utilizes the MapReduce job metadata, e.g., input split size, to prepare the shared data between the computing nodes without violating the MapReduce fault tolerance handling mechanism.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51, 107–113.
Ekanayake, J., et al. (2008). Mapreduce for data intensive scientific analyses. In Proceedings of the 2008 Fourth IEEE International Conference on eScience, eScience ’08. pp. 277–284.
Apache Hadoop. (2015). Retrieved 19 Dec 2015, from, https://hadoop.apache.org/.
Shvachko, K., et al. (2010). The hadoop distributed file system. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). pp. 1–10.
Ma, Z., & Gu, L. (2010). The limitation of MapReduce: A probing case and a lightweight solution. In Proceedings of the 1st International Conference on Cloud Computing, GRIDs, and virtualization. pp. 68–73.
Elteir, M., et al. (2010) Enhancing Mapreduce via asynchronous data processing. In IEEE 16th International Conference on Parallel and Distributed Systems (ICPADS). pp. 397–405.
Bröder, A., & Gaissmaier, W. (2007). Sequential processing of cues in memory-based multiattribute decisions. Psychonomic Bulletin & Review, 14, 895–900.
Datar, M., & Motwani, R. (2007). The sliding-window computation model and results. In Data streams (pp. 149–167). New York, NY: Springer.
Olson, M. (2010). Hadoop: Scalable, flexible data storage and analysis. In IQT quarterly (Vol. 1, pp. 14–18). New York, NY: Springer.
Yu, Y., et al. (2008). DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI (pp. 1–14). Berkeley, CA: USENIX Association.
Yang, H.-C., et al. (2007). Map-reduce-merge: Simplified relational data processing on large clusters. In Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data. pp. 1029–1040.
Li, L., et al. (2014). Rolling window time series prediction using MapReduce. In 2014 IEEE 15th International Conference on Information Reuse and Integration (IRI). pp. 757–764.
Dudek, A. E., et al. (2014). A generalized block bootstrap for seasonal time series. Journal of Time Series Analysis, 35, 89–114.
Hu, Z., et al. (2002). An accumulative parallel skeleton for all. In Programming languages and systems (pp. 83–97). Heidelberg: Springer.
Liu, Y., et al. (2014). Accumulative computation on MapReduce. IPSJ Online Transactions, 7, 33–42.
Burgstahler, L., Neubauer, M. (2002). New modifications of the exponential moving average algorithm for bandwidth estimation. In Proceedings of the 15th ITC Specialist Seminar.
White, T. (2012). Hadoop: The definitive guide. Sebastopol, CA: O’Reilly Media, Inc..
Acknowledgments
This work is supported and funded by Alberta Innovates Technology Futures (AITF), Calgary, AB, Canada. The authors would like to thank Alberta Health Services (AHS) and Calgary Laboratory Services (CLS), Calgary, Alberta, Canada, for endless logistics support.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendix A: Java Class for Record Sharing—Mapper Class
public class RShareMap extends Mapper<LongWritable, Text, LongWritable, Text> { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { Configuration conf = context.getConfiguration(); int SplitSize = Integer.parseInt(conf.get("SplitSize")); InputSplit split = context.getInputSplit(); long SplitLength = ((FileSplit) split).getLength(); long SplitOffset = ((FileSplit) split).getStart(); // Split the lines so you can select the element you want to copy forward String line = value.toString(); System.out.println(line.length()); // Split up each element of the line String[] elements = line.split("\n"); int window = 2; // number of records to share with the next map int size = 6;//one character size. Text outputValue = new Text(); LongWritable OutKey = new LongWritable(); long idx1 = SplitLength-(window-1)*size; // used for first split only SplitOffset =0 long idx2 = idx1+(window-1)*size;// used for first split only SplitOffset =0 /// Split values if ((SplitOffset==0)&&(key.get()>=idx1)& &(key.get()<=idx2)){ int temp= (int) Math.log10( ((window-1) +SplitLength)*size ); // calculate the new space for the copied forward element for first split only temp=temp+1; int Space = (int)Math.pow(10, temp); outputValue.set(elements[0]); long nkey= (long) (SplitLength*Space); nkey=nkey+key.get(); OutKey.set(nkey); context.write(OutKey, outputValue); } // End of First split values // Subsequent splits except the last one long nkey=key.get(); int x= (int) Math.log10( ((window-1) +SplitLength)*size ); x=x+1; int NewSpace = (int)Math.pow(10, x); long SplitNew=SplitOffset*NewSpace; if(SplitOffset>0){ long idx3 = SplitOffset+SplitLength-(window-1)*size; // used for any split other than SplitOffset =0 long idx4 = idx3+(window-1)*size; // used for any split other than SplitOffset =0 boolean test = (nkey>idx3)&&(nkey<=idx4)&& (SplitLength==SplitSize); if(test){ outputValue.set(elements[0]); long tempkey = (SplitOffset+SplitLength) *NewSpace; long tKey = tempkey+nkey; OutKey.set(tKey); context.write(OutKey, outputValue); //System.out.println(SplitLength); } } long newKey = SplitNew+key.get(); context.write(new LongWritable(newKey), value); } }
Appendix B : Java Class for Record Sharing—Reducer Class
public class RShareReduce extends Reducer<LongWritable, Text, LongWritable, Text> { public void reduce(LongWritable key, Text values, Context context) throws IOException, InterruptedException { long k1= key.get()+100; context.write(new LongWritable (k1), values); } }
Appendix C: Java Class for Record Sharing—Driver Class
public class RShareDriver { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); // Use programArgs array to retrieve program arguments. String[] programArgs = new GenericOptionsParser (conf, args).getRemainingArgs(); conf.set("SplitSize", "12"); Job job = new Job(conf); job.setJarByClass(RShareDriver.class); job.setMapperClass(RShareMap.class); job.setReducerClass(RShareReduce.class); job.setOutputKeyClass(LongWritable.class); job.setOutputValueClass(Text.class); FileInputFormat.setMaxInputSplitSize(job, 12); // change the max split FileInputFormat.setMinInputSplitSize(job, 12); // change the min split // TODO: Update the input path for the location of the inputs of the map-reduce job. FileInputFormat.addInputPath(job, new Path(programArgs[0])); // TODO: Update the output path for the output directory of the map-reduce job. FileOutputFormat.setOutputPath(job, new Path(programArgs[1])); // Submit the job and wait for it to finish. job.waitForCompletion(true); // Submit and return immediately: // job.submit(); } }
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Mohammed, E.A., Naugler, C.T., Far, B.H. (2018). A Sliding-Window Algorithm Implementation in MapReduce. In: Moshirpour, M., Far, B., Alhajj, R. (eds) Applications of Data Management and Analysis . Lecture Notes in Social Networks. Springer, Cham. https://doi.org/10.1007/978-3-319-95810-1_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-95810-1_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-95809-5
Online ISBN: 978-3-319-95810-1
eBook Packages: Business and ManagementBusiness and Management (R0)