Abstract
This paper presents and evaluates an alternative sorting component for Hadoop based on the replacement selection algorithm. In comparison with the default quicksort-based implementation, replacement selection generates runs which are in average twice as large. This makes the merge phase more efficient, since the amount of data that can be merged in one pass increases in average by a factor of two. For almost-sorted inputs, replacement selection is often capable of sorting an arbitrarily large file in a single pass, eliminating the need for a merge phase. This paper evaluates an implementation of replacement selection for MapReduce computations in the Hadoop framework. We show that the performance is comparable to quicksort for random inputs, but with substantial gains for inputs which are either almost sorted or require two merge passes in quicksort.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bentley, J.L., Sedgewick, R.: Fast algorithms for sorting and searching strings. In: Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms. SODA 1997, pp. 360–369. SIAM, Philadelphia, PA, USA (1997)
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 3rd edn. The MIT Press, Cambridge (2009)
Dusso, P.M.: Optimizing Sort in Hadoop using Replacement Selection. Master thesis, University of Kaiserslautern (2014)
Estivill-Castro, V., Wood, D.: Foundations for faster external sorting (extended abstract). In: Thiagarajan, P.S. (ed.) FSTTCS. LNCS, vol. 880, pp. 414–425. Springer, Heidelberg (1994)
Friend, E.H.: Sorting on electronic computer systems. J. ACM 3(3), 134–168 (1956)
Graefe, G.: Query evaluation techniques for large databases. ACM Comput. Surv. 25(2), 73–169 (1993)
Graefe, G.: Implementing sorting in database systems. ACM Comput. Surv. 38(3) (2006)
Härder, T.: A scan-driven sort facility for a relational database system. In: Proceedings of VLDB, pp. 236–244 (1977)
Knuth, D.E.: The Art of Computer Programming. Sorting and Searching, vol. 3, 2nd edn. Addison Wesley Longman Publishing Co. Inc., Redwood City (1998)
Larson, P.A.: External sorting: run formation revisited. IEEE Trans. Knowl. Data Eng. 15(4), 961–972 (2003)
Larson, P.A., Graefe, G.: Memory management during run generation in external sorting. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, pp. 472–483. SIGMOD 1998. ACM, New York, NY, USA (1998)
Moore, E.: Sorting method and apparatus, 9 May 1961. http://www.google.com.br/patents/US2983904
Nyberg, C., Barclay, T., Cvetanovic, Z.: AlphaSort: a RISC machine sort. In: Proceedings of SIGMOD, pp. 233–242 (1994)
Skiena, S.S.: The Algorithm Design Manual. Springer, London (1998)
Transaction Processing Performance Council: TPC Benchmark H (Decision Support) Standard Specification. http://www.tpc.org/tpch/. Accessed 10 January 2014
Acknowledgements
We thank Renata Galante for her helpful comments and suggestions on earlier revisions of this paper.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Dusso, P.M., Sauer, C., Härder, T. (2015). Optimizing Sort in Hadoop Using Replacement Selection. In: Tadeusz, M., Valduriez, P., Bellatreche, L. (eds) Advances in Databases and Information Systems. ADBIS 2015. Lecture Notes in Computer Science(), vol 9282. Springer, Cham. https://doi.org/10.1007/978-3-319-23135-8_25
Download citation
DOI: https://doi.org/10.1007/978-3-319-23135-8_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23134-1
Online ISBN: 978-3-319-23135-8
eBook Packages: Computer ScienceComputer Science (R0)