Skip to main content

Optimizing Sort in Hadoop Using Replacement Selection

  • Conference paper
  • First Online:
Book cover Advances in Databases and Information Systems (ADBIS 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9282))

Abstract

This paper presents and evaluates an alternative sorting component for Hadoop based on the replacement selection algorithm. In comparison with the default quicksort-based implementation, replacement selection generates runs which are in average twice as large. This makes the merge phase more efficient, since the amount of data that can be merged in one pass increases in average by a factor of two. For almost-sorted inputs, replacement selection is often capable of sorting an arbitrarily large file in a single pass, eliminating the need for a merge phase. This paper evaluates an implementation of replacement selection for MapReduce computations in the Hadoop framework. We show that the performance is comparable to quicksort for random inputs, but with substantial gains for inputs which are either almost sorted or require two merge passes in quicksort.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://bitbucket.org/pmdusso/hadoop-replacement-selection-sort.

References

  1. Bentley, J.L., Sedgewick, R.: Fast algorithms for sorting and searching strings. In: Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms. SODA 1997, pp. 360–369. SIAM, Philadelphia, PA, USA (1997)

    Google Scholar 

  2. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 3rd edn. The MIT Press, Cambridge (2009)

    MATH  Google Scholar 

  3. Dusso, P.M.: Optimizing Sort in Hadoop using Replacement Selection. Master thesis, University of Kaiserslautern (2014)

    Google Scholar 

  4. Estivill-Castro, V., Wood, D.: Foundations for faster external sorting (extended abstract). In: Thiagarajan, P.S. (ed.) FSTTCS. LNCS, vol. 880, pp. 414–425. Springer, Heidelberg (1994)

    Google Scholar 

  5. Friend, E.H.: Sorting on electronic computer systems. J. ACM 3(3), 134–168 (1956)

    Article  Google Scholar 

  6. Graefe, G.: Query evaluation techniques for large databases. ACM Comput. Surv. 25(2), 73–169 (1993)

    Article  Google Scholar 

  7. Graefe, G.: Implementing sorting in database systems. ACM Comput. Surv. 38(3) (2006)

    Google Scholar 

  8. Härder, T.: A scan-driven sort facility for a relational database system. In: Proceedings of VLDB, pp. 236–244 (1977)

    Google Scholar 

  9. Knuth, D.E.: The Art of Computer Programming. Sorting and Searching, vol. 3, 2nd edn. Addison Wesley Longman Publishing Co. Inc., Redwood City (1998)

    MATH  Google Scholar 

  10. Larson, P.A.: External sorting: run formation revisited. IEEE Trans. Knowl. Data Eng. 15(4), 961–972 (2003)

    Article  Google Scholar 

  11. Larson, P.A., Graefe, G.: Memory management during run generation in external sorting. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, pp. 472–483. SIGMOD 1998. ACM, New York, NY, USA (1998)

    Google Scholar 

  12. Moore, E.: Sorting method and apparatus, 9 May 1961. http://www.google.com.br/patents/US2983904

  13. Nyberg, C., Barclay, T., Cvetanovic, Z.: AlphaSort: a RISC machine sort. In: Proceedings of SIGMOD, pp. 233–242 (1994)

    Google Scholar 

  14. Skiena, S.S.: The Algorithm Design Manual. Springer, London (1998)

    MATH  Google Scholar 

  15. Transaction Processing Performance Council: TPC Benchmark H (Decision Support) Standard Specification. http://www.tpc.org/tpch/. Accessed 10 January 2014

Download references

Acknowledgements

We thank Renata Galante for her helpful comments and suggestions on earlier revisions of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pedro Martins Dusso .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Dusso, P.M., Sauer, C., Härder, T. (2015). Optimizing Sort in Hadoop Using Replacement Selection. In: Tadeusz, M., Valduriez, P., Bellatreche, L. (eds) Advances in Databases and Information Systems. ADBIS 2015. Lecture Notes in Computer Science(), vol 9282. Springer, Cham. https://doi.org/10.1007/978-3-319-23135-8_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-23135-8_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-23134-1

  • Online ISBN: 978-3-319-23135-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics