Traverse: Simplified Indexing on Large Map-Reduce-Merge Clusters

Yang, Hung-chih; Parker, D. Stott

doi:10.1007/978-3-642-00887-0_27

Hung-chih Yang¹⁹ &
D. Stott Parker¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5463))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

1686 Accesses
18 Citations

Abstract

The search engines that index the World Wide Web today use access methods based primarily on scanning, sorting, hashing, and partitioning (SSHP) techniques. The MapReduce framework is a distinguished example. Unlike DBMS, this search engine infrastructure provides few general tools for indexing user datasets. In particular, it does not include order-preserving tree indexes, even though they might have been built using such indexing components. Thus, data processing on these infrastructures is linearly scalable at best, while index-based techniques can be logarithmically scalable. DBMS have been using indexes to improve performance, especially on low-selectivity queries and joins. Therefore, it is natural to incorporate indexing into search-engine infrastructure.

Recently, we proposed an extension of MapReduce called Map-Reduce-Merge to efficiently join heterogeneous datasets and executes relational algebra operations. Its vision was to extend search engine infrastructure so as to permit generic relational operations, expanding the scope of analysis of search engine content.

In this paper we advocate incorporating yet another database primitive, indexing, into search engine data processing. We explore ways to build tree indexes using Hadoop MapReduce. We also incorporate a new primitive, Traverse, into the Map-Reduce-Merge framework. It can efficiently traverse index files, select data partitions, and limit the number of input partitions for a follow-up step of map, reduce, or merge.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Apache. Hadoop (2006), http://lucene.apache.org/hadoop/
Bayer, R., McCreight, E.M.: Organization and Maintenance of Large Ordered Indices. Acta Inf. 1(3), 173–189 (1972)
Article MATH Google Scholar
Brewer, E.A.: Combining Systems and Databases: A Search Engine Retrospective. In: Hellerstein, J.M., Stonebraker, M. (eds.) Readings in Database Systems, 4th edn. MIT Press, Cambridge (2005)
Google Scholar
Comer, D.: The Ubiquitous B-tree. Comp. Surveys 11(2), 121–137 (1979)
Article MathSciNet MATH Google Scholar
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters.. In: OSDI, pp. 137–150 (2004)
Google Scholar
Fagin, R., et al.: Extendible Hashing - A Fast Access Method for Dynamic Files. TODS 4(3), 315–344 (1979)
Article Google Scholar
Graefe, G.: B-tree Indexes, Interpolation Search, and Skew. In: Ailamaki, A., Boncz, P.A., Manegold, S. (eds.) DaMoN, p. 5. ACM, New York (2006)
Chapter Google Scholar
Guttman, A.: R-Trees: A Dynamic Index Structure for Spatial Searching. In: Yormark, B. (ed.) SIGMOD 1984, pp. 47–57. ACM Press, New York (1984)
Google Scholar
Hellerstein, J.M., et al.: Generalized Search Trees for Database Systems. In: VLDB 1995, pp. 562–573. Morgan Kaufmann, San Francisco (1995)
Google Scholar
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: a Not-So-Foreign Language for Data Processing. In: SIGMOD 2008, pp. 1099–1110. ACM, New York (2008)
Google Scholar
O’Neil, P.E., Graefe, G.: Multi-Table Joins Through Bitmapped Join Indices. SIGMOD Record 24(3), 8–11 (1995)
Article Google Scholar
Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the Data: Parallel Analysis with Sawzall. Scientific Programming Journal 13(4), 227–298 (2005)
Google Scholar
Yang, H.-C., Dasdan, A., Hsiao, R.-L., Parker Jr., D.S.: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters. In: Chan, C.Y., Ooi, B.C., Zhou, A. (eds.) SIGMOD Conference, pp. 1029–1040. ACM, New York (2007)
Google Scholar
Yang, H.-C., Parker Jr., D.S., Hsiao, R.-L.: The Holodex: Integrating Summarization with the Index. In: SSDBM, pp. 23–32. IEEE Computer Society, Los Alamitos (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

UCLA Computer Science Department, USA
Hung-chih Yang & D. Stott Parker

Authors

Hung-chih Yang
View author publications
You can also search for this author in PubMed Google Scholar
D. Stott Parker
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Information Technology and Electrical Engineering, The University of Queensland, QLD 4072, Brisbane, Australia
Xiaofang Zhou & Ke Deng &
Tokyo Institute of Technology, Graduate School of Information Science and Engineering, 2-12-1 Oh-Okayama Meguro-ku, 152-8552, Tokyo, Japan
Haruo Yokota
CSIRO, Castray Esplanade, TAS 7000, Hobart, Australia
Qing Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, Hc., Parker, D.S. (2009). Traverse: Simplified Indexing on Large Map-Reduce-Merge Clusters. In: Zhou, X., Yokota, H., Deng, K., Liu, Q. (eds) Database Systems for Advanced Applications. DASFAA 2009. Lecture Notes in Computer Science, vol 5463. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00887-0_27

Download citation

DOI: https://doi.org/10.1007/978-3-642-00887-0_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00886-3
Online ISBN: 978-3-642-00887-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics