An Elastic Approximate Similarity Search in Very Large Datasets with MapReduce

Phan, Trong Nhan; Küng, Josef; Dang, Tran Khanh

doi:10.1007/978-3-319-10067-8_5

An Elastic Approximate Similarity Search in Very Large Datasets with MapReduce

Trong Nhan Phan¹⁷,
Josef Küng¹⁷ &
Tran Khanh Dang¹⁸

Conference paper

538 Accesses
6 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8648))

Abstract

The outbreak of data brings an era of big data and more challenges than ever before to traditional similarity search which has been spread to a wide range of applications. Furthermore, an unprecedented scale of data being processed may be infeasible or may lead to the paralysis of systems due to the slow performance and high overheads. Dealing with such an unstoppable data growth paves the way not only to similarity search consolidates but also to new trends of data-intensive applications. Aiming at scalability, we propose an elastic approximate similarity search that efficiently works in very large datasets. Moreover, our proposed scheme effectively adapts itself to the well-known similarity searches with pairwise documents, pivot document, range query, and k-nearest neighbour query. Last but not least, these methods, together with our filtering strategies, are implemented and verified by experiments on real large data collections in Hadoop showing their promising effectiveness and efficiency.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 34.99; Price excludes VAT (USA)

Softcover Book: USD 44.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Alabduljalil, M.A., Tang, X., Yang, T.: Optimizing Parallel Algorithms for All Pairs Similarity Search. In: Proceedings of the 6th ACM International Conference on Web Search and Data Mining, USA, pp. 203–212 (2013)
Google Scholar
Alex cluster, http://www.jku.at/content/e213/e174/e167/e186534 (referenced on February 4, 2014)
Apache Software Foundation. Hadoop: A Framework for Running Applications on Large Clusters Built of Commodity Hardware (2006)
Google Scholar
Baraglia, R., De Francisci Morales, G., Lucchese, C.: Document Similarity Self-Join with MapReduce. In: Proceedings of the 10th IEEE International Conference on Data Mining, pp. 731–736 (2010)
Google Scholar
Dang, T.K.: Solving Approximate Similarity Queries. Journal of Computer Systems Science and Engineering 22(1-2), 71–89 (2007)
MathSciNet Google Scholar
Dang, T.K., Küng, J.: The SH-tree: A Super Hybrid Index Structure for Multidimensional Data. In: Mayr, H.C., Lazanský, J., Quirchmayr, G., Vogel, P. (eds.) DEXA 2001. LNCS, vol. 2113, pp. 340–349. Springer, Heidelberg (2001)
Chapter Google Scholar
DBLP data set, available on, http://dblp.uni-trier.de/xml/ (referenced on March 8 , 2014)
De Francisci Morales, G., Lucchese, C., Baraglia, R.: Scaling Out All Pairs Similarity Search with MapReduce. In: Proceedings of the 8th Workshop on Large-Scale Distributed Systems for Information Retrieval, pp. 25–30 (2010)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: Proceedings of the 6th Symposium on Opearting Systems Design and Implementation, pp. 137–150. USENIX Association (2004)
Google Scholar
Elsayed, T., Lin, J., Oard, D.W.: Pairwise Document Similarity in Large Collections with MapReduce. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies, Companion Volume, Columbus, Ohio, pp. 265–268 (2008)
Google Scholar
Fenz, D., Lange, D., Rheinländer, A., Naumann, F., Leser, U.: Efficient Similarity Search in Very Large String Sets. In: Ailamaki, A., Bowers, S. (eds.) SSDBM 2012. LNCS, vol. 7338, pp. 262–279. Springer, Heidelberg (2012)
Chapter Google Scholar
Li, R., Ju, L., Peng, Z., Yu, Z., Wang, C.: Batch Text Similarity Search with MapReduce. In: Du, X., Fan, W., Wang, J., Peng, Z., Sharaf, M.A. (eds.) APWeb 2011. LNCS, vol. 6612, pp. 412–423. Springer, Heidelberg (2011)
Chapter Google Scholar
Szmit, R.: Locality Sensitive Hashing for Similarity Search Using MapReduce on Large Scale Data. In: Kłopotek, M.A., Koronacki, J., Marciniak, M., Mykowiecka, A., Wierzchoń, S.T. (eds.) IIS 2013. LNCS, vol. 7912, pp. 171–178. Springer, Heidelberg (2013)
Chapter Google Scholar
Vernica, R., Carey, M.J., Li, C.: Efficient Parallel Set-similarity Joins Using MapReduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, USA, pp. 495–506 (2010)
Google Scholar
Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient Similarity Joins for Near Duplicate Detection. In: Proceedings of the 17th Int’l World Wide Web Conference, pp. 131–140 (2008)
Google Scholar
Zhang, D., Yang, G., Hu, Y., Jin, Z., Cai, D., He, X.: A Unified Approximate Nearest Neighbor Search Scheme by Combining Data Structure and Hashing. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence, pp. 681-687 (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

FAW Institute, Johannes Kepler University Linz, Austria
Trong Nhan Phan & Josef Küng
HCMC University of Technology, Ho Chi Minh City, Vietnam
Tran Khanh Dang

Authors

Trong Nhan Phan
View author publications
You can also search for this author in PubMed Google Scholar
Josef Küng
View author publications
You can also search for this author in PubMed Google Scholar
Tran Khanh Dang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Paul Sabatier University, IRIT, 118, route de Narbonne, 31062, Toulouse Cedex, France
Abdelkader Hameurlain & Franck Morvan &
HCMC University of Technology, 268 Ly Thuong Kiet Street, District 10, Ho Chi Minh City, Vietnam
Tran Khanh Dang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Phan, T.N., Küng, J., Dang, T.K. (2014). An Elastic Approximate Similarity Search in Very Large Datasets with MapReduce. In: Hameurlain, A., Dang, T.K., Morvan, F. (eds) Data Management in Cloud, Grid and P2P Systems. Globe 2014. Lecture Notes in Computer Science, vol 8648. Springer, Cham. https://doi.org/10.1007/978-3-319-10067-8_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-10067-8_5
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10066-1
Online ISBN: 978-3-319-10067-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics