Skip to main content

An Elastic Approximate Similarity Search in Very Large Datasets with MapReduce

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8648))

Abstract

The outbreak of data brings an era of big data and more challenges than ever before to traditional similarity search which has been spread to a wide range of applications. Furthermore, an unprecedented scale of data being processed may be infeasible or may lead to the paralysis of systems due to the slow performance and high overheads. Dealing with such an unstoppable data growth paves the way not only to similarity search consolidates but also to new trends of data-intensive applications. Aiming at scalability, we propose an elastic approximate similarity search that efficiently works in very large datasets. Moreover, our proposed scheme effectively adapts itself to the well-known similarity searches with pairwise documents, pivot document, range query, and k-nearest neighbour query. Last but not least, these methods, together with our filtering strategies, are implemented and verified by experiments on real large data collections in Hadoop showing their promising effectiveness and efficiency.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   34.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   44.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Alabduljalil, M.A., Tang, X., Yang, T.: Optimizing Parallel Algorithms for All Pairs Similarity Search. In: Proceedings of the 6th ACM International Conference on Web Search and Data Mining, USA, pp. 203–212 (2013)

    Google Scholar 

  2. Alex cluster, http://www.jku.at/content/e213/e174/e167/e186534 (referenced on February 4, 2014)

  3. Apache Software Foundation. Hadoop: A Framework for Running Applications on Large Clusters Built of Commodity Hardware (2006)

    Google Scholar 

  4. Baraglia, R., De Francisci Morales, G., Lucchese, C.: Document Similarity Self-Join with MapReduce. In: Proceedings of the 10th IEEE International Conference on Data Mining, pp. 731–736 (2010)

    Google Scholar 

  5. Dang, T.K.: Solving Approximate Similarity Queries. Journal of Computer Systems Science and Engineering 22(1-2), 71–89 (2007)

    MathSciNet  Google Scholar 

  6. Dang, T.K., Küng, J.: The SH-tree: A Super Hybrid Index Structure for Multidimensional Data. In: Mayr, H.C., Lazanský, J., Quirchmayr, G., Vogel, P. (eds.) DEXA 2001. LNCS, vol. 2113, pp. 340–349. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  7. DBLP data set, available on, http://dblp.uni-trier.de/xml/ (referenced on March 8 , 2014)

  8. De Francisci Morales, G., Lucchese, C., Baraglia, R.: Scaling Out All Pairs Similarity Search with MapReduce. In: Proceedings of the 8th Workshop on Large-Scale Distributed Systems for Information Retrieval, pp. 25–30 (2010)

    Google Scholar 

  9. Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: Proceedings of the 6th Symposium on Opearting Systems Design and Implementation, pp. 137–150. USENIX Association (2004)

    Google Scholar 

  10. Elsayed, T., Lin, J., Oard, D.W.: Pairwise Document Similarity in Large Collections with MapReduce. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies, Companion Volume, Columbus, Ohio, pp. 265–268 (2008)

    Google Scholar 

  11. Fenz, D., Lange, D., Rheinländer, A., Naumann, F., Leser, U.: Efficient Similarity Search in Very Large String Sets. In: Ailamaki, A., Bowers, S. (eds.) SSDBM 2012. LNCS, vol. 7338, pp. 262–279. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  12. Li, R., Ju, L., Peng, Z., Yu, Z., Wang, C.: Batch Text Similarity Search with MapReduce. In: Du, X., Fan, W., Wang, J., Peng, Z., Sharaf, M.A. (eds.) APWeb 2011. LNCS, vol. 6612, pp. 412–423. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  13. Szmit, R.: Locality Sensitive Hashing for Similarity Search Using MapReduce on Large Scale Data. In: Kłopotek, M.A., Koronacki, J., Marciniak, M., Mykowiecka, A., Wierzchoń, S.T. (eds.) IIS 2013. LNCS, vol. 7912, pp. 171–178. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  14. Vernica, R., Carey, M.J., Li, C.: Efficient Parallel Set-similarity Joins Using MapReduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, USA, pp. 495–506 (2010)

    Google Scholar 

  15. Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient Similarity Joins for Near Duplicate Detection. In: Proceedings of the 17th Int’l World Wide Web Conference, pp. 131–140 (2008)

    Google Scholar 

  16. Zhang, D., Yang, G., Hu, Y., Jin, Z., Cai, D., He, X.: A Unified Approximate Nearest Neighbor Search Scheme by Combining Data Structure and Hashing. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence, pp. 681-687 (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Phan, T.N., Küng, J., Dang, T.K. (2014). An Elastic Approximate Similarity Search in Very Large Datasets with MapReduce. In: Hameurlain, A., Dang, T.K., Morvan, F. (eds) Data Management in Cloud, Grid and P2P Systems. Globe 2014. Lecture Notes in Computer Science, vol 8648. Springer, Cham. https://doi.org/10.1007/978-3-319-10067-8_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-10067-8_5

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-10066-1

  • Online ISBN: 978-3-319-10067-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics