Abstract
The outbreak of data brings an era of big data and more challenges than ever before to traditional similarity search which has been spread to a wide range of applications. Furthermore, an unprecedented scale of data being processed may be infeasible or may lead to the paralysis of systems due to the slow performance and high overheads. Dealing with such an unstoppable data growth paves the way not only to similarity search consolidates but also to new trends of data-intensive applications. Aiming at scalability, we propose an elastic approximate similarity search that efficiently works in very large datasets. Moreover, our proposed scheme effectively adapts itself to the well-known similarity searches with pairwise documents, pivot document, range query, and k-nearest neighbour query. Last but not least, these methods, together with our filtering strategies, are implemented and verified by experiments on real large data collections in Hadoop showing their promising effectiveness and efficiency.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Alabduljalil, M.A., Tang, X., Yang, T.: Optimizing Parallel Algorithms for All Pairs Similarity Search. In: Proceedings of the 6th ACM International Conference on Web Search and Data Mining, USA, pp. 203–212 (2013)
Alex cluster, http://www.jku.at/content/e213/e174/e167/e186534 (referenced on February 4, 2014)
Apache Software Foundation. Hadoop: A Framework for Running Applications on Large Clusters Built of Commodity Hardware (2006)
Baraglia, R., De Francisci Morales, G., Lucchese, C.: Document Similarity Self-Join with MapReduce. In: Proceedings of the 10th IEEE International Conference on Data Mining, pp. 731–736 (2010)
Dang, T.K.: Solving Approximate Similarity Queries. Journal of Computer Systems Science and Engineering 22(1-2), 71–89 (2007)
Dang, T.K., Küng, J.: The SH-tree: A Super Hybrid Index Structure for Multidimensional Data. In: Mayr, H.C., Lazanský, J., Quirchmayr, G., Vogel, P. (eds.) DEXA 2001. LNCS, vol. 2113, pp. 340–349. Springer, Heidelberg (2001)
DBLP data set, available on, http://dblp.uni-trier.de/xml/ (referenced on March 8 , 2014)
De Francisci Morales, G., Lucchese, C., Baraglia, R.: Scaling Out All Pairs Similarity Search with MapReduce. In: Proceedings of the 8th Workshop on Large-Scale Distributed Systems for Information Retrieval, pp. 25–30 (2010)
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: Proceedings of the 6th Symposium on Opearting Systems Design and Implementation, pp. 137–150. USENIX Association (2004)
Elsayed, T., Lin, J., Oard, D.W.: Pairwise Document Similarity in Large Collections with MapReduce. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies, Companion Volume, Columbus, Ohio, pp. 265–268 (2008)
Fenz, D., Lange, D., Rheinländer, A., Naumann, F., Leser, U.: Efficient Similarity Search in Very Large String Sets. In: Ailamaki, A., Bowers, S. (eds.) SSDBM 2012. LNCS, vol. 7338, pp. 262–279. Springer, Heidelberg (2012)
Li, R., Ju, L., Peng, Z., Yu, Z., Wang, C.: Batch Text Similarity Search with MapReduce. In: Du, X., Fan, W., Wang, J., Peng, Z., Sharaf, M.A. (eds.) APWeb 2011. LNCS, vol. 6612, pp. 412–423. Springer, Heidelberg (2011)
Szmit, R.: Locality Sensitive Hashing for Similarity Search Using MapReduce on Large Scale Data. In: Kłopotek, M.A., Koronacki, J., Marciniak, M., Mykowiecka, A., Wierzchoń, S.T. (eds.) IIS 2013. LNCS, vol. 7912, pp. 171–178. Springer, Heidelberg (2013)
Vernica, R., Carey, M.J., Li, C.: Efficient Parallel Set-similarity Joins Using MapReduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, USA, pp. 495–506 (2010)
Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient Similarity Joins for Near Duplicate Detection. In: Proceedings of the 17th Int’l World Wide Web Conference, pp. 131–140 (2008)
Zhang, D., Yang, G., Hu, Y., Jin, Z., Cai, D., He, X.: A Unified Approximate Nearest Neighbor Search Scheme by Combining Data Structure and Hashing. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence, pp. 681-687 (2013)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Phan, T.N., Küng, J., Dang, T.K. (2014). An Elastic Approximate Similarity Search in Very Large Datasets with MapReduce. In: Hameurlain, A., Dang, T.K., Morvan, F. (eds) Data Management in Cloud, Grid and P2P Systems. Globe 2014. Lecture Notes in Computer Science, vol 8648. Springer, Cham. https://doi.org/10.1007/978-3-319-10067-8_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-10067-8_5
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10066-1
Online ISBN: 978-3-319-10067-8
eBook Packages: Computer ScienceComputer Science (R0)