An Efficient Document Indexing-Based Similarity Search in Large Datasets

  • Trong Nhan PhanEmail author
  • Markus Jäger
  • Stefan Nadschläger
  • Josef Küng
  • Tran Khanh Dang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9446)


In this paper, we principally devote our effort to proposing a novel MapReduce-based approach for efficient similarity search in big data. Specifically, we address the drawbacks of using inverted index in similarity search with MapReduce and then propose a simple yet efficient redundancy-free MapReduce scheme, which not only takes advantages over the baseline inverted index-based procedures but also adapts to various similarity measures and similarity searches. Additionally, we present other strategic methods in order to potentially contribute to eliminating unnecessary data and computations. Last but not least, empirical evaluations are intensively conducted with real massive datasets and Hadoop framework in the cluster of commodity machines to verify the proposed methods, whose promising results show how much beneficial they are when dealing with big data.


Similarity search Efficiency Mapreduce Large datasets Clustering Filtering Redundancy-free capability Document indexing 



Our sincere thanks to Faruk Kujundžić, Scientific Computing, Information Management team, Johannes Kepler University Linz, for his kind support in the Alex Cluster.


  1. 1.
    Alex cluster. Available on the following website link. Accessed 4 Feb 2014
  2. 2.
    Apache Hadoop. Wiki at Accessed 8 Mar 2014
  3. 3.
    Bayardo, R., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the 16th International Conference on World Wide Web, pp. 131–140 (2007)Google Scholar
  4. 4.
    DBLP data set. Accessed 8 Mar 2014
  5. 5.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating Systems Design and Implementation, USENIX Association, pp. 137–150 (2004)Google Scholar
  6. 6.
    Deng, D., Li, G., Hao, S., Wang, J., Feng J.: MassJoin: a MapReduce-based algorithm for string similarity joins. In: Proceedings of the 30th IEEE International Conference on Data Engineering, pp. 340–351 (2014)Google Scholar
  7. 7.
    Dittrich, J., Richter, S., Schuh, S.: Efficient or Hadoop: why not both? Datenbank-Spektrum 13(1), 17–22 (2013)CrossRefGoogle Scholar
  8. 8.
    Elsayed, T., Lin, J., Oard, D.W.: Pairwise document similarity in large collections with MapReduce. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies, Companion Volume, pp. 265–268 (2008)Google Scholar
  9. 9.
    Han, J., Kamber, M., Pei, J.: Data mining: concepts and techniques, 3rd edn. The Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann Publishers. ISBN: 978-0123814791 (2011)Google Scholar
  10. 10.
    Kolb, L., Thor, A., Rahm, E.: Don’t match twice: redundancy-free similarity computation with MapReduce. In: Proceedings of the 2nd International Workshop on Data Analytics in the Cloud (2013)Google Scholar
  11. 11.
    Letouzé, E.: Big data for development: challenges & opportunities. In: Tatevossian, A.R., Kirkpatrick, R., (eds.) UN Global Pulse, pp. 1–47 (2012)Google Scholar
  12. 12.
    Lin, J.: Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce. In: Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 155–162 (2009)Google Scholar
  13. 13.
    Metwally, A., Faloutsos, C.: V-SMART-join: a scalable MapReduce framework for all-pair similarity joins of multisets and vectors. PVLDB 5(8), 704–715 (2012)Google Scholar
  14. 14.
    Mika, P.: Distributed indexing for semantic search. In: Proceedings of the 3rd International Semantic Search Workshop, pp. 1–4 (2010)Google Scholar
  15. 15.
    Phan, T.N., Küng, J., Dang, T.K.: An efficient similarity search in large data collections with MapReduce. In: Dang, T.K., Wagner, R., Neuhold, E., Takizawa, M., Küng, J., Thoai, N. (eds.) FDSE 2014. LNCS, vol. 8860, pp. 44–57. Springer, Heidelberg (2014)Google Scholar
  16. 16.
    Phan, T.N., Küng, J., Dang, T.K.: An elastic approximate similarity search in very large datasets with MapReduce. In: Hameurlain, A., Dang, T.K., Morvan, F. (eds.) Globe 2014. LNCS, vol. 8648, pp. 49–60. Springer, Heidelberg (2014)Google Scholar
  17. 17.
    Project Gutenberg. Accessed 8 Mar 2014
  18. 18.
    Rajaraman, A., Ullman J.D.: Finding similar items. In: Mining of Massive Datasets, 1st edn, pp. 71–127 (Chap. 3). Cambridge University Press, Cambridge (2011)Google Scholar
  19. 19.
    Rong, C., Lu, W., Wang, X., Du, X., Chen, Y., Tung, A.K.H.: Efficient and scalable processing of string similarity join. IEEE TKDE 25(10), 2217–2230 (2013)Google Scholar
  20. 20.
    Theobald, M., Siddharth, J., Paepcke, A.: Spotsigs: robust and efficient near duplicate detection in large web collections. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 563–570 (2008)Google Scholar
  21. 21.
    Zadeh, R.B., Goel, A.: Dimension independent similarity computation. J. Mach. Learn. Res. 14(1), 1605–1626 (2013)MathSciNetzbMATHGoogle Scholar
  22. 22.
    Zikopoulos, P.C., Eaton, C., DeRoos, D., Deutsch, T., Lapis, G.: Understanding big data: analytics for enterprise class Hadoop and streaming data. McGraw-Hill Osborne Media, New York. ISBN: 978-0071790536 (2012)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Trong Nhan Phan
    • 1
    Email author
  • Markus Jäger
    • 1
  • Stefan Nadschläger
    • 1
  • Josef Küng
    • 1
  • Tran Khanh Dang
    • 2
  1. 1.Institute for Application Oriented Knowledge ProcessingJohannes Kepler University LinzLinzAustria
  2. 2.Faculty of Computer Science and EngineeringHCMC University of TechnologyHo Chi Minh CityVietnam

Personalised recommendations