Document Nearest Neighbors Query Based on Pairwise Similarity with MapReduce

  • Peipei Lv
  • Peng YangEmail author
  • Yong-Qiang Dong
  • Liang Gu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11334)


With the continuous development of Web technology, many Internet issues evolve into Big Data problems, characterized by volume, variety, velocity and variability. Among them, how to organize plenty of web pages and retrieval information needed is a critical one. An important notion is document classification, in which nearest neighbors query is the key issue to be solved. Most parallel nearest neighbors query methods adopt Cartesian Product between training set and testing set resulting in poor time efficiency. In this paper, two methods are proposed on document nearest neighbor query based on pairwise similarity, i.e. brute-force and pre-filtering. brute-force is constituted by two phases (i.e. copying and filtering) and one map-reduce procedure is conducted. In order to obtain nearest neighbors for each document, each document pair is copied twice and all records generated are shuffled. However, time efficiency of shuffle is sensitive to the number of the intermediate results. For the purpose of intermediate results reduction, pre-filtering is proposed for nearest neighbor query based on pairwise similarity. Since only first top-k neighbors are output for each document, the size of records shuffled is kept in the same magnitude as input size in pre-filtering. Additionally, detailed theoretical analysis is provided. The performance of the algorithms is demonstrated by experiments on real world dataset.


Nearest neighbors query Pairwise similarity Time efficiency 



This work is supported by the National Science Foundation of China under grants No. 61472080, No. 61672155, No. 61272532, the Consulting Project of Chinese Academy of Engineering under grant 2018-XY-07, National High Technology Research and Development Program (863 Program) of China under grant No. 2013AA013503 and Collaborative Innovation Center of Novel Software Technology and Industrialization.


  1. 1.
    Ahmed, O.S., Franklin, S.E., Wulder, M.A., White, J.C.: Extending airborne lidar-derived estimates of forest canopy cover and height over large areas using knn with landsat time series data. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 9(8), 3489–3496 (2016)CrossRefGoogle Scholar
  2. 2.
    Al Aghbari, Z.: Array-index: a plug&search K nearest neighbors method for high-dimensional data. Data Knowl. Eng. 52(3), 333–352 (2005)CrossRefGoogle Scholar
  3. 3.
    Almalawi, A.M., Fahad, A., Tari, Z., Cheema, M.A., Khalil, I.: \( k \) NNVWC: an efficient \( k \)-nearest neighbors approach based on various-widths clustering. IEEE Trans. Knowl. Data Eng. 28(1), 68–81 (2016)CrossRefGoogle Scholar
  4. 4.
    Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Cha, G.H., Zhu, X., Petkovic, D., Chung, C.W.: An efficient indexing method for nearest neighbor searches in high-dirnensional image databases. IEEE Trans. Multimed. 4(1), 76–87 (2002)CrossRefGoogle Scholar
  6. 6.
    Chim, H., Deng, X.: Efficient phrase-based document similarity for clustering. IEEE Trans. Knowl. Data Eng. 20(9), 1217–1229 (2008)CrossRefGoogle Scholar
  7. 7.
    Dai, J., Ding, Z.M.: MapReduce based fast kNN join. Chin. J. Comput. (2015)Google Scholar
  8. 8.
    Deng, Z., Zhu, X., Cheng, D., Zong, M., Zhang, S.: Efficient kNN classification algorithm for big data. Neurocomputing 195, 143–148 (2016)CrossRefGoogle Scholar
  9. 9.
    Dhanabal, S., Chandramathi, S.: A review of various k-nearest neighbor query processing techniques. Int. J. Comput. Appl. 31(7), 14–22 (2011)Google Scholar
  10. 10.
    Elsayed, T., Lin, J., Oard, D.W.: Pairwise document similarity in large collections with MapReduce. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, pp. 265–268. Association for Computational Linguistics (2008)Google Scholar
  11. 11.
    Fier, F.: Distributed similarity joins on big textual data: toward a robust cost-based framework (2017)Google Scholar
  12. 12.
    Ghiassi, M., Fa’al, F., Abrishamchi, A.: Large metropolitan water demand forecasting using DAN2, FTDNN, and KNN models: a case study of the city of Tehran, Iran. Urban Water J. 14(6), 655–659 (2017)CrossRefGoogle Scholar
  13. 13.
    Kibanov, M., Becker, M., Mueller, J., Atzmueller, M., Hotho, A., Stumme, G.: Adaptive kNN using expected accuracy for classification of geo-spatial data. arXiv preprint arXiv:1801.01453 (2017)
  14. 14.
    Lai, J., Liaw, Y.C., Liu, J.: Fast k-nearest-neighbor search based on projection and triangular inequality. Pattern Recognit. 40(2), 351–359 (2007)CrossRefGoogle Scholar
  15. 15.
    Lee, K.H., Lee, Y.J., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with mapreduce: a survey. AcM sIGMoD Rec. 40(4), 11–20 (2012)CrossRefGoogle Scholar
  16. 16.
    Li, S.Z., Chan, K.L., Wang, C.: Performance evaluation of the nearest feature line method in image classification and retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 11, 1335–1349 (2000)Google Scholar
  17. 17.
    Liaw, Y.C., Leou, M.L., Wu, C.M.: Fast exact k nearest neighbors search using an orthogonal search tree. Pattern Recognit. 43(6), 2351–2358 (2010)CrossRefGoogle Scholar
  18. 18.
    Lin, J.: Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 155–162. ACM (2009)Google Scholar
  19. 19.
    Liu, T., Moore, A.W., Gray, A.: New algorithms for efficient high-dimensional nonparametric classification. J. Mach. Learn. Res. 7, 1135–1158 (2006)MathSciNetzbMATHGoogle Scholar
  20. 20.
    Maillo, J., Triguero, I., Herrera, F.: A MapReduce-based k-nearest neighbor approach for big data classification. In: Trustcom/BigDataSE/ISPA, 2015 IEEE. vol. 2, pp. 167–172. IEEE (2015)Google Scholar
  21. 21.
    McNames, J.: A fast nearest-neighbor algorithm based on a principal axis search tree. IEEE Trans. Pattern Anal. Mach. Intell. 23(9), 964–976 (2001)CrossRefGoogle Scholar
  22. 22.
    Nodarakis, N., Sioutas, S., Tsoumakos, D., Tzimas, G., Pitoura, E.: Rapid AkNN query processing for fast classification of multidimensional data in the cloud. Eprint Arxiv (2014)Google Scholar
  23. 23.
    Omohundro, S.M.: Five balltree construction algorithms. International Computer Science Institute Berkeley (1989)Google Scholar
  24. 24.
    Schiaffino, L., et al.: Feature selection for KNN classifier to improve accurate detection of subthalamic nucleus during deep brain stimulation surgery in Parkinson’s patients. In: Torres, I., Bustamante, J., Sierra, D. (eds.) VII Latin American Congress on Biomedical Engineering CLAIB 2016, Bucaramanga, Santander, Colombia, October 26th -28th, 2016. IP, vol. 60, pp. 441–444. Springer, Singapore (2017). Scholar
  25. 25.
    Sproull, R.F.: Refinements to nearest-neighbor searching in k-dimensional trees. Algorithmica 6(1–6), 579–589 (1991)MathSciNetCrossRefGoogle Scholar
  26. 26.
    Tan, S.: An effective refinement strategy for KNN text classifier. Expert Syst. Appl. 30(2), 290–298 (2006)CrossRefGoogle Scholar
  27. 27.
    Tombros, A., Ali, Z.: Factors affecting web page similarity. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 487–501. Springer, Heidelberg (2005). Scholar
  28. 28.
    Velásquez, J.D., et al.: Docode 5: building a real-world plagiarism detection system. Eng. Appl. Artif. Intell. 64, 261–271 (2017)CrossRefGoogle Scholar
  29. 29.
    Wang, Y., Wang, Z.O.: A fast KNN algorithm for text categorization. In: 2007 International Conference on Machine Learning and Cybernetics, vol. 6, pp. 3436–3441. IEEE (2007)Google Scholar
  30. 30.
    Yu, C., Ooi, B.C., Tan, K.L., Jagadish, H.: Indexing the distance: an efficient method to KNN processing. In: VLDB, vol. 1, pp. 421–430 (2001)Google Scholar
  31. 31.
    Zamir, O., Etzioni, O.: Web document clustering: a feasibility demonstration. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 46–54. ACM (1998)Google Scholar
  32. 32.
    Zhang, C., Li, F., Jestes, J.: Efficient parallel KNN joins for large data in MapReduce. In: Proceedings of the 15th International Conference on Extending Database Technology, pp. 38–49. ACM (2012)Google Scholar
  33. 33.
    Zhang, S., Li, X., Zong, M., Zhu, X., Wang, R.: Efficient knn classification with different numbers of nearest neighbors. IEEE Trans. Neural Netw. Learn. Syst. 29(5), 1774–1785 (2018)MathSciNetCrossRefGoogle Scholar
  34. 34.
    Zhou, Y., Zhang, C., Wang, J.: Tunable nearest neighbor classifier. In: Rasmussen, C.E., Bülthoff, H.H., Schölkopf, B., Giese, M.A. (eds.) DAGM 2004. LNCS, vol. 3175, pp. 204–211. Springer, Heidelberg (2004). Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Peipei Lv
    • 1
    • 2
  • Peng Yang
    • 1
    • 2
    Email author
  • Yong-Qiang Dong
    • 1
    • 2
  • Liang Gu
    • 1
    • 2
  1. 1.School of Computer Science and EngineeringSoutheast UniversityNanjingChina
  2. 2.Key Laboratory of Computer Network and Information IntegrationSoutheast University, Ministry of EducationNanjingChina

Personalised recommendations