Similarity Search on Massive Data Based on FPGA

  • Yanzheng Wang
  • Hong Gao
  • Shengfei Shi
  • Hongzhi WangEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9645)


Data quality is a very important question in massive data process. When we want to distill valuable knowledge from a mass set of data, the key point is to know whether the dataset is clean. So before we extract useful massage from the dataset we’d better do some data clean job. Similarity search is a very important method in data clean. MapReduce will be used to do similarity search in our data clean system. But the efficiency is very low. We found that when we process the massive data stored in HDFS with MapReduce programing model every part of the dataset will be scanned and this is very time-consuming especially for large scale dataset. In this paper we will do filter operation on original data with hardware before we use similarity search to do data clean.


Data clean FPGA Similarity search MapReduce 



This paper was partially supported by National Sci-Tech Support Plan 2015BAH10F01 and NSFC grant U1509216, 61472099, 61133002.


  1. 1.
    Rahm, E., Do, H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)Google Scholar
  2. 2.
    Morales, G.D.F., Lucchese, C., Baraglia, R.: Scaling out all pairs similarity search with mapreduce. In: 8th Workshop on LargeScale Distributed System for Information Retrieval (2010)Google Scholar
  3. 3.
    Bayardo, R.J., Ma, Y., Srikant, R.: Scaling uop all pairs similarity search. In: Proceeding of WWW (2007)Google Scholar
  4. 4.
    Awekar, A., Samatova, N.F.: Fast matching for all pairs similarity search. In: Intelligent Agent Technology Workshop (2009)Google Scholar
  5. 5.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th OSDI, vol. 51, no. 1, pp. 107–113 (2004)Google Scholar
  6. 6.
    HDFS (Hadoop Distributed File System) Architecture.
  7. 7.
    Sukhwani, B., Hong, M., Thoennes, M., Dube, P., lyer, B.: Database analytics acceleration using FPGAs. In: International Conference on Parallel Architectures and Compilation Techniques, pp. 411–420 (2012)Google Scholar
  8. 8.
  9. 9.
  10. 10.
    Woods, L., Teubner, J., Alonso, G.: Real-time pattern matching with FPGAs. In: IEEE International Conference on Data Engineering, pp. 1292–1295 (2011)Google Scholar
  11. 11.
    Teubner, J., Muller, R., Alonso, G.: Frequent item computation on a chip. IEEE Trans. Knowl. Data Eng. 23(8), 1169–1181 (2011)CrossRefGoogle Scholar
  12. 12.
    Zarifi, T., Malek, M.: FPGA implementation of image processing technique for blood samples characterization. Comput. Electr. Eng. 40(5), 1750–1757 (2014)CrossRefGoogle Scholar
  13. 13.
    Brost, V., Yang, F., Meunier, C.: Flexible VLIW processor based on FPGA for efficient embedded real-time image processing. J. Real-Time Image Process. 9(1), 47–59 (2014)CrossRefGoogle Scholar
  14. 14.
    Chenini, H., Dérutin, J.P., Aufrère, R., Chapuis, R.: Parallel embedded processor architecture for FPGA-based image processing using parallel software skeletons. J. Adv. Sig. Process. 2013(1), 1–23 (2013)CrossRefGoogle Scholar
  15. 15.
    Choi, Y.M., So, K.H.: Map-reduce processing of K-means algorithm with FPGA-accelerated computer cluster. In: IEEE International Conference on Application-specific System, Architectures and Processors, pp. 9–16 (2014)Google Scholar
  16. 16.
    Belean, B., Borda, M., Bot, A.: FPGA based hardware architectures for iterative algorithms implementations. In: International Conference on Telecommunications and Signal Processing, pp. 751–754 (2013)Google Scholar
  17. 17.
    Becher, A., Bauer, F., Ziener, D., Teich, J.: Energy-aware SQL query acceleration through FPGA-based dynamic partial reconfiguration. In: International Conference on Field Programmable Logic and Applications, pp. 1–8 (2014)Google Scholar
  18. 18.
    Dennl, C., Ziener, D., Teich, J.: On-the-fly composition of FPGA-based SQL query accelerators using a partially reconfigurable module library. IEEE Int. Symp. Field-Programma Custom Comput. Mach. 282(1), 45–52 (2012)CrossRefGoogle Scholar
  19. 19.
    Halstead, R.J., Sukhwani, B., Min, H., Thoennes, M., Dube, P., Asaad, S., Iyer, B.: Accelerating join operation for relational databases with FPGAs. In: Proceeding of the 2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines, pp. 17–20 (2013)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Yanzheng Wang
    • 1
  • Hong Gao
    • 1
  • Shengfei Shi
    • 1
  • Hongzhi Wang
    • 1
    Email author
  1. 1.School of Computer Science and TechnologyHarbin Institute of TechnologyHarbinChina

Personalised recommendations