Advertisement

Exploring the Hamming Distance in Distributed Infrastructures for Similarity Search

  • Rodolfo da Silva VillaçaEmail author
  • Rafael Pasquini
  • Luciano Bernardes de Paula
  • Maurício Ferreira Magalhães
Part of the Modeling and Optimization in Science and Technologies book series (MOST, volume 4)

Abstract

Nowadays, the amount of data available on the Internet is over Zettabytes (ZB). Such condition defines a scenario known in the literature as Big Data. Although traditional databases are very efficient for finding and retrieving specific content, they are inefficient on Big Data scenario, since the great majority of such data are unstructured and scattered across the Internet. In this way, new databases are required in order to support similarity search. In order to handle such challenging scenario, the proposal in this chapter is to explore the Hamming similarity existent between content identifiers that are generated using the Random Hyperplane Hashing function. Such identifiers provide the basis for building distributed infrastructures that facilitate the similarity search. In this chapter, we present two different approaches: a P2P solution (Hamming DHT) and a Data Center solution (HCube). Evaluations are presented and indicate that both are capable of improving the recall in a similarity search.

Keywords

Cosine Similarity Distribute Hash Table Vector Space Model Gray Code Space Fill Curve 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Gantz, J., Reinsel, D.: The Digital Universe Decade - Are You Ready? http://www.emc.com/collateral/analyst-reports/idc-digital-universe-are-you-ready.pdf (2010) (Online; Acesso em 2 de Março de 2013)
  2. 2.
    The Apache Software Foundation: Apache\(\textsuperscript{\textregistered}\) Hadoop, http://hadoop.apache.org/ (2013) (Online; Acesso em 5 de Março de 2013)
  3. 3.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  4. 4.
    Indyk, P., Motwani, R.: Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In: STOC 1998: Proceedings of the 30th Annual ACM Symposium on Theory of Computing, pp. 604–613. ACM, New York (1998)Google Scholar
  5. 5.
    Charikar, M.S.: Similarity Estimation Techniques from Rounding Algorithms. In: STOC 2002: Proceedings of the 34th Annual ACM Symposium on Theory of Computing, New York, NY, USA, pp. 380–388 (2002)Google Scholar
  6. 6.
    Frank, A., Asuncion, A.: UCI machine learning repository (2010), http://archive.ics.uci.edu/ml
  7. 7.
    Villaça, R., de Paula, L.B., Pasquini, R., Magalhães, M.F.: Hamming DHT: Taming the Similarity Search. In: Proceedings of the 10th Annual IEEE Consumer Communications and Networking Conference, CCNC 2013. IEEE Communications Society, Las Vegas (2013)Google Scholar
  8. 8.
    Villaça, R., Pasquini, R., de Paula, L.B., Magalhães, M.F.: HCube: A Server-centric Data Center Structure for Similarity Search. In: Proceedings of the 27th International Conference on Advanced Information Networking and Applications, AINA 2013. IEEE Computer Society, Barcelona (2013)Google Scholar
  9. 9.
    Desai, A., Singh, H., Pudi, V.: DISC: Data-Intensive Similarity Measure for Categorical Data. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011, Part II. LNCS (LNAI), vol. 6635, pp. 469–481. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  10. 10.
    Lee, D., Park, J., Shim, J., Lee, S.: Efficient Filtering Techniques for Cosine Similarity Joins. INFORMATION-An International Interdisciplinary Journal 14, 1265 (2011)Google Scholar
  11. 11.
    Lawder, J.: The application of Space-filling Curves to the Storage and Retrieval of Multi-dimensional Data. PhD thesis, University of London, London (December 1999)Google Scholar
  12. 12.
    Zhang, C., Xiao, W., Tang, D., Tang, J.: P2P-based multidimensional indexing methods: A survey. J. Syst. Softw. 84(12), 2348–2362 (2011)CrossRefGoogle Scholar
  13. 13.
    Olszak, A.: Hycube: a dht routing system based on a hierarchical hypercube geometry. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds.) PPAM 2009, Part II. LNCS, vol. 6068, pp. 260–269. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  14. 14.
    Tang, C., Xu, Z., Mahalingam, M.: psearch: information retrieval in structured overlays. SIGCOMM Comput. Commun. Rev. 33, 89–94 (2003)CrossRefGoogle Scholar
  15. 15.
    Bhattacharya, I., Kashyap, S., Parthasarathy, S.: Similarity Searching in Peer-to-Peer Databases. In: Proceedings of the 25th IEEE International Conference on Distributed Computing Systems, ICDCS 2005, pp. 329–338 (June 2005)Google Scholar
  16. 16.
    Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. In: Proc. of the 7th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2006, vol. 7. USENIX, Berkeley (2006)Google Scholar
  17. 17.
    DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: amazon’s highly available key-value store. SIGOPS Oper. Syst. Rev. 41(6), 205–220 (2007)CrossRefGoogle Scholar
  18. 18.
    Stoica, I., Morris, R., Liben-Nowell, D., Karger, D.R., Kaashoek, M.F., Dabek, F., Balakrishnan, H.: Chord: A Scalable Peer-to-Peer Lookup Protocol for Internet Applications. IEEE/ACM Trans. Netw. 11(1), 17–32 (2003)CrossRefGoogle Scholar
  19. 19.
    de Paula, L.B., Villaça, R.S., Magalhães, M.F.: Analysis of Concept Similarity Methods Applied to an LSH Function. In: COMPSAC 2011: Computer Software and Applications Conference. IEEE, Munich (2011)Google Scholar
  20. 20.
    Faloutsos, C.: Gray Codes for Partial Match and Range Queries. IEEE Trans. Software Eng. 14(10), 1381–1393 (1988)CrossRefzbMATHMathSciNetGoogle Scholar
  21. 21.
    Pasquini, R.: Proposta de Roteamento Plano Baseado em uma Métrica de OU-Exclusivo e Visibilidade Local. Phd. thesis, Faculdade de Engenharia Eletrica e Computação. Universidade Estadual de Campinas, Campinas, SP (June 2011)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Rodolfo da Silva Villaça
    • 1
    Email author
  • Rafael Pasquini
    • 2
  • Luciano Bernardes de Paula
    • 3
  • Maurício Ferreira Magalhães
    • 4
  1. 1.Department of Computing and Electronics (DCEL)Federal University of Espírito Santo (UFES)São Mateus/ESBrazil
  2. 2.Faculty of Computing (FACOM)Federal University of Uberlândia (UFU)Uberlândia/MGBrazil
  3. 3.Federal Institute of Education, Science and Technology of São Paulo (IFSP)Bragança Paulista/SPBrazil
  4. 4.School of Computing and Electrical Engineering (FEEC)State University of Campinas (UNICAMP)Campinas/SPBrazil

Personalised recommendations