A Bloom Filter-Based Data Deduplication for Big Data

Conference paper
Part of the Lecture Notes in Networks and Systems book series (LNNS, volume 38)

Abstract

Big data is growing at an unprecedented rate with text data having a large share and redundancy is a technique to ensure availability of this data. Large growth of unstructured text data hinders the primary purpose of the big data rendering the data difficult to store and search. Data compression is a solution to optimize the use of the storage space for big data. Deduplication is the most useful compression techniques. This paper proposes a two-phase data deduplication mechanism for text data. In the syntactic phase, a combination of clustering and Bloom Filter is used. In the semantic phase, a combination of SVD and WordNet synset is employed. Experimental results show the efficacy of the proposed system.

Keywords

Deduplication Bloom Filter Clustering SVD WordNet 

References

  1. 1.
  2. 2.
    Eaton C, Deroos D, Deutsch T, Lapis G, Zikopoulos P (2012) Understanding big data. McGraw-Hill CompaniesGoogle Scholar
  3. 3.
  4. 4.
  5. 5.
    Reed DA, Gannon DB, Larus JR (2012) Imagining the future: thoughts on computing. Computer 45CrossRefGoogle Scholar
  6. 6.
  7. 7.
  8. 8.
  9. 9.
    Su YH, Chuan HM, Wang SC, Yan KQ, Chen BW (2014) Quality of service enhancement by using an integer bloom filter based data deduplication mechanism in the cloud storage environment. In: IFIP international conference on network and parallel computing. Springer, Berlin, pp 587–590Google Scholar
  10. 10.
    Su YH, Merlo P, Henderson J, Schneider G, Wehrli E (2013) Learning document similarity using natural language processing. Linguistik Online 17(5)Google Scholar
  11. 11.
    da Cruz Nassif LF, Hruschka ER (2013) Document clustering for forensic analysis: an approach for improving computer inspection. IEEE Trans Inf Forensics Secur 8:46–54CrossRefGoogle Scholar
  12. 12.
    Jiang J-Y, Lin Y-S, Lee S-J (2014) A similarity measure for text classification and clustering. IEEE Trans Knowl Data Eng 26:1575–1590CrossRefGoogle Scholar
  13. 13.
    Pires CE, Nascimento DC, Mestre (2016) Applying machine learning techniques for scaling out data quality algorithms in cloud computing environments. Appl Intell 45:530CrossRefGoogle Scholar
  14. 14.
    Gemmell J, Rubinstein BIP, Chandra AK. Improving entity resolution with global constraints. https://arxiv.org/abs/1108.6016
  15. 15.
    Bose P, Guo H, Kranakis E, Maheshwari A, Morin P, Morrison J, Smid M, Tang Y (2008) On the false-positive rate of bloom filters. Inf Process Lett 108(4):210–213MathSciNetCrossRefGoogle Scholar
  16. 16.
    Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors. Commun ACM 13(7):422–426CrossRefGoogle Scholar
  17. 17.
    Wikipedia (2015) Bloom filter. https://en.wikipedia.org/wiki/Bloom_filter
  18. 18.
    Subramanyam R (2016) Idempotent distributed counters using a forgetful bloom filter. Clust Comput 19(2):879–892MathSciNetCrossRefGoogle Scholar
  19. 19.
    Hu G, Zhou S, Guan J, Hu X (2008) Towards effective document clustering: a constrained K-means based approach. Inf Process Manag 44:1397–1409CrossRefGoogle Scholar
  20. 20.
    Tolic A, Brodnik A (2015) Deduplication in unstructured-data storage systems. Elektroteh Vestn 82(5):233Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  1. 1.Institute of Engineering and ManagementKolkataIndia
  2. 2.DIST, CEGAnna UniversityChennaiIndia

Personalised recommendations