Skip to main content

Detecting Duplicates in Real-Time Data Warehouse Using Bloom Filter-Based Approach

  • Conference paper
  • First Online:
Intelligent Technologies and Applications (INTAP 2019)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1198))

Included in the following conference series:

  • 937 Accesses

Abstract

Data warehousing has been a topic of intense research for past few years. A data warehouse is primarily as a central repository in which data is coming from disparate sources. Generally, fresh data in these warehouses are loaded to the central repository in disconnected mode through batch processing. Hence, there is always a chance of non-real time data available in the central warehouse. This stale data is not useful for most of the commercial real-time applications such as real-time transport monitoring, smart cities, semantic web, online transaction processing and sensor networks. In order to fully realize these applications, fresh data needs to be readily available for critical decision making purpose. In particular, they demand real time and quick accumulation of data from diverse sources in to main data warehouse. This paper focuses on maintaining consistency and providing real-time data updates in data warehouse. In particular, the paper targets the detection of duplicates in streaming environment with a limited amount of memory. For this purpose, it employs a novel concept called Bloom Filter. The bloom filter sets the bits in the array when the information is added in the data warehouse. This technique gives nearly 100% result without any false positive value. The error rate in worst case scenario is 0.01%. For implementation, a data structure called time frame bloom filter (TBF) is used which is essentially a bit map of information. Using this method, one can insert, update, delete and search the messages data in the data warehouse very quickly. To make the bloom filter scalable, one can also add more than one bloom filter to address the inconsistency issues.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Zuters, J.: Near real-time data warehousing with multi-stage trickle and flip. In: Grabis, J., Kirikova, M. (eds.) BIR 2011. LNBIP, vol. 90, pp. 73–82. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24511-4_6

    Chapter  Google Scholar 

  2. Vassiliadis, P., Simitsis, A.: Near real time ETL. In: Kozielski, S., Wrembel, R. (eds.) New Trends in Data Warehousing and Data Analysis. AOIS, vol. 3, pp. 1–31. Springer, Boston (2009). https://doi.org/10.1007/978-0-387-87431-9_2

    Chapter  Google Scholar 

  3. Thomsen, C., Pedersen, T.B., Lehner, W.: RiTE: providing on-demand data for right-time data warehousing. In: Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, ICDE 2008 (2008)

    Google Scholar 

  4. Santos, R.J., Bernardino, J.: Optimizing data warehouse loading procedures for enabling useful-time data warehousing. In: Proceedings of the 2009 International Database Engineering & Applications Symposium, IDEAS 2009 (2009)

    Google Scholar 

  5. Rudra, A., Yeo, E.: Key issues in achieving data quality and consistency in data warehousing among large organizations in Australia. In: Proceedings of the 32nd Hawaii International Conference on System Science (1999)

    Google Scholar 

  6. Prakash, D., Prakash, N.: A Requirements driven approach to data warehouse consolidation. In: 11th IEEE International Conference on Research Challenges in Information Science, RCIS 2017 (2017, to be presented)

    Google Scholar 

  7. Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)

    Article  Google Scholar 

  8. Li, Z., He, K., Lin, F.: Deduplication of files in cloud storage based on differential bloom filter. In: IEEE 7th International Conference on Software Engineering and Service Science (ICSESS) (2016)

    Google Scholar 

  9. Goyal, A., Swaminathan, A., Pande, R., Attar, V.: Cross platform (RDBMS to NoSQL) database validation tool using bloom filter. In: 2016 International Conference on Recent Trends in Information Technology (ICRTIT) (2016)

    Google Scholar 

  10. Talpur, A., Newe, T., Shaikh, F.K.: Bloom filter based data collection algorithm for wireless sensor networks. In: IEEE International Conference on Information Networking (ICOIN) (2017)

    Google Scholar 

  11. Lu, Y., Prabhakar, B., Bonomi, F.: Perfect hashing for network applications. In: Proceedings of ISIT 2006 (2006, to appear)

    Google Scholar 

  12. Fan, L., Cao, P., Almeida, J., Broder, A.Z.: Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Trans. Network. 8(3), 281–293 (2000)

    Article  Google Scholar 

  13. Bonomi, F., Mitzenmacher, M., Panigrahy, R., Singh, S., Varghese, G.: An improved construction for counting bloom filters. In: Azar, Y., Erlebach, T. (eds.) ESA 2006. LNCS, vol. 4168, pp. 684–695. Springer, Heidelberg (2006). https://doi.org/10.1007/11841036_61

    Chapter  Google Scholar 

  14. Rottenstreich, O., Kanizo, Y., Keslassy, I.: The variable-increment counting bloom filter. IEEE/ACM Trans. Netw. (2014). https://doi.org/10.1109/TNET.2013.2272604

  15. Xuan, S., Man, D., Wang, W., Yang, W.: The improved variable length counting bloom filter based on buffer. In: 2015 Eighth International Conference on Internet Computing for Science and Engineering (ICICSE), pp. 74–78. IEEE Conference Publications (2015)

    Google Scholar 

  16. Zengin, S., Schmidt, E.G.: A fast and accurate hardware string matching module with bloom filters. IEEE Trans. Parallel Distrib. Syst. 28, 305–317 (2016)

    Google Scholar 

  17. Mun, J.H., Lim, H.: Cache sharing using bloom filters in named data networking. J. Netw. Comput. Appl. 90, 74–82 (2017)

    Article  Google Scholar 

  18. Xu, Z., Chen, B., Meng, X., Liu, L.: Efficient detection of sybil attacks in location-based social networks. In: College of Information Engineering, Inner Mongolia University of Technology, Hohhot, China Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China Department of Computer Science, Michigan Technological University, Michigan, USA (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Noman Islam .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rizwan, S., Adil, S.H., Islam, N. (2020). Detecting Duplicates in Real-Time Data Warehouse Using Bloom Filter-Based Approach. In: Bajwa, I., Sibalija, T., Jawawi, D. (eds) Intelligent Technologies and Applications. INTAP 2019. Communications in Computer and Information Science, vol 1198. Springer, Singapore. https://doi.org/10.1007/978-981-15-5232-8_65

Download citation

  • DOI: https://doi.org/10.1007/978-981-15-5232-8_65

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-15-5231-1

  • Online ISBN: 978-981-15-5232-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics