Abstract
Data warehousing has been a topic of intense research for past few years. A data warehouse is primarily as a central repository in which data is coming from disparate sources. Generally, fresh data in these warehouses are loaded to the central repository in disconnected mode through batch processing. Hence, there is always a chance of non-real time data available in the central warehouse. This stale data is not useful for most of the commercial real-time applications such as real-time transport monitoring, smart cities, semantic web, online transaction processing and sensor networks. In order to fully realize these applications, fresh data needs to be readily available for critical decision making purpose. In particular, they demand real time and quick accumulation of data from diverse sources in to main data warehouse. This paper focuses on maintaining consistency and providing real-time data updates in data warehouse. In particular, the paper targets the detection of duplicates in streaming environment with a limited amount of memory. For this purpose, it employs a novel concept called Bloom Filter. The bloom filter sets the bits in the array when the information is added in the data warehouse. This technique gives nearly 100% result without any false positive value. The error rate in worst case scenario is 0.01%. For implementation, a data structure called time frame bloom filter (TBF) is used which is essentially a bit map of information. Using this method, one can insert, update, delete and search the messages data in the data warehouse very quickly. To make the bloom filter scalable, one can also add more than one bloom filter to address the inconsistency issues.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Zuters, J.: Near real-time data warehousing with multi-stage trickle and flip. In: Grabis, J., Kirikova, M. (eds.) BIR 2011. LNBIP, vol. 90, pp. 73–82. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24511-4_6
Vassiliadis, P., Simitsis, A.: Near real time ETL. In: Kozielski, S., Wrembel, R. (eds.) New Trends in Data Warehousing and Data Analysis. AOIS, vol. 3, pp. 1–31. Springer, Boston (2009). https://doi.org/10.1007/978-0-387-87431-9_2
Thomsen, C., Pedersen, T.B., Lehner, W.: RiTE: providing on-demand data for right-time data warehousing. In: Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, ICDE 2008 (2008)
Santos, R.J., Bernardino, J.: Optimizing data warehouse loading procedures for enabling useful-time data warehousing. In: Proceedings of the 2009 International Database Engineering & Applications Symposium, IDEAS 2009 (2009)
Rudra, A., Yeo, E.: Key issues in achieving data quality and consistency in data warehousing among large organizations in Australia. In: Proceedings of the 32nd Hawaii International Conference on System Science (1999)
Prakash, D., Prakash, N.: A Requirements driven approach to data warehouse consolidation. In: 11th IEEE International Conference on Research Challenges in Information Science, RCIS 2017 (2017, to be presented)
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)
Li, Z., He, K., Lin, F.: Deduplication of files in cloud storage based on differential bloom filter. In: IEEE 7th International Conference on Software Engineering and Service Science (ICSESS) (2016)
Goyal, A., Swaminathan, A., Pande, R., Attar, V.: Cross platform (RDBMS to NoSQL) database validation tool using bloom filter. In: 2016 International Conference on Recent Trends in Information Technology (ICRTIT) (2016)
Talpur, A., Newe, T., Shaikh, F.K.: Bloom filter based data collection algorithm for wireless sensor networks. In: IEEE International Conference on Information Networking (ICOIN) (2017)
Lu, Y., Prabhakar, B., Bonomi, F.: Perfect hashing for network applications. In: Proceedings of ISIT 2006 (2006, to appear)
Fan, L., Cao, P., Almeida, J., Broder, A.Z.: Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Trans. Network. 8(3), 281–293 (2000)
Bonomi, F., Mitzenmacher, M., Panigrahy, R., Singh, S., Varghese, G.: An improved construction for counting bloom filters. In: Azar, Y., Erlebach, T. (eds.) ESA 2006. LNCS, vol. 4168, pp. 684–695. Springer, Heidelberg (2006). https://doi.org/10.1007/11841036_61
Rottenstreich, O., Kanizo, Y., Keslassy, I.: The variable-increment counting bloom filter. IEEE/ACM Trans. Netw. (2014). https://doi.org/10.1109/TNET.2013.2272604
Xuan, S., Man, D., Wang, W., Yang, W.: The improved variable length counting bloom filter based on buffer. In: 2015 Eighth International Conference on Internet Computing for Science and Engineering (ICICSE), pp. 74–78. IEEE Conference Publications (2015)
Zengin, S., Schmidt, E.G.: A fast and accurate hardware string matching module with bloom filters. IEEE Trans. Parallel Distrib. Syst. 28, 305–317 (2016)
Mun, J.H., Lim, H.: Cache sharing using bloom filters in named data networking. J. Netw. Comput. Appl. 90, 74–82 (2017)
Xu, Z., Chen, B., Meng, X., Liu, L.: Efficient detection of sybil attacks in location-based social networks. In: College of Information Engineering, Inner Mongolia University of Technology, Hohhot, China Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China Department of Computer Science, Michigan Technological University, Michigan, USA (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Rizwan, S., Adil, S.H., Islam, N. (2020). Detecting Duplicates in Real-Time Data Warehouse Using Bloom Filter-Based Approach. In: Bajwa, I., Sibalija, T., Jawawi, D. (eds) Intelligent Technologies and Applications. INTAP 2019. Communications in Computer and Information Science, vol 1198. Springer, Singapore. https://doi.org/10.1007/978-981-15-5232-8_65
Download citation
DOI: https://doi.org/10.1007/978-981-15-5232-8_65
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-5231-1
Online ISBN: 978-981-15-5232-8
eBook Packages: Computer ScienceComputer Science (R0)