Abstract
As the volume of data increases every day, it has become very difficult to manage storage devices to handle this explosive development of digital data. Deduplication plays a crucial role to remove redundancy in large-scale cluster storage space. Existing deduplication research using overlapping algorithms is working inefficiently in a lot of situations—it absorbs high memory and uses a lot of processing time. Real-time data is repeatedly incomplete, conflicting, and/or missing in certain behaviors or trends, and often includes significant errors. In the deduplication process, data pre-processing is a method which involves transforming raw data into a comprehendible format which is easy to analyze in terms of duplication data. So, data deduplication clusters have been accepted in data storage systems for records and data backup. Most of the researchers in this field are focused on data deduplication clusters, to reduce replica data in order to improve server memory. Especially popular is the pattern-matching deduplication clustering process. In this chapter, the overlapping algorithm and how the proposed multi-level pattern-matching algorithm (MLPMA) works for deduplication with large amounts of data and higher efficiencies is discussed. This technique of combining similarity with locality is achieved by applying a Bloom filter to the deduplication cluster for efficient data removal, which moves toward exploiting data redundancy. As an end result, in the deduplication scenario this technique is significant in improving the efficiency of the data deduplication ratio and throughput. To conclude, the evaluations show that the deduplication method has excellent performance.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Zhang, Y., Feng, D., Jiang, H., Xia, W., Fu, M.: A fast asymmetric extremum content defined chunking algorithm for data deduplication in backup storage systems. IEEE Trans. Comput. (2016)
Goasdoué, F., Rousset, M.-C.: Robust module-based data management. IEEE Trans. Knowl. Data Eng. (2013)
El Rouayheb, S.: Synchronization and deduplication in coded distributed storage networks. IEEE/ACM Trans. Netw. (2015)
Dai, H., Zhang, S., Wang, L., Ding, Y.: Research and implementation of big data preprocessing system based on Hadoop. IEEE Trans. Knowl. Data Eng. (2015)
Ioannou, E., Garofalakis, M.: Query analytics over probabilistic databases with unmerged duplicates. IEEE Trans. Knowl. Data Eng. (2015)
Wu, X., Zhu, X., Wu, G.-Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. (2014)
Zhou, Z., Zhang, T., Chow, S.S.M.: Efficient authenticated multi-pattern matching. Int. J. Sci. Res. (2016)
Costa, G., Manco, G., Ortale, R.: An incremental clustering scheme for data deduplication. Int. J. Sci. Res. 21 (2012)
Song, G., Han, L., Xie, K.: Overlapping decomposition for Gaussian graphical modeling. IEEE Trans. Knowl. Data Eng. 27(8) (2015)
Dal Bianco, G., Galante, R., Gonçalves, M.A., Canuto, S., Heuser, C.A.: A practical and effective sampling selection strategy for large-scale deduplication. IEEE Trans. Knowl. Data Eng. (2015)
Banerjee, A., Krumpelman, C., Basu, S., Mooney, R.J.: Model based overlapping clustering. In: ACM International Conference on Knowledge Discovery and Data Mining, Aug 2015
Liu, G., Zheng, K., Wang, Y., Orgun, M.A., Liu, A., Zhao, L.: Multi-constrained graph pattern matching in large-scale contextual social graphs. In: IEEE International Conference, Apr 2015
Li, Z., He, K., Wei, W., Lin, F.: Deduplication of files in cloud storage based on differential Bloom filter. IEEE Trans. Knowl. Data Eng. (2016) (in press)
Lomte, V.M., Deorukhakar, H.B.: Review of slicing approach: data publishing with data privacy and data utility. Int. J. Sci. Res. (IJSR) (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Sahaya Jenitha, A., Sinthu Janita Prakash, V. (2019). An Effective Content-Based Strategy Analysis for Large-Scale Deduplication Using a Multi-level Pattern-Matching Algorithm. In: Satapathy, S., Joshi, A. (eds) Information and Communication Technology for Intelligent Systems . Smart Innovation, Systems and Technologies, vol 107. Springer, Singapore. https://doi.org/10.1007/978-981-13-1747-7_23
Download citation
DOI: https://doi.org/10.1007/978-981-13-1747-7_23
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1746-0
Online ISBN: 978-981-13-1747-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)