1 Introduction

Distributed storage system can store huge amounts of data. Due to its characteristics of high performance-cost ratio and good scalability, it is widely used in cloud storage, disaster recovery, backup and other purposes. But distributed storage system has a lot of duplicate data, which brings a waste of storage space and adds a lot of unnecessary network traffic. So it is quite necessary to employ deduplication technology for distributed storage.

By the difference of detection granularity, deduplication can process in file-level and block-level or byte/bit-level. The smaller detection granularity, the more detected redundant data will be, but the complexity and consume will correspondingly increase. The Content-Defined Chunking (CDC), a classic algorithm for deduplication, employs Sliding Window technology to determine the split point of chunking, has been applied in LBFS, [1] Pastiche backup system [2] and Deep Store archive storage system [3] and the like. The CDC is suitable for frequently updated data set and has more advantages of reducing the occupying of storage space over the Fixed-Sized Block method. [1] However, in the extreme case, if the entire data stream does not find the matching boundary point, the CDC will degenerate to the FSB. To process the blocks beyond a certain length, Eshghi and Tang [4] proposed a Two Thresholds Two Divisors (TTTD) algorithm supporting the size of chunk which bears two expectations and two thresholds. Lu et al. [5] proposed the solution of counting the frequency of occurrences of block to improve the effect of the block and raise the rate of deduplication. Zhang et al. [6] proposed an asymmetric extremum content defined chunking algorithm for fast and bandwidth-efficient data deduplication. Yu et al. [7] present a leap-based CDC algorithm which provides significant improvement in deduplication performance without compromising the deduplication ratio.

The chunks produced by CDC are varied with length; thereby more complexity of management appears. The Sliding Block (SB) algorithm [8] combines the advantages of fixed-size and variable-size, mainly produces fix-size chunks. The SB algorithm is efficient to solve the insert and delete problem. Wang et al. [9] employ backtracking sub-blocks strategy in SB algorithm, via backtracking the left/right 1/4 and 1/2 sub-blocks in matching-failed segments, improves the duplicate detection precision compared with the traditional SB algorithm and CDC algorithm. Zhu et al. [10] designed and implemented a backup system with intelligent data deduplication, which named Backup Ded up including four deduplication strategies, that is SIS, FSB, CDC and SB.

In this paper, a Self-Matching Sliding Block (SMSB) Algorithm applied to deduplication in distributed storage system is proposed. The SMSB creates a matching table in deduplication processing node and processes the self-matching on the fingerprint and checksum of sliding block. It ignores a large number of data blocks which do not meet the matching conditions, so as to improve the efficiency of deduplication. It is suitable for distributed environment.

2 Deduplication Problem in Distributed Storage

Distributed storage system, which is generally consisted of a single management node that can practice unified allocation and management of resources to all storage nodes and a plurality of data storage nodes, may be visited by multiple clients simultaneously. The management node saves data block information, the storage nodes store all data blocks. If need to perform deduplication in the distributed storage system, the task can be assigned to minor storage nodes or data sender. The transfer for duplicate data wastes so much bandwidth; therefore, the processing of eliminating redundant done before the transfer of data is a very appropriate option.

The deduplication rate of the SB algorithm ranks top in all types of the same data detecting method, but this algorithm needs to query checksum for matching the block when the sliding window move one byte forward. Since the metadata of distributed storage system is stored in one or more nodes, and the nodes performing task of matching cannot save the massive checksum and data, they can only complete the matching task by visiting metadata nodes via network. The processing speed is too slow! The CDC algorithm can take advantage of the internal characteristics of the data for deduplication without having to query the metadata node in the sub-block process, but compared with the SB algorithm, this algorithm has lower deduplication rate and disk space utilization.

3 SMSB Algorithm

In order to solve the deduplication problem in distributed storage, we proposed a novel Self-Matching Sliding Block (SMSB) algorithm. The SMSB uses a matching table as repetitive testing cache, and each node performing deduplication updates timely matching table data by querying metadata node in the process.

3.1 Description

The process of SMSB algorithm is shown in Fig. 1. Firstly, calculating the fingerprint and checksum of the data block in sliding window, if successfully matched, it shows that the block is likely to be repeated; and then calculate the hash value for further examination. If the match is not successful, it is probably not repeated block, and then the sliding window continues to move forward. When the sliding window has been moved to a block size distance, the block in it matching no blocks is likely a new one. The hash value of the block is calculated and stored, and later used as a comparison target block.

Fig. 1.
figure 1

Self-Matching Sliding Block method

3.2 Matching Table

The matching table consists of fingerprint and checksum. For a continuous string [X1, X2,…, XW, XW+1, XW+2,…] (each character has 8 bits), any Xi represents polynomial Xi(t) Select a irreducible polynomial P(t) on Galois field GF(2n), the Rabin fingerprint [11] \( R_{1} \left( t \right) \) and Adler-32 checksum [12] \( S_{1} \) of string [X1, X2, …, XW] are calculated as follows:

$$ R_{1} \left( t \right) = \sum\nolimits_{j = 1}^{W} {X_{j} \left( t \right)t^{{8\left( {W - j} \right)}} \bmod \;P\left( t \right)} $$
(1)
$$ \left\{ {\begin{array}{*{20}l} {A_{1} = 1 + \sum\nolimits_{j = 1}^{W} {X_{j} \bmod 65521} } \hfill \\ {B_{1} = \sum\nolimits_{j = 1}^{W} {\left[ {\left( {W - j + 1} \right) \times X_{j} } \right] + W\bmod 65521} } \hfill \\ {S_{1} = B_{i} \ll 16 + A_{i} } \hfill \\ \end{array} } \right. $$
(2)

For a string [X1+1, X1+2,…, XW, XW+1](i ≥ 1), its fingerprint \( R_{i + 1} \left( t \right) \) and checksum \( S_{i + 1} \) are calculated as follows:

$$ R_{i + 1} \left( t \right) = R_{i} \left( t \right)t^{8} - X_{1} \left( t \right)t^{8w} + X_{1 + w} \left( t \right)\quad mod P(t) $$
(3)
$$ \left\{ {\begin{array}{*{20}l} {A_{i + 1} = A_{i} - X_{i} + X_{i + W} \bmod 65521} \hfill \\ {B_{i + 1} = B_{i} + A_{i + 1} - WX_{i - 1} \,\bmod 65521} \hfill \\ {S_{i + 1} = B_{i + 1} \ll 16 + A_{i + 1} } \hfill \\ \end{array} } \right. $$
(4)

Equations (3) and (4) have the computation time complexity of O(1).

3.3 Create and Maintain Matching Table

A matching table is created in memory, in order to achieve fast matching. Matching table is actually a simple index table stored in memory as an array, using the fingerprint of the data block as the array subscript (index), checksum as the value of the array element (index results). Matching table is updated by the latest block-priority strategy. When storing a new data block, according to the Rabin fingerprint(rabin) and Adler-32 checksum(adler) of the block, update Match[rabin] as adler.

3.4 Self-matching

Assume that storage space has processed a certain number of data blocks, in the matching table established based on this, checksum and fingerprint has a corresponding relationship as follows: (R1, S1), (R2, S2), (R3, S3), (R4, S4). When the sliding window moves forward, the fingerprint and checksum of each block corresponding to it will be calculated, and as a matching pair. The course of their work is shown in Fig. 2.

Fig. 2.
figure 2

Self-matching process

4 Experimental Results

The experiments use the actual data of the ordinary computer to extract three types of files (Word documents, source code, and BMP images) in the hard disk, less duplicate or identical files. The collected data sets are listed in Table 1.

Table 1. Testing data set

In the experiments, the data sets on Table 1 were partitioned by using 64, 256, 512, 1024, 4096 bytes as a unit block. In CDC algorithm, the minimum block size is set to be a half of the maximum value.

4.1 Deduplication Rate Test

The deduplication rates and comparison results are shown in Fig. 3.

Fig. 3.
figure 3

Compare with different chunk size

4.2 Disk Space Utilization Test

The rates of disk space utilization and comparison results are shown in Fig. 4.

Fig. 4.
figure 4

Compare with different chunk size

4.3 The Data Processing Rate Test

The data processing rates and comparison results are shown in Fig. 5.

Fig. 5.
figure 5

Compare with different chunk size

4.4 Summary of Results

The comparison results of three algorithms by each evaluation criteria are summarized in Table 2. SMSB algorithm is better than the CDC algorithm in deduplication rate, disk space utilization, but it is lower in the data processing rate. The deduplication rate and the disk space utilization rate of SMSB algorithm is respectively 2.03 times and 1.28 times of the CDC algorithm and that the data processing speed is 0.83 times of the CDC algorithm. We are confident that the SMSB algorithm is more suitable than the CDC algorithm when used in a distributed storage system.

Table 2. Self-Matching Sliding Block

5 Conclusion

Due to the limitations of the experimental conditions, the program has only been tested on small data sets, in the actual environment, the distributed storage system data block size will be bigger, generally between 4 KB to 128 KB, the proportion of duplicate data in the system will also be larger. The main purpose of this paper is to compare the processing performance of different algorithms, so as to find better deduplication method. Therefore, the work in this paper has a practical guiding significance.

For the distributed storage system with the feature of the Locality Preserved Caching (LPC), such as disaster recovery system and the backup system, we can modify the maintenance rules of the matching table according to this feature thus to identify the successor of duplicate blocks in the storage system, thus the coming block will be able to be predicted. We are ready to further improve and enhance the performance of SMSB algorithm for disaster recovery, backup storage system.