Adaptive deduplication of virtual machine images using AKKA stream to accelerate live migration process in cloud environment
Cloud Computing is a paradigm which provides resources to users from its pool based on demand to satisfy their requirements. During this process, many servers are overloaded and underloaded in the cloud environment. Thus, power consumption and load balancing are the major problems and are resolved by live virtual machine (VM) migration. Load balancing is addressed by moving virtual machines from overloaded host to under loaded host and from under loaded host to any other host which is not overloaded called VM migration. If this process is done without power off (Live) the virtual machines then it is called live VM migration. By this process, the issue of power consumption by physical hosts is also resolved. Migrating virtual machines involves virtualized components like storage disks, memory, CPU and networking, the entire state of VM is captured as a collection of data files. These data files are virtual disk files, configuration files, and log files. The virtual disk files take larger memory and take more migration time and hence the downtime. These disk files include many zero pages, similar and redundant pages. Many techniques such as compression, deduplication is used reduce the size of VM disk image file. Compression techniques are not widely used, due to the disadvantage of compression ratio and decompression time. Many researchers hence used deduplication techniques for reducing the VM disk image file in the live migration process. The significance of the research work is to design an adaptive deduplication mechanism for reducing VM disk image file size by performing fixed length and variable length block-level deduplication processes. The Rabin-Karp rolling hash algorithm is used in variable length block-level deduplication. Akka stream is used for streaming the VM disk image files as it is the bulk volume of live data transfer. To reduce the time of the deduplication process, many researchers used multithreading and multi-core technologies. We use multithreading in Akka framework to run the deduplication process concurrently without OutofMemory errors. The experiment results show that we achieved a maximum of 83% overall reduction in image storage space and 89.76% reduction in total migration time are achieved by adaptive deduplication method. 3% improvement in deduplication rate when compared with the existing image management system. The results are significant because when we apply this in the storage of data centres, there are much space savings. The reduction in size is dependent on the dataset was taken and the applications running on the VM.
KeywordsVirtualization Cloud computing Pre-copy technique Virtual machine migration Post copy technique Chunking strategies Streaming analytics Akka stream
Network attached storage
Network File System
Virtual machine migration
Figure 1(b) shows different types of hardware virtualization. In hardware-assisted virtualization the hypervisor is supported by the hardware without modifying the guest software. In full virtualization, all guest operating systems are abstracted from the hardware and are communicated with VMM using binary translation. Paravirtualization  is where the guest OS is modified, and hyper calls are issued to the host OS instead of issuing to hardware.
Virtualization provides and manages the data centre’s infrastructure dynamically. By multiplexing many virtual machines (VMs) on a single physical host, the utilization of resources is improved. Based on demand, these VMs scale up and down. Hence some hosts are heavily loaded, and some are under-utilized. So, effective virtual machine management is critical. A strong tool for managing virtual machines is VM Migration. Load balancing, power management, fault tolerance, and system maintenance are the goals of VM migration.
Virtual Machine Migration
Non-live migration or cold migration
Live migration or hot migration
Post copy approach
In Fig. 2(b) pre-copy approach , the minimal state of the processor is transferred to the target and followed by iterative push phase where the dirty (modified) pages are pushed to destination iteratively until the dirty page rate is less than the pages transferred rate. Then a small stop and copy phase is followed. In pre-copy approach, during iterative push phase, many zero pages which contains all zeros, identical pages (80% above similar) and similar pages (60% to 80% similar) are transferred to the target host. These pages are not required for VM to resume  at the destination machine.
In the process of VM migration, the entire virtual machine state is to be transferred. It contains connected device states which are of less size and sent to target host easily. Disk state information is not required to transfer as it is provided by network attached storage (NAS), memory state information of size gigabytes need to be transferred. VM’s CPU state information is the small amount and not having much effect in live migration time and downtime . Memory state contains guest OS memory state and all information about processes running within the VM. This is the vast amount of information which badly effects migration time and downtime.
VM configured memory
VM used memory
The physical memory of the underlying hardware that is allocated to the guest VM. This is the memory that is actively used by the VM from the hypervisor perspective.
The relationship between these memories according to their sizes is given as in Fig. 2(c).
The configured memory is a considerable size and contains VM image files. VM images consist more than 80% of zero pages and duplicate contents which, not only occupies more storage but also increases pressure on network transmission  especially in the live migration process. These pages are not required by the VM to resume at the target host. We can reduce the size of this memory by compression or deduplication techniques. As decompression time is the drawback of many compression algorithms, deduplication is preferable.
Data based deduplication
This is sub-file level deduplication where the file is divided into chunks called blocks. Based on the size of blocks in the chunking process, block-level deduplication is divided into fixed length block-level deduplication and variable length block-level deduplication.
Fixed length block-level deduplication
In this deduplication, the file is divided into fixed-length blocks. Identical blocks are identified and are removed to minimize the size of the file. It is simple and fast.
Variable length block-level deduplication
Client-based deduplication where redundant data is removed and unique data is transferred to the backup device at the client as in Fig. 2(f).
Target-based deduplication: After receiving all the redundant data at the target, the deduplication process is carried out, and unique data is transferred to the backup device as in Fig. 2(g). If the deduplication is done immediately while receiving, then it is called in-line deduplication. After writing to the disk, if deduplication is done, then it is called post-process deduplication.
The objective of the work is to reduce the total migration time and downtime in live VM migration process by reducing the amount of VM memory transferred from the source host to destination host during the process of migration.
Adaptive deduplication to reduce the VM disk image file size by detecting, and removing the zero, similar and redundant memory copies. Adaptive deduplication uses both fixed and variable size chunking (content defined chunking) strategies.
Analysis of deduplication effects on different formats of virtual machine disk images, as these files take more memory in VM and time in the migration process.
Akka framework is introduced, by which a high volume of reactive streams (live data) and back pressure mechanism are handled. As the variable length block-level deduplication process takes much time, multithreading concept is used to accelerate the process.
The remaining paper organized as, the work related to the research discussed in section "Related work". Section 3 explains our motivation for the proposed algorithm. Section "Motivation" explains the research work experimental setup. Results are discussed in section "Experimental setup". The paper is concluded in section "Results and Discussion".
In the characteristic-based compression (CBC) algorithm [10, 11], where different compression techniques used before the transfer of VM memory pages. Based on the type of pages, different techniques such as Wkdm for high similarity ratio pages, LZO for pages with low similarity were used to reduce their size. An improved algorithm was presented in  where memory-to-disk mappings were sent to the target instead of memory pages. From NAS the target host fetches the memory pages after deduplication with the help of NFS fetch queue. MDD (Migration with Data Deduplication)  was introduced in live migration for data deduplication of run-time memory image. Zero pages, similar pages were identified using hash-based fingerprints and were eliminated using RLE (Run Length Encode). To cut down the deduplication time multithreading was used by MDD. Extreme binning  in which Rabin fingerprints were used to identify the duplicated blocks. MD5 and SHA collision resistant algorithms were used to find the fingerprints. VM scheduling strategies  combined with VM replication strategies were introduced to reduce migration latencies associated with live migration of VMs across WAN. Scalability and storage issues of VM images and were resolved by using a liquid distributed file system . Gang Migration using global deduplication (GMGD)  was used to detect duplicates in VM memory pages and avoids their retransmission to target host in a cluster. Thus, reducing network overhead while running on several hosts. 42% reduction  in migration time was achieved. An optimized Incremental Modulo-K(INC-K) algorithm , modulo arithmetic in nature was used for deduplication and achieved 66% better deduplication ratio. In , Inner VM and cross-VM duplicates were removed by using a multi-level selective deduplication method. Parallelism was also increased by using local and global deduplication and a high deduplication ratio achieved. Different chunking strategies  were applied on various VM disk image dataset and proved that compression rate varied for different operating system versions, software configuration and achieved more savings in storage by identifying zero-filled blocks. In  heuristic prediction was used to determine non-duplicates by using adaptive block skipping, implemented on disk images of size 1 TB which leads to maximum 40% of space savings. Rapid Asymmetric Maximum (RAM) , a modified content defined chunking was used, which increases more throughput with less hash-based computations.
The main objective of the research work is to reduce the number of pages transferred from source machine to destination machine in live migration process, thus achieving the reduction in migration time and downtime. Hence, in the proposed algorithm, deduplication process on VM disk images applied twice to reduce the duplicates effectively. In the proposed Adaptive deduplication algorithm both fixed size and variable size chunking strategies are implemented to identify the redundant data in VM image files by using Akka stream in the live migration of virtual machines. To accelerate the deduplication process, multithreading in Akka stream framework is used. Akka streams are mainly meant for reactive streams such as live data where handling the non-predetermined volume of data is difficult. For parallel programming, and delay minimization in live data streaming it is better to use actor model. Akka Stream is an actor programming model. In parallel programming, it matches the speed between producers and consumer messages by avoiding avoid back-pressure mechanism. Back pressure means if one component is under stress to handle the incoming messages and subsequently drops the messages. Most of the researchers use the Akka model for Big-data analytics. In this work, Akka is used for streaming of live virtual machine data. As variable length block-level deduplication takes more time Akka with multithreading is used to accelerate the process. In this paper, the results after adaptive deduplication with Akka stream compared with existing deduplication techniques. Akka framework is used here to complete the process without getting OutofMemory Errors. The VM image data set was collected from an image registry of OpenStack, created when launching VMs with the configuration of 2 GB RAM and 10 GB hard disk assuming no applications running on the VM. Each VM is loaded with several guest operating systems like Centos7, Centos 6.9, Ubuntu 12.04 Ubuntu 14.04 and Ubuntu 11.10 versions. Several formats of virtual disk images VDI, QCOW2, VMDK, and VHD of created VMs were taken as input data set to our research work. The proposed adaptive deduplication with Akka stream is as follows.
Number of pages transferred: The amount of VM memory or pages transferred during the migration. It also includes zero pages, duplicates, and similar pages.
Total Migration time: The total time required for a VM on source host to migrate and to start on a destination. The sum of all time that is preparation time, page transfer time, down time and resume time.
Application degradation: The performance of the host decreases as the migration is in process.
Preparation time: When migration is initiated, the time for transferring minimal state of the CPU to the destination. In the pre-copy approach, pages become dirty while VM on source host is running. This time includes the entire process of iteratively pushing the dirty pages to the destination host.
Resume time: The time taken by the migrated VM to resume its operations at the target host.
Network traffic overhead: Overhead is the extra operations that are imposed by the virtual machine migration technique. It shows the impact on application performance.
Downtime: The duration of time that a VM is suspended (out of service) before it resumes on the target host.
To evaluate the deduplication rate of VM image data set using adaptive deduplication techniques and compare with existing fixed length, variable length block deduplication techniques.
To carry out the process of live migration of VMs where VMs size dynamically allocated after adaptive deduplication using Akka Streams.
To accelerate the adaptive deduplication process by using multithreading.
Processor: Intel® Core (TM) i5-8250U
8th Generation CPU 1.8 GHz DDR4
OS: Windows 10
Memory: 1 TB disk
System Type: 64-bit OS, × 64-based
Software: JDK 1.8 and Oxygen 2
Simulator: Cloud Sim 3.03
Fixed length and variable length block-level deduplication techniques are implemented in Java and Akka framework. This code returns the size of VM virtual disk image after eliminating the duplicates. CloudSim live migration code is invoked using IqrMu.main(args) which outputs the total migration time.
CloudSim set up
Different formats of VM images like VDI, VMDK, QCOW2, VHD files are provided by glance component of OpenStack image registry. These images have much impact on deduplication rate  and are discussed by many researchers. In our work VM Image dataset are taken from OpenStack image registry by creating VMs with a standard configuration of 2GB memory and 10GB hard disk.
Virtual Machine disk images in the data set (Bytes)
Results and discussion
In Fixed length block-level deduplication technique any format of virtual disk image file is taken as input and the file is split into small chunks each of size 4 KB. Using the SHA algorithm, the hash code of each chunk is calculated. The hash code and chunk name are stored in the hash table. If any further in coming chunk having the same hash code, the chunk is not stored in the hash table. Thus, the duplicates are eliminated. As the file size is small, there is no significant impact on live migration performance.
From Fig. 5(b) and (c) it is observed that Ubuntu OS versions don’t have many duplicates except for that lighter versions. 89.87% is the maximum reduction in the size of Ubuntu OS disk image files. 19.77% is the percentage of size reduction for Ubuntu14.04 VMDK format which is the minimum among all available Ubuntu versions. 91.6% reduction in total migration time is achieved for Centos6.9 disk image file. Consequently, the downtime, application performance degradation will be reduced reasonably.
Figure 5(d). shows that adaptive deduplication shows the better deduplication rate for all the input files compared to the fixed and variable length block deduplication techniques. The better the deduplication rate, the more the VM file size is reduced. It means the number of VM pages to be transferred is reduced. We also observed that small VM image files which are lighter versions have many duplicates when compared to large VM image files.
Figure 5(e) and (f) show that adaptive deduplication technique gives better performance when compared with fixed and variable deduplication techniques in terms of deduplication rate and migration time. 92.66% for fixed, 94.35% for variable and 95.53% for adaptive deduplication rate achieved regarding size. 91.4% for fixed, 91.6% for variable and 92.5% better reduction achieved regarding total migration time. Minimum percentage of deduplication rate 6.61%, 19.77%, 56.29% achieved by using fixed length, variable length, and adaptive duplication techniques respectively. 5%, 14.1%, 41.5% total migration time is reduced by using fixed length, variable length, and adaptive deduplication techniques respectively.
Figure 5(g) show the comparison of all techniques regarding storage. By using the proposed technique 6.61% minimum for Ubuntu files and 95.5% maximum deduplication rate is obtained for centos files. The minimum of 5% and 92.5% maximum reduction in total migration time is obtained. IM-Dedup  uses static (fixed length) chunking procedure and it achieves 80% reduction in overall image storage. We achieved a 83% reduction in the overall storage of images. 3% improvement is significant because this algorithm is used in data centres for optimal storage of VM disk images. As large VM disk images have peta bytes of size, 3% reduction in storage give a significant memory savings. 89.76% reduction in migration time by using adaptive deduplication. The number of duplicates present in VM disk images is based on the data set taken, the type of disk image, and the applications running on the VM. In this paper, we didn’t discuss the time for deduplication process since the adaptive deduplication process if it carried out by high-end servers of cloud, the deduplication was completed within no time and there is no significant impact on migration time and downtime.
In this research the existing deduplication techniques fixed length, variable length block deduplication techniques are implemented and compared with proposed adaptive deduplication techniques on standard VM images dataset which is taken from open stack image registry when created with each VM configuration of RAM 2GB and hard disk of 10GB assuming no applications running on any VM. The Rabin-Karp rolling hash algorithm is used for variable length block deduplication. IM-Dedup uses static (fixed length) chunking procedure, and it achieves 80% reduction in overall image storage. We achieved an 83% reduction in the overall storage of images and 89.76% overall reduction in migration time by using adaptive deduplication. 3% improvement in deduplication rate by the proposed method. As the migration time is reduced obviously, downtime and application performance degradation also reduced.
We are thankful for the people who gave the technical assistance in getting the data set of VM images.
Availability of data and materials
NM did a literature survey on live migration techniques, deduplication techniques. She proposed and implemented adaptive deduplication of virtual machine images by using fixed length and variable length block deduplication techniques. VG guided choosing of parameters for evaluating various techniques. She approved the final version of the paper to be submitted. Both authors read and approved the final manuscript.
TYJNaga Malleswari, G.Vadivu declares that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 1.Naga Malleswari TYJ, Rajeswari D, Senthil J (2012) A survey of cloud computing architecture and services provided by various cloud service providers. In: Proceedings of 2nd international conference on demand computing, the Oxford College of Engineering, November 15–16, BangaloreGoogle Scholar
- 2.Venkatesha, Sharath & Sadhu, Shatrugna & Kintali, Sridhar. (2009). Survey of Virtual Machine Migration TechniquesGoogle Scholar
- 5.Diego Perez-Botero, “A Brief Tutorial on Live Virtual Machine Migration from a Security Perspective”Google Scholar
- 6.X Zhang, Z Huo, J Ma and D Meng, "Exploiting Data Deduplication to Accelerate Live Virtual Machine Migration," 2010 IEEE international conference on cluster computing, 2010, pp. 88–96. https://doi.org/10.1109/CLUSTER.2010.17
- 7.S. Sharma and M. Chawla, "A technical review for efficient virtual machine migration," 2013 International Conference on Cloud & Ubiquitous Computing & Emerging Technologies (CUBE), Pune, India, 2014, pp. 20–25. doi: https://doi.org/10.1109/CUBE.2013.14
- 8.Hu, Wenjin, Hicks, Andrew, Zhang, Long, Dow Eli, Soni, Vinay, Jiang, Hao, Bull, Ronny, Matthews, Jeanna. (2013). A quantitative study of virtual machine live migration. Proceedings of the 2013 ACM Cloud and Autonomic Computing conference. https://doi.org/10.1145/2494621.2494622
- 9.Naga Malleswari TY, Malathi D, Vadivu G. Deduplication Techniques: A Technical Survey. International J Innovative Res Sci Technol|, 2014, 1, (7) pp. 318–325Google Scholar
- 10.Jin Hai & Deng, Li & Wu, Song, Shi, Xuanhua, Pan, Xiaodong. (2009). Live virtual machine migration with adaptive, memory compression. Proceedings - IEEE International Conference on Cluster Computing, ICCC. 1–10. 10.1109/CLUSTR.2009.5289170Google Scholar
- 13.D. Bhagwat, K. Eshghi, D. D. E. Long and M. Lillibridge, "extreme binning: scalable, parallel deduplication for chunk-based file backup," 2009 IEEE international symposium on modeling, Analysis & Simulation of Computer and Telecommunication Systems, London, 2009, pp. 1–9. doi: https://doi.org/10.1109/MASCOT.2009.5366623
- 14.Jinhua H, Jianhua G, Sun G, Zhao T (2010) A scheduling strategy on load balancing of virtual machine resources in cloud computing environment, 3rd international symposium on parallel architectures. Algorithms and Programming:89–96Google Scholar
- 15.X. Zhao, Y. Zhang, Y. Wu, K. Chen, J. Jiang and K. Li, "Liquid: a scalable deduplication file system for virtual machine images," in IEEE Transactions on Parallel and Distributed Systems, vol. 25, no. 5, pp. 1257–1266, 2014.doi: https://doi.org/10.1109/TPDS.2013.173
- 16.U. Deshpande, B. Schlinker, E. Adler and K. Gopalan, "Gang Migration of Virtual Machines Using Cluster-wide Deduplication," 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, Delft, 2013, pp. 394–401.doi: 10.1109/CCGrid.2013.39Google Scholar
- 17.H. Liu, H. Jin, X. Liao, C. Yu and C. Z. Xu, "Live virtual machine migration via asynchronous replication and state synchronization," in IEEE Transactions on Parallel and Distributed Systems, vol. 22, no. 12, pp. 1986–1999, Dec. 2011. doi: https://doi.org/10.1109/TPDS.2011.86
- 18.J. Min, D. Yoon and Y. Won, "Efficient deduplication techniques for modern backup operation," in IEEE Transactions on Computers, vol. 60, no. 6, pp. 824–840, 2011. doi: https://doi.org/10.1109/TC.2010.263
- 19.W. Zhang, H. Tang, H. Jiang, T. Yang, X. Li and Y. Zeng, "Multi-level Selective Deduplication for VM Snapshots in Cloud Storage," 2012 IEEE Fifth International Conference on Cloud Computing, Honolulu, HI, 2012, pp. 550–557. doi: https://doi.org/10.1109/CLOUD.2012.78
- 20.Jin Keren and Ethan L. Miller. “The effectiveness of deduplication on virtual machine disk images.” SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference SYSTOR (2009). Article 7Google Scholar
- 22.N.S. Widodo, Ryan & Lim, Hyotaek & Atiquzzaman, Mohammed. (2017). A new content-defined chunking algorithm for data deduplication in cloud storage. Futur Gener Comput Syst 71. https://doi.org/10.1016/j.future.2017.02.013
- 23.Naga Malleswari TYJ, Vadivu G, Malathi D (2015) Live virtual machine migration techniques-a technical survey. In: 2nd international conference on intelligent computing, communication and devices intelligent computing august 2015, intelligent computing, communication and devices, volume 1. Springer, pp 303–309Google Scholar
- 24.Jilin Zhang, Shuting Han, Jian Wan, Baojin Zhu, Li Zhou, Yongjian Ren, and Wei Zhang (2013) IM-Dedup: an image management system based on deduplication applied in DWSNs, international journal of distributed sensor networks July 16, 2013, doi.org/ https://doi.org/10.1155/2013/625070
- 25.Kim D., Song S., Choi BY. (2017) Existing deduplication techniques. In: Data Deduplication for Data Optimization for Storage and Network Systems. Springer, ChamGoogle Scholar
- 26.TYJ Naga Malleswari, Vadivu G (2017) Deduplication of VM Memory Pages Using Mapreduce In Live Migration, ARPN Journal Of Engineering And Applied Sciences, 2017, Vol 12, No. 6, Pp. 1890–1894Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.