A Multi-Optimization Technique for Improvement of Hadoop Performance with a Dynamic Job Execution Method Based on Artificial Neural Network

Abstract

The improvement of Hadoop performance has received considerable attention from researchers in cloud computing fields. Most studies have focused on improving the performance of a Hadoop cluster. Notably, various parameters are required to configure Hadoop and must be adjusted to improve performance. This paper proposes a mechanism to improve Hadoop, schedule jobs, and allocate and utilize resources. Specifically, we present an improved ant colony optimization method to schedule jobs according to the job size and the time expected for execution. Priority is given to the job with the minimum data size and minimum response time. The resource usage and running jobs by data node are predicted using an artificial neural network, and job activity and resource usage are monitored using the resource manager. Moreover, we enhance the Hadoop Name node performance by adding an aggregator node to the default HDFS framework architecture. The changes involve four entities: the name node, secondary name node, aggregator nodes, and data nodes, where the aggregator node is responsible for assigning the jobs among the data node, and the Name node keeps tracking only the aggregator nodes. We test the overall scheme among Amazon EC2 and S3, and show the results of throughput and CPU response time for different data sizes. Finally, we show that the proposed approach shows significant improvement compare to native Hadoop and other approaches.

Introduction

The process of improving Hadoop performance by enhancing Hadoop distributed file system (HDFS; name node and data nodes) capabilities is a critical task. Hadoop is a collection of open-source utilities that was developed for writing applications [1]. Such information can be used to process structured and unstructured data stored in HDFS form. MapReduce was developed to organize large volumes of data using a cluster of commodity hardware. The primary challenge observed in using MapReduce is based on poor performance in map and reduce tasks, which are negatively affected by the use of too many configuration parameters. There are approximately one hundred and ninety parameters in Hadoop.

Any user who submits a job must adjust these configuration parameters. Most users lack the practical knowledge needed to properly configure that many parameters, which could explain the notable performance degradation. Job scheduling is an important strategy in Hadoop processing systems. A user's job is divided into smaller sub-tasks and distributed among several data blocks. In such cases, the name node is managed by a file index system that stores the data block information. Several schemes can be used to schedule tasks, many of which have been proposed to improve the performance of job scheduling [2]. Job scheduling can be classified into several classes based on many criteria. The classification criteria for job schedulers in Hadoop are based on the environment, resources (CPU time, free slots, I/O utilization and disk space), priority and time. The main objective of introducing a job scheduler is to reduce overhead and computational resources and increase throughput by allocating jobs to the processor and reducing the job completion time. The major types of job schedulers in Hadoop are as follows: static schedulers, dynamic schedulers, time-based schedulers, fiFO (first in first out) schedulers and delay schedulers [3]. Based on a previous study [4], the performance of Apache Spark is better than that of Hadoop. When considering Spark, one considers its ability to perform streaming, batch processing and machine learning tasks on a single cluster, and its speed which is faster than that of Hadoop, although the speed may be insufficient if the system memory is full. According to previous research, Hadoop performs better than Spark based on both the computational time and cost. Currently, Hadoop can provide support for client tasks with high throughput, high tolerance, high exibility and high availability. In this paper, the importance of the configuration parameters in Hadoop is analyzed.

Background and Motivation

This section briefly describes Hadoop and the motivations of this paper, which focuses on the improvement of Hadoop performance in dynamic job execution. Hadoop clusters are gaining more popularity daily because of their efficient computational times and high cost-effectiveness. There are two major components of Hadoop: (1) an HDFS and (2) MapReduce. The HDFS provides users with distributed storage access, and MapReduce provides distributed processing [5].

An HDFS contains two parts: the name node (master node) and the data nodes (slave nodes). These components are used for the efficient management of distributed environments and storage. Generally, the Hadoop framework provides users with the ability to create clusters of commodity servers. This framework is based on the write once and read many approaches for data file processing. In each cluster, the master node is responsible for the overall file storage, such as data saving and directing the jobs to the respective slave nodes; thus, it can scale the 1000 s of nodes. Moreover, the master node stores large data files in blocks. Each data block is managed by different nodes in the cluster, and a data block can be replicated on multiple nodes [6].

Then, MapReduce runs tasks on a cluster that permits the data management of a distributed data storage system. The function of MapReduce is to split the input dataset into a number of blocks and store them in an HDFS. Each block size is 64 or 128 MB. A common Hadoop package is comprised of necessary Java archive (JAR) scripts and the files required in a large cluster. The MapReduce component uses two functions.

  • Mapper The map operation runs for each block, which is isolated from other blocks, and for the data node at the point where the data are actually stored. In the mapping phase, this operation performs mapping functions, i.e., < key, value > for each term, such as < Seoul, 1 > .

  • Reducer The reducer operation combines the results from various mappers and produces a single final output.

Figure 1 shows the native Hadoop MapReduce paradigm. However, the built-in MapReduce component can lead to some poor performance in terms of the mapper and reducer functions for certain tasks [7]. These limitations are described in details in the following “Related Work” section.

Fig. 1
figure1

Native Hadoop MapReduce paradigm

Paper Structure

The remainder of the paper is structured as follows. “Literature Review” section presents the previous works related to HDFSs and MapReduce for improving Hadoop performance. In “Problem Statement” section, we state the research findings of some of the previous works related to improved Hadoop research. In “Proposed Work” section, the proposed framework and the contributions of this paper are described in subsections. The experiments involving the proposed Hadoop framework are implemented and discussed in “Experimental Results” section. Furthermore, the results of the proposed scheme are compared with those from previous works in this section. Finally, the conclusions of the study and some future research directions are given in “Conclusion and Future Work” section.

Related Work

Literature Review

Several studies focused on improving Hadoop performance [8,9,10]. Aydin et al. [11] developed a Hadoop program for distributed data log analysis with a small cloud. The distributed log analysis contains several cluster nodes and splits the large log files in an HDFS using the MapReduce programming model. This process automatically resizes clusters based on the data analysis requirements. The results indicated that the Hadoop-based framework performed as well as the more recent Spark framework for given tasks. Yao et al. [12] addressed the high response time problem under different workloads. Specifically, the authors designed an LsPS to reduce the average job response time by leveraging the job size patterns and modifying the scheduling schemes among users. In [13], the authors described the prominent role of machine learning (ML)-based methods and algorithms with the latest versions used in big data processing and analytics. A genetic algorithm-based job scheduling scheme was proposed in [14] for big data analytics to improves the efficiency of analysis. Job scheduling reduced the time and cost of analysis.

Authors [15] proposed metaheuristic optimization and ensemble modeling approaches for Hadoop configuration tuning (H-Tune). A non-intrusive performance profile was used to acquire the runtime details of MapReduce applications and reduce the runtime overhead by less than 2%. An ensemble modeling approach considered the input data size and Hadoop configuration of each application, and a metaheuristic-based configuration optimizer was used to determine the optimal configuration based on the performance of a given application.

A native Hadoop MapReduce model was proposed for applications in data mining, namely frequent item set mining. Moreover, an improved weighted HashT a priori algorithm was proposed using the Hadoop MapReduce model. This model also employed a transaction filtering technique to remove infrequent transactions from the dataset. Because the native MapReduce model requires considerable memory size for frequent item set mining [16], a MapReduce-based a priori algorithm was proposed for performance optimization [17]. Mbarka et al. [18] proposed a failure and dynamically aware framework that can be integrated into Hadoop to schedule tasks and change the scheduling decisions based on information collected from the Hadoop environment. Authors in [19] presented a new concept in Map Reduce called Shufe. It was a new component added to a default map that reduced the required operations. This method provides a flexible operating sequence and identifies the most important factors that affect the shufe phase performance, including the key number, spilled file number, and intermediate Results variance. Therefore, the paper focuses on dynamically adjusting the order of operations in shufe to improve the performance of MapReduce applications. The map-side shufe may suffer from high I/O utilization and require a long execution time. Brahwar et al. [20] proposed a new scheduling algorithm called Tolhit for the Hadoop cluster. The algorithm is based on effective resource utilization and the identification of the nodes at which jobs are scheduled. According to the information from the Hadoop cluster, the best node is selected to schedule the tasks. The experimental results of Tolhit displayed a 27% improvement in Hadoop performance in terms of makespan (completion time of a task). This algorithm performs better than the Hadoop fair scheduler, but it is only fit for slow tasks. The authors in [21] presented a Hadoop job schedule that satisfies given deadlines. User jobs are scheduled based on the FIFO order, and an optimization criterion was established for job scheduling based on user-specified constraints (deadlines). A deadline constraint scheduler was used that maintains a priority queue of jobs. A two-phase computation scheme (map and reduce) was proposed in which key-value pair tuples are shufed across different reduce nodes. Task assignment is completed based on a heartbeat interval (every 3 s), which increases the run time of MapReduce tasks. Brahmwar et al. [22] applied an improved job scheduling algorithm for the Hadoop framework. In this paper, scheduling was based on Bayes classification. In a job queue, the jobs were classified as good or bad jobs, and a task tracker selected good jobs and then allocated the appropriate resources. At time t, Bayes classification was used to execute the most appropriate jobs elected by the task tracker. However, the Bayes classifier introduces some unnecessary error and computational burden. Gu et al. [23] proposed the S Hadoop method, which improves the Hadoop MapReduce framework for task/job execution optimization. The first step is to set up and clean up a MapReduce job to reduce the total computational time, and an instant messaging mechanism was proposed to accelerate the performance of sensitive task scheduling and execution. The setup tasks launched from the job tracker to the task tracker consist of state information reports. The task tracker forwards regular heartbeat messages once the task is completed. However, high overhead occurs for the Hadoop cluster and limits the dynamic scheduling of time slots for resource utilization and workload distribution. Authors in [24] proposed H2 Hadoop architecture to improve Hadoop performance for metadata-related jobs. In this work, the name node is responsible for assigning jobs to a client, dividing the jobs into tasks, and assigning the tasks. Hence, the name node directly forwards the jobs to a particular data node without the knowledge of the entire cluster. Jeon et al. [25] show the effect of MapReduce parameters in distributed processing of machine learning program. Chung and Nah [26] showed how different virtualization methods affect the distributed processing of a massive volume of data in terms of the processing performance. The results showed that Docker-based virtual clusters are usually faster than Xen-based virtual clusters. However, in some cases, Xen performs faster than Docker according to the established parameters, such as the block size and a number of virtual nodes.

Problem Statement

Today, large volumes of data are generated from different sources around the world. With the rapid growth of the size of datasets, it has become difficult to process data in a reasonable amount of time. Vast computational power and resources are required to overcome this issue. Thus, this paper focuses on enhancing Hadoop by improving the abilities of the HDFS. Based on the memory requirements, a metaheuristic algorithm called a genetic simulated annealing algorithm was proposed to adjust the major configuration parameters of Hadoop. The drawback of this optimization method is the high processing time, which is largely due to the convergence of the initial guess [9]. A heterogeneous job allocation scheduler with a dynamic grouping integrated neighboring search algorithm was proposed in [27]. This algorithm processes tasks based on (1) the job classification (the job type, such as CPU bound or I/O bound), (2) the ratio table (a capability ratio table is created for task trackers and data nodes), (3) data block allocation and grouping (grouping with CPU slot numbers), and (4) neighboring searches (CPU task allocation and I/O task allocation). When the dataset size is sufficiently large, the name node processes scan lead to failure, or high over head may be accrued. To overcome these issues of the conventional fair scheduler in Hadoop, the authors in [28] proposed an improved fair scheduling algorithm for clustering user jobs. The advantage of the improved fair scheduling scheme is its efficiency in producing throughput for datasets of variable size; however, the disadvantages are that long jobs can slow the algorithm and cause overloading issues at a node. Authors in [29] proposed a data locality-aware enhanced task scheduling algorithm was proposed to improve the job completion time when an input split consists of multiple data blocks that are distributed and stored in different nodes, this data location method fails to cope with the degradation in processing performance due to the increased frequency of data block copying. To solve this issue, authors have proposed a task scheduling algorithm by defining a method to classify data locality taking into account the location of all data blocks that comprise an input split, categorizing tasks based on the defined method, and sequentially assigning tasks according to a given priority.

Motivation of this Paper

  • Most of the existing schedulers, such as LATE and FCFS, provide poor performance, such as unbalanced resource utilization, and these schedulers neglect the workload of a job, which is the main reason for imbalanced resource allocation. Based on previous research, our motivations in this paper focus on improving Hadoop performance by proposing new job scheduling and resource utilization and allocation methods. We use Amazon EC2 nodes in which one node is established as the master node and others are slave nodes. In this HDFS cluster, each master node contains a number of aggregator nodes, and each slave node contains three phases: map, reduce, and shufe. The primary contributions of this paper are as follows.

  • We propose an improvement to Hadoop to overcome the issues of the native Hadoop. An improved ant colony optimization (ACO) algorithm is proposed for job scheduling based on the job size and execution time.

  • We design an enriched name node by enhancing its capabilities. Three different entities are used in the HDFS: a name node, data nodes, and aggregator nodes. First, we schedule user jobs with an improved ACO algorithm. In this algorithm, we perform job scheduling based on the job size and execution time.

  • After job scheduling is complete, the resources are allocated for each job using an artificial neural network (ANN). Each current node usage is predicted using the ANN.

  • The proposed scheme is evaluated based on the throughput and response time results, and the results of the scheme are compared to those of previous works to validate the proposed method.

Proposed Work

System Overview

Overview of the proposed system is shown in Figure 2. The following subsections is the description.

Fig. 2
figure2

System overview

User Job Scheduling Based on the Improved Ant Colony Optimization Algorithm

A native job scheduling problem in Hadoop is solved using an improved ACO algorithm. Job scheduling has become a vital role in recent years due to the rapid growth of big data.

The problem of job scheduling can be formulated as follows.

  • Assume that n users assign N jobs that must be processed with M data nodes.

  • The data nodes are denoted as D, D = {D1, D2,…, Dm}.

  • Each user job j (1 ≤  j ≤ N) comprises a set of tasks Tj1, Tj2, Tin.

  • The job schedule for job j of user U requires one data node out of the set of data nodes Dij ∈ D.

The problem is that a job is assigned based on the job size and expected execution time. The major objective in this paper is to assign a sequence of user jobs to data nodes and reduce the overall response time and increase throughput. The assumptions of job scheduling in this paper are as follows:

  • User jobs/tasks are independent of each other;

  • Data nodes vary based on their current usage;

  • More than one users job is not permitted at time t.

  • There are no constraints among the jobs of different users; and

  • In a given time, t, a name node can execute at most one job.

The improved ACO algorithm is commonly used for solving optimization problems. A set of ants is established to compute the solution space. At each iteration, the ants search for paths (set of the least loaded nodes) and leave pheromones along their routes. The pheromones along a route deliver a message to ants at adjacent nodes and evaporate with a variable velocity. Generally, traditional ACO uses a fixed velocity. When a new ant moves, the pheromone level increases at a constant speed. An ant chooses the next node based on two constraints, including the available pheromone information and heuristic information. Thus, the major objective of the ACO algorithm is to find the best data node that has the smallest load.

For example, there are four users who submit four jobs, and each job is split into a fixed number of tasks, and are each split into tasks. The processing of each job is based on two constraints: the minimal job size and minimal expected execution time. According to these constraints, jobs are sequentially executed in the available space of data nodes, and artificial ants execute three jobs, or a total of 9 tasks, in a sequential manner. Generally, the steps in the traditional ACO are based on the computed probability of transition, visibility and pheromone level. The goal is to select the data nodes that are chosen by ants based on pheromone visibility and level. In time t, the probability of selecting the path for ant from one point to another is computed as follows:

$$p_{ij}^{a} (t) = \left\{ {\begin{array}{*{20}l} {\frac{{\mu_{ij}^{\beta } (t)\delta_{ij}^{\gamma } (t)}}{{\sum\nolimits_{{S \in P_{a} }} {\mu_{ij}^{\beta } (t)\delta_{ij}^{\gamma } (t)} }}} \hfill &\quad {{\text{if}}\;j \in P_{a} } \hfill \\ 0 \hfill &\quad {\text{otherwise}} \hfill \\ \end{array} } \right.$$
(1)

where Pa represents the data nodes selected by the name node that ant a can select and β and γ are the pheromone and expectation factors, respectively \(\delta_{ij}^{\gamma }\). Is a heuristic factor, and the pheromone level and visibility react the probability of an ant selecting the path. In addition, the fitness f(x) function of ant a is used to choose a path and is expressed as follows.

$$f\left( x \right) = w_{1} \times S + w_{2} \times {\text{EET}}$$
(2)

Moreover, the user job with the minimum job size and execution time at is given first priority, and represent the weighting constants. The job execution strategy of an ant is shown in Fig. 3.

Fig. 3
figure3

Ant execution job strategy based on data nodes

The following Algorithm 1 explains step by step Improved Ant Colony Procedure.

figurea

Algorithm for the Improved Ant Colony Optimization

Artificial Neural Network for Node Usage Prediction

This section explains how an ANN is used to compute the usage of the node load. The proposed ANN is based on the formulation of the name node and aggregator node information. We consider a set of input variables and compute the weights for the input variables to obtain the output variable. This approach can be represented as a set of inputs, an activation function, and an output. The links between the inputs and hidden layer are called weights. The weights define the connection strengths between neurons, and is the output value as shown in Fig. 4.

Fig. 4
figure4

Artificial neural network

In the hidden layer, each neuron receives weighted inputs and bias from other neurons in the input layer as follows:

$$Z_{i} = \left( {\sum\limits_{K = 1}^{{N_{j} - 1}} {x_{k}^{j - 1} w_{k,i} - b_{k} } } \right)$$
(3)

where wk,i is the weight value of the connection between node K and all the nodes in the input layer, \(X_{k}^{j - 1}\) is the input from the Kth node in the jth layer, Nj−1 is the total number of nodes in layer j − 1, and bk is the bias of the node.

The summation of these weighted values with the activation function is forwarded to compute the output of the node. The output is computed as follows.

$$y_{i} = f\left( {Z_{i} } \right)$$
(4)

The sigmoid function can be used for the activation function and is formulated as follows.

$$f_{{Z_{i} }} = \frac{1}{{1 + e^{ - Zi} }}$$
(5)

This equation serves to model the relevant nonlinear behaviors.

The memory amount of a data node can be calculated as follows. For a data node with a process memory of 4–8 GB, task tracker memory of 4–8 GB, OS memory of 4 GB–8 GGB, fans 4 CPU cores e with 4–8 GB units of memory per core, the total memory of the data node is (at least) equal to 4 × 4 + 4 + 4 = 28 GB.

Experimental Results

This section describes the proposed Improved Hadoop evaluation on an Amazon cloud. An AWS account was created to use Amazon EC2 and S3. Next, we compare the performance evaluation results of our algorithm with those of previous Hadoop improvement algorithms, including DGNS [27]. The experimental results are used to validate the performance of the proposed system compared to that of the native Hadoop configuration.

Experimental Environment

The experimental environment included a separate Improved Hadoop model for each given application. The software description for Improved Hadoop is illustrated in Table 1. Each Hadoop cluster included a Pentium Dual-Core CPU E5700 @ 3.00 GHz and 2.00 GB of installed memory on a 32-bit operating system. Apache Hadoop release 2.7.2 was installed on the Hadoop cluster. This Hadoop version is stable and composed of one name node and data nodes. The name node runs the resource manager for Map-Shufe-Reduce tasks. Each data node runs the Mapper-Shufe-Reduce tasks, and the default block size is 128 MB. We conduct diverse experiments using the default, or native, Hadoop configurations to obtain the parameters for the Improved Hadoop configuration. The system configuration is shown in Tables 1 and 2.

Table 1 System configuration
Table 2 Hadoop configuration parameters

In our Improved Hadoop configuration, the input data block size is set to 128 MB, but the input data size can be varied up to 2 GB. There are 12 Hadoop configuration parameters considered in the experiments, and the selection of these parameters is described in Table 2. In our implementation, the Amazon EC2 Hadoop cluster nodes are as follows:

  • Master node (name node),

  • Secondary master node (secondary name node),

  • Slave nodes (data nodes), and

  • Aggregator nodes.

Configuration of Amazon S3

A standard scheme to store, share and retrieve large quantities of input data in the Amazon Web Service is the Amazon S3 object storage service. In this paper, we recommend that this service be used to keep all the data required to demonstrate Hadoop workload clusters. The features of Amazon S3 are as follows: It can store long-term data; it can support objects up to 5 TB, with many petabytes of data allowed in a single bucket; it is easily writable and readable from Apache Hadoop; and it provides all the AWS regions where data clouds are available and instances are launched based on the storage options suggested in Amazon S3. S3 can provide high durability for input data, is a cheap option for reducing storage redundancy, and is 99.99% available over a given year. The performance of the HDFS is great, and data storage and processing are based on the same cluster, which improves the processing speed and reduces the latency. Table 2 describes core Hadoop configuration parameters.

Terasort Benchmark Description

TeraSort is used to test our proposed work. It is very popular benchmark that estimates the amount of time duration required to sort one terabyte randomly distributed data over the configuration system. It is commonly used to estimate MapReduce performance of Hadoop cluster. It is recorded 1 TB of data in 209 s on Hadoop cluster of 910 nodes. The main goal of Terasort is sorting 1 TB of data as early as possible. It integrates testing the MapReduce and HDFS layers of Hadoop cluster. Some of the areas of TeraSort which is used to find MapReduce slot assignments are sound or not. A running of full Terasort must follow the following three steps (1). generate the input data through TeraGen, (2). running the actual Terasort over the input data, and (3). validate output data using TeraValidate.

Results and Discussion

In this section, the acceleration of Hadoop and improvements in Hadoop performance are discussed. In addition, we present designs that take advantage of the Map-Shufe-Reduce and enhanced name node capabilities. The proposed system results are promising, and various open research issues are solved. Finally, we prove that the proposed approach is a fast and salable approach compared to previous methods.

Definitions of Performance Measures

  • Throughput is the total completion time per unit of time. This metric is computed as follows.

    $${\text{Total}}\;{\text{throughput}}\;\left[ {{\text{MB}}/{\text{s}}} \right] = \frac{{{\text{Total}}\;{\text{bytes}}\;{\text{processed}}}}{{{\text{Test}}\;{\text{Execution}}\;{\text{Time}}}}$$
    (6)

To obtain accurate throughput, the number of map slots associated with a given cluster should be determined.

  • The response time is that required for processing in Hadoop, involves hashing for mappers and reducers and inputting and outputting HDFS data blocks.

We compute the response time for a given task as follows:

$${\text{Response}}\;{\text{Time}} = \frac{M}{\text{HT}} + \frac{R}{S}$$
(7)
$$= \left( {{\text{BI}} + H + O} \right) + \left( {{\text{Sh}} + {\text{So}} + {\text{BO}}} \right)$$
(8)

where M represents the map time, HT represents the hash time, R represents the Reduce time and S is the Sort time. In Eq. (8), BI represents the block input, H represents hashing, and BO represents the block output.

Effectiveness of the Proposed Approach Versus Those in Previous Works

In this paper, we first obtain throughput results for the native and improved versions of Hadoop. A study of data representations in Hadoop is presented to optimize data storage and retrieval. In HDFS, a client job is divided into a number of tasks that are allocated among different data blocks. Via writing once and reading many options, high-throughput data access support is provided. It can work based on the data locality principle, in which the computation moves to the data instead of the data to the computation to reduce network congestion. According to the benefits stated above, the overall system throughput is improved by the proposed method. We test the proposed approach with respect to the original Hadoop, and Fig. 5 shows the evaluation results for throughput with variable data size.

Fig. 5
figure5

Throughput results

As shown in Fig. 5, Improved Hadoop displays better performance than Hadoop because the improved ACO is used to schedule client jobs. Next, the proposed approach is compared with state-of-the-art approaches, including DGNS [9] and iShufe [29].

Figure 6 illustrates the throughput performance for DGNS [24] and Improved Hadoop. The number of tasks can be varied from 500 to 2000, and throughput increased for each task run. The DGNS was proposed in a heterogeneous computing environment with almost 10 GB and 100 nodes; however, the CPU time and memory size requirements of the master node are high compared to those of Improved Hadoop. In iShufe [30], the Hadoop cluster yielded less throughput due to longer completion times, as shown in Fig. 6 (Table 3).

Fig. 6
figure6

Comparison of throughput

Table 3 Throughput for Hadoop versus improved Hadoop
  1. (a)

    Response Time The HDFS configuration parameters and design factors affect the performance of Hadoop frameworks. Such factors include the block size, the number of data nodes used, and the number of clients (Table 4).

    Table 4 Throughput for Hadoop versus improved Hadoop

The results of user job scheduling and the resource utilization responses are shown in Fig. 7. The results show that native Hadoop has a high response time, but Improved Hadoop has a lower response time because user jobs are distributed and executed using Map-Shufe-Reduce tasks in the Hadoop cluster. Furthermore, the native Hadoop cluster uses default schedulers, including static (FIFO, fair, capacity, delay, and match making), dynamic (resource aware and deadline constraint), resource-based (delay, match making, and resource aware) and time-based (delay and deadline constraint) schedulers.

Fig. 7
figure7

Response time results

There are different factors considered in native Hadoop, and we can investigate these factors and compare them with those in Improved Hadoop. Figure 7 shows the response time results when the number of tasks is varied from 1000 to 5000. In [29], the authors evaluate iShufe operations in a multiuser Hadoop environment and run multiple jobs. Furthermore, they used the modified Hadoop fair scheduler to support two different MapReduce jobs at a time. Thus, it reduced the overall job completion time by 30.2%, but Brsmall jobs have long wait times. This issue decreases the overall system performance. In our proposed version of Improved Hadoop, optimization problems are solved using an improved ACO, and the usage of data nodes is determined and periodically updated for task execution (refer to Fig. 8).

Fig. 8
figure8

Comparison of response time

We can dynamically add the new user tasks to the scheduler and snot affect the system performance. When compared to iShufe. In summary, we present various ways of enhancing the capabilities of native Hadoop, and use aggregator nodes to initiate HDFS jobs and analyze the results in terms of the throughput and response time.

Conclusion and Future Work

Rising interest in the improvement of Hadoop performance by tuning the relevant configuration parameters is seen in the recent increase of research. In this paper, we have integrated multiple optimization techniques to improve Hadoop performance in a cloud environment. We presented an improved ant colony optimization method to schedule jobs according to the job size and the time expected for execution. Also, we proposed an artificial neural network to predict the number of running tasks and resource usage by data nodes. Moreover, we enhanced the Hadoop Name node capabilities by adding an aggregator node to the default HDFS framework architecture. The experiment results showed a significant improvement of Hadoop performance in terms of maximized number of throughputs and the reduction in CPU response time. Results illustrate the improvements offered by the proposed approach and indicate that it performs better than the other approaches. In the future, we will perform additional experiments considering more parameters to be analyzed. Also, we look forward to investigating the configuration factors that influence the performance of Hadoop and its speed.

References

  1. 1.

    Wang T, Wang J, Nguyen SN, Yang Z, Mi N, Sheng B. Ea2s2: an efficient application-aware storage system for big data processing in heterogeneous clusters. In: 2017 26th international conference on computer communication and networks (ICCCN). IEEE; 2017. p. 1–9.

  2. 2.

    Subrahmanyam K, Thanekar SA, Bagwan A. Improving Hadoop performance by enhancing name node capabilities. J Soc Technol Environ Sci. 2017;6(2):1–8.

    Google Scholar 

  3. 3.

    Usama M, Liu M, Chen M. Job schedulers for big data processing in Hadoop environment: testing real-life schedulers using benchmark programs. Digit Commun Netw. 2017;3(4):260–73.

    Article  Google Scholar 

  4. 4.

    Han S, Choi W, Muwafiq R, Nah Y. Impact of memory size on bigdata processing based on Hadoop and spark. In: Proceedings of the international conference on research in adaptive and convergent systems. ACM; 2017. p. 275–80.

  5. 5.

    Nghiem PP, Figueira SM. Towards efficient resource provisioning in MapReduce. J Parallel Distrib Comput. 2016;95:29–41.

    Article  Google Scholar 

  6. 6.

    Wang K, Yang Y, Qiu X, Gao Z. MOSM: an approach for efficient storing massive small files on Hadoop. In: 2017 IEEE 2nd international conference on big data analysis (ICBDA). IEEE; 2017. p. 397–401.

  7. 7.

    Kim H-G. Effects of design factors of HDFS on a I/O performance. J Comput Sci. 2018;14:304–9.

    Article  Google Scholar 

  8. 8.

    Nazini H, Sasikala T. Simulating aircraft landing and take off scheduling in distributed framework environment using Hadoop file system. Cluster Comput. 2018;22:1–9.

    Google Scholar 

  9. 9.

    Luo X, Fu X. Configuration optimization method of Hadoop system performance based on genetic simulated annealing algorithm. Cluster Comput. 2018;22:1–9.

    Google Scholar 

  10. 10.

    Guo M. Design and realization of bank history data management system based on Hadoop 2.0. Cluster Comput. 2018;22:1–7.

    MathSciNet  Google Scholar 

  11. 11.

    Aydin G, Hallac IR. Distributed log analysis on the cloud using mapreduce. arXiv preprint arXiv:1802.03589. 2018.

  12. 12.

    Yao Y, Tai J, Sheng B, Mi N. LSPS: a job size-based scheduler for efficient task assignments in Hadoop. IEEE Trans Cloud Comput. 2015;3(4):411–24.

    Article  Google Scholar 

  13. 13.

    Bhatnagar R. Machine learning and big data processing: a technological perspective and review. In: International conference on advanced machine learning technologies and applications. Springer; 2018. p. 468–78.

  14. 14.

    Lu Q, Li S, Zhang W, Zhang L. A genetic algorithm-based job scheduling model for big data analytics. EURASIP J Wirel Commun Netw. 2016;2016(1):152.

    Article  Google Scholar 

  15. 15.

    Hua X, Huang MC, Liu P. Hadoop configuration tuning with ensemble modeling and metaheuristic optimization. IEEE Access. 2018;6:44161–74.

    Article  Google Scholar 

  16. 16.

    Ba-Alwi FM, Ammar SM. Improved FTWeighted HashT Apriori algorithm for big data using Hadoop MapReduce model. J Adv Math Comput Sci. 2018;27(1):1–11.

    Google Scholar 

  17. 17.

    Singh S, Garg R, Mishra P. Performance optimization of MapReduce-based apriori algorithm on Hadoop cluster. Comput Electr Eng. 2018;67:348–64.

    Article  Google Scholar 

  18. 18.

    Soualhia M, Khomh F, Tahar S. A dynamic and failure-aware task scheduling framework for Hadoop. IEEE Trans Cloud Comput. 2018. https://doi.org/10.1109/TCC.2018.2805812.

    Article  Google Scholar 

  19. 19.

    Wang J, Qiu M, Guo B, Zong Z. Phase—reconfigurable shuffle optimization for Hadoop MapReduce. IEEE Trans Cloud Comput. 2015. https://doi.org/10.1109/TCC.2015.2459707.

    Article  Google Scholar 

  20. 20.

    Kc K, Anyanwu K. Scheduling Hadoop jobs to meet deadlines. In: 2010 IEEE second international conference on cloud computing technology and science (CloudCom). IEEE; 2010. p. 388–92.

  21. 21.

    Guo Y, Wu L, Yu W, Wu B, Wang X. The improved job scheduling algorithm of Hadoop platform. arXiv preprint arXiv:1506.03004. 2015.

  22. 22.

    Brahmwar M, Kumar M, Sikka G. Tolhit—a scheduling algorithm for Hadoop cluster. Proc Comput Sci. 2016;89:203–8.

    Article  Google Scholar 

  23. 23.

    Gu R, Yang X, Yan J, Sun Y, Wang B, Yuan C, Huang Y. SHadoop: improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters. J Parallel Distrib Comput. 2014;74(3):2166–79.

    Article  Google Scholar 

  24. 24.

    Alshammari H, Lee J, Bajwa H. H2hadoop: improving Hadoop performance using the metadata of related jobs. IEEE Trans Cloud Comput. 2016;6:1031–40.

    Article  Google Scholar 

  25. 25.

    Jeon S, Chung H, Choi W, Shin H, Chun J, Kim JT, Nah Y. MapReduce tuning to improve distributed machine learning performance. In: 2018 IEEE first international conference on artificial intelligence and knowledge engineering (AIKE). IEEE; 2018. p. 198–200.

  26. 26.

    Chung H, Nah Y. Performance comparison of distributed processing of large volume of data on top of Xen and Docker-based virtual clusters. In: International conference on database systems for advanced applications. Springer; 2017. p. 103–13.

  27. 27.

    Chen C-T, Hung L-J, Hsieh S-Y, Buyya R, Zomaya Y. Heterogeneous job allocation scheduler for Hadoop MapReduce using dynamic grouping integrated neighboring search. IEEE Trans Cloud Comput. 2017. https://doi.org/10.1109/TCC.2017.2748586.

    Article  Google Scholar 

  28. 28.

    Sneha S, Sebastian S. Improved fair scheduling algorithm for Hadoop clustering. Oriental J Comput Sci Technol. 2017;10:194–200.

    Article  Google Scholar 

  29. 29.

    Choi D, Jeon M, Kim N, Lee B-D. An enhanced data-locality-aware task scheduling algorithm for Hadoop applications. IEEE Syst J. 2017;99:1–12.

    Google Scholar 

  30. 30.

    Guo Y, Rao J, Cheng D, Zhou X. ishuffle: improving Hadoop performance with shuffle-on-write. IEEE Trans Parallel Distrib Syst. 2017;28(6):1649–62.

    Article  Google Scholar 

Download references

Acknowledgements

This research was supported by the MIST (Ministry of Science and ICT), Korea, under the National Program for Excellence in SW supervised by the IITP (Institute for Information & communications Technology Promotion) (2017-0-00091). This work was supported by “Human Resources Program in Energy Technology” of the Korea Institute of Energy Technology Evaluation and Planning (KETEP), granted financial resource from the Ministry of Trade, Industry & Energy, Republic of Korea. (No. 20174030201740).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Rayan Alanazi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Alanazi, R., Alhazmi, F., Chung, H. et al. A Multi-Optimization Technique for Improvement of Hadoop Performance with a Dynamic Job Execution Method Based on Artificial Neural Network. SN COMPUT. SCI. 1, 184 (2020). https://doi.org/10.1007/s42979-020-00182-3

Download citation

Keywords

  • Hadoop performance enhancement
  • Job scheduling
  • Artificial neural network
  • Improved ant colony optimization
  • Map-shuffle-reduce