1 Introduction

The largest advantage of the Hadoop platform is that developers do not need to understand its internal implementation details, and its underlying automatic encapsulation completes complex parallelization techniques. In this case, technicians do not need to write complex cloud computing program codes; they only need to call the corresponding interface. While Hadoop has many advantages, there is still much room for improvement because Hadoop's development time is not long enough. In actual use, the related implementation details in the built-in MapReduce are not very good due to the short development time. When Hadoop is designing task scheduling, it implicitly assumes no difference in computing power between nodes. In reality, heterogeneous clusters are very common. According to the distribution of computing power, the ability to do more work can improve not only resource utilization but also platform execution efficiency. Therefore, it is very useful to study and improve the data locality and the difference in computing power. This paper thus examines and analyses the optimization of computing task scheduling performance based on the Hadoop big data platform. It is expected to make a certain contribution to optimizing task scheduling performance.

This paper examines and analyses the optimization of computing task scheduling performance based on the Hadoop big data platform. Compared with the default task scheduling algorithm used by Hadoop, the task scheduling algorithm for intratask data localization based on Hadoop proposed in this paper has a better data localization effect in the running process of four jobs. There are more data localization tasks, the expected goal is achieved, the effectiveness of the algorithm is verified, and the performance is improved by 30%. The reason is that during task scheduling, we compute the data localization saturation for each node. Low-saturation nodes take priority so that high-saturation nodes do not preempt tasks corresponding to data blocks stored by low-saturation nodes.

The innovation of this paper is reflected in the following: (1) This paper analyses and discusses Hadoop big data processing and simultaneously studies the improvement in the Hadoop resource scheduling algorithm. (2) This paper conducts experimental research on the performance optimization of computing task scheduling within Hadoop jobs and analyses the results.

2 Related work

According to the research progress abroad, different researchers have conducted corresponding cooperative research in task scheduling. Lu P conducted a comparative study of DoT scheduling in a fixed and flexible grid ML-IDCON. He proposed a DoT scheduling algorithm that works well for both types of networks [1]. Zheng proposed a novel ensemble method that schedules task sets to assign each task's locked cache contents to local caches and second-level caches [2]. Ahmad introduced a comparative analysis of existing simulators and visualization techniques, which are helpful for feasibility analysis of real-time input tasks for IoT embedded applications [3]. Considering the influence of observation time on imaging quality, Li proposed an agile satellite task scheduling based on a constraint satisfaction model and discrete differential evolution combined with variable neighbourhood search [4]. However, these scholars' explorations of task scheduling lack certain technical demonstrations. We found that there are better studies on task scheduling based on Hadoop big data, and we checked the relevant literature on Hadoop big data.

Some scholars have also conducted research on Hadoop big data. The main aim of Agarwal was to describe the concepts of big data, cloud computing, big data as a service and Hadoop as a service on cloud platforms [5]. To improve the efficiency of HadoopCluster in big data collection and analysis, Dadheech proposed an algorithm system that can meet the needs of protected discriminative data in HadoopClusters and improve performance [6]. Based on the built Hadoop big data platform, Zheng determined parameters such as the state vector, distance matching principle, and neighbourhood value through numerical tests. A short-term traffic flow prediction method based on the Hadoop big data platform was constructed to realize the short-term prediction of road traffic flow [7]. However, these scholars did not examine and analyse the optimization of computing task scheduling performance based on the Hadoop big data platform but only discussed its significance unilaterally.

3 Hadoop big data platform

3.1 Hadoop big data processing

Hadoop is the basic structure of a distributed system developed by the Apache Foundation. Users can develop distributed software on their own without knowing the distributed infrastructure. The advantages of the cluster are given full play to realize fast computing and storage. Hadoop is the Java source code in Google's cloud computing system, which includes HDFS and MapReduce [8]. It has high reliability, high scalability, high efficiency, high fault tolerance, and low cost. Many scholars use Hadoop as their research platform when they are engaged in research in the field of cloud computing [9]. Hadoop can be widely used in the field of big data processing mainly due to its capacity for data extraction, transformation and loading (ETL). Hadoop's distributed architecture makes the big data processing engine as close to the memory as possible, which is more suitable for batch operations, such as ETL, because such batch results can be sent directly to storage. Figure 1 shows the scope of cloud computing services.

Fig. 1
figure 1

Scope of cloud computing services

One of the most commonly used applications of Hadoop is web search. Although it is not only a software architecture but also a parallel data processing engine, its performance is very good. Hadoop can distribute the computing process to a large number of cheap servers in the cluster for execution, the cost is low, and the low service price makes it easy for users to adopt [10]. Unlike some computing frameworks that "move data to tasks", Hadoop adopts a "move programs to data" model. During the execution process, Hadoop replicates the program running the dependent library to all task nodes, which improves the cluster performance and further reduces the I/O load [11]. Once a failed node is detected, Hadoop will retry execution on a different node, so the presence of a single compute node will not cause cluster inconsistencies. In addition, Hadoop has high scalability, and computing nodes can be added at any time to participate in cluster work through a simple configuration [12]. Figure 2 shows the Hadoop cluster distribution environment. The core part of the Hadoop distributed computing platform includes HDFS, the processing flow of MapReduce, the Hive data warehouse tool Hive, and the distributed database Hbase.

Fig. 2
figure 2

Hadoop cluster

HDFS realizes the distributed storage of data, which solves a large number of problems existing in many distributed file systems. It includes the amount of data that can be stored (TB level or petabyte level), the reliability of storage, and better integration with Hadoop's Map Reduce framework [13].

As shown in Fig. 3, HDFS is a typical master/slave structure, and the roles of master and slave are played by name node and data node, respectively [14]. Name nodes are responsible for managing the HDFS file system, including namespaces and file blocks. When saving files, HDFS divides a single file into several blocks according to a fixed size and can copy the blocks by setting the replication factor. These blocks are randomly stored on different data nodes, which greatly increases the storage capacity and availability of HDFS [15]. Each data node periodically sends heartbeats and block reports to the named node to check the status of the data node and whether the file data on the node is consistent with that recorded by the named node [16].

Fig. 3
figure 3

HDFS architecture

Because the named node has a single failure problem, when the named node fails, the backup named node can assume the job of the named node [17].

3.2 Improvement of the Hadoop resource scheduling algorithm

In the Hadoop platform, cluster resources are uniformly managed and allocated by the resource scheduler. It is a pluggable component in Resource Manager. To facilitate expansion, the system predefines some specifications for users to implement resource scheduling policies by themselves [18].

Aiming at problems of low resource utilization and unbalanced load in cloud computing environments, a new method combining a hybrid genetic algorithm and particle swarm algorithm is proposed. It is applied to the scheduling of cloud computing resources to improve resource utilization [19, 20].

  1. (1)

    Traditional optimization algorithm

Intelligent optimization algorithms include mainly genetic algorithms, simulated annealing algorithms, differential evolution algorithms, and particle swarm algorithms. Intelligent optimization algorithms are usually designed for specific problems, such as job scheduling, machine learning, and control systems [21]. Next, we introduce the particle swarm algorithm and genetic algorithm in detail.

Particle swarm optimization (PSO) is an evolutionary algorithm whose main goal is to graph the irregular behaviour of birds to realize intelligent simulation of the foraging process [22,23,24]. As shown in Fig. 4, PSO simulates the foraging behaviour of birds. A flock of birds haphazardly searches for food, and in this area, they only find a piece of food. Not a single bird knew there was food. However, they are well aware of the distance from their current location to the food. Therefore, the best way is to find food. The easiest and most effective method is to search close to the prey. Foraging behaviour is an instinctive animal behaviour. When animals are hungry and thirsty, they will naturally look for food and water to meet their physiological needs so that they can survive. Animals that lose their ability to find food will die. The particle swarm algorithm compares the movement area of birds to the solution space for solving problems. The individual guides the movement of the entire population with the information of adjacent individuals and, after many iterations, shows the possibility of obtaining the optimal solution.

Fig. 4
figure 4

Particle swarm optimization

The method takes the group as the background and transfers the individuals in the group to a better position by calculating the adaptability of the surrounding environment. The following formulas are updated formulas for particle velocity and particle position:

$$u_{nt}^{h} = r*u_{nt}^{h - 1} + s_{1} *e_{1} *\left( {ibest_{nt} - a_{nt}^{h - 1} } \right) + s_{2} *e_{2} *\left( {ibest_{t} - a_{nt}^{h - 1} } \right)$$
(1)
$$a_{nt}^{h} = a_{nt}^{h - 1} + u_{nt}^{h - 1}$$
(2)

\({\text{where}}\;t = 1,2,3, \ldots , \;{\text{and}}\;T\) represents the size of the search space and \(u_{nt}^{h}\) represents the t-dimensional component of the flight velocity vector of the h-th iteration particle n. \(a_{nt}^{h}\) represents the t-th dimension component of the h-th iteration particle n's position vector, and r is the customary weight that adjusts the search range of the solution space.

In the iterative method, particles have certain maximum and minimum velocities in the search space, and the goal is to set an interval to prevent particles from leaving the solution space. In general, both the maximum speed \(V_{\max }\) and the minimum speed \(V_{\min }\) are set as constants defined by the user. When the update is complete, the maximum and minimum velocities of the particles are updated. The update formula is as follows:

$$V_{nt} = V_{\min } ,\quad {\text{if}}\; V_{nt} < V_{\min }$$
(3)
$$V_{nt} = V_{\max } ,\quad {\text{if}}\;V_{nt} > V_{\max }$$
(4)

In the iterative process of particle swarm optimization, the combination of inertia, self-cognition and social experience allows the population to have a good convergence speed in the solution space.

The genetic algorithm is an adaptive global optimal probabilistic search method based on genetic evolution theory. Similar to natural evolution, the genetic algorithm follows an extremely strict survival law of the fittest through continuous evolution to find more suitable populations, and the best choice is optimal. In the process of evolution, individuals evolve according to a certain adaptive standard and towards the optimal direction for the individual. Then, the optimal solution is generated through corresponding coding rules, and a more reasonable solution can be obtained. The process of searching for the global optimal solution by the genetic algorithm is iterative until the termination condition of the algorithm is met. When a genetic algorithm is used to solve the problem, every possible solution is coded into a "chromosome", that is, an individual, where several individuals constitute a population.

  1. (2)

    Cloud computing resource scheduling objective function

Resource scheduling refers to a method in which each resource user configures resources according to certain usage principles in a specific resource environment. The MapReduce programming framework involves concepts such as jobs, resources, storage devices, and scheduling processes. The content introduced next is to model the computing method of cloud computing programming and finally obtain the objective function of resource scheduling. This metric is measured by cluster execution time. This function can be applied to the genetic algorithm as a fitness function to make the calculation process more scientific. There are generally two ways to schedule resources for computing tasks: adjusting the use of resources allocated to a computer or offloading computing work to another computer. For edge computing resource allocation and task scheduling problems, an accurate method is proposed based on the analysis of system characteristics, but the algorithm complexity is high and consumes more computing resources, which is not suitable for large-scale problems. Most work uses heuristic strategies to allocate resources and schedule tasks. Although faced with large-scale tasks and isomerized resources, such methods are easy to design, easy to implement, and do not occupy too many computing resources, but it is difficult to achieve good overall optimization results. Intelligent algorithms are applicable to complex problems with strong constraints and multiple objectives and have strong scalability, but they are difficult to apply to scenarios with high real-time requirements, such as distributed online problems.

The scheduling problem of cloud resources includes three levels: scheduling resources, scheduling virtual resources (such as virtual machines), scheduling physical resources, and scheduling physical resources. In cloud computing, tasks and resources are not in one-to-one correspondence but are first associated with tasks and resources. The resources are then mapped to the corresponding physical devices to implement job scheduling. In short, the following 5 tuples can describe a resource scheduling process:

$$C = \left\{ {V,U,B,Q_{vu} ,Q_{ub} } \right\}$$
(5)

In this process, the task \(Q_{vu}\) of the system is assigned to the computer centre, and \(Q_{ub}\) is assigned to the corresponding storage device. Therefore, the key to resource planning is determining how to arrange resources for the storage device. The assignment matrix of task V to physical device B is denoted as:

$${\text{ETD}}_{qp} = {\text{ETD}}\left( {v_{n} Q_{vu} ,b_{h} } \right)\quad 1 \le n \le q,1 \le h \le p$$
(6)

The earliest completion time of task \(v_{n}\) on the physical device \(b_{h}\) through the mapping relationship \(Q_{vu}\) is recorded as:

$${\text{Finish}}\left( {v_{n} Q_{vu} ,b_{h} } \right) = {\text{Start}}\left( {b_{h} } \right) + {\text{ETD}}\left( {v_{n} Q_{vu} ,b_{h} } \right)$$
(7)

The total time it takes to assign tasks on physical device \(b_{h}\) can be expressed as:

$${\text{Sum}}\left( {b_{h} } \right) = \mathop \sum \limits_{n = 1}^{q} s_{nh} {\text{Finish}}\left( {v_{n} Q_{vu} ,b_{h} } \right)$$
(8)
$$s_{nh} = \left\{ {\begin{array}{*{20}l} {1,} \hfill & {v_{n} Q_{vu} = b_{h} } \hfill \\ {0,} \hfill & {v_{n} Q_{vu} \ne b_{h} } \hfill \\ \end{array} } \right.$$
(9)

Therefore, the total time for all tasks to execute through the resource scheduler is:

$${\text{total}}\left( V \right) = \mathop \sum \limits_{h = 1}^{p} {\text{Sum}}\left( {b_{h} } \right)$$
(10)

The resource scheduling algorithm makes the running time of the task the shortest, which is the minimum value in the above formula.

$${\text{Goal}}\left( V \right) = {\text{min}}\mathop \sum \limits_{h = 1}^{p} {\text{Sum}}\left( {b_{h} } \right)$$
(11)
  1. (3)

    Improved genetic algorithm

The traditional genetic algorithm is a typical optimization method. Based on the genetic algorithm, this paper optimizes the resource allocation characteristics in cloud computing to make it converge more quickly. On this basis, the fitness function is introduced to solve the shortcomings of the traditional genetic algorithm.

The standard genetic algorithm uses binary encoding, which includes a character set {0,1}, and these genes form binary-encoded strings. The problem addressed in this paper is how to classify tasks into a continuous task during the sorting process. If binary encoding is used, a continuous function becomes a discrete function. The largest drawback of binary encoding is that the Hamming distance is large, it is difficult to pass crossover and mutation, and the precision is not high. Real coding uses Euclidean distance to shift the genetic algorithm closer to the problem space, so we use real coding.

The genetic algorithm simulates the principle of best adaptation in nature for the search process. In this process, the fitness function is the criterion for the quality of individuals in the population. In cloud computing, task completion time is an important indicator of the merits of cloud resource allocation algorithms. The greater the algorithm’s degree of adaptation is, the better its performance is, which can be expressed as follows:

$$g = {1 \mathord{\left/ {\vphantom {1 {\min \mathop \sum \limits_{h = 1}^{p} {\text{Sum}}\left( {b_{h} } \right)}}} \right. \kern-\nulldelimiterspace} {\min \mathop \sum \limits_{h = 1}^{p} {\text{Sum}}\left( {b_{h} } \right)}}$$
(12)

A dynamic penalty function is introduced into the fitness function, and a smaller fitness value is proposed for those who are not satisfied. Its penalty function can be expressed as:

$$i = \left( {sv} \right)^{\beta } \mathop \sum \limits_{h = 1}^{p} j_{h}^{\theta }$$
(13)

In the formula, v represents the number of evolutions; generally, s = 0.5, \(\beta = \theta = 2\). Therefore, the fitness function can be combined with the dynamic penalty function as:

$$G = \frac{1}{{\min \mathop \sum \nolimits_{h = 1}^{p} {\text{Sum}}\left( {b_{h} } \right)}} + \left( {sv} \right)^{\beta } \mathop \sum \limits_{h = 1}^{p} j_{h}^{\theta }$$
(14)

Before the crossover operation, individual matching is generally performed by random matching. Assuming that there are Q individuals in a population, \(Q{/}2\) pairs are randomly formed, and then the two individuals in this random combination are paired.

Since the real number coding method is adopted, the arithmetic crossover method can be used to perform the crossover operation on \(A_{n}\) and \(A_{m}\) to obtain the descendants of \(A_{n}^{`}\) and \(A_{m}^{`}\), respectively. The crossover operation can be expressed as follows:

$$A_{n}^{`} = \beta A_{n} + \left( {1 - \varphi } \right)A_{m}$$
(15)
$$A_{m}^{`} = \beta A_{m} + \left( {1 - \varphi } \right)A_{n}$$
(16)

Assume that the mutant individual is \(Q = \left[ {q_{1} ,q_{2} , \ldots ,q_{h} , \ldots q_{K} } \right]^{v}\), and the probability of mutating is \(I_{q}\).

$$q_{h + 1} = \left\{ {\begin{array}{*{20}l} {q_{h} + \Delta \left( {v,q_{h\max } - q_{h} } \right),} \hfill & {\beta = 0} \hfill \\ {q_{h} + \Delta \left( {v,q_{h} - q_{h\min } } \right),} \hfill & {\beta = 1} \hfill \\ \end{array} } \right.$$
(17)

\(\Delta \left( {v,a} \right)\) represents a nonuniformly distributed random number in the interval [0, a], and its expression can be described as follows:

$$\Delta \left( {v,a} \right) = a\left( {1 - \theta^{{\left( {1 - K/V} \right)^{d} }} } \right)$$
(18)

In the formula, d represents the parameter that determines nonuniformity, and \(\theta\) is a random number in the interval [0,1].

The population obtained by the evolution of the genetic algorithm is used as the initial population, and the optimal resource allocation scheme is obtained by using the global optimization of the algorithm and the local optimization of the PSO. Figure 5 shows the resource scheduling process of cloud computing.

Fig. 5
figure 5

Cloud computing resource scheduling

4 Experimental results of computing task scheduling performance optimization

4.1 Cluster configuration

The Hadoop cluster in this experiment consists of 6 PCs, one of which serves as the master node JobTracker. In addition, JobTracker serves as the name node and secondary name node, and the remaining machines serve as TaskTracker. It also acts as a data node. Each PC is connected through 100 M internet. In addition, the hardware configuration of the PC, such as memory size and CPU performance, is not the same, that is, the cluster is heterogeneous. The operating system used in this cluster is Ubuntu 14.04, and JDK version 1.7 is uniformly installed on the operating system. The version number of Hadoop is 1.2.0.

An overview of the composition of this cluster is given in Table 1. The hardware configuration of the master node is given in Table 2. Since each slave node is heterogeneous, the specific configuration differs, so the specific configuration parameter table of the slave node is not given.

Table 1 Hadoop cluster composition overview table
Table 2 Master node hardware configuration

4.2 Experimental data

This study uses the word count program WordCount that comes with Hadoop as a test program to evaluate the performance of the scheduler. The program counts the number of occurrences of each word in one or more files. We can download a batch of log files through the internet. From small to large, the data in this experiment are 512 M, 1 G, 1.5 G, and 2 G. Table 3 shows the experimental data situation.

Table 3 Experimental data situation

This experiment focuses on the proportion of data localization computing tasks within the job and the job execution time when the same size job file uses different task scheduling algorithms in the Hadoop cluster. These results can be viewed through the monitoring webpage provided by Hadoop. The parameter Data-localmaptasks provided by the webpage represents the number of data localization tasks in the map task of the modified job, and the ratio can be calculated according to the number of map tasks. The task completion time can be obtained by subtracting the task start time from the completion time.

The specific experimental process is as follows: Step 1: Configure the Hadoop cluster environment and perform different configurations according to the task scheduler used. Step 2: Upload the file to be calculated to HDFS. Step 3: Under different task schedulers, submit log files of the same size for calculation in the same way. Step 4: View the running results of each job under different task schedulers through the results provided by the webpage and calculate statistics. Step 5: Compare the experimental results, that is, the data locality and execution time of the job, to draw conclusions about the experiment. We use the current Hadoop default task scheduling algorithm and the data localization task scheduling algorithm based on Hadoop jobs proposed in this paper to conduct the experiments.

4.3 Experimental results

  1. (1)

    Data locality analysis

The data locality of a job is an important indicator of the efficiency of job execution on a Hadoop cluster, which is reflected by the proportion of data localization tasks in a job. A data localization task means that when a computing node executes a task, the required data blocks are stored on the current node. The calculation process does not need to be transmitted from other nodes through the network; it can be directly read and locally calculated. The larger the proportion of data localization tasks is, the fewer the tasks that need to be transmitted through the network in the entire job, which reduces both the average execution time of all tasks and the job execution time. Statistics about tasks executed by jobs in Hadoop can be viewed through the web page. The proportion of data localization tasks in a job can be obtained based on the number of data localization computing tasks and the total number of computing tasks.

The line graph in Fig. 6 shows the trend of changes in the proportion of the number of data localization tasks in the job as the file of the calculated job increases. The job file to be calculated at the beginning is relatively small, and the number of divided tasks is also small, so each node needs only to process tasks corresponding to one or two data blocks. Therefore, the probability of nondata localization tasks is low, and a 100% data localization ratio is prone to occur. As the number of jobs to be calculated increases, the number of tasks increases. The proportion of data localization tasks generally does not reach 100%, and it can only keep approaching this value. The figure also shows that with an increase in the calculated job files and in the number of Map tasks, the number of data localization tasks in the job generally shows an increasing trend. This trend illustrates that large files have more data localization tasks than small files. The reason is that the Hadoop cluster is sharded according to a fixed size, so large files will be divided into more data shards than small files, and these data shards will be distributed and stored on each data node. The more nodes that are stored, the more convenient it is for nodes to allocate corresponding local work, which also indicates that Hadoop clusters can better handle large documents.

Fig. 6
figure 6

Data localization comparison trend chart

The bar chart in Fig. 7 shows a comparison of the proportion of data localization tasks in jobs with different task scheduling algorithms under the same size file. We can clearly find that the task scheduling algorithm based on data localization in Hadoop jobs proposed in this paper shows better data locality than the default task scheduling algorithm used by Hadoop during the four jobs. With more data localization tasks, the expected goal is achieved, the effectiveness of the algorithm is verified, and its performance is improved by 30%.

  1. (2)

    Analysis of job execution time

Fig. 7
figure 7

Data localization comparison stage diagram

The above experimental results show that the improved algorithm given in this paper shows obvious improvement in the data locality of the job. Next, we will analyse and compare the two algorithms in terms of job execution time, which is an important indicator to measure the processing efficiency of Hadoop. Faster job execution time means that the cluster can process more data per unit time, which means that the throughput of the entire cluster system increases. Moreover, improving the job execution time is often the most intuitive concern for developers.

In this regard, the line graphs of the job execution time of the two task scheduling algorithms are given. The vertical coordinate represents the execution time in minutes.

As shown in Fig. 8, as the amount of data that needs to be calculated increases, the performance of the cluster will increasingly improve. The data in this experiment are only GB-level files, and the amount of data processed by mainstream commercial companies on the Hadoop platform is PB and EB. At this time, the performance improvement will be more obvious. At the same time, large-scale Hadoop clusters are generally deployed in a multirack structure in multiple data centres. Since this experiment cannot set up a cluster with multiple data centres in multiple places, the improvement in the algorithm proposed in this paper in reducing remote network transmission cannot be well reflected. In addition, as mentioned in a previous article, the Hadoop platforms of large companies process hundreds of thousands of jobs per day. Therefore, any improvements that lead to increased cluster efficiency are significant.

Fig. 8
figure 8

Comparison trend of job execution time

Figure 9 shows that the execution time of each job is shortened under the task scheduling algorithm proposed in this paper, and the time reduction is more obvious with increasing data size. The reason is mainly the increased proportion of localized computing tasks as the amount of job data processed increases. The greater the total number of computing tasks is, the greater the number of data localization computing tasks is, and the reduction in the average execution time of each task leads to a reduction in the overall job execution time. Therefore, in the case of large-scale data computing, the improved algorithm proposed in this paper will perform better. At the same time, the experimental data show that the improved task scheduling algorithm obviously improves the data locality of the job and shortens the execution time of the job. It can thus improve the execution efficiency of the Hadoop platform and achieve the expected goal.

Fig. 9
figure 9

Job execution time comparison stage diagram

5 Conclusion

This paper uses six PCs to build a small heterogeneous Hadoop cluster as the experimental environment and prepares log files with sizes of 512 M, 1 G, 1.5 G, and 2 G. Then, it uses the WordCount program that comes with Hadoop as a test program. In the experiment, log files of the same size are calculated under different task scheduling algorithms, and the experimental results are counted. The data locality and execution time of the job are analysed in half. On the one hand, the improved algorithm proposed in this paper shows obvious improvement in data locality compared with the original task scheduling algorithm. The reason is that we calculate the data localization saturation of each node when scheduling tasks. Nodes with low saturation are executed first, and the tasks corresponding to the data blocks stored by nodes with low saturation will not be preempted by nodes with high saturation. In addition, the larger the log file used in the experiment is, the better the data locality is. The reason is the slicing mechanism of Hadoop. Larger files have more tasks, which also indicate that Hadoop clusters are suitable for computing large files. On the other hand, in terms of job execution time, the improved algorithm proposed in this paper has also significantly reduced. The reason is the increase in data localization computing tasks, as a result of which the network transmission overhead and the overall running time of the entire job are reduced. From these two aspects, the improved algorithm proposed in this paper achieves the expected goal, which proves the validity and reliability of the algorithm. In short, both cloud computing and Hadoop are in a period of rapid development, and there will definitely be many problems waiting for us to solve during development. We need to use the knowledge we have to generate new solutions to solve these problems and contribute to the development of the internet.