1 Introduction

Reusing source code can reduce the development time for a software system. But if a developer does not use his/her skills to develop new codes and follows an easy technique of copy-paste with some modification or exactly same, then it can cost the maintenance process of the software system [1, 2]. Saving hours during the development of software by reusing existing code will not work every time since there can be a change in the requirements from the clients, that results in a change of the code. If any of the blocks remained unchanged, it will heavily cost the development of software. So code cloning is defined as an improper practice of programming in which programmers use the existing code with or without adding/deleting statements and the code is called code clone. In non-object oriented languages such as C, as it does not provide encapsulation, cloning can be beneficial also. For example, in case of the requirements of the same code multiple times, reusing similar code can reduce the development time [3].

Many researchers have studied the impact of function clones [3, 4], they proposed different approaches for the detection of function or method-level code clones in software systems. The different types of algorithms proposed are the metrics-based approaches, token-based approaches, tree-based approaches, and graph-based approaches. Since day by day computer processors are getting fast, so they are capable to do lot more calculation simultaneously which make the metrics-based approaches fast as compared to token-based, tree-based, and graph-based approaches.

In this paper, we have used a multi-threading metrics-based approach for detecting function clones in software systems. We have used eight metrics derived from each function of the software system for the comparison. Further, to reduce false positives, in addition to metric comparison, Cosine-similarity based [5] textual analysis is also done. Using multi-threading improves the execution time of the detection process. We have provided an introductory background on function clones. We also evaluated the proposed approach on three open source projects and compared the results with CloneManager [3].

The rest of the paper is organized as follows: Motivation behind the work is explained in Sect. 2. Section 3 discusses related works. Having discussed the background knowledge about cloning in Sect. 4, a detailed description of the proposed methodology is provided in Sect. 5. Implementation details and results are presented in Sect. 6, the conclusion is discussed in Sect. 7.

2 Motivation

Maintenance is one of the lengthiest phases of software development life cycle [6]. It involves various tasks of improvement of codes, modification of codes as the software gets deployed in the target system. During the development phase, if a software developer had followed the copy-paste techniques in order to meet the deadline, then he/she needs to pay the price for it. Greater the scattering of function code clones, the more will the effort to make modification and remove any bug [7]. Code cloning not only affects the quality of the code but also increases the overall cost of the software system. The factors that strengthen the importance of detection of function clones are:

Easy Propagation of Bugs. During the development phase, due to time limitation and other factors, developers often reuse code fragments. But uncontrolled reuse of code fragments can lead to scattering of same code fragments across multiple files, directories. If the original code contains any bug or one line modification, then due to scattered code clone, there is a serious problem for the software as developers have to manually scan all the files and remove the bug. It is very costly and time taking activity.

A Bug Prone Practice. Change in the requirements can lead to change in the complete code fragment or sometimes some modification to the old fragment. Copying old code fragment’s structure can sometimes introduce new errors or can be error prone in the future development. Proper use of abstraction and inheritance can improve the software quality and can protect the software system from errors, bugs.

Software Design Defects. Code clones sometimes result in improper abstraction and are not maintaining proper inheritance hierarchy. Sometimes developer skips these important development principles which later on become a severe problem. Due to all these improper practices in the middle of developing a large system, in case of improper use of copy/pasting code fragment, it could lead to deviation from proposed software design and also it gets difficult for the developer to follow the design in the case of incorrect use design principles.

High Maintenance Cost. During the development, it happens often that developers misunderstand the requirements and they develop accordingly. If this code fragment’s clones are scattered around the software system, then in the maintenance phase, if there is a need for modification, then it is a very costly and time-consuming task. One has to manually find all the code clones of a particular code fragment and have to make a particular modification.

Resources Over-utilization. The incorrect practice of code duplication makes software system larger which increases lots of performance issues. Instead of using the advantages of inheritance, if one uses the copying/pasting code fragments then it is going to increase the size of the software system which will put a strain on compiler and interpreter during execution.

3 Related Works

In recent studies by various researchers targeted the clone detection, they claimed that large software and frameworks have 9–17% copied code [1]. Some researchers specifically targeted the function clones in scripting languages, object-oriented languages and proposed the techniques for their detection.

Kodhai and Kanmani [3] proposed a metrics-based approach for detecting function level clones. Their detection approach is a three-stage process. Preprocessing is the first stage in which they applied a transformation on the source code to make it suitable to apply a detection algorithm. At the second stage, they calculated 12 metrics and detected the function clones. At the third stage, they did post-processing of the detected clone pairs. They produced high precision and recall value but did not mention the approach for textual comparison.

Roy and Cordy [4] proposed a function-clone detection approach combining the AST-based and text-based algorithms. They selected 15 open source project of Java and C language. The outcome of their work is a benchmark which other researchers can use for verification of their tools as they manually verified the projects individually.

Yang et al. [8] proposed an AST-based function clone detection technique using the Smith-Waterman algorithm for textual analysis. They carried their study on five open source Java-language projects with high precision and recall value. Basit and Jarzabek [9] proposed a data-mining approach for detecting higher-level clones. They proposed a tool named Clone Miner which detects the simple clones and uses the frequent closed item set in data mining to detect higher-level clones. But they did not mention about precision and recall of their tool.

Lanubile and Mallardo [10] targeted scripting languages used in HTML webpages. They proposed two-staged semi-automated function clone detection process selecting potential clones at first stage and visual verification is carried out at the second stage for verification of the function clones. Lague et al. [11] explained the benefits of incorporating detection of function clones during software development. They introduced two changes in the design process and found that the growth rate of the total number of function clones is lower in the projects. Mayrand et. al [12] carried out their experiment on large software systems targeting the exact and near-miss function clones. Using the Datrix tool, they calculated 21 metrics of function which they compared at four different stages, and at different stages, they determined their cloning level. They found that semantically similar clones have a high rate of false positives as compared to exact clone.

4 Background

Based on the syntactic or semantic similarities between two functions, one can classify function clones into two categories. Syntactic function clones are those which are based on textual/syntax based similarity among the functions. Semantic function clones are similarly based on the meaning/semantics of the functions [3].

Syntactic means the structure or the arrangement of the code. It does not deal with the working of the function. When two different functions have a similar structure of code or dislocated structure, then it is called syntax based function clones. They can be detected by textual comparison of the functions. But it is time exhausting and certain transformation can be applied to reduce the time taken for detection of clones. Based on the syntactic similarity, function clones can be further divided into three types: Exact clones or Type-I clones, Renamed clones or Type-II clones, Near-Miss clones or Type-III clones.

While writing code, a programmer inserts comments, proper indentation, and white-spaces. It increases the comprehension of the code while revising and modification of the code. When two different function bodies differ only with some modification in comments and white-spaces, then it is called exact or Type-I clones. In a software system, there can be packages, classes, function and variables in functions. Since every programmer uses different variable naming conventions, so when there is a similarity between two functions with a difference in only the naming of variables, functions names, etc, it is called renamed or Type-II clones. For incorporating new changes and requirements, there can be the insertion of new codes or deletion of existing codes. When two different functions have similar code with some gaps of statements, then it is called near-miss or Type-III function clones. Table 1 presents an illustrative example to show different types of syntactically similar function clones.

Table 1 An example illustrating different types of syntactically similar function clones. Functions in columns A and B, A and C, A and D form Type-I, Type-II, Type-III clones respectively

Semantic clones generally deal with the meaning of the function. The syntax may not be the same. There can be a set of functions which perform similar tasks but having different syntactic structures (Table 2). These functions form a function clone due to the semantically similar body. Such clones cannot be detected by simply comparing the text or body of the function. There is a need to draw some dependency graph to study the pattern of the algorithm and detect the function clones.

Table 2 An example of semantically similar function clone

5 Proposed Methodology

Figure 1 shows the presented detection process of function clones using the metrics-based approach. It consists of three major activities: Code Pre-processing, Metric Calculation, Clone Detection.

Fig. 1
figure 1

An overview of the proposed function clone detection methodology

5.1 Code Pre-processing

Since a C source project can contain various types of files, in the first step, we filter out all the C source files from which the algorithm will detect the function clones. After extracting all the C source code files, the algorithm now extracts all the functions from all the extracted files.

During the development of software systems, for better comprehension and readability, developers use comments, whitespaces and many other naming conventions which have nothing to do with the working of the software. These extra text/string, whitespaces are removed from the source code and it is converted to the proper format so that the detection process can be applied on it. So we perform various transformations to make the code suitable for detecting function clones. It includes removal of comments, modifiers, string literals, macros, include statements, and parameters of if, while and for blocks.

5.2 Metric Calculation

Once the source code is preprocessed, it will not contain any noise or unwanted code so the eight metrics are calculated for each function to detect the metrics based function clones. The metrics values represent the count of keyword or metrics in the function. Here is the list of eight metrics extracted from each function:

  1. 1.

    Count of Conditions [CC]

  2. 2.

    Count of Iterations [IC]

  3. 3.

    Count of Inputs Taken [INC]

  4. 4.

    Count of Outputs Produced [OC]

  5. 5.

    Count of Selection Statements [SC]

  6. 6.

    Count of Return Keyword [RC]

  7. 7.

    Count of Assignment [AC]

  8. 8.

    Effective Line of Code (excluding white-spaces, comments, macros) [eLOC].

figure a

5.3 Detection

After calculating metrics, to detect the function clones, the sum of the metric-values is matched for all the possible pairs of functions. If the sum of metrics of two functions is the same or its ratio is greater than a threshold value specified by the user, then to curb false positives (there can be a possibility that two different functions have a similar sum of metrics values but they do not form function clones), the algorithm further performs textual analysis to confirm the function clones. It uses Cosine-similarity [5] to calculate the similarity between two functions. It calculates the frequency vector for two functions and then based on the values of the vectors, the similarity value will be calculated. Greater the value of Cosine similarity index, the greater the similarity. If the value is greater than a pre-specified Cosine-Similarity Threshold (\(C_{ST}\)), then the algorithm detects corresponding functions as function clone.

Instead of comparing each metrics values of two functions, it considers the sum of metrics because it will be easy to compare the sum instead of each metrics. Also if we compare each metrics, then finding the threshold for each metrics value can be problematic for varying size of the source program. Further, there can be cases when two different functions with each similar metrics values can be textually different. In this case, complexity will be increased as there will be an individual comparison of metrics and thereafter the textual analysis for confirming the Type-I and Type-II function clones.

Algorithm 1 discusses the proposed function clone detecetion methodology. Lines 1–6 extracts all files with extension .c from the subject program (SubProg). Then, a list of all functions from the files is extracted (line 7–9). Each function is assigned a unique id that starts from zero to distinguish one function from the other. For each of these extracted functions, the eight above specified metrics are calculated. The proposed algorithm uses the concept of multithreading to reduce execution time. Based on the numbers of functions in the subject program (\(C_{fn}\)) and functions per thread (\(N_{FP}\)), the thread count for multithreading is calculated (lines 13–15). A set of \(N_{FP}\) functions is assigned to each thread (lines 17–18) based on increasing values of their unique ids. For each function pairs, if the ratio of the sum of metrics is greater than a prespecified threshold (line 23) and have cosine similarity greater than or equal to the prespecified cosine-similarity threshold (\(C_{ST}\)) (line 24), then these two functions are candidates for a function clone pair (line 25). At the end, based on the equivalence of clones, these clone pairs are used to find function clone classes (line 29).

6 Implementation and Results

We have implemented the proposed approach in Java using NetBeans IDE 8.0.2. Java Swing has been used for designing the interface part. For controlling the performance, the user has an option of providing input values i.e. metrics similarity threshold, cosine-similarity threshold, function coverage threshold, and the number of functions per thread.

We have selected three open source C project for evaluation of the proposed methodology. The list of projects is highlighted in Table 3 with the number of files, effective lines of codes and number of functions. There is a difference in numbers of files detected by CloneManager [3] and our algorithm because of the difference in accessing date of the source website.

Table 3 Details of the subject programs used for evaluation

In Table 4, we have compared and shown the results of our algorithm and CloneManager. From the result, we conclude that our algorithm detects the more number of function clone pairs. Further, the incorporation of Cosine-similarity to ensure the metrics-based approach improves the performance of the tool.

Table 4 The result produced by our approach

Table 5 presents the precision of our algorithm. We have manually verified the calculated metrics values for each function-pair that makes a detected clone pair. We searched the functions in project source code and manually analyzed it to verify the clones.

Table 5 The precision of the proposed approach

In Table 6, we have compared and shown the time consumed in minutes by our algorithm and CloneManager. Our tools have successfully used the computing capacity of the fast computer processors using a multi-threading based function clone detection approach. Since with the increase in the number of files and numbers of functions, the detection time increases as there is an increase in the number of comparisons among the function pairs. There is a difference in the numbers of files and functions detected by our tool and CloneManager, the time consumed for detection have slightly increased. For example, CloneManager detected 496 files in Apache-httpd-2.2.8 [14], while our algorithm has detected 539 C code files.

Table 6 Comparison of execution time with CloneManager

7 Conclusions

In the recent past, many researchers have proposed different function clone detection techniques. In this paper, we have proposed a metrics-based algorithm using textual analysis to detect Type-I and Type-II function clones. We have used three open source C projects to find the detection results and also validated and compared the results with the CloneManager tool. We found high recall and precision of the approach. As we have only targeted the Type-I, II function clones, in the future we target other types of function clones combining it some other detection algorithms to improve the performance of the algorithm.