1 Introduction

Currently the software is ubiquitous in organizations whatever their sectors of activities, and its quality is becoming a critical issue. The main objective is to achieve the expressed and implicit needs (expectations), defined by the users when the software is used under specified conditions. A quality model has been defined in ISO 25010 standard which classifies and structures all software product quality characteristics/attributes. Maintainability as a quality characteristic for software product is related to the ease with which the software undergoes changing requirements [1].

Several research studies have been published related to software product maintainability prediction techniques (SPMP). Some studies have empirically validated their proposed techniques such as in [2, 3], others have conducted SLRs in order to have an overview of the existing studies in the literature such as in [4,5,6,7], while other studies have empirically compared their proposed SPMP techniques with some existing ones, such as in [8,9,10]. However, to the best of our knowledge no study has been published with the purpose of reviewing and analyzing these SPMP comparative studies based on their accuracy.

This paper analyzes a set of 29 empirical studies published between 2000–2017, that have proposed and/or evaluated and compared different SPMP techniques. This set of studies is identified from 8 digital libraries (Science Direct, Springer Link, Ebsco, ACM Digital Library, Google Scholar, Scopus and Jstore). Since each of these SPMP techniques was built in a particular experimental context, this paper review and analysis are addressed according to five 5 research questions (RQs) related to: SPMP techniques, metrics, datasets, accuracy criteria and validation procedures.

This paper is structured as follows. Section 2 presents the method used to collect the studies. Section 3 analyzes and discusses the review results and provides responses to the 5RQs. Section 4 provides an accuracy comparison of SPMP techniques reported superior in the literature. Conclusion and future work are presented in Sect. 5.

2 Method

This paper provides the results of the accuracy comparison of software product maintainability prediction techniques based on literature review. A systematic review was done through 8 databases; Science Direct, Springer Link, Ebsco, ACM Digital Library, Google Scholar, Scopus and Jstore. The search string was build using the following keywords (maintainability OR analyzability OR modifiability OR testability OR portability) AND (compar* OR empirical* OR evaluation* OR validation* OR experiment* OR control experiment OR case study OR survey) AND (software* OR application OR system) AND (predict* OR evaluat* OR assess* OR estimat* OR measur*) AND (method* OR technique*). This combination of keywords doesn’t much in some databases, it was therefore necessary to adapt the search string for each database.

To reduce the large number of papers returned, the search was limited to papers published between 2000 and 2017. To get relevant papers, some inclusion and exclusion criteria were applied. Included papers were those conducting empirical studies on SPMP and/or providing comparison of SPMP techniques. Papers which fall outside the topic of maintainability, not including empirical comparison, or written in other languages than English were excluded. This task was performed by two authors. The conclusion to include or exclude a paper was done based on reading the abstracts in first time, and then reading the full papers. As a result of this search process, a set of 29 relevant papers was selected; grouped by years of publications in Table 1.

Table 1. Selected studies per years of publications

This set of 29 selected papers was used to provide responses to the following 5 Research Questions (RQs):

  1. 1.

    What kind of techniques were used for SPMP?

  2. 2.

    What datasets were used for SPMP?

  3. 3.

    What types of metrics were used for SPMP?

  4. 4.

    What accuracy criteria & validation procedures were used for SPMP?

  5. 5.

    What SPMP techniques were reported superior from literature?

3 Analysis and Discussion

Building and evaluating software maintainability prediction techniques rely mainly on datasets and evaluation methods [37, 38]. This section presents and discusses the results obtained from the review of this set of 29 primary studies and provides responses for the 5 RQs.

3.1 Techniques Used for SPMP (RQ1)

Based on the 29 selected studies, SPMP techniques used for predicting software product maintainability are grouped in two categories: machine learning and statistical techniques. The machine learning techniques were classified according to [25, 39] as follows: Decision trees (DT), Neural networks (NN), Bayesian learners (BL), Rule based learning (RBL), Ensemble learners (EL), Evolutionary algorithms (EA), Clustering (CL), Fuzzy & Neuro Fuzzy based (NF), Instance based (IB), Inductive Rule Based (IR), Support Vector Machine (SVM), and Miscellaneous techniques.

Acronyms:

Ward Neural Network (WNN), General Regression Neural Network (GRNN), Bayesian Network (BN), Regression Tree (RT), Backward Elimination (BE), Stepwise Selection (SS), Multiple Adaptive Regression Splines (MARS), Multiple Linear Regression (MLR), Support Vector Regression (SVR), Gaussian Mixture Model (GMM), Aggregating One-Dependence Estimators (AODE), k Nearest Neighbor (KNN), Naïve Bayes (NB), Random Forest (RF), Radial Basis Function Network (RBF), Projection Pursuit Regression (PPR), Feed Forward Neural Network (FFNN), Fuzzy Inference Systems (FIS), Adaptive Neuro-Fuzzy Inference Systems (ANFIS), Extreme Learning Machine (ELM), Multilayer Perceptron (MLP), Group Method of Data Handling (GMDH), Genetic Algorithms (GA), Probabilistic Neural Network (PNN), Linear Regression (LR), Multiple Classifiers Combination (MCC), Back Propagation Neural Network (BPNN), Mamdani Fuzzy Logic (MFL), Sensitivity Based Linear Learning Method (SBLLM), Least Median of Squares Regression (LMSR), Reduced Error Pruned Tree (REPTree), Locally Weighted Learning (LWL), Conjunctive Rule Learner (CR), Decision Table (DTable), M5 Rules (M5R), Pace Regression (PR), Isotonic Regression (IR), Regression By Discretization (RegByDisc), Additive Regression (AR), Ensemble Selection (ES), Gaussian Process Regression (GPR), Fuzzy Subtractive Clustering (FSC), Decision Stump (DS), K-means clustering (KMC), X-means clustering (XMC), Feed Forward 3-Layer Back Propagation Network (FF3LBPN), Logistic Regression (LogR), Kohonen Network (KN), Gene Expression Programming (GEP), Hybrid approach of neural network and genetic algorithm (Neuro-GA), Type-2 fuzzy logic systems (T2FLS), Hybrid approach of functional link artificial neural network with GA (FGA, AFGA), Particle Swarm Optimization with GA (FPSO, MFPSO), Clonal Selection Algorithm with GA (FCSA), SVR with Radial Kernel Function (SVR-RK).

Figure 1 depicts the classification of SPMP techniques used in the 29 studies. The analysis show that the machine learning techniques were the most used with 83% (24 papers) compared to statistical techniques with 41% (12 papers). Note that some studies have evaluated many techniques, while others have focused on only one technique. Within the machine learning categories, the NN techniques were used in 16 studies, SVM in 7 studied, DT in 5 studies, EA in 5 studies, NF in 4 studies, BL in 3 studies, IB and CL in 2 studies each, and RBL, EL, and IR in 1 studies each. The statistical category includes techniques such as: PR, MLR, MARS, RegByDisc, GMM, Lreg, SS, BE, LR, PPR, SVR, AR, GPR, SVR-RK, RT, LMSR, IR. The MLR technique was the most used (4 studies), followed by MARS (3 studies), then SS, BE, PPR, LogR, SVR, and RT (2 studies each).

Fig. 1.
figure 1

Classification of SPMP techniques.

3.2 Datasets Used for SPMP (RQ2)

Several datasets have been used in the 29 SPMP selected studies. Figure 2 reports the number of studies per datasets. According to the figure, UIMS and QUES were the most commonly used in 19 and 18 studies respectively; these datasets are provided by Li and Henry [40]. FLMS (File Letter Monitoring System), EASY (Classes Online Services collection), Lucene (Apache Lucene), JEdit (Java text Editor), and MIS (Medical Imaging System) are open source datasets (OSS) and have been used in two studies each. The remained datasets have been used only once.

Fig. 2.
figure 2

Number of studies per datasets

3.3 Types of Metrics Used for SPMP (RQ3)

Many metrics have been proposed in the literature in order to evaluate the software product design as well as source code. The most used types of metrics in the 29 selected SPMP studies are presented in Table 2. As shown in the table, the C&K and Size metrics are the most used metrics (90%, respectively) compared to L&H, McCabe, MI, and Halstead’s metrics. It should be noted that the selected studies have used one or several metrics depending on the study purpose. For instance, studies [22, 28, 29] have used C&K, L&H, and Size metrics, while studies [11, 12] have used Halstead’s, McCabe, and Size metrics.

Table 2. Distribution of metrics types per studies

Moreover, within the set of OO Metrics (OOM), it is noticed that 9 out of the 10 have been frequently used in the selected studies. Figure 3 presents the distribution of studies per these OOM. As shown, the Lack of Cohesion in Methods (LCOM), Depth of Inheritance Tree (DIT), Number of Children (NOC), Weighted Methods per Class (WMC), Response For a Class (RFC), Message Passing Coupling (MPC), Data Abstraction Coupling (DAC), Number Of local Methods (NOM), and Size have been used in more than 20 studies while Coupling Between Object (CBO) was used only in six studies.

Fig. 3.
figure 3

Number of OO metrics per study

3.4 Accuracy Criteria and Validation Procedures Used for SPMP (RQ4)

In the 29 selected studies, various accuracy criteria were used to compare the accuracy of SPMP techniques (see Fig. 4). From this figure, it appear that MMRE is the most frequently used accuracy criteria; being adopted by 52% (15 papers), followed by the Pred (Pred (0.25), Pred (0.30), and Pred (0.75)) with 45% (13 papers), and Max MRE with 28% (8 papers).

Fig. 4.
figure 4

Number of studies per accuracy criteria

Besides, in order to evaluate the accuracy of their proposed SPMP techniques, the selected studies have used different validation methods. Figure 5 reports the cross validation methods (CV) per studies. According to the figure, it is observed that k-FCV (k-Fold Cross-Validation) especially; 10-FCV method was the most used in the selected studies followed by LOOCV (Leave-One-Out Cross-Validation), CV, and at last 5 FCV.

Fig. 5.
figure 5

Cross validation methods per studies.

3.5 SPMP Techniques Reported Superior from Literature (RQ5)

The 29 selected studies investigated in this paper have compared the accuracy of some SPMP techniques (new models and/or already published models). From the literature results, we can notice that no technique is definitively better than the others. For instance:

  • MLP technique was reported superior in 4 studies,

  • SVM, DT, ELM, GMDH and MARS techniques were reported superior in 2 studies each, and

  • the rest of techniques are reported only once.

Moreover, the MARS technique was reported to be superior in two studies [10, 36], while not in 9 studies [13, 14, 18, 23, 25, 32,33,34,35], and SVM technique was reported accurate in 2 studies [21, 31] and not in [11, 12, 23, 35, 36]. Therefore, the choice of the best technique to predict the maintainability is not obvious since every technique has its advantages and its drawbacks.

4 SPMP Techniques Accuracy Comparison

The purpose of this section is to compare the techniques reported to be superior and which have used the same datasets (UIMS and QUES), metrics (L&H and C&K), accuracy criteria (MMRE, Pred (0.25), Pred (0.30)), and OO software applications. From the 29 selected studies, 7 studies (8 experiments) have been selected for UIMS set and 7 studies (9 experiments) have been selected for the QUES. Using the MMRE and the Pred as accuracy criteria for comparison, it is important to note that low MMRE value or a high Pred (25) value indicates good accuracy [41, 42].

Table 3 shows that the T2FLS has achieved a significantly better prediction accuracy than the other techniques with an MMRE value of 0.00007, Pred (0.25) value of 0.86 and Pred (0.30) value of 0.92 in UIMS datasets. Therefore, the T2FLS technique can predict maintainability of the UIMS dataset better than BN, MARS, TreeNet, ELM, MFL, FSC, and K*.

Table 3. Prediction accuracy for UIMS

For QUES dataset, Table 4 shows that MFL, K*, and KNN have achieved the same MMRE value of 0.27. Moreover, they are slightly equal in terms of Pred: (Pred (0.25) = 0.52 and Pred (0.30) = 0.62 for MFL, Pred (0.25) = 0.62 and Pred (0.30) = 0.65 for KNN and Pred (0.25) = 0.56 and Pred (0.30) = 0.66 for K*). Thus, overall, the MFL, K* and KNN techniques are better than BN, MARS, TreeNet, ELM, SBLLM, and PR.

Table 4. Prediction accuracy for QUES

5 Conclusion and Future Work

In this paper, we have investigated the empirical studies that have compared the accuracy of some SPMP models, published between 2000 and 2017. The results of the discussion conducted show that the NN, SVM, DT, EA, and MLR techniques and the DIT, LCOM, RFC, and WMC measures were commonly used to predict maintainability. The UIMS and QUES datasets, the MMRE and Pred(20or30) accuracy criteria, and the 10-fold cross validation method were the frequently used in the selected studies. The results have also showed that there is no accurate technique. As a future work, it is planned to conduct more empirical studies to better predict the maintainability of software.