Introduction

Along with the fourth industrial revolution, artificial intelligence, big data, Internet of Things, and cloud computing are emerging as cutting-edge technologies globally. In particular, artificial intelligence has unlimited potential to further improve the quality of human life and can solve several difficult engineering problems [1,2,3,4,5,6,7,8,9,10,11,12]. Moreover, this technology provides basic ideas to derive successful solutions to numerous problems encountered in the software development field.

As the size and complexity of software evolve, software defects are becoming inevitable. A software defect is an error, flaw, mistake, or fault in a computer program or system, producing incorrect or unexpected results [13]. These defects inconvenience users by causing malfunctions, i.e., software defects continuously decrease the quality of the software until the defect is fixed. Thus, defects are a significant issue that must be resolved for software quality improvement. Various methods for effectively detecting, fixing, and patching bugs have been investigated by software developers [14, 15]. For software projects, the ratio of the cost of software maintenance to the total project cost exceeds 50% [16,17,18,19]. Corrective maintenance addressing software defects accounts for 20% of all maintenance activities [20,21,22], and an improved efficiency in fixing defect will have a direct effect on the reduction in software development and maintenance costs. These issues are considered to be significant for software development companies.

During a software development, bug reports are written to effectively manage and fix software bugs when they are detected during the software life cycles. Bug reports are documents that detail the occurrence of defects with a specific format between developers and reporters. In general, information regarding the reporter, environment, and other data, including the priority and severity used in triage, are recorded in bug reports. Developers make substantial efforts to fix bugs and improve the communication with a user or quality assurance (QA) team using bug reports.

When defects occur during software development and maintenance, a software development manager often follows the defect life cycle, as shown in Fig. 1. Figure 1 summarizes the stages of the defect life cycle. The straight lines indicate parts that are manually applied by developers. The dotted lines are parts that can be automated by a system. First, once defects are detected, bug reports are written with the initial state “NEW,” and the manager analyzes whether a bug is valid and has been duplicated. If a bug is not detected by the manager, the report is sent to the developer and the state is changed to “OPEN.” Second, as a result of the developer’s activity, the bug is considered “CLOSED” or “REOPENED.” As shown in Fig. 1, several stages must be completed before developers start to modify the code. An important stage involves classifying bug reports for the bug report manager. The classification is divided into two classes: (1) textual classification, and (2) triage. Textual classification is based on texts, such as the title or body text. Triage is a classification method not based on texts but based on the priority or severity of the defects. Through textual classification, the bug reports classified with similar reports are assigned to developers that make modules of the defects that have occurred. Textual classification is also used to identify duplicate bug reports, which account for 30% of all bug reports [24]. Because developers cannot address all bug reports, a triage is essential for attaining the best maintenance efficiency within a limited period of time.

Fig. 1
figure 1

Defect life cycle [23]

If these defect classification processes are accurately and smoothly applied, no problems will occur; if not, the effect on the fixing of software defects and maintenance will decrease. An incorrect textual classification causes a bug report to be reassigned to other developers; the bug report cannot be fixed until the reassignment is complete. Thus, an incorrect textual classification decreases the maintenance efficiency. In addition, a mis-triage creates a more critical problem. Because unimportant problems are processed first, urgent defects can be delayed. In an incorrect textual classification, after a developer requests that a bug report be reassigned, the developer can work on another job. Conversely, the cost of fixing unimportant defects caused by a mis-triage is irreversible. Thus, an accurate bug report classification and assignment are directly connected to software maintenance efficiency. Because this efficiency is connected to the cost incurred by the company, it is extremely important. Owing to the importance of an accurate bug report triage, a pre-existing bug report triage is manually applied. For example, in the case of Eclipse, developers may spend up to 2 h classifying bug reports every day.

To resolve these problems, artificial intelligence techniques are now actively being studied, and have shown better classification accuracy than traditional (non-artificial intelligence based) methods [25,26,27,28,29,30,31,32,33,34,35,36]. Such techniques can be a key to solving most of the current problems regarding this issue. Thus, there have been various attempts to overcome the weaknesses of traditional methods by combining artificial intelligence as a hybrid approach.

To reduce the effort required in this regard, studies have proposed the application of state-of-the-art automation methods for bug report classification [25,26,27,28,29]. In particular, latent Dirichlet allocation (LDA)-based classification methods are common because they are suitable to bug reports that contain text-based data. Although these methods are excellent in terms of textual classification, the accuracy of the triage is unsatisfactory. To achieve an improved LDA-based method, software engineers have proposed new methods that combine LDA with other approaches, such as the k-nearest neighbor (KNN) and a support vector machine (SVM) [30,31,32,33,34,35]. However, it is risky to combine LDA with other methods for improving the accuracy of bug report classification because the combined methods cannot be applied well when compatibility issues occur (e.g., a correlation or difference in the input data between the methods) between LDA and other approaches. The risk is greater when LDA is combined with another method. Thus, instead of combining LDA with another method to improve the performance of the bug report classification, the performance of LDA itself should be improved.

To improve the bug report triage performance, in this study, we focus on improving LDA itself and propose a new method based on multiple LDA and backpropagation techniques. The proposed method aims to improve the quality of the topic set produced through LDA classification. The method builds additional topic sets that complement the original topic set from a typical use of LDA, and classifies and analyzes them to support the original topic set for improving the accuracy of the bug report classification. To evaluate the proposed method, we use bug reports from Bugzilla [37] along with Android bug reports from Mining Software Repositories (MSR) [38, 39]. Any method that fails to classify a significant number of bug reports is useless, and we therefore verified that the proposed method is able to classify a significant number of bug reports as a repository platform. We also verified and determined the efficiency of the method for use in a bug triage. To determine the difference between the original LDA classification and the proposed method, we statistically verified the method using a paired T-test.

The main contributions of this study are as follows:

  • A new method is proposed to improve the accuracy of bug report triage using multiple LDA and backpropagation techniques.

  • The proposed method is able to maintain compatibility with the existing hybrid LDA methods through a design of the necessary conditions.

  • Factors hindering the accuracy of the triage are identified through a detailed analysis.

  • Our experiments were conducted based on bug reports for actual software used in practice.

  • The superiority of the proposed method was validated through a statistical evaluation.

The remainder of this paper is organized as follows: Related studies are introduced in Sect. “Related work”. Section “Background” describes the background information. Section “Approach” shows our method to improving the bug reports triage performance and avoid confliction with existing LDA-based triage methods. Section “Evaluation” evaluates the proposed method and Sect. “Discussion” discusses detailed analysis of the proposed method. This paper is concluded and future research is discussed in Sect. “Conclusions”.

Related work

Bug report deduplication

Bug report deduplication is the process of removing duplicate bug reports. Duplicate bug reports cause an overestimation of the number of bug reports and increase the costs required. Thus, studies on bug report deduplication greatly help reduce the workload.

Alipour et al. [40, 41] used textual information (e.g., title, abstract or body text) to reduce bug report duplication. They proposed a BM25F based method that automatically extract the implications of the bug report and builds a dictionary (set of words). The researchers referred android layered architectural words [42], software non-functional requirements words [43], android topic words using LDA [44], android topic words using labeled-LDA [44] and random words in the English dictionary. As shown by the dictionary sources, the method is applied to android bug reports and an 11.55% performance improvement is achieved compared with REP [45]. A similar study [46] uses word embedding.

Aggarwal et al. [47, 48] improves the method in a study by Alipour [40] and proposes a method that is based on software engineering literature and reduces manual efforts for deduplication with minimal loss of triage accuracy. This study shows that the method of Aggarwal et al. is better than Alipour’s method in Eclipse, Mozilla and Open Office.

Campbell et al. [49] focused on off-the-shelf information retrieval techniques. Although these techniques were not designed for bug reports, they outperformed other approaches in terms of crash bucketing (i.e., bug report grouping) at an industrial scale. The authors used more than 30 thousand report data from the Ubuntu repository and Mozilla’s own automated system. Finally, they demonstrated that bug report deduplication still has significant room for improvement, particularly in terms of identifier tokenization through term frequency–inverse document frequency (TF–IDF).

Hindle et al. [50] proposed a method for preventing duplicate bug reports before they are submitted. This method finds duplicate or related bug reports in the bug database using texts. In addition, this simple method can be used to evaluate a new bug report deduplication method. This method is evaluated using bug reports from Android, Eclipse, Mozilla, and OpenOffice projects.

Nguyen et al. [51] proposed the DBTM, which has two advantages: both features are based on a topical method and information retrieval (IR). This method shows 20% performance improvement compared with the Relational Topic Model (RTM) [52] and REP [45] in Eclipse, Mozilla and Open Office.

Tian et al. [53] improve the study of Jarbert [54] and introduce three kinds of approaches. The first approach does not use term appearance (e.g., TF-IDF) but applies BM25 because BM25 is the best method according to the technical literature search. The second approach uses “product” as metadata, i.e., this method uses the notion that bug reports with different product are not duplicated. The third approach uses a comparison of the top k-similar bug reports instead of the most similar bug reports. This method improves the true positives and maintains low false negatives compared with a study of Mozilla projects by Jarbert.

Other machine learning methods [55, 56], such as hidden Markov models (HMMs) or deep networks, are proposed. They build a model that identifies the features of duplicate bug reports and utilize it. A multi-factor analysis method [55] that employs LDA, LNG and n-gram is also proposed.

Bug report triage

Bug report triage is a type of classification process. Because the developer’s workload is limited, critical bug reports should be processed earlier. Thus, a bug report triage is a classification process based on “priority.”

Tamrawi et al. [57] proposed Bugzie, which recommends bug reports. Bugzie builds fuzzy sets that are based on words extracted from the title and the description. Bugzie shows that it outperforms naïve Bayes, C4.5 (decision tree) and SVM with regarding to the temporal efficiency in Eclipse.

Wang et al. [58] proposed FixerCache, which is an unsupervised bug triage method. FixerCache overcomes the limits of supervised classification based on the activities of developers. FixerCache uses TF extracted from the title and the description of bug reports and outperforms naïve Bayes and SVM regarding the accuracy of classification.

Wen et al. [59] proposed Configuration Bug Learner Uncovers Approved options (CoLUA). CoLUA is a two-phase method that utilizes machine learning, IR and natural language processing (NLP) to resolve communication problems between developers and reporters. In the first phase, CoLUA determines what the bug report intends to convey based on its text information. In the second phase, CoLUA identifies the options that affect the communication in the labeled bug reports. The researchers evaluated CoLUA; their findings indicate that CoLUA has a better F-measure than the ZeroR classifier.

Zhang et al. [60] proposed the k-NN search and heterogeneous proximity (KSAP). KSAP employs the heterogeneous network of the bug report repository and historical bug reports to improve the auto-allocation of bug reports. KSAP is a two-phase method. First, KSAP obtains historically similar bug reports. Second, KSAP ranks the contribution of developers by heterogeneous proximity. The developers evaluated KSAP using Eclipse, Mozilla, Apache Ant and Apache Tomcat6. KSAP shows a performance improvement of 7.5–32.25% compared with ML-KNN [61, 62], DREX [63], DRETOM [64], Bugzie [57], and DevRec [62].

Many bug report triage methods [65,66,67,68,69,70] use data reduction. To achieve data reduction, these methods use KNN, naïve Bayes, and clustering and reduce feature selection and instance selection using the representative and statistic value of these methods or newly define “module selection.”

Machine-learning based methods applied to bug triage have also been frequently studied. Florea et al. [71] proposed an SVM-based bug report assignment recommender implemented in a cloud platform that achieves better results than other SVM-based bug report assignment recommending systems. They evaluate their method using actual datasets consisting of Netbean, Eclipse, and Mozilla projects. Popular deep-learning-based methods, which machine-learning type approaches, have recently been proposed using two deep-learning classifiers, namely, convolutional and recurrent neural networks for a parallel and extendable recommending system [72], and using a convolutional neural network and word embedding for automated bug triage [73]. These studies use an actual open-source dataset and demonstrate a higher accuracy than existing machine-learning-based methods.

Background

Bug report

In a modern environment, bug reports are operated as a part of community of issue (bug) tracking systems, i.e., bug reports are identified by not only the developers or report managers in charge but also all related people and are even used as public data. Thus, the bug report process requires accurate classification to satisfy the needs of numerous people. A distinct difference is the metadata of bug reports compared with common documents. Especially priority and severity, which is one of the metadata, are important because they are used to bug report triage. Due to the triage process, developers can be informed prior to the processing of important and critical bug reports. Figures 2 and 3 show examples of bug reports. Bug reports in Bugzilla are known to address a substantial amount of metadata. Bugzilla even supports “importance (priority and severity)” and “triage owner”, which are related to the triage process, and common data such as “reporter”, “product” and “status.” Figure 2 shows a bug report in Bugzilla. The report presents an unlimited page loading and information leaks. The priority “p2” in the bug report enable developers to fix the bug as soon as possible before reading it closely (Bugzilla uses stages p1–p5 as priority, where p1 is the highest priority). Bug reports in Github have a substantial amount of information about the environment in which the bug appears. A bug report, e.g., “enhancement”, “discussion”, and “question”, is usually not uploaded. Figure 3 shows a bug report in Github. The bug report shows cases in which a segmentation fault related with disconnected monitors. The bug occurred in iOS version 12, and the report describes the environment in which the bug occurs by showing the code. This study uses bug reports from Bugzilla and MSR that support bug reports in Fire Fox and Eclipse.

Fig. 2
figure 2

Bug report in Bugzilla [74]

Fig. 3
figure 3

Bug report in GitHub [75]

LDA

LDA is a probability model in which topics exist in each document for the given document collection (corpus). Users can estimate the words distribution by topics and topic distribution by documents using LDA. In LDA, documents consist of topics, and topics generate words based on the probability distribution. LDA traces the back process and creates the document when data are input. T is a topic variable, D is a document variable and W is a word variable. The trace back process is described as follows:.

Figure 4 shows an example of supposition for generating a document in LDA. If a machine knows the distribution of topics in documents, a document can be generated using supposition of LDA. In a chart of the figure, the distribution from topic 1 to topic 4 is 0.15, 0.2, 0.35 and 0.3. The machine stochastically selects a topic. In the figure, topic 1 is selected with a 15% probability. The machine selects a word that consists of topic 1 (all topics consist of words that are well matched with the topic). In the figure, “basic” is selected. Topic 2 is selected with a 20% probability, and “function” is selected. If the machine repeats this routine, the document is completed.

Fig. 4
figure 4

Example of supposition to generate a document in LDA

Figure 5 shows an example of traceback in LDA. LDA builds the distribution of topics by tracing back to the supposition in Fig. 4. First, the machine randomly assigns all words to topics. As shown in Fig. 5, assume that only two topics exist: A, B and two documents: Doc1, Doc2. For W (third word: “Apple” in Doc1), the machine determines the distribution of topics in the same document (P(T| Doc 1)). Because both A and B appear at 50% in Doc 1, the topic of W cannot be determined.

Fig. 5
figure 5

Bug report in GitHub

The machine determines the distribution of the topics for the same word (P(T| “Apple”)). In this figure, it obtains the distribution of “Apple” in Doc 1, 2. Because the distribution of B is larger, it determines that the topic of W is B. This study aims to improve triage accuracy and be compatible with state-of-the-art studies that employ multiple LDA.

Approach

This section describes how the proposed method is processed. Figure 6 shows an existing bug report triage process using LDA. Figure 7 demonstrates the overall approach used with the proposed method.

Fig. 6
figure 6

Existing LDA classification method

Fig. 7
figure 7

Overall approach

Applying LDA to bug report classification

The existing bug report classification applies LDA for a bug report base (dataset), and the machine classifies the bug reports based on the topic sets as the result (a union topic set (UTS) for distinguishing other topic sets that subsequently appear). This process is a part of the proposed method. The existing method is suitable for textual classification but achieves a poor triage performance, one of the reasons for which is the common elements occurring in different topic sets. Figure 8 shows an example of a bug report mis-triage. “Crashed image” is a bug in which an image is not displayed on the page where it should be. A bug report for a crashed image caused by an incorrect extension or an image loader error should be triaged as priority “P1.” Two priorities exist, namely, P1 and P3, in the correct triage model (the topics are listed in order of their influence). In the figure, “crash” is the first to appear in both P1 and P3. Unfortunately, the situation is the same in the bug report. Thus, the machine will apply a triage using topics with a low influence, and even minor errors will cause a mis-triage.

Fig. 8
figure 8

Example of the mis-triage caused by common elements

Identifying mis-triaged bug reports

To improve the UTS, including the common elements, the proposed method builds additional topic sets. One set is a partial topic set (PTS). The existing LDA classification cannot determine the priority or severity of the UTS. Thus, it should identify them along with the mis-triaged bug reports. The PTS assumes this role. The PTS-building process is similar to the case of the UTS. The proposed method classifies the bug reports in the training set based on the priority and severity. The PTS representing each field is obtained by applying the field. Figure 9 visualizes the building of the PTS and the process of identifying mis-triaged bug reports. The most popular field in the UTS can be determined by comparing the UTS with the PTS. The method also estimates that bug reports inconsistent with the most popular field will be mis-triaged. The common element problem can be resolved by correctly reclassifying mis-triaged bug reports based on the PTS; however, this method has a particular problem in that it only uses the UTS for bug report classification. This method should round the UTS and PTS for all fields. From a temporal aspect, this step is absolutely inefficient. Thus, the search space of the bug triage should be reduced. This problem is resolved through the next step.

Fig. 9
figure 9

Process for building PTS and identification of mis-triaged reports

Analyzing mis-triaged bug reports by building a feature topic set

To overcoming the temporal limit of the method using the PTS, in this study, a method for reducing the search space of a bug report triage using the feature topic set (FTS) is proposed. This method does not round all PTSs but does round the FTS as the corresponding common elements for the bug reports. Figure 10 shows the analysis process of building the FTS and its features. In the initial results, the proposed method collects mis-triaged bug reports based on the PTS and obtains the FTS by applying LDA to them. The FTS can be constructed in two ways. The first way is to build for the correct destination of mis-triaged bug reports, and the second is to build for the current location and correct destination of the bug report. The latter has the advantage of requiring a smaller search space by designating the current location. However, this approach should be employed when the massive size of the training set is prepared because the number of topics decreases for each FTS. The topics of the FTS are divided into four parts based on the ranking of the UTS. Table 1 shows the terms used in this step.

Fig. 10
figure 10

Process of building FTS and feature of FTS topics

Table 1 Term definitions for analyzing mis-triaged bug reports

In Fig. 10, the FTS has several topics; in particular, “red,” “correct,” and “hash” appear in the original UTS numerous times. The proposed method obtains the ranks of the topics in both the major and minor fields (and determines the average ranks for two or more minor fields). The method selects the topic and corresponding factor based on these ranks. The words that do not appear in the UTS are classified as the NN factor. There are two methods that address common elements. With the first method, common elements are removed from its minor field. As a factor of the common elements, the performance of the UTS is influenced but N-clsf is commonly decreased. If none of the words in a bug report are located in the UTS, the proposed method cannot classify the bug report, i.e., a common element deletion is inefficient. If common elements are included in the HH factor, acc-major improves but acc-minor decreases. If common elements are included in the HL factor, acc-major increases but acc-minor is less effective. As the worst case, if common elements are included in the HH factor, acc-major is unaffected, and acc-minor declines.

The other method that addresses common elements is to disregard them and reclassify the bug report by matching the FTS when the bug report is classified by common elements. Even if this approach increases the temporal cost of classifying bug reports by the FTS, it can prevent a decrease in N-clsf when common elements are removed. In particular, when the NN factor is employed, the process is quick because it does not check common elements in the FTS, unlike other factors. This study uses this method.

Re-classifying bug reports by improving UTS using FTS

To improve the UTS using the FTS, the proposed method builds a factor parser consisting of common elements in the UTS. This method quickly searches using a Hash or Trie (because all common elements are words). A factor parser obtains the addresses of the FTS that correspond to a common element. Figure 11 shows an example in which the UTS is improved using the FTS. The proposed method classifies the bug reports through the UTS, similar to the existing LDA classification. If the bug report includes common elements (particularly HH, HL, and LH factors), it calls the FTS that corresponds to the common elements identified by the factor parser. The NN factor only exists in the FTS. We know that other factors are identified, and the NN factor better represents its major fields. Thus, constructing the FTS using only an NN factor builds a more accurate and quicker environment. The method compares the report with the FTS to check whether the classification is correct.

Fig. 11
figure 11

Example of improving UTS using FTS

Evaluation

Experiment design

To develop an experiment verifying the proposed method, we collected multiple bug reports. In this study, our dataset consists of 3362 bug reports from Bugzilla, which supports various types of bug reports and metadata, and 41,229 bug reports from the MSR, which supports numerous types of bug reports. The bug reports from Bugzilla are classified based on the priority and severity. The bug reports from Bugzilla support both the severity and priority as the metadata that are useful for triage, whereas those from MSR support only the priority for the triage. To conduct the practical experiments, a total of 231 bug reports from Bugzilla related with Git were actually used in the group development. Git is an open-source distributed version-control system for tracking changes in the source code during software development [76]. In this study, our dataset consists of bug reports from Bugzilla that support various bug reports and metadata and bug reports from MSR that support numerous bug reports. The bug reports from Bugzilla are classified by priority and severity.

Thus, we fit a model verifying the improvement in the UTS accuracy of the proposed method. The collected bug reports were divided into three groups: Bugzilla, MSR, and a combination of the two. To build the training and test sets, we employ tenfold cross-validation. In this experiment, we implement the proposed method in Python 3. Python 3 supports various libraries for NLP and topic modeling. We use nltk (NLP), stop-words, and genism (topic modeling).

Experiment results

Table 2 shows the number of bug reports classified by the proposed method in terms of percentage of each fold. The fold represents the divided parts in the cross-validation. “Bugzilla,” “MSR,” “Bugzilla (Git),” and “Integrated” represent the bug report datasets. The percentages indicate the probability that the bug reports will be classified in each dataset. Bugzilla (Git) consists of bug reports related with Git in Bugzilla dataset. In the “Integrated” dataset “Bugzilla” and “MSR” are combined. As shown in Table 2, the method achieves a classification rate of greater than 97% for all folds for the Bugzilla reports; the average rate is 98.311%. In the case of the MSR, the classification rate is lower than that of Bugzilla, and the average rate is 94.993. In the case of Bugzilla (Git), the classification rate is lower than the entire Bugzilla dataset, such as the MSR, and the average rate is 93.91%. In the case of an integrated environment, the average rate is 96.199%.

Table 2 Percentage of bug reports classified by the proposed method (%)

Table 3 shows the percentage of each fold for the accuracy of the bug reports classified based on the severity using the LDA in Bugzilla. Table 4 shows the percentage of each fold for the accuracy of the bug reports classified based on the severity when using the proposed method in Bugzilla. The first row of Tables 3 and 4 represents the severity level of the bug reports. The numbers in Tables 3 and 4 denote the accuracy of the bug report classification for each severity level. The numbers in italics represent the maximum value, and the numbers in the round brackets represent the minimum value. The proposed method achieves an improvement in accuracy of 25% compared with the original LDA for seven severity fields. In particular, “Block” accounts for 35%. The proposed method shows a classification accuracy of 85% for all fields, with the exception of “Block.” The reason for the low effectiveness for “Block” is the small number of bug reports. Because overlapped contexts are rare compared with other fields, the method cannot obtain high weight topics.

Table 3 Triage accuracy of bug report classified by LDA (Severity) (%)
Table 4 Triage accuracy of bug report classified by the proposed method (Severity)

Table 5 shows the percentage of each fold for the accuracy of the bug reports classified based on the priority using the LDA. Table 6 shows the percentage of each fold for the accuracy of the bug reports classified based on the priority when using the proposed method. The first row of Tables 5 and 6 represents the accuracy of the bug report classification based on the priority.

Table 5 Triage accuracy of bug report classified by LDA (priority) (%)
Table 6 Triage accuracy of bug report classified by the proposed method (priority) (%)

The proposed method achieves an improvement in accuracy of 24% compared with the original LDA for five priority fields. In particular, the method achieves an excellent performance for “P2” and “P4.”

Discussion

Compatibility with original LDA in the existing studies

This section discusses how the proposed method is compatible (substitutable) with the LDA for the existing combined methods (the LDA with other methods). Zou et al. [62] defined two constraints used in generating a combined method. First, the base techniques should apply the same information source. If they use different sources of information, it is necessary to conduct a data conversion. When the user cannot develop a data converter, this combined method is impossible to correctly build. Even if a data converter is developed, it can create other defects. Second, the correlation should be low between the base techniques used for the combination. Zou et al. [77] categorized fault-localization (FL) techniques into seven FL families. They demonstrated a set of techniques with a weak correlation with each other.

Related to the first condition, the proposed method uses bug report dataset that is the same information source with LDA. Related to the second condition, the existing studies used a combination of low correlation techniques with LDA. That is, the proposed method also ensure low correlation with the techniques used in the existing studies because this method is based on the multiple LDA and the same information source with LDA mentioned above. Thus, the proposed method can be used as an alternative to original LDA in the combined LDA methods of the existing studies.

Statistical comparison with the proposed method and original LDA

Table 7 shows the results for the paired T-test of each field in Tables 3 and 4. A paired t-test is used to compare two population means for two samples, which is generally used before-and-after observations on the same subjects [78, 79]. Table 8 shows the results for the paired T-test of each field in Tables 5 and 6. The T-statistic is a value obtained by the T-test for each field. The P-value is a statistical value used to compare the proposed method and the original LDA. The null hypothesis (H0) indicates that no statistically significant difference exists between the proposed method and the original LDA. The alternative hypothesis (H1) shows that a statistically significant difference does exist between the proposed method and the original LDA. In Tables 7 and 8, all p-values are less than 0.05. We reject H0 and adopt H1. Thus, the proposed method has a statistically significant difference compared with the original LDA, i.e., the proposed method is better than the original LDA classification.

Table 7 Paired T-test results for severity accuracy
Table 8 Paired T-test results for priority accuracy

Conclusions

In this paper, a method for improving LDA, which is generally employed in a bug report triage, was proposed. To improve the classification accuracy of the topic set used with LDA, the proposed method builds additional topic sets and improves the original set. To validate the proposed method, we used bug report platforms applied in practice, namely, Bugzilla, MSR, Bugzilla (Git), and an integrated platform. We demonstrated how the method accurately classifies the bug reports using the experiment results, and how it is better than the original LDA based on a statistical paired T-test.

Traditional bug triage methods try to cover their weakness by combining the LDA with other methods. However, the combined hybrid methods may have compatibility problems, such as a correlation or difference in the input data applied. Because the proposed method focuses on upgrading the LDA itself, such compatibility issues do not occur. In addition, the proposed method will provide the basis for the development of more improved hybrid methods.

Even if we improve triage accuracy via this study we will perform further research on the following issues. Although many related studies use bug reports to help fix bugs, studies that address fixing bugs using comments are lacking. We intend to study the relation between comments and bug reports and identify bugs based on this relation. We also will study the automation of bug identification using the results of these studies.