1 Introduction

Coronary artery disease (CAD) is the disease caused by vascular stenosis that supplies oxygenated blood to humans’ heart which results in severe heart problems such as angina and heart attack [10, 11]. CAD is one of the largest killers in developed countries including Nigeria killing more than 7.4 million people around the world in 2012 and 53,836 persons as of 2014 in Nigeria [20, 21]. It is also estimated that 1 in 7 people in the United States has CAD [3]. The disease is also one of the leading causes of deaths in women, killing more people than cancers [3]. CAD causes more deaths and disabilities in developed nations than developing ones [20] [28].

In Nigeria, CAD causes more deaths due to insufficient knowledge of the negative impact of the disease on humans [5] [23, 24]. In Nigeria, CAD-related deaths reached 53,836 which amount to 2.82% of all deaths that occurred in 2014 [20] [24]. Most of the victims of the disease tend to ignore the early symptoms and consultations with the health workers until they are in a bad or severe condition of the CAD. Therefore, most of these patients die before receiving appropriate medications or medical attentions [5]. There is a huge burden of CAD in most of the West African countries due to limited resources to provide comprehensive health care for the CAD patients and inadequate awareness campaign of the disease. Therefore, early detection and diagnosis of the CAD, being currently one of the deadliest diseases in Nigeria, might be assisting significantly to fight the disease [8, 9].

Medical diagnosis is the process of diagnosing the disease by measuring specific symptoms and signs [11, 12] [22]. The patient expresses symptoms of the disease to medical doctor, while signs of the disease are observed by medical doctor. However, patient may not accurately sometime express the symptoms, and physicians may not always be sure of signs of the diseases due to uncertainty and vagueness in the course of diagnostic decision making [11, 21, 22]. Therefore, various uncertainties and vagueness affect the diagnostic process, and they must be carefully dealt with [7]. Sometime also, physicians often have variations in their decisions due to uncertainty and vagueness of the information they have at their disposal [18, 24, 25]. These uncertainties, complexities and vagueness involved in the diagnostic decision-making process have to be addressed. In this regard, the fuzzy expert systems are being developed to mimic human specialists' diagnostic decision-making processes in order to address the issue of uncertainties, complexities and vagueness often associated with decision making [11, 12, 26]. The fuzzy-based expert system is an advanced artificial intelligence system that uses unconventional thinking to reduce the uncertainty that is often associated with the diagnosis process of diseases [1, 17] [21] [26]. In this study, a fuzzy expert system for diagnosis of coronary artery disease is built with MATLAB, which can easily be integrated into the electronic health record system.

2 Related works

In [14], fuzzy expert system for diagnosing of heart disease based on medical records in Jordan was developed using a visual studio, and system is able to identify CAD patients. In [19], the clinical support system for treating chronic heart disease using risk factors, most of which are clinical risk factors, has been developed. C4.5 algorithm is used to generate system production rules from the Cleveland heart disease database and system was proved to be very efficient and effective. In [16], a web-based diagnostic system for diagnosis of cardiac is developed with PHP, HTML, and Java script and MySQL and the system used 15 input variables with seven diagnostic rules. In [1], an expert system for diagnosing cardiovascular disease with MATLAB was developed, and the system has 94% accuracy. In [13], an expert system for cardiovascular disease was developed and the production rules of the system were made from the UCI Cleveland Clinic Foundation, Repository of Machine Learning Databases. In [17], a CAD screening system using clinical parameters was developed and a questionnaire was designed under the medical team's guidance to collect information about patients' clinical parameters based on the risk factors of CAD. The system was implemented using object-oriented technique with one demographic risk factor and eleven clinical risk factors of CAD.

Many scholars have developed expert systems for diagnosing CAD. Still, most of these systems, and their production rules, were generated from a repository dataset of CAD, such as the Cleveland Heart Disease Database. Only a few generated datasets from the available medical records. Therefore, developing a Fuzzy Based Expert System for Diagnosis of CAD using datasets generated from CAD patients' medical records in Nigeria is required.

3 Materials and methods

The fuzzy based expert system developed in this work, has three (3) major component which include knowledge discovery (data mining), fuzzification, knowledge inference and defuzzification. Figure 1 shows the methods and materials of the study.

Fig. 1
figure 1

Methodology of the study

3.1 Dataset

Diagnostic data of the patients who are suffering from and those who were suspected of having CAD was collected at General Hospitals in Kano State, Nigeria. The data collection was approved by the Kano State Ministry of Health in Kano – Nigeria.

Data preparation.

The dataset collected was prepared, cleansed and only 506 diagnostic cases were recorded. The dataset has twelve (12) attributes which include age, glucose, blood pressure, chest pain, triglycerides, high-density lipoprotein (HDL), cholesterol, low-density lipoprotein (LDL), body mass index, creatinine, heart rate, and diagnostic result. Table 1 shows units, range, and data type of each attribute of the dataset.

Table 1 Description the Attributes of the dataset
Table 2 Performance Evaluation Result

3.2 Knowledge discovery (Data mining process)

The prepared and cleaned dataset was transformed into Weka readable file format called Attribute-Relation File Format (ARFF). Weka is an open-source machine learning software used to uncover useful knowledge from the dataset [25]. An improved C4.5 classification algorithm proposed by [4] was encoded into the Weka to generate the production rules used in the knowledge of the system. The algorithm employed L’ hospital rule in the course of the improvement of C4.5 algorithm where it uses average information gain and information gain ratio rather than just information gain ratio used by C4.5 algorithm as the criterion to select the candidate attribute as the root of the decision tree [27]. Let Assume, S is the dataset and B as the set of attributes of the dataset. The information gain of attribute B is computed using Eq. 1 as follows

$$\mathrm{Gain}\left(\mathrm{B}\right)=\mathrm{I}\left(\mathrm{p},\mathrm{n}\right)-\mathrm{E}(\mathrm{B})$$
(1)

The Gain-Ratio (b) is expressed as follows

$$\mathrm{Gain}-\mathrm{Ratio}\left(\mathrm{B}\right)=\frac{\mathrm{Gain}(\mathrm{B})}{\mathrm{I}(\mathrm{B})}=\frac{\mathrm{E}\left(\mathrm{S},\mathrm{B}\right)-\mathrm{E}(\mathrm{B})}{\mathrm{I}(\mathrm{B})}$$
(2)

The algorithm computes average information gain and information gain ratio using equation.

$${\mathrm{GGR}}_{\mathrm{av}}(\mathrm{B})=\frac{\mathrm{Gain}\left(\mathrm{B}\right)+\mathrm{Gain}-\mathrm{Ratio}(\mathrm{B})}{2}=\frac{\mathrm{Gain}\left(\mathrm{B}\right)+ \frac{\mathrm{Gain}(\mathrm{B})}{\mathrm{splitinfo}(\mathrm{B})}}{2}=\frac{\mathrm{Gain}\left(\mathrm{B}\right)* \left(1+ \frac{1}{\mathrm{splitinfo}(\mathrm{B})}\right)}{2}$$
(3)
$${\mathrm{GGR}}_{\mathrm{av}}(\mathrm{B})= \frac{\left\{\frac{\mathrm{pn}}{\mathrm{N}}- \left[\left[\frac{{\mathrm{B}}_{11* {\mathrm{B}}_{12}}}{{\mathrm{B}}_{1}}\right]+ \left[\frac{{\mathrm{B}}_{21* {\mathrm{B}}_{22}}}{{\mathrm{B}}_{2}}\right]\right]\right\}* \left(1+ \frac{{\mathrm{B}}_{1}{\mathrm{B}}_{2}}{\mathrm{N}}\right)}{2\frac{{\mathrm{B}}_{1}{\mathrm{B}}_{2}}{\mathrm{ N}}}$$
(4)

where.

B1 is the set of positive sample in B.

B2 is the set of negative sample in B.

B11 is the set of positive sample that are in B with positive value of attributes.

B12 is the set of positive sample that are in B with negative value of attributes.

B21 is the set of negative sample that are in B with positive value of attributes.

B22 is the set of negative sample that are in B with negative value of attributes.

The improved C4.5 algorithm was ran into Weka simultaneously with C4.5 and Random Tree Algorithms respectively. The performance results of the algorithms are shown in Table 2. The improved algorithm has the highest accuracy of 86.56% among all the algorithms.

The decision tree generated by using improved C4.5 was converted into crisp rules. Table 3 shows some of the corresponding crisp rules of generated from the decision tree.

Table 3 Sample of Diagnostic Crisp Rules

3.3 Rule selection

Rule Selection Technique (RST) proposed by [2] was adopted to select the crisp rules generated using an improved C4.5 algorithm. The rule selection is based on the notion of the importance measure and supports filtering of the rules, therefore, rules were converted into decision table. The filtering technique is applied to select the rules in order to reduce their number before to apply importance measure to select the most importance ones [2].

3.4 Fuzzification

Fuzzification is the process of fuzzifying the crisp set of rules generated using an improved C4.5 algorithm. The fuzzification is carried out using fuzzy logic. Unlike traditional logic which has only 0 or 1, fuzzy logic has infinite numbers from 0 to 1. Fuzzy logic is called multi-valued logic, unlike the conventional logic set, where an element can either belong entirely to a group or does not belong at all [6, 17]. In the fuzzy theory, A fuzzy set A in X is defined as a set of ordered pairs = , () ∈ 

where μA(x) is called the membership function of set A

$${\upmu }_{\text{A}}(\text{x}) :\text{ X }\to \{0, 1\},\text{ where }{\upmu }_{\text{A}}(\text{x}) = 1\text{ if x is totally in A};{ \mu} \text{A}(\text{x}) = 0\text{ if x is not in A}; 0 < {\upmu }_{\text{A}}(\text{x}) < 1\text{ if x is partly in A}$$
(5)

Fuzzy sets allow a succession of possible choices. For any element x of the universe X, the membership function μA(x) is equal to the degree that x is an element of set A [27]. This value set between 0 and 1 is considered the order of membership [27]. It is also known as the membership value of the element x in set A. Fuzzy logic is just an expression of ambiguity and uncertainty. The advantage is that they can overlap and avoid the problem of sharp boundaries [15]. Therefore, the attributes of the dataset which are the system's inputs and output were fuzzified in order to address the inaccuracies, ambiguities, and uncertainties associated with diagnostic decision making of CAD’ patients[17].

The system input parameters include: age, blood pressure, glucose, cholesterol, triglycerides, HDL, LDL, creatinine, body mass index, heart rate, and chest pain, which has been defined with three fuzzy linguistic values and the output variable (diagnosis)with an input parameter chest pain have four fuzzy values. The output variable (diagnosis) has healthy, mild, moderate, and severe fuzzy linguistic values while and chest pain which is an input parameter has typical angina, atypical angina, non-angina, and asymptomatic fuzzy linguistic values. However, there is no ambiguity or overlap in chest pain, and since the patient has only one chest pain at a time. The value of a fuzzy variable is defined by the fuzzy membership grade, which is determined by the membership function. However, a trapezoidal membership function was used for all input variables while for output variable, triangular membership function was used. A trapezoidal membership function distribution is represented as Trapezoidal (x; a,b,c,d). The membership function value at x = a, x = b, x = c and x = d are set equal to 0.0, 1.0, 1.0 and 0.0, respectively. The trapezoidal membership function expressed in Eq. (6) below

$$trapezoid \space \left( x;a, b, c, d\right)=\mathrm{max}\left(\mathrm{min}\left(\frac{x-a}{b-a} , 1,\frac{d-x}{d-c} \right),0\right).$$
(6)

The triangular membership function is donated by Triangle (x; a, b, c). The membership function value at x = a, x = b and x = c are set equal to 0.0, 1.0 and 0.0, respectively. The triangular membership function expressed in Eq. (7) below

$$triangle \ \left( x;\ a, b, c\right)=\mathrm{max}\left(\mathrm{min}\left(\frac{x-a}{b-a} , 1,\frac{c-x}{c-b} \right),0\right)$$
(7)

The linguistic variables and membership functions each attribute of the dataset is determined, calculated and visualized using MATLAB. Thus, each crisp value has been transformed or converted into a fuzzy value. As such, all the crisp set rules generated using an improved C4.5 decision tree algorithm were transformed into the corresponding fuzzy set rules. Moreover, after determining each attribute's linguistic variable and converting crisp value into fuzzy values, the crisp set rules generated earlier were converted into a fuzzy set of rules. Table 4 shows the sample of fuzzy rules, Fig. 2 shows membership functions and linguistic of age, Fig. 3 shows the membership function of the linguistic variables of the chest pain and Fig. 4 shows the membership functions of the linguistic variables of diagnosis.

Table 4 Sample of Fuzzy Rules
Fig. 2
figure 2

Membership functions of the linguistic variables age

Fig. 3
figure 3

Membership functions of the linguistic variables of chest pain

Fig. 4
figure 4

Membership functions of the linguistic variables of diagnosis

4 Fuzzy based expert system

A fuzzy Based Expert System for Diagnosis of CAD has three major components which include knowledge base, inference engine, and defuzzification (user interface).

4.1 Knowledge base

The knowledge base has been developed based on the historical data and the experience of cardiologists. Cardiologists were consulted and involved in the stage of data collection, cleaning, interpretation and knowledge generation. Cardiologists verified each rule generated with an improved data mining algorithm, and all the conflicts were resolved. The system employed a production technique for knowledge presentation. The production rules are written in the format of < IF (condition) THEN (conclusion) > . In the present fuzzy system, condition and conclusion are fuzzy variables. These rules are diagnostic rules and are selected by the inference engine of the system. MATLAB is used to implement the system, which has 87 rules. The knowledge base rules are shown in Fig. 5.

Fig. 5
figure 5

Knowledge base rules

4.2 Knowledge inference

Knowledge inference is a mechanism behind inferring new knowledge from existing fuzzy rules available in the system knowledge base. Therefore, new information and conclusions would be deduced from it. Mamdani inference technique is used to stimulate expert physicians' reasoning in diagnosing CAD in this work. Mamdani Fuzzy Inference System is widely used because it provides good results with a relatively simple structure. Mamdani is used to create a control system synthesizing a set of linguistic production rules obtained from experienced human operators [17]. Therefore, the Minimum operator, the conjunction operator is MIN, the t-norm from the compositional rule is MIN, and the MAX operator is used to aggregate the rules. Figure 6 shows the Graphical User Interface (GUI) of System Inference with Mamdani technique.

Fig. 6
figure 6

GUI of System Inference with Mamdani technique

4.3 Defuzzification

Defuzzification involves transforming the output of the inference engine (fuzzy values) into crisp values. A centroid is employed in this work for defuzzification, called the center of area or center of gravity, where z is the output variable, and (z) is the membership function of the aggregated fuzzy set A referring to z. The Centroid method de-fuzzifies the system's diagnosis result's undefined values, which is the output of the system to crisp values.

$${Z}_{COA}= \frac{{\int }_{Z}\mu A \left(z\right) . zdz}{{\int }_{z}\mu A (z)}$$
(8)

Figure 7 shows the GUI of Rule Viewer of the system while Fig. 8 shows the Surface Viewer of the system.

Fig. 7
figure 7

GUI of Rule Viewer

Fig. 8
figure 8

Surface Viewer

5 Performance evaluation of the system

The expert system was applied to the diagnostic data of 100 people (Healthy = 21%, Mild = 23%, Moderate = 31% and Severe = 25%) who came to Specialist Hospital in Kano, Nigeria for CAD checkup. Information based on one demographic risk factor and eleven clinical risk factors was taken from them and were labelled by cardiologist. The system was applied to find the model predicted risk to these people. For evaluation of the performance of the system, model predicted outputs were compared with the results given by the cardiologist. Table 5 shows the check-up results for each class of patients with healthy, mild, moderate, and severe cases.

Table 5 Checkup Result

The system used to diagnose CAD patients based on a demographical CAD risk factor and eleven clinical risk factors. Below are metrics used to evaluate the performance of the system

  1. i.

    Accuracy: is used to evaluate the percentage of CAD patients who were correctly diagnosed by the system.

  2. ii.

    Sensitivity: is used to evaluate the percentage of CAD patients who were abnormal and correctly diagnosed by the system.

  3. iii.

    Specificity: is used to evaluate the percentage of CAD patients who were normal and correctly diagnosed by the system.

  4. iv.

    Receiver Operating Characteristic Curve (ROC) is used to show the relationship between the specificity and sensitivity of the system.

Table 6 and Fig. 9 show the system's performance evaluation result based on accuracy sensitivity and specificity as 94.55%, 95.35%, and 95.00%, respectively. ROC shows the relationship between the specificity and sensitivity of the system. The result indicates that the system is reliable and can diagnose both negative and positive CAD patients effectively.

Table 6 Performance Evaluation Result
Fig. 9
figure 9

System Performance

The x-axis of ROC is showing specificity while y-axis showing sensitivity as shown in Fig. 10. The curve shows that the relationship between the specificity and sensitivity and it indicates the diagnostic ability of the system as its discrimination threshold is varied. The curve shows that, the system can diagnose positive cases than negative cases of the CAD patients efficiently.

Fig. 10
figure 10

Receiver Operating Characteristic Curve

6 Conclusion

CAD is no longer one of the deadliest diseases to developed nations but rather to developing countries like Nigeria. Therefore, CAD is a world phenomenon. In this study, a fuzzy-based expert system for CAD diagnosis has been designed to complement health workers to diagnose CAD. The improved C4.5 data mining algorithm is used to transfer the human knowledge to the system's knowledge base instead of conventional techniques such as interviews, questionnaires, etc. The performance evaluation system was carried out, and the system has 94.55% accuracy, 95.35% sensitivity, and 95.00% specificity. This shows that system has both higher capability of detecting both healthy and unhealthy CAD patients and it can be relied upon.