1 Introduction

The present BPMD techniques meet great challenges while mining models from real-life event logs. Such logs often contain complex trace behaviours which might be far beyond the expression ability of existing BPMD approaches [9]. Figure 1 shows an example event log \(L_1\) and the model generated by carrying out the Heuristics Miner (HM) [3] on this log. The mined model for \(L_1\) has a relatively low fitness (0.7752) due to the existence of inexpressible process behaviours for HM in \(L_1\). Several pioneering approaches [69] have been put forward in academia for solving the problem mentioned above. These proposed methods are able to help mine high-fitness models expressed by Petri net [1]. However, few efforts have been made to help the HM (that expresses process model by Heuristics net) get better mining results. According to [10], the HM is one of the most important and widely utilised BPMD tools in the ProM framework [1] for dealing with real-life event logs. Developing an auxiliary method to help it mine high-fitness process models is far from trivial.

Fig. 1.
figure 1

The process model mined by executing Heuristics Miner on an example log \(L_1\).

In this paper, we put forward a novel heuristic method named HIF which transforms the fitness improvement problem for the non-fitting models mined by HM into the problem of locating the inexpressible process behaviours of HM in event logs and converting these found behaviours into expressible behaviours. As shown in Fig. 2, an element \(B_k\) in the process behaviour space (PBS) represents a kind of process behaviour extracted from a specific event log L. Afterwards, it is assessed whether \(B_k\) can be expressed by HM. If \(B_k\) cannot be expressed then all the process behaviours that pertain to \(B_k\) in L will be converted into expressible behaviours. Given an event log, how to build the PBS relevant to this log and how to locate the inexpressible behaviours in the PBS and then transform them into expressible behaviours for HM are the main problems that this paper is going to solve.

Fig. 2.
figure 2

The process for dealing with the inexpressible process behaviours in event logs.

2 Approach Design

In this section, some important basic notations and concepts are introduced in Subsect. 2.1. The details of the proposed method are elaborated in Subsects. 2.2, 2.3, 2.4 and 2.5.

2.1 Notation

Let \(L^{+}\) be the set of event logs, \(\varOmega : L^{+} \rightarrow M^{+}\) be a BPMD algorithm, where \(M^{+}\) is the set of process models. \(\varPhi : (M^{+},L^{+}) \rightarrow V^{+}\) represents a process model fitness evaluation mechanism which gets a process model together with its relevant event log as input and creates an assessed value from \(V^{+}\) (the set of all possible values output by \(\varPhi \)) as output.

Definition 1

(Direct Activity Relation). Let \(SA_L\) be the set of activities from an event log L. Symbol \(\succ _L\) represents a direct relation between two activities from \(SA_L\). For two activities a and b from \(SA_L\), \(a\! \succ _L{\!} b = true\) if \(|a{\!} \succ _L{\!} b| > 0\), where \(|a{\!} \succ _L{\!} b|\) is the number of times that a is directly followed by b in L.

2.2 Build Process Behaviour Space (PBS)

How to effectively extract and organise process behaviours is the first challenge encountered by our approach. In this subsection, we present a method for collecting and structuring the process behaviours from event logs based on two concepts: behaviour-related activity and behaviour-related sub-trace.

Definition 2

(Behaviour-Related Activity). Let \(SA_L\) be the set of activities for event log L. Symbol \(\Rightarrow _L\) represents a behaviour-based relation between any two activities from \(SA_L\). For two activities \(a, b \in SA_L\), \(a \Rightarrow _L b = true\) if \(a \succ _L b=true\) or \(b \succ _L a=true\) and b is also called a behaviour-related activity (BRA) of a.

Definition 3

(Behaviour-Related Sub-trace). Let \(SA_L\) be the set of activities for event log L. Let t be a trace from L, \(st \sqsubseteq t\) be a sub-trace of t and \(SA_{st}\) be the set of activities for st. Given an activity \(a \in SA_L\), st is a behaviour-related sub-trace (BRST) of a if \(\forall b \in SA_{st} \wedge b \ne a \) such that \(a \Rightarrow _L b\) and \(a \in SA_{st}\). And st is a maximal behaviour-related sub-trace (MRST) of a if \(\not \exists st{'}\) such that \(st{'}\) is a BRST of a and \(st \sqsubset st{'}\).

Let’s take event log \(L_1\) depicted in Fig. 1 as an example. According to Definition 2, activity F has six BRAs which are activity D, E, H, I, G and C. Seven kinds of MRSTs for activity F can be discovered from \(L_1\) (as shown in Fig. 3) according to Definition 3. It can be seen that every MRST of F contains activity F and all the other activities in the MRST are BRAs of F.

Fig. 3.
figure 3

The PBS built for the example event log \(L_1\).

In our technique, the process behaviours recorded in an event log are divided into several groups where each group is relevant to a single activity from this log and the process behaviours for a group are stored in the MRSTs of its related activity. For instance, the PBS for log \(L_1\) consists of nine sets of MRSTs where each set of MRSTs is relevant to a specific activity from \(L_1\) (as shown in Fig. 3). Our technique is devised to detect each set of MRSTs stored in PBS iteratively for finding and converting inexpressible process behaviours for HM.

2.3 Activity Ranking

In this subsection, an activity ranking method is put forward in which the MRSTs related to the higher-ranked activities will be handled before the MRSTs relevant to the lower-ranked activities. The proposed activity ranking method is based on two concepts: behaviour-related activity weight (BAW) and activity ranking weight (ARW). Given an activity a from event log L, the BAW of a is defined as:

$$\begin{aligned} BAW_a = |\bullet \succ _L a|+|a \succ _L \bullet |. \end{aligned}$$
(1)

In Eq. 1, \(|\bullet \succ _L a|\) represents the total number of activities from L that are directly followed by a at least once and \(|a \succ _L \bullet |\) represents the total number of activities which directly follow a in L at least once.

Axiom 1

The larger the BAW of an activity from an event log L is, the more possible this activity will be the main factor that leads to the inexpressible process behaviours in log L.

According to Axiom 1, the BAW is employed by our technique to quantify the complexity induced by an activity on its related process behaviours (i.e. the MRSTs of this activity) recorded in the relevant event log. However, Axiom 1 might not be applicable in all situations, e.g. an activity that only joins a concurrent behaviour may also have a large BAW but it will not cause any inexpressible process behaviour as long as the utilised BPMD algorithm can model concurrency. These additional situations are also considered in our approach proposed in the next subsection.

Let a be an activity from event log L, the ARW (activity ranking weight) of a is defined as:

$$\begin{aligned} ARW_a = \frac{BAW_a}{BAW_{max}} \times \frac{|a|}{OF_{max}}. \end{aligned}$$
(2)

where \(BAW_{max}\) stands for the BAW of a particular activity from L which has the largest BAW, |a| stands for the occurrence frequency of activity a in log L and \(OF_{max}\) represents the occurrence frequency of an activity from L which has the largest frequency of occurrence. According to Eq. 2, the ARW of an activity consists of two parts. The fist part is based on the BAW of this activity while the second part considers the influence level of activity on the fitness of the final mined model. In our method, the larger the ARW of an activity is, the higher ranking the activity will have.

2.4 Detection and Conversion of Inexpressible Process Behaviours

In this subsection, we first formalise a new concept called environment item. Then, a new method named DCIB is proposed.

Definition 4

(Environment Item). Let \(SA_L\) be the set of activities from event log L, activity a, b and c are three activities from \(SA_L\), the tuple (bc) is an environment item (EI) of activity a if \(\exists t \in L\) such that \(<b,<a\ldots>,c> \sqsubseteq t\), where t stands for a trace from L and \(<a\ldots>\) represents a sub-trace that only consists of activity a (one or more).

According to Definition 4, the activity F in the example log \(L_1\) has six EIs which are EI (D,E), (H,I), (E,H), (I,G), (E,G) and (C,D).

Axiom 2

Converting an activity into a new activity under appropriate environment item will help reduce the complex process behaviours aroused by this activity.

Figure 4 shows an event log \(L_2\) generated by converting activity F under environment (D,E) into a new activity F1 and converting F under environment (H,I) into a new activity F0 in log \(L_1\). As illustrated in Fig. 4, the process model mined from the newly created log \(L_2\) has a much higher fitness than the model mined from \(L_1\). The main reason for such an improvement on fitness is that the conversion of activity F under environment (D,E) and (H,I) transforms the inexpressible (complex) process behaviours (related to F) for HM into expressible (simple) process behaviours.

Fig. 4.
figure 4

The process model mined from newly generated log \(L_2\).

Fig. 5.
figure 5

The basic procedure for technique DCIB.

The method DCIB (that will be) proposed in this subsection is able to assist in detecting the suitable EIs for a specific activity under which transforming the activity into new activities can help simplify the complex process behaviours led by this activity. The storage structure of process behaviours in PBS provides a basis for DCIB to fulfil such a function. Specifically speaking, for each time DCIB discovers the qualified EIs for a certain activity by detecting its MRSTs stored in the relevant PBSFootnote 1. Let’s take the activity F from log \(L_1\) as an example to explain the primary procedure for DCIB. Let \(\mathrm {S_F}\) stand for the set of MRSTs for activity F (the details of \(\mathrm {S_F}\) are exhibited in Fig. 3), \(\mathrm {v_1}\) represent the fitness value of the process model mined from \(\mathrm {S_F}\), SEI be a set of EIs, \(\alpha \) be a target fitness and \(\beta \) be a minimum fitness improvement threshold. As illustrated in part A of Fig. 5, DCIB contains three stages and two modules. In stage 1, it judges whether \(\mathrm {v_1}\) is less than the target fitness \(\alpha \). If it is not, DCIB stops because the negative influences aroused by the inexpressible process behaviours related to activity F is acceptable. In stage 2 and stage 3 (that belong to \(\mathrm {Module{\!}{\!}-{\!}{\!}1}\)), DCIB searches for the best EI ((D,E) in our example) of activity F among all its EIs (mentioned above) under which converting F into a new activity will generate a new set of MRSTs \(\mathrm {NS_F}\) for F where the fitness of the model mined from \(\mathrm {NS_F}\) has the largest value (i.e. \(\mathrm {v_2}\)) compared with the models mined from other set of MRSTs generated by transforming F under other EIs of F. Then the found EI (D,E) is removed from the original set of EIs for activity F. The part B of Fig. 5 shows the details for realising the stage 2 of DCIB. In \(\mathrm {Module{\!}{\!}-{\!}{\!}2}\), DCIB judges if \(\mathrm {v_2}<\alpha \) and \(\mathrm {v_2}-\mathrm {v_1}\ge \beta \). If it is, put the found EI (D,E) in SEI, replace \(\mathrm {S_F}\) by using \(\mathrm {NS_F}\) and continue running \(\mathrm {Module{\!}{\!}-{\!}{\!}1}\). If \(\mathrm {v_2}\ge \alpha \), DCIB stops because the EIs found so far are enough to help decrease the negative influence led by the complex process behaviours aroused by F to a certain extent (indicated by \(\alpha \)). DCIB will also stop running if \(\mathrm {v_2}-\mathrm {v_1}<\beta \). Because adding new activities will help improve the accuracy of the potential model but may also increase the complexity of the model at the same time. It is not worth to add new activities if the model fitness cannot be improved to a certain extent.

2.5 A Heuristic Method for Improving the Fitness of Mined Business Process Models (HIF)

In this subsection, we propose a heuristic method named HIF based on the discussions in the former subsections for improving the fitness of process models mined through HM. The details about HIF are shown in Algorithm 1.

figure a

Firstly, the number of activities in the given log L is stored in variable x (step 1). Then, algorithm HIF creates the PBS for log L (step 2) and also a ranking list LRA for the activities in L (step 3) according to the method proposed in Subsect. 2.3. Next, HIF chooses an activity a which has the highest ranking in LRA and removes a from LRA (step 5). The inexpressible process behaviours aroused by a will be first handled which means that HIF always give priority to the main contradiction. Then, HIF searches for the qualified EIs for activity a through technique DCIB (introduced in Subsect. 2.4) and the found EIs are put in set SEI (step 6). Afterwards, for each environment item \(ei \in SEI\), HIF changes the activity a into a new activity under environment ei in log L (this action will help improve the fitness of the model mined from L as demonstrated in the last subsection), removes ei from SEI and put the newly generated activity in the set of activities \(SA_L\) for log L (steps \(7-11\)). In HIF, a threshold \(\mu \) is used to limit the number of the newly added activities because adding too many new activities might increase the complexity of the final model. If the number of the newly added activities is larger than \(\mu \times x\) then HIF stops (step 11 and 12). The activity ranking procedure described in step 3 makes sure that the accuracy of the mined model could be improved as much as possible under the limitation given by \(\mu \). Furthermore, if the fitness of the model mined from L is larger than or equal to the given target fitness \(\alpha \) then HIF also stops (step 11 and 12). Finally, a process model M with higher fitness value is output by HIF.

3 Evaluation

In our experiment, the ICS fitness [4] is used for evaluating the accuracy of mined models. The Extended Cardoso Metric (E-Cardoso) [5] and Place/Transition Connection Degree (PT-CD) [2] are employed for evaluating the impact of our method on the complexity of the mined models. We tested the effectiveness of HIF on four real-life event logs: the repair log (Repair) from [1], the log of the loan and overdraft approvals process (LOA) from Business Process Intelligence Challenge (BPIC) 2012, the log of Volvo IT incident and problem management (VIPM) from BPIC 2013 and log of CRM process (MCRM) from [2].

Table 1. Evaluation results for the proposed model fitness improvement method HIF.

In the experiment for HIF on the four logs, the target fitness \(\alpha \) is set to 1, the model fitness improvement threshold \(\beta \) is set to 0.03 and the threshold for the number of newly added activities \(\mu \) is set to 0.3. Table 1 shows the evaluation results. In Table 1, \(M{\!}{\!}-{\!}{\!}Repair\) represents the model generated by directly mining the original event log Repair and \(M{\!}{\!}-{\!}{\!}Repair_N\) stands for the model output by HIF (the same applies to the other models). It can be seen that the technique HIF can improve the fitness of the mined models to a large extent, while for most of the models output by HIF their precision and complexity are kept within an acceptable range compared with their original models. The model \(M{\!}{\!}-{\!}{\!}Repair_N\) only has four more activities than the model \(M{\!}{\!}-{\!}{\!}Repair\) but the fitness for \(M{\!}{\!}-{\!}{\!}Repair_N\) has been greatly improved (the same to the model \(M{\!}{\!}-{\!}{\!}VIPM_N\)). This benefits from the activity ranking method presented in Subsect. 2.3.

4 Conclusion

In this paper we proposed the technique HIF for helping improve the fitness of the models mined from event logs. The proposed technique is able to detect the inexpressible process behaviours recorded in event logs for HM and transform the found behaviours into expressible behaviours. As a result, more fitting process models can be generated. Through the evaluation results from Sect. 3 we demonstrated the effectiveness of HIF. Our future work will mainly be focused on adapting our method HIF to other BPMD techniques so as to help them mine better process models. In the meantime, we will also validate HIF on some other real-life cases.