Keywords

1 Introduction

Modern organizations are centered on the processes needed to deliver products and services in an efficient and effective manner. Organizations that operate at a higher process maturity level use formal/semiformal models (e.g., UML, EPC, BPMN and YAWL models) to document their processes. In some case these models are used to configure process-aware information systems (e.g., WFM or BPM systems). However, in most organizations process models are not used to enforce a particular way of working. Instead, process models are used for discussion, performance analysis (e.g., simulation), certification, process improvement, etc. However, reality may deviate from such models. People tend to focus on idealized process models that have little to do with reality. This illustrates the importance of conformance checking [1, 2, 12].

Conformance checking aims to verify whether the observed behavior recorded in an event log matches the intended behavior represented as a process model. The notion of alignments [2] provides a robust approach to conformance checking, which makes it possible to pinpoint the deviations causing nonconformity. An alignment between a recorded process execution and a process model is a pairwise matching between activities recorded in the log and activities allowed by the model. Sometimes, activities as recorded in the event log (events) cannot be matched to any of the activities allowed by the model (process activities). For instance, an activity is executed when not allowed. In this case, we match the event with a special null activity (hereafter, denoted as \(\gg \)), thus resulting in a so-called move on log. Other times, an activity should have been executed but is not observed in the event log. This results in a process activity that is matched to a \(\gg \) event, thus resulting in a so-called move on model.

Alignments are powerful artifacts to detect nonconformity between the observed behavior as recorded in the event log and the prescribed behavior as represented by process models. In fact, when an alignment between a log trace and process model contains at least one move on log or model, it means that such a log trace does not conform to the model. As a matter of fact, moves on log/model indicate where the execution is not conforming by pinpointing the deviations that have caused this nonconformity.

In general, a large number of possible alignments exist between a process model and a log trace, since there may exist manifold explanations why a trace is not conforming. It is clear that one is interested in finding what really happened. Adriansyah et al. [4] have proposed an approach based on the principle of the Occam’s razor: the simplest and most parsimonious explanation is preferable. Therefore, one should not aim to find any alignment but, precisely, one of the alignments with the least expensive deviations (one of the so-called optimal alignments), according to some function assigning costs to deviations.

Existing alignment-based conformance checking techniques (e.g. [2, 4]) require process analysts to manually define a cost function based on their background knowledge and beliefs. The definition of such a cost function is fully based on human judgment and, thus, prone to imperfections. These imperfections ultimately lead to alignments that are optimal, according to the provided cost function, but that do not provide an explanation of what really happened.

In this paper, we propose an alternative way to define a cost function, where the human judgment is put aside and only objective factors are considered. The cost function is automatically constructed by looking at the logging data and, more specifically, at the past process executions that are compliant with the process model. The intuition behind is that one should look at the past history of process executions and learn from it what are the probable explanations of nonconformity. In particular, probable explanations of nonconformity for a certain process execution can be obtained by analyzing the behavior observed for such a process execution in each and every state and the behavior observed for other confirming traces when they were in the same state. Our approach gives a potentially different cost for each move on model and log (depending on the current state), leading to the definition of a more sensitive cost function.

The approach has been fully implemented as a software plug-in for the open-source process-mining framework ProM. To assess the practical relevance of our approach, we performed an evaluation using both synthetic and real event logs and process models. In particular, we tested it on a real-life case study about the management of road-traffic fines by an Italian town. The results show that our approach significantly improves the accuracy in determining probable explanations of nonconformity compared to existing techniques. Moreover, an analysis of the computation time shows the practical feasibility of our approach.

The paper is organized as follows. Section 2 introduces preliminary concepts. Section 3 provides the motivations for this work, discussing how the construction of optimal alignments should be kept independent of the reason why such alignments are constructed. Section 4 presents our approach for constructing optimal alignments. Section 5 presents experiment results, which are discussed in Sect. 6. Finally, Sect. 7 discusses related work and concludes the paper providing directions for future work.

2 Preliminaries

This section introduces the notation and preliminaries for our work.

2.1 Labeled Petri Nets, Event Logs, and Alignments

Process models describe how processes should be carried out. Many languages exist to model processes. Here, we use a simple formalism, which suffices for the purpose of this work:

Definition 1

(Labeled Petri Net). A Labeled Petri net is a tuple (PTFA\(\ell ,m_i,m_f)\) where

  • P is a set of places;

  • T is a set of transitions;

  • \(F\subseteq (P\times T) \cup (T\times P)\) is the flow relation between places and transitions (and between transitions and places);

  • A is the set of labels for transitions;

  • \(\ell : T \rightarrow A\) is a function that associates a label with every transition in T;

  • \(m_i\) is the initial marking;

  • \(m_f\) is the final marking.

Hereafter, the simpler term Petri net is used to refer to Labeled Petri nets. The label of a transition identifies the activity represented by such a transition. Multiple transitions can be associated with the same activity label; this means that the same activity can be represented by multiple transitions. This is typically done to make the model simpler. Some transitions can be invisible. Invisible transitions do not correspond to actual activities but are necessary for routing purposes and, as such, their execution is never recorded in event logs. Given a Petri net N, \(\textsf {Inv}_N \subseteq A\) indicates the set of labels associated with invisible transitions. As a matter of fact, invisible transitions are also associated with labels, though these labels do not represent activities. We assume that a label associated with a visible transition cannot be also associated with invisible ones and vice versa.

The state of a Petri net is represented by a marking, i.e. a multiset of tokens on the places of the net. A Petri net has an initial marking \(m_i\) and a final marking \(m_f\). When a transition is executed (i.e., fired), a token is taken from each of its input places and a token is added to each of its output places. A sequence of transitions \(\sigma _M\) leading from the initial to the final marking is a complete process trace. Given a Petri net N, \(\Gamma _N\) indicates the set of all complete process traces allowed by N.

Example 1

Figure 1 shows a normative process, expressed in terms of Petri net, which encodes the Italian laws and procedures to manage road traffic fines [19]. A process execution starts by recording a traffic fine in the system and sending it to Italian residents. Traffic fines might be paid before or after they are sent out by police or received by the offenders. Offenders are allowed to pay the due amount in partial payments. If the total amount of the fine is not paid in 180 days, a penalty is added. Offenders may appeal against a fine to the prefecture and/or judge. If an appeal is accepted, the fine management is closed. On the other hand, if the fine is not paid by the offender (and no appeal has been accepted), the process eventually terminates by handing over the case for credit collection.

Fig. 1.
figure 1

A process model for managing road traffic fines. The green boxes represent the transitions that are associated with process activities while the black boxes represent invisible transitions. The text below the transitions represents the label, which is shortened with a single letter as indicated inside the transitions (Color figure online).

Given a Petri net \(N=(P,T,F,A,\ell ,m_i,m_f)\), a log trace \(\sigma _L\in A^*\) is a sequence of events where each event records the firing of a transition. In particular, each event records the label of the transition that has fired. An event log \(\mathcal {L}\in \mathbb {B}(A)\) is a multiset of log traces, where \(\mathbb {B}(X)\) is used to represent the set of all multisets over X. Here we assume that no events exist for activities not in A; in practice, this can happen: in such cases, such events are filtered out before the event log is taken into consideration.

Not all log traces can be reproduced by a Petri net, i.e. not all log traces perfectly fit the process description. If a log trace perfectly fits the net, each “move” in the log trace, i.e. an event observed in the trace, can be mimicked by a “move” in the model, i.e. a transition fired in the net. After all events in the log trace are mimicked, the net reaches its final marking. In cases where deviations occur, some moves in the log trace cannot be mimicked by the net or vice versa. We explicitly denote “no move” by \(\gg \).

Definition 2

(Legal move). Let \(N=(P,T,F,A,\ell ,m_i,m_f)\) be a Petri net. Let \(S_L= (A \setminus \textsf {Inv}_N) \cup \{\gg \}\) and \(S_M=A\cup \{\gg \}\). A legal move is a pair \((m_{L},m_{M})\in (S_L\times S_M)\setminus (\gg ,\gg )\) such that

  • \((m_{L},m_{M})\) is a synchronous move if \(m_{L}\in S_L\), \(m_{M}\in S_M\) and \(m_{L}=m_{M}\),

  • \((m_{L},m_{M})\) is a move on log if \(m_{L}\in S_L\) and \(m_{M}=\gg \),

  • \((m_{L},m_{M})\) is a move on model if \(m_{L}=\gg \) and \(m_{M}\in S_M\).

\(\Sigma _N\) denotes the set of legal moves for a Petri net N.

In the remainder, we indicate that a sequence \(\sigma ^{\prime }\) is a prefix of a sequence \(\sigma ^{\prime \prime }\), denoted with \(\sigma '\in \mathsf {prefix}(\sigma '')\), if there exists a sequence \(\sigma '''\) such that \(\sigma ''=\sigma ' \oplus \sigma '''\), where \(\oplus \) denotes the concatenation operator.

Definition 3

(Alignment). Let \(\Sigma _N\) be the set of legal moves for a Petri net \(N=(P,T,F,A,\ell ,m_i,m_f)\). An alignment of a log trace \(\sigma _L\) and N is a sequence \(\gamma \in \Sigma _N^*\) such that, ignoring all occurrences of \(\gg \), the projection on the first element yields \(\sigma _L\) and the projection on the second element yields a sequence \(\langle a_1,\ldots ,a_n \rangle \) such that there exists a sequence \(\sigma '_P=\langle t_1,\ldots ,t_n \rangle \in \mathsf {prefix}(\sigma _P)\) for some \(\sigma _P \in \Gamma _N\) where, for each \(1 \le i \le n\), \(\ell (t_i)=a_i\). If \(\sigma '_P \in \Gamma _N\), \(\gamma \) is called a complete alignment of \(\sigma _L\) and N.

Figure 2 shows three possible complete alignments of a log trace \(\sigma _{1}=\langle c, s, n, t, o\rangle \) and the net in Fig. 1. The top row of an alignment shows the sequence of events in the log trace, and the bottom row shows the sequence of activities in the net (both ignoring \(\gg \)). Hereafter, we denote \(\mid _L\) the projection of an alignment over the log trace and \(\mid _P\) the projection over the net.

Fig. 2.
figure 2

Alignments of \(\sigma _{1}=\langle c, s, n, t, o\rangle \) and the process model in Fig. 1

As shown in Fig. 2, there can be multiple possible alignments for a given log trace and process model. The quality of an alignment is measured based on a provided cost function \(K : \Sigma _N^* \rightarrow \mathbb {R}^+_0\), which assigns a cost to each alignment \(\gamma \in \Sigma _N^*\). Typically, the cost of an alignment is defined as the sum of the costs of the individual moves in the alignment. An optimal alignment of a log trace and a process trace is one of the alignments with the lowest cost according to the provided cost function.

As an example, consider a cost function that assigns to any alignment a cost equal to the number of moves on log and model for visible transitions. If moves on model for invisible transitions \(i_k\) are ignored, \(\gamma _1\) has two moves on model, \(\gamma _2\) has one move on model and one move on log, and \(\gamma _3\) has one move on model and two moves on log. Thus, according to the cost function, \(\gamma _1\) and \(\gamma _2\) are two optimal alignments of \(\sigma _{1}\) and the process model in Fig. 1.

2.2 State Representation

At any point in time, a sequence of execution of activities leads to some state, and this state depends on which activities have been performed and in which order. Accordingly, any process execution can be mapped onto a state. As discussed in [3], a state representation function takes care of this mapping:

Definition 4

(State Representation). Let A be a set of activity labels and R the set of possible state representations of the sequences in \(A^*\). A state representation function \(\mathsf {abst}: A^* \rightarrow R\) produces a state representation \(\mathsf {abst}(\sigma )\) for each process trace \(\sigma \in \Gamma \).

Several state-representation functions can be defined. Each function leads to a different abstraction, meaning that multiple different traces can be mapped onto the same state, thus abstracting out certain trace’s characteristics. Next, we provide some examples of state-representation functions:

  • Sequence abstraction. It is a trivial mapping where the abstraction preserves the order of activities. Each trace is mapped onto a state that is the trace itself, i.e. for each \(\sigma \in A^*\), \(\mathsf {abst}(\sigma )=\sigma \).

  • Multi-set abstraction. The abstraction preserves the number of times each activity is executed. This means that, for each \(\sigma \in A^*\), \(\mathsf {abst}(\sigma )=M \in \mathbb {B}(A)\) such that, for each \(a \in A\), M contains all instances of a in \(\sigma \).

  • Set abstraction. The abstraction preserves whether each activity has been executed or not. This means that, for each \(\sigma \in A^*\), \(\mathsf {abst}(\sigma )=M \subseteq A\) such that, for each \(a \in A\), M contains a if it ever occurs in \(\sigma \).

Example 2

Table 1 shows the state representation of some process traces of the net in Fig. 1 using different abstractions. For instance, trace \(\langle c,p,p,s,n\rangle \) can be represented as the trace itself using the sequence abstraction, as state \(\{c(1),p(2),s(1),n(1)\}\) using the multi-set abstraction (in parenthesis the number of occurrences of activities in the trace), and as \(\{c,p,s,n\}\) using the set abstraction. Traces \(\langle c, p, s, n\rangle \) and \(\langle c, p, p, s, n, p\rangle \) are also mapped to state \(\{c,p,s,n\}\) using the set abstraction.

Table 1. Examples of state representation using different abstractions

3 Constructing Optimal Alignments Is Purpose Independent

As discussed in Sect. 2.1, the quality of an alignment is determined with respect to a cost function. An optimal alignment provides the simplest and most parsimonious explanation with respect to the used cost function. Therefore, the choice of the cost function has a significant impact on the computation of optimal alignments.

Typically, process analysts define a cost function based on the context of use and the purpose of the analysis. For instance, Adriansyah et al. [7] study various ratios between the cost of moves on model and moves on log, and analyze their influence on the fitness of a trace with respect to a process model. The work in [5, 6] uses alignments to identify nonconforming user behavior and quantify it with respect to a security perspective. In particular, the cost of deviations is determined in terms of which activity was executed, which user executed the activity along with its role, and which data have been accessed.

Existing alignment-based techniques make the implicit assumption that the obtained optimal alignments represent the most plausible explanations of what actually happened. However, they do not account for the fact that the use of different cost functions can yield different optimal alignments, thus resulting in inconsistent diagnostic information. The following example provides a concrete illustration of this issue.

Example 3

Consider the fine management process presented in Fig. 1 and the log trace \(\sigma _2=\langle c,s,a,d\rangle \). Suppose an analyst has to analyze \(\sigma _2\) with respect to both fitness, in order to verify to what extent log traces comply with the behavior prescribed by the process model, and the information provided to citizens, in order to minimize the number of complaints and legal disputes. To this end, the analyst defines two cost functions, presented in Fig. 3a. Cost function \(c_1\) defines the cost of deviations in terms of fitness. In particular, we use the cost function presented in [7] which defines a ratio between the cost of moves on log and the cost of moves on model equal to 5:1 for all activities. On the other hand, cost function \(c_2\) defines the cost of deviations in terms of user satisfaction. Here, deviations concerning payment have low cost. On the other hand, the missed delivery of the fine or notification has a high cost. The optimal alignments obtained using cost functions \(c_1\) and \(c_2\) are given in Fig. 3b and c respectively.

Fig. 3.
figure 3

Inconsistent explanations of nonconformity due to the use of different cost functions.

Based on the example above, an interesting question comes up: which alignments should the analyst take as a plausible explanation of what happened? The alignments in Fig. 3b and c are supposed to be both plausible explanations, but with respect to different criteria. Our claim is that, although alignments provide a robust approach to conformance checking, it is necessary to rethink how cost functions are defined and, in general, how alignment-based techniques should be applied in practice.

This paper starts from the belief that the construction of an optimal alignment is independent from the purpose of the analysis. An optimal alignment should provide probable explanations of nonconformity, independently of why we are interested to know that. Therefore, first, an alignment providing probable explanations of what actually happened has to be constructed (hereafter we refer to such an alignment as probable alignment). Later, this alignment is analyzed according to the purpose of the analysis.

This separation of concerns can be achieved by employing two cost functions: a first cost function to find probable alignments and a second cost function to quantify the severity of the deviations of the computed alignments, which is customized according to the purpose of use. In the remainder of this paper, we discuss how to construct a cost function which provides probable explanations of what actually happened. In particular, this paper is concerned with constructing probable alignments; the discussion on the second purpose-dependent cost function is out of the scope of this paper.

4 History-Based Construction of Probable Alignments

This section presents our approach to construct alignments that give probable explanations of deviations based on objective facts, i.e. the historical logging data, rather than on subjective cost functions manually defined by process analysts. To construct an optimal alignment between a process model and an event log, we use the A-star algorithm [13], analogously to what proposed in [4].

Section 4.1 discusses how the cost of an alignment is computed, and Sect. 4.2 briefly reports on the use of A-star to compute probable alignments.

4.1 Definition of Cost Functions

The computation of probable alignments relies on a cost function that accounts for the probability of an activity to be executed in a certain state. The definition of such a cost function requires an analysis of the past history as recorded in the event log to compute the probability of an activity to immediately occur or to never eventually occur when the process execution is in a certain state.

The A-star algorithm [13] finds an optimal path from a source node to target node where optimal is defined in terms of minimal cost. In our context, moves that are associated to activities whose execution is more probable in a given state should have a low cost, whereas moves that are associated to activities whose execution is unlikely in a given state should have a high cost. Therefore, probabilities cannot be straightforwardly used as costs of moves. For this purpose, we need to introduce a class of functions \(\mathcal {F} \subseteq [0,1] \rightarrow \mathbb {R}^+\) to map probabilities to costs of moves. Based on the restriction imposed by the A-star algorithm on the choice of the cost function, a function \(f \in \mathcal {F}\) if and only if \(f(0)=\infty \) and f is monotonously decreasing between 0 and 1 (with \(f(1)>0\)). Hereafter, these functions are called cost profile. Intuitively, a cost profile function is used to compute the cost of a legal move based on the probability that a given activity occurs when the process execution is in a given state. Below, we provide some examples of cost profile function:

$$\begin{aligned} \begin{array}{l} f(p)=\frac{1}{p} \quad \quad f(p)=\frac{1}{\sqrt{p}} \quad \quad f(p)=1+\log \left( \frac{1}{p}\right) \end{array} \end{aligned}$$
(1)

The choice of the cost profile function has a significant impact on the computation of alignments (see Sect. 6). For instance, the first cost profile in Eq. 1 favorites alignments with more frequent traces, whereas the last cost profile is more sensitive to the number of deviations in the computed alignments. In Sect. 5, we evaluate these sample cost profiles with different combinations of event logs and process models. The purpose is to verify whether a cost profile universally works better than the others.

Similarly to what proposed in [4], the cost of an alignment move depends on the move type and the activity involved in the move. However, differently from [4], it also depends on the position in which the move is inserted:

Definition 5

(Cost of an alignment move). Let \(\Sigma _N\) be the set of legal moves for a Petri net N. Let \(\gamma \in \Sigma _N^*\) be a sequence of legal moves for N and \(f \in \mathcal {F}\) a cost profile. The cost of appending a legal move \((m_L,m_M) \in \Sigma _N\) to \(\gamma \) with state-representation function \(\mathsf {abst}\) is:

$$\begin{aligned} \begin{array}{l} \kappa _{\mathsf {abst}}((m_L,m_M),\gamma )=\\ \quad \left\{ \begin{array}{ll} 0 &{} m_L = m_M \\ 0 &{} m_L = \gg \text{ and } m_M \in \textsf {Inv}_N \\ f\big (P_{\mathsf {abst}}(m_M \text { occurs after } \gamma \mid _P)\big ) &{} m_L = \gg \text{ and } m_M \not \in \textsf {Inv}_N \\ f\big (P_{\mathsf {abst}}(m_L \text { never eventually occurs after } \gamma \mid _P)\big ) &{} m_M = \gg \\ \end{array} \right. \end{array} \end{aligned}$$
(2)

Readers can observe that the cost of a move on log \((m_L,\gg )\) is not simply based on the probability of not executing activity \(m_L\) immediately after \(\gamma \mid _P\); rather, it is based on the probability of never having activity \(m_M\) at the any moment in the future for that execution. This is motivated by the fact that a move on log \((m_L,\gg )\) indicates that \(m_L\) is not expected to ever occur in the future. Conversely, if it was expected, a number of moves in model would be introduced until the process model, modeled as a Petri net, reaches a marking that allows \(m_L\) to occur (and, thus, a move in both can be appended).

For a reliable computation of probabilities, we only use the subset of traces \(\mathcal {L}_{fit}\) of the original event log \(\mathcal {L}\) that fit the process model. We believe that, in many process analyses, it is not unrealistic to assume that several traces are compliant. For instance, this is the case for the real-life process about road-traffic fine management discussed in Sect. 5.2.

One may argue that some paths in the process model can be more prone to compliance errors compared to other paths. Thus, eliminating all non-fitting traces from the log would lead to underestimate the probability of executing activities in such paths. We argue that the reasons for nonconformity should be carefully investigated. For instance, frequent cases of nonconformity on a certain path may indicate that the process model does not reflect the reality [15, 24]. Ideally, an analyst should revise the process model and then use the new model to identify the set of fitting traces. This problem, however, is orthogonal to the current work and can be addressed using techniques for process repairing [14]. In this work, we assume that the process model is complete and accurately defines the business process. On the other hand, if the process model correctly reflects the reality, it is not obvious that non-fitting traces should be used to compute the cost function. Indeed, the resulting cost function would be biased by behavior that should not be permitted. Moreover, using error correction methods may lead to the problem of overfitting the training set [16]. Based on these considerations, we only use fitting traces as historical logging data.

The following two definitions describe how to compute the probabilities required by Definition 5.

Definition 6

(Probability that an activity occurs). Let \(\mathcal {L}\) be an event log and \(\mathcal {L}_{fit}\subseteq \mathcal {L}\) the subset of traces that comply with a given process model represented by a Petri net \(N=(P,T,F,A,\ell ,m_i,m_f)\). The probability that an activity \(a \in A\) occurs after executing \(\sigma \) with state-representation function \(\mathsf {abst}\) is the ratio between number of traces in \(\mathcal {L}_{fit}\) in which activity a is executed after reaching state \(\mathsf {abst}(\sigma )\) and the total number of traces in \(\mathcal {L}_{fit}\) that reach state \(\mathsf {abst}(\sigma )\):

$$\begin{aligned} \begin{array}{l} P_{\mathsf {abst}}(a \text { occurs after } \sigma ) = \frac{|\{\sigma ' \in \mathcal {L}_{fit}\;:\;\exists \sigma '' \in \mathsf {prefix}(\sigma ').\; \mathsf {abst}(\sigma '')=\mathsf {abst}(\sigma ) \wedge \sigma ''\oplus \langle a \rangle \in \mathsf {prefix}(\sigma ') \}|}{|\{\sigma '\in \mathcal {L}_{fit}\;:\; \exists \sigma '' \in \mathsf {prefix}(\sigma ').\; \mathsf {abst}(\sigma '')=\mathsf {abst}(\sigma )\}|} \end{array} \end{aligned}$$
(3)

Definition 7

(Probability that an activity never eventually occurs). Let \(\mathcal {L}\) be an event log and \(\mathcal {L}_{fit}\subseteq \mathcal {L}\) the subset of traces that comply with a given process model represented by a Petri net \(N=(P,T,F,A,\ell ,m_i,m_f)\). The probability that an activity \(a \in A\) will never eventually occur in a process execution after executing \(\sigma \in A^*\) with state-representation function \(\mathsf {abst}\) is the ratio between the number of traces in \(\mathcal {L}_{fit}\) in which a is never eventually executed after reaching state \(\mathsf {abst}(\sigma )\) and the total number of traces in \(\mathcal {L}_{fit}\) that reach state \(\mathsf {abst}(\sigma )\):

$$\begin{aligned} \begin{array}{l} P_{\mathsf {abst}}(a \text { never eventually occurs after } \sigma ) =\\ \quad \quad \quad \ \frac{|\{\sigma ' \in \mathcal {L}_{fit}\;:\;\exists \sigma '' \in \mathsf {prefix}(\sigma ').\; \mathsf {abst}(\sigma '')=\mathsf {abst}(\sigma ) \wedge \forall \sigma ''' \; \sigma ''\oplus \sigma '''\oplus \langle a' \rangle \in \mathsf {prefix}(\sigma ') \wedge a'\ne a \}|}{|\{\sigma '\in \mathcal {L}_{fit}\;:\; \exists \sigma '' \in \mathsf {prefix}(\sigma ').\; \mathsf {abst}(\sigma '')=\mathsf {abst}(\sigma )\}|} \end{array} \end{aligned}$$
(4)

Intuitively, \(P_{\mathsf {abst}}(a\ \text {occurs after } \sigma )\) and \(P_{\mathsf {abst}}(a\ \text {never eventually occurs after } \sigma )\) are conditional probabilities. Given two events A and B, the conditional probability of A given B is defined as the quotient of the probability of the conjunction of events A and B, and the probability of B:

$$\begin{aligned} P(A|B)=\frac{P(A\cap B)}{P(B)} \end{aligned}$$
(5)

It is easy to verify that Eq. 3 coincides with Eq. 5 where A represents that activity a is executed, B that trace \(\sigma \) is executed, and \(A\cap B\) that \(\sigma \oplus \langle a\rangle \) is executed. Similar observations hold for Eq. 4.

The cost of an alignment is the sum of the cost of all moves in the alignment, which are computed as described in Definition 5:

Definition 8

(Cost of an alignment). Let \(\Sigma _N\) be the set of legal moves for a Petri net N. The cost of alignment \(\gamma \in \Sigma _N^*\) with state-representation function \(\mathsf {abst}\) is computed as follows:

$$\begin{aligned} K_{\mathsf {abst}}(\gamma \oplus (m_L,m_M))= \left\{ \begin{array}{ll} \kappa _{\mathsf {abst}}((m_L,m_M),\langle \rangle )\quad \quad &{} \gamma =\langle \rangle \\ \kappa _{\mathsf {abst}}((m_L,m_M),\gamma )+K_{\mathsf {abst}}(\gamma )\quad \quad &{} \text {otherwise} \\ \end{array} \right. \end{aligned}$$
(6)

Hereafter, the term probable alignment is used to denote any of the optimal alignments (i.e., alignments with the lowest cost) according to the cost function given in Definition 8.

4.2 The Use of the A-Star Algorithm to Construct Alignments

The A-star algorithm [13] aims to find a path in a graph V from a given source node \(v_0\) to any node \(v \in V\) in a target set. Every node v of graph V is associated with a cost determined by an evaluation function \(f(v) = g(v) + h(v)\), where

  • \(g : V \rightarrow \mathbb {R}^+_0\) is a function that returns the cost of the smallest path from \(v_0\) to v;

  • \(h : V \rightarrow \mathbb {R}^+_0\) is a heuristic function that estimates the cost of the path from v to its preferred target node.

Function h is said to be admissible if it returns a value that underestimates the distance of a path from a node \(v'\) to its preferred target node \(v''\), i.e. \(g(v')+h(v') \le g(v'')\). If h is admissible, A-star finds a path that is guaranteed to have the overall lowest cost.

The A-star algorithm keeps a priority queue of nodes to be visited: higher priority is given to nodes with lower costs. The algorithm works iteratively: at each step, the node v with lowest cost is taken from the priority queue. If v belongs to the target set, the algorithm ends returning node v. Otherwise, v is expanded: every successor \(v'\) is added to the priority queue with a cost \(f(v')\).

We employ A-star to find any of the optimal alignments between a log trace \(\sigma _L \in \mathcal {L}\) and a Petri net N. In order to be able to apply A-star, an opportune search space needs to be defined. Every node \(\gamma \) of the search space V is associated to a different alignment that is a prefix of some complete alignment of \(\sigma _L\) and N. Since a different alignment is also associated to every search-space node and vice versa, we use the alignment to refer to the associated state. The source node is an empty alignment \(\gamma _0 = \langle \rangle \) and the set of target nodes includes every complete alignment of \(\sigma _L\) and N.

Let us denote the length of a sequence \(\sigma \) with \(\Vert \sigma \Vert \). Given a node/alignment \(\gamma \in V\), the search-space successors of \(\gamma \) include all alignments \(\gamma ' \in V\) obtained from \(\gamma \) by concatenating exactly one move. Given an alignment \(\gamma \in V\), the cost of the path from the initial node to node \(\gamma \in V\) is:

$$\begin{aligned} g(\gamma ) = \Vert \gamma \mid _L \Vert + K(\gamma ). \end{aligned}$$

where \(K(\gamma )\) is the cost of alignment \(\gamma \) according to Definition 8. It is easy to check that, given two complete alignments \(\gamma '_C\) and \(\gamma ''_C\), \(K(\gamma '_C) < K(\gamma ''_C)\) iff \(g(\gamma '_C) < g(\gamma ''_C)\) and \(K(\gamma '_C) = K(\gamma ''_C)\) iff \(g(\gamma '_C) = g(\gamma ''_C)\). Therefore, an optimal solution returned by A-star coincides with an optimal alignment.

The time complexity of A-star depends on the heuristic used to find an optimal solution. In this work, we consider term \(\Vert \sigma _L \Vert \) in h to define an admissible heuristic; this term does not affect the optimality of solutions. Given an alignment \(\gamma \in V\), we employ the heuristics:

$$\begin{aligned} h(\gamma ) = \Vert \sigma _L\Vert - \Vert \gamma \mid _L \Vert . \end{aligned}$$

For alignment \(\gamma \), the number of steps to add in order to reach a complete alignment is lower bounded by the number of execution steps of trace \(\sigma _L\) that have not been included yet in the alignment, i.e. \(\Vert \sigma _L\Vert - \Vert \gamma \mid _L \Vert \). Since the additional cost to traverse a single node is at least 1, the cost to reach a target node is at least \(h(\gamma )\), corresponding to the case where the part of the log trace that still needs to be included in the alignment perfectly fits.

Fig. 4.
figure 4

Construction of the alignment of log trace \(\sigma _{3}=\langle c, s, n, l,o\rangle \) and the net in Fig. 1. Cost of moves are computed with sequence state-representation function, cost profile , and \(\mathcal {L}_{fit}\) in Table 1.

Example 4

Consider a log trace \(\sigma _{3}=\langle c, s, n, l,o\rangle \) and the net N in Fig. 1. An analyst wants to determine probable explanations of nonconformity by constructing probable alignments of \(\sigma _{3}\) and N, based on historical logging data. In particular, \(\mathcal {L}_{fit}\) consists of the traces in Table 1 (the first column shows the traces, and the second the number of occurrences of a trace in the history). Assume that the A-star algorithm has constructed an optimal alignment \(\gamma \) of trace \(\langle c, s, n\rangle \in \mathsf {prefix}(\sigma _{3})\) and N (left part of Fig. 4). The next event in the log trace (i.e., l) cannot be replayed in the net. Therefore, the algorithm should determine which move is the most likely to have occurred. Different moves are possible; for instance, a move on log for l, a move on model for p, a move on model for t, etc. The algorithm computes the cost for these moves using Eq. 5 (right part of Fig. 4). As move on model \((\gg , p)\) is the move with the least cost (and no other alignments have lower cost), alignment \(\gamma ' = \gamma \oplus (\gg , p)\) is selected for the next iteration. It is worth noting that activity d never occurs after \(\langle c, s, n\rangle \) in \(\mathcal {L}_{fit}\); consequently, the cost of move \((\gg ,d)\) is equal to \(\infty \).

5 Implementation and Experiments

We have implemented our approach for history-based construction of alignments as a plug-in of the nightly-build version of the ProM framework (http://www.promtools.org/prom6/nightly/).Footnote 1 The plug-in takes as input a process model and two event logs. It computes probable alignments for each trace in the first event log with respect to the process model based on the frequency of the traces in the second event log (historical logging data). The output of the plug-in is a set of alignments and can be used by other plug-ins for further analysis. A screenshot of the plugin is shown in Fig. 5. In particular, the figure shows the result of aligning a few sample event traces with the net in Fig. 1.

Fig. 5.
figure 5

Screenshot of the implemented approach in ProM, showing the probable alignment constructed between log traces and the process model in Fig. 1.

To assess the practical feasibility and accuracy of the approach, we performed a number of experiments using both synthetic and real-life logs. In the experiments with synthetic logs, we assumed that the execution of an activity depends on the activities that were performed in the past. In the experiments with real-life logs, we tested if this assumption holds in real applications. Accordingly, the real-life logs were used as historical logging data. To evaluate the approach, we artificially added noise to the traces used for the experiments. This was necessary to assess the ability of the approach to reconstruct the original traces. The experiments were performed using a machine with 3.4 GHz Intel Core i7 processor and 16 GB of memory.

5.1 Synthetic Data

For the experiments with synthetic data, we used the process for handling credit requests in [19]. Based on this model, we generated 10000 traces consisting of 69504 events using the CPN Tools (http://cpntools.org). To assess the accuracy of the approach, we manipulated 20 % of these traces by introducing different percentages of noise. In particular, given a trace, we added and removed a number of activities to/from the trace equal to the same percentage of the trace length. The other traces were used as historical logging data. We computed probable alignments of the manipulated traces and process model, and evaluated the ability of the approach to reconstruct the original traces. To this end, we measured the percentage of correct alignments (i.e., the cases where a projection of an alignment over the process coincides with the original trace) and compute the overall Levenshtein distance [17] between the original traces and the projection of the computed alignments over the process. The Levenshtein distance is a string metric that measures the distance between two sequences, i.e. the minimal number of changes required to transform one sequence into the other. In our setting, it provides an indication of how much the projection of the computed alignments over the process is close to the original traces.

We tested our approach with different amounts of noise (i.e., 10 %, 20 %, 30 % and 40 % of the trace length), with different cost profiles (i.e., , , and ), and with different state-representation functions (i.e., sequence, multi-set, and set). Moreover, we compared our approach with existing alignment-based conformance checking techniques. In particular, we used the standard cost function introduced in [4]. We repeated each experiment five times. Table 2 shows the results where every entry reports the average over the five runs.

Table 2. Results of experiments on synthetic data. CA indicates the percentage of correct alignments, and LD indicates the overall Levenshtein distance between the original traces and the projection of the alignments over the process. For comparison with existing approaches, the standard cost function as defined in [4] was used. The best results for each amount of noise are highlighted in bold.

The results show that cost profiles in combination with sequence and multi-set abstractions are able to better identify what really happened, i.e. they align the manipulated traces with the corresponding original traces in more cases (CA). In all cases, cost profile with sequence state-representation function provides more accurate diagnostics (LD): even if log traces are not aligned to the original traces, the projection over the process of alignments constructed using this cost profile and abstraction are closer to the original traces. Compared to the cost function used in [4], our approach computed the correct alignment for 4.4 % more traces when cost profile and sequence state-representation function are used. In particular, our approach correctly reconstructed the original trace for 18.4 % of the traces that were not correctly reconstructed using the cost function used in [4]. Moreover, an analysis of LD shows that, on average, the traces reconstructed using our approach have 0.37 deviations (compared to the original traces), while the traces reconstructed using the cost function used in [4] have 0.45 deviation. This corresponds to an improvement of LD of about 15.2 %.

5.2 Real-Life Logs

To evaluate the applicability of our approach to real-life scenarios, we used an event log obtained from a fine management system of the Italian police [19].Footnote 2 The process model in form of Petri net is presented in Fig. 1. We extracted a log consisting of 142408 traces and 527549 events, where all traces are conforming to the net. To these traces, we applied the same methodology used for the experiments reported in Sect. 5.1. We repeated the experiments five times. Table 3 shows the results where every entry reports the average over the five runs.

Table 3. Results of experiments on real-life data. Notation analogous to Table 2.

The results confirm that cost profiles in combination with sequence and multi-set state-representation functions provide the more accurate diagnostics (both CA and LD). Moreover, the results show that our approach (regardless of the used cost profile and state-representation function) performs better than the cost function in [4] on real-life logs. In particular, using sequence state-representation function and cost profile , our approaches computed the correct alignment for 1.8 % more traces than what the cost function in [4] did. Although this may not be seen as a significant improvement, it is worth noting that the cost function in [4] already reconstructs most of the traces (98 % and 97 % of the traces for 10 % and 20 % noise respectively). Nonetheless, our approach correctly reconstructed the original trace for 19.3 % of the traces that were not correctly reconstructed using the cost function used in [4]. Moreover, our approach improves LD by 21.1 % compared to the cost function used in [4]. Such an improvement shows that when the original trace is not reconstructed correctly, our approach returns an explanation that is significantly closer to what actually happened.

5.3 Complexity Analysis

In the previous sections, we have analyzed the accuracy of our approach for the computation of probable alignments. In this section, we aim to perform a complexity analysis. In the worst case, the problem is clearly exponential in the length of the log traces and the number of process activities. However, in this paper, we advocate the use of the A-star algorithm since it can drastically reduce the execution time in the average case. To illustrate this, we report on the computation time for the loan process and the fine-management process for different amounts of noise.

Fig. 6.
figure 6

Distribution of the computation time required to construct probable alignments for different amounts of noise. The computation time is grouped into 1 ms intervals in Fig. 6a and 0.3 ms intervals in Fig. 6b. The y-axis values are shown in a logarithmic scale.

Figure 6 shows the distribution of the computation time for the traces used in the experiments. In particular, Fig. 6a shows that, in the experiments of Sect. 5.1 (loan process), the construction of alignments required less than 1 ms for most of the traces. On the other hand, the construction of probable alignments for the fine management process required less than 0.3 ms for most of the traces (Fig. 6b). Table 4 reports the mean and standard deviation of computation time required to construct probable alignments for different levels of noise. The results show that, in both experiments, the time needed to construct probable alignments increases with increasing amounts of noise.

Based on the results presented in this section, we can conclude that, for both synthetic and real-life processes, our approach can construct probable alignments for a trace in the order of magnitude of milliseconds, which shows its practical feasibility.

Table 4. Mean and standard deviation of computation time required to construct probable alignments for different amounts of noise.

6 Discussion

The A-star algorithm requires a cost function to penalize nonconformity. In our experiments, we have considered a number of cost profiles to compute the cost of moves on log/model based on the probability of a given activity to occur in historical logging data. The selection of the cost profile has a significant impact on the results as they penalize deviations differently. For instance, cost profile penalizes less probable moves much more than . To illustrate this, consider a trace \(\sigma =\langle x,y\rangle \) and the process model in Fig. 7a. Two possible alignments, namely \(\gamma _1\) and \(\gamma _2\), are conceivable (Fig. 7b). \(\gamma _1\) contains a large number of deviations compared to \(\gamma _2\) (50 moves on log vs. 1 move on log). The use of cost profile yields \(\gamma _1\) as the optimal alignment, while the use of cost profile yields \(\gamma _2\) as the optimal alignment. Tables 2 and 3 show that cost profile usually provides more accurate results. Cost profile penalizes less probable moves excessively, and thus tends to construct alignments with more frequent traces in the historical logging data even if those alignments contain a significantly larger number of deviations. Our experiments suggest that the construction of probable alignments requires a trade-off between the frequency of the traces in historical logging data and the number of deviations in alignments, which is better captured by cost profile .

Fig. 7.
figure 7

Process model including two paths formed by a (sub)sequence of 50 activities and 1 activity respectively. The first path is executed in \(99\,\%\) of the cases; the second in \(1\,\%\) of the cases. \(\gamma _1\) and \(\gamma _2\) are two possible alignments of trace \(\sigma =\langle x,y\rangle \) and the process model.

Different state-representation functions can be used to characterize the state of a process execution. In this work, we have considered three state-representation functions: sequence, multi-set, and set. The experiments show that in general the sequence abstraction produces more accurate results compared to the other abstractions. The set abstraction provides the least accurate results, especially when applied to the process for handling credit requests (Table 2). The main reason is that this abstraction is not able to accurately characterize the state of the process, especially in presence of loops: after each loop iteration the process execution yields the same state. Therefore, the cost function constructed using the set abstraction is not able to account for the fact that the probability of executing certain activities can increase after every loop iteration, thus leading to alignments in which loops are not captured properly.

The experiments show that our technique tends to build alignments that provide better explanations of deviations. It is easy to see that, when nonconformity is injected in fitting traces and alignments are subsequently built, the resulting alignments yield perfect explanations if the respective process projections coincide with the respective fitting traces before the injections of nonconformity. Tables 2 and 3 show that, basing the construction of the cost function on the analysis of historical logging data, our technique tends to build alignments whose process projection is closer to the original fitting traces and, hence, the explanations of deviations are closer to the correct ones.

7 Related Work and Conclusions

In process mining, a number of approaches have been proposed to check conformance of process models and the actual behavior recorded in event logs. Some approaches [10, 11, 18, 21, 22] check conformance by verifying whether traces satisfies rules encoding properties expected from the process. Petković et al. [23] verify whether a log trace is a valid trace of the transition system generated by the process model. Rozinat et al. [24] propose a token-based approach for checking conformance of an event log and a Petri net. The number of missing and added tokens after replaying traces is used to measure the conformance between the log and the net. Banescu et al. [9] extend the work in [24] to identify and classify deviations by analyzing the configuration of missing and added tokens using deviation patterns. The genetic mining algorithm in [20] uses a similar replay technique to measure the quality of process models with respect to given executions. However, these approaches only give a Boolean answers diagnosing whether traces conform to a process model or not. When they are able to provide diagnostic information, such information is often imprecise. For instance, token-based approaches may allow behavior that is not allowed by the model due to the used heuristics and thus may provide incorrect diagnostic information.

Recently, the construction of alignments has been proposed as a robust approach for checking the conformance of event logs with a given process model [4]. Alignments have proven to be powerful artifacts to perform conformance checking. By constructing alignments, analysts can be provided with richer and more accurate diagnostic information. In fact, alignments are also used as the main enablers for a number of techniques for process analytics, auditing, and process improvement, such as for performance analysis [2], privacy compliance [5, 6] and process-model repairing [14].

To our knowledge, the main problem of existing techniques for constructing optimal alignments is related to the fact that process analysts need to provide a function which associates a cost to every possible deviation. These cost functions are only based on human judgment and, hence, prone to imperfections. If alignment-based techniques are fed with imprecise cost functions, they create imperfect alignments, which ultimately leads to unlikely or, even, incorrect diagnostics.

In this paper, we have proposed a different approach where the cost function is automatically computed based on real facts: historical logging data recorded in event logs. In particular, the cost function is computed based on the probability of activities to be executed or not in a certain state (representing which activities have been executed and their order). Experiments have shown that, indeed, our approach can provide more accurate explanations of nonconformity of process executions, if compared with existing techniques.

We acknowledge that the evaluation is far from being completed. We aim to perform more extensive experiments to verify whether certain cost-profile functions provide more probable alignments than others or, at least, to give some guidelines to determine in which settings a given cost-profile function is preferable.

In this paper, we only considered the control flow, i.e. the name of the activities and their ordering, to construct the cost function and, hence, to compute probable alignments. However, the choice in a process execution is often driven by other aspects. For instance, when instances are running late, the execution of certain fast activities are more probable; or, if a certain process attribute takes on a given value, certain activities are more likely to be executed. We expect that our approach can be significantly improved if other business process perspectives (e.g., data, time and resources) are taken into account.