Abstract
Process mining, as a well-established research area, uses algorithms for process-oriented data analysis. Similar to other types of data analysis, the existence of quality issues in input data will lead to unreliable analysis results (garbage in - garbage out). An important input for process mining is an event log which is a record of events related to a business process as it is performed through the use of an information system. While addressing quality issues in event logs is necessary, it is usually an ad-hoc and tiresome task. In this paper, we propose an automatic approach for detecting two types of data quality issues related to activities, both critical for the success of process mining studies: synonymous labels (same semantics with different syntax) and polluted labels (same semantics and same label structures). We propose the use of activity context, i.e. control flow, resource, time, and data attributes to detect semantically identical activity labels. We have implemented our approach and validated it using real-life logs from two hospitals and an insurance company, and have achieved promising results in detecting frequent imperfect activity labels.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The Manhattan distance between any two PDFs \(p\) and \(q\) lies in the interval \([0,2]\), because in the best case, \(p\) and \(q\) are identical, then \(M(p,q) = 0\), and in the worst case \(\exists i,j \mid 1\leqslant i,j\leqslant n, i\ne j \) such that \( p_i = 1 \) and \(q_j = 1\), then \(M(p,q) = 2\).
- 2.
However, it may not be helpful if activities are performed in batch processing mode.
- 3.
A part of a day is a 4-hours period of a day.
- 4.
However, this principle may not hold for data attributes that take a wide range of values. One may be able to distinguish such attributes and informative ones via data-aware process mining [18]. The most informative attributes that indicate similarity or difference between activities are probably those involved in decision points.
- 5.
\(O(m+n^2)\) for the footprint matrix and \(O(n^3)\) for the distance measure.
- 6.
\(O(m)\) for the resource multi-sets and \(O(n^2\times r)\) for the distance measure.
- 7.
\(O(m)\) for the duration multi-sets and \(O(n^2\times d)\) for the distance measure.
- 8.
\(O(m)\) for the time multi-sets and \(O(n^2)\) for the distance measure.
- 9.
\(O(m\times k)\) for the data multi-sets and \(O(n^2\times v)\) for the distance measure.
- 10.
\(O(n^2\log n)\) for the Kruskal algorithm and \(O(n^3)\) for silhouette analysis.
- 11.
We used the bin width of 1Â min for duration binning, the number 2 for null-valued distances, and uniform weights for measures within the temporal dimension.
- 12.
- 13.
- 14.
To access to logs, ground truths and results, refer to https://s3-ap-southeast-2.amazonaws.com/event-log-quality/CoopIS2019/ReadMe.docx.
- 15.
- 16.
We can’t release the log due to the NDA agreement with the organization.
- 17.
Of course where domain knowledge is available, it can guide the user to set the weights, but we want our approach to be applicable even if domain knowledge is not available by finding the best weights automatically.
- 18.
We assigned weights 1, 2, 3, 4, 5 to each of the four dimensions and picked the first one that maximizes the F-score.
- 19.
We did not include the Hospital Billing logs in this experiment because their activity labels were artificially renamed to arbitrary names and therefore applying label similarity on those names would not result in meaningful outcomes.
- 20.
We have considered final similarity threshold \(\theta = 0.7\) and distance threshold \(\theta _l = 0.3\) for the computations of Table 4 to compare methods under the same conditions.
- 21.
In combining our approach with label similarity we still select the best weights, i.e. from 1 to 5 for each dimension as well as the label similarity measure.
References
van der Aa, H., Gal, A., Leopold, H., Reijers, H.A., Sagi, T., Shraga, R.: Instance-based process matching using event-log information. In: Dubois, E., Pohl, K. (eds.) CAiSE 2017. LNCS, vol. 10253, pp. 283–297. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59536-8_18
Van der Aalst, W.M.P.: Process Mining: Data Science in Action, 2nd edn. Springer, Heidelberg (2016)
van der Aalst, W., et al.: Process mining manifesto. In: Daniel, F., Barkaoui, K., Dustdar, S. (eds.) BPM 2011. LNBIP, vol. 99, pp. 169–194. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28108-2_19
van der Aalst, W.M.P., Dustdar, S.: Process mining put into context. IEEE Internet Comput. 16(1), 82–86 (2012)
Becker, M., Laue, R.: A comparative survey of business process similarity measures. Comput. Ind. 63(2), 148–167 (2012)
Bose, R.J.C., Mans, R.S., van der Aalst, W.M.P.: Wanna Improve Process Mining Results - It’s High Time We Consider Data Quality Issues Seriously. Technical Report BPM-13-02, BPM Center (2013)
Bose, R.J.C., Mans, R.S., van der Aalst, W.M.P.: Wanna improve process mining results - it’s high time we consider data quality issues seriously. In: Computational Intelligence and Data Mining Symposium, pp. 127–134. IEEE (2013)
Cairns, A.H., et al.: Using semantic lifting for improving educational process models discovery and analysis. In: Symposium on Data-driven Process Discovery and Analysis. CEUR, vol. 1293, pp. 150–161 (2014)
Celino, I., de Medeiros, A.K.A., Zeissler, G., et al.: Semantic business process analysis. In: Workshop on Semantic Business Process and Product Lifecycle Management. CEUR, vol. 251, pp. 44–47. CEUR-WS (2007)
Conforti, R., La Rosa, M., ter Hofstede, A.H.M.: Timestamp Repair for Business Process Event Logs. Technical report, The University of Melbourne (2018)
Craw, S.: Manhattan distance. In: Shekhar, S., Xiong, H., Zhou, X. (eds.) Encyclopedia of Machine Learning and Data Mining, pp. 790–791. Springer, Cham (2017)
Dijkman, R., Dumas, M., van Dongen, B., et al.: Similarity of business process models: metrics and evaluation. Inf. Syst. 36(2), 498–516 (2011)
Dixit, P.M., et al.: Detection and interactive repair of event ordering imperfection in process logs. In: Krogstie, J., Reijers, H.A. (eds.) CAiSE 2018. LNCS, vol. 10816, pp. 274–290. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91563-0_17
Günther, C.W.: Process Mining in Flexible Environments. Ph.D. thesis, Einhoven University Of Technology (2009)
Klinkmüller, C., Weber, I., Mendling, J., Leopold, H., Ludwig, A.: Increasing recall of process model matching by improved activity label matching. In: Daniel, F., Wang, J., Weber, B. (eds.) BPM 2013. LNCS, vol. 8094, pp. 211–218. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40176-3_17
Koschmider, A., Ullrich, M., Heine, A., Oberweis, A.: Revising the vocabulary of business process element labels. In: Zdravkovic, J., Kirikova, M., Johannesson, P. (eds.) CAiSE 2015. LNCS, vol. 9097, pp. 69–83. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19069-3_5
Kruskal, J.B.: On the shortest spanning subtree of a graph and the traveling salesman problem. Am. Math. Soc. 7(1), 48–50 (1956)
Leoni, M.D., van der Aalst, W.M.P.: Data-aware process mining: discovering decisions in processes using alignments. In: SAC, pp. 1454–1461. ACM (2013)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics doklady 10(8), 707–710 (1966)
Lu, X., Fahland, D.: A conceptual framework for understanding event data quality for behavior analysis. In: ZEUS. CEUR, vol. 1826, pp. 11–14 (2017)
Lu, X., et al.: Semi-supervised log pattern detection and exploration using event concurrence and contextual information. In: Panetto, H., et al. (eds.) OTM 2017. LNCS, vol. 10573, pp. 154–174. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69462-7_11
Lu, X., Fahland, D., van den Biggelaar, F.J.H.M., van der Aalst, W.M.P.: Handling duplicated tasks in process discovery by refining event labels. In: La Rosa, M., Loos, P., Pastor, O. (eds.) BPM 2016. LNCS, vol. 9850, pp. 90–107. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45348-4_6
Mannhardt, F., Blinde, D.: Analyzing the trajectories of patients with sepsis using process mining. In: CAiSE. CEUR, vol. 1859, pp. 72–80 (2017)
Mans, R.S., van der Aalst, W.M.P., Vanwersch, R.J.B., Moleman, A.J.: Process mining in healthcare: data challenges when answering frequently posed questions. In: Lenz, R., Miksch, S., Peleg, M., Reichert, M., Riaño, D., ten Teije, A. (eds.) KR4HC/ProHealth -2012. LNCS (LNAI), vol. 7738, pp. 140–153. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-36438-9_10
Massey Jr., F.J.: The kolmogorov-smirnov test for goodness of fit. J. Am. Stat. Assoc. 46(253), 68–78 (1951)
de Medeiros, A.K.A., et al.: An outlook on semantic business process mining and monitoring. In: Meersman, R., Tari, Z., Herrero, P. (eds.) OTM 2007. LNCS, vol. 4806, pp. 1244–1255. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76890-6_52
Rosemann, M., Recker, J., Flender, C.: Contextualisation of business processes. Int. J. Bus. Process Integr. Manage. 3(1), 47–60 (2008)
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Suriadi, S., Andrews, R., ter Hofstede, A.H.M., Wynn, M.T.: Event log imperfection patterns for process mining: towards a systematic approach to cleaning event logs. Inf. Syst. 64, 132–150 (2017)
Tan, P.N., Steinbach, M., Kumar, V.: Cluster analysis: additional issues and algorithms. In: Introduction to Data Mining, pp. 569–650. Pearson (2005)
Tax, N., Alasgarov, E., Sidorova, N., et al.: Generating Time-Based Label Refinements to Discover More Precise Process Models. Technical report, Eindhoven University of Technology (2017)
Verhulst, R.: Evaluating Quality of Event Data within Event Logs: An Extensible Framework. Master’s thesis, Eindhoven University of Technology (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Sadeghianasl, S., ter Hofstede, A.H.M., Wynn, M.T., Suriadi, S. (2019). A Contextual Approach to Detecting Synonymous and Polluted Activity Labels in Process Event Logs. In: Panetto, H., Debruyne, C., Hepp, M., Lewis, D., Ardagna, C., Meersman, R. (eds) On the Move to Meaningful Internet Systems: OTM 2019 Conferences. OTM 2019. Lecture Notes in Computer Science(), vol 11877. Springer, Cham. https://doi.org/10.1007/978-3-030-33246-4_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-33246-4_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-33245-7
Online ISBN: 978-3-030-33246-4
eBook Packages: Computer ScienceComputer Science (R0)