A Contextual Approach to Detecting Synonymous and Polluted Activity Labels in Process Event Logs

Sadeghianasl, Sareh; ter Hofstede, Arthur H. M.; Wynn, Moe T.; Suriadi, Suriadi

doi:10.1007/978-3-030-33246-4_5

Sareh Sadeghianasl¹⁴,
Arthur H. M. ter Hofstede¹⁴,
Moe T. Wynn¹⁴ &
…
Suriadi Suriadi¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 11877))

Included in the following conference series:

OTM Confederated International Conferences "On the Move to Meaningful Internet Systems"

2603 Accesses
8 Citations

Abstract

Process mining, as a well-established research area, uses algorithms for process-oriented data analysis. Similar to other types of data analysis, the existence of quality issues in input data will lead to unreliable analysis results (garbage in - garbage out). An important input for process mining is an event log which is a record of events related to a business process as it is performed through the use of an information system. While addressing quality issues in event logs is necessary, it is usually an ad-hoc and tiresome task. In this paper, we propose an automatic approach for detecting two types of data quality issues related to activities, both critical for the success of process mining studies: synonymous labels (same semantics with different syntax) and polluted labels (same semantics and same label structures). We propose the use of activity context, i.e. control flow, resource, time, and data attributes to detect semantically identical activity labels. We have implemented our approach and validated it using real-life logs from two hospitals and an insurance company, and have achieved promising results in detecting frequent imperfect activity labels.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The Manhattan distance between any two PDFs \(p\) and \(q\) lies in the interval \([0,2]\), because in the best case, \(p\) and \(q\) are identical, then \(M(p,q) = 0\), and in the worst case \(\exists i,j \mid 1\leqslant i,j\leqslant n, i\ne j \) such that \( p_i = 1 \) and \(q_j = 1\), then \(M(p,q) = 2\).
2.
However, it may not be helpful if activities are performed in batch processing mode.
3.
A part of a day is a 4-hours period of a day.
4.
However, this principle may not hold for data attributes that take a wide range of values. One may be able to distinguish such attributes and informative ones via data-aware process mining [18]. The most informative attributes that indicate similarity or difference between activities are probably those involved in decision points.
5.
\(O(m+n^2)\) for the footprint matrix and \(O(n^3)\) for the distance measure.
6.
\(O(m)\) for the resource multi-sets and \(O(n^2\times r)\) for the distance measure.
7.
\(O(m)\) for the duration multi-sets and \(O(n^2\times d)\) for the distance measure.
8.
\(O(m)\) for the time multi-sets and \(O(n^2)\) for the distance measure.
9.
\(O(m\times k)\) for the data multi-sets and \(O(n^2\times v)\) for the distance measure.
10.
\(O(n^2\log n)\) for the Kruskal algorithm and \(O(n^3)\) for silhouette analysis.
11.
We used the bin width of 1 min for duration binning, the number 2 for null-valued distances, and uniform weights for measures within the temporal dimension.
12.
https://svn.win.tue.nl/repos/prom/Packages/SynonymousLabelRepair.
13.
https://data.4tu.nl/repository/uuid:76c46b83-c930-4798-a1c9-4be94dfeb741.
14.
To access to logs, ground truths and results, refer to https://s3-ap-southeast-2.amazonaws.com/event-log-quality/CoopIS2019/ReadMe.docx.
15.
https://data.4tu.nl/repository/uuid:915d2bfb-7e84-49ad-a286-dc35f063a460.
16.
We can’t release the log due to the NDA agreement with the organization.
17.
Of course where domain knowledge is available, it can guide the user to set the weights, but we want our approach to be applicable even if domain knowledge is not available by finding the best weights automatically.
18.
We assigned weights 1, 2, 3, 4, 5 to each of the four dimensions and picked the first one that maximizes the F-score.
19.
We did not include the Hospital Billing logs in this experiment because their activity labels were artificially renamed to arbitrary names and therefore applying label similarity on those names would not result in meaningful outcomes.
20.
We have considered final similarity threshold \(\theta = 0.7\) and distance threshold \(\theta _l = 0.3\) for the computations of Table 4 to compare methods under the same conditions.
21.
In combining our approach with label similarity we still select the best weights, i.e. from 1 to 5 for each dimension as well as the label similarity measure.

References

van der Aa, H., Gal, A., Leopold, H., Reijers, H.A., Sagi, T., Shraga, R.: Instance-based process matching using event-log information. In: Dubois, E., Pohl, K. (eds.) CAiSE 2017. LNCS, vol. 10253, pp. 283–297. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59536-8_18
Chapter Google Scholar
Van der Aalst, W.M.P.: Process Mining: Data Science in Action, 2nd edn. Springer, Heidelberg (2016)
Book Google Scholar
van der Aalst, W., et al.: Process mining manifesto. In: Daniel, F., Barkaoui, K., Dustdar, S. (eds.) BPM 2011. LNBIP, vol. 99, pp. 169–194. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28108-2_19
Chapter Google Scholar
van der Aalst, W.M.P., Dustdar, S.: Process mining put into context. IEEE Internet Comput. 16(1), 82–86 (2012)
Article Google Scholar
Becker, M., Laue, R.: A comparative survey of business process similarity measures. Comput. Ind. 63(2), 148–167 (2012)
Article Google Scholar
Bose, R.J.C., Mans, R.S., van der Aalst, W.M.P.: Wanna Improve Process Mining Results - It’s High Time We Consider Data Quality Issues Seriously. Technical Report BPM-13-02, BPM Center (2013)
Google Scholar
Bose, R.J.C., Mans, R.S., van der Aalst, W.M.P.: Wanna improve process mining results - it’s high time we consider data quality issues seriously. In: Computational Intelligence and Data Mining Symposium, pp. 127–134. IEEE (2013)
Google Scholar
Cairns, A.H., et al.: Using semantic lifting for improving educational process models discovery and analysis. In: Symposium on Data-driven Process Discovery and Analysis. CEUR, vol. 1293, pp. 150–161 (2014)
Google Scholar
Celino, I., de Medeiros, A.K.A., Zeissler, G., et al.: Semantic business process analysis. In: Workshop on Semantic Business Process and Product Lifecycle Management. CEUR, vol. 251, pp. 44–47. CEUR-WS (2007)
Google Scholar
Conforti, R., La Rosa, M., ter Hofstede, A.H.M.: Timestamp Repair for Business Process Event Logs. Technical report, The University of Melbourne (2018)
Google Scholar
Craw, S.: Manhattan distance. In: Shekhar, S., Xiong, H., Zhou, X. (eds.) Encyclopedia of Machine Learning and Data Mining, pp. 790–791. Springer, Cham (2017)
Chapter Google Scholar
Dijkman, R., Dumas, M., van Dongen, B., et al.: Similarity of business process models: metrics and evaluation. Inf. Syst. 36(2), 498–516 (2011)
Article Google Scholar
Dixit, P.M., et al.: Detection and interactive repair of event ordering imperfection in process logs. In: Krogstie, J., Reijers, H.A. (eds.) CAiSE 2018. LNCS, vol. 10816, pp. 274–290. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91563-0_17
Chapter Google Scholar
Günther, C.W.: Process Mining in Flexible Environments. Ph.D. thesis, Einhoven University Of Technology (2009)
Google Scholar
Klinkmüller, C., Weber, I., Mendling, J., Leopold, H., Ludwig, A.: Increasing recall of process model matching by improved activity label matching. In: Daniel, F., Wang, J., Weber, B. (eds.) BPM 2013. LNCS, vol. 8094, pp. 211–218. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40176-3_17
Chapter Google Scholar
Koschmider, A., Ullrich, M., Heine, A., Oberweis, A.: Revising the vocabulary of business process element labels. In: Zdravkovic, J., Kirikova, M., Johannesson, P. (eds.) CAiSE 2015. LNCS, vol. 9097, pp. 69–83. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19069-3_5
Chapter Google Scholar
Kruskal, J.B.: On the shortest spanning subtree of a graph and the traveling salesman problem. Am. Math. Soc. 7(1), 48–50 (1956)
Article MathSciNet Google Scholar
Leoni, M.D., van der Aalst, W.M.P.: Data-aware process mining: discovering decisions in processes using alignments. In: SAC, pp. 1454–1461. ACM (2013)
Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics doklady 10(8), 707–710 (1966)
MathSciNet Google Scholar
Lu, X., Fahland, D.: A conceptual framework for understanding event data quality for behavior analysis. In: ZEUS. CEUR, vol. 1826, pp. 11–14 (2017)
Google Scholar
Lu, X., et al.: Semi-supervised log pattern detection and exploration using event concurrence and contextual information. In: Panetto, H., et al. (eds.) OTM 2017. LNCS, vol. 10573, pp. 154–174. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69462-7_11
Chapter Google Scholar
Lu, X., Fahland, D., van den Biggelaar, F.J.H.M., van der Aalst, W.M.P.: Handling duplicated tasks in process discovery by refining event labels. In: La Rosa, M., Loos, P., Pastor, O. (eds.) BPM 2016. LNCS, vol. 9850, pp. 90–107. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45348-4_6
Chapter Google Scholar
Mannhardt, F., Blinde, D.: Analyzing the trajectories of patients with sepsis using process mining. In: CAiSE. CEUR, vol. 1859, pp. 72–80 (2017)
Google Scholar
Mans, R.S., van der Aalst, W.M.P., Vanwersch, R.J.B., Moleman, A.J.: Process mining in healthcare: data challenges when answering frequently posed questions. In: Lenz, R., Miksch, S., Peleg, M., Reichert, M., Riaño, D., ten Teije, A. (eds.) KR4HC/ProHealth -2012. LNCS (LNAI), vol. 7738, pp. 140–153. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-36438-9_10
Chapter Google Scholar
Massey Jr., F.J.: The kolmogorov-smirnov test for goodness of fit. J. Am. Stat. Assoc. 46(253), 68–78 (1951)
Article Google Scholar
de Medeiros, A.K.A., et al.: An outlook on semantic business process mining and monitoring. In: Meersman, R., Tari, Z., Herrero, P. (eds.) OTM 2007. LNCS, vol. 4806, pp. 1244–1255. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76890-6_52
Chapter Google Scholar
Rosemann, M., Recker, J., Flender, C.: Contextualisation of business processes. Int. J. Bus. Process Integr. Manage. 3(1), 47–60 (2008)
Article Google Scholar
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Article Google Scholar
Suriadi, S., Andrews, R., ter Hofstede, A.H.M., Wynn, M.T.: Event log imperfection patterns for process mining: towards a systematic approach to cleaning event logs. Inf. Syst. 64, 132–150 (2017)
Article Google Scholar
Tan, P.N., Steinbach, M., Kumar, V.: Cluster analysis: additional issues and algorithms. In: Introduction to Data Mining, pp. 569–650. Pearson (2005)
Google Scholar
Tax, N., Alasgarov, E., Sidorova, N., et al.: Generating Time-Based Label Refinements to Discover More Precise Process Models. Technical report, Eindhoven University of Technology (2017)
Google Scholar
Verhulst, R.: Evaluating Quality of Event Data within Event Logs: An Extensible Framework. Master’s thesis, Eindhoven University of Technology (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

Queensland University of Technology, Brisbane, Australia
Sareh Sadeghianasl, Arthur H. M. ter Hofstede, Moe T. Wynn & Suriadi Suriadi

Authors

Sareh Sadeghianasl
View author publications
You can also search for this author in PubMed Google Scholar
Arthur H. M. ter Hofstede
View author publications
You can also search for this author in PubMed Google Scholar
Moe T. Wynn
View author publications
You can also search for this author in PubMed Google Scholar
Suriadi Suriadi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sareh Sadeghianasl .

Editor information

Editors and Affiliations

University of Lorraine, Vandoeuvre Les Nancy Cedex, France
Hervé Panetto
Trinity College Dublin, Dublin, Ireland
Christophe Debruyne
Universität der Bundeswehr München, Munich, Germany
Martin Hepp
Trinity College Dublin, Dublin, Ireland
Dave Lewis
Università degli Studi di Milano Crema, Crema, Italy
Claudio Agostino Ardagna
TU Graz, Graz, Austria
Robert Meersman

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sadeghianasl, S., ter Hofstede, A.H.M., Wynn, M.T., Suriadi, S. (2019). A Contextual Approach to Detecting Synonymous and Polluted Activity Labels in Process Event Logs. In: Panetto, H., Debruyne, C., Hepp, M., Lewis, D., Ardagna, C., Meersman, R. (eds) On the Move to Meaningful Internet Systems: OTM 2019 Conferences. OTM 2019. Lecture Notes in Computer Science(), vol 11877. Springer, Cham. https://doi.org/10.1007/978-3-030-33246-4_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-33246-4_5
Published: 11 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-33245-7
Online ISBN: 978-3-030-33246-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics