Skip to main content

A Contextual Approach to Detecting Synonymous and Polluted Activity Labels in Process Event Logs

  • Conference paper
  • First Online:
On the Move to Meaningful Internet Systems: OTM 2019 Conferences (OTM 2019)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 11877))

Abstract

Process mining, as a well-established research area, uses algorithms for process-oriented data analysis. Similar to other types of data analysis, the existence of quality issues in input data will lead to unreliable analysis results (garbage in - garbage out). An important input for process mining is an event log which is a record of events related to a business process as it is performed through the use of an information system. While addressing quality issues in event logs is necessary, it is usually an ad-hoc and tiresome task. In this paper, we propose an automatic approach for detecting two types of data quality issues related to activities, both critical for the success of process mining studies: synonymous labels (same semantics with different syntax) and polluted labels (same semantics and same label structures). We propose the use of activity context, i.e. control flow, resource, time, and data attributes to detect semantically identical activity labels. We have implemented our approach and validated it using real-life logs from two hospitals and an insurance company, and have achieved promising results in detecting frequent imperfect activity labels.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The Manhattan distance between any two PDFs \(p\) and \(q\) lies in the interval \([0,2]\), because in the best case, \(p\) and \(q\) are identical, then \(M(p,q) = 0\), and in the worst case \(\exists i,j \mid 1\leqslant i,j\leqslant n, i\ne j \) such that \( p_i = 1 \) and \(q_j = 1\), then \(M(p,q) = 2\).

  2. 2.

    However, it may not be helpful if activities are performed in batch processing mode.

  3. 3.

    A part of a day is a 4-hours period of a day.

  4. 4.

    However, this principle may not hold for data attributes that take a wide range of values. One may be able to distinguish such attributes and informative ones via data-aware process mining [18]. The most informative attributes that indicate similarity or difference between activities are probably those involved in decision points.

  5. 5.

    \(O(m+n^2)\) for the footprint matrix and \(O(n^3)\) for the distance measure.

  6. 6.

    \(O(m)\) for the resource multi-sets and \(O(n^2\times r)\) for the distance measure.

  7. 7.

    \(O(m)\) for the duration multi-sets and \(O(n^2\times d)\) for the distance measure.

  8. 8.

    \(O(m)\) for the time multi-sets and \(O(n^2)\) for the distance measure.

  9. 9.

    \(O(m\times k)\) for the data multi-sets and \(O(n^2\times v)\) for the distance measure.

  10. 10.

    \(O(n^2\log n)\) for the Kruskal algorithm and \(O(n^3)\) for silhouette analysis.

  11. 11.

    We used the bin width of 1 min for duration binning, the number 2 for null-valued distances, and uniform weights for measures within the temporal dimension.

  12. 12.

    https://svn.win.tue.nl/repos/prom/Packages/SynonymousLabelRepair.

  13. 13.

    https://data.4tu.nl/repository/uuid:76c46b83-c930-4798-a1c9-4be94dfeb741.

  14. 14.

    To access to logs, ground truths and results, refer to https://s3-ap-southeast-2.amazonaws.com/event-log-quality/CoopIS2019/ReadMe.docx.

  15. 15.

    https://data.4tu.nl/repository/uuid:915d2bfb-7e84-49ad-a286-dc35f063a460.

  16. 16.

    We can’t release the log due to the NDA agreement with the organization.

  17. 17.

    Of course where domain knowledge is available, it can guide the user to set the weights, but we want our approach to be applicable even if domain knowledge is not available by finding the best weights automatically.

  18. 18.

    We assigned weights 1, 2, 3, 4, 5 to each of the four dimensions and picked the first one that maximizes the F-score.

  19. 19.

    We did not include the Hospital Billing logs in this experiment because their activity labels were artificially renamed to arbitrary names and therefore applying label similarity on those names would not result in meaningful outcomes.

  20. 20.

    We have considered final similarity threshold \(\theta = 0.7\) and distance threshold \(\theta _l = 0.3\) for the computations of Table 4 to compare methods under the same conditions.

  21. 21.

    In combining our approach with label similarity we still select the best weights, i.e. from 1 to 5 for each dimension as well as the label similarity measure.

References

  1. van der Aa, H., Gal, A., Leopold, H., Reijers, H.A., Sagi, T., Shraga, R.: Instance-based process matching using event-log information. In: Dubois, E., Pohl, K. (eds.) CAiSE 2017. LNCS, vol. 10253, pp. 283–297. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59536-8_18

    Chapter  Google Scholar 

  2. Van der Aalst, W.M.P.: Process Mining: Data Science in Action, 2nd edn. Springer, Heidelberg (2016)

    Book  Google Scholar 

  3. van der Aalst, W., et al.: Process mining manifesto. In: Daniel, F., Barkaoui, K., Dustdar, S. (eds.) BPM 2011. LNBIP, vol. 99, pp. 169–194. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28108-2_19

    Chapter  Google Scholar 

  4. van der Aalst, W.M.P., Dustdar, S.: Process mining put into context. IEEE Internet Comput. 16(1), 82–86 (2012)

    Article  Google Scholar 

  5. Becker, M., Laue, R.: A comparative survey of business process similarity measures. Comput. Ind. 63(2), 148–167 (2012)

    Article  Google Scholar 

  6. Bose, R.J.C., Mans, R.S., van der Aalst, W.M.P.: Wanna Improve Process Mining Results - It’s High Time We Consider Data Quality Issues Seriously. Technical Report BPM-13-02, BPM Center (2013)

    Google Scholar 

  7. Bose, R.J.C., Mans, R.S., van der Aalst, W.M.P.: Wanna improve process mining results - it’s high time we consider data quality issues seriously. In: Computational Intelligence and Data Mining Symposium, pp. 127–134. IEEE (2013)

    Google Scholar 

  8. Cairns, A.H., et al.: Using semantic lifting for improving educational process models discovery and analysis. In: Symposium on Data-driven Process Discovery and Analysis. CEUR, vol. 1293, pp. 150–161 (2014)

    Google Scholar 

  9. Celino, I., de Medeiros, A.K.A., Zeissler, G., et al.: Semantic business process analysis. In: Workshop on Semantic Business Process and Product Lifecycle Management. CEUR, vol. 251, pp. 44–47. CEUR-WS (2007)

    Google Scholar 

  10. Conforti, R., La Rosa, M., ter Hofstede, A.H.M.: Timestamp Repair for Business Process Event Logs. Technical report, The University of Melbourne (2018)

    Google Scholar 

  11. Craw, S.: Manhattan distance. In: Shekhar, S., Xiong, H., Zhou, X. (eds.) Encyclopedia of Machine Learning and Data Mining, pp. 790–791. Springer, Cham (2017)

    Chapter  Google Scholar 

  12. Dijkman, R., Dumas, M., van Dongen, B., et al.: Similarity of business process models: metrics and evaluation. Inf. Syst. 36(2), 498–516 (2011)

    Article  Google Scholar 

  13. Dixit, P.M., et al.: Detection and interactive repair of event ordering imperfection in process logs. In: Krogstie, J., Reijers, H.A. (eds.) CAiSE 2018. LNCS, vol. 10816, pp. 274–290. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91563-0_17

    Chapter  Google Scholar 

  14. Günther, C.W.: Process Mining in Flexible Environments. Ph.D. thesis, Einhoven University Of Technology (2009)

    Google Scholar 

  15. Klinkmüller, C., Weber, I., Mendling, J., Leopold, H., Ludwig, A.: Increasing recall of process model matching by improved activity label matching. In: Daniel, F., Wang, J., Weber, B. (eds.) BPM 2013. LNCS, vol. 8094, pp. 211–218. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40176-3_17

    Chapter  Google Scholar 

  16. Koschmider, A., Ullrich, M., Heine, A., Oberweis, A.: Revising the vocabulary of business process element labels. In: Zdravkovic, J., Kirikova, M., Johannesson, P. (eds.) CAiSE 2015. LNCS, vol. 9097, pp. 69–83. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19069-3_5

    Chapter  Google Scholar 

  17. Kruskal, J.B.: On the shortest spanning subtree of a graph and the traveling salesman problem. Am. Math. Soc. 7(1), 48–50 (1956)

    Article  MathSciNet  Google Scholar 

  18. Leoni, M.D., van der Aalst, W.M.P.: Data-aware process mining: discovering decisions in processes using alignments. In: SAC, pp. 1454–1461. ACM (2013)

    Google Scholar 

  19. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics doklady 10(8), 707–710 (1966)

    MathSciNet  Google Scholar 

  20. Lu, X., Fahland, D.: A conceptual framework for understanding event data quality for behavior analysis. In: ZEUS. CEUR, vol. 1826, pp. 11–14 (2017)

    Google Scholar 

  21. Lu, X., et al.: Semi-supervised log pattern detection and exploration using event concurrence and contextual information. In: Panetto, H., et al. (eds.) OTM 2017. LNCS, vol. 10573, pp. 154–174. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69462-7_11

    Chapter  Google Scholar 

  22. Lu, X., Fahland, D., van den Biggelaar, F.J.H.M., van der Aalst, W.M.P.: Handling duplicated tasks in process discovery by refining event labels. In: La Rosa, M., Loos, P., Pastor, O. (eds.) BPM 2016. LNCS, vol. 9850, pp. 90–107. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45348-4_6

    Chapter  Google Scholar 

  23. Mannhardt, F., Blinde, D.: Analyzing the trajectories of patients with sepsis using process mining. In: CAiSE. CEUR, vol. 1859, pp. 72–80 (2017)

    Google Scholar 

  24. Mans, R.S., van der Aalst, W.M.P., Vanwersch, R.J.B., Moleman, A.J.: Process mining in healthcare: data challenges when answering frequently posed questions. In: Lenz, R., Miksch, S., Peleg, M., Reichert, M., Riaño, D., ten Teije, A. (eds.) KR4HC/ProHealth -2012. LNCS (LNAI), vol. 7738, pp. 140–153. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-36438-9_10

    Chapter  Google Scholar 

  25. Massey Jr., F.J.: The kolmogorov-smirnov test for goodness of fit. J. Am. Stat. Assoc. 46(253), 68–78 (1951)

    Article  Google Scholar 

  26. de Medeiros, A.K.A., et al.: An outlook on semantic business process mining and monitoring. In: Meersman, R., Tari, Z., Herrero, P. (eds.) OTM 2007. LNCS, vol. 4806, pp. 1244–1255. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76890-6_52

    Chapter  Google Scholar 

  27. Rosemann, M., Recker, J., Flender, C.: Contextualisation of business processes. Int. J. Bus. Process Integr. Manage. 3(1), 47–60 (2008)

    Article  Google Scholar 

  28. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)

    Article  Google Scholar 

  29. Suriadi, S., Andrews, R., ter Hofstede, A.H.M., Wynn, M.T.: Event log imperfection patterns for process mining: towards a systematic approach to cleaning event logs. Inf. Syst. 64, 132–150 (2017)

    Article  Google Scholar 

  30. Tan, P.N., Steinbach, M., Kumar, V.: Cluster analysis: additional issues and algorithms. In: Introduction to Data Mining, pp. 569–650. Pearson (2005)

    Google Scholar 

  31. Tax, N., Alasgarov, E., Sidorova, N., et al.: Generating Time-Based Label Refinements to Discover More Precise Process Models. Technical report, Eindhoven University of Technology (2017)

    Google Scholar 

  32. Verhulst, R.: Evaluating Quality of Event Data within Event Logs: An Extensible Framework. Master’s thesis, Eindhoven University of Technology (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sareh Sadeghianasl .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sadeghianasl, S., ter Hofstede, A.H.M., Wynn, M.T., Suriadi, S. (2019). A Contextual Approach to Detecting Synonymous and Polluted Activity Labels in Process Event Logs. In: Panetto, H., Debruyne, C., Hepp, M., Lewis, D., Ardagna, C., Meersman, R. (eds) On the Move to Meaningful Internet Systems: OTM 2019 Conferences. OTM 2019. Lecture Notes in Computer Science(), vol 11877. Springer, Cham. https://doi.org/10.1007/978-3-030-33246-4_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-33246-4_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-33245-7

  • Online ISBN: 978-3-030-33246-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics