Process Mining Event Logs from FLOSS Data: State of the Art and Perspectives

Mukala, Patrick; Cerone, Antonio; Turini, Franco

doi:10.1007/978-3-319-15201-1_12

Process Mining Event Logs from FLOSS Data: State of the Art and Perspectives

Patrick Mukala¹⁵,
Antonio Cerone¹⁵ &
Franco Turini¹⁵

Conference paper
First Online: 01 January 2015

934 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 8938))

Abstract

Free/Libre Open Source Software (FLOSS) is a phenomenon that has undoubtedly triggered extensive research endeavors. At the heart of these initiatives is the ability to mine data from FLOSS repositories with the hope of revealing empirical evidence to answer existing questions on the FLOSS development process. In spite of the success produced with existing mining techniques, emerging questions about FLOSS data require alternative and more appropriate ways to explore and analyse such data.

In this paper, we explore a different perspective called process mining. Process mining has been proved to be successful in terms of tracing and reconstructing process models from data logs (event logs). The chief objective of our analysis is threefold. We aim to achieve: (1) conformance to predefined models; (2) discovery of new model patterns; and, finally, (3) extension to predefined models.

Download conference paper PDF

1 Introduction

Since the mid nineties, there has been considerable work in the field of process mining. A number of techniques and algorithms enable the reenactment and discovery of process models from event logs (data) [21]. As the field matures and achieves critical success in process modelling, we suggest applying such techniques and algorithms to software process modelling in order to document and explain activities involved in the software development process. Hence, a practical example would be process mining Software Configuration Management (SCM) systems, such as CVS or subversion systems, for the purpose of modelling software development processes. These systems are popular in the world of Free/Libre OpenSource Software (FLOSS). FLOSS repositories store massive volumes of data about the software development activities. Applying process mining carries a non-negligible potential for understanding patterns in these data.

However, there have been limited efforts in applying process mining to the analysis of data in FLOSS environments. The only attempt in our knowledge consists in combining a number of software repositories in order to generate a log for process mining and analysis [12]. Such work exemplifies how process mining can be applied to understand software development processes based on audit trail documents recorded by the SCM during the development cycle.

The objective of our work is to open the discussion and possibly pave a way in introducing and adopting process mining as a viable alternative in analysing and discovering workflow models from email discussions, code comments, bug reviews and reports that are widely found in FLOSS environments. Our discussion can be predicated on the assumption that by looking at some of the existing techniques in mining software repositories, some benchmarks and guidelines can be defined to explore similar questions via the use of process mining and possibly assess its potential in so doing.

In this paper we investigate some of the state of the art techniques and activities for mining software repositories. We refer the reader to a similar endeavor by Kagdi, Collard and Maletic [10] for a detailed report in this regard. Their survey is quite expressive of critical milestones reached as part of mining software repositories. Instead, we succinctly select and present some of these mining perspectives in convergence with the objectives of our endeavor. We consider these approaches in terms of the type of software repositories to be mined, the expected results guiding the process of mining as well as the methodology and techniques used herein.

The reminder of the paper is structured as follows. In Sect. 2 we discuss some leading factors taken into account while mining repositories. In Sect. 3 selected mining techniques are described. Section 4 gives a condensed overview of some tools developed over the years to mine software repositories. In Sect. 5 we describe process mining as related to the previous sections. Finally, Sect. 6 concludes our work with the prospects of process mining FLOSS repositories as well as directions for future related work.

2 Mining Software Repositories: Leading Factors

The analysis of software repositories is driven by a large variety of factors. We consider four factors outlined by Kagdi, Collard and Maletic [10]: information sources, the purpose of investigation, the methodology and the quality of the output.

The first factor, information resources, depicts the repositories storing the data to be mined. There is a wide literature on mining software repositories [7, 8, 13]. Some notable sources include source-control systems, defect-racking systems and archived communications as the main sources of data utilised while conducting investigations in FLOSS [7, 10]. Source-control systems are repositories for storing and managing source code files in FLOSS. Defect-tracking systems, as the name suggests, manage bug and changes reporting. Archived communications encompass message exchanges via email in discussion groups and forums between FLOSS participants.

The next critical element at the heart of mining software repositories is the purpose. This is at the start of any research endeavor. It defines the objectives and produces questions whose answers are sought afterwards, during the investigation. This aims to determine what the output of process mining should be. After identifying the sources, determining the purpose, there is still room for deciding on the methodology for mining data and answering the questions. Due to the investigative nature of questions, available approaches present in the literature revolve around setting some metrics that need to be verified against the extracted data. For example, some metrics for assessing software complexity such as extensibility and defect density, can be verified on different versions of submitted software in SVN over a period of time and deduce properties that explain some form of software evolution.

The last factor paramount to the investigation of FLOSS repositories is evaluation. This is the evaluation of hypotheses that have been formulated according to the objectives of the investigation. In the context of software evolution, two assessment metrics for evaluation are borrowed from the area of information retrieval. These include precision and recall on the amount of information used as well as its relevance. In our case, the plan is to produce some models, process models primarily, and these models are to be evaluated and validated through a number of ways we deem appropriate.

3 Mining Techniques: Selected, Relevant Approaches

3.1 Bug Fixing Analysis

The first relevant attempt in mining software repositories pertains to analysing bug fixing in FLOSS. Śliwerski, Zimmermann and Zeller [18] present some results on their investigation on how bugs are fixed through introduced changes in FLOSS. The main repositories they used are CVS and Bugzilla along with the relevant metadata. While the purpose of their work was to locate changes that induce bug fixing by coupling a CVS to a BUGZILLA, our interest is to describe the methodology they used to investigate these repositories. Their methodology can be summarized in these three steps:

1.
Starting with a bug report in the bug database, indicating a fixed problem.
2.
Extracting the associated change from the version archive, this indicates the location of the fix.
3.
Determining the earlier change at this location that was applied before the bug was reported.

Step 1 is to identify fixes. This is done on two levels: syntactic and semantic levels. At the syntactic level, the objective is to infer links from a CVS log to a bug report while at the semantic level the goal is to validate a link using the data from the bug report [18]. In practice, this is carried out as follows.

Syntactically, log messages are split into a stream of tokens in order to identify the link to Bugzilla. The split generates one of the following items as a token:

a bug number, if it matches one of the following regular expressions (given in FLEX syntax^{Footnote 1}):
- bug[# \ t]*[0-9]+,
- pr[# \ t]*[0-9]+,
- show\_bug \.cgi \ ?id=[0-9]+,
- \ [[0-9]+ \ ];
a plain number, if it is a string of digits [0-9]+;
a keyword, if it matches the following regular expression:
- fix(e[ds])?|bugs?|defects?|patch;
a word, if it is a string of alphanumeric characters.

A syntactic confidence syn of zero is assigned to a link and its confidence raised by one if the number is a bug number and the log message contains a keyword, or if the log message contains only plain numbers or bug numbers. For example, the following log messages are considered:

Fixed bug 53784: .class file missing from jar file export

The link to the bug number 53784 gets a syntactic confidence of 2 because it matches the regular expression for bug and contains the keyword fixed.
52264,51529

The links to bugs 52264 and 51529 have syntactic confidence 1 because the log message contains only numbers.

Furthermore, the role of the semantic level in Step 1 of the methodology is to validate a link $(t,b)$ by taking information about its transaction $t$ and check it against information about its bug report $b$. A semantic level of confidence is thus assigned to the link based on the outcome. This is raised accordingly and incremented by 1 following a number of conditions such as “the bug $b$ has been resolved as FIXED at least once” or “ the short description of the bug report $b$ is contained in the log message of the transaction $t$”. Two examples in ECLIPSE are as follows:

Updated copyrights to 2004

The potential bug report number “200” is marked as invalid and thus the semantic confidence of the link is zero.
Support expression like (i)+= 3; and new int[] 1[0] + syntax error improvement

1 and 3 are (mistakenly) interpreted as bug report numbers here. Since the bug reports 1 and 3 have been fixed, these links both get a semantic confidence of 1.

The rest of the process (Step 2 and 3) is performed manually. Returned links are inspected manually in order to eliminate those that do not satisfy the following condition

$$sem > 1 \vee (sem = 1 \wedge syn > 0)$$

As shown in Fig. 1, the process involves rigorous manual inspection of randomly selected links that are to be verified based on the above condition.

After applying this concept in ECLIPSE and MOZILLA with respectively 78,954 and 109,658 transactions for changes made until January 20, 2005, the authors presented their results based on their objectives for 278,010 and 392,972 individual revisions on these projects respectively. Some of these results concern the average size of transactions for fixes in both projects and the different days of the week during which most changes are projected to occur, etc.

3.2 Software Evolution Analysis

The second approach was conducted by German [5] to present the characteristics of different types of changes that occur in FLOSS. German used CVS and its related metadata as information sources. The collective nature of software development in FLOSS environments allows for incremental changes and modifications to software projects. These progressive changes can be retrieved from version control systems such as CVS or SVN and parsed for analysis. In his approach, German investigated changes made to files as well as the developers that mostly commit these changes over a period of time. His argument also suggests that analysing the changes would provide clarifications on the development stages of a project in light with addition and update of features [5].

A number of projects considered for this purpose include PostgreSQL, Apache, Mozilla, GNU gcc, and Evolution. Using a CVS analysis tool called softChange, CVS logs and metadata were retrieved from these projects for investigation. A new algorithm called Modification Records (MRs) is proposed by German, who also claims that the algorithm provides a fine-grained view of the evolution of a software product. Noticeable from the work is the methodology used for mining the chosen repositories. The first step was to retrieve the historical files from CVS and rebuild the Modification Records from this info as they do not appear automatically in CVS. SoftChange, through its component file revision makes use of sliding window algorithm heuristic (shown in Fig. 2) to help organize this information.

Briefly explained, the algorithm takes two parameters ($\delta _{max}$ and $T_{max}$) as inputs. Parameter $\delta _{max}$ depicts the maximum length of time that an MR can last while $T_{max}$ is the maximum distance in time between two file revisions. The idea is that a file revision is included in a given MR on the basis of the following conditions:

all file revisions in the MR and the candidate file revision were created by the same author and have the same log (a comment added by the developer when the file revisions are committed);
the candidate file revision is at most $T_{max}$ seconds apart from at least one file revision in the MR;
the addition of the candidate file revision to the MR keeps the MR at most $\delta _{max}$ seconds long.

In order to conduct the analysis, knowledge of the nature and structure of codeMRs is required. Hence, the investigation is premised on an assumption that there exist six types of codeMRs reflecting different activities as undertaken by FLOSS developers. These include modifying code for Functionality improvement (addition of new features), Defect-fixing, Architectural Evolution and Refactoring (a major change in APIs or the reorganisation of the code base), Relocating code, Documentation (reference to changes to the comments within files) and Branch-merging, e.g. code is merged from a branch or into a branch.

Rysselberghe and Demeyer [17] investigate FLOSS repositories using clone detection methods In their approach the source code in CVS as well as its metadata are investigated in order to analyse frequently occurring changes (FACs) in source files. The idea is to document changes occurring in FLOSS using a technique tailored in the similar manner as the standard concept of frequently asked questions or FAQs. The rationale of FAQs is to gather some basic questions and answers that are representative of frequent questions and corresponding answers so as to reduce the continual posting of the same basic questions. Similarly, Rysselberghe and Demeyer consider this concept and apply it to frequent changes occurring in FLOSS. The objective is to identify frequently applied changes (FACs) since these changes record general solutions to frequent and recurring problems. Using proper CVS commands, such as some cvs log and cvs diff commands, change data can be extracted from CVS. These data include the difference in code before and after the change, the date and time of the change, the file involved. Once such information is obtained, the next step is to parse it and identify FACs. Locating FACs implies locating similar code fragments and this can be done by applying clone detection techniques.

Clone detection methods are developed to help identify duplicated or cloned code fragments in a program source code. During this process, a tool called CCFinder was used to analyze text files containing codes with FACs as retrieved using clone detection techniques. Based on some threshold values, the study asserts that high threshold values allow the identification of recurring and product-specific changes while low threshold values lead to the identification of frequently applied generic changes. Using Tomcat as a case study, observations drawn from the initial experiment include for instance that FACs identified with a high threshold and specific to one product and can be used to study and understand the motivation and success behind an applied change. Moreover, the removal of a recently added code fragment may give an indication for the reasons behind success or failure of changes in general. On the other hand, FACs with a low threshold can help in deriving low maintenance strategies automatically.

3.3 Identification of Developers Identities

The next case of FLOSS investigations is about the identification of developers identities in FLOSS repositories. Given the dynamic nature of developers behaviors in adopting different identities in distinct FLOSS projects, the task of identification becomes cumbersome. Nevertheless, one solution in this regards has been to integrate data from multiple repositories where developers contribute. Sowe and Cerone [19], using repositories from the FLOSSMetrics project, proposed a methodology to identify developers who make contributions both by committing code to SVN and posting messages to mailing lists.

Robles and Gonzalez-Barahona [14] conducted a similar study, based on the application of heuristics, to identify the many identities used by developers. Their methodology was applied on the GNOME project where 464,953 messages from 36,399 distinct e-mail addresses were fetched and analysed, 123,739 bug reports from 41,835 reporters, and 382,271 comments from 10,257 posters were retrieved from the bug tracking system. Around 2,000,000 commits, made by 1,067 different committers, were found in the CVS repository. The results showed that 108,170 distinct identities could be extracted and for those identities, 47,262 matches were found, of which 40,003 were distinct (with the Matches table containing that number of entries). Using the information in the Matches table, 34,648 unique persons were identified.

3.4 Source Code Investigation

In his work Yao [25] has the objective to search through source code in CVS and related metadata to find lines of code in specific files etc. This is done through a tool called CVSSearch (see Sect. 4). The technique used here to analyse CVS comments allows to automatically find an explicit mapping of the commit comment and the lines of code that it refers to. This is useful as CVS comments provide additional information that one cannot find in code comments. For instance, when a bug is fixed, relevant information is not typically extracted from code comment but can be found in CVS. Moreover, as part of FLOSS investigation, one can search for code that is bug-prone or bug-free based on CVS comments where these lines of code can be referenced.

Hence, Yao’s technique entails searching for lines of code by their CVS comments in producing a mapping between the comments and the lines of code to which they refer [25]. Unlike the CVS annotate command, which shows only the last revision of modification for each line, the algorithm used here records all revisions of modification for each line. The algorithm is highlighted as follows [25]:

Consider a file $f$ at version $i$ which is then modified and committed into the CVS repository yielding version $i+1$.
Also, suppose the user entered a comment $C$ which is associated with the triple $(f,i, i+1)$.
By performing a diff between versions $i$ and $i+1$ of $f$, it is possible to determine lines that have been modified or inserted in version $i+1$, the comment $C$ is thus associated with such lines.
Additionally, in order to search for the most recent version of each file, a propagation phase during which the comments associated with version $i+1$ of $f$ are “propagated” to the corresponding lines in the most recent version of $f$, say $j \ge i+1$. This is done by performing diff on successive versions of $f$ to track the movement of these lines across versions until version $j$ is reached.

Ying, Wright and Abrams [26] use a different perspective to investigate source code. Using the source code in CVS, the authors propose an approach to study communication through source code comments using Eclipse as a case study. This is premised on a principle of good programming that asserts that comments should “aid the understanding of a program by briefly pointing out salient details or by providing a larger-scale view of the proceedings” [26]. As part of understanding FLOSS activities, it has been found that comments in these environments are sometimes used for communication purposes. An example of a comment such as “Joan, please fix this method” addresses a direct message to other programmers about a piece of code but it is usually located in a separate archive (e.g. CVS).

3.5 Supporting Developers and Analysing Their Contributions

Another approach to mining FLOSS repositories is about providing adequate information for new developers in FLOSS. Given the dynamic mode of operations in FLOSS, it is quite difficult for newcomers who join a project to come up-to-speed with a large volume of data concerning that project Hence, a new tool called Hipikat is introduced [2, 3] to this end. The idea is that Hipikat can recommend to newcomers key artifacts from the project archives. Basically, this tool is assumed to form an implicit group memory from the information stored in a projects archives and, based on this information, gives a new developer information that may be related to a task that the newcomer is trying to perform [3]. The Eclipse open-source project is used as a case study in applying this approach.

The building blocks of this approach are twofold. Firstly, an implicit group memory is formed from the artifacts and communications stored in a projects history. Secondly, the tool presents to the new developer artifacts as selected from this memory in relevance to the task being performed. A group memory can be understood as a repository used in a FLOSS work group to solve present needs based on historical experience. In essence, the purpose of Hipikat is to allow newcomers to learn from the past by recommending items from the project memory made of source code, problem reports, newsgroup articles, relevant to their tasks [2].

This model depicts four types of artifacts that represent four main objects that can be found in FLOSS projects as shown in Fig. 3: change tasks (tracking and reporting bugs like in Bugzilla), source file versions (as recorded in CVS), mailing lists (messages posted on developer forums) and other project documents like requirements specification and design documents. An additional entity called Person is included to represent the authors of the artifacts.

Finally, Huang and Liu [9] analyse developer roles and contributions. Similar to numerous other studies available in the literature, this is based on a quantitative approach to analyse data in FLOSS. Using the CVS as the experimental repository, a network analysis is performed in order to construct social network graphs representing links between developers and different parts of a project. Standard graph properties are computed on the constructed networks and thus an overview in terms of developers activities is given to explain the fluctuations between developers with lower and higher degree.

4 Tools

Central to the sheer of work done with the purpose of mining software repositories are tools. A number of tools have been developed throughout this process, and we look at a few to express what aspects of software repositories can be mined using such tools.

CVSSearch. Used for mining CVS comments, the tool takes advantages of two characteristics of CVS comments [25]. Firstly, a CVS comment more likely describes the lines of code as involved in the commit; and secondly, the description given in the comment can be used for many more versions in the future. In other words, CVSSearch allows one to better search the most recent version of the code by looking at previous versions to better understand the current version. The tool is actually the implementation of Yao’s algorithm highlighted in Sect. 3.
CVSgrab. The objective of the tool is to visualise large software projects during their evolution. CV query mechanisms are embedded in the tool to access CVS repositories both locally and over the internet. Using a number a metrics, CVSgrab is able to detect and cluster files with similar evolution patterns [23]. One of the key features is its particularity to interactively show evolutions of huge projects on a single screen, with minimal browsing. The tools architectural pipeline is given in the Fig. 4. As output, CVSgrab uses a simple 2D layout where each file is drawn as a horizontal strip, made of several segments. The $x$-axis encodes time, so each segment corresponds to a given version of its file. Colour encodes version attributes such as author, type, size, release, presence of a given word in the versions CVS comment, etc. Atop of color, texture may be used to indicate the presence of a specific attribute for a version. File strips can be sorted along the $y$-axis in several ways, thereby addressing various user questions [23].
SoftChange. The purpose of this tool is to help understand the process of software evolution. Based on analysing historical data, SoftChange allows one to query who made a given change to a software project (authorship), when (chronology) and, whenever available, the reason for the change (rationale). Three basic repositories are used with SoftChange for analysis: CVS, bug tracking system (Bugzilla) and the software releases [6].
MLStats. This is a tool used for mailing lists analysis. The purpose of the tool is to extract details of emails from the repository. Data extracted from messages vary from senders and receivers to topics of message and time stamps as associated with the exchanged emails [1, 15]. The tool makes use of the email headers to derive the analysis.
CVSAnalY. This is a CVS and Subversion repository analyser that extracts information from a repository. Embedded with a web interface, it outputs the analysis results and figures that can be browsed through the interface [16]. Specifically, CVSAnalY analyses CVS log entries that represent committers names, date of commit, the committed file, revision number, lines added, lines removed and an explanatory comment introduced by the committer. The tool provides statistical information about the database, compute several inequality and concentration indices and generate graphs for the evolution in time for parameters such as number of commits, number of committers etc. as needed.

5 Process Mining for Knowledge Discovery in Event Logs

Process mining is used as a method for reconstructing processes as executed from event logs [24]. Such logs are generated from process-aware information systems such as Enterprise Resource Planning (ERP), Workflow Management (WFM), Customer Relationship Management (CRM), Supply Chain Management (SCM) and Product Data Management (PDM) [20]. The logs contain records of events such as activities being executed or messages being exchanged on which process mining techniques can be applied in order to discover, analyse, diagnose and improve processes, organisational, social and data structures [4].

Van der Aalst et al. [20] describe the goal of process mining to be the extraction of information on the process from event logs using a family of a posteriori analysis techniques. Such techniques enable the identification of sequentially recorded events where each event refers to an activity and is related to a particular case (i.e. a process instance). They also can help identify the performer or originator of the event (i.e. the person/resource executing or initiating the activity), the timestamp of the event, or data elements recorded with the event.

Current process mining techniques evolved from Weijters and Van der Aalst’s work [24] where the purpose was to generate a workflow design from recorded information on workflow processes as they take place. Assuming that from event logs, each event refers to a task (a well-defined step in the workflow), each task refers to a case (a workflow instance), and these events are recorded in a certain order. Weijters and Van der Aalst [24] combine techniques from machine learning and Workflow nets in order to construct Petri nets that provide a graphical but formal language for modeling concurrency as seen in Fig. 5.

The preliminaries of process mining can be explained starting with the following $\alpha $-algorithm. Let $W$ be a workflow log over $T$ and $\alpha (W)$ be defined as follows.

1.
$T_W = \{ t \in T \mid \exists \sigma \in W. \ t \in \sigma \}$
2.
$T_I = \{ t \in T \mid \exists \sigma \in W. \ t = first(\sigma ) \}$
3.
$T_O = \{ t \in T |\mid \exists \sigma \in W. \ t = last(\sigma ) \}$
4.
$X_W = \{ (A,B) \mid A \subseteq T_W \ \wedge \ B \subseteq T_W \wedge \forall a \in A\ \forall b \in B.\ a \rightarrow _W b\ \wedge $

$\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \,\,\forall a_1, a_2 \in A.\ a_1 \#_W a_2 \ \wedge \ \forall b_1, b_2 \in B.\ b_1 \#_W b_2 \}$
5.
$Y_W = \{ (A,B) \in X \mid \forall (A',B') \in X_A \subseteq A' \ \wedge \ B \subseteq B' \Longrightarrow (A,B) = (A',B') \}$
6.
$P_W = \{ p_{(A,B)} \mid (A,B) \in Y_W \} \ \cup \ \{ i_W,o_W \}$
7.
$F_W = \{ (a,p_{(A,B)}) \mid (A,B) \in Y_W \ \wedge \ a \in A \} \ \cup $

$\ \ \ \ \ \ \ \ \ \,\,\{ (p_{(A,B)},b) \mid (A,B) \in Y_W \ \wedge \ b \in B \} \ \cup $

$\ \ \ \ \ \ \ \ \ \,\,\{ (i_W,t) \mid t \in T_I \} \ \cup \ \{ (t,o_W) \mid t \in T_O \}$
8.
$\alpha (W) = (P_W,T_W,F_W)$.

The sequence of execution of the $\alpha $-algorithm goes as follows [4]: the log traces are examined and the algorithm creates the set of transitions ($T_W$) in the workflow (Step 1) the set of output transitions ($T_I$) of the source place (Step 2) and the set of the input transitions ($T_O$) of the sink place (Step 3). Then the algorithm creates $X_W$ (Step 4) and $Y_W$ (Step 5) used to define the places of the mined workflow net. In Step 4, it discovers which transitions are causally related. Thus, for each tuple $(A, B) \in X_W$, each transition in set $A$ causally relates to all transitions in set $B$, and no transitions in $A$ and in $B$ follow each other in some ring sequence. Note that the OR-split/join requires the fusion of places. In Step 5, the algorithm refines set $X_W$ by taking only the largest elements with respect to set inclusion. In fact, Step 5 establishes the exact amount of places the mined net has (excluding the source place $i_W$ and the sink place $o_W$). The places are created in Step 6 and connected to their respective input/output transitions in Step 7. The mined workflow net is returned in Step 8 [4].

From a workflow log, four important relations are derived upon which the algorithm is based. These are $>_W$, $\rightarrow _W$, $\#_W$ and $\parallel _W$ [4].

In order to construct a model such as the one in Fig. 5 on the basis of a workflow log, the workflow log has to be analysed for causal dependencies [22]. For this purpose, the log-based ordering relation notation is introduced: Let $W$ be a workflow log over $T$, i.e. $W \in P(T*)$. Let $a, b \in T$. Then

$a >_W b$ if and only if there are a trace $\sigma = t_1 t_2 t_3 \dots t_{n-1}$ and an integer $i \in \{1,\dots ,n-2 \}$ such that $\sigma \in W$, $t_i = a$ and $t_{i+1} = b$;
$a \rightarrow _W b$ if and only if $a >_W b$ and $b >_W a$;
$a \#_W b$ if and only if $a >_W b$ and $b >_W a$;
$a \parallel _W b$ if and only if $a >_W b$ and $b >_W a$.

Considering the workflow log $W = \{ABCD, ACBD, AED \}$, relation $>_W$ describes which tasks appeared in sequence (one directly following the other): $A >_W B$, $A >_W C$, $A >_W E$, $B >_W C$, $B >_W D$, $C >_W B$, $C >_W D$ and $E >W D$. Relation $\rightarrow _W$ can be computed from $>_W$ and is referred to as the (direct) causal relation derived from workflow log $W$: $A \rightarrow _W B$, $A \rightarrow _W C$, $A \rightarrow _W E$, $B \rightarrow _W D$, $C \rightarrow _W D$ and $E \rightarrow _W D$. Note that $B \rightarrow _W C$ follow from $C >_W B$. Relation $W$ suggests potential parallelism.

In practice, process mining can produce a visualisation of models, as seen in Figs. 6 and 7, based on the available data (event logs), the purpose of the investigation as well as the methodology and the expected output. We consider a simple example of a log about ordering and purchasing operations in an enterprise. The core advantage is the ability to track the activities as they are performed, the authors in the execution of these activities, the duration of the activities with regards to the entire process models. Additional statistical information can also be provided about the activities in the model as rewired and determined by the goals of the analysis.

Details of events and activities are given in Fig. 6. Specifically, the user is presented with a list of activities, the corresponding timestamp as well as the authors of such activities over a given period of time. The duration of every single activity is also included in the final report as is the frequency of occurrence of these activities. A similar analysis when conducted in FLOSS promises to uncover hidden patterns or enhance the visibility of predicted occurrences. In Fig. 7, a graphical representation of the occurrence of flow of activities is constructed and can be referred to as a Process Model. This is a reenactment of all selected activities as they occur according to a particular workflow.

6 Conclusion

FLOSS repositories store a sheer volume of data about participants activities. A number of these repositories have been mined using some of the techniques and tools we have discussed in this paper. However, to the date, there has not been any concrete investigation into how logs from FLOSS repositories can be process mined for analysis. This maybe attributed partly to two apparent factors. Firstly, researchers interested in mining software repositories have not come across process mining and thus its value is unexploited; secondly, the format of recorded in FLOSS poses a challenge in constructing event logs. Nevertheless, after reviewing existing mining techniques and the analysis they provide on the data, one can infer the type of input data, the expected output and thus construct logs that can be used for analysis through any of process mining recognised tools such as the ProM framework or Disco. The example presented previously has been carried out using Disco as tool of visualisation. This approach can bring an additional flair and extensively enrich data analysis and visualisation in the realm of FLOSS data. In our future work, we plan to produce tangible examples of process models reconstructed with logs from data representing FLOSS members daily activities. These logs can be built from Mailing archives, CVS data as well as Bug reports. Our data source is OpenStack [11]. This is an environment that reunites thousands of developers and users as well as more than 180 participating organizations that work together on a number of projects and components for open source cloud operating systems. We make use of the dumps of data from this platform to produce empirical evidence of learning processes using Process Mining techniques. With a clearly defined objective and the type of data needed, process mining promises to be a powerful technique for empirical evidence provision in software repositories.

Notes

1.
FLEX syntax is used by Adobe Flex, a tool that generates programs for pattern matching in text. It receives user-specified input and produces a C source file.

References

Bettenburg, N., Shihab, E., Hassan, A.E.: An empirical study on the risks of using off-the-shelf techniques for processing mailing list data. In: Proceedings of the IEEE International Conference on Software Maintenance, pp. 539–542. IEEE Computer Society (September 2009)
Google Scholar
Cubranic, D., Murphy, G.C.: Hipikat: recommending pertinent software development artifacts. In: Proceedings of the 25th International Conference on Software Engineering, pp. 408–418. IEEE Computer Society (May 2003)
Google Scholar
Cubranic, D., Murphy, G.C., Singer, J., Booth, K.S.: Hipikat: a project memory for software development. IEEE Trans. Softw. Eng. 31(6), 446–465 (2005)
Article Google Scholar
de Medeiros, A.K.A., van der Aalst, W.M.P., Weijters, A.J.M.M.T.: Workflow mining: current status and future directions. In: Meersman, R., Schmidt, D.C. (eds.) CoopIS 2003, DOA 2003, and ODBASE 2003. LNCS, vol. 2888, pp. 389–406. Springer, Heidelberg (2003)
Chapter Google Scholar
German, D.M.: An empirical study of fine-grained software modifications. Empirical Softw. Eng. 11(3), 369–393 (2006)
Article Google Scholar
German, D.M., Hindle, A.: Visualizing the evolution of software using softchange. Int. J. Softw. Eng. Knowl. Eng. 16(01), 5–21 (2006)
Article Google Scholar
Hassan, A.E.: Mining software repositories to assist developers and support managers. In: Proceedings of the 22nd IEEE International Conference on Software Maintenance (ICSM’06), pp. 339–342. IEEE Computer Society (September 2006)
Google Scholar
Hassan, A.E.: The road ahead for mining software repositories. In: Frontiers of Software Maintenance (FoSM 2008), pp. 48–57. IEEE Computer Society (September 2008)
Google Scholar
Huang, S.K., Liu, K.M.: Mining version histories to verify the learning process of legitimate peripheral participants. ACM SIGSOFT Softw. Eng. Notes 38(4), 1–5 (2005)
Google Scholar
Kagdi, H., Collard, M.L., Maletic, J.I.: A survey and taxonomy of approaches for mining software repositories in the context of software evolution. J. Softw. Maint. Evol. Res. Pract. 19(2), 77–131 (2007)
Article Google Scholar
OpenStack. Openstack system usage data. http://www.openstack.org
Poncin, W., Serebrenik, A., van den Brand, M.: Process mining software repositories. In: Proceedings of the 15th European Conference on Software Maintenance and Reengineering (CSMR 2011), pp. 5–14. IEEE Computer Society (2011)
Google Scholar
Robbes, R.: Mining a change-based software repository. In: Proceedings of the Fourth International Workshop on Mining Software Repositories, p. 15. IEEE Computer Society (2007)
Google Scholar
Robles, C., Gonzalez-Barahona, J.M.: Developer identification methods for integrated data from various sources. ACM SIGSOFT Softw. Eng. Notes 38(4), 1–5 (2005)
Article Google Scholar
Robles, G., Gonzalez-Barahona, J.M., Izquierdo-Cortazar, D., Herraiz, I.: Tools for the study of the usual data sources found in libre software projects. Int. J. Open Source Softw. Process. (IJOSSP) 1(1), 24–45 (2009)
Article Google Scholar
Robles, G., Koch, S., Gonzalez-Barahona, J.M.: Remote analysis and measurement of libre software systems by means of the cvsanaly tool. In: Proceedings of the 2nd Workshop on Remote Analysis and Measurement of Software Systems (2004)
Google Scholar
Rysselberghe, F.V., Demeyer, S.: Mining version control systems for facs (frequently applied changes). In: Proceedings of the International Workshop on Mining Software Repositories (MSR’04), pp. 48–52 (May 2004)
Google Scholar
Śliwerski, J., Zimmermann, T., Zeller, A.: When do changes induce fixes? ACM SIGSOFT Softw. Eng. Notes 38(4), 1–5 (2005)
Article Google Scholar
Sowe, S.K., Cerone, A.: Integrating data from multiple repositories to analyze patterns of contribution in foss projects. In: Proceedings of the 4th International Workshop on Foundations and Techniques for Open Source Software Certification (OpenCert 2010), Electronic Communications of the EASST, vol. 33. EASST (2010)
Google Scholar
van der Aalst, W.M., Rubin, V., Verbeek, H.M.W., van Dongen, B.F., Kindler, E., Günther, C.W.: Process mining: a two-step approach to balance between underfitting and overfitting. Softw. Syst. Model. 9(1), 87–111 (2010)
Article Google Scholar
van der Aalst, W.M., van Dongen, B.F., Herbst, J., Maruster, L., Schimm, G., Weijters, A.J.M.M.: Workflow mining: a survey of issues and approaches. Data Knowl. Eng. 47(2), 237–267 (2003)
Article Google Scholar
van der Aalst, W.M., Weijters, T., Maruster, L.: Workflow mining: discovering process models from event logs. IEEE Trans. Knowl. Data Eng. 16(9), 1128–1142 (2004)
Article Google Scholar
Voinea, L., Telea, A.: Mining software repositories with CVSgrab. In: Proceedings of the 2006 International Workshop on Mining Software Repositories, pp. 167–168. ACM (May 2006)
Google Scholar
Weijters, A.J.M.M., der Aalst, W.M.P.V.: Process mining: discovering workflow models from event-based data. In: Proceedings of the 13th Belgium-Netherlands Conference on Artificial Intelligence (BNAIC 2001), pp. 283–290 (October 2001)
Google Scholar
Yao, A.: Cvssearch: searching through source code using cvs comments. In: Proceedings of the IEEE International Conference on Software Maintenance (ICSM’01), p. 364. IEEE Computer Society (November 2001)
Google Scholar
Ying, A.T., Wright, J.L., Abrams, S.: Source code that talks: an exploration of eclipse task comments and their implication to repository mining. ACM SIGSOFT Softw. Eng. Notes 30(4), 1–5 (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Informatica, University of Pisa, Pisa, Italy
Patrick Mukala, Antonio Cerone & Franco Turini

Authors

Patrick Mukala
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Cerone
View author publications
You can also search for this author in PubMed Google Scholar
Franco Turini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Patrick Mukala .

Editor information

Editors and Affiliations

University of Malaga, Malaga, Spain
Carlos Canal
LIG Lab, Saint Martin d'Hères Cedex, France
Akram Idani

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mukala, P., Cerone, A., Turini, F. (2015). Process Mining Event Logs from FLOSS Data: State of the Art and Perspectives. In: Canal, C., Idani, A. (eds) Software Engineering and Formal Methods. SEFM 2014. Lecture Notes in Computer Science(), vol 8938. Springer, Cham. https://doi.org/10.1007/978-3-319-15201-1_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-15201-1_12
Published: 01 February 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-15200-4
Online ISBN: 978-3-319-15201-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics