1 Introduction

Currently a number of studies provide evidence that suggests the existence of learning opportunities in FLOSS environments [1, 10, 1217, 22]. As part of this substantiation, FLOSS communities have been established as environments where successful collaborative and participatory learning between participants occurs [14, 16, 17].

Moreover, the levels of interest as well as the aura created around the occurrence of learning within FLOSS have attracted practitioners in tertiary education to consider incorporating participation in FLOSS projects as a requirement for some Software Engineering courses [12, 14, 24]. A number of pilot studies have been conducted in order to evaluate the effectiveness of such an approach in traditional settings of learning [1013, 20]. To aid in this endeavor, in our previous study, we put an emphasis on how learning occurs in terms of phases [2, 19]. To this end, it has been proposed that a typical learning process in FLOSS occurs in three main phases: Initiation, Progression and Maturation. In each phase, a number of activities are executed though interactions between Novices and Experts. A Novice is considered as any participant in quest of knowledge while the knowledge provider is referred to as the Expert. Figure 1 depicts the categorization of the learning phases with the Initiation Phase synonymously corresponding to understanding on the x axis as a learning stage, while Progression and Maturation correspond to practicing and developing respectively. The gray area in Fig. 1 represents the progression with regards to users as they progressively perform the types of activities on the y axis.

Fig. 1.
figure 1

Learning stages and participants’ learning progression in OSS communities [2]

In this paper, we present an approach for mining these learning phases from FLOSS data. For illustrative purposes, we detail our approach and present the results for the understanding (Initiation) phase, which is at the bottom of the scale in Fig. 1. In this phase, FLOSS participants get involved in the projects by reviewing and communicating with the purpose of understanding contents without producing any tangible contributions. Initiation is a critical stage as the participant accesses project repositories and exchanges emails and posts messages seeking information and posting any requests. Figure 1 also shows how, in the practicing and developing phases, the participants’ activities gradually move from simply using to posting and making significant contributions through commits [2].

FLOSS repositories, such as CVS, Bug reports, mailing archives, Internet relay chats etc., contain all traces of participants’ activities as they work in these environments. Singh et al. [6] argue that the FLOSS environment typically includes discussion forums or mailing lists to which users can post questions and get help from developers or other users. These forums are unrestricted and act as a learning environment for novices and experts alike. While many studies have provided invaluable insights in this direction, their results are mostly based on surveys and observation reports [7, 8, 2123, 24]. Our paper proposes to contribute in this context by studying learning activities from FLOSS repositories using process mining. In particular, the paper focuses on tracing and visualizing the learning activities as well as their flow of occurrence collected in mailing archives of a FLOSS platform called OpenStack [25].

Our major contribution is the approach used in analyzing the data, the application of process mining and mostly the discovered empirical evidence of learning activities’ traces. The rest of the paper is structured as follows. Section 2 provides preliminary details on mining the data and constructing the log and succinctly describes the Initiation Phase of the learning process. In Sect. 3 we discuss the data collection and analysis and then present the empirical results. Section 4 concludes the paper.

2 Preliminaries: Mining Data and Catalog of Key Phrases

In order to identify activities and construct the event logs needed for our analysis, we undertake a number of tasks. The first task is analyzing the contents of emails. Text mining appears to be the most direct solution for this task as we need to analyze the contents of a post/email and deduct a corresponding activity. Current text mining tools such as Carrot2, GATE, OpenLP, RapidMiner and KH Coder appear not to be appropriate for the kind of analysis we want to conduct. Tracing learning activities requires semantic interpretation of email contents and this could not be achieved by using any of these tools.

Therefore, we considered making use of Semantic search with MS SQL as a fit alternative. Semantic search improves search by understanding the contextual meaning of the terms and tries to provide the most accurate answer for a given text document. However, this also requires the use of key phrases to steer the search [18]. Our choice of key phrases is based on a number of studies conducted in FLOSS with regards to the kinds of questions and answers that are asked in FLOSS communication environments [35, 27]. We start from this categorization, following questions and responses categories; then we deduct a number of key phrases. We try as much as we can to include all the identified key phrases and expressions within the context of identifying learning activities and establishing the learning process across its three phases, although in this paper we only present the details of the first phase.

We make use of previous findings [3, 4], a formal model of learning activities in FLOSS communities [19], as well as lexical semantics to draw a catalog of key phrases with respect to our endeavor. Lexical semantics builds from synonyms of terms and their homonyms to derive the meaning of words in specific contexts. Hence, making use of semantic search is paramount and promises to capture the meaning of message contents as much as possible in identifying activities. Figure 2 presents a catalog that contains the key phrases that semantically identify activities as categorized according to the participants’ roles in the Initiation Phase of the learning process.

Fig. 2.
figure 2

Catalog of key phrases for initiation phase

Principal activities gravitate around observing and making contacts in the Initiation Phase of the learning process [19]. Ideally, this step constitutes an opportunity for the Novice to ask questions and get some help depending on the requests while the Expert intervenes at this point to respond to such requests.

On the one hand, a Novice seeking help can execute a number of activities. These include FormulateQuestion, IdentifyExpert, PostQuestion, CommentPost or PostMessage, ContactExpert and SendDetailedRequest. On the other hand, the main activities as undertaken by the Expert during the same period of time include ReadMessages on the mailing lists/Chat messages, ReadPost from forums, ReadSourceCode, as any participant commits code to the project, or CommentPost, ContactNovice and CommentPost.

In order to conduct our analysis, we need to identify the most appropriate repository in this regard. The main criteria in making such a decision lies on the existence of some form of communication exchange between FLOSS members on any candidate repository. Mailing Archives contain email messages between FLOSS members about discussions on topics relevant to the community. Some of these topics involve general questions or specific requests about files, pieces of code or even the use of new plug-ins etc. Hence, these Mailing Archives provide adequate details to track activities and explain their flow of occurrence in the Initiation Phase. Moreover, it is worth noting that the same approach can be applied to mine the remaining phases on other repositories such as source code or commits.

3 Data Collection and Analysis

The FLOSS platform used in our analysis is OpenStack [25]. According to Wikipedia, “OpenStack is a free and open-source software cloud computing software platform. Users primarily deploy it as an infrastructure as a service (IaaS) solution. The technology consists of a series of interrelated projects that control pools of processing, storage, and networking resources throughout a data center—which users manage through a web-based dashboard, command-line tools, or a RESTful API that is released under the terms of the Apache License” [25].

We considered this platform mainly due to the availability of data about email archives and also because it is still an active platform. This database is made up of 7 tables that store data pertaining to compressed files (source_code file, bugs), the mailing lists as per group discussions and topic of interests, the number of messages exchanged as well as details of the individuals involved in these exchanges as shown in Table 1.

Table 1. Details of mailing archives elements from OpenStack

This repository contains exactly 54762 emails exchanged between 3117 people who are registered on 15 distinct mailing lists. These emails were sent during a period of time spanning from 2010 to 2014. The length of the messages considered is of typical email length specifically with an average of 3261 characters, the longest email was of 65535 characters and the shortest message yields a single character length.

In order to analyze this data set, we make use of process mining techniques. The key in Process Mining is to identify events. An event is a tuple made up essentially of case ID, performer, activity and any relevant attributes we need for our analysis. In our case, we include the phase of the learning process, date as well as the role (Novice, Expert). Other key components include the catalog of key phrases as shown in Fig. 2 and the data set. Based on all these elements, we generated our event log, which is the set of all identified events.

An event E is a sextuple (t, a, p, d, s, r) such that: t is the case in the event and can be either a topic on emails or an issue number on code and bug reports; a is the activity; p is the participant; d is the relevant date of occurrence; s is the state of the learning process; and r is the participant’s role in the process.

Moreover, we refer to the catalog introduced earlier to retrieve the mappings between key phrases, activities, states and participants. Let c1, c2 and c3 be catalogs respectively for Initiation, Progression and Maturation. We distinguish between key phrases for activities and states. We refer to key phrases for states as gl_key (global keys) while the key phrases that help distinguish activities are referred to as lc_key (local keys). We define catalogs as sextuples (C, ci , gl_key, state, lc_key, activity, role) such that: C is the set of all our catalogs, ciC is a single catalog, gl_key is the key phrase for the identification of a state, state is the state as it appears in the catalog, lc_key is the key phrase used to identify an activity, activity is the corresponding activity in the catalog, and role is the role as it appears in the catalog. Using such information, we generate the event log to be analysed through process mining.

3.1 Process Mining Mailing Archives

In order to process mine these records, we choose Disco (Discover Your Processes) [26], an appropriate tool for analyzing the identified events and providing efficient visualizations to demonstrate the workflow of occurrence of activities in these processes. Disco is a toolkit for process mining that enables the user to provide a preprocessed log specifying case, activities, originator and any other needed attributes. The tool performs automatic process discovery from the log and outputs process models (maps) as well as relevant statistical data.

In essence, Disco applies process mining techniques in order to construct process models based on available logging data that is organized into an event log. This logging data is all the details about transactions that can be found in log file or transaction databases. Therefore, an event log can take a tabular structure containing all recorded events that relate to executed business activities [26].

Making use of Disco, we produced the Process Models representing the occurrence of learning activities as documented by their corresponding email messages.

For simplicity, we choose to represent the models through a graphical workflow as well as the statistical information as provided by Disco. Disco offers the possibility for a process model to be represented with frequency metrics that explain the flow of occurrence of events.

The main objective of the frequency metrics is the depiction of how often certain parts of the processes have been executed. We can distinguish three levels of frequency: absolute frequency, case frequency and maximum repetitions. We consider these metrics to model learning activities executed by both the Novices and Expert.

Additional details regarding the numerical measures such as events over time, active cases during this given period of time, case variants, the number of events per case as well as case duration could be plotted as needed. However, for simplicity and effectiveness, we represent only major statistical details that are most representative of the presence, impact and occurrence of learning activities in FLOSS over the chosen period of time.

3.2 Empirical Results

Before we unpack details about process models for both the Novice and Experts, we give some crucial details about the overall Initiation Phase. It should be noted that in Figs. 3 and 4 the numbers, the thickness of the arcs or edges, and the coloring in the model illustrate how frequent each activity or path has been performed. For the purpose of this paper, we retained the topic of emails, the message itself, the people involved in exchanging these emails, the resulting activities and classification of where such activities fall in our defined learning curve to build events and produce the event log used for model extraction.

Fig. 3.
figure 3

Process model for novice–per frequency [initiation phase] in mailing lists

Fig. 4.
figure 4

Process model for expert–per frequency [initiation phase] in mailing lists

The analysis of the Initiation Phase of the learning process is carried out on data that refer to the period between the 11th of November 2010 and the 6th of May 2014. During this time, we note that a total of 123401 events were generated. An event represents a tuple made up of the case (in this context, the discussion topic), the email senders as well as the relevant learning activities. With about 565 cases, a total of 14 activities are executed with an average time per case of 69.9 days while the median duration is of 57.8 days.

We can also point out that participants in quest for knowledge claim the majority of activities with a total of 122838 amounting to 99.54 % of all executed activities at this point in contrast with Experts who intervene at a lower rate of 0.36 % with 440 activities, slightly ahead of people doing something other than exchanging knowledge with 123 activities.

The process model depicted in Fig. 3 represents a workflow for all the activities performed by the Novice during the first phase of the learning process. On average, how often an activity has been executed in this process by the Novice as well as how often an activity links to another (path) can be noted through the numbers, the thickness of the arcs or edges, and the coloring in the model. We note that the Novice in OpenStack has engaged in a number of learning activities throughout this period of time. Figure 3 demonstrates that in 51 cases the process would start from formulating a question, posting the question, commenting on post (and this has occurred about 27 times), posting a message, which indicates that an expert has been identified, contacting that expert and sending a detailed request to the expert through commenting on a post. The numerical argument between the transitions from one activity to another indicates how many times on average this has happened.

Moreover, Fig. 3 also shows that in 502 cases, a Novice starts by commenting on a post first, then formulates a question and follows the process as explained above. Sometimes (101 times in the depicted process map), after identifying an expert, a Novice could go back to formulating another question (or follow-up questions) or even go back to just commenting on the post as part of the interaction.

The process model depicted in Fig. 4 represents a workflow for all the activities performed by the Expert during the first phase of the learning process. One should note that 6 main activities are undertaken by the Expert. In some cases, the Expert would contact the Novice, by commenting on a post or giving feedback regarding a request from the Novice, then read messages and posts, commenting on these posts as well as reading source code, especially if the Novice’s requests have to do with source code. The assumption here is that the Expert is referred to as such because of the nature of the reaction activity. Every time an Expert comments, it is in response to a Novice request or to request further details on an already posted question.

In some instances, the Expert would go back to reading messages after commenting on a post or reading source code, or sometimes contacting the Novice again after commenting on a post involved in the exchange. In some instances, on average in 9 cases, the Expert goes back to contacting the Novice after providing feedback. This is where further details about the request or clarification might be requested.

4 Conclusion

FLOSS environments appear to provide learning opportunities for participants. While this aspect has been previously investigated [7, 8], we believe that there is a need for empirical support for such findings. An important remark [9] emphasizes that in such previous work, content data had been collected using surveys and questionnaires or through reports from observers who have been part of the community for a defined period of time. Research suggests that a growing number of participants are highly motivated to engage in these platforms through discussions and email exchange. These interactions produce massive volumes of data that contain evidence of learning. Since these learning activities are not directly observable in the repositories, we made use of semantic search in MS SQL to identify activities from message texts and constructed the event log that we then analyzed with the help of the Disco process mining tool.

Using a combination of text mining techniques (semantic search), key phrases and a set of rules, we believe that our approach has provided insights on how to find traces of interaction and learning activities in FLOSS email messages. We can thus say that it is feasible to trace these activities and that process mining can play a catalyst role in identifying these activities in FLOSS. Our work aimed to give evidence about the existence of learning processes in FLOSS environments through the empirical analysis of OpenStack Mailing Archives.

Our results demonstrate how these learning processes are extracted and how each activity fits within the global picture. For a Novice, we can note that the process in some cases spans from formulating a question, posting the question, commenting on post, posting a message, which indicates that an expert has been identified, contacting that expert and sending a detailed request to the expert through commenting on a post. In most cases, a Novice starts by commenting on a post first, and then formulates a question before following the process as described above. Six main activities are performed by the Expert starting from contacting the Novice, through commenting on a post or giving feedback regarding a request from the Novice. The Expert will then read messages and posts, commenting on these posts as well as reading source code, especially if the Novice’s requests have to do with source code. In some instances, the Expert would go back to reading messages after commenting on a post or reading source code, or sometimes contacting the Novice again after commenting on a post involved in the exchange. In some other instances the Expert goes back to contacting the Novice after providing feedback. This is where further details about the request or clarification might be requested.

Finally, more experiments using this approach can be conducted in order to trace activities in the next two phases. These phases include the Progression and Maturation phases. Figure 1 indicates how FLOSS contributors start by just observing, in the Initiation Phase, and gradually evolve to more tangible activities such as posting and committing software artifacts and source code [2]. The types of activities in the Progression phase include Revert, Post and Apply. After the Expert makes contact, the Novice provides additional details about the previous request through activity Revert. The Expert gets further involved by providing guidance and the required help to the Novice through Post, while the Novice implements the new acquired knowledge through Apply. During the Maturation phase, activities include Analyze, Commit, Develop, Review and Revert. In this phase, the Novice’s acquired skills are expressed through the execution of advanced activities such as producing new code, reviewing new commits and providing assistance during discussions, thus gradually performing a transition from Novice to Expert [19]. The Expert performs the same activities with an emphasis on transferring knowledge and providing help when needed rather than applying new skills.