Keywords

1 Introduction

Voluntary contributions play an important role in maintaining Open Source Software (OSS) projects active [29]. This is because OSS projects work in a symbiotic way. While communities need to motivate, engage, and retain new developers to remain sustainable [19], a large, globally distributed community of developers wants to contribute for a variety of reasons, including learning, the necessity to fix a bug, and reputation [13, 24, 33].

However, as shown before in several studies [28,29,30], the newcomers face several barriers while joining to OSS projects. This can lead to demotivation and, ultimately, dropouts. Given the importance of the newcomers to the projects and the barriers they face, it is important to study the different aspects of the joining process. As previously stated by Steinmacher et al. [30], joining a project is a complex process composed of different stages and a set of forces that push newcomers towards (motivation and attractiveness) or away (onboarding barriers) from the project. While motivation is something that is usually inherent to the developers, attractiveness is a force that—to some extent—can be managed by the projects.

Some previous studies analyzed project attractiveness by analyzing its relationship with license [22], source code attributes [15, 16], and code base [4]. However, the existing literature does not analyze the temporal aspect of the newcomers’ joining, nor consider the recent phenomenon of social coding environments and their characteristics, which introduced a more standardized way to contribute [9]. This perspective is important, since, according to Capiluppi and Michlmayr [3] “the success of a project is often related to the number of developers it can attract”.

In this paper, we start filling this gap by investigating which projects’ characteristics are related to the increase of newcomers growth temporally in OSS projects maintained on GitHub. To achieve that, we selected a set of factors inherently from the OSS projects that might explain the increase of newcomers. We place these factors in context, measuring their effects on 72 weeks of growth of newcomers in 450 OSS projects. Our approach included clustering the OSS projects in terms of newcomers’ joining growth, aiming to identify different growth patterns. Based on the patterns identified, we leveraged the Random Forest [2] classifier to measure the effects of the projects’ characteristics aiming to explain each pattern.

The contributions of this study include: (i) empirical evidence of different patterns of newcomers growth in OSS projects, adopting a time series analysis; and (ii) identifying the factors that could potentially lead to attraction of newcomers in OSS projects. Ultimately, the results of this work may benefit project maintainers who can get acquainted with ways to make their project’s more attractive, creating a more welcoming environment for newcomers.

2 Methodology

To guide our research towards this goal, we designed the following research questions:

  • RQ1. What are the newcomer joining rates in OSS projects? The answer to this question is relevant to understand if projects receive constant rates of newcomers temporally, or if there are different trends for different projects. In case we find different trends of temporal joining rates, it is worth understanding how these different trends can be classified. By having a comprehensive classification of newcomers joining rate per time, it is possible to go in-depth and explore the reasons for the differences.

  • RQ2. What are the project’ factors that may influence the joining rate? The answer to this question aims to explain what are the factors that may influence different rates of newcomer joining, which may bring to light potential ways to attract more developers to OSS projects.

2.1 Curating the Corpus of OSS Projects

To explore the attractiveness of OSS projects, we retrieved data from 450 OSS projects hosted on the GitHub coding platform. To improve the variety of projects, we first selected the fifteen most popular programming languages used on GitHub [20]. The set of programming languages was composed of C, Clojure, CoffeeScript, Erlang, Go, Haskell, Java, JavaScript, Scala, Objective-C, Perl, PHP, Python, Ruby, and TypeScript. To avoid unmaintained projects, for each selected programming language, we filtered the 30 most starred OSS projects. We followed recent related work that considers stars as a measure of attractiveness in OSS projects [1]. We ended up with 450 OSS projects, written by 15 popular programming languages. The set of sampled projects included:

  • scala/scala: The Scala programming language was released in 2001 and has received more than 31 K commits. Mostly written in Scala.

  • django/django: High-level web application framework. Release in 2005, it has received more than 26 K commits. Mostly written in Python.

  • vim/vim: Highly configurable text editor. Released in 1991, it has received more than 9 K commits. Mostly written in C.

Fig. 1.
figure 1

Stars per project

As one can see, Fig. 1 suggests that even selecting projects with the highest number of stars, the distribution of stars varies significantly in the dataset, ranging from 6 to 39,990 stars (Q1: 3,648; Median: 10,470; Q3: 18,900). The data was collected using the GitHub API, and it was conducted in October 2018. To access the complete dataset and the source code of the tools used in this research are publicly available in our repositoryFootnote 1.

In our analysis, we collected a set of factors from the selected projects’ repositories, including project popularity, maturity, receptivity/welcoming features. We used these factors in our model, and their descriptions are presented in Table 1. We also calculated the Spearman correlation coefficients [12] on the factors in order to remove the strongly correlated ones (\(\rho > 0.7\)). The number of forks was removed from the list of factors since it has a high correlation with the number of stars.

Table 1. Factors extracted from repositories

It is important to note that some of these factors are not straightforwardly available in the GitHub repository, which is the case of the domain of the repositories. For this particular factor, we followed the Borges et al. methodology to define projects domain [1], and manually added it by doing a qualitative analysis over the website and documentation of the projects. As an example, we manually evaluate the Linux websiteFootnote 2 to define it as part of the system software domain.

To characterize the project attractiveness temporally, we collected the newcomers’ growth rate for each project, considering one week as the observation unit. The growth of newcomers is represented by a time series, which associates the evolution of the number of newcomers with the number of weeks existent in each project (newcomers per week). To this end, we define a newcomer as any contributor that submitted their very first contribution (commit) to the master branch. It is worth mentioning that each contributor was considered a newcomer only once, in the particular week that they submitted their first commit.

2.2 Identifying Growth Patterns

To identify the different newcomers’ growth patterns, we clustered the sampled OSS projects growth according to the time series mentioned before. We used the K-Spectral Centroid (KSC) clustering algorithm [32] to create the clusters. The KSC algorithm finds clusters of time series that share distinct temporal patterns, following a similar approach as the used in the classical K-means clustering algorithm [11]. We chose to apply the KSC clustering algorithm because the clustering is performed independently of shifts (i.e., dates) and scale (i.e., volume), focusing rather on the overall shape of the time series [7]. Moreover, the KSC algorithm was applied in well-established papers that follow a similar approach in different contexts [1] and domains [8].

The KSC algorithm requires that all the time series used during the clusterization have the same length. For this reason, we used only the time frame comprising the last 72 weeks of newcomers inflow for each project, considering the date of the dataset creation (October 2018). We use the length of 72 weeks because all projects are at least 72 weeks old. The KSC algorithm also requires the definition of a specific number of k clusters. To decide the best number of k clusters, we used the \(\beta _{CV}\) [17] heuristic. The \(\beta _{CV}\) heuristic is defined as the ratio of two coefficients: variation of the intracluster distances and variation of the intercluster distances. The smallest value of k after which the \(\beta _{CV}\) ratio remains roughly stable should be selected, as a stable \(\beta _{CV}\) implies that new splits affect only marginally the variations of intracluster and intercluster distances [8].

Figure 2 presents an association between the \(\beta _{CV}\) ratios and the k clusters for the newcomers time series. When considering the \(\beta _{CV}\) ratios and the number of projects per cluster, we decided to use \(k = 3\) clusters in the K-SC algorithm. Note that \(k = 4\) clusters could represent a better number of clusters in terms of ratio stability. By testing the clustering algorithm for \(k = 3\) and \(k = 4\), we found that the results were similar, and two groups could clearly be merged. Moreover, having a small number of projects per cluster would affect the future classification of patterns presented in Sect. 2.3. Thus, we kept with \(k = 3\) clusters.

Fig. 2.
figure 2

The \(\beta _{CV}\) ratios

2.3 Identifying Explanations to the Growth Patterns

Our next step was to explore the different newcomers’ growth patterns. To achieve this goal, we used a Random Forest [2] classifier to measure the effects of the independent variables (factors) in the explaining of the dependent variable (growth patterns). We selected Random Forest because it is fast [23], robust in the presence of noise and outliers [21], and has a great performance with numerical and categorical data [25].

To develop our model, we used the RandomForestClassifier class from the scikit-learn framework. The predictors used were the factors defined in Table 1, and the target variables were the growth patterns found. Using a measure called Mean Decrease Impurity (MDI) [14], RandomForestClassifier also provides a ranking of the most important features in the prediction of the target variables. The higher the score of the feature, the greater is its importance. We used the scores obtained from the classifier to understand the relationship between the factors and the growth patterns. Finally, we measured the effectiveness of the classifier using three commonly used metrics of Machine Learning: Precision, Recall, and F-measure. The code related to the clusterization and classification of the growth patterns are publicly available in our repository (See footnote 1).

Table 2. Description of the clusters
Fig. 3.
figure 3

Growth patterns clusters and centroids.

3 RQ1. On the Newcomer Joining Rates in OSS Projects

Figure 3 presents the growth patterns found, which we named as logarithmic, exponential and linear growth of newcomers. We chose these names based on the centroids’ trends, defined as the average growth of the clusters (also presented in Fig. 3). In Table 2 we present a clusters overview, including the number of repositories per cluster and the percentage increase of newcomers obtained from the centroids in 72 weeks. Linear growth includes the highest number of repositories (71.5%), with a percentage increase of newcomers of 19.9%. The Logarithmic growth is the second one with most repositories (15.7%), but different from the Linear growth, it has the lowest increase of newcomers (12.6%). The Exponential growth is the less representative cluster, represented by only 12.6% of the repositories. On the other hand, its percentage increase of newcomers is the highest one (27.6%).

To put these results in a better context, we investigated OSS projects for each one of the growth patterns. In the linear growth, we perceived that projects definitelytyped/definitelytyped, rails/rails, and symfony/symfony, follow this tendency from the very beginning. We perceived a similar trend when analyzing the projects that fit on the logarithm growth. The projects alpaca-lang/alpaca and chaplinjs/chaplin are interesting samples because both are active projects that received their first newcomer only after the first two months of analysis. On the other hand, the projects ionic-team/ionic, synrc/n2o, and DrKLO/Telegram are the ones who fit in the exponential growth. In particular, project Ionic grew from 25 to 45 newcomers in only ten weeks of activity.

figure a

4 RQ2: On the Factors that May Influence the Joining Rate

After running the Random Forest classifier and calculating the MDI for the factors in our prediction model, we rank the projects’ factors. Table 3 presents factors ordered by importance in predicting of newcomers growth patterns. As one could see, the highest score is the number of stars. This finding is particularly interesting because, contrary to well-known beliefs that suggest that newcomers may be more tempted to contribute to OSS projects that are written in a programming language that they are more familiar with, than a popular one. Time to merge appears next in the top factors explaining the newcomers’ growth rate. This is interesting, because it shows a relationship between a newcomers onboarding and the good practice of giving timely review, feedback, and closing pull requests. Completing the list of the top factors (with scores higher than 0.10), we have factors that are intrinsically related to the project, such as age and the number of programming languages used.

Moreover, the presence of text files such as the CONTRIBUTING file, the LICENSE file, and the CODE OF CONDUCT file, which are even recommended as community best practicesFootnote 3, are among the worst ranked factors in our model (they scored 0.0396, 0.0236, and 0.0118, respectively). Still, having issues and pull requests templates, which are also recommended to welcome newcomers, presented very low scores (0.0260 and 0.0183, respectively).

Table 3. Ranking of the most important factors

We also investigated the effectiveness of the classifier in predicting the growth patterns. We used three metrics from scikit-learnFootnote 4 to investigate the effectiveness: Precision, Recall, and F-measure [26]. Precision measures the correctness of the classifier in predicting growth patterns. Recall measures the effectiveness of the classifier in identifying the growth patterns. F-measure is the harmonic mean of precision and recall. Table 4 shows the metrics’ results divided by cluster, along with an overall result based on the micro-average of each metric [27].

Table 4. Precision, recall, and f-measure of the classification model. Divided by clusters, and an overall.

In general, the Random Forest classifier obtained a significant performance, with a micro-average of 72% for precision, recall, and F-measure. The Linear growth group presented the highest results, and almost all its instances were correctly classified (Precision: 75%, Recall: 95%, F-measure: 84%). On the other hand, we could not observe good results for the Logarithmic and Exponential groups. Only 6% of the Exponential instances were identified correctly (Precision: 33%, Recall: 6%, F-measure: 10%). The bad results may be justified by the number of instances analyzed (We only used 450 projects), the number of instances per growth group (since 71.5% of the instances belong to the Linear pattern), and the characteristics used during the prediction (Other characteristics may also affect the distinction of patterns). However, to understand these differences accurately, a broader study is needed.

figure b

5 Related Work

Several studies focus on how newcomers join OSS projects [18, 30, 31]. Von Krogh et al. [31] propose a joining script for developers who want to take part in a project. Similarly, Nakakoji et al. [18] proposed an onion based structure (the onion patch) to explain the OSS joining process. Steinmacher et al. [30] proposed a joining model, in which they represent motivation and attractiveness as forces that influence outsiders to become newcomers to OSS project. In this study, we focus on the characteristics of the project that may explain the attraction of newcomers in terms of temporal onboarding growth.

The attractiveness topic had also been previously studied. For example, Santos et al. [22] defined a theoretical cause-effect model for attractiveness to OSS projects, proposing its typical causes (license type, intended audience, type of project, development status), indicators (hits, downloads, members), and consequences (number of open tasks, time for task completion). They found that projects for end-users and developers have higher attractiveness, that application domain impacts attractiveness, and that projects licensed under most restrictive licenses tend to be less attractive. These results contradict Colazo and Fang’s [5] results, which analyzed 62 projects from SourceForge and found that restrictively licensed projects are more attractive to volunteer OSS developers.

From a different perspective, Meirelles et al. [16] applied the same model as Santos [22], inserting source code metrics as a typical attractiveness causes. They observed that structural complexity and software size (lines of code and number of modules), indicating that structural complexity negatively influences attractiveness, whereas software size positively influences it. Chengalur-Smith et al. [4] analyzed whether codebase size, project age, and niche size (a measure borrowed from ecology) influenced project attractiveness, finding that these three characteristics indeed influence the project’s ability to attract and retain developers.

Although the attractiveness topic has been explored from different perspectives, the studies mentioned to not consider temporal growth trends in newcomer’s onboarding. Moreover, the aforementioned studies do not analyze the attractiveness after the social coding environments become commonplace in OSS development. One exception is the paper by Gupta et al. [10], who analyzed how the adoption of continuous integration impacts developer attraction. However, they analyzed this single intervention, without considering any other project characteristic.

6 Limitations

In a study such as this, there are always many limitations and threats to validity. First, we considered only a limited number of attributes to explain the phenomenon of newcomers onboarding growth. Although we understand that different attributes and different ways to compute the factors should be used, we focused on factors that cover project popularity, maturity, skills required, receptivity/welcoming features.

Second, we focused only on GitHub OSS projects—the largest OSS hosting environment to date —, which means that our findings may not generalize to other platforms with different contributing characteristics. Moreover, we diversified our sample including 30 projects for each of the 15 most popular programming languages in GitHub, which naturally increases the diversity of our sample. However, our focus on the most popular OSS projects may have influenced factors such as the “# of Stars”. Nevertheless, as depicted at Fig. 1, the selected OSS are greatly diverse, when it comes to the number of stars. Still, although large, our dataset clearly does not comprehend the whole universe of OSS projects available. Also, we did not distinguish spare time contributors from employees hired by a software company to contribute to OSS. We are aware that they may have different contributing behaviors [6], but a comprehensive analysis of their joining rate is left this for future work.

Third, the decisions regarding the observation unit (in our case, one week) can also be seen as a limiting factor. However, it is important to note that, when exploring our data, and we found similar behavior for higher time windows. Similarly, we also considered that the use of 72 weeks to represent the joining rate of newcomers may not be sufficient. However, this is a limitation of the KSC algorithm, which requires an equivalent time window for all analyzed series.

Finally, we understand that only three clusters may not represent the diversity of joining rates found in OSS projects. However, we tried to overcome this limitation by using the \(\beta _{CV}\) heuristic. Yet, one might argue that our cluster may not strictly follow the function curves we indicated (exponential, linear, and logarithm). However, to make sense of this, we investigated the coefficients that indicate a strong correlation with the centroids and the growth patterns. Figure 4 shows this relationship. In each figure, there is a black line (the centroid line) and a red dotted line (the trend line that represents the function curve). As indicated, both lines follow roughly the same curve (the closer R2 is to 1 indicates the stronger is the correlation).

Fig. 4.
figure 4

Comparing the growth patterns with the default curves. (Color figure online)

7 Conclusion

In this paper, we investigated the projects’ characteristics that may explain the different patterns of newcomers growth in OSS projects. Through a sequence of quantitative and statistical analyses based on data mined from 450 OSS projects, we were able to uncover several so far unknown behaviors of OSS projects. For instance, we perceived that there are three main onboard growth patterns (namely a logarithm growth pattern, a linear growth pattern, and an exponential growth pattern). Moreover, we also shed some light on the factors that might encourage newcomers to onboard on the OSS projects. The Top-3 ranked factors were: the number of stars, the time to merge a pull-request, and the number of programming languages used. For future work, we plan to leverage qualitative analysis to better understand what external reasons (e.g., new release, recently open sourced, the first page on HackerNews, etc.) might lead some projects to an exponential growth of newcomers.