1 Introduction

Contributor disengagement in open source is widely known as a costly and critical issue [9, 19, 49], as it can directly affect the sustainability of projects. For example, in a recent study Coelho et al. reported that 41% of failed open source projects cited a reason involving the developer team, such as lack of interest or time of the main contributor [9]. Such local (project-level) sustainability issues in open source can have cascading effects on the entire ecosystem because of project interdependencies [12, 53]. So-called “core”, i.e., established, contributors are particularly critical for the sustainability of open source projects [19, 57].

There are many reasons why established contributors disengage. Some may be unavoidable, whereas others could perhaps be prevented through interventions or by providing better community support. Likely there are various dynamics in play, including the role of volunteers as compared to corporate employees [44], the role of external events such as family planning and job changes, and the role of perceived purpose, community support, and stress. Effects might include abruptly leaving the project, but also slow disengagement, or causing rippling frustrations through delays or cynicism.

The goal of our research is to better understand disengagement factors and which established contributors are at risk and when; this will enable us to build and validate a conceptual framework and theory. Moreover, we pursue a data-driven approach, operationalizing uncovered factors based on publicly available trace data. This way, we can identify at-risk open source contributors and communities, and help guide resources (e.g., volunteers, sponsors) toward projects and contributors in need, enhancing the sustainability of the overall ecosystem.

We identify potential disengagement factors from literature on turnover and open source retention, cross-validate them with results from a survey among contributors who recently stopped all open source activities on GitHub, operationalize select factors with public trace data, and finally conduct survival modeling among a set of 206 GitHub users to triangulate the survey results.

Among others, we identify the degree to which contributors work outside of typical office hours and to what degree they engage in support activities as important moderating factors. According to Claes et al. [8], 33% of open source contributors do not follow typical working hours, but instead work nights and weekends. Our survey shows that contributors who work nights and weekends proportionally tend to disengage for different reasons than those working regular hours. In addition, our survey reveals that the most common reasons for complete disengagement relate to transitions in employment, such as graduating from academia, changing employers, and changing roles.

To validate disengagement factors beyond our survey, we model to what degree hypothesized factors—such as working hours, engagement in support activities, and team size, which can be measured in public trace data of contributor activities—can predict the later disengagement of those contributors. To that end, we use the quantitative statistical method of survival modeling. As a key factor in our model, derived from our survey results, we incorporate transitions identified from public CVs of developers. Specifically, we analyze which contributor populations are more resilient to transitions such as job changes.

We find that working predominantly during office hours and experiencing a transition both increase a contributors risk of disengagement. Conversely, we find that increased levels of activity and working on more popular projects both decrease a contributors risk of disengagement.

In summary, we contribute (1) a survey revealing the reasons behind contributor disengagement; (2) a comparison between different groups of contributors; (3) measures to differentiate between groups, which could be used to help identify at-risk groups and better target support interventions; (4) a novel operationalization of transition data; and (5) a survival model demonstrating which factors are able to predict contributor disengagement.

2 Related Work

Turnover. Prior work has shown that the turnover rate of a project profoundly affects its survival probability [33, 46] and code quality [21]. Approximately 80% of open source projects fail due to contributor turnover related issues [46]. Even within projects that do not outright fail, contributor turnover has a significant adverse effect on software quality [21]. On a project level, contributor disengagement results in knowledge loss, which is a particularly expensive issue [33].

Employee turnover and retention have been broadly studied across many fields [31, 35]. In professional settings, early turnover research has focused often on personal characteristics (e.g., ability, age) and employee satisfaction, measured with hiring tests and surveys, whereas later research has explored many more nuanced factors, such as labor market (e.g., job opportunities), non-work values, and organizational commitment [31]. Research has shown that, while far from all turnover can be explained by dissatisfaction and similar factors [38], there are positive and negative factors that can buffer against shocks such as external job offers [6, 20]. Turnover among volunteers is less explored: Although some research suggests that similar personal and environmental factors influence their decisions to quit [41], other researchers point out that satisfaction and achievement, compatible working hours, training, challenging work, and role identity may play particularly strong roles [25, 34, 50].

Whereas reasons for joining open source [5, 24, 37, 44, 48, 54, 55] and interventions to improve the onboarding experience for new developers [7, 18, 30, 52] have been studied in depth, studies of contributor retention are rarer. Prior research has focused primarily on testing basic attributes [11, 39, 40, 46, 49, 53, 58]. For example, they have shown that retention is higher for contributors that have participated longer [39, 49], contributed more code changes [11, 39], and communicated more [11]. However, there has been a limited amount of prior research has also explored more nuanced factors, like whether a developers gender and social network effect their risk of disengagement [43]. Using surveys, researchers further associated ratings of general dissatisfaction and lack of community identification with higher perceived turnover and turnover intentions [32, 56]. Zhou et al.’s case study of three projects further suggests that commercial participation can crowd out volunteers [58].

Long working hours, lack of sleep, and lack of recovery on weekends are often discussed as stressors. Many studies confirm the importance of “mentally switching off” [1, 4, 51]. In software engineering, several studies have shown the influence of time-related factors, such as late-night commits and long working sessions being more likely to contain bugs [17, 45], sleep deprivation reducing code quality [22], Monday commit comments using more negative language [29], and time pressure is often seen as an important stressor [36].

Open Source Practitioners Reporting Stress. In addition to the academic literature, open source practitioners also spoke out about frustrations, funding concerns, stress, and even burnout. Often, there are high expectations and copious amounts of pressure placed on established open source contributors.

Many stories via blog posts from maintainers who disengaged have a similar narrative that describes the growing pressures and responsibilities they experienced that lead to their disengagement. One such blog post describes how “as [my project’s] popularity rose and rose, my drive to continue to create new projects, fell. All while the burden of supporting the needs of the massive user bases of my successful projects and the pressure of maintaining those projects grew.”Footnote 1

In addition to blog posts, there were also participants from the survey we ran who explicitly cited a lack of support as a reason for their disengagement. For example “[The open source project] is increasingly depended upon by other projects, but very few external developers are interested/willing enough to [understand the company] let alone contribute improvements/fixes. The support burden is a good problem to have (people are finding [the project] useful), but it does impose a productivity (and sometimes a motivation) burden.” (P35)

Contributors are broadly expected to maintain their projects. Having a seemingly never-ending list of tasks is another commonly cited reason for disengagement among the aforementioned blog posts and survey respondents. As described in a blog post by a now-retired developer, “working long hours for endless months” was a critical reason for their disengagement.Footnote 2

3 Overview: Mixed-Method Research

Our mixed-method empirical study follows a sequential exploratory design [14], combining qualitative and quantitative analysis of survey and GitHub trace data.

Step 1: Survey (Sect. 4). Although the turnover literature (Sect. 2) provides several starting points for potential disengagement factors, there has been only limited research on the actual reasons why open source contributors disengage. Therefore, we decided to ground our research by conducting an open-ended survey among developer who recently disengaged from all public GitHub activities. We furthermore analyze the frequency of self-reported reasons for disengagement regarding whether different populations disengage for different reasons.

Step 2: Survival analysis (Sect. 5). We test to what degree the potential disengagement factors identified statistically explain disengagement. To that end, we operationalize several disengagement factors, including when and what contributors worked on as well as job transitions in historic trace data and public CVs, and use survival modeling [42] to test their significance.

4 Self-reported Reasons for Disengagement (Survey)

4.1 Survey Methodology

To ground our analyses, we surveyed a sample of open source contributors who recently disengaged from all public GitHub activities, asking about their reasons.

Recently Disengaged Established Contributors. We invited open source contributors who stopped all public activity on GitHub after being active for at least 18 month. We identified such contributors from GHTorrent [26] trace data (version 2018-08). We then constructed six-month panels aggregating contributions (commits and issue/pull request events) per person, and selected those contributors who contributed at least 100 commits per six-month period for three consecutive periods, but at most 5 commits in the following period (the five commit threshold allows for some residual activity). This way, we identified a total of 702 contributors who disengaged (i.e., stopped contributing publicly) within the last year and had public email addresses listed on their GitHub profile pages.

We specifically sampled only previously active contributors with at least 100 commits per period across all of GitHub. Previous research has shown that within a single project, there are many different kinds of contributors, with one of the most popular models being the onion model [15]. With our threshold we target contributors who are likely very active in at least one project, rather than more peripheral or episodic contributors, which may have different motivations [2].

Survey Design. We designed a simple, single-question, open-ended survey, asking “Could you help us understand your reasons for reducing your contributions to GitHub projects?” We chose the open-ended format to avoid priming the participants to ensure organic but relevant responses. We use the single-question format without external survey software, because it reduces the barrier to participation. We invited all 702 identified candidates and received 151 valid answers (21.5% response rate). Our response rate is in line with other GitHub surveys, e.g., [27].

Card Sorting Analysis. We used card sorting, a qualitative content analysis [47] method, to analyze the survey answers. Two researchers reviewed the cards and organized them into mutually agreed upon categories using a ground-up process resulting in 17 subgroups. These subgroups were then further grouped into three overarching themes: Technical, Social, and Occupational. Note that many participants cited multiple reasons, resulting in 239 reasons from 151 responses.

Quantitative Analysis. In addition to identifying common self-reported reasons for disengagement from the survey responses, we additionally explore whether different populations report different reasons. Based on the literature and reports from open source practitioners (cf. Sect. 2), we specifically investigate whether contributors (a) working mostly “regular” office hours or (b) performing more support activities report disengaging for different reasons.

Working Hours: Analyzing GitHub data, we measure what percentage of contributions are made between 7am and 7pm local time, Monday through Friday, captured as indexWorkHours (the slightly wider interval than the traditional 9am to 5pm increases robustness to daylight savings [8]). To detect the contributor’s local time, we adjusted the UTC times in GHTorrent with the average time zone offset for each developer, collected from a small random sample of their commits after cloning repositories locally. We then separate our survey participants into two groups, Office Hours (more likely paid contributors) and Nights and Weekends (more likely volunteers), based on whether they perform more or less relative amount in the office hour window described above than average (average \(indexWorkHours = 0.6\); design following prior research [39]).

Support Activity: We also measured indexSupport as the percentage of support activities among all activities, i.e., all non-commit GHTorrent events related to managing issues and pull requests. We distinguish between High Support Work and Low Support Work relative to the mean (\(indexSupport = 0.2\)).

Note that given the different ways in which we aggregate the survey responses and the relatively small sample size overall, we cannot draw sound statistical conclusions about differences between the (sub)groups. While we report exact numbers, readers should focus on qualitative differences.

Table 1. Self-reported reasons for disengagement in survey

Threats to Survey Validity. As usual for surveys, our results may be affected by a selection bias: contributors who did not answer may have had different reasons for disengaging. To identify contributors who had disengaged, we used public GitHub data, which covers much but not all open source activities, as also visible in 10 (of 151) survey responses that indicate changing platforms. Deriving the survival model data from survey participants enabled modeling only contributors confirmed to have disengaged. Note that we consider moving to private repositories (12 answers) still as disengagement from public open source activities. Furthermore, our approach to identify disengagement looks for sudden disengagement (within a six-month window) and results may not generalize to contributors who disengage more gradually. Contributors may also deliberately or unconsciously self-censor in their answers, providing socially acceptable reasons rather than real—a common concern in turnover research [31]. Note however, that our survival model (discussed later) is built entirely on historic trace data rather than self-reported answers, and thus reduces this threat.

4.2 Results from Survey

In Table 1, we show the survey results. The most common self-reported reason for disengagement was changing jobs to a job that does not support open source work and occupational reasons were generally the most frequent.

Furthermore, we observe differences across populations: Contributors who work nights and weekends tend to disengage for different reasons than those who work during office hours: contributors who worked nights and weekends most commonly cited social reasons, whereas those who worked during office hours most commonly cited occupational reasons; the largest difference is between those who cited Left job where they contributed to OSS, with 19% and 0% citing it respectively.

Next, we turn to the aggregation by type of work, noting Contributors who do less support work tend to disengage for different reasons than those who do more: In particular, only 67% of the More Support Work group cited at least once Occupational reason, compared to 72% of the Less Support Work group. The difference between these two groups may be because since they are less stressed when major life changes occur (i.e., getting a new job or leaving school), they are better able to cope with transitions.

Finally, we emphasize a surprising result. For all contributors, occupational reasons such as major life changes (e.g., getting a new job or leaving school) were the most cited (with 106 citations), significantly more than lacking peer support or losing interest that are more commonly discussed in the literature. This motivated us to consider transitions explicitly in our survival analysis below.

5 Modeling Disengagement Factors (Survival Analysis)

5.1 Survival Model Methodology

We use survival analysis to triangulate the survey results and model the relative strengths of the effects of the three main factors emerging from the survey analysis on the risk of disengagement from public GitHub activity (Work Hours vs Nights and Weekends; High Support Work vs Low Support Work; and Job Transitions). Survival analysis is a statistical modeling technique that specializes in time-to-event data [42], particularly suited for modeling right censored data. In our study, the event is public GitHub disengagement; right censorship can occur for contributors whose last recorded event may be very close to the end of the observation period, for which it is not clear whether they will return to contribute more. In particular, we use a Cox Proportional Hazards regression model [13]. The estimated regression coefficients describe each variable’s hazard ratio (\(\text {HR}\)), which is analogous to an odds ratio in for multiple logistic regression analysis. Briefly, an \(\text {HR} > 1\) indicates an increased risk of observing the event, and an \(\text {HR} < 1\) indicates a decreased risk, relative to a one unit change in a predictor variable (or flipping the value, in case of binary variables), while holding all other predictors constant.

Data. We collect GitHub data on several variables for the open source contributors who disengaged and responded to our survey (the ‘treatment’ group), as well as for an equal sized ‘control’ group of contributors who did not disengage. With this design, a survival model estimates which factors are statistically useful for distinguishing groups.

For job transition data, we collect publicly available CV data from contributors by following links on their GitHub profiles. Since our data collection is not yet fully automated, we can currently only assemble a dataset of moderate size, therefore we only collected data for our survey participants (plus the control group), because their survey answers validate that they actually disengaged. For non-CV data, we use GHTorrent (Sect. 4). We discard 34 participants for which we cannot find CVs or similar information from which we can deduce past transitions, leaving us with a dataset of 206 contributors of which 103 disengaged. By construction, both groups contributed actively for 18 months (at least 100 commits per six-month period for three consecutive periods; Sect. 4); the ‘control’ group contributors then remained active for at least another six months at similar levels or higher, while the ‘treatment’ group contributors made at most five commits in the following period, i.e., they disengaged.

Model Factors and Operationalization. We compute:

  • Activity level: Prior work has shown that more active contributors are less likely to disengage [11], hence we control for the average quarterly activity level by counting all activities (commits and support) per person.

  • Working hours and support: We use the two factors indexWorkHours and index-Support as introduced in Sect. 4.1 to characterize the degree of work outside regular working hours (more likely volunteers) and the degree of support activities, both identified as stressors by practitioners (cf. Sect. 2). We compute dummy variables indicating being above or below the mean.

  • Organizational affiliation: Previous research has shown that on a project scale, having an organizational affiliation can help increase developer retention rates [58]. We test whether organizational affiliation has the same affect on engagement on an individual scale as it does on a project scale. Using GHTorrent, we record whether contributors had an Organizational Affiliation listed on their GitHub public profile.

  • Team size: Turnover research regularly reveals social embedding in a team as an antidote to turnover [19]. We operationalize this as the number of contributors per project. Since a contributor may be part of multiple projects, we consider only their main projects (for a contributor, taking all projects with the highest number of contributions that together constitute at least 50% of all contributions) and record the average team size among those projects. ‘Teams’ comprise everyone who authored at least one commit.

  • Project popularity: To control for whether contributors are more likely to disengage from small or very popular projects, we use the number of stars a project has on GitHub as a proxy for its popularity (standard measure in GitHub research [16]). We model popularity in addition to activity level because previous research has shown that the popularity of a project influences its survival probability [53], and we are interested in whether the popularity of a project also affects the survival probability of its contributors on an individual level. For contributors working on multiple projects, we consider the max popularity of the contributor’s active projects (see team size).

  • Transition found: Finally, to operationalize a contributor’s transition data, identified as very important in our survey, we went to their linked publicly available CV and created a binary variable that recorded whether there was a transition present in the last year or not. We considered a transition to be either the stopping or starting of a job or educational program.

Model Diagnostics. We performed the standard model diagnostics: We log transformed variables with highly skewed distributions, as necessary, to reduce heteroscedasticity [23]. We tested for multicollinearity using the variance inflation factor (VIF < 3) [10]. We also inspected Schoenfeld residual plots to graphically diagnose Cox regression modeling assumptions [28].

Threats to Model Validity. Regarding the survival model, statistical power is limited by the small sample size, which is limited by our design of modeling only survey participants with public CV data (due to confirming disengagement with the survey and manual effort required, as discussed). Since our treatment group was limited to the survey respondents, our survival model also has the risk of suffering from selection bias. As usual, our operationalization of factors in our survival model can only capture part of the concept to be measured. While we experimented with different operationalizations of our factors to ensure construct validity and robustness, one needs to be careful in generalizing our results beyond our specific operationalizations.

5.2 Results from Survival Modeling

Table 2 presents the results from the two survival models created; a base model without the novel transition found variable, and a full model with.

Table 2. Survival models for contributor disengagement.

The base model had a goodness of fit of \(\mathrm{R}^{2} = 0.21\). The controls behave as expected. Total activity had a hazard rate of 0.36, meaning it decreases a contributor’s risk of disengaging by a factor of 0.38. Similarly, contributors who work on more popular projects are less likely to disengage (Max number of stars has a hazard ratio of 0.85).

As predicted based on previous research, the workHours dummy affects a contributor’s risk of disengaging, having a hazard ratio of 1.56. This suggests that working during business hours more than the average contributor increases the risk of disengaging by a factor of 1.56. Surprisingly, we do not observe any statistically significant effects of doing more support work than average (the highSupportWork dummy), perhaps due to our operationalization or relatively small sample size.

The full model fits the data better (\(\mathrm{R}^{2} = 0.25\)), meaning that adding in the jobTransition variable helped increase the explanatory power of the model. The jobTransition variable has a hazard ratio of 2.48, meaning, as suggested by the survey results, that experiencing a transition significantly increases a contributor’s risk of disengagement by a factor of 2.48.

6 Discussion and Conclusions

In this research, we have looked at the reasons why established open source contributors disengage, using a survey with 151 responses and a survival model to quantify factors which predict disengagement. From the grouped analysis of survey results, we learned that the Nights and Weekends and Office Hours groups tend to cite different reasons for their disengagement, and so do more the Less Support Work and More Support Work groups.

Importantly, our study shows that operationalizations of different disengagement risk factors using publicly observable trace data are plausible. For example, since occupational reasons were the most commonly cited, we used online public CVs to operationalize the jobTransition variable; however, other commonly cited reasons from the survey may also be operationalizable. Another commonly cited reason was ‘no time, personal circumstance’, more specifically people often cited having children or getting married. Such circumstances may be observable on social networking platforms. This suggest that a data-driven systems could be developed to help identify at-risk groups on a significantly larger scale, instead of having to rely on relatively expensive survey data. This information could be useful to different stakeholders, such as open source foundations and other funding agencies, looking to target support interventions. Overall, support interventions targeted more appropriately could significantly increase the sustainability of open source ecosystems.

We aim to work on these extensions of the research and more, to better understand the reasons why different kinds of established contributors disengage, since defining the problem is the first step to solving it [3].