1 Introduction

The emphasis on delivering business value was one of the leading driving forces behind the adoption of most agile software development approaches. This goal also motivated the introduction of lean thinking in the software development processes, with the elimination of waste as a core principle – along with the continuous learning through short cycles and frequent builds, and the promotion of late changes and fast iterations (Poppendieck and Cusumano 2012).

These movements took place in the context of a business shift to the digital transformation era, in which disruptive business processes and models are seen as necessary paths to promote competitiveness, mainly for those companies that are not willing to give significant control of their processes to big software vendors (Andriole 2017). Together with the technological evolution (e.g., cloud computing), this compelled the software industry to recall the programming-in-the-small (DeRemer and Kron 1976) principles and to revisiting the overwhelming technical complexity and inflexibility of huge, standardized software systems and processes (Andriole 2017). Hence, the need to eliminating waste with unnecessary complexity or promote late changes to keep the software systems up-to-date with the business processes changes.

Despite its origins in manufacturing, lean principles are continuously being explored in new industries, even among those involving intensive knowledge work as in the case of Software Engineering (SE). Staats et al. (2011) state that “knowledge work not only has a context separate from manufacturing but also differs fundamentally in structure, calling into question lean principles’ universal applicability.” By one hand, agile software development approaches succeeded attaining goals such as adaptability and iterative processes (Abrantes and Travassos 2013), although being very developer-centric and relatively opaque to management regarding effort estimation, duration, and development costs (Maglyas et al. 2012; Fitzgerald et al. 2014). Lean approaches, on the other hand, are more geared towards quantitative measurement and decision making based on evidence (Fitzgerald et al. 2014).

One of the leading lean approaches used in SE is Kanban, which has been increasingly adopted by software organizations (Versionone 2017). Given its rising in popularity, researchers are increasing their attention to this theme, as can be seen in the four secondary studies analyzing different perspectives regarding Kanban (Corona and Pani 2013; Ahmad et al. 2013; Al-Baik and Miller 2015; Ahmad et al. 2018) covering over 20 primary studies. The technical literature is quite comprehensive reporting evidence regarding the benefits expected from Kanban and the challenges involved in its utilization.

However, despite the essential efforts in organizing a body of knowledge as observed in these four secondary studies (systematic reviews and mappings), there is still a lack of synthesis of the benefits and challenges of Kanban. Research syntheses are essential to provide a summarization, integration, combination, and comparison of findings from different studies. They are proposed on the premise that single studies are limited in the extent to which they may be generalized (Cruzes and Dybå 2011). Thus, a research synthesis represents a vital knowledge tool employed to manage and put scientific findings to use (Santos and Travassos 2016).

The primary goal of this paper is to investigate and identify the benefits and challenges of using Kanban in SE evidenced in the technical literature. It is a fundamental step in organizing an empirically-grounded reference for supporting the decision-making on this subject in SE. Also, the aggregated evidence presented in this paper aims to help software practitioners to understand and analyze the benefits and challenges of adopting and using Kanban in their software projects. Besides, to support SE researchers to identify areas where further research is needed to consolidate, understand or evolve the current knowledge regarding the use of Kanban in SE.

This paper is organized as follows. The next section briefly presents the basic concepts of Kanban and lean thinking. In Section 3, the study methodology is detailed, showing how primary studies were selected and aggregated using the Structured Synthesis Method (SSM). Section 4 describes how the primary studies were analyzed before aggregation. In the SSM, the primary studies have to be translated into diagrammatic representations used to aggregate the studies’ outcomes. In Section 5, the aggregation process and results are presented. The main benefits and challenges are explained and detailed. Then, Section 6 examines the results in the light of the existing body of evidence and discusses what can be learned from the aggregation. The threats to the validity of this synthesis study are explored in Section 7, and Section 8 concludes this paper.

2 Background

The Lean Methodology, also known as the Toyota Production System, is a production management process developed by Taiichi Ohno in the 40’s in the Japanese manufacturing industry context. In its conception, the term “Lean” was primarily associated with the reduction of costs (Ohno and Bodek 1988) through the “elimination of waste” or “doing more with less” (Conboy 2009). In the course of time, it became also focused on value for the customer and flow of work. It is common to refer to the lean concept as “lean thinking,” meaning that it is a mental model of how the world works (Poppendieck and Poppendieck 2013). In Womack and Jones (1997) the core lean principles are defined as follows:

  • Value: can only be defined by the customer. Also, it must be expressed regarding a specific product;

  • Value stream: the course of action through which a specific product must go to make it available;

  • Flow: the pursuit of continuous production keeping interruptions at a minimal level;

  • Pull: the customer pulls the product from the producer when it is needed rather than pushing the products, often unwanted, onto the customer;

  • Perfection: a virtuous cycle is created by the interaction of the previous four principles. “Organizations begin to accurately specify value, identify the entire value stream, make the value-creating steps for specific products flow continuously, and let customers pull value from the enterprise.”

The Lean philosophy uses a number of tools to support management its operation. One of these tools is called kanban (based on Toyota Production System). In contrast, there is an adaptation of the kanban, made by Anderson (2010), which is called Capital K (or Kanban). In this paper, our focus is on the latter, the Kanban with Capital K used in the software development context. Kanban is an approach to visualize the workflow of a production system. It makes use of the queue theory to control and improve the value stream by aiming attention at the production flow. In SE, David Anderson was the first to use Kanban in 2004 with a software development team at Microsoft. According to Anderson (2010), Kanban has five principles:

  • Visualize workflow: The board is the primary tool used to visualize and coordinate teamwork. Its columns show a sequence of activities, where the cards represent the features under work;

  • Limit work in progress: WIP is a way to manage and limit the amount of working in progress. There should always be a way to limit and signal to pull a new task;

  • Measure and manage flow: different statistics and diagrams can be used to monitor the Kanban process such as cycle/lead time, queue size, and cumulative flow diagrams;

  • Make process policies explicit: policies are an essential part of assuring that the flow is achieved. They establish the conditions to make the pull system work. They include, for instance, how to assign tasks and activities to developers and when a work item can be pulled from one state to another;

  • Use models to recognize improvement opportunities: three models are suggested (i) the Theory of Constraints, (ii) a subset of ideas from Lean Thinking that identifies wasteful activities as economic costs, and (iii) some variants that focus on understanding and reducing variability.

In SE, a Kanban system is usually implemented as a board on a wall with columns representing the different development process stages, i.e., the value stream (Poppendieck and Cusumano 2012). Cards are used to describe pieces of work or tasks, which are moved through the chart columns. A typical configuration used in a Kanban chart in the software context contains at least columns for the stages of specification, development, test, and deploy (Corona and Pani 2013). For each column, a limit for the work in progress is determined. As a result, flow and bottlenecks are usually the main issues addressed in daily meetings and play a crucial role in identifying improvement opportunities. Furthermore, as a visual tool, the chart stimulates the value stream evaluation materialized in it, also prompting not only the process improvement itself but also the defined policies supporting it.

3 Study method

The fundamental research question regarding our work refer to “What are the trends observed in empirical studies available in the technical literature regarding the benefits and challenges of using Kanban in software organizations?” To answer this research question, we aggregated the results of primary studies regarding Kanban using the SSM. The SSM allows the aggregation of qualitative and quantitative evidence through the use of diagrammatic models (Santos and Travassos 2013). As both qualitative and quantitative research synthesis method, the SSM briefly depicts the essential contextual aspects and informs the effects trend (e.g., positive or negative), as well as a certainty estimation about them. Therefore, the SSM provides balanced information regarding the phenomena, neither aggregating precise quantitative findings nor rich qualitative descriptions.

This blend of integrative and interpretive synthesis (Cruzes and Dybå 2011) was the primary reason for deciding to use the SSM as our research method since the primary studies regarding the use of Kanban in SE report both quantitative and qualitative evidence. In the SSM, interpretative synthesis aspects are concerned with the organization and development of concepts to describe contextual aspects of evidence whereas integrative features are focused on pooling data about cause-effect or moderation relations taking into account the uncertainty estimated for each evidence. Besides, the SSM offers tool support to model and synthesize evidence (Santos et al. 2015; Santos and Travassos 2017a), including facilities for graphical modeling, evidence search, and support for the synthesis. Another essential functionality is the evidence model comparison used to aggregate evidence, which has mechanisms for ‘conflict resolution’ between the models. The Evidence Factory tool including all the results of the synthesis presented in this paper can be accessed at http://evidencefactory.lens-ese.cos.ufrj.br/synthesis/editor/80416.

In general terms, an SSM synthesis study follows three steps: (i) the selection of primary studies, (ii) the analysis and representation of evidence acquired by such studies, and (iii) evidence synthesis. The basic idea involving these three steps is to collect evidence then represent them from the same perspective so that the results can be consolidated and synthesized. It is similar to statistical meta-analysis studies – which is a kind of integrative synthesis – where the effect size is used to get a uniformed view over the studies outcomes, which is also used in their aggregation (Borenstein et al. 2009).

Next, we describe how we applied the SSM to synthesize the research on benefits and challenges of using Kanban in SE. We also provide some descriptions of the SSM definition and utilization necessary for understanding how this synthesis study was conducted. We refer the reader to the following work Santos and Travassos (2013) for further details regarding the method, and to Martinez-Fernandez et al. (2015), Chapetta (2016), and Santos and Travassos (2017b) to find examples of its application.

3.1 SSM step 1: Selecting primary studies

As there are four secondary studies regarding Kanban including one (Ahmad et al. 2018) that has been recently published, there is no reason to perform some of the typical procedures involved in this step, such as defining a search string and selecting the studies based on inclusion and exclusion criteria. Instead, we used the datasets from these secondary studies to form the set of primary studies to be aggregated. We have taken the primary studies from the two most recent secondary studies (Al-Baik and Miller 2015; Ahmad et al. 2018). Regarding the other two, one (Ahmad et al. 2013) is updated by Ahmad et al. (2018), and the other (Corona and Pani 2013) is focused on the tools available for Kanban boards in software development. Only the papers reporting results from primary studies (i.e., case study, survey, controlled experiment, or simulation study) on using Kanban in SE were selected from the two secondary studies. Grey literature and experience reports that were considered in these two secondary studies were excluded from the synthesis.

Ahmad et al. (2018) enumerate 23 technical papers as primary studies (and other 23 as experience reports). However, we have found that three of them (Corona and Pani 2013; Ahmad et al. 2013; Al-Baik and Miller 2015) were, in fact, secondary studies. Also, one additional paper was excluded (Heikkilä et al. 2016), since the only challenge it reported, called “Setting up and maintaining Kanban”, has not been translated as it did not represent a moderator – more details about the benefits and challenges of using Kanban in SE considered in each primary study is shown in Table 1 (Section 4). Thus, from the 23 primary studies, 19 were included in the synthesis.

Al-Baik and Miller (2015) enumerate 37 papers as studies from which six studies we could classify as primary studies. Only one of the six primary studies was not included in Ahmad et al. (2018) and, thus, was also included in the synthesis. The remaining 31 papers were excluded because of the following reasons: (i) grey literature (19 papers); (ii) experience reports (six papers – all included in Ahmad et al. (2018)); (iii) it is not an empirical study (five papers); and (iv) not reporting results describing the benefits and challenges of Kanban use (one paper). Appendix B lists all the primary studies. It also should be noticed that all included papers are from the Software Engineering realm. That is, they investigate Kanban as a software technology (i.e., a set of techniques and tools employed in software development).

3.2 SSM step 2: Analysis and evidence representation

In this step, the goal is to put the selected primary studies under the same perspective so they can be aggregated. The idea is similar to the statistical meta-analysis, in which all primary studies are represented by a numerical value called effect size and then aggregated by combining those values (Borenstein et al. 2009). In the case of SSM, each primary study is represented by an evidence model, which is denominated theoretical structure. The evidence models describe the primary studies’ contextual aspects and the effects/moderators expected from the object of study – Kanban, in this synthesis. These descriptions are used as input for determining the evidence compatibility and for the aggregation itself.

As 20 primary studies were selected, we have had to create 20 theoretical structures. The name theoretical structure is related to the origin of the model constructs, which were taken from a representation created for theory building (Sjøberg et al. 2008). Since the model was adapted for the purpose of research synthesis, we use the name theoretical structure to bring attention to the model structure instead of the epistemological aspects related to theory building. In fact, this emphasis is also reflected in the method name Structured Synthesis Method. In the following paragraphs, we describe the evidence model constructs.

The ten semantic constructs used in the theoretical structures are shown in Fig. 2. There are three possible types of structural relationships in the representation: is a, part of and property of. All of them have counterparts in UML, respectively: generalization, composition and class attributes. The is a and part of relationships use the same UML notation for generalization and composition. Dashed connections denote properties. The relationships are used to link two types of concepts – value and variable.

A value concept represents a particular variable value, usually an independent variable. Rectangles represent value concepts. They are classified in archetypes (the root of each hierarchy), causes (indicated by the use of bold font and a ‘C1’ following the name denoting that it is the ‘cause 1’ (e.g., ‘Kanban’), and contextual aspects (e.g., ‘Distributed Project’). The four archetypes – activity, actor, system, and technology – were suggested by Sjøberg et al. (2008) in an attempt to capture the typical scenario in SE described by an actor applying a technology to perform activities in a software system.

A variable concept focuses on value variations usually associated with a dependent variable. Variable concepts are represented by ellipses or parallelograms symbolizing effects (e.g., ‘Work Visibility’) and moderators (e.g., ‘Training), respectively. Also, effects are not connected to cause using lines as they are assumed to exist when reading the diagram. Lines are also lacking in the link between moderators and the (moderated) effects. In this case, a textual hint (e.g., ‘M1’) is shown beside both the moderated effect and moderator. Both relationships, cause-effect, and moderation, are denominated influence relationships.

A seven-point Likert scale is used to indicate an effect size. The scale ranges from strongly negative to strongly positive. It is indicated above the ellipse (e.g., indicates that ‘Collaboration’ is between weakly positively and positively affected by ‘Kanban’ – the number of arrows indicates the value in the scale; represents strongly negative and strongly positive, and half arrows indicate a range such as in the case of ‘Collaboration’). The other type of variable concepts, namely moderators, indicates that some positive or negative effect is moderated (i.e., reduced) when it increases or decreases. It has a scale with three values indicating the moderation direction: inversely proportional, indifferent, and directly proportional. For instance, the moderator ‘Training’ has an inversely proportional influence on ‘Collaboration,’ which means that the more it is present, the less it exerts a moderation influence. The last aspect related to variable concepts is the association of a belief value (ranging from 0% to 100% or 0 to 1) to estimate the confidence in the observed effects and moderations. The bar under each element represents the belief value, e.g., ‘Flow of work’ has 47% of belief value.

3.2.1 Extracting information to build evidence models

In order to create the evidence models, it is necessary to extract information from the primary studies. The goal is to determine and define the concepts (contextual aspects, moderators, and effects) that will form the evidence model, and to estimate the confidence (i.e., belief value) over the variable concepts (moderators and effects). This is usually performed in two stages one for determining the concepts and other for estimating the confidence. These two stages are described next.

In the first stage, the procedures are analog to the coding process (Auerbach and Silverstein 2003), but with the specific goal of developing concepts and relating them according to the diagrammatic model definitions given earlier. Hence, the coding in the SSM does not necessarily need to go through a continuous and iterative process of small steps as it is usually indicated for coding, but it can be focused on the elements of the theoretical structures. There are several recommendations for performing this coding process in the SSM. For instance, one of the recommendations is the translation procedure (Britten et al. 2002). In the SSM, as the goal is to aggregate evidence by combining the compatible theoretical structures, the translation procedure can support the identification of concepts, which at first glance are not comparable, but when translated to the proper concept they become comparable. One example in software context would be translating Understandability and Learnability by a more generic concept, for instance, Usability.Footnote 1 This kind of generalization is not free from threads and should be considered in case by case basis according to the researchers’ interpretations. Readers interested in a detailed view regarding the recommendations and heuristics for the coding process in the SSM can find in Santos (2015).

During the coding, besides the evidence model concepts, it is also necessary to determine the effects intensity and the moderators direction. For qualitative studies, the adverbs and adjectives used to qualify the reported outcomes are translated to the seven-point Likert scale describing the effect size or intensity. When there was no indication of the effect intensity we were conservative and decided to define a range of values to represent the imprecision regarding the intensity, e.g., ‘between weakly positive and positive.’ For the quantitative studies, on the other hand, we need to arbitrate ranges of values using the domain of the dependent variable scale as input to be able to translate it to the seven-point Likert scale. For instance, in Fitzgerald et al. (2014) “the overall cycle time was reduced from almost 100 days to just over 60 days, a significant improvement” – the authors qualified the difference of 40 days as a significant improvement, which was used to determine this effect as strongly positive in the Likert scale.

In the second stage, with the concepts and their relationships defined, the SSM needs further definitions to determine the confidence (i.e., the belief value) related to the effects and moderators. Two inputs are used to that end. One is the study type of which evidence was acquired. The SSM uses the GRADE evidence hierarchy (Atkins et al. 2004) to split the 0–1 belief value range into four subranges: unsystematic observations [0.00, 0.25]; observational studies [0.25, 0.50]; quasi-experiments [0.50, 0.75]; and randomized controlled [0.75, 1]. The second input is the quality assessment which is translated into the 0.25 subrange. The SSM proposes to use two checklists to assess the quality of each study, which are explained in Santos and Travassos (2013). Based on this, the belief values listed in Table 5 (Appendix A) are calculated, e.g., the study P1 was observational (0.25), and in the performed quality assessment using the checklists, it got 0.17 out of 0.25. As one can see, the estimation procedure give lower belief values for less reliable studies and higher values for the more reliable ones. Thus, the basic idea is to reflect the reliability of the evidence represented by a theoretical structure. Details regarding the quality assessment for each study can be found in the Evidence Factory tool.

For performing these two stages for translating evidence from the primary studies into the diagrammatic evidence models, three researchers – the first three authors – organized the tasks in the following manner.

First, the 20 papers were evenly distributed among the researchers. Each of them thoroughly read the papers and extracted the relevant information to create the evidence models. The benefits and challenges enumerated in the secondary study by Ahmad et al. (2018) were used as the primary source for identifying the effects (i.e., benefits) and moderators (i.e., challenges) in the primary studies’ reports. It should be noticed that this process is usually performed inductively based on the primary studies textual report, but in the specific case of this study we used the work of Ahmad et al. (2018) as the benefits and challenges represent the codes extracted from the primary studies. Still, we were not able to find all the benefits and challenges in the papers as indicated in Ahmad et al. (2018) – the differences are presented in Section 4. On the other hand, contextual aspects were identified using the different SSM recommendations and heuristics.

Second, after the evidence models’ creation, the researchers discussed the models together. Each researcher summarized his papers and presented the models for the other two. During this process, three mains aspects were focused: (i) assessing the understanding of the primary studies’ context and outcomes, (ii) indicating the trace between the theoretical structures’ concepts and the excerpts from which they were extracted, and (iii) reaching a consensus regarding the theoretical structures’ concepts definition (e.g., guided by the reciprocal translation procedure indicated in the SSM – adapted from Meta-Ethnography (Da Silva et al. 2013)).

3.3 SSM step 3: Evidence synthesis

In this step, the evidence extracted from the primary studies are aggregated based on the evidence models. Therefore, it is essential to define what makes theoretical structures to match, i.e., what makes them compatible and allowing to aggregate evidence. The SSM defines two theoretical structures are compatible when their value concepts are the same or have the same meaning, which includes the cause, archetypes, and contextual aspects. Once the researcher determines that the theoretical structures can be compatible, then their effects and moderators are combined according to their directions and intensities.

Pair-by-pair comparisons determine the compatibility among theoretical structures. When a pair is found to be compatible, the combined theoretical structure is formed by the common value concepts of both theoretical structures being compared and by the variable concepts present in at least one of the two structures. Archetypes and contextual aspects, represented by value concepts, describe the conditions under which the aggregation is valid. For instance, in order to an evidence model be compatible with the one shown in the Fig. 1, it must have the same value concepts, namely: ‘portfolio management,’ ‘software development process,’ ‘software project,’ ‘distributed project,’ ‘software team,’ ‘medium-scale system,’ and ‘Kanban.’

After identifying compatibility based on the value concepts, the variable concepts’ (i.e., effects and moderators) intensity (e.g., positive or negative) and uncertainty (i.e., belief value) are pooled, in such a way that their intensity reflects the resulted agreement on the combined evidence. To that end, an uncertainty formalism is necessary to combine the results – otherwise, a simple vote counting strategy would be used. In the SSM, the Mathematical Theory of Evidence (Shafer 1976) (also known as Dempster-Shafer theory, DST) is the mathematical formalism that enables obtaining the pooled outcomes. The DST uses two primary inputs to combine two pieces of evidence. One is the hypotheses believed to have a chance to be true – a belief value greater than zero – and the other is the belief values themselves. Hypotheses are defined as sets of the powerset of the defined frame of discernment set whereas the belief value is estimated based on the procedures described in the previous step.

In order to perform the aggregation in the SSM using the DST formalisms, the different intensity values that an effect is possible to assume is represented as the frame of discernment. Since the intensity of an effect uses a seven-point Likert scale, the corresponding frame of discernment in the DST is defined as Θ = {SN, NE, WN, IF, WP, PO, SP} – the element names are abbreviations for the Likert scale terms, e.g., SN is ‘strongly negative’, IF is ‘indifferent’, and WP is ‘weakly positive.’ Likewise, the frame of discernment for moderators is formed by three values that are used to indicate the moderation direction: Θ = {IP, IF, DP} – ‘inversely proportional,’ ‘indifferent,’ and ‘directly proportional,’ respectively.

Once hypotheses and belief values are defined for each evidence, then the Dempster’s Rule of Combination is applied. Eq. (1) shows that the aggregated belief value for each hypothesis C is equal to the sum of the product of the hypotheses’ belief values whose intersection between all hypotheses Ai and Bj of both evidence is C. The function m is called basic probability assignment function which, as the name implies, is used to assign a belief value to the different hypotheses of the powerset.

$$ {m}_3(C)=\frac{\sum \limits_{\begin{array}{c}i,j\\ {}{A}_i\cap {B}_j=C\end{array}}{m}_1\left({A}_i\right)\ {m}_2\left({B}_i\right)}{1-K}, where\ K=\sum \limits_{\begin{array}{c}i,j\\ {}{A}_i\cap {B}_j=\varnothing \end{array}}{m}_1\left({A}_i\right)\ {m}_2\left({B}_i\right) $$
(1)

When the intersection between two hypotheses is an empty set, we say that there is a conflict. A conflict is, then, redistributed to the aggregated hypotheses – that is the function of 1 - K in the denominator. More details about how DST is used in SSM are available in Santos and Travassos (2013). At the end of Section 5, the reader finds an example of how to compute it.

4 Representation of the benefits and challenges of Kanban

Before presenting the aggregated results, in this Section we describe the coding process involved in modeling the theoretical structures and the concepts developed in that process. For the sake of readability and reasonable manuscript size, only two models are detailed since 20 theoretical structures were created. The chosen models (one is qualitative and the other quantitative) are representative of the overall set of studies. Besides, they have different levels of complexity as the quantitative investigated only three benefits and the qualitative study covers several benefits and challenges.

As stated in Section 3.2.1 (SSM Step 2), the benefits and challenges were pre-determined using those enumerated in Ahmad et al. (2018). However, since the benefits and challenges were not thoroughly defined, we possibly interpreted some of them differently. For instance, the benefits ‘identify impediments to flow’ and ‘improve workflow’ relate to each other, but the exact intended difference between them is not explicit.

Using the SSM jargon, the benefits were defined as effects (positive influence) and the challenges as moderators. Not all challenges could be interpreted as moderators since some of them are more associated with an intrinsic characteristic of Kanban use than an external aspect moderating the effects of its utilization. For instance, ‘setting up and maintaining Kanban’ is a Kanban usage aspect itself, not an external issue that whether not addressed can moderate the effects. Another example is the ‘poor understanding of Kanban concepts and practices,’ which is a direct Kanban use consequence, while the challenge ‘lack of training’ reflects an external aspect that can address the ‘poor understanding of Kanban concepts and practices.’ Furthermore, the SSM uses concepts, also denominated as constructs, in the theoretical structures. Therefore, we had to code the benefits and challenges enumerated in Ahmad et al. (2018) to constructs. Then, for each construct, we provided a definition for it based on the technical literature. For instance, ‘improve quality’ was coded as ‘internal quality’ since the quality improvement aspect reported in the primary studies was always related to an internal property of the software product. Once this interpretation was made, we provided a definition based on ISO (2017a). Also, any adjectives were removed since the effect intensity represents it – e.g., ‘improve communication’ became ‘communication.’ The output of these interpretations and considerations are shown in Table 6 (Appendix A).

Effects and moderators account for the variable concepts only. Regarding the value concepts, they were coded directly from the primary studies. Although most of the papers were observational and provided a relatively wealthy description of the studies’ context, few of them explicitly stated what factors were determinant for the results found. Under these circumstances, the researcher constructing the theoretical structures needs to use their understanding of these factors to decide when they are relevant enough to be made explicit in the evidence models. An alternative to this approach would be to describe all contextual aspects reported in the studies, such as programming languages and types of products developed. However, besides adding a considerable amount of complexity to the aggregation process as all conflicts between the models need to be resolved in the Evidence Factory tool, the decision of whether a conflict represents inadmissible evidence (i.e., the results cannot be aggregated) would still be a researcher interpretation. Given this line of reasoning and with the goal of generalizing the results, we adopted the first approach to describe the essential contextual aspects in our interpretations. As we discuss in the next section, this keeps the aggregation in a manageable size and brings the focus to the essential contextual aspects.

Figure 1 presents evidence representation for the study P10 (Fitzgerald et al. 2014). The model has three reported benefits ‘time-to-market’, ‘control of project activities and tasks,’ and ‘continuous learning.’ Also, besides the cause ‘Kanban’ and the archetypes, it has six value concepts describing the context: ‘portfolio management,’ ‘software development process,’ ‘software project,’ ‘distributed project,’ ‘software team,’ and ‘medium-scale system.’ The paper reports a study in a Polish company in which the researchers used a mathematical model (Erlang-C model) to gather and analyze data to improving the decision-making process regarding the Kanban process used for portfolio management. Projects of medium-scale systems formed the company portfolio: “very common category of development projects at Ericpol was one where the inflow of projects was random and not controlled by the software development function, where fast response time and short service time were crucial, work effort was relatively small, and technical complexity was usually moderate.” As it can be seen, the three most salient contextual aspects are ‘portfolio management,’ ‘distributed project,’ and ‘medium-scale system.’ They can be used as a starting point for explanations in case of contradictory results.

Fig. 1
figure 1

Evidence model representing the results of P10

From the three evidence model effects, two were quantitative (‘time-to-market’ and ‘control of activities and tasks’) based on the Erlang-C model and one qualitative using field observations. ‘Time-to-market’ was measured regarding the cycle time (in days): “the overall cycle time was reduced from almost 100 days to just over 60 days, a significant improvement.” Moreover, ‘control of activities and tasks’ was extracted from the analysis of the Erlang-C model: “the bigger an organization is (the more teams it has), the less sensitive it is to fluctuations in the average inflow … This suggests that … all the teams across different sites should work on a common input queue or backlog for software development.” Notice that for the ‘time-to-market’ concept the effect intensity was defined as {PO, SP} since it was considered a significant improvement by the authors whereas for ‘control of activities and tasks’ the intensity was defined as {WP, PO} since the authors indicated that it was a “suggestion” that the tasks and activities should be put on a common queue. The third, and last, effect ‘continuous learning’ was extracted from the following excerpt: “it demonstrates how an organization can make better decisions based on data gathered and analyzed using a model (in this case, Erlang-C) that is highly relevant in the organization’s context.” In this case, “highly relevant” was the qualification that to define the ‘continuous learning’ intensity as {PO, SP}.

The second evidence model presented in Fig. 2 is a study (Dennehy and Conboy 2017) concerned with the investigation of what the authors denominate ‘flow techniques’ to which Kanban, particularly the board, is directly related. According to the authors, citing (Womack and Jones 1997), flow in product development is defined as “the progressive achievement of tasks along the value stream so that a product proceeds from design to launch, order to delivery, and raw materials into the hands of the customer with no stoppages, scrap, or backflows.” The investigation was based on the Activity Theory, which is a framework for studying different forms of human practice as historically developing cultural systems. It is used to identify the contradictions and congruencies among the six constructs of the model, namely, tools and signs, subject, object, rules and norms, community, and division of labor. This analysis, using the Activity Theory, was performed in two large software companies (with a workforce of more than 100,000 employees) in a multiple-case design with cross-case analysis.

Fig. 2
figure 2

Evidence model representing the results of P9

The richness of the investigations and analysis with the Activity Theory is reflected in the evidence model. It contains five effects, two moderators, and six contextual aspects. For instance, the improvement of the ‘flow of work,’ which was the main focus of the study, was explained in the following excerpt: “after an initial period of using the physical Kanban, a congruency in the work activity emerged between the subject and the tool, as well as the community. A manager at Company B explained that by using the physical Kanban, it was ‘enacting everyone involved by basically seeing where the problems are.’” In the coding process, ‘flow of work’ was always associated with the identification of impediments to the flow (see Table 6 in Appendix A). Also, as the authors indicate a complete congruency for this issue, the effect intensity was defined as {PO, SP}. One last example is regarding a moderator, the ‘management’ ‘expertise.’ In P9, the possible moderation of the management expertise was described as follows: “a deeper analysis revealed that this change could be linked to the creation of another contradiction between the tools that was caused by the introduction of flow. That is, between the expertise and knowledge of the project manager (mental tool) and the capability offered by the Kanban (physical tool). It led to a shift in rules and norms concerning the activity. Evidence of this contradiction is captured by a manager at Company B who acknowledged that he was initially skeptical of the tool (Kanban) because it ‘looked very gimmicky and I just don’t trust pieces of paper stuck to the wall.’” Regarding the value concepts, the ‘distributed project’ is the only one representing a distinctive P9 study characteristic and as mentioned in the first evidence model can be used for explanations of contradictory results. Also, it is interesting to notice the absence of the ‘system’ archetype as in P9 there was no effect, moderator or contextual aspect related to it.

These two evidence models give a glimpse of how the coding process regarding the Kanban studies was performed using the SSM. For completeness, we show in Table 1 all the effects intensities and moderators directions, in addition to their belief values. Given our understanding and interpretation of the benefits and challenges (Table 6 in Appendix A), their identification in the primary studies diverged from the ones indicated in Ahmad et al. (2018). Shaded cells indicate that these were indicated as a benefit or challenge for the respective study in Ahmad et al. (2018), but we could not identify it in our study. Belief values in italic font represent the benefits or challenges that we were not confident about their identification in the respective study – for those cases; we applied a 50% discount on the belief value. Moreover, the ones in bold font are those which we identified in the respective study, but they were not indicated in Ahmad et al. (2018). Apart from that, the effects intensities were represented using the notation presented in Section 3.2; and the moderators use for directly proportional and for inversely proportional.

Table 1 Effects and moderators as reported in the selected studies

5 Results

The aggregation was performed following the procedures described in Section 3.3. We first present the results of the primary aggregation focus, which is the pooled effects and moderators. Then, at the end of this Section, we detail some aspects of the aggregation process such as how the pooled results were computed and how the evidence models compatibility was analyzed.

Table 2 shows the results after performing the aggregation of evidence on the benefits and challenges of using Kanban in SE. Apart from this section as a whole, Table 2 synthesizes the answer to the research question defined at the beginning of Section 3. The first column shows the reported effect (i.e., benefit) caused by or the moderator (i.e., challenge) influencing the introduction of Kanban in the organization. The second column indicates the primary studies that have reported this effect and the third the number of papers. The fourth column shows the aggregated effect intensity about how the use of Kanban causes such effect (e.g., positive or negative). The fifth column represents the aggregated belief of such effect. It is one of the most exciting results of the aggregation. Table 2 shows in bold the effects and moderators in the upper quartile (Q3) and underlined those equal or higher than the median (Q2) after the aggregation – the quartile calculation was separated for effects (Q2 = 75.5; Q3 = 93.75) and moderators (Q2 = 73; Q3 = 85). The sixth column shows whether there was a conflict while aggregating that effect. It is essential to analyze and to characterize different contexts from which the evidence was gathered. Lastly, the seventh column shows the difference between the belief max value in individual papers and the gained confidence after the aggregation. Therefore, a positive difference indicates the effects that have been reinforced after the aggregation whereas a negative difference shows that the evidence is somewhat contradictory.

Table 2 Aggregated effects and moderators of Kanban use

The aggregation indicates that ‘work visibility,’ ‘control of project activities and tasks,’ ‘flow of work,’ and ‘time-to-market’ are the main benefits of Kanban by using the criteria of belief values in the upper quartile. The evidence regarding these benefits is vast and consensual. Among the moderators, a Kanban successful adoption is most conditioned to the ‘organizational culture.’ It should be noticed that we ignored the effect intensity selection strategy used by default in the SSM to use the quartile criteria. The strategy tries to balance outcomes precision and confidence, by selecting a singleton (a value in the Likert scale) or a compound (a range in the Likert scale) hypothesis as an aggregation result. The use of this strategy is just a suggestion of the SSM since, as discussed in Santos and Travassos (2013), this is a definition related to the Mathematical Theory of Evidence and there is no consensual way to perform this selection (Bloch 1996). With this default strategy, we would have three different results in Table 2: (i) ‘work visibility’ with PO intensity and 84% of belief, (ii) ‘control of project activities and tasks’ with PO intensity and 83% of belief, and (iii) ‘flow of work’ with PO intensity and 77% of belief. These results are more precise, but with the tradeoff of lower confidence. As in the case of this synthesis, all effects in the evidence models were modeled using an intensity range (see Section 3.2.1), then we opted to keep the range for the outcomes as well despite the default strategy of the SSM. Still, it shows that for these three effects, contrarily to the others, there is a significant amount of belief assigned to the singleton. It demonstrates how the Mathematical Theory of Evidence converges with the accumulation of evidence. For instance, ‘flow of work’ has three evidence models (P3, P15, and P22) with {WP, PO} intensity and five models (P1, P5, P9, P17, and P20) with {PO, SP} intensity. As one can see, {PO} is the intersection between the two ranges.

Kanban {positively – strongly positively} affects the ‘work visibility’ of software development projects. Indeed, Kanban is by definition regarded as a visualization tool to introduce Lean ideas. “The Kanban board offers better transparency of the development process and shows which developer is working on which task” (P20). Comparing to Scrum, “the difference is the Scrum board resets between each iteration while the Kanban board is normally persistent and doesn’t need to reset and start over. Further, tasks are visualized on the Scrum board for each sprint, while Kanban visualizes tasks that can be pulled at any time to respect WIP limits; this restricts the allowed number of tasks in every workflow state” (P1). Moreover, even in the educational environment, the “overall perception of students (55%) about the Kanban board was positive. It helps in visualizing and prioritizing an entire work project more efficiently” (P2).

Kanban {positively – strongly positively} affects the ‘flow of work’ of the software development process. Associated with the ‘work visibility,’ the ‘flow of work’ is another Kanban core aspect as it “provides greater visibility into what teams are doing… improving the feedback loops [and] exposing resource constraints and even capacity utilization” (P9). Thus, it can support organizations in making its value stream as efficient as possible avoiding any impediments or overworking. As a result, team members become aware that “organizational entities cannot work independently because the outcome is related to their cooperative capabilities to create value. Interacting components are important for reaching smooth value streams and avoiding local optimizations” (P22).

The use of Kanban {positively – strongly positively} affects the ‘control of project activities and tasks’ of software development projects. One crucial aspect of this improved control is focusing the work conducted by the software team since the “work in progress limit helps teams to avoid working on too many parallel tasks and are forced to work on those tasks that deliver value to the project” (P1). This focus is important even in higher levels of management for managing the work of more than one team as stated in P10: “all the teams across different sites should work on a common input queue or backlog for software development.

The use of Kanban {weakly positively – positively affects} the ‘time-to-market’ of software products. It was one of the few effects having quantitative data since it has a natural surrogate which is a cycle or lead time. In P16, the “the great majority of teams reduced their average lead time, some of them for more than 30%”. Kanban helps the features to be released as soon as possible as in the case of P23 in which “variation in delivery times reduced by 78% from 30.5 to 6.8, and the mean time to develop fewer and smaller software features declined by 73% from 9.2 to 2.5 working days.”

Other benefits that can be expected from Kanban – belief value equal to or higher than the median – are ‘workflow,’ ‘communication,’ ‘motivation,’ and ‘customer satisfaction.’ These represent effects that are likely to be observed on adopting Kanban in software organizations. Moreover, an appropriate level of ‘expertise’ and ‘supporting practices’ – usually agile practices such as test-driven development and pair programming – need close attention to operate Kanban in its full extent. The other effects and moderators have a relatively lower belief value and seem to not appear in practice as fewer studies reported them, or they are not completely congruent. Not complete congruency means that results are not the same, but they have an intersection – for instance, the intensities for ‘external quality’ {WP, PO} in P14, {WP, PO} in P17, and {IF, WP} in P21.

It is also noticeable the virtual absence of conflicts in the aggregation. A conflict occurs when the intersection between two results is empty, such as between {WN, IF} and {PO, SP} as occurred for ‘team cohesion’ in the evidence models of papers P20 and P5, respectively. The reasons for this are twofold. First, it is related to how we opted to model the effect intensity. As almost all primary studies are observational and qualitative, the benefits descriptions were not precise enough to define a single value in the seven-point Likert scale. Thus, the aggregation of range intensities is less susceptible to conflict. Second, it is due to research nature on this topic. The point in question is it seems that negative results are not being published, which represents an important limitation of the current body of knowledge regarding the use of Kanban in SE. The only conflict presented in Table 2 was associated precisely with a negative influence of Kanban to the ‘team cohesion’ in the study P20.

As described in Section 3.3, the aggregation was performed using the Dempster’s Rule of Combination (see eq. 1). To illustrate how it is computed, let’s take as an example the ‘team cohesion’ effect. Table 3 below is a schematic representation for computing the Dempster’s Rule of Combination using eq. 1. Notice how the conflict is redistributed among the hypotheses. The highest belief value (excluding the frame of discernment {Θ} itself) is assigned to {WN, IF} with 0.306.

Table 3 Combination of two basic probability assignment functions (for ‘team cohesion’)

The values for the combined mP5-team cohesionmP20-team cohesion are:

κ = 0.168 and 1 - κ = 0.832,

mP5mP20 ({PO,SP}) = 0.229/0.832 = 0.275,

mP5mP20 ({WN,IF}) = 0.255/0.832 = 0.306,

mP5mP20 ({Θ}) = 0.348/0.832 = 0.418,

mP5mP20 is 0 for all other sets of the powerset of Θ.

Another vital aggregation process aspect is the determination of evidence compatibility. All the studies outcomes were considered compatible and for this reason, aggregated. To reach this conclusion the general orientation was to seek generalization. It was only possible due to the relatively high number of studies. Thus, we first assumed generalization, and if the results were conflicting, then explanations would be sought. Table 4 enumerates the generalizations applied in the aggregation. All the different kinds and size of software systems, such as ‘web system’ and ‘large-scale system,’ were ignored. Another important generalization was related to the primary goal of using Kanban. Most papers report the utilization of Kanban in a typical software development environment, i.e., in the commercial or controlled construction of software products. However, five studies were very distinct on the Kanban use primary goal. Two of them investigated the use of Kanban in ‘portfolio management,’ and three used Kanban in ‘education’ as a tool for learning software concepts. Some studies were also conducted in a ‘distributed project’ setting and one with the specific purpose of software ‘maintenance.’

Table 4 Applied generalizations in the aggregation process

6 Discussion

Despite the important differences in the studies’ context, the aggregation did not present significant conflict in the results. The resulting aggregated evidence model indicates that Kanban main benefits are ‘work visibility,’ ‘control of project activities and tasks,’ ‘flow of work,’ and ‘time-to-market’ regardless the type of software systems under development or whether they are collocated or distributed. Also, the same results are achieved in the specific settings of ‘portfolio management’ and ‘education.’ Still, the software organizations need to have close attention to its ‘organizational culture’, the ‘expertise’ of their management, and the right set of ‘supporting practices’ in place to obtain these improvements.

This work contributes to the body of knowledge on using Kanban in SE by strengthening evidence of its benefits and challenges. For most of the benefits (effects) and challenges (moderators), the results followed the trends indicated in the previous investigations. This observation helped to see what effects are general to different settings in which Kanban is used in software organizations. We believe that the aggregated results point to more generalized perceptions and stronger indications of its applicability. Thus, it is expected that practitioners benefit from these indications to support their decision making regarding Kanban in practice. Besides decision making, the synthesis indications provide the basis to detect when Kanban is not producing the expected outcomes, allowing software organizations to act upon accordingly. Also, it may indeed be the case that the benefits could not be observed in some settings. These situations should be documented appropriately and incorporated into the synthesis to keep it up to date.

Previous primary studies already reported the effects caused by and moderators influencing the use of Kanban. However, most of them bring detailed qualitative observations from interviews and surveys while few others present quantitative data regarding specific aspects of ‘time-to-market’ and ‘external quality.’ Hence, this synthesis, besides strengthening evidence of Kanban benefits and challenges, also presents a concise and organized view of previous works regarding this theme.

On the other hand, previous secondary studies focused on different aspects of the body of knowledge than this synthesis study. Corona and Pani (2013) cover how Kanban boards are used, what is typically represented, and which tools are being employed to support software development teams. They identified and analyzed the most commonly represented activities in Kanban boards. Nine activities were identified from all the boards analyzed, namely Specification, Analyze, Build, Development, Test, Acceptance, Deploy/Delivery, Release Ready, and Documentation. Al-Baik and Miller (2015) is mostly a conceptual paper focused on describing how the main elements are discussed in the technical literature and how they relate to the lean thinking. Based on 37 studies the authors identified 20 different concepts, principles, and techniques related to Kanban, such as ‘pull system,’ ‘prioritized queue,’ ‘done item,’ and ‘validated learning.’ As mentioned in Section 3.1, Ahmad et al. (2013) was updated in Ahmad et al. (2018). And the last paper (Ahmad et al. 2018) is a mapping study showing, as a typical study of this kind, a broad overview of Kanban publications, including their venues, number of published papers over the years, and the research methods used in the studies.

Besides the aspects previously mentioned, both Al-Baik and Miller (2015) and Ahmad et al. (2018) dedicate part of their reports to enumerate and analyze the Kanban reported benefits and challenges, which is directly related to this synthesis study. Not by chance, these two works were used as the basis for the identification of the primary studies and as the source for the aggregated benefits and challenges. However, since the benefits and challenges were not the sole goals of these two works, the way that these issues were addressed is relatively more straightforward compared to this synthesis. They used an approach that can be directly related to the vote-counting strategy, which limitations are broadly discussed in the technical literate (Pickard et al. 1998). Thus, this synthesis study represents an additional contribution to these secondary studies.

Turning the attention back to the aggregated results, although the synthesis shows the main benefits of Kanban, it should be noticed that there are several other factors with fewer studies investigating them. Thus, either they rarely appear in practice (and for this reason not considered in the investigations), or they still need further studies to improve confidence in them. Particularly regarding the effect ‘team cohesion’ we should add our interpretations for this divergence. The studies P5 and P20 have very distinct context. P5 was conducted in an educational environment whereas P20 in a large software process improvement initiative in a software organization. Interestingly enough, the negative intensity for ‘team cohesion’ came from the second study. The authors of P20 did not present their thoughts about this specific result since this was part of a questionnaire answered by the organization employees. The question was put in a ‘social factors’ section of a questionnaire. The only consideration in P20 was that “process transition has impacted the social factors the least.” Participants of P5, on the other hand, were more open for collaboration and obtaining new skills as this was precisely the capstone course goal in which the study was conducted. Using Kanban resulted in “building positive relationships among team members, effective task management, development of a shared vision and responsibility sharing.” We hypothesize that Kanban can help to improve social factors such as ‘team cohesion,’ but for this to happen the team members need to be open for the new mindset imposed by the Lean thinking and Kanban concepts.

7 Threats to validity

Aggregating qualitative and quantitative evidence is by nature an endeavor subject to different risks. To mitigate this significant threat a method appropriate for that goal the SSM has been used. The SSM method aims to support the SE community to construct and consolidate empirically-grounded knowledge (Santos and Travassos 2013). It has been used in different research topics such as software reference architecture (Martinez-Fernandez et al. 2015), software productivity (Chapetta 2016), and software inspections (Santos and Travassos 2017b). This section discusses possible threats to validity and emphasizes the mitigation actions used.

For the threat of missing critical primary studies, we used two secondary studies as a source of primary studies. We obtained a set of 20 studies reporting evidence on Kanban, which is a high number in SE considering that they report the same effects (i.e., benefits and drawbacks of Kanban). During this process, we discarded studies that only reported opinions, rather than empirically-grounded evidence.

We are aware that each selected study poses its validity threats; therefore, we carefully assessed them together with the studies’ context to interpret their results appropriately. Furthermore, while representing empirical evidence from individual studies, researchers can reflect their own opinion and, thus, bias the representation. A researcher first prepared the definition and analysis of each evidence model from each selected primary study and validated together with two other researchers to mitigate these subjective issues. During this process we experienced some semantic issues, meaning that different studies referred to the same concept using different terms. This would lead to a wrong aggregation. To avoid this, we created a glossary of terms that was represented in the evidence models and kept track of the matching terms.

To improve the aggregated evidence interpretation, we used some suggested strategies (Santos and Travassos 2013). For instance, we recorded the diverse context of each study, so we could better reflect and understand the aggregated evidence. It is important to note that our aggregated results are based on what the authors reported in their papers. Hence, there is always the risk that valuable information might not have been reported. Also, even although Al-Baik and Miller (2015) and Ahmad et al. (2018) indicate what papers report each benefit and challenge, we read each paper to seek them again. It generated some divergences as listed in Table 1 to which we attribute the definitions of the benefits and challenges as listed in Table 6 (Appendix A).

Our results show that some effects got higher degrees of belief while others did not. Being aware of the correct interpretation of these results is essential. On the one hand, the effects and moderators that got higher belief are potentially those that have been further studied and agreed among the studies. On the other hand, those effects that got lower belief values (or even negative) are those that were just partially approached by the available evidence (or got contradictory results among the studies). Therefore, these effects are relevant topics that need to be further studied. We highly encourage the SE community interested on using Kanban to investigate the effects that do not have a high confidence value yet, to increase knowledge and consolidate the beliefs of the benefits and challenges of using Kanban in SE.

8 Conclusions

This paper presented a synthesis study regarding the benefits and challenges of using Kanban in SE. Despite being a relatively recent research topic, there is an extensive amount of available evidence regarding this topic, which supported the achievement of high confidence for several benefits and challenges worked out in the synthesis.

The primary contribution of this paper is to present a condensed view of the benefits and challenges regarding Kanban use in software organizations as reported in primary studies in the field. It brings an objective indication of what are the more relevant ones considering the knowledge available in the technical literature, which is an essential result of a synthesis study in comparison to the individual results of the primary studies that were aggregated. Also, it strengthens the understanding of which aspects Kanban can improve in software organizations and what factors should be addressed to achieve the best results considering all studies as a whole.

The benefits ‘work visibility’, ‘control of project activities and tasks,’ ‘flow of work,’ and ‘time-to-market’ indeed appear to be the ones intrinsically linked to the Lean thinking and the Kanban approach. Moreover, given the synthesis results, they do seem to be present in software projects as well. Still, the results must be taken with caution. We missed primary studies with negative results regarding Kanban implementations in software organizations. According to Staats et al. (2011), failed implementations are common outside the manufacturing realm, and it is quite surprising that the secondary studies used to select the primary studies for this synthesis did not find any.

The synthesis reported in this paper can be evolved from the point where its scope has been delimited. All the data is available in the Evidence Factory tool, and the SSM allows to add new primary studies as necessary. One exciting addition would be the inclusion of the 23 experience reports listed in Ahmad et al. (2018). It can reinforce some of the effects and moderators which lacks evidence and can bring new insights regarding the Kanban use in software organizations.