1 Introduction

“Carbon in the atmosphere is rising, even as emissions stabilize” was the heading of a recent article in the New York Times (Gillis 2017). The author was puzzled by this: “If the amount of the gas that people are putting out has stopped rising, how can the amount that stays in the air be going up faster than ever?” In fact, each ton of carbon dioxide (CO2) emitted from fossil fuel combustion increases CO2 concentration in the atmosphere for at least thousands of years (Archer and Brovkin 2008), meaning that emissions yesterday, today, and tomorrow produce warming that lasts. Hence, the total amount of CO2 emissions needs to be limited to avoid dangerous interference with the climate system, with net CO2 emissions eventually coming down to zero for atmospheric concentrations to stabilize. We are rapidly approaching the amount of carbon we can emit while staying below 2 °C warming and with current levels of emissions that carbon budget would be emptied within a few decades (Goodwin et al. 2018; Peters et al. 2012).

Despite this enormous challenge, the basic relationship between CO2 emissions and atmospheric CO2 concentrations is poorly understood by the public. The first study demonstrating the widespread failure to grasp the fundamental relationship between stocks and flows of CO2 in the carbon cycle—known as stock-flow (SF) failure—was that by Sterman and Booth Sweeney (2007). In their sample of 212 graduate students at Massachusetts Institute of Technology (MIT) within science, technology, engineering, mathematics, or economics, 84% gave answers to an SF task that violated basic mass-balance principles, assuming atmospheric carbon stocks would stabilize even if emissions exceeded removals. This is “analogous to arguing a bathtub filled faster than it drains will never overflow” (ibid. p. 216). The authors hypothesized that SF failure is due to the use of a pattern matching heuristic, where respondents match trends in flows and stocks, rather than accounting for the stock-flow dynamics of the system.

Since the seminal paper by Sterman and Booth Sweeney (2007), several studies have focused on SF failure, and these can be divided into three main strands of research. First, there are studies that aim to confirm the findings by Sterman and Booth Sweeney (Cronin et al. 2009; Dutt and Gonzalez 2009). Second, there are studies that alter the tasks or the setting in an attempt to establish if the poor performance depends on external factors such as task design and context and background of participants (Cronin and Gonzalez 2007; Sterman and Booth Sweeney 2002, 2007; Guy et al. 2013; Fischer et al. 2015; Newell et al. 2016). Third, there are intervention studies that aim to improve understanding among the participants, mainly through knowledge transfer from other contexts or by active learning methods (Dutt and Gonzalez 2009, 2012a, b; Moxnes and Saysel 2009). A different approach was taken by Dryden et al. (2018), who simply asked for an estimation of the atmospheric residence time for CO2. Their results show that people estimate CO2 to be gone from the atmosphere within decades of being emitted, which further highlights misunderstandings around CO2 accumulation.

In this paper, we report on findings from a mixed methods study of public understanding of atmospheric CO2 accumulation. First and foremost, we wanted to take a closer look at the common yet intriguing finding in the literature on SF failure that most people “have difficulty relating the flows into and out of a stock to the level of the stock, even in simple, familiar contexts such as bank accounts and bathtubs” (Sterman 2011, p. 817). We surmised that most people have an intuitive understanding of the concept of accumulation, but this type of understanding is not revealed in the kind of CO2 stabilization task used by Sterman and Booth Sweeney (2007). We test this hypothesis by drawing on a typology of knowledge (Biggs 2003) that distinguishes between three different types of knowledge that SF tasks can assess: declarative (knowing what), procedural (knowing how), and conditional (knowing when) knowledge.

We note that previous research on SF failure seems to have overlooked this aspect of task design (there is, at least, no explicit discussion of different types of knowledge). Consequently, we developed two alternative SF tasks (using the carbon cycle and a bathtub, respectively, as contexts) with lower knowledge demands,Footnote 1 so to speak, they explicitly ask about the relationship between the flows into and out of a stock for the stock to stabilize. Performance on these two alternative SF tasks was compared with performance on a task with higher knowledge demands, similar to the one used by Sterman and Booth Sweeney (2007). To further test the surmised disconnection between these types of knowledge, we used a pre- and post-test design, to investigate whether an explanation of the knowledge required to solve the tasks would have any effect on the performance on the kind of task used by Sterman and Booth Sweeney (2007).

In addition, through qualitative data, we sought to gain insight into different ways of reasoning when solving the SF tasks to better understand what could explain SF failure and why people seem unable to apply intuitive knowledge about accumulation in certain tasks. It is widely acknowledged that an understanding of how people make sense of concepts and principles in science is essential for effective science teaching and communication (Ambrose et al. 2010; Morgan et al. 2002). Yet, most previous research on SF failure has focused on task performance without probing how people actually reason when solving various SF tasks (Korzilius et al. 2014). One notable exception is the study by Korzilius et al. (2014), which used the think-aloud method to explore “reasoning patterns” used by people when solving SF tasks. The SF tasks in their study, however, were more generic, while our study focuses on ways of reasoning about atmospheric CO2 accumulation and how this relates to task performance. There are several reasons for the necessity of studying ways of reasoning in the CO2 context, ranging from the carbon cycle dynamics (which posit that the capacity for uptake of CO2 is determined by the historical emissions) to the amount of public debate on the topic. As an example, the New York Times article mentioned earlier received more than 600 comments online.

Finally, we investigated whether there is a connection between performance on our SF tasks and stated climate policy support, as suggested by some (Sterman 2008; Chen 2011; Dutt and Gonzalez 2012a). While there is some support for the notion that climate science literacy enhances concern for climate change (Hornsey et al. 2016; Guy et al. 2014; Ranney and Clark 2016), the previous literature on SF failure in the climate context has not explicitly tested for a relationship between SF task performance and stated climate policy support.

2 Method

2.1 Study context and participants

The context of the study was a massive open online course (MOOC) entitled “Sustainability in Everyday Life,”Footnote 2 offered by Chalmers University of Technology between Aug 29, 2016, and Oct 16, 2016, using the EdX platform. The course was not part of any university program, required no particular prior knowledge, was open to take, and free of charge for everyone with internet access. It only generated a diploma if completed. This MOOC was chosen for this study due to the relevant course content and the possibility to get a large number of respondents.

The sustainability MOOC consisted of five modules or themes: globalization, climate, food, energy, and chemicals. The performance on different kinds of SF tasks was assessed during the climate module, directly after a general introductory video on climate change, which did not address the knowledge tested by the SF tasks, and a question assessing climate policy support was included in the pre-course survey (i.e., before the students were introduced to any course contents). To motivate task completion, the SF tasks gave points that contributed to the total examination of the course regardless of performance.

Of 3540 participants enrolled in the course, 300 started the climate change module where the SF tasks were placed. Of these, 214 participated in the study by completing all of the SF tasks. A total of 49 countries were represented in the sample, with most participants from the EU/EEA (58), the USA (25), India (11), and Mexico (9). See the supplementary material for the full list. The sample included 119 females and 77 males (18 participants had not disclosed their gender). The participants’ average age was 38 years. Of the 92% who stated their highest attained educational level, 81% had a bachelor’s degree or higher. Admittedly, the high average education level, together with the fact that the participants have opted to take a course in sustainability, implies that our participants do not constitute a representative sample of the general public (see the supplementary material for more information on the course context and participants).

2.2 Study design

In this section, the overall design of the study is described along with the design of the tasks; in the next section we explain—by drawing on a typology of knowledge—how tasks were designed to assess different types of knowledge. Table 1 depicts the overall design of the study, summarizing the different tasks (all tasks were completed online) and the order in which they were completed—the five steps of the study design.

Table 1 An overview of the study design, describing the tasks’ order and format, the types of knowledge assessed, and the number of participants that completed each task

Prior to the SF tasks, the participants were given a question aiming to measure stated preferences with respect to climate policy (T0). Here, the participants were asked which one of the following statements came closest to their personal view:

  1. 1.

    Society should not take any steps to reduce emissions of greenhouse gases (such as CO2).

  2. 2.

    Society should reduce emissions of greenhouse gases in the future, in response to climate impacts as they actually occur.

  3. 3.

    Society should take moderate actions to reduce emissions of greenhouse gases today, to reduce future climate impacts.

  4. 4.

    Society should take strong action to reduce emissions of greenhouse gases today, to reduce future climate impacts.

  5. 5.

    I do not know/I have not formed an opinion.

The alternatives were formulated to reflect attitudes of “wait and see” (2) or “go slow” (3), as discussed by Sterman (2008).

In the first SF task (T1), participants completed a task, which we will refer to as the main SF task that was designed to be similar to the task used by Sterman and Booth Sweeney (2007).Footnote 3 The main SF task consists of a short introductory text, graphs of the annual historic emissions and uptake of CO2, a graph of a scenario with a stabilized amount of CO2 in the atmosphere, and a multiple choice question (see Fig. 1). Participants were asked to choose, among four alternative graphs, the graph depicting emissions and uptake trajectories that is consistent with the scenario for CO2 stabilization. The correct answer is alternative 3 (marked with a green symbol).

Fig. 1
figure 1

The main SF task (T1/T3), which also included an answer alternative 5: “I don’t know.” The correct answer is alternative 3 in which emissions and uptake meet—which causes the atmospheric CO2 amount to stabilize—after which they jointly diminish over time (since lower emissions causes uptake to fall)

Although the main SF task (see Fig. 1) was designed to be similar to the task used by Sterman and Booth Sweeney (2007), our version of the task contained less superfluous information, both in text and graphs, to avoid cognitive overload. However, we added more elaborate information about the CO2 uptake, which was given the same attention as the emissions. For the first period of the graphs (i.e., 1900–2015), the CO2 emissions and uptake values were produced using a simple climate model (Sterner and Johansson 2017), which simulates the carbon cycle response. For this, widely used “historic emissions” that give a realistic impression were used (Meinshausen et al. 2011).

No feedback on task performance is provided to the participants throughout the full set of tasks. In the second SF task (T2), participants were randomly assigned to complete one of three alternative tasks, T2A–C (see Table 1). In contrast to the main SF task, these tasks were designed to direct the participants’ attention towards the principles of accumulation. This was done by explicitly asking questions about (T2A–B) or describing (T2C) the relationship between the flows into and out of a stock in order for the stock to stabilize at a certain level. As a consequence, and as we argue in the next section, these tasks differ from the main SF task in terms of their knowledge demands—that is, in terms of the type of knowledge they assess. The first task (T2A) uses the carbon cycle as context (see Fig. 2), while the second (T2B) uses a bathtub as context (see Fig. 3). These two tasks are central to our hypothesis (stated in the introduction) as they allow us to investigate whether participants perform better on stock stabilization tasks that explicitly ask about the relationship between the flows into and out of a stock (T2A–B), compared with the kind of task used in previous studies (Dutt and Gonzalez 2012a; Guy et al. 2013; Newell et al. 2016; Sterman and Booth Sweeney 2007) (T1). The third task (T2C), not involving a question, uses a bathtub analogy to explain atmospheric CO2 accumulation in a simple way (see figure in the supplementary material); in T2C, the respondents were only asked to confirm that they had studied the analogy. This task, in contrast to T2A–B, presented the participants with the knowledge that is needed to solve the main SF task.

Fig. 2
figure 2

A description of task T2A, directing participants’ attention towards the principles of accumulation in the original carbon cycle context. T2A was designed to have a lower knowledge demand compared with the main SF task: it (only) assesses declarative and procedural knowledge of accumulation

Fig. 3
figure 3

A description of task T2B, directing participants’ attention towards the principles of accumulation in a bathtub context. T2B was (like T2A) designed to have a lower knowledge demand compared with the main SF task: it (only) assesses declarative and procedural knowledge of accumulation

Thereafter, the participants were asked to complete the main SF task again (T3) (see Table 1 and Fig. 1). The logic behind this was that the alternative tasks, T2A–C, would help participants by pointing to the knowledge needed for solving the main SF task, thus allowing us to investigate whether these three tasks could serve as educational interventions that improve performance on the main SF task.

In addition to testing people’s performance on SF tasks with different knowledge demands, we aim to unpack public understanding of CO2 accumulation by exploring people’s ways of reasoning when solving SF tasks. We did this by, in task T4, asking participants to provide a short, written explanation of how they reasoned when choosing to keep or change their answer when completing the main SF task again (T3). Collecting the combined data of how people answer on SF tasks and how they reason while doing so, we aim to study the mental representations used by the participants when answering the main SF task. Mental representations are similar to mental models (which are “personal, internal representations of external reality that people use to interact with the world around them”) (Jones et al. 2011) but are here used instead of mental models to emphasize that their nature is not seen to be stable or static to the same extent that mental models are sometimes viewed.

2.3 Task design and knowledge demands

As noted above, the tasks—the main SF task (T1/T3) and the alternative tasks (T2A–B)—were designed to assess different types of knowledge. While knowledge can be classified in many ways (Alexander et al. 1991), we draw on a typology described by (among others) Biggs (2003), comprising three types of knowledge:

  1. 1.

    Declarative knowledge, which refers to “knowing about things [such as facts, concepts, and principles], or knowing what” (p. 41)

  2. 2.

    Procedural knowledge, which refers to “knowing how to do things, such as carrying out procedures or enacting skills” (p. 42)Footnote 4

  3. 3.

    Conditional knowledge, which refers to “knowing when to do these things [...] under what conditions one should do this as opposed to that” (p. 42)

These types of knowledge are “characterized by the function they fulfil in the performance of a target task” (de Jong and Ferguson-Hessler 1996, p. 106). To put it differently, we are interested in knowledge-in-use (ibid. p. 110).Footnote 5 Moreover, while “it is certainly possible to know the what of a thing without knowing the how or when of it” (Alexander et al. 1991, p. 323), successful problem solving requires the use of all three of these types of knowledge (Turns and Van Meter 2011). With these theoretical deliberations in mind, we now turn to an epistemological demand analysis (de Jong and Ferguson-Hessler 1996)—i.e., an analysis of the knowledge demands—of our SF tasks.

Tasks T2A (climate context) and T2B (bathtub context) were designed to assess declarative and procedural knowledge of accumulation. That is, in these tasks, participants first have to recall what the principles of accumulation (i.e., principles of mass balance) say—thus demonstrating declarative knowledge. Next, they have to figure out how to apply these principles to arrive at the relationship between the emissions/inflow and uptake/outflow for the amount of CO2 or water to stabilize at a certain level—thus demonstrating procedural knowledge.Footnote 6 The difference between T2A and T2B is mainly the familiarity of the context, where the more familiar context of a bathtub may make it easier to draw on knowledge that is relevant for solving the problem.

In the main SF task (T1/T3), on the other hand, participants not only have to apply the principles of accumulation—thus demonstrating declarative and procedural knowledge (as in T2A–B)—but also have to realize that this is what the task requires them to do—thus demonstrating conditional knowledge. Note that the main SF task does not direct the participants’ attention towards the principles of accumulation; that is, it does not explicitly ask about the relationship between the emissions and uptake for the amount of CO2 to stabilize. As such, one can argue that the main SF task (T1/T3) poses higher demands on knowledge, compared with tasks T2A–B.

2.4 Data analysis

In addition to descriptive statistics, a chi-square test of homogeneity was used to determine if the rate of success was significantly different between any pair of groups on the same task or any pair of tasks for the same group.

An inductive thematic analysis (Braun and Clarke 2006) was used to analyze the participants’ written answers to the open-ended question, “Briefly explain how you reasoned when choosing to keep or change your answer.” In line with this kind of qualitative analysis, a set of themes was identified after coding the data and sorting and sifting the codes in an iterative way. (For a more detailed account of the analysis, see the supplementary material.) These themes provided a deeper understanding of the ways of reasoning being used when answering the main SF task and made it possible to relate the performance on the different SF tasks to different ways of reasoning.

3 Results

3.1 Performance on SF tasks with different knowledge demands

Table 2 shows that there was a large difference between participants’ performance on the SF tasks that assessed different types of knowledge and SF tasks with different knowledge demands. The main SF task—both as a pre-test and post-test—had a significantly lower success rate than the two alternative tasks, T2A (carbon cycle context) and T2B (bathtub context), that directed the participants’ attention towards the principles of accumulation and hence did not assess conditional knowledge. The success rate for the participants who were assigned T2A went from 26 on the main SF task to 54% on the alternative task. For the T2B group, the success rate increased from 17 to 70%. These differences are statistically significant (p < 0.001) and indicate a high level of intuitive understanding—declarative and procedural knowledge—of the principles of accumulation. The level of education also seems to be positively correlated with performance (see the supplementary material) but was not analyzed further because it is outside the scope of this study.

Table 2 Share of correct answers for SF tasks and a chi-square test of homogeneity, in which statistically significant (p < 0.1) differences are marked in italics

3.2 Efficacy of the interventions

For the full sample, the success rate on the main SF task was 21% in T1 and 28% in T3, after the alternative tasks, serving as interventions (see Table 2). This difference is not statistically significant (p = 0.14). Only one of the three interventions had a weakly statistically significant (p = 0.08) impact on the participants’ performance on the main SF task: the alternative task that directed the participants’ attention towards the principles of accumulation in the bathtub context (T2B). The task (T2C) that involved reading about the bathtub as an analogy for atmospheric CO2 accumulation (see the supplementary material) did not improve the participants’ success rate on the main SF task, even though it presented them with the knowledge needed to answer the task, using both text and visuals.

3.3 Ways of reasoning

Five different ways of reasoning when answering the SF tasks (from answers on task T4) were identified, and these could be grouped into three main categories: system reasoning (with three subcategories), pattern reasoning, and phenomenological reasoning. These reflect different mental representations of the tasks (and possibly different levels of ambition in dealing with the tasks). Below, we describe what the participants focused on when using a certain way of reasoning, with Table 3 showing the frequency of responses that were classified to belong to the different categories of reasoning and some illustrative quotes for the different ways of reasoning.

Table 3 The participants’ answers to the open-ended task (T4) were classified into five ways of reasoning, which are summarized into three overarching categories. The frequencies reported are the fraction of the 214 answers that were classified to belong to a given category or way of reasoning. These do not sum up to 100% since some answers were classified as belonging to several ways of reasoning. The ways of reasoning are exemplified using illustrative quotes

Participants who used system reasoning focused on the system in terms of a relationship between emissions and uptake. We identified three different ways of conceptualizing this relationship:

  1. 1.

    Conservation of mass, which correctly posits that emissions must equal uptake for CO2 stabilization

  2. 2.

    No accumulation, which incorrectly posits that the difference between emissions and uptake must be constant for CO2 stabilization. Some participants claimed that the amount of CO2 in the atmosphere is equal to the annual difference between emissions and uptake (i.e., A = EU). Consequently, this way of reasoning does not take into account the amount of CO2 in the atmosphere at the start of each year that remains from past years

  3. 3.

    Historic debt balancing, which incorrectly posits that emissions must go below uptake for CO2 stabilization. According to this way of reasoning, emissions have historically been above uptake and all emitted CO2 needs to be taken up for CO2 stabilization (i.e., \( \dot{A}=0 \) only if ∫(E) =  ∫ (U))

Participants who used pattern reasoning inappropriately focused on matching graphical patterns between the amount of CO2 in the atmosphere and the annual emissions or uptake. Alternatively, they focused on the notion of “stabilization,” without being explicit about in what sense.

Participants who used phenomenological reasoning focused on a variety of aspects of phenomena related to climate change that are not needed for solving the SF tasks.Footnote 7 Examples of such phenomena can be found in the illustrative quotes for this way of reasoning in Table 3 but include population growth and sources of emissions and uptake. Based on these phenomena related to climate change, participants seemingly or explicitly inferred what will or should happen to emissions and uptake in the future, rather than dealing with the task as it is formulated.

3.4 Relation between ways of reasoning and answers on the main SF task

Figure 4 shows how the five ways of reasoning, identified from the answers on task T4, are related to answers on the main SF task in the post-test (T3). While some of the participants who chose the first or second (incorrect) alternatives of increasing or stable emission scenarios reasoned in terms of no accumulation, the majority of those who chose the second alternative used pattern reasoning. The vast majority of those who chose the third (correct) alternative used conservation of mass. The majority of those who chose the fourth (incorrect) alternative, where emissions plummet below uptake, reasoned in terms of historic debt balancing. In summary, Fig. 4 shows that apart from phenomenological reasoning—which appears in all four alternative answers—there is a dominant way of reasoning behind each alternative. The occurrence of phenomenological reasoning in all alternative answers in the post-test suggests that the participants struggled to create a correct mental representation of the main SF task; that is, they struggled to judge what prior knowledge is relevant for the SF task at hand.

Fig. 4
figure 4

Relation between ways of reasoning and answers on the post-test. Pattern reasoning is in orange and shows up in alternative 2 (which match the pattern of emissions with that of amount) as expected. Phenomenological reasoning is gray and is distributed between the different answers. The system reasoning category is marked with different patterns of blue to highlight that the answers almost exclusively belong to one of three different subcategories of system reasoning: no accumulation (small white dots), conservation of mass (chess squares), and historic debt balancing (diagonal stripes)

We note that among those who managed to create or utilize a mental representation that guided them to the correct answer, only a couple used phenomenological reasoning. The largest shares of unclassified explanations fell into the first two answer alternatives which also had the largest shares of pattern matchers. This may indicate that an unproportionally large share of the answers for alternatives 1 and 2 is less thought through than the average answer, since the main reason for not being classified was that explanations given were too brief to be classifiable (which we reason is a sign of the tasks being given little thought) and since pattern matching is considered to be a general solution heuristic (Gilovich and Savitsky 2002) requiring little cognitive effort.

Lastly, we note that among those answering alternative 4 (in which emissions go below uptake), a higher than average number of participants were categorized into more than one way of reasoning. Most often they reasoned both about what they want to happen or what needs to happen in terms of human development (as opposed to in terms of emissions, uptake, and amount)—i.e., phenomenological reasoning—and about the need for emissions to go below uptake for the amount to stabilize—historic debt balancing.

3.5 Relation between performance on SF tasks and stated climate policy support

The stated support for stringent climate policies was very strong in our sample (see the supplementary material), with 93% of the 167 participants that answered both the SF tasks and the climate policy question agreeing with the statement that “society should take strong action to reduce emissions of greenhouse gases today.” This clearly shows that our sample participants constitute an interested and pro-climate policy group of the general public. Given this lack of variance in stated climate policy support, we were unable to explore potential correlations between different types of knowledge (or understanding) of climate physics and stated policy support. However, these results suggest that at least the type of knowledge tested in the main SF task is not a prerequisite for stated support for stringent climate policy.

4 Discussion

4.1 Probing SF failure: knowing how and knowing when

Interestingly, but in line with our hypothesis that SF tasks with lower knowledge demands would result in higher success rates, participants performed significantly better on the SF tasks that directed their attention towards the principles of accumulation (T2A–B), compared with the main SF task (T1/T3). As Newell et al. (2013) pointed out, “Given the low-base of accurate performance in [SF tasks], any manipulation which leads to over 50% of the sample getting the answer (approximately) correct is newsworthy” (p. 3143). Our finding nuances the common finding in the literature on SF failure that most people “have difficulty relating the flows into and out of a stock to the level of the stock, even in simple, familiar contexts such as bank accounts and bathtubs” (Sterman 2011, p. 817). Instead, we found that most participants were able to successfully solve SF tasks (T2A–B) assessing declarative and procedural knowledge of accumulation (knowing what and knowing how) but struggled with conditional knowledge (knowing when) in relation to the main SF task. To put it in simpler terms, our finding suggests that people do “understand” the principles of accumulation and how to use them but do not understand that it is this knowledge they should apply in the main SF task. This finding is in line with research on problem solving in physics, indicating that students find it difficult to create a correct mental representation of a new problem by combining the information provided in the problem statement with relevant background knowledge (Savelsbergh et al. 2002).

Yet, the idea that different kinds of SF tasks may assess different types of knowledge of accumulation seems to be largely overlooked in the literature on SF failure; there is, at least, no explicit discussion of different types of knowledge or what it means to “understand” accumulation. Indeed, we note that the high success rates on several SF tasks reported by Fischer et al. (2015) could be a result of what type of knowledge they assess, rather than the particular format (without graphs), as suggested by the authors.

4.2 Efficacy of the interventions

Only one of the three alternative tasks that directed the participants’ attention towards the principles of accumulation had a (weakly) statistically significant impact on performance on the main SF task in the post-test: the alternative task that used the bathtub analogy as context (T2B). This finding supports the notion that while analogies can be an effective teaching tool (Podolefsky and Finkelstein 2006), active learning methods, such as answering a question, are more conducive to learning compared with just reading or hearing an explanation (Freeman et al. 2014). However, the rather small improvement in the success rate for the main SF task suggests that additional scaffolding is needed to overcome the challenges inherent in the main SF task.

4.3 Ways of reasoning provide additional theoretical insights into SF failure

We identified five ways of reasoning when dealing with the main SF task, and these could be grouped into three categories: system reasoning, pattern reasoning, and phenomenological reasoning. These ways of reasoning provide additional theoretical insights to explain the large difference in performance between the different kinds of SF tasks. More specifically, they provide insights into what background knowledge participants drew on to create a mental representation of the main SF task. Our results therefore support the interesting hypothesis that SF failure “may be less a matter of incorrect knowledge and more a matter of incorrect problem representation” (Cronin and Gonzalez 2007, p. 15).

System reasoning consists of three subcategories, which we have termed conservation of mass, no accumulation, and historic debt balancing. The “no accumulation” subcategory supports the claim made by Cronin and Gonzalez (2007, p. 11) that some people “will look at the difference between the inflow and outflow when thinking about the stock […], but they will ignore current accumulation in the stock”.

Pattern reasoning involves using the correlation heuristic as a problem solving strategy, “erroneously assuming that the behavior of a stock matches the pattern of its flows” (Cronin et al. 2009, p. 1). While the correlation heuristic has been forwarded as the dominant reason for SF failure (Cronin et al. 2009), it remained an untested hypothesis until recently. As Korzilius et al. (2014) noted:

Thus far, research on stock-flow performance has focused on the outcomes of reasoning processes and inferred that individuals use correlational reasoning while estimating stock-flow behavior, assuming that the flow(s) immediately and directly affect the stock. The actual reasoning process of participants remained hidden from the researchers. […] We may say that the correlation heuristic has the status of a hypothetical idea, a presumption that still has to be tested in research (p. 269).

Our study provides empirical evidence, both quantitative and qualitative, for the claim that people use the correlation heuristic as a problem solving strategy. In the main SF task, the answer alternative that was selected by most participants (about 45%) was the pattern matching alternative, and pattern reasoning was the most frequently used explanation for choosing this alternative. This finding is in line with previous research, demonstrating a strong tendency for pattern matching (e.g., Dutt and Gonzalez 2013; Reichert et al. 2015; Cronin et al. 2009; Sterman 2008).

To our knowledge, phenomenological reasoning has not been documented in the literature on SF failure. What distinguishes phenomenological reasoning from the other types of reasoning is a strong focus on the context of the SF task and various phenomena related to climate change. Previous research on SF failure has viewed contextual knowledge as something that might be lacking and hence a potential explanation for the poor performance on SF tasks (Cronin et al. 2009; Newell et al. 2013). Interestingly, in our study, the problem was rather the opposite: It is not that participants knew “too little” about the context—it is rather that they knew “too much” and got “lost in the complexity of the context,” to borrow a phrase from Eggert et al. (2017). The crux of phenomenological reasoning is echoed in an observation made by the Spanish novelist Pérez-Reverte (1998):

There are no innocent readers anymore. Each overlays the text with his own perverse view. A reader is the total of all he’s read, in addition to all the films and television he’s seen. To the information supplied by the author he’ll always add his own. And that’s where the danger lies: An excess of references (p. 335).

By unearthing several such “references” and putting phenomenological reasoning next to the other ways of reasoning, we provide novel insights into climate change domain-specific challenges related to solving the kind of SF task used by Sterman and Booth Sweeney (2007).

Our findings have important implications for teaching and climate change communication. First of all, it is unlikely that a single learning activity or explanation will help all people—with their different ways of reasoning—to understand atmospheric CO2 accumulation. People using no accumulation reasoning need help to realize that the CO2 that was present last year does not magically disappear, so to speak. Those using historic debt balancing would likely benefit from being reminded that we are opting for stabilizing the CO2 amount at a higher level (compared with pre-industrial times) and that if all CO2 emitted by humans (since industrialization) were taken up, we would fall back to pre-industrial atmospheric CO2 levels. People using phenomenological reasoning, and potentially also those using pattern matching, would likely benefit from having a guided step-by-step comparison of the carbon cycle with a carefully chosen analogical system. This could help them focus on the principles of accumulation. Having been told or reminded of how the principles work in a contained and familiar analogical context, the learners should have a chance to follow an assisted transfer of knowledge back to the CO2 context. This may help them realize how the principles are applicable in the climate context which by itself may previously have caused them to lose track of their reasoning around accumulation.

A limitation of the thematic analysis presented here was the briefness of the answers provided by most participants to the open-ended question. Thus, a next step could be to conduct semi-structured interviews with a smaller sample to explore in more detail what conceptual and mathematical difficulties people experience when dealing with SF tasks that assess different types of knowledge. Investigating deeper psychological mechanisms behind the different ways of reasoning identified in this study is also a possible next step. The substantial fraction of answers which included people’s attitudes about what they want to see happen suggests that how people answer and reason is affected by more than mere task-specific cognitive reasoning. A large fraction of the participants seems to have unconsciously substituted the cognitively demanding SF task with a simpler question and answered that question instead—what Kahneman and Frederick (2002) call attribute substitution. We hypothesize that attribute substitution may explain why people tend to use pattern reasoning and phenomenological reasoning, and thus an inappropriate mental representation of the SF tasks.

4.4 Link between knowledge and stated policy support

Our results clearly demonstrate that performing well on the main SF task is not a necessary condition for stated support for climate policy. This should perhaps come as no surprise, given the extensive evidence that there is a host of other factors, beyond knowledge, that influence people’s attitude and behavior in relation to climate change mitigation and adaptation, such as values, social norms, science skepticism and literacy, and political orientation (Hornsey et al. 2016; Hamilton et al. 2015; Gifford 2011; Wibeck 2014).

On the other hand, in no way do our results rule out that a better understanding of (some aspects of) climate science could affect support for climate policy or that understanding could be important for actual (or revealed) climate policy support. The existing evidence on the connections between climate science literacy and climate policy support does show that greater understanding of climate science correlates with greater belief in or acceptance of climate change (Hornsey et al. 2016; Guy et al. 2014; Ranney and Clark 2016) and that greater belief in turn is associated with stronger support for climate policy (Hornsey et al. 2016), though the latter effect is relatively small. Hence, we agree with Eggert et al. (2017) who argue that conceptual understanding of climate physics “is an important prerequisite to change individuals’ attitudes towards climate change and thus to eventually foster climate literate citizens” (p. 137).

A key question—related to the main focus of this study—is what (type of) knowledge has the largest potential to leverage climate policy support. For instance, the Climate Literacy Framework presented by the US Global Change Research Program lists no less than 39 points that climate literate citizens should know in order to make informed decisions on climate change; a better understanding of which of these points are more important for fostering support for climate policies would help promote more effective climate change communication. The results presented by Shi et al. (2016) show that there can be differences in how knowledge in different domains of climate science—such as basic physics, causes, and impacts—can affect attitudes to climate risks. However, this and other studies on the links between climate literacy and concerns have solely focused on different facets of declarative knowledge (i.e., climate science facts). The results presented in this study suggest that it would also be interesting to further explore the relationship between other types of knowledge (procedural and conditional) and climate policy support.

5 Conclusions

The question of whether people understand atmospheric CO2 accumulation is not as simple as it seems. This mixed methods study of public understanding of atmospheric CO2 accumulation and stated climate policy support extends previous research on SF failure by showing that:

  • Seemingly similar SF tasks may assess different types of knowledge, and people perform significantly better on tasks assessing declarative and procedural knowledge compared with tasks assessing conditional knowledge

  • When faced with a climate SF task, most people use one of three overarching ways of reasoning: system reasoning, pattern reasoning, and phenomenological reasoning

  • System reasoning took on three different forms which we name conservation of mass, no accumulation, and historic debt balancing. These three different ways of reasoning suggest that the system was treated using three distinctly different mental representations

Taken together, our findings show that SF failure can be due to the use of inappropriate mental representations of SF tasks rather than a poor understanding of the principles of accumulation. This calls for both a more nuanced discussion on how to promote understanding of climate science and a more detailed exploration of the links between different (types) of climate science knowledge and climate policy support.