Introduction

STEM (Science, Technology, Engineering, and Mathematics) has become a popular term in education worldwide. It is acknowledged that there is a need to have a workforce with adequate STEM knowledge and skills to meet the challenges in the future (e.g., Caprile et al., 2015; English, 2016; Marginson et al., 2013; National Science and Technology Council, 2013; The Royal Society Science Policy Centre, 2014).

While STEM education has drawn increased attention and research, it is also filled with debates and dilemmas (STEM Task Force Report, 2014). There is little consensus on what STEM education means and how it should be realized in practice (NAE & NRC, 2014). In a broad sense, it can refer to either the sum of the individual disciplines involved in STEM or an interdisciplinary approach to STEM education that emphasizes the connections across disciplines. The latter is what this paper focuses on. That is, in this paper, STEM education stands for interdisciplinary STEM education, which will be further elaborated in the “Theoretical framework” section.

STEM education is driven by today’s complex policy and economic, social, and environmental problems that require solutions integrated and interdisciplinary in nature (e.g., Bryan, Moore, Johnson, & Roehrig, 2015; English, 2017; Sanders, 2009). Simply put, it is a means for linking students’ learning across the STEM disciplines (NRC, 2009; STEM Task Force Report, 2014). However, given that the traditional discipline-based approach is still dominant in the educational system, how interdisciplinary STEM education should be assessed and evaluated has raised many concerns, such as inadequate research in evaluating the efficacy of integrated instruction and challenges in assessing students’ interdisciplinary understanding (Griese et al., 2015; Herro et al., 2017; Shen, Liu & Sung, 2014; You et al., 2018).

This review study aims to address the timely issue on assessing student learning in STEM education to move the field forward. Through reviewing relevant literature and using a two-dimensional framework, the study summarizes and examines approaches in assessing student learning in STEM education beyond the elementary or primary level. While acknowledging there is a body of literature focusing on interdisciplinary STEM education at the elementary level (e.g., English & King, 2015; Sinatra et al., 2017), we did not include this level in this study for two reasons: (1) theoretically, students at this level tend to have a very naïve concept about the (nature of) disciplines involved in STEM; (2) pragmatically, we wanted to narrow down the scope of this study. The following question guided this review study: What are typically included in assessments of student learning in STEM education at the secondary and the tertiary level?

Theoretical framework

Assessment in this paper refers to deliberate effort to observe student learning through different means to evaluate where students are with respect to one or more specific learning objectives. We follow the reasoning elaborated in the “assessment triangle” (NRC, 2001; Pellegrino, 2014) that includes three important ends: cognition, observation, and interpretation. Specifically, we submit that the design of any assessment should begin with the cognition end—the target knowledge, skills, or any aspects to be assessed in a specific subject domain and how such knowledge and skills can be represented. Since we focus on interdisciplinary STEM education that involves multiple disciplines, the cognition end involves two interrelated aspects: (a) the nature of the multiple disciplines to be assessed and (b) the learning objectives in relationship with the multiple disciplines.

Nature of disciplines

As STEM education involves multiple STEM disciplines, it is necessary to clarify the nature of disciplines in our study. In general, the term discipline refers to a branch of human knowledge (Choi & Pak, 2006; Klein, 1990) and associated tradition, culture, and community related to that discipline. More specifically, in this study, disciplines include the main fields considered in STEM education such as science, technology, engineering, and mathematics, as well as their sub-fields such as physics in science and geometry in mathematics.

People use different terms (e.g., integrated, interdisciplinary, transdisciplinary) to describe the connections and integration among multiple disciplines (NAE & NRC, 2014; Shen, Sung, & Zhang, 2015; STEM Task Force Report, 2014). An interdisciplinary approach juxtaposes and integrates (parts of) two or more disciplines, focusing on establishing explicit connections between relevant disciplines (Klein, 2004; Miller, 1981). In education, we view interdisciplinarity as learners building connections between different disciplines, such as integrating knowledge and skills from two or more disciplines in order to solve complex problems or explain sophisticated phenomena. We also submit that an interdisciplinary approach is built upon its disciplinary roots (Zhang & Shen, 2015). Therefore, in an interdisciplinary approach, learning in each discipline is clearly discernable and needs to be supported (NAE & NRC, 2014).

A transdisciplinary approach refers to the unity of knowledge and skills beyond disciplinary framing (Nicolescu, 2002). As Miller (1981) put it elegantly, “transdisciplinary approaches are articulated conceptual frameworks which claim to transcend the narrow scope of disciplinary world views and metaphorically encompass the several parts of the material field which are handled separately by the individual specialized disciplines” (p.21). Compared with an interdisciplinary approach, a transdisciplinary approach focuses on the problem space or issue at hand without paying much attention to traces of the individual disciplines (Klein, 2008). Similar to a transdisciplinary perspective, an integrated approach in educational settings often refers to combining ideas and subject matters of different disciplines into a seamless whole (Lederman & Niess, 1997).

Learning objectives

A variety of approaches to STEM integration have been proposed (e.g., Guzey et al., 2014; Moore et al., 2014; NAE & NRC, 2014; Sanders, 2012). Oftentimes, the primary reason why these approaches differ is that each program has its unique learning objective(s). According to Bloom’s taxonomy (Anderson & Krathwohl, 2001; Bloom, 1956), learning objectives may be categorized into three types: the cognitive, the affective, and the psychomotor domains. The cognitive domain includes knowledge and processes and skills that are mentally performed (e.g., understanding factual knowledge and conceptual knowledge, practicing procedural knowledge and metacognitive knowledge): The affective domain concerns constructs related to feelings and emotions (e.g., interest, attitudes, motivation, and values): The psychomotor domain deals with processes and skills that are physically performed (e.g., body movements, physical abilities). As there is no STEM educational program focusing on the psychomotor domain from our pool of literature, we chose to focus on the cognitive and affective domains in this study.

Considering the cognitive domain, a basic learning objective of STEM education is to help students develop content knowledge for one or more specific disciplines within STEM. For instance, a common approach in STEM education is to use engineering design to help students develop math and/or science knowledge (Becker & Park, 2011; Guzey et al., 2014; Sanders, 2009, 2012). In this approach, one (or more) specific discipline(s) within STEM (e.g., math and science) is prioritized and the other discipline(s) (e.g., engineering) serves as a vehicle to deliver that discipline.

Another common learning objective of STEM education within the cognitive domain is to help students develop skills that go beyond a single discipline. Consistent with this perspective, a range of programs focusing on STEM integration have used learning tasks that are situated in the context of a complicated situation and require students to apply knowledge from multiple disciplines (e.g., Hurley, 2001; Moore et al., 2014). In this way, the importance for each discipline is treated at the same level (helping understand the situation), and the amount of knowledge for each discipline depends on the nature of the learning problem or situation.

While educators have long pointed out the problem of separating knowledge and skills in teaching and learning, integrating knowledge and skills as a measurable outcome poses significant challenges. For instance, the US Next Generation Science Standards (NGSS; NRC, 2012) adopted the term “practices” to “emphasize that engaging in scientific investigation requires not only skill but also knowledge that is specific to each practice” (p.45). New approaches to assess NGSS in a more integrated manner have shown promises (NRC, 2014).

Many have argued that a significant outcome of an interdisciplinary STEM education approach is students’ positive enhancement in the affective domain (e.g., Apedoe et al., 2008; Guzey et al., 2016). The affective domain in our study includes measures such as students’ interest, engagement, attitude, and motivation for STEM contents and practices and career aspiration for STEM professions.

Methods

In order to conduct a systematic review, we followed established procedure and guideline (e.g., Liberati et al., 2009; Namdar & Shen, 2015). We conducted a careful literature search and screening and then extracted key information from the studies included in our pool (see below). We also developed a coding framework according to our theoretical framework to analyze the studies. We reported both descriptive statistics and key information, as well as important insights, supplemented with examples for the reader to better understand these results.

Literature collection

Since we targeted STEM education, “STEM” and “science, mathematics, technology, and engineering” were the initial terms to start with. Due to the different interpretations of STEM education, we included “integrated”, “interdisciplinary”, and “transdisciplinary” in the searched terms. In order to expand the initial pool, we used different combinations of these terms in the title, abstract, and keywords to search for literatures in ERIC, ProQuest, as well as major STEM educational journals such as International Journal of Science Education, Journal of Research in Science Teaching, Science Education, International Journal of Science and Mathematics Education, and International Journal of STEM Education. The search generated a number of 635 initial references. Fig. 1 shows the search result in each step.

Fig. 1
figure 1

Literature screening flowchart

The abstracts of these articles were reviewed to conduct a preliminary screening. The following criteria were then used to select relevant articles for our study.

  • Only research articles that report interdisciplinary STEM educational programs for G6-college students empirically and explicitly were included. Again, as elementary or primary school students experienced much less time in solid disciplinary learning in STEM, studies targeting the elementary level were excluded in this review.

  • The articles need to report quantitative or qualitative methods to assess student learning, including knowledge, skills, practice, or affective domains, as well as descriptions of curriculum and learning activities that allow us to better understand their assessments. We excluded literature focusing solely on developing specific program evaluation/assessment tool in STEM (e.g., Hernandez, et al., 2014; Herro, et al. 2017). Studies assessing teachers’ development or pedagogical skills were also excluded.

  • We only included articles published during the 2000–2019 period. Prior to 2000, STEM education paid more attention to improve mathematics and science outcomes separately and involved little integration of engineering and technology (Breiner et al., 2012; Bybee, 2010; Hoachlander & Yanofsky, 2011; Sanders, 2009).

  • To narrow down our scope, we only selected articles published in peer-reviewed journal articles in English.

Analysis

After selecting relevant articles, we read all the articles carefully and recorded all of the basic information (authors, journal name, year published, grade level, disciplines involved, types of assessment, main results, etc.) in a spreadsheet. In accordance with our theoretical framework, we then examined the assessment aspect of each article by establishing a coding framework that has two dimensions: (1) the nature of disciplines and (2) the learning objectives (Fig. 2). The first dimension includes three categories: monodisciplinary (i.e., the assessment targets individual disciplines), interdisciplinary (i.e., the assessment targets connections between disciplines), and transdisciplinary (i.e., the assessment targets beyond disciplinary constraints).

Fig. 2
figure 2

Coding framework

The second dimension includes four aspects related to learning objectives: knowledge, skill, practice, and affective domain. Knowledge in our study refers to structured and systematized concepts organized in a single discipline (e.g., force in physics), and crosscutting concepts or general principles connecting multiple disciplines (e.g., concept of size and scale) (McCormick, 2004). Skills, on the other hand, focus on learner’ ability to do something. When put in the context of disciplines, skills may include monodisciplinary skills (e.g., experimental skills in science), interdisciplinary skills (e.g., integrating knowledge from two or more disciplines), and transdisciplinary skills (e.g., creativity, critical thinking).

Note that since an article may report multiple types of assessment, each article can be coded into more than one category. Two graduate students were first trained to be familiar with the coding framework. Then, each person coded a set of the articles separately and checked the other person’s coding. Issues emerging from the coding process were brought up for discussion and were all resolved with the research team. At the end, the two coders double checked all the coding together. Cohen’s Kappa indicates that the data obtained from two raters had a high inter-rater reliability (F = 4.0, Sig = .000), which suggests that the raters coded the articles with a high level of consistency.

Before we report the results, the reader should keep in mind that the review has several limitations. As this review focused on student learning in STEM educational programs, the search was restricted to literature with an explicit focus on interdisciplinary STEM, not studies on the STEM programs consisting of individual disciplines of science, technology, engineering, or mathematics. Besides, the STEM education movement is still relatively young, only becoming prominent in international discourse in the last two decades (Blackley & Howell, 2015). Moreover, as we used G6-college as a criterion for the screening process and excluded the studies solely designed to develop assessments for STEM program without the description of the program itself, we underestimated the work done in the field.

Results

After a careful screening, a total of 49 articles were selected into our library from an initial pool of 635 references (see Supplementary Materials for the list of reviewed articles). In the following, we first describe general information extracted from the reviewed studies; we then report results related to our research question.

General information about the studies

Figure 3 shows the distribution of the articles over the years. Not surprisingly, there is a trend of steep increase over the years, which indicates that research in STEM education is still on the rise. Figure 4 shows the distribution of the articles over the disciplines. Science was the dominant discipline (n = 47) and engineering was the second (n = 42). Most articles (n = 40) include both science and engineering in their studies. In terms of implementation context, there were more studies conducted in informal settings (n = 28) than in formal settings (n = 21). In terms of grades, there were 22 in middle grades (G6-8), 20 in high school level (G9-12), and 12 in tertiary level.

Fig. 3
figure 3

Distribution of the articles over the years

Fig. 4
figure 4

Distribution of the articles over the disciplines

Specific findings

We sought to answer the following research question: What are typically included in assessments of student learning in STEM education? In this section, we first summarize the assessment formats used in the studies and then report the results according to our coding framework (Fig. 2).

The assessment formats vary a great deal, ranging from traditional formats such as standardized tests to research-oriented observational methods (e.g., interviews, video analysis; see Table 1). The formats for monodisciplinary knowledge (MDK) and monodisciplinary affective (MDA) assessments are relatively universal across different studies whereas and those for interdisciplinary knowledge and practices (IDK, IDP) are more diverse. This is intuitive as assessing content knowledge in individual disciplines has a long tradition and is relatively easier. In contrast, assessing interdisciplinary knowledge, practices, and affective domains is much more challenging and calls for innovations.

Table 1 Formats of assessments in STEM educational programs. (The numbers represent each of the 49 articles included in the study)

In terms of analysis or interpretation of assessment results, there are more quantitative studies (n = 38) than qualitative studies (n = 26). For quantitative studies, a variety of inferential statistics were used (e.g., Students’ t test, regression study, ANOVA) besides descriptive and correlational analysis. Psychometric analyses were also performed in many studies to examine the validity and reliability of the instrument (e.g., applying Rasch modeling to examine the construct validity), but 10 studies did not include any discussion on validity or reliability of their assessment methods.

Table 2 shows the distribution of the articles on the two dimensions: the nature of the disciplines and the learning objectives. In the following, we report results of selected cells in Table 2 with illuminating examples.

Table 2 Content involved in assessments of STEM educational programs. (The numbers represent each of the 49 articles included in the study)

Monodisciplinary knowledge

A total of 19 articles (38.8%) included assessments that focused on students’ knowledge in individual disciplines or their sub-disciplines within STEM. All of them used pre-post design to examine students’ content learning gains. For example, Apedoe et al. (2008) described an 8-week, design-based unit that integrated science and engineering for high school students to improve their understanding of chemistry concepts (e.g., energy changes in reactions). Students completed a three-phase design of a heating/cooling system prototype: conceiving the prototype, developing the subsystems, and presenting the design. The assessment of the unit (i.e., the pre/post-test) focused on chemistry knowledge and consisted of 24 items selected from the American Chemical Society’s and the Chemical Concept Inventory test item pool. The results indicated that students’ understanding of the key chemistry concepts (e.g., reactions, atomic interactions, energy) improved significantly.

Note that 13 out of the 19 studies actually included more than one discipline in their assessment. Guzey et al. (2017) presented an engineering design-based STEM program to improve elementary and middle school (G4-8) students’ learning of science, engineering, and mathematics knowledge. The pre/post-test, developed by a team of classroom teachers, school curriculum specialists, and researchers, focused on assessing students’ engineering, science, and mathematics knowledge. The test for middle grades had 45 multiple-choice items (15 for each discipline). The mathematics- and science-related items were chosen from publicly available large-scale assessments such as Trends in International Mathematics and Science Study (TIMSS) and National Assessment of Educational Progress (NAEP); the engineering items were developed by the authors. They found that student performances on the different science assessments were mixed when comparing between the treatment students (whose teachers attended a corresponding professional development program) and the controlled students (whose teachers did not). Furthermore, no significant treatment effect was found for engineering and mathematics assessments. One interesting finding was that quality of curriculum units and type of engineering integration were associated with student learning gains.

Despite that all of the programs described in these 19 articles emphasized interdisciplinary connection in instruction, assessments developed or adopted by these researchers focused on measuring students’ content gains within individual disciplines. Even when multiple disciplines were assessed simultaneously, they were done in a separate manner. Furthermore, the assessment items were typically adopted or adapted from existing standardized tests. Most of the instruments combined multiple-choice and open-ended items. Notably, the majority of these articles (9 out of 13) reported mixed results with respect to student learning gains in different disciplines.

Interdisciplinary knowledge

In contrast, among the 49 articles selected, only 7 (14.3%) included assessments of students’ IDK. The methods of measuring IDK are more diverse (e.g., presentation, written work, and tests that included multiple-choice and open-ended items). For instance, Riskowski, Todd, Wee, Dark, and Harbor (2009) developed a test assessing students’ IDK in an interdisciplinary engineering module for eighth grade science class. The module aimed at promoting students’ knowledge of an interdisciplinary topic, drinking water safety, by engaging them in designing, building, and testing a water purification device. To evaluate student learning in the module, the researchers developed a pre/post-test that consisted of five true/false questions (focusing on “factual knowledge of water quality” and taken directly from the textbook), five open-ended questions (focusing on “human impact of the availability of safe drinking water”), and one design question (asking students to use drawings, words, or phrases to “describe and explain what was needed to ensure safe drinking water and how their purification design addressed the water quality issues they identified”). For the design question, the highest level of response needed to identify three different types of contaminants in the water and use evidence to justify the design. The instrument as a whole did target students’ interdisciplinary understanding of drinking water safety and related issues. In analysis, students’ total score was also analyzed separately as the science component score and the engineering component score. Unfortunately, it was not entirely clear how the researchers assigned the two component scores. Also, the authors described that they developed a rubric to evaluate different aspects of students’ final presentations, but the rubric was not provided and this evaluation was not factored in the research analysis.

As a different example, Dierdorp et al. (2014) assessed students’ written works and analyzed their reflections in an interdisciplinary unit. The unit focused on building meaningful connections among natural science, mathematics, and statistics through professional practices. Six design cycles were adopted for students’ learning, and three types of professional practice were used as contexts: sports physiologist, dike height detection, and calibration of measuring instruments. Focusing on IDK development, the researchers analyzed their written works that were completed during the process of designing, as well as student questionnaires and interviews. Five possible connections (i.e., math-science, math-professional practices, statistics-science, professional practices-science, and professional practices-statistics) were explicitly assessed. Specifically, the written works were first coded in each discipline (e.g., “formulas using” represented math). The answers of questionnaires were coded into positive, negative, or inconclusive. The results indicated that integrating professional practices with science, mathematics, and statistics could help students build more connections among these disciplines.

Among the seven studies in the IDK cell, researchers all emphasized assessing students’ interdisciplinary learning by identifying and evaluating the connections between different disciplines students built in the learning process. Compared with those studies from other cells, while more diverse formats of assessments were used, the analysis and/or coding process were much more cumbersome.

Transdisciplinary skill

A total of 11 articles (22.4%) reported assessments of students’ TDS. Ability tests, interviews, and informal observation were used in these studies. Five of the nine studies used pre/post-tests while the rest used interview or process assessments.

Chen and Lin (2019) reported a 2-year interdisciplinary STEM course, which consisted of maker-centered, project-based learning (M-STEM-PjBL). They aimed at enhancing rural middle school students’ creativity, problem solving, science learning, and their attitudes toward science in Taiwan. The M-STEM-PjBL course had five phases: preparation, implementation, presentation, evaluation, and correction. The content of the first year consisted of designing GigoToys and writing S4A programs, and the second-year curriculum focused on creating green energy. The researchers assessed students’ creativity and collected students’ notes, written records, teachers’ notes, and project products as empirical data. Their analysis found that the M-STEM-PjBL curriculum increased students’ practical skills (such as doing hands-on activity) and creativity. However, although multiple types of qualitative data were enlisted, they did not provide sufficient information about the rubrics they used to evaluate these data.

Lou, Chou, Shih, and Chung (2017) presented a study in which they integrated STEM education with project-based learning (PBL) and investigated the effectiveness of this curriculum. The curriculum used CaC2 steamship as its theme of design and had five stages: preparation, implementation, presentation, evaluation, and correction. Students’ learning processes were evaluated by interviews, and their learning gain was examined by the “Creativity Tendency Scale”, adapted from the Williams Creativity Assessment Packet revised by Lin and Wang (1994). Analysis of the Creativity Tendency Scale responses showed that students improved in all four aspects: adventurousness, imagination, curiosity, and challenge. However, although using a TDS test may be appropriate, the study did not provide much information regarding the test itself. Also, the study did not state how the different aspects of creativity were integrated into their curriculum activities.

The 11 articles in the TDS cell described assessments that focused on students’ skills beyond specific disciplines. Ability tests (e.g., critical thinking) were used in four of these studies. Interviews were also used in five studies to capture students’ perceptions of their own skill status and gains.

Interdisciplinary practices

A total of 7 articles (14.3%) included some forms of assessment related to engineering design, such as designing a manufacture of a solar automatic trolley (Lou, Shih, Diez, & Tseng, 2011). Although design is a trademark of engineering practices, we categorized these studies in IDP because all of them focused on assessing students’ design knowledge and skills simultaneously and, at the same time, attending to how students connect knowledge and skills from multiple disciplines.

English, King, and Smeed (2017) presented a study in which Australian sixth grade students engaged in designing an earthquake-proof structure using simple materials. The students were first introduced to some basic science information (e.g., tectonics) and relevant engineering terms (cross bracing, tapered geometry, and base isolation). The students were then presented with the design challenge for which they worked in small teams to design and construct a two-story structure resistant to simulated “earthquakes.” While completing the challenge, students needed to respond to an activity booklet that included several guiding questions related to their design (e.g., one such question was “What did you learn about your new building from the test, including any mathematics and science that you used?”). Using multiple sources of data (e.g., student design sketches and annotations, responses to questions in booklets, group work video transcripts), the researchers analyzed through open-coding how students engaged in the design process, as well as how they applied their STEM disciplinary knowledge in the process. The results shed light on how students coming from different regions approached the design task differently and how their STEM disciplinary knowledge developed in the design process. The research delineated the complex design processes through multiple angles and perspectives. However, the assessment method (open coding in data analysis) did not seem to be readily applicable in a practical classroom setting.

As one can see in English et al. (2017), it is very time consuming to assess practices as students engage in the processes. Other studies focused on assessing design products instead. For instance, Fan and Yu (2017) developed a STEM engineering module and compared it with a traditional technology module for high school students in Taiwan. The STEM engineering module was designed to intentionally integrate science (e.g., mechanics) and mathematics knowledge (e.g., geometry) and processes with that of technology and engineering. In both the STEM engineering module and the technology module, students spent the last 6 weeks of the 10-week curriculum to design a mechanical toy. Besides other types of assessment, students’ final design products were evaluated to tap into students’ design practices. More specifically, the products were scored on three aspects: mechanical design, working function, and materials used (each on a five-point scale). The product analysis showed that the students using the STEM engineering module outperformed those using the traditional technology unit in all three aspects. It was interesting to point out that in this study the quality of student products was not correlated with either students’ conceptual knowledge or their higher order thinking skills.

Compared with IDK, assessments in IDP focused more on processes and presented more diverse formats. Out of the seven studies, five assessed both the design processes and the products, one assessed products only, and one assessed processes only. One observation is that, while having an iterative cycle is a defining feature of engineering design, all of the product-based assessments only focused on the final design after the iteration process.

Monodisciplinary and transdisciplinary affective perspectives

The affective domain is the most popular assessment target in our pool of articles. A total of 16 articles (32.7%) included assessments of students’ MDA; a total of 18 articles (36.7%) targeted TDA domains.

The themes of MDA assessments include awareness, attitudes, beliefs, motivation, and interest toward specific disciplines in STEM (n = 11) and perceptions of STEM-related careers (n = 7). For instance, Apedoe et al. (2008) showed that their design unit not only increased students MDK in chemistry (see the “Monodisciplinary knowledge” section for their program description) but also increased their interest in engineering careers. Students completed a five-point Likert scale survey to self-report to what extent they showed awareness and interest in engineering careers. The scale was divided into four dimensions: desire to attend more engineering classes, awareness of engineering as a career, desire to be an engineer, and desire to attend engineering informal activities. Results indicated that students’ interest and awareness in engineering careers significantly increased after the unit. One intriguing point was that the researchers assessed students’ content knowledge in science (chemistry in particular) while assessing their interest in engineering careers.

The themes of TDA assessment involve interests toward STEM, self-efficacy and willingness to major in STEM-related disciplines in college, or attitude towards STEM career without spelling out specific disciplines. For example, Musavi et al. (2018) reported the Stormwater Management and Research Team (SMART) program, which included a weeklong summer institute and an academic-year student research program for high school students. The program activities were designed to help students to increase interest and confidence, improve STEM knowledge and skills, and apply them to research and solve water quality problems in their local communities. Four surveys were used to assess students’ attainment through the program. The post-institute and the midterm surveys were designed to assess the students’ engagement in their engineering and science practices (e.g. “Working on global problems, such as stormwater, in my local community makes me want to pursue a career in STEM”). The specific items were not presented in the article. Results revealed that students’ confidence to succeed in STEM was strengthened with the combination of mentorship and a global problem in the local community.

One article covered both MDA and TDA. Nugent et al. (2015) presented a study examining a multitude of factors related to middle school youth’s STEM learning and career orientation. The students were drawn from those who participated in robotics camps across the country as part of a larger STEM education project. Students attended 40 h (1 week) of hands-on activities that focused on using the LEGO Mindstorms NXT robotics platform to build and program robots. A number of variables including STEM interest (MDA), youth career orientation (TDA), and STEM self-efficacy (TDA) were assessed using Likert scales. The STEM interest questions focused on students’ value and usefulness of STEM (e.g., “It is important for me to learn how to conduct a scientific investigation”). Career orientation was separated into two dimensions—interest in STEM-related jobs and willingness of taking secondary level science and mathematics lessons. The self-efficacy items focused on detecting their confidence in robotics tasks (e.g., “I am confident that I can record data accurately”). Results revealed a positive relationship between youth STEM self-efficacy and their career orientation, which were both influenced by STEM interest. While the study focused on testing a model of social, cognitive, and affective factors contributing to youth’s STEM learning, the corresponding assessments were constrained to only capture a cross-sectional snapshot of students’ self-reported learning and affective factors.

Compared with the MDA where Likert scale was used in all the 16 studies, the formats of TDA assessment were much more diverse, including open-ended questions (n = 3), Likert scale (n = 7), structured or semi-structured interviews (n = 5), observation of participation (n = 4), and willing to major in STEM-related curriculum (n = 3).

Discussion

Our review shows that most STEM assessments focused on MDK, MDA, and TDA, but much less work targeted the other aspects (Fig. 5). Although most of the STEM programs aimed at improving student’s interdisciplinary understanding or skills, their assessments barely address this goal. In the following, we discuss several challenges in assessing STEM education as a way to steer the research community to directions that may need more attention.

Fig. 5
figure 5

Distribution of the articles over the categories

Challenges in assessing interdisciplinary learning

As one can see in Fig. 5, interdisciplinary assessments were least used when compared to monodisciplinary and transdisciplinary assessments. The challenge of assessing interdisciplinary learning has been well-documented (e.g., Herro et al., 2017; NAE & NRC, 2014).

The first issue is that interdisciplinarity in STEM education has been taken for granted. In reality, it is neither explicitly theorized nor well-articulated. STEM integration is not simply putting disciplines together as a conglomerate—it needs to be “intentional” and “specific” considering the connections across disciplines in the curriculum (Bryan et al., 2015; Vasquez et al., 2013). As Guzey et al. (2017) concluded in their study that “simply adding engineering into science instruction is not necessarily supportive of better student learning—teaching high-quality curriculum units that purposefully and meaningfully connect science concepts and the practices of those of engineering is essential to produce positive student outcomes” (p. 219).

The second issue is inconsistency among the curriculum, instruction, and assessment observed in these programs. As we can see in the literature, even though all the programs were categorized into STEM program, oftentimes the trace of interdisciplinary integration was implicit. All articles reviewed in this study emphasized the interdisciplinary nature of their educational programs, but very few of them went in details to describe the strategies they used to integrate and connect the disciplines.

Once the connections across disciplines are made explicit in curriculum and instruction, ideally these connections need to be assessed in order to capture students’ interdisciplinary learning. As Moore et al. (2014) noted, just because these interdisciplinary connections might be emphasized in a curriculum, there is no guarantee that students will identify them or make the connections on their own.

Assessing the connections across disciplines in student understanding also helps understand if an interdisciplinary STEM education program does what the curriculum intends to. While most of the reviewed programs stated that the orientations of their curriculum were improving students’ understanding of crosscutting concepts or interdisciplinary skills, few of them actually assessed them. As a result, the impacts of these programs were not properly evaluated (most likely underestimated), which hindered the effectiveness of the feedback drawn from these assessment results in order to further improve them.

Challenges in assessing STEM learning processes/practices

Our review shows that although most of the articles (n = 38) developed learning activities based on engineering design, only a few (n = 8) assessed engineering design practices. This reflects the big picture of the challenges we all face in assessing learning processes similar to other areas (e.g., Ashcroft & Palacio, 1996; Lederman et al., 2014).

Taking science as an example, the incorporation of practices in science education is a welcomed advancement internationally (e.g., McNeill et al., 2018; NGSS Lead States, 2013). How to assess these complex constructs framed under the three-dimensional science learning starts to take on some momentum. For instance, NRC (2014) suggested that assessment tasks had to be designed to provide evidence of students’ ability to use the practices, to apply their understanding of the crosscutting concepts, and to draw on their understanding of specific disciplinary ideas, all in the context of addressing specific problems.

We would like to highlight two important aspects pertaining to assessing practices in STEM education. First of all, among the reported assessments of engineering practices, assessing final products was the most popular approach. However, assessing final products misses the critical process of iterative refinement of engineering design. This also applies to other STEM practices such as modeling (see, e.g., Namdar & Shen, 2015). The iterative nature of engineering design is particularly powerful for students since it prompts them to test and revise a solution to create the best possible outcome, encouraging idea improvement in generative learning (Crismond & Adams, 2012; Hamilton et al., 2008).

Second, there are many other learning practices key to STEM education that need to be identified and assessed. Given the diversity and heterogeneity of interdisciplinary STEM educational programs, we believe that a framework that incorporates practices (similar to that of NGSS) and a systematic set of assessment methods need to be developed. Besides engineering design, some key processes in STEM education for consideration may include problem solving (Prinsley & Baranyai, 2015), interdisciplinary reasoning and communication processes (Shen, Sung, & Zhang, 2015), and collaboration (Herro et al., 2017).

Developing practical tools and guidelines for classroom use

Formatively, well-crafted assessments provide both the teacher and the students high-quality information about the effects of instruction and ways to improve. Developing practical tools and guidelines to assess students’ learning in STEM education is still urged in classroom settings.

Among the literatures we reviewed, many reported assessments are research-oriented. Qualitative analyses such as video analyses and open coding are time and resource consuming. Therefore, these assessments typically function as research tools to advance our understanding rather than as practical tools in classroom instruction settings. Many teachers do not have time or resources to accomplish these add-on assessments, and therefore, will not be able to use them to inform and assist student learning.

Moreover, using practical tools is more conducive to the development of longitudinal research. Similar to many other types of educational programs, interdisciplinary STEM education should aim for long-term impact (NAE & NRC, 2014). Among the 49 articles, only five longitudinal studies traced the progress of students’ STEM learning and careers. Though short-term gains reflected important aspects of the effectiveness of an interdisciplinary curriculum, how students retain learning outcomes (especially those pertain to interdisciplinary learning) over time is largely unknown. Developing more practical tools could contribute to make long-term research more feasible.

Practitioners need clear guidelines and training to select and coordinate assessment formats for STEM education for different purposes. As can be seen in Table 1, a variety of formats were adopted in the reviewed works. Multiple formats/purposes for assessments should be balanced to meet student needs and provide basis for curriculum adjustments in specific settings. In practice, it is essential for teachers to align assessment and instruction for each STEM lesson (Solomon, 2003). In reality, how to choose appropriate assessment formats while keeping the assessment task feasible in different classroom settings requires further research.

Conclusion

In our study, we reviewed assessments of interdisciplinary STEM educational programs from 2000 to 2019 and provided a two-dimensional framework of assessing students’ learning in STEM education. The findings suggested that most assessments focused on assessments of monodisciplinary learning and transdisciplinary affective domains. Few assessments paid attention to interdisciplinary learning and practices.

Research in assessing interdisciplinary STEM learning has made important strides. Yet, there is still a long way to go. Based on the review, we would like to recommend the following suggestions as a way to help the community of researchers and practitioners who work in this area to better calibrate our work:

  • The nature of the involved disciplines and mechanisms on how they are connected need to be made explicit in interdisciplinary STEM curriculum and instruction. More importantly, the connections across disciplines need to be operationalized and assessed to provide targeted feedback to students.

  • STEM learning processes and practices are complex and manifold, especially when one considers the many different features of disciplinary processes and practices. These core learning processes and practices need to be clearly delineated in learning objectives, and assessments need to be built around these objectives to capture the complex nature of interdisciplinary STEM learning.

  • Developing practical assessment tools and guidelines for classroom use should be prioritized. While STEM education has penetrated many classrooms, most teachers have not received proper training on how to assess student learning in STEM. While our two-dimensional framework provides a theoretical starting point, building a network or repertoire of resources for practitioners would be a pragmatic step moving forward.