An explanatory item response theory method for alleviating the coldstart problem in adaptive learning environments
Abstract
Electronic learning systems have received increasing attention because they are easily accessible to many students and are capable of personalizing the learning environment in response to students’ learning needs. To that end, using fast and flexible algorithms that keep track of the students’ ability change in real time is desirable. Recently, the Elo rating system (ERS) has been applied and studied in both research and practical settings (Brinkhuis & Maris, 2009; Klinkenberg, Straatemeier, & van der Maas in Computers & Education, 57, 1813–1824, 2011). However, such adaptive algorithms face the coldstart problem, defined as the problem that the system does not know a new student’s ability level at the beginning of the learning stage. The coldstart problem may also occur when a student leaves the elearning system for a while and returns (i.e., a betweensession period). Because external effects could influence the student’s ability level during the period, there is again much uncertainty about ability level. To address these practical concerns, in this study we propose alternative approaches to coldstart issues in the context of the elearning environment. Particularly, we propose making the ERS more efficient by using an explanatory item response theory modeling to estimate students’ ability levels on the basis of their background information and past trajectories of learning. A simulation study was conducted under various conditions, and the results showed that the proposed approach substantially reduces ability estimation errors. We illustrate the approach using real data from a popular learning platform.
Keywords
Elearning system Coldstart problem Elo rating system Explanatory IRT Betweensession effectThere has been an increasing trend toward implementing electronic learning (elearning) environments for higher education as well as for K–12 education, because advanced technologies can have substantial advantages in assisting students’ learning. One important advantage of technologybased learning environments is accessibility, with students having access to the learning environment at their own pace, anytime and anyplace. In addition, elearning environments can be more effective and efficient than traditional classroom learning, because they are capable of personalizing students’ learning opportunities by using an adaptive system (Brusilovsky and Peylo, 2003). Unlike static learning environments, where the same contents and information are given to each student, the adaptive systems in elearning environments can take students’ individual characteristics into account. For example, students’ learning preferences (e.g., visual, auditory, or kinesthetic), background information (e.g., gender, age, and socioeconomic status), and knowledge level (e.g., previous courses taken and education level) can be used as information that the adaptive system incorporates so as to optimize the learning conditions. The importance of accounting for individual characteristics in elearning environments has consistently been emphasized in previous studies (e.g., Kalyuga & Sweller, 2005; Shute & Towle, 2003; Snow, 1989, 1996).
One of the systems used in adaptive elearning environments is the Elo rating system (ERS; Elo, 1978). The ERS can be used to produce adaptive item sequencing, in which items are selected in real time on the basis of the current estimate of the student’s ability or knowledge level (Wauters, Desmet, & Van den Noortgate, 2010). More specifically, when a student responds to an item, the ERS algorithm computes updated estimates of the student’s ability level and the item difficulty, based on the correctness of the current response. Then the next item is provided, such that the item difficulty level matches the student’s current ability level. The ERS not only can be used to estimate a constant student’s ability level, but is also applicable to tracking a changing ability level in learning environments where the student’s ability level is expected to improve, because of the instant feedback that is provided on each response (Wauters et al., 2010). Note that although the ERS can be used to gradually obtain reliable estimates of both a student’s abilities and the item difficulties, adaptive item sequencing would be more efficient if we could start from a precalibrated item bank, including information on item difficulty and possibly on other characteristics of the items. In previous research, the development and application of the ERS has been widely explored (e.g., Brinkhuis & Maris, 2009; Coomans, Hofman, Brinkhuis, van der Maas, & Maris, 2016; Klinkenberg, Straatemeier, & van der Maas, 2011; Maris & van der Maas, 2012; Savi, van der Maas, & Maris, 2015).
When a new student comes into the elearning system, the ERS algorithm is required to set an initial value for the student’s ability. Initial values may be drawn randomly, or the mean of the previous students’ ability levels can be used (Wauters et al., 2010). However, an ambiguous initial value could lead to inaccurate ability estimation, resulting in a higher number of responses being required in order to get accurate ability estimates, or in higher standard errors of the ability estimates given a particular number of responses (Wauters et al., 2010). As a consequence, it may take longer before the environment is optimally adapted to the student. This problematic situation is referred as the coldstart problem.
In general, the coldstart problem occurs when a student starts working in the learning environment. However, the coldstart problem may also occur if a student decides to leave the elearning system for a while before starting a new session. Betweensession periods are prevalent in the elearning environment because the students have flexibility in when to access this environment. There may be ability change in relation to various experiences during betweensession periods. For example, the student might take extra training through printed learning materials or additional online learning platforms. For such cases, the student’s ability tends to increase. Alternatively, the student might not have been involved in any type of constructive learning, and might have partially forgotten the relevant content. In this case, his or her ability tends to decrease. For that reason, roughly assuming that the ability estimated from the previous session will be the same as the ability level at the beginning of the next session, by ignoring potential ability change between study sessions, may again lead to inefficient ability estimates. The longer the betweensession period, therefore, the higher the uncertainty about the student’s ability at the start of the new session, and therefore the more we can consider the start of a new session a cold start.
One possible solution to both coldstart problems would be to incorporate additional information about the student. More specifically, a student’s background information may give a better idea of the initial ability level and how the ability level may have changed between sessions. However, the current ERS uses the Rasch model, which does not allow the flexibility to utilize the student’s information, because the model includes only two parameters (e.g., item difficulty and latent ability). Alternatively, explanatory item response theory (explanatory IRT; De Boeck & Wilson, 2004) analyses can be used to explore the effects of background variables on initial abilities and evolution in ability. Explanatory IRT models have been investigated for educational data in which student and item effects on the probability of a correct response have been considered random (Van den Noortgate, De Boeck, & Meulders, 2003), allowing the inclusion of student and item characteristics as predictors. Furthermore, data from an elearning environment were analyzed with the explanatory IRT model in another study (Kadengye, Ceulemans & Van den Noortgate, 2014). However, the explanatory IRT model has not yet been used in combination with the ERS algorithm. Therefore, it is unknown to what extent the coldstart problems may be resolved by adapting explanatory IRT with the ERS. Thus, in the present study we aimed to investigate the coldstart problem in the elearning system, both when new students come in to the learning environment or students come back to the environment after a specific period between sessions. We further aimed to describe and empirically evaluate one possible approach to addressing the coldstart problem, by combining explanatory IRT and the ERS.
Below, we first give more details about the ERS and its proposed improvement by means of explanatory IRT. Next, we describe a simulation study evaluating the performance of this combined approach. We then demonstrate its applicability to reallife data collected from an elearning platform.
The Elo rating system
Note that in Eq. 1, it is necessary to specify a starting value for the ability, \( {\widehat{\theta}}_{p(0)} \), in order to initiate the ERS algorithm. In principle, zero (or a randomly drawn value from, e.g., a standard normal distribution) can be used, because the mean ability is often assumed to be zero in order to make the Rasch model identified. However, to avoid the coldstart problem, several alternative methods to specify the initial value for an adaptivelearning system have been suggested (e.g., Bobadilla, Ortega, Hernando, & Bernal, 2012; Pereira & Hruschka, 2015; Tang & McCalla, 2004). Wauters et al. (2010) suggested an individualspecific approach for an adaptive elearning environment that uses explanatory IRT to obtain more specific starting values using students’ information.
Explanatory IRT
Structure of a data set
Student  Session  Item  Score  btime  wtime  Time 

1  1  1  1  0  0.1  0.1 
1  1  2  0  0  0.2  0.2 
1  1  3  1  0  0.3  0.3 
1  1  .  .  0  .  . 
1  1  .  .  0  .  . 
1  1  n _{11}  0  0  t _{1}  t _{1} 
1  2  n_{11} + 1  0  24  0.1  t_{1} + 24 + 0.1 
1  2  n_{11} + 2  0  24  0.2  t_{1} + 24 + 0.2 
1  2  n_{11} + 3  1  24  0.3  t_{1} + 24 + 0.3 
1  2  .  .  24  .  . 
1  2  .  .  24  .  . 
1  2  n_{11} + n_{12}  1  24  t _{2}  t_{1} + 24 + t_{2} 
1  3  n_{11} + n_{12} + 1  1  48  0.1  t_{1} + t_{2} + 48 + 0.1 
1  3  n_{11} + n_{12} + 2  0  48  0.2  t_{1} + t_{2} + 48 + 0.2 
1  3  n_{11} + n_{12} + 3  1  48  0.3  t_{1} + t_{2} + 48 + 0.3 
1  3  .  .  48  .  . 
1  3  .  .  48  .  . 
1  3  n_{11} + n_{12} + n_{13}  1  48  t _{3}  t_{1} + t_{2} + 48 + t_{3} 
Model estimation
The parameters of the explanatory IRT model (including the random effects w_{0p}, w_{1p}, and w_{2p}) shown in Eq. 4, can then be estimated within the Bayesian framework (e.g., Dai & Mislevy, 2009; Frederickx, Tuerlinckx, De Boeck, & Magis, 2010; Kadengye et al., 2014). Bayesian inference draws conclusions about parameters in the form of a posterior distribution that combines the likelihood of the data with prior knowledge about the parameters. Let (Y, Ω) denote the likelihood function of a set of all parameters shown in Eq. 4, say Ω, and let P(Ω) denote the prior distribution of those parameters. The posterior distribution is proportional to the product of the two components, P(Ω Y) ∝ L(Y, Ω) P(Ω). Because obtaining an analytical solution for P(Ω Y) may not be feasible or may be extremely intensive to compute, a Markov chain Monte Carlo (MCMC) algorithm is typically employed.
Initial ability of new students
As a first way to enhance the traditional ERS, an explanatory IRT analysis can be used to give more accurate starting values when the ERS has started updating the ability estimates of new student(s). To estimate the initial ability of the new students, we fit the model to data that have already been collected from the previous students in the environment. Given the new students’ background information, this allows us to obtain an estimated initial ability, \( \left({\widehat{\alpha}}_{00}+{\sum}_{j=1}^J{\widehat{\alpha}}_{0j}{Z}_{pj}\right) \). In the case that there is no prior knowledge about α_{00} and α_{0j}, it is typical to choose a noninformative or vague prior distribution for the Bayesian inference (Gelman et al., 2014). Therefore, in this study a normal distribution with an extremely large variance (i.e., small precision) was chosen for the prior distribution of initial ability, α_{00} and α_{0j}~N(0, 10^{6}).
Ability change between study sessions
As a second means to enhance the performance of the traditional ERS, the explanatory IRT analysis can give information about ability change between sessions (i.e., the effect of btime_{p(t)} in Eq. 4). The information assists in choosing more accurate starting values when the traditional ERS has restarted updating the ability for the next session. Similar to the estimation of the initial ability (i.e., \( {\widehat{\alpha}}_{00}+{\sum}_{j=1}^J{\widehat{\alpha}}_{0j}{Z}_{pj} \)), the estimate of a slope parameter \( \left({\widehat{\alpha}}_{20}+{\sum}_{j=1}^J{\widehat{\alpha}}_{2j}{Z}_{pj}\right) \) for btime_{p(t)} in Eq. 4 provides the ability change that can be expected for a student with these characteristics. Adding this estimated change to the ability estimate at the end of the previous session gives the starting value for resuming the ERS at the beginning of the next session. If for a given student we already have observed data for at least two sessions, the explanatory IRT model can also estimate w_{2p} from Eq. 4 for the particular student, on top of the expected ability change corresponding to the specific student characteristics.
In this case, the prior distributions P(α_{2j}) and P(w_{2p}) based on data up to the current session can be informed by the posterior distribution based on the data up to the previous session. Therefore, we assign our prior for the current session such that \( {\alpha}_{2j}\sim N\left({\mu}_{\alpha 2},{\sigma}_{\alpha 2}^2\right) \), where μ_{α2} and \( {\sigma}_{\alpha 2}^2 \) are informed by the posterior mean and variance based on the previous session. Similarly, the prior for \( {\widehat{w}}_{2p} \) also is to be updated along with \( {\widehat{\alpha}}_{2j} \) across the continuing analysis after each session. That is, \( {w}_{2p}\sim N\left({\mu}_{w2},{\sigma}_{w2}^2\right) \), where μ_{w2} and \( {\sigma}_{w2}^2 \) are informed by the posterior mean and variance based on the previous session. Allowing the normal prior density with an informative mean and variance for w_{2p} is advantageous for obtaining its posterior mean, \( {\widehat{w}}_{2p} \), especially when the students have not gone through enough sessions. With only a few observations, the posterior inference about ability change, w_{2p}, is shifted away from the individualspecific estimate toward the direction of the populationaveraged estimate. With more observations after more sessions, the inference relies more on individualspecific change. Therefore, this characteristic enables the analysis to adopt an average student’s slope estimate plus the studentspecific deviation, which attempts to approximates more “individualized” ability change between sessions. Note that it would be natural to assign noninformative priors for the very first session.
Simulation study
To provide evidence of the performance of the proposed method, a simulation study was conducted. We produce simulated data using a data generation process that mimics various aspects of an online learning environment. In accordance with our research questions, the full simulation study consisted of two parts. In Study 1, we examined the effects of addressing the coldstart problem for new students. Here we assumed that students’ true ability evolves while they are being engaged in the learning environment, but that there are no ability changes while they are outside the environment (i.e., between sessions). In Study 2, we focused on exploring the effects of addressing ability changes between sessions. Therefore, ability growth both within and between sessions would be simulated.
For each condition of the two studies, we simulated data for 300 students who engaged in a total of four sessions and solved 40 items per session. In particular, the simulated data sets contained the students’ responses to a total of 160 items, which were randomly assigned out of an item bank of 1,000 items. Then each data set was divided into two subsets: (a) a “training set” (n = 250 students) was used to fit the explanatory IRT model, in order to obtain the starting values in the ERS, and (b) a “validation set” (n = 50 students) was used to execute the ERS by incorporating the information from the training set. Individuals from the validation set were considered new students who had just engaged in the environment. For those students, we allowed the ERS to estimate their ability growth trajectory within sessions. Note that evaluation of an ERS with the support of explanatory IRT modeling was the primary purpose of the present study, and thus we compared the performance of the proposed ERS with a traditional ERS in which no student information was taken into account. In both subsets, we assumed that the data were collected from two groups (i.e., nondyslexic and dyslexic groups) with equal sample sizes (i.e., a balanced design).
Study 1: No true ability change between sessions
Study design
Z_{p1} and Z_{p2} are binary indicators for two groups, say a nondyslexic and a dyslexic group, respectively (Z_{p1} = 1 if the student is from the nondyslexic group, and 0 otherwise; Z_{p2} = 1 if the student is from the dyslexic group, and 0 otherwise). The model does not include an intercept, so α_{01} and α_{02} refer to the expected initial abilities in these two groups. The initialability parameter is affected not only by the group membership, but also by individualspecific factors (w_{0p}, or the individual deviation from the group mean). Similarly, the learning trend varies over persons around a groupspecific mean. We assumed that the nondyslexic group would typically show a greater average initial ability level (α_{01}) and learning growth (α_{02}) than the dyslexic group, since many educators and learning platforms are concerned with the lower learning capability of those who experience dyslexia (e.g., Humphrey & Mullins, 2002; Polychroni, Koukoura, & Anagnostou, 2006; Vukovic, Lesaux, & Siegel, 2010).
Initial ability
With regard to the true fixed and random effects for the initial ability levels, two scenarios were considered. In the first scenario, a relatively smaller difference between the two group means and a relatively smaller random standard deviation were considered. In particular, the fixed effects were set at α_{01} = 0.1 and α_{02} = – 2.5, respectively, and the standard deviation of the random effects was set at σ_{w0} = 0.5, where w_{0p}~N(0, \( {\sigma}_{w0}^2 \)). In the second scenario, the fixed effects were set at α_{01} = 0.5 and α_{02} = – 2.5, respectively, and the standard deviation of the random effects was set at σ_{w0} = 1.5.
Ability growth
In both scenarios, the withinsession time (i.e., wtime_{p(t)}) was simulated to increase by 0.05 h for each item. Also, the fixed time effects were set at α_{11} = 0.18 and α_{12} = 0.05, respectively, and the standard deviation of the random effects was set at σ_{w1} = 0.0008, where w_{1p}~N(0, \( {\sigma}_{w1}^2 \)).
Item difficulty
The true item difficulties β_{p(t)}, for measurement occasion t for student p, were drawn randomly from N(0, 2). On the basis of those parameter values, the item response of each student (correct/incorrect) for each measurement occasion t was randomly generated on the basis of a binomial distribution with the probability in Eq. 6.
Starting value
Implementation
For the estimation procedure with the training set, the MCMC algorithm was implemented with R 3.3.3 (R Core Team, 2013). More specifically, JAGS (Plummer, 2015) was implemented in the R package R2jags (Su & Yajima, 2015), which provides wrapper functions for the Bayesian analysis program. For each analysis with JAGS, four chains were run, and each ran for 10,000 iterations. We use a thinning parameter of four and used the first half as burnin. The resulting iterations from each chain were pooled and randomly mixed after burnin to be used as samples from the posterior distribution of the parameters. The use of multiple chains and thinning served to reduce the dependencies among the iterations and ensure adequate convergence to and mixing from the posterior distribution. Gelman and Rubin’s (1995) statistics were used as convergence diagnostics. The final estimates of the model parameters were obtained by taking the mean of the posterior samples after the burnin periods.
Results
Parameter estimates of the explanatory IRT models (n = 250 training set)
σ_{w2} = 0.5  σ_{w2} = 1.5  

Parameter  True  Est.  SE  True  Est.  SE 
Intercept  
Nondyslexia  0.1  0.135  0.093  1.5  1.175  0.123 
Dyslexia  – 2.5  – 2.035  0.166  – 2.5  – 2.159  0.182 
Each plot in the figure displays the performance when using the groupspecific start values as compared to the coldstart values. Specifically, using the groupspecific values suggests that the explanatory IRT approach decreases the systematic bias in the initial ability estimates (as can be seen in the right panels). As a result of this decreased bias, the initial MSE is also smaller for the explanatory IRT approach (as can be seen in the left panels). However, because students’ ability may still deviate from the ability that is expected on the basis of their background characteristics, there is still variation in the ability estimates around the overall value, adding to the MSE.
When the group difference in initial ability is small (in the upper left panel of Fig. 1), the MSEs begin with 2.4 and 0.5 for the coldstart and the groupspecific starts, respectively. When the group difference is large (in the lower left panel of Fig. 1), the MSEs begin with approximately 5.2 and 1.8 for the coldstart and groupspecific starts, respectively. This suggests that the MSEs in the second scenario are generally greater than those in the first scenario, since the simulated data variance was large. For both scenarios, the MSE tends to gradually decrease as the number of items answered increases. The ERS with the groupspecific start already begins with a relatively smaller MSE, which means that the ability estimation is accurate even with a small number of items answered. However, the ERS with the cold start produces considerably higher MSEs than the groupspecific approach in the first session. In each panel, the performances of the ERS with the two starting values coincide after 20 items at least. The results suggest that using the groupspecific start is capable of enhancing ability estimation without having students answer an excessive number of items.
Study 2: When there is true ability change between sessions
Study design
Betweensession change

Scenario A: with betweensession ability change, small student variation (σ_{w0} = 0.5, σ_{w2} = 0.2), and 40 items per session

Scenario B: with betweensession ability change, large student variation (σ_{w0} = 1.5, σ_{w2} = 1.5), and 40 items per session

Scenario C: with betweensession ability change, small student variation (σ_{w0} = 0.5, σ_{w2} = 0.2), and 80 items per session

Scenario D: with betweensession ability change, large student variation (σ_{w0} = 1.5, σ_{w2} = 1.5), and 80 items per session
The betweensession time (i.e., btime_{pi}) was generated from a Poisson distribution with a mean and variance of 1 (per day).
Withinsession change
In any of the data generation scenarios, the true fixed and random effects for the withinsession period were determined as follows: α_{11} = .18, α_{12} = .05, and σ_{w2} = .0008. The withinsession time (i.e., wtime_{p(t)}) was generated on the basis of approximately 0.05 h for solving each item in the session.
Item difficulty
As in Study 1, β_{p(t)} was randomly drawn from N(0, 2), and the item responses were randomly generated on the basis of a binomial distribution with the probability from the Eq. 9.
Starting values
The third option (“GroupSpecific” in Figs. 2, 3, 4, and 5) was to use the groupspecific slopes corresponding to dyslexic or nondyslexic groups (i.e., α_{2g}, g = 1, 2). Finally the fourth option (“IndividualSpecific” in Figs. 2, 3, 4, and 5) was to use each individual student’s deviation (i.e., w_{2p}, p = 1, …, P) as well as the groupspecific averages. The dataanalyzing model was equivalent to that in Eq. 7. Note that the accuracy of the students’ deviations improves as more data for the student are collected throughout longer study sessions.
Results
The results are summarized for each of the four data generation scenarios.
Betweensession ability change with small student variation (40 items per session)
Parameter estimates of the explanatory IRT models (n = 250 training set): Betweensession ability change with small student variation
Parameter  True  Overall  GroupSpecific  

Est.  SE  Est.  SE  
Intercept  – 0.810  0.100  
α_{01} (Nondyslexia)  0.5  0.390  0.112  
α_{02} (Dyslexia)  – 2.5  – 2.217  0.178  
BetweenSession Slopes  0.446  0.090  
α_{21} (Nondyslexia)  0.7  0.681  0.133  
α_{22} (Dyslexia)  0.1  0.245  0.110 
With the validation data (n = 50 new students), the ERS with the choice of groupspecific starting values as initial ability levels was implemented. Figure 2 visualizes MSE as a function of the total number of items answered at each iteration step in the ERS (left) and the true and estimated ability trajectories for an average student (right). The MSE value at the beginning is around 0.6 and decreases within Session 1. Such a decreasing trend in MSE can also be found within Sessions 2–4, meaning that ability estimation in the ERS gets more accurate as more items are given to students. However, it is noticeable that the MSE tends to take considerable leaps between consecutive sessions. This is because there are true systematic ability changes during these periods (α_{20} = 0.7, α_{21} = 0.1), but also unsystematic (individualspecific) ability changes. When methods using the explanatory IRT model (in the “Individualspecific” and “Groupspecific” forms) are used, the MSE values are smaller than in the traditional ERS—that is, because the explanatory IRT model succeeds in accounting for those changes.
Also, the right panel of the figure shows bias—that is, the gap between the true and estimated ability trajectories. The results suggest that methods using the explanatory IRT model (both “Individualspecific” and “Groupspecific”) consistently outperform the traditional ERS because IRT enables us to reduce bias due to the ability change between sessions.
Betweensession ability change with large student variation (40 items per session)
Parameter estimates of the explanatory IRT models (n = 250 training set): Betweensession ability change with large student variation (40 items per session)
Parameter  True  Overall  GroupSpecific  

Est.  SE  Est.  SE  
Intercept  – 0.760  0.084  
α_{01} (Nondyslexia)  0.5  0.364  0.108  
α_{02} (Dyslexia)  – 2.5  – 1.912  0.125  
BetweenSession Slopes  0.428  0.089  
α_{21} (Nondyslexia)  0.7  0.567  0.126  
α_{22} (Dyslexia)  0.1  0.284  0.123 
Figure 3 visualizes MSE as a function of the total number of items answered at each iteration step in the ERS (left) and the true and estimated ability trajectories for an average student (right). As compared to Scenario A, the MSE values are greater, in general. The MSE value at the beginning is around 1.3, and it reduces almost to 0.4 within Session 1. Such a decreasing trend in MSE can also be found within Sessions 2–4, meaning that the ability estimation in ERS gets more accurate as more items are given to students. However, it is noticeable that the MSE tends to take considerable leaps between two consecutive sessions, due to the true systematic (α_{20} = 0.7, α_{21} = 0.1) and unsystematic (individualspecific) ability changes during these periods. Therefore, when methods using the explanatory IRT model (in the “Individualspecific” and “Groupspecific” versions) are used, the MSE values are smaller than in the traditional ERS. Between the two approaches, the individualspecific estimate seems to perform better, especially in Session 4. Also, the right panel of the figure shows bias—that is, the gap between the true and estimated ability trajectories. The results suggest that methods using the explanatory IRT model (both “Individualspecific” and “Groupspecific”) consistently outperform the traditional ERS, because IRT enables us to account for the ability change between sessions more accurately.
Betweensession ability change with small student variation (80 items per session)
Parameter estimates of the explanatory IRT models (n = 250 training set): Betweensession ability change with small student variation (80 items per session)
Parameter  True  Overall  GroupSpecific  

Est.  SE  Est.  SE  
Intercept  – 1.001  0.107  
α_{01} (Nondyslexia)  0.5  0.448  0.105  
α_{02} (Dyslexia)  – 2.5  – 2.137  0.116  
BetweenSession Slopes  0.405  0.084  
α_{21} (Nondyslexia)  0.7  0.650  0.113  
α_{22} (Dyslexia)  0.1  0.238  0.126 
Figure 4 visualizes MSE as a function of the total number of items answered at each iteration step in the ERS (left) and the true and estimated ability trajectories for an average student (right). During Session 1, the MSE begins at 0.6 and moves to almost zero at the end of Session 1. During Sessions 3 and 4, noticeable gaps appear between individual or groupspecific approaches, as compared to the traditional ERS approach (in light gray). The two approaches reduce MSE in leaps between Sessions 1 and 2 (Items 80 and 81), Sessions 2 and 3 (Items 160 and 161), and Sessions 3 and 4 (Items 240 and 241). In this scenario, the individualspecific approach works best but does not particularly outperform the groupspecific estimate. Also, the right panel of the figure shows bias—that is, the gap between the true and estimated ability trajectories. The results suggest that the methods using the explanatory IRT model (both “Individualspecific” and “Groupspecific”) consistently outperform the traditional ERS, because IRT enables us to account for ability change between sessions more accurately.
Betweensession ability change with large student variation (80 items per session)
Parameter estimates of the explanatory IRT models (n = 250 training set): Betweensession ability change with large student variation (80 items per session)
Parameter  True  Overall  GroupSpecific  

Est.  SE  Est.  SE  
Intercept  – 0.824  0.100  
α_{01} (Nondyslexia)  0.5  0.311  0.105  
α_{02} (Dyslexia)  – 2.5  – 2.310  0.128  
BetweenSession Slopes  0.433  0.083  
α_{21} (Nondyslexia)  0.7  0.674  0.107  
α_{22} (Dyslexia)  0.1  0.193  0.118 
Figure 5 visualizes MSE as a function of the total number of items answered at each iteration step in the ERS (left) and the true and estimated ability trajectories for an average student (right). Due to the large variation in ability change between sessions, we found considerable leaps between Sessions 1 and 2 (Items 80 and 81), Sessions 2 and 3 (Items 160 and 161), and Sessions 3 and 4 (Items 240 and 241). Although all four approaches result in very similar performance during Session 2, the individualspecific and groupspecific approaches outperform during Sessions 3 and 4. In particular, the individualspecific approach produces smaller MSEs than the groupspecific approach, and the gap becomes bigger in Session 4. Also, the right panel of the figure shows bias—that is, the gap between the true and estimated ability trajectories. The results suggest that the methods using the explanatory IRT model (both “Individualspecific” and “Groupspecific”) consistently outperform the traditional ERS, because IRT enables us to account for the ability change between sessions more accurately.
Realdata example
Data description
For illustrative purposes, we used a data set collected from a Webbased learning platform, “Oefenweb” (Oefenweb.nl). It was designed as an itembased elearning environment for 200,000 pupils from a total of 2,000 mainly primary schools in the Netherlands. The learning environment supplements children’s cognitive development in math and language at their own ability level, and teachers can receive information in order to get informed about their students. We used data obtained from one of its exercise programs, called Math Garden. In particular, the present data refer to math exercises related to the addition operation, which were collected between Fall 2016 and Spring 2017, during one school year. Those exercises were developed on the basis of the Rasch model framework. The platform includes data from large numbers of students and learning items—specifically, 1,562 students’ responses to items across four study sessions in total. Note that the total number of items for the four study sessions varied by students—the average number of items per student was 81.3, with a minimum of 8 and a maximum of 576—and therefore, the data set includes missing responses. In addition to the student responses to the items, the data set also contains background information variables for the students, including (a) type of learning (0 = easy, 1 = moderate, and 2 = challenging), (b) grade (3rd, 4th, 5th, and 6th graders), and (c) gender (0 = female, 1 = male). Finally, the data set also contains timestamps recording when the students start and finish solving each item, so the amounts of time spent within (wtime) and outside (btime) the study sessions can be calculated. In particular, wtime indicates measurement time points in hours across all study sessions, and btime shows the spacing time in days between study sessions, computed from the sessionspecific time stamps. The median spacing time between two consecutive study sessions was 1.93 days. Among the students, 500 were selected for analyzing with explanatory IRT models, and the remaining 1,062 students were used to keep track of ability trajectory using the traditional ERS. On the basis of this setup, the same procedures as described in Studies 1 and 2 were applied.
Results
Parameter estimates of the IRT models for the online data (n = 500)
Empty model  Explanatory model  

Parameter  Est.  SE  Est.  SE 
Intercept  0.094  0.090  – 0.732  0.155 
Type of learning  – 0.302  0.095  
Grade  0.315  0.035  
Gender  – 0.039  0.146  
BetweenSession Slopes  0.013  0.063  0.019  0.176 
Type of learning  – 0.008  0.007  
Grade  – 0.001  0.023  
Gender  0.017  0.127 
For the same student, Fig. 7 compares the estimated ability trajectories with and without considering ability change during the period between sessions (i.e., while not engaged with the learning environment) for the ERS. To avoid the initial coldstart problem, the backgroundspecific estimates from Fig. 6 were used. Because the four options only differ in the ways the betweensession effect is accounted for, the curves coincide in Session 1. Also, the differences are extremely minor for Session 2 (“S2”). At the beginning of Session 3 (“S3”), the ability trajectory using the three modelbased starting values tends to be greater than with the traditional ERS. Finally, in Session 4, the ability estimates when using the three modelbased starting values again tend to be greater than with the traditional one. In particular, the overall starting value gave the highest value, followed by the backgroundspecific starting value and the individualspecific starting value. The estimates by using individualspecific starting values are located between the other two modelbased starting values and the traditional ERS. After a longer sequence of items within the session, however, the gaps among the four approaches tend to be negligible.
Discussion and conclusion
The present study proposed methods to address the coldstart problem in elearning environments by implementing the explanatory IRT model with the ERS. The proposed methods were empirically evaluated via a simulation study. We considered various scenarios that differed from each other in (a) whether the groupspecific initial ability estimates were specified for new students or (b) whether groupspecific and/or individualspecific effects of the explanatory variables were included for the students’ ability change while not using the learning environment. Those explanatory IRT models were evaluated under conditions in which the true initial ability parameters were different across multiple groups, the true ability trajectory between sessions was either positive or negative, and relatively small or large variances across students were generated. The proposed models were evaluated with the crossvalidation technique, in which the models were built using part of the data set (a training set) and tested on the remaining part of the data set (a testing set, which we considered to be data from new students). Finally, the proposed methods were illustrated using real data obtained from an elearning environment.
The results of the study showed that the ERS with the explanatory IRT models provided better latent ability estimates when a student entered the elearning environment than did the traditional ERS with the Rasch model. The MSE values for the ability estimates at the initial stage were consistently lower for the explanatory models, and similar findings were observed across measurement occasions. These findings imply that in order to obtain more accurate ability estimates in the ERS, an explanatory IRT model that can take students’ information into account should be implemented. The flexibility to include the explanatory variables plays a significant role in the ERS, and individualized initial estimates could increase the efficiency of the elearning system. On the basis of these findings, we recommend using explanatory IRT models with the ERS to obtain more accurate and efficient ability estimates.
In addition, on the basis of the present simulation study results, we found that when students resumed the elearning system, groupspecific abilities were the better estimates for the initial ability level at which to initiate the ERS, rather than the estimates at the last measurement occasion of the previous session. In the present study, explanatory IRT models that included betweensession variables showed more accurate ability estimates than did the model without the betweensession variables when there was a betweensession effect. The model with the betweensession variables also outperformed for conditions in which the betweensession effects were both positive and negative. These findings imply that for elearning environments, the sessionspecific approach should be encouraged to estimate the betweensession effect.
It is also worthwhile to note that when the explanatory IRT models were estimated, we employed the Bayesian estimation method. In general, for situations in which the fitted model becomes more complex or data sets include large number of observations, it may not be uncommon to observe nonconvergence of the parameter estimation, or it may take a much longer runtime for the model to converge. A similar issue occurred in a study in which longitudinal growth IRT modeling for the elearning data was examined (Kadengye, Ceulemans, & Van den Noortgate, 2015). Given that the Bayesian approach can be an alternative to obtaining estimates by sampling from the posterior distribution, we initially estimated the models using both a numerical integration approach and MCMC estimations, and found that MCMC yielded more accurate estimates and took less runtime. Although more studies will be needed to precisely compare both estimation methods for explanatory IRT models in the context of the elearning environment, our study suggests that the Bayesian approach toward explanatory IRT models works well.
We also acknowledge some limitations of the study. For the simulation study, we only incorporated one categorical explanatory variable to address the coldstart problem. However, in practical settings more student information will be available and collected, as is shown in the realdata illustration, and those background variables should be included in the explanatory IRT model. Including more variables is expected to lead to an increase in precision, but it also increases the complexity of the explanatory IRT model, and the same pattern as in the current approaches may not be obtained. In addition, the ERS that we considered in this study does not assume adaptive item sequencing. In other words, in the present simulation study, the items were generated randomly across measurement occasions for each student, and the item difficulty parameters were assumed to be fixed and considered known (i.e., estimated using a large prior calibration study). However, given that the ERS allows for updating not only the ability estimate but also the item difficulties simultaneously, it would be worthwhile to investigate the performance of the explanatory IRT model with more complex conditions (e.g., when item difficulties cannot be considered known, and therefore when the coldstart problem also applies to the item side). Finally, the present work demonstrates the added value of using the explanatory IRT method on top of the traditional method (Glickman, 1999) of addressing the coldstart problem by using large step sizes at the beginning of new sessions. Specifically, we took a bigger step size to start the ERS algorithm (by setting K = 0.7 at the beginning of each study session),and linearly decreased the size as a function of the total number of items answered. More research exploring the optimal step size function will be desirable. For example, Klinkenberg et al. (2011, p. 1816) followed Glickman’s reasoning and proposed making K a function of the number of items answered and the elapsed time between two answers.
Nonetheless, the results of this study provide valuable information about how to deal with the coldstart problem in elearning environments. We recommend fitting explanatory IRT models that can be used to get good initial starting values for new students, or to get improved starting values at the start of new sessions. These (relatively computingintensive) analyses do not have to be done on the fly, but can be repeated to get more precise estimates of model parameters when the amount of available data increases. Also, the analyses to estimate studentspecific trajectories do not have to be done on the fly. Ideally these should be performed between study sessions. One could, for instance, update the estimates once a day. The study is expected to allay concerns about the ERS and catalyze the usefulness of the elearning system in educational settings. Furthermore, we acknowledge that the cold start is a problem that can be encountered in any educational setting—for example, when there is a new student (or new learning materials are introduced) in a traditional classroom setting. It is likely that the teacher will try to overcome this coldstart problem using the same principle that underlies our explanatory IRT approach: by predicting ability using the background characteristics of the student. The advantage of the classroom setting is that the teacher in principle could use multiple types of data (e.g., the attitude and participation of the student) to estimate the extent to which the student has mastered the content. In contrast, a disadvantage is that the predictions are not necessarily based on an objective, evidencebased model, but rather may be prone to prejudices. In addition, teachers cannot observe all students permanently, so they may miss important information that could be used to update their ability estimates. We hope this study will inspire researchers in facetoface educational settings, too.
Notes
Acknowledgements
This research include a methodological approach and a real data example from the LEarning analytics for AdaPtive Support (LEAPS) project, funded by imec (Kapeldreef 75, B3001, Leuven, Belgium) and the Agentschap Innoveren & Ondernemen. The LEAPS project aimed to develop a selflearning analytical system to enable adaptive learning. This support system can be integrated into educational games and in supporting software for difficult readers and for professional communication. Partners from a broad field of expertise work together within the consortium. Examples include educational and cognitive scientists, software developers, statisticians, experts in humancomputer interaction and educational publishers. This extensive interdisciplinary collaboration makes it possible to create a muchneeded and commercial solution.
References
 Bobadilla, J., Ortega, F., Hernando, A., & Bernal, J. (2012). Generalization of recommender systems: Collaborative filtering extended to groups of users and restricted to groups of items. Expert Systems with Applications, 39, 172–186.CrossRefGoogle Scholar
 Brinkhuis, M. J., & Maris, G. (2009). Dynamic parameter estimation in student monitoring systems (Measurement and Research Department Reports, Rep. No. 20091). Arnhem, The Netherlands: Cito.Google Scholar
 Brusilovsky, P., & Peylo, C. (2003). Adaptive and intelligent webbased educational systems. International Journal of Artificial Intelligence in Education, 13, 159–172.Google Scholar
 Coomans, F., Hofman, A., Brinkhuis, M., van der Maas, H. L., & Maris, G. (2016). Distinguishing fast and slow processes in accuracyresponse time data. PLoS ONE, 11, e0155149. https://doi.org/10.1371/journal.pone.0155149 CrossRefPubMedPubMedCentralGoogle Scholar
 Dai, Y., & Mislevy, R. (2009). A mixture Rasch model with a covariate: A simulation study via Bayesian Markov chain Monte Carlo estimation (PhD dissertation). College Park, MD: University of Maryland. Retrieved from http://drum.lib.umd.edu/handle/1903/9926
 De Boeck, P., & Wilson, M. (2004). Explanatory item response models. New York, NY: Springer.CrossRefGoogle Scholar
 Elo, A. E. (1978). The rating of chessplayers, past and present (Vol. 3). London, UK: Batsford.Google Scholar
 Frederickx, S., Tuerlinckx, F., De Boeck, P., & Magis, D. (2010). RIM: A random item mixture model to detect differential item functioning. Journal of Educational Measurement, 47, 432–457.CrossRefGoogle Scholar
 Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2014). Bayesian data analysis (3rd ed.). Boca Raton, FL: CRC Press.Google Scholar
 Gelman, A., & Rubin, D. B. (1995). Avoiding model selection in Bayesian social research. Sociological Methodology, 25, 165–173. https://doi.org/10.2307/271064 CrossRefGoogle Scholar
 Glickman, M. E. (1999). Parameter estimation in large dynamic paired comparison experiments. Applied Statistics, 48, 377–394.Google Scholar
 Humphrey, N., & Mullins, P. M. (2002). Research section: Personal constructs and attribution for academic success and failure in dyslexia. British Journal of Special Education, 29, 196–203.CrossRefGoogle Scholar
 Kadengye, D. T., Ceulemans, E., & Van den Noortgate, W. (2014). A generalized longitudinal mixture IRT model for measuring differential growth in learning environments. Behavior Research Methods, 46, 823–840. https://doi.org/10.3758/s1342801304133 CrossRefPubMedGoogle Scholar
 Kadengye, D. T., Ceulemans, E., & Van den Noortgate, W. (2015). Modeling growth in electronic learning environments using a longitudinal random item response model. Journal of Experimental Education, 83, 175–202.CrossRefGoogle Scholar
 Kalyuga, S., & Sweller, J. (2005). Rapid dynamic assessment of expertise to improve the efficiency of adaptive elearning. Educational Technology Research and Development, 53, 83–93.CrossRefGoogle Scholar
 Klinkenberg, S., Straatemeier, M., & van der Maas, H. L. J. (2011). Computer adaptive practice of Maths ability using a new item response model for on the fly ability and difficulty estimation. Computers & Education, 57, 1813–1824. https://doi.org/10.1016/j.compedu.2011.02.003 CrossRefGoogle Scholar
 Maris, G., & van der Maas, H. (2012). Speed–accuracy response models: Scoring rules based on response time and accuracy. Psychometrika, 77, 615–633.Google Scholar
 Papousek, J., Pelánek, R., & Stanislav, V. (2014, July). Adaptive practice of facts in domains with varied prior knowledge. Paper presented at the Educational Data Mining 2014 Conference, London, UK.Google Scholar
 Pereira, A. L. V., & Hruschka, E. R. (2015). Simultaneous coclustering and learning to address the cold start problem in recommender systems. KnowledgeBased Systems, 82, 11–19.CrossRefGoogle Scholar
 Plummer, M. (2015) Just another Gibbs sampler (JAGS) [Software]. Retrieved from http://mcmcjags.sourceforge.net
 Polychroni, F., Koukoura, K., & Anagnostou, I. (2006). Academic selfconcept, reading attitudes and approaches to learning of children with dyslexia: Do they differ from their peers? European Journal of Special Needs Education, 21, 415–430.CrossRefGoogle Scholar
 R Core Team. (2013). R: A language and environment for statistical computing (Version 3.3.3). Vienna, Austria: R Foundation for Statistical Computing. Retrieved from www.Rproject.org
 Savi, A. O., van der Maas, H. L., & Maris, G. K. (2015). Navigating massive open online courses. Science, 347, 958–958.CrossRefPubMedGoogle Scholar
 Shute, V., & Towle, B. (2003). Adaptive elearning. Educational Psychologist, 38, 105–114.CrossRefGoogle Scholar
 Snow, R. E. (1989). Toward assessment of cognitive and conative structures in learning. Educational Researcher, 18, 8–14.CrossRefGoogle Scholar
 Snow, R. E. (1996). Aptitude development and education. Psychology, Public Policy, and Law, 2, 536–560. https://doi.org/10.1037/10768971.2.34.536 CrossRefGoogle Scholar
 Su, Y. S., & Yajima, M. (2015). R2jags: Using R to run “JAGS” (R package version 0.57). Retrieved fromGoogle Scholar
 Tang, T., & McCalla, G. (2004, August). Utilizing artificial learners to help overcome the coldstart problem in a pedagogicallyoriented paper recommendation system. Paper presented at the 3rd International Conference on Adaptive Hypermedia and Adaptive WebBased Systems, Eindhoven, The Netherlands.Google Scholar
 Van den Noortgate, W., De Boeck, P., & Meulders, M. (2003). Crossclassification multilevel logistic models in psychometrics. Journal of Educational and Behavioral Statistics, 28, 369–386.CrossRefGoogle Scholar
 Vukovic, R. K., Lesaux, N. K., & Siegel, L. S. (2010). The mathematics skills of children with reading difficulties. Learning and Individual Differences, 20, 639–643.CrossRefGoogle Scholar
 Wauters, K., Desmet, P., & Van den Noortgate, W. (2010). Adaptive itembased learning environments based on the item response theory: Possibilities and challenges. Journal of Computer Assisted Learning, 26, 549–562.CrossRefGoogle Scholar
 Wauters, K., Desmet, P., & Van den Noortgate, W. (2012). Item difficulty estimation: An auspicious collaboration between data and judgment. Computers & Education, 58, 1183–1193.CrossRefGoogle Scholar