Multiple imputation methods for handling missing values in a longitudinal categorical variable with restrictions on transitions over time: a simulation study
Longitudinal categorical variables are sometimes restricted in terms of how individuals transition between categories over time. For example, with a time-dependent measure of smoking categorised as never-smoker, ex-smoker, and current-smoker, current-smokers or ex-smokers cannot transition to a never-smoker at a subsequent wave. These longitudinal variables often contain missing values, however, there is little guidance on whether these restrictions need to be accommodated when using multiple imputation methods. Multiply imputing such missing values, ignoring the restrictions, could lead to implausible transitions.
We designed a simulation study based on the Longitudinal Study of Australian Children, where the target analysis was the association between (incomplete) maternal smoking and childhood obesity. We set varying proportions of data on maternal smoking to missing completely at random or missing at random. We compared the performance of fully conditional specification with multinomial and ordinal logistic imputation, and predictive mean matching, two-fold fully conditional specification, indicator based imputation under multivariate normal imputation with projected distance-based rounding, and continuous imputation under multivariate normal imputation with calibration, where each of these multiple imputation methods were applied, accounting for the restrictions using a semi-deterministic imputation procedure.
Overall, we observed reduced bias when applying multiple imputation methods with restrictions, and fully conditional specification with predictive mean matching performed the best. Applying fully conditional specification and two-fold fully conditional specification for imputing nominal variables based on multinomial logistic regression had severe convergence issues. Both imputation methods under multivariate normal imputation produced biased estimates when restrictions were not accommodated, however, we observed substantial reductions in bias when restrictions were applied with continuous imputation under multivariate normal imputation with calibration.
In a similar longitudinal setting we recommend the use of fully conditional specification with predictive mean matching, with restrictions applied during the imputation stage.
KeywordsFully conditional specification Longitudinal categorical data Missing data Multiple imputation Multivariate normal imputation Restricted transitions
available case analysis
body mass index
BMI for age z-scores
complete case analysis
imputation as a continuous variable using multivariate normal imputation with calibration
fully conditional specification
indicator based imputation using multivariate normal imputation with projected distance-based rounding
Longitudinal Study of Australian Children
missing at random
missing completely at random
Mean square error
multivariate normal imputation
predictive mean matching
- two-fold FCS
two-fold fully conditional specification
The problem of missing data is prominent in longitudinal studies as these studies involve gathering information from respondents at multiple waves over a long period of time . One approach for handling such missing data is multiple imputation (MI), which has become a frequently used method for handling missing data in observational epidemiological studies . MI is a two stage process . In the first stage, the incomplete dataset is replicated multiple times, with the missing values replaced by values drawn from an appropriate imputation model. In the second stage, the analysis of interest is performed on each of the imputed datasets and resulting parameter estimates are combined using Rubin’s rules . Multivariate normal imputation (MVNI), and fully conditional specification (FCS), are widely available MI methods that have been used in longitudinal studies [4, 5] to impute missing values.
MVNI imputes missing values by fitting a joint imputation model for all the variables with missing data, assuming that these variables follow a multivariate normal distribution . FCS uses univariate regression models fitted to each variable with missing data depending on the type of variable with missing data [7, 8]. When handling missing values in longitudinal data, standard implementations of MVNI and FCS can be applied by treating repeated measurements of the same variable at different time points as distinct variables, sometimes referred to as the “Just Another Variable” approach . For example, measurements of quality of life at different time points are treated as separate variables. This needs to be done for all the longitudinal variables. This approach does not explicitly model the longitudinal structure of the data, although it does allow for the correlations between the repeated measurements. The two-fold FCS algorithm is a recently proposed version of FCS that takes into consideration the longitudinal structure of the data by imputing missing values in a variable at a certain time point, using information only from the specific time point and immediately adjacent time points [9, 10]. Two-fold FCS may help to reduce convergence issues encountered with FCS in longitudinal studies with large numbers of waves and incomplete variables .
In many epidemiological studies, variables are collected that involve several restrictions. One example is that of restricted-transition variables. These are categorical variables where the set of possible future states depends on its current and previous states. For example, with a time-dependent measure of smoking categorised as never-, ex-, and current-smoker, current- or ex-smokers cannot transition to a never-smoker at a subsequent wave. Oral contraceptive use measured repeatedly as a never-user, ex-user or current-user is another example of a time-dependent variable which is restricted such that an ex- or current-user cannot transition into a never-user at a subsequent wave. However, never-users may start using oral contraceptives at any time.
Guidance on how MI methods should be applied for handling missing data in such variables is limited in the statistical literature. For incomplete smoking data (non-, ex- and current-smoker), Welch et al.  focused on a simulation scenario where non-smokers at baseline did not transition into other smoking categories, and used deterministic imputation for the non-smoking category in this simulation study. Specifically, all respondents observed as non-smokers at any of the time points, were imputed as non-smokers for missing time points. Missing values for the remaining respondents were imputed stochastically, as either a current-smoker or ex-smoker . Although this semi-deterministic approach is appealing, it may not always be appropriate as in real-world situations some non-smokers may start smoking. Similarly, in the contraceptive use example, never-users may start using oral contraceptives over time. Another simulation study by Kalaycioglu et al.  explored a number of scenarios for handling missing values in longitudinal data, including a categorical treatment variable, which had transition restrictions. However, little information was available on how missing values were handled in this variable.
While the primary goal of MI is to obtain valid inferences, and not to replace the actual missing values per se , it is important to assess the impact of implausible imputation values on the parameter estimates of interest [6, 7, 12]. Therefore, the aim of this paper was to evaluate the performance of possible MI approaches (namely MVNI, FCS, and two-fold FCS algorithm) for handling missing values in a longitudinal categorical variable with restrictions on transitions over time. We report the findings of a case study from the Longitudinal Study of Australian Children (LSAC), and a simulation study based on the LSAC  where approximately 65% of data on maternal smoking were set to missing completely at random (MCAR) or missing at random (MAR). In this study, maternal smoking was a time-dependent categorical exposure variable with restrictions, measured repeatedly over six time points.
Motivating example: Longitudinal study of Australian children (LSAC)
The Longitudinal Study of Australian Children (LSAC) is a prospective study of 10,000 children, involving two cohorts, the infant cohort (B) and the child cohort (K). Data collected at six time points, from 2004 to 2014  was available for this study. LSAC obtained written informed consent from the caregiver on behalf of each of the study children, as the children were minors at the time of data collection and was approved by the Australian Institute of Family Studies Ethics Committee.
Epidemiological analysis of interest
Childhood obesity is a growing epidemic in most developed countries, and a common problem among Australian children . Many severe health diseases are attributable to childhood obesity . Importantly, exposure to maternal smoking has been found to be an important risk factor of childhood obesity [16, 17, 18, 19]. The motivating example for our simulation study was to quantify the relationship between exposure to maternal smoking and body mass index (BMI).
Target analysis model
Description of variables from the Longitudinal Study of Australian Children used in the simulation study for respondent i at wave j
Study child’s BMI for agea
0 = Never-smoker
1 = Ex-smoker
2 = Current-smoker
0 = No
1 = Yes
Maternal age at child birth
0 = Not completed
1 = Completed
0 = No
1 = Yes
Family socio-economic status
Study child’s sex
0 = Female
1 = Male
Study child’s birth weight
Study child’s age
Simulation of complete data
The simulation study was based on six waves of the LSAC infant cohort, which had a participation of 5107 children at wave 1 (see Additional file 1: Table S1). Data were generated as specified below based on the casual diagram in Fig. 1a. This process was repeated to generate 1000 complete datasets. A detailed description of the simulation procedure is provided in the Additional file 1.
Maternal smoking at wave 0 (i.e. during pregnancy) (m_smokingi,0) was generated from a multinomial logistic regression model:
Maternal smoking at waves j = 1,…,6 (m_smokingi,j) was generated in two stages.
Stage 1: Maternal smoking was generated for respondents who were never-smokers at the previous wave using the multinomial logistic regression model:
Stage 2: Maternal smoking for the remaining respondents (current- or ex-smoker) was generated using the logistic regression model:
We considered β1,1 = 0.10 and β1,2 = 0.15. In general, parameter values used in the simulation process were chosen to mimic the LSAC data (see Additional file 1: Table S2).
Generation of missing data
Specifications of the parameters in the logistic regression models used to impose missing data under the missing at random scenarios
Model A Equation 5b
Model B Equation 6b
Model A Equation 5b
Model B Equation 6b
Maternal depression at wave j
exp(ν1) = 1.67
exp(ω1) = 1.61
exp(ν1) = 2.80
exp(ω1) = 2.70
BMI for age z-scores at wave j + 1
exp(ν2) = 1.64
exp(ω2) = 1.58
exp(ν2) = 2.60
exp(ω2) = 2.50
Specifically, under each MAR mechanism, it was assumed the probability of missingness in maternal smoking followed a logistic regression model dependent on BMIz and the auxiliary variable maternal depression (Fig. 1b). The d-separation criterion  was used to show that missingness is independent of unobserved data conditional on maternal depression at wave j (m_depressionj) and BMIz measured at the subsequent wave (BMIzj + 1) that is, the MAR assumption holds given these variables (see Additional file 1). The models used to generate missing values in maternal smoking were:
Model A introduces monotone missingness, such that, if the measurement at wave j is specified as missing using model A, then the individual will have measurements missing for all subsequent waves j + 1, … , 5.
where Ri, j is an indicator variable of missingness, and maternal smoking was assigned to missing for respondent i at wave j if Ri, j = 1.
Model B was only applied to the respondents who were not specified as missing using model A. The strong MAR scenario was obtained by doubling the log of the odds ratios used in the weak MAR scenario (see Table 2 for parameter values).
For each mechanism (MCAR or MAR), the overall missingness proportion for maternal smoking was set at 45 % and 65%, representing realistic and extreme scenarios respectively , resulting in 6 simulation scenarios.
Methods to handle missing data
For comparison with MI methods, we first performed a complete case analysis (CCA), excluding all respondents with missing values for maternal smoking at any of the 5 waves, and an available case analysis (ACA), including available data at each wave in the analyses . These approaches are commonly used due to simplicity [2, 22, 23, 24]. CCA and ACA are expected to produce biased estimates under the MAR scenarios explored in this study. Both CCA and ACA condition on the missingness indicator Rj (see Fig. 1b). This missingness indicator is a collider as it lies in the pathway ‘m_depressionj→Rj← BMIzj + 1’, opening a backdoor path between the exposure and outcome of interest that is not blocked in the analysis model given that maternal depression is an auxiliary variable not included in the target analysis. Therefore, in principle we expect biased estimates under CCA and ACA , although this bias may be small.
We then assessed three MI methods, MVNI, FCS, and two-fold FCS, to multiply impute missing values in maternal smoking at waves 1 to 5. Given that the missingness mechanism generated satisfies the MAR assumption given m_depressionj and BMIzj + 1, as explained previously, we expect in principle that appropriate MI methods incorporating the target analysis variables as well as the auxiliary maternal depression variable to produce unbiased estimates under the missing data scenarios considered. Specifically, we considered two versions of each of these MI methods; the standard version, and the restriction-adapted version that accounts for restrictions in transitions over time.
In the standard implementation of MVNI and FCS, repeated measurements of maternal smoking were included as distinct variables in the imputation model (i.e. one variable for each time point). This ‘single-level’ imputation was used to impute missing data at all the time points. The correlation between the repeated measures is captured in this approach [4, 5], However, treating repeated measurements of the same variable as distinct variables fails to account for the temporal ordering of the data which may affect imputation .
With MVNI, due to the assumption of multivariate normality, the imputed values for maternal smoking could take non-integer values. Therefore, we used two methods for imputation; maternal smoking imputed as indicators using MVNI, followed by projected distance-based rounding (indicator-PDBR) , and maternal smoking imputed as a continuous variable using MVNI, followed by calibration (continuous-calibration) [27, 28], to re-categorise imputed values into the original categories (see Additional file 1, Figure S2 and S3).
Within the FCS framework we considered three univariate imputation methods: multinomial logistic regression, ordinal logistic regression (treating the smoking variable as continuous based on the numerical codes 0, 1, 2), and predictive mean matching (PMM) (using a linear prediction model to obtain predicted values and k = 5 and 10 for randomly drawing from kth nearest observed values to the predicted value) .
With the two-fold FCS algorithm, missing values in maternal smoking were imputed using information from only specific and immediately adjacent time points, and assuming a multinomial logistic imputation model (ordinal logistic regression is not available in current implementation of two-fold FCS) .
We used a linear mixed-effects model with a random intercept as our analysis model. Even though we used a multilevel analysis model, missing data were imputed using single-level fixed-effect imputation methods. These single-level fixed-effect MI methods allow an unstructured correlation structure between the repeated measurements. This indicates that no unnecessary assumptions are made about the correlations, which makes the single-level fixed-effect MI methods more general than a multilevel MI method. Furthermore, all imputation models included all variables in the analysis model as predictors, as well as the time-dependent auxiliary variable maternal depression . Hence the MI methods considered are approximately compatible with the analysis model. Even though single-level fixed-effect MI may lead to increased precision, the statistical literature has highlighted limitations of this method: it can inflate the sampling variance, lead to low coverage probabilities, and may be computationally demanding. These issues are discussed by Enders et al. .
Stage 1: If a respondent was observed as a never-smoker at a specific wave, any missing values in all previous waves were deterministically assigned to be a never-smoker (Fig. 2a).
Stage 2: If a respondent was observed as a current- or ex-smoker at a specific wave, any missing values in all subsequent waves were imputed stochastically as current- or ex-smokers (i.e. as a binary variable) (Fig. 2b).
Stage 3: For the remaining scenarios (Fig. 2c), the missing values were imputed stochastically as never-, current- or ex-smokers.
In stage 3, it is inevitable that a small proportion of imputed values will violate the restrictions. However, we accepted these implausible values as it would be difficult to further introduce restrictions within the already existing restrictions.
Performance measures for evaluating different methods
We compared the performances of CCA, ACA, and the different MI methods (standard and restriction-adapted versions) using the absolute bias (difference between true value and average of MI estimates calculated from 1000 simulations); empirical standard error (square root of variance of 1000 estimates); and coverage of 95% confidence interval (proportion of simulated datasets in which the true parameter value was contained in the estimated 95% confidence interval). The relative bias (bias relative to true parameter value), the model-based standard error (average of standard errors of 1000 estimates) and mean square error (MSE), which is a combined measure of bias and efficiency , were also reported. The Monte Carlo errors for the MI estimates were used to assess the variation in estimated parameters across the simulations .
Case study analysis
In addition to the simulation study, we also provide an empirical comparison of the methods considered, using the data from the LSAC infant cohort. We used wave-specific measures of whether the mother currently smoked or not to derive the never-smoker, ex-smoker and current-smoker at waves 1 through 6 (see Additional file 1).
Stata 13 statistical software  was used for all analyses.
Results from simulation study
The standard and two-fold FCS methods with multinomial logistic regression imputation models failed to converge in all 1000 simulations for each of the 6 simulation scenarios. Standard FCS with ordinal logistic regression imputation showed extremely high non-convergence rates (up to 95%). The results for standard and two-fold FCS methods with multinomial logistic regression imputation, and FCS with ordinal regression are no longer considered in the following description of the results.
For all MI methods with no issues of convergence, we found substantial gains in precision compared to CCA. However, for ACA we observed slightly larger empirical standard errors compared to these MI approaches. Across these MI methods, there was minimal difference in precision irrespective of the imputation approach and whether it was applied with or without restrictions. The gain in precision for MI compared to CCA and ACA was also reflected in the MSE, in which the MI methods produced a substantially lower MSE compared to CCA, and a slightly lower MSE compared to ACA. FCS with PMM performed better in terms of MSE than the other imputation approaches in most missingness scenarios when no restrictions were applied, however, we did not observe much difference in MSE when restrictions were applied.
The coverage was within 93.6 and 96.4% for the nominal level of 95% (expected range for coverage based on 1000 simulations) for most scenarios. However, a slight over-coverage was reported by both continuous-calibration and FCS with PMM for parameter estimates corresponding to ex-smokers relative to never-smokers, under both standard and restriction-adapted versions.
Results from case study
Similar to the simulation study, the multinomial and ordinal logistic imputation models fitted under the FCS methods (both with and without restrictions) did not converge. Additionally, indicator-PDBR with restrictions, which showed some convergence issues in the simulation study, did not converge with the real data.
We compared the performance of MI methods, MVNI, FCS, and two-fold FCS, applied with and without restrictions, in addition to CCA and ACA, for handling missing data in a categorical variable with restrictions over time. We considered 6 different scenarios of missing data in maternal smoking, a longitudinal categorical exposure with three levels; never-smoker, ex-smoker and current-smoker, where an ex- or current-smoker at a specific wave is restricted from transitioning into a never-smoker.
Consistent with previously published studies [9, 21, 36, 37, 38], CCA and ACA produced negligible bias under MCAR. CCA excluded all individuals with missing data in at least one wave from the analysis. Missing data in maternal smoking were generated such that missingness was dependent on the outcome, BMI for age z-scores, after conditioning on the variables of the target analysis model. Therefore, as expected CCA produced biased estimates when data were MAR, with larger bias in the stronger MAR scenario. In contrast, in nearly all missingness scenarios investigated, ACA produced less biased estimates than MI without restrictions. This may be due to ACA accounting for most of the missingness mechanism due to the correlation between the repeated measurements. The imputation of implausible transitions under standard MI without restrictions is a possible reason for why this method produced more biased estimates than ACA. Furthermore, standard MVNI and FCS methods do not account for the temporal ordering of the repeated measurements as they treat repeated measurements of the same variable as distinct variables , which may explain the under-performance. However, simulation studies by Kalaycioglu et al.  and De Silva et al.  have shown that both MVNI and FCS may not be to susceptible to this issue as they have both been shown to have very good performance when including as much information as possible (i.e. all the repeated measurements) in the imputation model, as implemented in our study. Conversely, Kalaycioglu et al.  reported more biased estimates using ACA compared with MI without restrictions in the presence of multiple longitudinal variables with missing data, many of which were not restricted. In terms of precision, we observed substantial and slight gains with MI in both standard and restriction-adapted versions compared to CCA and ACA respectively, consistent with previous studies [4, 5]. This was presumably because we used maternal depression (a fully observed time-dependent variable) in the imputation models, which was a strong predictor of missingness [4, 21, 31, 39, 40].
The standard FCS approach imputing smoking using multinomial or ordinal logistic regression imputation failed to converge in 95–100% of the simulated datasets. Our findings agree with the results of simulation studies by Welch et al.  and Kalaycioglu et al. , which reported convergence issues in FCS, albeit of smaller proportions. Welch et al. , assumed that non-smokers at baseline remained non-smokers throughout, and only current- and ex-smokers transitioned between the two categories, thus converting the imputation of maternal smoking into a binary imputation. Despite this, approximately 25% of the simulated datasets did not converge with standard FCS . Of note, application of the two-fold FCS in our simulation study, which reduced the number of categorical predictor variables in each univariate imputation model  where imputation of smoking was performed using multinomial logistic regression, still did not overcome the convergence issues. We observed similar convergence issues as seen in the simulation study with the real data.
Multinomial logistic regression faces difficulties of convergence when the imputation model includes a large number of categorical variables with rare categories and/or high collinearity. In our study, under FCS, six categorical smoking variables (one for each time point) were included in the multinomial logistic imputation model, and only a small number of ex-smokers were present in the simulated data mimicking the real cohort. Even though under the two-fold FCS algorithm only four categorical smoking variables (current and immediately adjacent time points, and smoking during pregnancy) were included in the multinomial logistic imputation model, all of these variables had a rare category leading to convergence issues.
FCS with PMM imputation produced the least biased estimates when compared to other MI methods irrespective of whether restrictions were applied. It also produced the smallest MSE across the 6 missing mechanisms, gaining precision over ACA, which performed best in terms of bias. While all other MI methods either failed to converge for all simulated datasets or resulted in large bias, PMM performed well both with and without restrictions. PMM replaces missing values with observed values [29, 41], therefore, even without restrictions, the proportion of implausible transitions imputed was low. PMM also avoids the problems arising from rounding methods related to MVNI. Slight issues of over-coverage were observed under PMM. Rodwell et al.  also reported issues with coverage when using PMM for imputing limited range variables, due to the matching algorithm used in Stata for PMM imputation. PMM uses three different types (0, 1 and 2) of matching to calculate a predictive distance between an observed value and a value obtained from a linear predictor, and identifies k observations which minimise this predictive distance. The `mi impute pmm’ command in Stata uses type 2 matching. PMM can also be implemented in R using the `mice’ package which uses type 1 matching. Type 2 matching differs from type 1 matching in that it does not adequately account for the uncertainty around the parameter of the imputation model when computing the predictive distance. A simulation study by Morris et al.  reported under-coverage for PMM under both type 1 and type 2 matching, with type 2 matching leading to slightly worse coverage probabilities for this reason. Therefore, the coverage probabilities may have been better when implementing PMM using the `mice’ package in R compared to the `mi impute pmm’ command in Stata.
Simulation studies by Kalaycioglu et al.  and De Silva et al.  have shown that MVNI can have very good performance when used to impute missing longitudinal data. However, the underlying assumption of multivariate normality is not plausible in our study as maternal smoking is a categorical variable. While MVNI can result in valid inferences despite the departure from multivariate normality [6, 43], adoption of a suitable rounding method to deal with non-integer imputed smoking values is required for the analysis of interest. There are number of rounding techniques available for categorical variables at a single time point [44, 45], rounding methods in the context of longitudinal data are yet to be explored . We observed high biases with both MVNI approaches under different scenarios, especially without restrictions. Presumably because, indicator-PDBR uses an indicator based approach for imputation followed by projected distance-based rounding, which does not aim to preserve the marginal proportion in each category, and continuous-calibration imputes maternal smoking as a continuous variable, followed by calibration for rounding, which distorts the association between the exposure and outcome, even though it aims to preserve the marginal proportion in each category [44, 45]. Continuous-calibration resulted in substantial reductions in bias when restrictions were applied, and there were slight gains in MSE from continuous-calibration compared to indicator-PDBR, which agrees with the findings of Galati et al. . It should, however, be noted that continuous-calibration was originally proposed for ordinal variables , while maternal smoking is technically a nominal variable. Indicator-PDBR also faced some convergence issues, presumably because it uses an indicator-based approach for imputation .
The three-stage restriction procedure employed in our study is an extension of the semi-deterministic approach used by Welch et al. , where they simplified the imputation to ex- and current-smokers as discussed previously. We observed moderate to substantial reductions in bias for PMM and continuous-calibration, and fewer convergence issues for indicator-PDBR, when restrictions were applied. However, when restrictions were applied, we observed that the empirical standard errors either slightly increased or remained the same compared with the standard implementation of MI. The MSE was greatly influenced by the empirical standard error due to its relatively large magnitude compared with absolute bias, therefore, even in scenarios which showed substantial improvements in bias, little or no change in empirical standard errors resulted in no changes in MSE, when restrictions were applied.
There is currently limited guidance on the imputation of missing values in time-dependent categorical variables even without restrictions. With standard FCS often facing convergence issues in the presence of categorical variables with rare categories, and unsatisfactory rounding methods for MVNI, this area warrants further research. Enders et al.  suggested using a joint imputation procedure with latent variable formulation for categorical variables, available in the MLwiN software . The ‘jomo’ package in R is designed for multilevel joint modelling MI , but to date has not been widely adopted. Our study was limited to currently available methods in the Stata statistical software and multilevel MI methods such as ‘jomo’ are currently not available in Stata. Additionally, further research is required to examine how to implement restrictions within these multilevel imputation methods, and this was beyond the scope of this study.
Our simulation study was designed based on the LSAC infant cohort to assess the performance of MI methods in a realistic setting [4, 21, 36]. We also provide a case study for an empirical illustration of what we observed in the simulation study. This simulation study was designed based on a single cohort, and the performance of the methods may vary with changes in various factors including, magnitude and structure of the correlations between the repeated measurements, and magnitudes of the parameters used in the simulation models . Therefore, caution is required when generalising these results.
The findings from this study, which was based on a longitudinal cohort study, indicate that among the MI methods available in Stata (which are all single-level fixed-effect models), FCS with PMM, applied with restrictions, performs best in terms of bias and precision, when handling up to 65% missing values in a time-dependent categorical exposure variable with restrictions on transitioning over time. In a similar longitudinal setting, we would recommend the use of PMM within the FCS framework with a suitable procedure to implement restrictions within the imputations.
This work was supported by funding from the National Health and Medical Research Council: a Centre of Research Excellence grant, ID 1035261, awarded to the Victorian Centre of Biostatistics (ViCBiostat); and a Senior Research Fellowship ID 1104975 (JAS) and Career Development Fellowship ID 1053609 (KJL). APDS is funded by a Victorian International Research Scholarship and a Melbourne International Fee Remission Scholarship.
Availability of data and materials
All data generated and analysed during the current study are available from the corresponding author on reasonable request.
Authors APDS and JAS designed the study with critical review from AMDL, MMB and KJL. APDS performed the simulation study and statistical analyses under the supervision of JAS and AMDL. APDS drafted the paper with input from JAS, MMB, AMDL and KJL. All authors were responsible for critical revision of the manuscript and have approved the final version to be published.
Ethics approval and consent to participate
For the simulation study, data were completely simulated, which did not require approval from the ethics committee or consent from participants. The case study example used in this study was based on the infant cohort of LSAC which has been provided ethical clearance by the Australian Institute of Family Studies Ethics Committee. The authors are approved users of the LSAC data and were granted access to the data through The University of Melbourne’s Organisational Deed of License. Written informed consent was obtained from the caregiver on behalf of each of the study children, as the children were minors at the time of data collection. The signed consent forms are retained by the field agency (Australian Bureau of Statistics).
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 4.De Silva AP, Moreno-Betancur M, De Livera AM, Lee KJ, Simpson JA. A comparison of multiple imputation methods for handling missing values in longitudinal data in the presence of a time-varying covariate with a non-linear association with time: a simulation study. BMC Med Res Methodol. 2017;17(1):114–24.CrossRefGoogle Scholar
- 7.Raghunathan TE, Lepkowsi JM, Van Hoewyk J, Solenberger P. A multivariate technique for multiply imputing missing values using a sequence of regression models. Sampling Methodology. 2001;27(1):85–95.Google Scholar
- 13.Australian Institute of Family Studies. The Longitudinal Study of Australian Children: An Australian Government Initiative, Data User Guide. 2013.Google Scholar
- 26.Allison PD. Missing data. Thousand Oaks, Calif. London: SAGE Publications; 2002.Google Scholar
- 29.Little RJA. Missing-data adjustments in large samplings. J Bus Econ Stat. 1988;6(3):287–96.Google Scholar
- 35.StataCorp. Stata statistical software, release 13. College Station: StataCorp LP; 2013.Google Scholar
- 47.Quartagno M, Carpenter J. Package 'jomo'. R statistical software package. 2016.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.