Japanese Journal of Statistics and Data Science

, Volume 1, Issue 2, pp 413–433

Application of the bootstrap method for change points analysis in generalized linear models

Original Paper

Abstract

In this paper, we focus on the construction methods of the prediction model, estimation methods of the change point locations, and the confidence intervals for the generalized linear model with piecewise different coefficients. As a standard approach for multiple change point analysis, the application of the hierarchical splitting algorithm is widely used. However, the hierarchical splitting algorithm has a high risk in that the standard error of the change point estimators become large and, therefore, the prediction accuracy of the estimated model decreases. To deal with this problem, we consider the application of a bootstrap method based on the hierarchical splitting algorithm. Through simulation studies, we compare the algorithms in terms of the prediction accuracy of the estimated model, bias and variance of the change point estimators, and the accuracy of the confidence intervals of the change points. From the result, we confirmed the utility of the bootstrap-based methods for change point analysis, especially the increased prediction accuracy of the obtained model, decreased standard error of the change point estimators, and construction of better confidence intervals depending on the situation. We also present the results of a simple example to demonstrate the utility of the method.

Keywords

Bagging Break point Confidence interval Ensemble method Hierarchical splitting

1 Introduction

Generalized linear models are widely used for modeling an interesting response variable based on explanatory variables. In ordinal analysis based on a generalized linear model (GLM), the model is assumed to hold for the entire data set. However, it is widely understood that the assumption does not hold for several situations. For example, in epidemiological studies in occupational medicine, there is often a threshold concentration of a specific agent, which has an adverse health effect (Ulm 1991). As another example, in medical research, there is a possibility that the mortality rate for a certain disease changes suddenly with a certain threshold of age. To deal with these data, we can consider linear models, where the structure is changed at some points by an explanatory variable. These points are called change points or break points. In this study, we focus on the construction method of the prediction model, estimation methods of the change points, and the confidence interval for the GLM with piecewise different coefficients.

The change point analysis has been studied for a number of years. For example, Hawkins (1977), Worsley (1979), Inclán (1993), and Chen and Gupta (1997) studied the detection of change point locations in a sequence of random variables, which follows a normal distribution. Hawkins (1977) and Worsley (1979) described a method based on the likelihood procedure test. On the other hand, Inclán (1993) proposed a Bayesian-based approach, and Chen and Gupta (1997) studied an approach based on a Bayesian information criterion (Schwarz 1978). If there are multiple change points, the grid search could be used on each method. However, if the number of search points is large, this method is not practical from the viewpoint of computational complexity. To deal with this problem, the hierarchical splitting (HS) algorithm, which dichotomizes data recursively in the same way as a classification and regression tree (Breiman et al. 1984), is widely used (Chen and Gupta 2012). As an improved version of the HS algorithm, Hawkins (2001) proposed the dynamic programing (DP) algorithm, which can change the determination of location of change points according to the number of change points. In cases where the number of change points is unknown, information criteria or tests based on the limiting theorem are used to estimate it (Chen and Gupta 2012). The studies in change point analysis for a sequence of random variables are summarized by Csörgő and Horváth (1997) and Chen and Gupta (2012).

There are also many studies on change point analysis for ordinary linear models (OLM). For example, the likelihood-ratio-based methods are discussed in Quandt (1958, 1960), Kim and Siegmund (1989), and Kim (1994). Brown et al. (1975) and James et al. (1987) introduced the regression residual-based method, and Brown et al. (1975) also described the recursive residual-based method. These methods are based on the use of the limiting distribution under the null hypothesis that there is no change point. In addition to these methods, there are regression spline-based approaches (Smith 1979), and Bayesian-based approaches (Holbert 1982), etc. Recently, a method for carrying out change point analysis and variable selection simultaneously has also been proposed (Wu 2008). For the case of OLM, the HS algorithm is generally used to search for the locations of multiple change points, and there is also research into methods using the DP algorithm (Bai and Perron 2003). In cases where the number of change points is unknown, information criteria are generally used to estimate it. Studies into change point analysis for OLM have been summarized by Chen and Gupta (2012).

Although there are fewer studies on change point analysis in GLM than on the sequence of random variables or OLM, several have been published. Stasinopoulos and Rigby (1992) discussed the detection method of a change point in univariate GLM, and showed the results of its application to medical data. Ulm (1991) and Gurevich and Vexler (2005) discussed the detection method of a change point in logistic regression models for epidemiological data analysis. Küchenhoff and Carroll (1997) discussed the estimation methods of change points in a segmented GLM with measurement error. As with the case of a sequence of random variables or OLM, application of the HS algorithm to the detection of the multiple change points in GLM can naturally be considered. On the other hand, the DP algorithm cannot be applied under the assumption that the dispersion parameters in each segment are equal. Because we treat the GLM by assuming that the coefficients are different in each segment but the dispersion parameters are equal, methods based on the HS algorithm are considered in this study. In cases where the number of change points is unknown, we consider the use of the information criteria.

As a disadvantage of the HS algorithm, the estimated locations of change points are fixed until the end of the algorithm. From this, an optimal combination of change points may not be found in some cases. As a result, there is a high risk that the variance of the estimator will become large. Moreover, if the locations of change points are estimated incorrectly, it is expected that the prediction accuracy of the finally obtained model will decrease. To deal with this problem, we consider the application of the HS algorithm with a bootstrap method in this study. It is expected to decrease the variance of estimators of change points by aggregating the estimators obtained from each model, which are given by resampling data and the HS algorithm. Moreover, it is expected to increase the prediction accuracy by aggregating (bagging) the models obtained from resampling data and the HS algorithm.

As another disadvantage of the HS algorithm, the distributions of estimators of change points are not clear. As discussed above, in many studies the limiting distributions of statistics are investigated under the null hypothesis that there is no change point. On the other hand, the distribution of the estimators given by the HS algorithm is not well known. Therefore, we compare the confidence intervals of estimators through simulation studies. For the construction method of confidence intervals, we mainly compare two methods: the method which assumes the asymptotic normality of estimators, and a method based on empirical distribution.

The remainder of this paper is organized as follows: In Sect. 2, we introduce the notation and models treated in this study. In Sect. 3, the HS algorithm and bagging algorithm of the models obtained by the HS algorithm for prediction are described. In Sect. 4, we present the construction methods of confidence intervals of change point estimators, which are compared in this study. In Sect. 5, the results of the simulation studies are described. The application results of the method to simply example data are shown in Sect. 6. Finally, we describe the conclusions of this paper in Sect. 7.

2 Notation and model

Let $${\mathcal {L}}=\{(y_i, {\varvec{x}}_i, {\varvec{z}}_i);\ i= 1, 2, \ldots , n\}$$ denote a set of observed learning samples, where $$y_i$$ is the response, and $${\varvec{x}}_i = (1, x_{i1}, x_{i2},\ldots , x_{i q-1})'$$ and $${\varvec{z}}_i = (z_{i1}, z_{i2},\ldots , z_{ip})'$$ denote the vectors of independent explanatory variables. We assume that $$y_i$$ comes from a distribution in the exponential family with probability density function,
\begin{aligned} f(y_i|\theta , \phi )=\exp \left[ \frac{y_i\theta - b(\theta )}{\phi } + c(y_i, \phi )\right] , \end{aligned}
where $$\theta$$ is the canonical parameter, $$\phi$$ is the dispersion parameter, and $$b(\cdot )$$ and $$c(\cdot )$$ are known functions. In addition to this, we assume that the n pairs $$(y_i, {\varvec{x}}_i, {\varvec{z}}_i)$$ of observations are arranged in ascending order based on the continuous variable $$x_{i1}$$ considering change points.
In GLM, the piecewise different mean structure arises from the change in structure of the linear predictor. That is, the change in structure of the link function and/or the explanatory variables and/or the coefficients can be the cause of the difference in the mean structure. On the other hand, the change in variance structure can arise from the change in structure of the dispersion parameter and/or variance function and/or mean. Therefore, we can consider piecewise different mean and variance structure by thinking about the GLM with piecewise different coefficients. Let $$\mu _i=E(y_i)$$ represent the mean of $$y_i$$; we consider the linear predictor model with $$d-1$$ change points:
\begin{aligned} \eta _i \equiv g(\mu _i) =\left\{ \begin{array}{ll} {\varvec{x}}_i'{\varvec{\beta }}_1 + {\varvec{z}}_i'{\varvec{\alpha }}, &{}\quad \tau _0< x_{i1} \le \tau _1 ,\\ {\varvec{x}}_i'{\varvec{\beta }}_2 + {\varvec{z}}_i'{\varvec{\alpha }}, &{}\quad \tau _1< x_{i1} \le \tau _2 ,\\ \quad \quad \vdots &{} \\ {\varvec{x}}_i'{\varvec{\beta }}_d + {\varvec{z}}_i'{\varvec{\alpha }}, &{}\quad \tau _{d-1} < x_{i1} \le \tau _d ,\\ \end{array} \right. \end{aligned}
where $$g(\cdot )$$ represents the link function, and $$\tau _0 = -\infty$$ and $$\tau _d = +\infty$$ are assumed. Here $${\varvec{\beta }}_k=(\beta _{0k}, \beta _{1k}, \ldots , \beta _{q-1 k})'$$ represents the piecewise different coefficients vector ($$k=1, 2, \ldots , d$$), and $${\varvec{\alpha }}=(\alpha _1, \alpha _2, \ldots , \alpha _p)'$$ represent the common coefficient vector in all segments. As a restriction to guarantee that coefficients will be estimable, it is assumed that the number of learning samples included in each segment is larger than q. Then, our aim is to estimate the location of change points $$\tau _1, \tau _2, \ldots , \tau _{d-1}$$, coefficient vectors $${\varvec{\beta }}_1, {\varvec{\beta }}_2, \ldots , {\varvec{\beta }}_d, {\varvec{\alpha }}$$, dispersion parameter $$\phi$$ if present, and the number of segments d if it is unknown.
To estimate these parameters with the maximum likelihood method, the log likelihood of this model can be represented by the sum of the d log likelihoods using the samples included in each segment:
\begin{aligned} l({\varvec{\tau }},{\varvec{\beta }},{\varvec{\alpha }},\phi |{\varvec{y}}) = \sum _{k=1}^d \sum _{{\mathop {x_{i1}\in (\tau _{k-1}, \tau _k]}\limits ^{i}}} \log f(y_i|{\varvec{\beta }}_k, {\varvec{\alpha }}, \phi ), \end{aligned}
where $${\varvec{\tau }} = (\tau _1, \tau _2, \ldots , \tau _{d-1})$$ and $${\varvec{\beta }} = ({\varvec{\beta }}_1, {\varvec{\beta }}_2, \ldots , {\varvec{\beta }}_d)$$. The maximum likelihood estimators (MLE) of $${\varvec{\beta }}$$, $${\varvec{\alpha }}$$, and $$\phi$$ are obviously dependent on the unknown location of change points $${\varvec{\tau }}$$. If $${\varvec{\tau }}$$ is fixed, the score functions of $${\varvec{\beta }}$$ and $${\varvec{\alpha }}$$ are given by
\begin{aligned} \frac{\partial }{\partial \beta _{j_1 k}}l({\varvec{\beta }},{\varvec{\alpha }},\phi |{\varvec{\tau }},{\varvec{y}}) = \sum _{{\mathop {x_{i1}\in (\tau _{k-1}, \tau _k]}\limits ^{i}}} \left[ \frac{(y_i-\mu _i)}{\phi V(\mu _i)}\frac{\partial \mu _i}{\partial \eta _i}x_{ij_1} \right] , \end{aligned}
for $$j_1=0, 1, \ldots , q-1$$, and
\begin{aligned} \frac{\partial }{\partial \alpha _{j_2}}l({\varvec{\beta }},{\varvec{\alpha }},\phi |{\varvec{\tau }},{\varvec{y}}) = \sum _{k=1}^d \sum _{{\mathop {x_{i1}\in (\tau _{k-1}, \tau _k]}\limits ^{i}}} \left[ \frac{(y_i-\mu _i)}{\phi V(\mu _i)}\frac{\partial \mu _i}{\partial \eta _i}z_{ij_2} \right] , \end{aligned}
for $$j_2=1, 2, \ldots , p$$, respectively. Therefore, the MLE $$\hat{\varvec{\beta }}$$ and $${\hat{\varvec{\alpha }}}$$ can be obtained by general iterative methods such as iterative weighted least squares. Moreover, the MLE $$\hat{\phi }$$ can be calculated with a common method such as the Pearson statistic by using $$\hat{\varvec{\beta }}$$ and $${\hat{\varvec{\alpha }}}$$.

Because $${\varvec{\tau }}$$ is actually unknown, some iterative search method is needed that estimates $${\varvec{\beta }}$$, $${\varvec{\alpha }}$$ and $$\phi$$ under possible combinations of fixed $${\varvec{\tau }}$$. It seems to be intuitive to use the grid search over all possible $${\varvec{\tau }}$$, but there is a problem from the perspective of computational effort. That is, the order of a grid search for the known number of change points d is $$O(n^d)$$, and this method is not realistic when the number of samples n or segments d is large. A more efficient method that dichotomizes samples recursively is introduced in Sect. 3.

If the number of segments d is unknown, there are several conceivable methods to estimate it. For example, iterative methods for testing $$d=d'$$ versus $$d=d'+1$$ based on the limiting theorem for a sequence of random variables or OLM have been given (see Chen and Gupta 2012) and Csörgő and Horváth (1997) for more details. In the GLM case, however, it is difficult to derive the limiting distribution of the test statistics for estimating the number of change points. Another easier method that can be considered in the use of information criteria. Although the Akaike information criterion AIC; see (Akaike 1973) exists as a representative information criterion, it is well known that it does not become an asymptotically consistent estimator for changes in model order (Schwarz 1978). From this, the Bayesian information criterion BIC; see (Schwarz 1978), which is a modification of the AIC, is widely used in the analysis of change points (Chen and Gupta 2012). Based on the free parameters $${\varvec{\beta }}, {\varvec{\alpha }}, \phi , {\varvec{\tau }}$$, in our model these criteria are given as follows:
\begin{aligned} AIC(d')= & {} -2\log l\left( \hat{\varvec{\beta }},{\hat{\varvec{\alpha }}},\hat{\phi }|{\varvec{\tau }}, {\varvec{y}}\right) + 2\{d'q + p + 1 + (d'-1) \}, \end{aligned}
(1)
\begin{aligned} BIC(d')= & {} -2\log l\left( \hat{\varvec{\beta }},{\hat{\varvec{\alpha }}},\hat{\phi }|{\varvec{\tau }}, {\varvec{y}}\right) + \{d'q + p + 1 + (d'-1) \}\log n. \end{aligned}
(2)
Recently, the minimum description length (MDL) theorem-based criterion see (Rissanen 2007) for details) has also been used in the analysis of change points with a genetic algorithm see (Davis et al. 2006) and (Lu et al. 2010). However, the MDL-based criterion has almost same value as the BIC in our simulation setting, therefore, we used AIC and BIC in this study.

3 Model construction

3.1 Hierarchical splitting

The HS algorithm is a repeated method similar to classification and regression trees. As the first step of the algorithm, all learning samples $${\mathcal {L}}$$ are split into two segments based on the splitting rule $$x_{i1}\le \tau '$$. To determine the splitting rule $$x_{i1}\le \tau '$$, that is to estimate the change point $$\tau '$$, all possible splits are evaluated and a split that maximizes the sum of the log likelihood of both segments is selected as the optimum.

As the second step, the next splitting rule $$x_{i1}\le \tau ''$$ is determined under the assumption that the splitting rule $$x_{i1}\le \tau '$$ is retained. That is, all learning samples $${\mathcal {L}}$$ are split into three segments based on the two splitting rules $$x_{i1}\le \tau '$$ and $$x_{i1}\le \tau ''$$. The second rule $$x_{i1}\le \tau ''$$ is determined which maximizes the sum of the log likelihood of the three segments from all possible splits. It should be noted that all possible splits must satisfy the condition that the number of samples included in each segment is larger than q.

By repeating this procedure, we construct the model with $$d-1$$ change points. The stopping rule for the algorithm is defined by the number of known change points or information criteria such as the AIC or BIC. The HS algorithm used in this study is described as follows:
1. 1.

The initial set of change points is given by $$T_1 = \{-\infty , +\infty \}$$.

2. 2.

For $$k\leftarrow 2$$ to the number of known segments d, or predefined maximum search number of segments $$d^{\max }$$ do

3. 3.

Find the set of possible change points $$T_k'=\{(T_{k-1}, \tau '_{(1)}), (T_{k-1}, \tau '_{(2)}), \ldots \}$$ which segments the data into k segments under the assumption that the splitting rule $$T_{k-1}$$ is given. $$\tau '_{(1)}, \tau '_{(2)}, \cdots$$ are candidates of kth change point.

4. 4.
Define the optimal set of k change points by
\begin{aligned} T_k = \arg \max _{(T_{k-1}, \tau '_{(l)})\in T_k'} l\left( \hat{\varvec{\beta }},{\hat{\varvec{\alpha }}},\hat{\phi }|(T_{k-1}, \tau '_{(l)}\right) ,{\varvec{y}}). \end{aligned}

5. 5.

end

6. 6.
If the number of segments d is unknown, the optimal set of change points is estimated by using (1) or (2) as follows:
\begin{aligned} T_d = \arg \max _{T_{d'}\in \{T_1, T_2, \ldots , T_{\max }\}} AIC(d'), \end{aligned}
or
\begin{aligned} T_d = \arg \max _{T_{d'}\in \{T_1, T_2, \ldots , T_{\max }\}} BIC(d'). \end{aligned}

7. 7.
The estimated linear predictor model is given by
\begin{aligned} {\hat{\eta }}_i^\mathrm{HS} = \sum _{k=1}^{d} I({\hat{\tau }}_{k-1} < x_{i1} \le {\hat{\tau }}_{k}){\varvec{x}}_i'\hat{\varvec{\beta }}_k + {\varvec{z}}_i'{\hat{\varvec{\alpha }}}, \end{aligned}
(3)
where $$I(\cdot )$$ represents the indicator function, and $$({\hat{\tau }}_0=-\infty< {\hat{\tau }}_1< {\hat{\tau }}_2< \cdots , {\hat{\tau }}_{d-1}< {\hat{\tau }}_d=+\infty )$$ are values obtained by rearranging the elements in $$T_d$$ in ascending order.

The obvious advantages of the HS algorithm are its ease of implementation and efficiency of computational speed. For a known number of segments d is of the order of $$O(n(d-1))$$ and, therefore, the amount of calculation required is much smaller than the grid search when n or d is large.

On the other hand, the disadvantage is that the change points that are determined as optimal in previous steps are fixed, and cannot be changed in later steps. Because of this lack of flexibility in the HS algorithm, there are often cases that the optimal combination of change points is not found. Thus, the variance of the estimates of change points will become large. This is also shown in the simulation results of Sect. 5. Owing to this disadvantage, there is a risk that the model obtained by the HS algorithm has low prediction accuracy. Moreover, as discussed in the introduction, the distribution of the estimators given by the HS algorithm is not well known. Obviously, there are cases where the estimates of the change points do not become the MLE. Therefore, for example, in the construction of the confidence interval, there is a risk that the interval will be largely miscalculated if the asymptotic normality is assumed. To deal with these problems, we consider the application of the bootstrap method to the model, and study the performance through simulations in the following sections.

3.2 Bagging

To construct a model with better prediction accuracy, we consider the use of the bagging algorithm. The bagging algorithm (Breiman 1996) is a representative method in the parallel ensemble methods that constructs a set of base models and combines them. In general, the base models are called base learners or base predictors in the field of machine learning. It is well known that bagging improves the prediction error dramatically by exploiting the independence between the base learners. As stated by Zhou (2012), for a regression problem, the degree of improvement of the mean squared error by bagging depends on the instability of the base learners. Because the variance of estimators of change points given by the HS algorithm is expected to be large, as discussed in above, the instability of the obtained model will also increase. Therefore, the bagging algorithm is expected to work effectively in the construction of models that include the change points.

To construct multiple base models, bootstrap samples are used. In the framework of linear models, there are two main bootstrap methods (Fox 2015). One is the method that treats the explanatory variables as random and constructs a set of bootstrap samples directly from the observed learning sample. The other method is treats the explanatory variables as fixed and samples a set of residuals from the model fitted by using $${\mathcal {L}}$$. Then, the bootstrap sample of the response variable $$y_i$$ corresponding to $$({\varvec{x}}_i, {\varvec{z}}_i)$$ is constructed by the sum of the predicted value from the fitted model given $$({\varvec{x}}_i, {\varvec{z}}_i)$$ and the sampled residuals.

The latter approach assumes that the fitted regression model using $${\mathcal {L}}$$ is correct and the errors are identically distributed. This assumption roughly holds in general OLM, but it is unlikely that this assumption would hold in our model. First, the theoretical errors of GLM corresponding to different explanatory variables are usually different. In addition, the fitted model with change points using the HS algorithm is at high risk of instability due to the reasons discussed above. For these reasons, we use the former method which directly resamples the original data set $${\mathcal {L}}$$.

The bagging algorithm used in this study to construct a prediction model is described as follows:
1. 1.

For $$b\leftarrow 1$$ to the predefined iterative number B do

2. 2.

Construct a set of bootstrap samples $${\mathcal {L}}^{(b)}$$ by sampling with replacement from $${\mathcal {L}}$$.

3. 3.
Estimate a linear predictor model $${\hat{\eta }}_i^{HS(b)}$$ by using the HS algorithm based on $${\mathcal {L}}^{(b)}$$:
\begin{aligned} {\hat{\eta }}_i^{HS(b)} = \sum _{k=1}^{d^{(b)}} I\left( {\hat{\tau }}_{k-1}^{(b)} < x_{i1} \le {\hat{\tau }}_{k}^{(b)}\right) {\varvec{x}}_i'\hat{\varvec{\beta }}_k^{(b)} + {\varvec{z}}_i'{\hat{\varvec{\alpha }}}^{(b)}, \end{aligned}
where $$d^{(b)}$$ is the known number of change points d, or if the number of change points is unknown, it is estimated by using the AIC or BIC based on $${\mathcal {L}}^{(b)}$$.

4. 4.

end

5. 5.
The estimated linear predictor model is given by
\begin{aligned} {\hat{\eta }}_i^\mathrm{Bag} = \frac{1}{B}\sum _{b=1}^B \left\{ \sum _{k=1}^{d^{(b)}} I\left( {\hat{\tau }}_{k-1}^{(b)} < x_{i1} \le {\hat{\tau }}_{k}^{(b)}\right) {\varvec{x}}_i'\hat{\varvec{\beta }}_k^{(b)} + {\varvec{z}}_i'{\hat{\varvec{\alpha }}}^{(b)} \right\} . \end{aligned}
(4)

Although it is possible to use the estimated linear predictor model (4) directly for prediction, it is difficult to interpret it. That is, in the model obtained by the HS algorithm, the point estimates of the location of change points are obviously $${\hat{\tau }}_1, {\hat{\tau }}_2, \ldots , {\hat{\tau }}_{d-1}$$ in (3). On the other hand, the point estimates of the change points in (4) are not clear. To estimate the change points based on (4), we consider the use of the mean values or medians of change points for all base models when the number of change points is known (that is $$d^{(b)} = d$$). That is, the estimate of $$\tau _k$$ is given by
\begin{aligned} {\hat{\tau }}_k^* = \frac{1}{B}\sum _{b=1}^B {\hat{\tau }}_k^{(b)}, \end{aligned}
(5)
or
\begin{aligned} {\tilde{\tau }}_k^* = \mathrm{Median}\left( {\hat{\tau }}_k^{(1)}, {\hat{\tau }}_k^{(2)}, \ldots , {\hat{\tau }}_k^{(B)}\right) . \end{aligned}
(6)
Actually, when we analyze the actual data, we need to estimate the fixed number of change points d based on $${\mathcal {L}}$$ by AIC or BIC in first (see Sect. 6 for an example).

From the central limiting theorem, if the bootstrap distribution used in bagging mimics the population distribution well, the estimator $${\hat{\tau }}_k^*$$ is expected to follow a normal distribution asymptotically. On the other hand, $${\tilde{\tau }}_k^*$$ is expected to be robust against extreme estimates. Because it is predicted that the variance of estimators of change points given by the HS algorithm will become large, there is a possibility of obtaining an extreme estimated value. We expect $${\tilde{\tau }}_k^*$$ to deal with this problem. To clarify the notation, we express the estimate of $$\tau _k$$ which is given by the HS algorithm on $${\mathcal {L}}$$ as $${\hat{\tau }}_k^\mathrm{HS}$$. The performance of these estimators is compared through simulations in Sect. 5.

4 Confidence intervals

As discussed in the introduction, the distributions of estimators of the change points given by the HS algorithm are not clear. In this study, we compare three methods to construct the confidence intervals of change points. The first is the widely used method to construct the confidence intervals by bootstrap, called basic bootstrap. If the bootstrap distribution mimics the population distribution well, the correct confidence interval for the true parameter for bootstrap distribution should converge to the correct confidence interval on the true parameter for population distribution. That is the behavior of $$\tau _k - {\hat{\tau }}_k^\mathrm{HS}$$ is approximately the same as the behavior of $${\hat{\tau }}_k^\mathrm{HS} - {\hat{\tau }}_k^*$$, where $${\hat{\tau }}_k^*$$ is the estimator of $$\tau _k$$ by HS algorithm based on the bootstrap samples. From this, the confidence interval of $$\tau _k$$ at level $$1-\alpha$$ is given by
\begin{aligned} {\mathrm{CI}}_{\mathrm{basic}} = \left[ 2{\hat{\tau }}_k^\mathrm{HS}-{\hat{\tau }}_{k,1-\alpha /2}^*, 2{\hat{\tau }}_k^\mathrm{HS}-{\hat{\tau }}_{k,\alpha /2}^*\right] , \end{aligned}
(7)
where $${\hat{\tau }}_{k,\alpha }^*$$ represents the lower empirical distribution percentile at level $$\alpha$$ which is given by the $$100\times \alpha$$ percentile in $$({\hat{\tau }}_k^{(1)}, {\hat{\tau }}_k^{(2)}, \ldots , {\hat{\tau }}_k^{(B)})$$.
The second and third methods use the percentiles of the empirical distribution of the estimator more directly. The difference between the two methods lies in determining the confidence limits. In the second method, the confidence interval is constructed so that both sides of the interval have equal probabilities for the empirical distribution of the estimator. That is, the confidence interval at level $$1-\alpha$$ is given by
\begin{aligned} {\mathrm{CI}}_{\mathrm{equal}} = \left[ {\hat{\tau }}_{k,\alpha /2}^*,~ {\hat{\tau }}_{k,1-\alpha /2}^*\right] . \end{aligned}
(8)
This method is called as percentile bootstrap. If the empirical distribution is symmetric and centered on $${\hat{\tau }}_k^\mathrm{HS}$$, this interval is expected to work well. The details of basic and percentile bootstraps are described in Efron and Tibshirani (1993) and Davison and Hinkley (1997).
In the third method, the confidence interval is constructed so that the interval is the shortest among the intervals satisfying the level $$\alpha$$ for the empirical distribution. Of course, both sides of the interval will no longer have the same probability. The confidence interval at level $$1-\alpha$$ is given by
\begin{aligned} {\mathrm{CI}}_{\mathrm{unequal}} = \left[ {\hat{\tau }}_{k,L(\alpha )}^*,~ {\hat{\tau }}_{k,U(\alpha )}^*\right] , \end{aligned}
(9)
where $${\hat{\tau }}_{k,L(\alpha )}^*$$ and $${\hat{\tau }}_{k,U(\alpha )}^*$$ are limit values of the shortest interval in the intervals which including the $$100\times \alpha$$ percentage points of $$({\hat{\tau }}_k^{(1)}, {\hat{\tau }}_k^{(2)}, \ldots , {\hat{\tau }}_k^{(B)})$$.

When the empirical distribution has a long tail or is very skewed, it is expected that the interval (8) would be too much longer than (9). In this case, the interval (8) will be too conservative compared with (9). On the other hand, when the distribution is almost symmetric, these two intervals will be almost the same. In the next section, we compare the performance of these intervals through the simulation studies.

5 Simulations

5.1 Models and setting

We present simulation studies to compare the HS algorithm and the algorithm based on the bootstrap method described in previous sections. We compare the algorithms in terms of the prediction accuracy of the obtained model, bias and variance of the estimator of the change point, and the accuracy of the confidence interval of the change point. We used the data generated from two models. In the first model, we assume the multivariate regression model with two change points and the responses are generated from the following model:
\begin{aligned} \eta _i = \mu _i =\left\{ \begin{array}{ll} -2 + 0.5x_{i} + 0.2 z_i, &{}\quad -\infty< x_{i} \le 3, \\ 2 - x_{i} + 0.2 z_i, &{}\quad 3< x_{i} \le 6, \\ -7 + x_{i} + 0.2 z_i, &{}\quad 6 < x_{i} \le +\infty ,\\ \end{array} \right. \end{aligned}
(10)
where the explanatory variables $$x_i$$ that take the change point into consideration and the $$z_i$$ that has the common coefficients in all segments both follow a uniform distribution between 0 and 10. The dispersion parameters are set to 1 for all segments. We call this Model 1. One of the simulation data given by Model 1 is shown in Fig. 1. Fig. 1 One of the simulation data of $$n=100$$ samples given by Model 1. The red line is the true mean value given by (10) in which the value of $$z_i$$ is given by the average value of $$z_i$$ of all 100 samples
In the second model, we assume the Poisson regression with two change points. The responses are generated from the following model:
\begin{aligned} \eta _i = g(\mu _i) =\left\{ \begin{array}{ll} 1 + 0.4x_{i} + 0.2 z_i, &{}\quad -\infty< x_{i} \le 3 ,\\ 3 - 0.3x_{i} + 0.2 z_i, &{}\quad 3< x_{i} \le 6 ,\\ 0.3x_{i} + 0.2 z_i, &{}\quad 6 < x_{i} \le +\infty ,\\ \end{array} \right. \end{aligned}
(11)
where $$g(\cdot )$$ is the log link function. The explanatory variables $$x_i$$ and $$z_i$$ follow a uniform distribution between 0 and 10 as in the first model. We call this Model 2. One of the simulation data given by this model is shown in Fig. 2. Fig. 2 One of the simulation data of $$n=100$$ samples given by Model 2. The red curve is the true mean value given by (11) in which the value of $$z_i$$ is given by the average value of $$z_i$$ of all 100 samples

The number of learning samples n used are 100 and 300. We set the restriction of the number of learning samples included in a segment as 10 for efficient calculation. When the number of change points is unknown, we set the maximum search number of segments $$d^{\max }$$ as 6. The iteration number of bootstrap B is set to 500. Since there were no major differences in the simulation results in other settings of these values, we only showing the results under these settings here. Simulations are repeated 300 times in every data group.

We use a workstation which has the Intel Core i7-4600U CPU processor (Intel co.). All the simulations and analyses were performed in MATLAB (Version R 2017b, Mathworks inc.). A typical experiment of once simulation to obtain the $${\hat{\eta }}^\mathrm{HS}$$ and $${\hat{\eta }}^\mathrm{Bag}$$ for Model 1 with $$n = 100$$ required 2 and 443 s, respectively. For $$n=300$$, it was 8 s for $${\hat{\eta }}^\mathrm{HS}$$, and 2330 s for $${\hat{\eta }}^\mathrm{Bag}$$. In Model 2 with $$n=100$$, a typical running time for the $${\hat{\eta }}^\mathrm{HS}$$ and $${\hat{\eta }}^\mathrm{Bag}$$ were 1.5 and 313 s, respectively. For $$n=300$$, it was 7 s for $${\hat{\eta }}^\mathrm{HS}$$, and 2070 s for $${\hat{\eta }}^\mathrm{Bag}$$.

5.2 Comparison of model prediction accuracy

To compare the model accuracy for two algorithms, we used the mean squared error of the prediction for test data. Let $$\hat{\mu }_i$$ be the predicted mean value of a test sample $$(x_i, z_i)$$ predicted by (3) or (4) which is constructed using learning samples ($$i=1, 2, \ldots , n_\mathrm{test}$$). Then, the mean squared error of the prediction for test data is given by
\begin{aligned} \mathrm{MSE}(\hat{\mu }) = \frac{1}{n_\mathrm{test}}\sum _{i=1}^{n_\mathrm{test}}(\hat{\mu }_i - \mu _i)^2. \end{aligned}
The number of test samples $$n_\mathrm{test}$$ is set to 1000.
The results of the simulations are listed in Table 1. The table lists the average values and standard deviations in all simulations of $$\mathrm{MSE}(\hat{\mu })$$ for both cases where the number of change points is known or unknown. As expected, the prediction accuracy of the model given by the bagging algorithm is better than that given by the HS algorithm for all simulation settings. The average values and the standard deviations of $$\mathrm{MSE}(\hat{\mu })$$ for $${\hat{\eta }}^\mathrm{Bag}$$ is lower than $${\hat{\eta }}^\mathrm{HS}$$.
Table 1

Comparison of the simulation results of model prediction accuracy for the HS algorithm and bagging algorithm when the true model contains change points

Model

n

Algorithm

d known

d unknown

AIC

BIC

Model 1

100

$${\hat{\eta }}^\mathrm{HS}$$

0.22

0.24

0.24

(0.13)

(0.12)

(0.14)

$${\hat{\eta }}^\mathrm{Bag}$$

0.16

0.18

0.16

(0.07)

(0.07)

(0.07)

300

$${\hat{\eta }}^\mathrm{HS}$$

0.08

0.12

0.06

(0.05)

(0.04)

(0.03)

$${\hat{\eta }}^\mathrm{Bag}$$

0.06

0.07

0.05

(0.03)

(0.02)

(0.02)

Model 2

100

$${\hat{\eta }}^\mathrm{HS}$$

9.17

7.58

7.14

(3.65)

(4.28)

(3.78)

$${\hat{\eta }}^\mathrm{Bag}$$

5.56

5.65

5.00

(2.15)

(2.50)

(2.17)

300

$${\hat{\eta }}^\mathrm{HS}$$

6.54

3.25

1.92

(1.81)

(1.34)

(1.06)

$${\hat{\eta }}^\mathrm{Bag}$$

3.61

2.12

1.63

(1.05)

(0.77)

(0.68)

The values in the table represent the average values of $$\mathrm{MSE}(\hat{\mu })$$ in 300 simulations. The values in parentheses represent the standard deviations of $$\mathrm{MSE}(\hat{\mu })$$ in the simulations

Models 1 and 2 are given by (10) and (11), respectively, and $${\hat{\eta }}^\mathrm{HS}$$ and $${\hat{\eta }}^\mathrm{Bag}$$ are given by (3) and (4), respectively

For Model 1, which is the multivariate regression model with two change points, there is only a slight difference in the results of $${\hat{\eta }}^\mathrm{Bag}$$ between the cases where the number of change points is known and unknown. For the results of $${\hat{\eta }}^\mathrm{HS}$$ when $$n=300$$, the model obtained by using BIC is about twice as accurate as that obtained by using AIC. The reason for this is that AIC tends to overestimate the number of change points.

For Model 2, which is the Poisson regression model with two change points, the model obtained by using BIC has greatest accuracy of the three patterns (d known, AIC, and BIC). This result is somewhat strange, because the accuracy of the model obtained in the case where d is unknown is higher than in the case where d is known. The reason for this is that there is a possibility that the estimation of the change points is largely incorrect for $${\hat{\eta }}^\mathrm{HS}$$. This is discussed in the next simulation result.

For $${\hat{\eta }}^\mathrm{Bag}$$, this result seems to be due to the diversity of the base models. That is, when the number of change points is known, the base models which construct the model (4) have the same number of segments. On the other hand, when the number of change points is unknown, the base models have several different numbers of segments. As the result, the diversity of the base models included in the estimated model when d is unknown is higher than when d is known. See Zhou (2012) for details on the diversity of the bagging algorithm.

5.3 Comparison of change point estimators

To compare the three change point estimators $${\hat{\tau }}_k^\mathrm{HS}$$, $${\hat{\tau }}_k^*$$, and $${\tilde{\tau }}_k^*$$, we used the average values and standard deviations in all simulations of the estimators for the case when the number of change points is known. In addition, histograms of the estimates in all simulations are also used for visual verification. The results of simulations are shown in Table 2 and Figs. 3 and 4. Because the variation of the histogram for $$n = 300$$ becomes small, but its shape is almost the same as the histogram for $$n = 100$$, we only show the results for $$n = 100$$ in Figs. 3 and 4.
Table 2

The simulation results of the comparison of three change point estimators. The values in the table represent the average values of the estimators in 300 simulations. The values in parentheses represent the standard deviations of the estimators in the simulations.

Model

n

Estimator

$$\tau _1 = 3$$

$$\tau _2 = 6$$

Model 1

100

$${\hat{\tau }}_k^\mathrm{HS}$$

3.16

(1.02)

5.88

(0.63)

$${\hat{\tau }}_k^*$$

3.29

(0.44)

5.97

(0.35)

$${\tilde{\tau }}_k^*$$

3.31

(0.69)

5.95

(0.36)

300

$${\hat{\tau }}_k^\mathrm{HS}$$

2.99

(0.76)

5.98

(0.19)

$${\hat{\tau }}_k^*$$

3.14

(0.41)

5.92

(0.16)

$${\tilde{\tau }}_k^*$$

3.11

(0.60)

5.99

(0.13)

Model 2

100

$${\hat{\tau }}_k^\mathrm{HS}$$

3.29

(0.78)

5.15

(0.84)

$${\hat{\tau }}_k^*$$

3.24

(0.39)

5.31

(0.41)

$${\tilde{\tau }}_k^*$$

3.33

(0.56)

5.24

(0.68)

300

$${\hat{\tau }}_k^\mathrm{HS}$$

3.41

(0.63)

5.13

(0.82)

$${\hat{\tau }}_k^*$$

3.33

(0.30)

5.13

(0.37)

$${\tilde{\tau }}_k^*$$

3.38

(0.46)

5.12

(0.71)

Models 1 and 2 are given by (10) and (11), respectively, $${\hat{\tau }}_k^\mathrm{HS}$$ is given by the HS algorithm, and $${\hat{\tau }}_k^*$$ and $${\tilde{\tau }}_k^*$$ are given by (5) and (6), respectively Fig. 3 The relative frequencies of the estimated change points in all simulations for Model 1 and $$n=100$$. The horizontal axis represents $$x_i$$. The red line is the true location of the change points ($$\tau _1 = 3$$, $$\tau _2 = 6$$) Fig. 4 The relative frequencies of the estimated change points in all simulations for Model 2 and $$n=100$$. The horizontal axis represents $$x_i$$. The red line is the true location of the change points ($$\tau _1 = 3$$, $$\tau _2 = 6$$)

For Model 1, the average values of estimates for $${\hat{\tau }}_k^\mathrm{HS}$$, $${\hat{\tau }}_k^*$$, and $${\tilde{\tau }}_k^*$$ take on similar values. On the other hand, the standard deviations are largely different. The standard deviation of $${\hat{\tau }}_k^*$$ is the smallest, then that of $${\tilde{\tau }}_k^*$$, and that of $${\hat{\tau }}_k^\mathrm{HS}$$ is the largest. From Fig. 3, it can be seen that the empirical distribution of estimates for $${\hat{\tau }}_1^\mathrm{HS}$$, $${\hat{\tau }}_1^*$$, and $${\tilde{\tau }}_1^*$$ on $$\tau _1$$ becomes unimodal, and the dispersion of $${\hat{\tau }}_1^*$$ is the smallest though the center is a little biased. Although the center of $${\hat{\tau }}_1^\mathrm{HS}$$ is almost unbiased, its standard deviation is about twice as large as $${\hat{\tau }}_1^*$$. The empirical distribution of $${\hat{\tau }}_2^\mathrm{HS}$$ and $${\tilde{\tau }}_2^*$$ for $$\tau _2$$ are slightly bimodal towards the center. Here $${\hat{\tau }}_2^*$$ has a unimodal empirical distribution with an almost unbiased center.

For Model 2, the average values of estimates for $${\hat{\tau }}_k^\mathrm{HS}$$, $${\hat{\tau }}_k^*$$, and $${\tilde{\tau }}_k^*$$ are slightly biased. In particular, for $$\tau _2$$, the differences between the average values of estimates and the true value are about 0.8 for all estimators. As in the case of Model 1, the standard deviation of $${\hat{\tau }}_k^*$$ is the smallest, then that of $${\tilde{\tau }}_k^*$$, and that of $${\hat{\tau }}_k^\mathrm{HS}$$ is the largest. The difference in standard deviation between $${\hat{\tau }}_k^*$$ and $${\hat{\tau }}_k^\mathrm{HS}$$ is almost double for all patterns. From Fig. 4, it is clear that the empirical distributions of the estimators are biased. As in the case for Model 1, $${\hat{\tau }}_1^*$$ and $${\hat{\tau }}_2^*$$ have a unimodal empirical distribution. The empirical distributions of $${\hat{\tau }}_1^\mathrm{HS}$$ and $${\tilde{\tau }}_1^*$$ for $$\tau _1$$ are skewed towards $$\tau _2$$. The empirical distributions of $${\hat{\tau }}_2^\mathrm{HS}$$ and $${\tilde{\tau }}_2^*$$ for $$\tau _2$$ are obviously bimodal where one center is about the true value of $$\tau _2$$ and the other center is biased towards $$\tau _1$$.

As a result of these simulations, depending on the model, there are cases where the obtained estimates of change points have an obvious bias. The bias tends to be pulled in the direction of another true change point. The empirical distribution of $${\hat{\tau }}_k^*$$ tends to have a unimodal distribution with small standard deviation. On the other hand, the empirical distributions of $${\hat{\tau }}_k^\mathrm{HS}$$ and $${\tilde{\tau }}_k^*$$ have the same shape, but the standard deviation of $${\tilde{\tau }}_k^*$$ tends to be smaller than $${\hat{\tau }}_k^\mathrm{HS}$$.

5.4 Comparison of confidence intervals

To compare the accuracy and length of the confidence intervals (7), (8), and (9) for change points, we used the percentage that a true change point is included in the constructed intervals and the average and standard deviation of length of the intervals in all simulations. The level for constructing the intervals is set to $$1-\alpha =0.95$$. The results are described in Table 3.

Table 3

Comparison of the simulation results of three confidence intervals

Model

n

Interval

$$\tau _1 = 3$$

$$\tau _2 = 6$$

Prop.

Length (std.)

Prop.

Length (std.)

Model 1

100

$${\mathrm{CI}}_{\mathrm{basic}}$$

56.0

3.9 (0.8)

82.7

3.2 (1.3)

$${\mathrm{CI}}_{\mathrm{equal}}$$

100.0

3.9 (0.8)

100.0

3.2 (1.3)

$${\mathrm{CI}}_{\mathrm{unequal}}$$

100.0

3.6 (0.8)

100.0

2.8 (1.3)

300

$${\mathrm{CI}}_{\mathrm{basic}}$$

60.3

2.9 (0.6)

87.7

0.9 (0.8)

$${\mathrm{CI}}_{\mathrm{equal}}$$

99.7

2.9 (0.6)

93.7

0.9 (0.8)

$${\mathrm{CI}}_{\mathrm{unequal}}$$

98.7

2.7 (0.6)

91.0

0.7 (0.7)

Model 2

100

$${\mathrm{CI}}_{\mathrm{basic}}$$

49.0

2.8 (0.6)

43.7

3.0 (1.0)

$${\mathrm{CI}}_{\mathrm{equal}}$$

98.3

2.8 (0.6)

97.3

3.0 (1.0)

$${\mathrm{CI}}_{\mathrm{unequal}}$$

97.0

2.5 (0.6)

96.0

2.7 (0.9)

300

$${\mathrm{CI}}_{\mathrm{basic}}$$

44.3

2.1 (0.5)

40.7

2.0 (0.2)

$${\mathrm{CI}}_{\mathrm{equal}}$$

98.3

2.1 (0.5)

88.7

2.0 (0.2)

$${\mathrm{CI}}_{\mathrm{unequal}}$$

96.0

1.9 (0.5)

86.7

1.9 (0.3)

The values of prop. in the table represent the percentages ($$\%$$) that a true change point $$\tau _k$$ is included in the $$95\%$$ confidence intervals in 300 simulations. The values of length represent the averages and standard deviations of length of the confidence intervals in 300 simulations

Models 1 and 2 are given by (10) and (11), respectively, and $${\mathrm{CI}}_{\mathrm{basic}}$$, $${\mathrm{CI}}_{\mathrm{equal}}$$ and $${\mathrm{CI}}_{\mathrm{unequal}}$$ are given by (7), (8), and (9), respectively

For Model 1, all intervals are too wide when the sample size is small, and the results of $${\mathrm{CI}}_{\mathrm{equal}}$$ and $${\mathrm{CI}}_{\mathrm{unequal}}$$ become conservative. The result of $${\mathrm{CI}}_{\mathrm{basic}}$$ is not work well. When the number of samples increases, all intervals return the results that are closest to the nominal level. Since the length of interval of $${\mathrm{CI}}_{\mathrm{basic}}$$ is same as the length of $${\mathrm{CI}}_{\mathrm{equal}}$$, the performance of $${\mathrm{CI}}_{\mathrm{equal}}$$ is higher than $${\mathrm{CI}}_{\mathrm{basic}}$$ in these situations. The reason of this is considered that the empirical distribution of estimators is symmetric and centered on $${\hat{\tau }}_k^\mathrm{HS}$$. This is confirmed by the fact that the histogram of Fig. 3a is nearly symmetrical in shape.

For Model 2, $${\mathrm{CI}}_{\mathrm{basic}}$$ is often unable to function. The reason of this is considered that the correct confidence interval for the true parameter for bootstrap distribution should not converge to the correct confidence interval on the true parameter for population distribution. There is a slight divergence between $${\mathrm{CI}}_{\mathrm{equal}}$$ and $${\mathrm{CI}}_{\mathrm{unequal}}$$ from the nominal level, especially for $$\tau _2$$ when $$n = 300$$. The reason for this is the bias of estimators $${\hat{\tau }}_k^\mathrm{HS}$$ from the true value is considered. As a result of these simulations, it is recommended to use $${\mathrm{CI}}_{\mathrm{equal}}$$ or $${\mathrm{CI}}_{\mathrm{unequal}}$$ when building a confidence interval of a change point based on the HS algorithm.

6 Example

We present a simple application of the change point analysis discussed in this study in reference to the study of female horseshoe crabs on an island in the Gulf of Mexico. These data were used by Agresti (2013), and the data set is available at the website offered by him. The data consist of 173 female crabs. The interesting response variable is the number of male crabs that cluster around a female crab during spawning. In this study, we used the carapace width as the explanatory variable with piecewise different coefficient vectors within the model. In addition to the carapace width, we used the crab’s weight as the explanatory variable with common coefficient vector. That is, $$x_{i} = \mathrm{carapace~width~(cm)''}$$ and $$z_{i} = \mathrm{crab's~weight~(kg)''}$$. As in the case of simulation studies, the iteration number of bootstrap B is set to 500.

For comparison, we first indicate the result of the general Poisson regression model without considering the change points. We used the log link function. The linear predictor model obtained by using the maximum likelihood method is
\begin{aligned} \eta _i = -1.30 + 0.05 x_i + 0.45 z_i. \end{aligned}
(12)
Figure 5 plots the response counts against carapace width with the obtained mean model using (12). The slope of $$x_i$$ is 0.05, which indicates that the number of clusters tends to increase slightly as carapace width increases. In addition, the increase in crab’s weight also affects the increase in the number of clusters. Fig. 5 Plots of the response counts against the carapace width. The red line is the mean value given by the general Poisson regression model, in which the value of $$z_i$$ is given by the average value of $$z_i$$ of all 173 data points
The linear predictor model obtained by using the HS algorithm with AIC is
\begin{aligned} \eta _i =\left\{ \begin{array}{ll} -12.10 + 0.47 x_{i} + 0.63 z_i, &{}\quad -\infty< x_{i} \le 25.15, \\ -51.52 + 1.99 x_{i} + 0.63 z_i, &{}\quad 25.15< x_{i} \le 26.05 ,\\ -3.92 + 0.12 x_{i} + 0.63 z_i, &{}\quad 26.05< x_{i} \le 28.10 ,\\ 5.55 - 0.21 x_{i} + 0.63 z_i, &{}\quad 28.10 < x_{i} \le +\infty . \\ \end{array} \right. \end{aligned}
(13)
Figure 6 shows the mean model obtained by (13). In this case, the number of change points is estimated as 3. In the segment where the value of the carapace width is 25.15–26.05, the slope of $$x_i$$ has a high value of 1.99. On the other hand, in the segment where the value of carapace width is higher than 28.10, the influence of the carapace width on the number of cluster is decreasing. Fig. 6 Plots of the response counts against carapace width. The red line is the mean value given by the HS algorithm with AIC, in which the value of $$z_i$$ is given by the average value of $$z_i$$ of all 173 data points
The linear predictor model obtained by using the HS algorithm with BIC is
\begin{aligned} \eta _i =\left\{ \begin{array}{ll} -12.29 + 0.49 x_{i} + 0.57 z_i, &{}\quad -\infty< x_{i} \le 25.15, \\ -51.67 + 2.00 x_{i} + 0.57 z_i, &{}\quad 25.15< x_{i} \le 26.05 ,\\ 0.51 - 0.03 x_{i} + 0.57 z_i, &{}\quad 26.05 < x_{i} \le +\infty .\\ \end{array} \right. \end{aligned}
(14)
Figure 7 shows the mean model obtained by (14). The number of change points is estimated as 2, and the third and fourth segments in (13) are aggregated to a single segment. In this new segment, where the value of the carapace width is higher than 26.05, the number of cluster tends to decrease moderately as the value of the carapace width increases. Fig. 7 Plots of the response counts against carapace width. The red line is the mean value given by the HS algorithm with BIC, in which the value of $$z_i$$ is given by the average value of $$z_i$$ of all 173 data points
From the simulation results in Table 1, because it is considered that the aggregate model based on BIC has greater prediction accuracy than the model based on AIC, we describe the mean model obtained by the bagging algorithm with BIC in Fig. 8. The model obtained by the bagging algorithm with AIC is nearly the same as the model in Fig. 8, but it has a more jagged curve than the model in Fig. 8. The main difference between the models obtained by the HS algorithm and the bagging algorithm is the prediction in the segment where the value of the carapace width is 25.15–26.05. In this segment, the predicted mean value of the number of clusters against the carapace width rapidly increases in the model obtained by the HS algorithm. In the model obtained by the bagging algorithm, on the other hand, the rate of increase is reduced and the model seems to have a natural shape. Fig. 8 Plots of the response counts against the carapace width. The red line is the mean value given by the bagging algorithm with BIC, in which the value of $$z_i$$ is given by the average value of $$z_i$$ of all 173 data points

To proceed with the analysis discussed in this study, we need to decide on the number of change points as a known number. It is well known that the AIC tends to have a high estimate of the number of parameters in a model. In addition, because it does not have an asymptotically consistent estimator for changes in model order, we considered that 2 is the optimal number of change points for this data. Therefore, in following analysis, we will consider the known number of change points as $$d - 1 = 2$$.

For the estimation of the location of change points, the estimates given by the HS algorithm are $${\hat{\tau }}_1^\mathrm{HS} = 25.15$$ and $${\hat{\tau }}_2^\mathrm{HS} = 26.05$$ as described in (14). On the other hand, the estimates given by (5) are $${\hat{\tau }}_1^* = 25.28$$ and $${\hat{\tau }}_2^* = 27.16$$. Moreover, the estimates given by (6) are $${\tilde{\tau }}_1^* = 25.15$$ and $${\tilde{\tau }}_2^* = 27.35$$. For $$\tau _1$$, there is only a small difference between the three estimates. On the other hand, for $$\tau _2$$, the estimates based on the bootstrap method are one or higher than the value of $${\hat{\tau }}_2^\mathrm{HS}$$. Because the standard error of $${\hat{\tau }}_2^\mathrm{HS}$$ is high from the simulation results in Table 2, there is a need to be aware that the value of $$\tau _2$$ may be higher than 26.05.

Based on the simulation results in Table 3, we think that $${\mathrm{CI}}_{\mathrm{equal}}$$ and $${\mathrm{CI}}_{\mathrm{unequal}}$$ are more suitable than $${\mathrm{CI}}_{\mathrm{basic}}$$. Therefore, we calculate the confidenced intervals of $${\hat{\tau }}_k^\mathrm{HS}$$ given by (8) and (9). The 95% confidence intervals for $${\hat{\tau }}_1^\mathrm{HS}$$ are $${\mathrm{CI}}_{\mathrm{equal}}=[24.05, 26.40]$$ and $${\mathrm{CI}}_{\mathrm{unequal}}=[24.05, 26.15]$$. The difference between these intervals is small, and we conclude that the first change point of the carapace width against the number of clusters is in the interval of about 24–26 cm.

The 95% confidence intervals for $${\hat{\tau }}_2^\mathrm{HS}$$ given by (8) and (9) are $${\mathrm{CI}}_{\mathrm{equal}}=[25.55, 28.40]$$ and $${\mathrm{CI}}_{\mathrm{unequal}}=[25.85, 28.45]$$, respectively. From the results of the model (13) which has three change points with 28.10 as the third change point, where the values $${\hat{\tau }}_2^*$$ and $${\tilde{\tau }}_2^*$$ are given as more than 27, and the upper bounds of the intervals $${\mathrm{CI}}_{\mathrm{equal}}$$ and $${\mathrm{CI}}_{\mathrm{unequal}}$$ are more than 28, we conclude that the second change point of the carapace width against the number of clusters is in the interval of about 25.5–28.5 cm.

7 Conclusion

The application of the HS algorithm is widely used as a standard approach for multiple change point analysis. The algorithm is easy to execute and the computational efficiency is good. However, there is a risk that the estimated change points obtained by the algorithm do not become the MLE, and as a result its variance increases and it has no consistency and asymptotic normality. To deal with this problem, we focused on the application of the bootstrap method based on the HS algorithm in GLM with piecewise different coefficients. In particular, we studied the convenience of the method from the three viewpoints: improvement of the prediction accuracy by bagging, reduction of the standard error of the estimator of the change points, and construction of the confidence interval of change points.

The following main results were obtained by simulation studies. First, the prediction accuracy of the model obtained by the bagging algorithm was almost certainly higher than the model obtained by the HS algorithm. A surprise result was that the model obtained by the bagging algorithm, when the number of change points was estimated in each base model, was more accurate than the model obtained when the number of change points was known. The reason for this is considered to be due to the diversity of the basic models, and therefore application of further developments of the algorithm such as VR-Tree ensemble (Liu et al. 2008) can be considered.

Second, there is little difference between the average values of estimators of the change points obtained by the HS algorithm and bootstrap method. Depending on the settings of the true model, both estimators have bias, but the standard error of the estimator obtained by the bootstrap method is smaller than that of the HS algorithm. Moreover, the estimator based on the mean value of the bootstrap estimates has a distribution that is unimodal and nearly symmetrical.

Third, $${\mathrm{CI}}_{\mathrm{equal}}$$ and $${\mathrm{CI}}_{\mathrm{unequal}}$$ are recommended when building the confidence interval of a change point based on the HS algorithm. Although there is some variation from the nominal level in the intervals based on these methods, it is closer to the nominal level than the intervals based on the basic bootstrap method.

Through the change point analysis of the study of female horseshoe crabs, we showed a simple application of the discussed methods. From the plot of the mean models given by the HS algorithm and bagging algorithm, it turned out that the latter model is reasonable and represents the model well. Based on the plot, the estimates of the change points, and the confidence intervals, we predicted a range that is somewhat conservative but has the possibility of change points.

There are several tasks that need further research. In this study, we compared the generally used HS algorithm and bagging algorithm from the perspective of prediction accuracy. Another comparable method is, for example, a model constructed by a non-parametric approach or a Bayesian approach. Therefore, a comparison of these methods through an extensive simulation study is a subject for future research.

In addition to this, there are many elements that are necessary for constructing an appropriate model such as variable selection, detection of interaction, sensitivity analysis, and confirmation of the linearity of the explanatory variables in the linear predictor. Moreover, we focus on the change in mean and/or variance structure by considering the piecewise different coefficients in GLM. As the extension of this, the detection of the variants change point in the model is an important problem. In order to deal with this problem, further extension methods need to be considered. These tasks will be the subject of future studies.

References

1. Agresti, A. (2013). Categorical Data Analysis (3rd ed.). Hoboken, New Jersey: Wiley.
2. Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B.N. Petrov, F. Csáki (Eds.)proceedings of the 2nd International Symposium on Information Theory (pp. 267-281). Budapest.Google Scholar
3. Bai, J., & Perron, P. (2003). Computation and analysis of multiple structural change models. Journal of Applied Econometrics, 18, 1–22.
4. Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123–140.
5. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. (1984). Classification and Regression Trees. California: Wadsworth.
6. Brown, R. L., Durbin, J., & Evans, J. M. (1975). Techniques for testing the constancy of regression relationships over time. Journal of the Royal Statistical Society Series B, 37, 149–192.
7. Chen, J., & Gupta, A. K. (1997). Testing and locating variance changepoints with application to stock prices. Journal of the American Statistical Association, 92, 739–747.
8. Chen, J., & Gupta, A. K. (2012). Parametric Statistical Change Point Analysis (2nd ed.). New York: Birkhäuser.
9. Csörgő, M., & Horváth, L. (1997). Limit Theorems in Change-Point Analysis. New York: John Wiley & Sons.
10. Davis, R. A., Lee, T. C. M., & Rodriguez-Yam, G. A. (2006). Structural break estimation for nonstationary time series models. Journal of the American Statistical Association, 101, 223–239.
11. Davison, A. C., & Hinkley, D. V. (1997). Bootstrap Methods and their Application. Cambridge: Cambridge University Press.
12. Efron, B., & Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Boca Raton, Florida: Chapman and Hall/CRC Press.
13. Fox, J. (2015). Applied Regression Analysis and Generalized Linear Models (3rd ed.). Thousand Oaks: Sage Publicatons.Google Scholar
14. Gurevich, G., & Vexler, A. (2005). Change point problems in the model of logistic regression. Journal of Statistical Planning and Inference, 131, 313–331.
15. Hawkins, D. M. (1977). Testing a sequence of observations for a shift in location. Journal of the American Statistical Association, 72, 180–186.
16. Hawkins, D. M. (2001). Fitting multiple change-point models to data. Computational Statistics & Data Analysis, 37, 323–341.
17. Holbert, D. (1982). A Bayesian analysis of a switching linear model. Journal of Econometrics, 19, 77–87.
18. Inclán, C. (1993). Detection of multiple changes of variance using posterior odds. Journal of Business and Economic Statistics, 11, 289–300.Google Scholar
19. James, B. J., James, K. L., & Siegmund, D. (1987). Tests for a change-point. Biometrika, 74, 71–84.
20. Kim, H. (1994). Tests for a change-point in linear regression. IMS Lecture Notes-Monograph Series, 23, 170–176.
21. Kim, H., & Siegmund, D. (1989). The likelihood ratio test for a change-point in simple linear regression. Biometrika, 76, 409–423.
22. Küchenhoff, H., & Carroll, R. J. (1997). Segmented regression with errors in predictors: semi-parametric and parametric methods. Statistics in Medicine, 16, 169–188.
23. Liu, F. T., Ting, K. M., Yu, Y., & Zhou, Z. H. (2008). Spectrum of variable-random trees. Journal of Artificial Intelligence Research, 32, 355–384.
24. Lu, Q., Lund, R., & Lee, T. C. M. (2010). An mdl approach to the climate segmentation problem. The Annals of Applied Statistics, 4, 299–319.
25. Quandt, R. E. (1958). The estimation of parameters of a linear regression system obeying two separate regimes. Journal of the American Statistical Association, 53, 873–880.
26. Quandt, R. E. (1960). Tests of the hypothesis that a linear regression system obeys two separate regimes. Journal of the American Statistical Association, 55, 324–330.
27. Rissanen, J. (2007). Information and Complexity in Statistical Modeling. New York: Springer.
28. Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461–464.
29. Smith, P. L. (1979). Splines as a useful and convenient statistical tool. The American Statistician, 33, 57–62.Google Scholar
30. Stasinopoulos, D. M., & Rigby, R. A. (1992). Detecting break points in generalised linear models. Computational Statistics & Data Analysis, 13, 461–471.
31. Ulm, K. (1991). A statistical method for assessing a threshold in epidemiological studies. Statistics in Medicine, 10, 341–349.
32. Worsley, K. J. (1979). On the likelihood ratio test for a shift in location of normal populations. Journal of the American Statistical Association, 74, 365–367.
33. Wu, Y. (2008). Simultaneous change point analysis and variable selection in a regression problem. Journal of Multivariate Analysis, 99, 2154–2171.
34. Zhou, Z. H. (2012). Ensemble Methods Foundations and Algorithms. Boca Raton: Chapman and Hall/CRC Press. 