# A note on the Gao et al. (2019) uniform mixture model in the case of regression

- 117 Downloads

## Abstract

We extend the uniform mixture model of Gao et al. (Ann Oper Res, 2019. https://doi.org/10.1007/s10479-019-03236-9) to the case of linear regression. Gao et al. (Ann Oper Res, 2019. https://doi.org/10.1007/s10479-019-03236-9) proposed that to characterize the probability distributions of multimodal and irregular data observed in engineering, a uniform mixture model can be used. This model is a weighted combination of multiple uniform distribution components. This case is of empirical interest since, in many instances, the distribution of the error term in a linear regression model cannot be assumed unimodal. Bayesian methods of inference organized around Markov chain Monte Carlo are proposed. In a Monte Carlo experiment, significant efficiency gains are found in comparison to least squares justifying the use of the uniform mixture model.

## Keywords

Multimodal data Uniform mixture model Regression models Statistical inference Bayesian analysis## 1 Introduction

Gao et al. (2019) proposed that to characterize the probability distributions of multimodal and irregular data observed from practical engineering, a uniform mixture model (UMM) can be used, which is a weighted combination of multiple uniform distribution components. As these authors notice, because of noise in many data sets, “probability distributions of observed data can not be accurately characterized by typical unimodal distributions (such as normal, lognormal, and Weibull distributions), and the adequacy of typical unimodal distributions may be questioned”.

*a*,

*b*) has probability density:

*N*is given, and using the following mixture density:

## 2 The case of linear regression

## 3 Statistical inference

### 3.1 Markov chain Monte Carlo (MCMC) in general

Draw \(\alpha _{1}^{(s)}\) from its conditional distribution \(\alpha _{1}|\alpha _{2}^{(s-1)},\mathcal {D},\)

Draw \(\alpha _{2}^{(s)}\) from its conditional distribution \(\alpha _{2}|\alpha _{1}^{(s)},\mathcal {D},\)

### 3.2 MCMC in the UMM linear regression model

*D*denotes the entire data set \(\{y_{i},x_{i}\}_{i=1}^{n}\). Therefore, we have:

*j*th sub-interval.

*N*the

*endpoint*\(a_{1}\)

*can be estimated from the data*. Define the parameter vector as \(\theta =[\beta ',a_{1},\{J_{i}\}_{i=1}^{n},w']'\). Given the \(J_{i}\)s we must have:

*N*and \(\Delta \), these equations determine the values of the endpoints. Suppose our prior is

*X*is the \(n\times k\) matrix of regressors. In turn, the posterior conditional distribution of \(\beta \) is \(p(\beta )\propto \mathrm {const}.\) subject to these restrictions. Suppose \(X=[\mathbf {x}_{1},\ldots ,\mathbf {x}_{k}]\) where \(\mathbf {x}_{j}\) is the

*j*the column of

*X*, an \(n\times 1\) vector. We can write (22) as follows:

*S*increases.

## 4 Monte Carlo evidence

For each case we assume that the sample size is \(n=25\), 50, 100, 500, 1000 and 10,000. We have two correlated regressors: the first one, \(x_{i1}\sim N(0,1)\) and the second is \(x_{i2}=x_{i1}+0.1\varepsilon _{i}\), where \(\varepsilon _{i}\sim N(0,1),\,i=1,\ldots ,n\). The regression model is: \(y_{i}=\beta _{0}+\beta _{1}x_{i1}+\beta _{2}x_{i2}+u_{i},\) where \(u_{i}\) is generated according to cases (a) through (d). The true parameter values are: \(\beta _{0}=10,\,\beta _{1}=1,\,\beta _{2}=-1\).

Efficiency of regression-UMM versus LS

Case (a) | Case (b) | Case (c) | Case (d) | |
---|---|---|---|---|

\(n=25\) | 1.712 | 1.912 | 1.981 | 2.231 |

\(n=50\) | 1.515 | 1.832 | 1.872 | 1.945 |

\(n=500\) | 1.350 | 1.644 | 1.750 | 1.717 |

\(n=1{,}000\) | 1.210 | 1.355 | 1.515 | 1.422 |

\(n=10{,}000\) | 1.07 | 1.101 | 1.113 | 1.130 |

From the results in Table 1, regression-UMM-based techniques are considerable more efficient compared to LS particularly for “small” samples (i.e. \(n\le 1000\)) although even at \(n=\)10,000 the improvement in efficiency is quite evident. With \(n=\)10,000 the efficiency is close to unity but still the efficiency of UMM is larger (notice that LS is best linear unbiased, but the UMM-regression estimator is not linear so efficiency gains are possible even in quite large samples). Moreover, the regression-UMM-based estimator is, practically, unbiased as it mean squared error and variance are very similar (results available on request). Finally, efficiency gains are largest in cases (b) and (c) where the mixing components are far from normality (viz. Student-*t* with one degree of freedom and lognormal components).

Another interesting case is to consider \(u_{i}\sim N(0,\sigma ^{2}),\,i=1,\ldots ,n\), where \(\sigma ^{2}\) is estimated using the LS estimator \(s^{2}=\frac{\sum _{i=1}^{n}(y_{i}-x'_{i}b_{LS})^{2}}{n-k}\), and \(b_{LS}=(X'X)^{-1}X'y\). In turn, we know that the support of the error terms is, approximately, \(\left( -3s,3s\right) \) (perhaps too “generously”). Even a plot of LS residuals can inform us, at least in large samples, about the support as well as the form of the distribution of errors.

*N*) in the support of UMM-regression. in Table 2.

Bias and efficiency of LS estimator of \(\beta _{1}\) and UMM-regression

\(N=10\) | \(N=50\) | \(N=100\) | |
---|---|---|---|

Bias LS | 0.014 | ||

Bias UMM | 0.012 | 0.011 | 0.011 |

s.e. LS | 0.011 | ||

s.e. UMM | 0.009 | 0.007 | 0.007 |

For example the mean square error (MSE) of LS is \(0.011^{2}+0.014^{2}=0.000317\) while the MSE of UMM-regression estimator with \(N=50\) is \(0.007^{2}+0.011^{2}=0.00017\) so the ratio of MSEs is almost 1.86. The MSE is lower compared to LS even if we use only \(N=10\) points in the support of the error.

## Notes

## References

- Gao, J., An, Z., & Bai, X. (2019). A new representation method for probability distributions of multimodal and irregular data based on uniform mixture model.
*Annals of Operations Research*. https://doi.org/10.1007/s10479-019-03236-9. - Gelfand, A. E., & Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal densities.
*Journal of the American Statistical Association*,*85*(410), 398–409.CrossRefGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.