Skip to main content
  • 1722 Accesses

Abstract

Regression is one of the most commonly used multivariate statistical methods. Multivariate linear regression can integrate many explanatory variables to predict the target variable. However, collinearity due to intercorrelations in the explanatory variables leads to many surprises in multivariate regression. This chapter presents both basic and advanced regression methods, including standard least square linear regression, ridge regression and principal component regression. Pitfalls in using these methods for geoscience applications are also discussed.

Any statistics can be extrapolated to the point where they show disaster.

Thomas Sowell

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Bertrand, P. V., & Holder, R. L. (1988). A quirk in multiple regression: The whole regression can be greater than the sum of its parts. The Statistician, 37, 371–374.

    Article  Google Scholar 

  • Chen, A., Bengtsson, T., & Ho, T. K. (2009). A regression paradox for linear models: Sufficient conditions and relation to Simpson’s paradox. The American Statistician, 63(3), 218–225.

    Article  MathSciNet  Google Scholar 

  • Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation for the behavioral sciences (3rd edn) (1st edition, 1975), Mahwah: Lawrence Erlbaum Associates, 703 p.

    Google Scholar 

  • Darmawan, I. G. N., & Keeves, J. P. (2006). Suppressor variables and multilevel mixture modeling. International Education Journal, 7(2), 160–173.

    Google Scholar 

  • Delfiner, P. (2007). Three pitfalls of Phi-K transforms. SPE Formation Evaluation & Engineering, 10(6), 609–617.

    Article  Google Scholar 

  • Friedman, L., & Wall, M. (2005). Graphic views of suppression and multicollinearity in multiple linear regression. The American Statistician, 59(2), 127–136.

    Article  MathSciNet  Google Scholar 

  • Gonzalez, A. B., & Cox, D. R. (2007). Interpretation of interaction: A review. The Annals of Statistics, 1(2), 371–385.

    Article  MathSciNet  Google Scholar 

  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction (2nd ed.). New York: Springer.

    Book  Google Scholar 

  • Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for non-orthogonal problems. Technometrics, 12, 55–68.

    Article  Google Scholar 

  • Huang, D. Y., Lee, R. F., & Panchapakesan, S. (2006). On some variable selection procedures based on data for regression models. Journal of Statistical Planning and Inference, 136(7), 2020–2034.

    Article  MathSciNet  Google Scholar 

  • Jones, T. A. (1972). Multiple regression with correlated independent variables. Mathematical Geology, 4, 203–218.

    Article  Google Scholar 

  • Liao, D., & Valliant, R. (2012). Variance inflation factors in the analysis of complex survey data. Survey Methodology, 38(1), 53–62.

    Google Scholar 

  • Lord, F. M. (1967). A paradox in the interpretation of group comparisons. Psychological Bulletin, 68, 304–305.

    Article  Google Scholar 

  • Ma, Y. Z. (2010). Error types in reservoir characterization and management. Journal of Petroleum Science and Engineering, 72(3–4), 290–301. https://doi.org/10.1016/j.petrol.2010.03.030.

    Article  Google Scholar 

  • Ma, Y. Z. (2011). Pitfalls in predictions of rock properties using multivariate analysis and regression method. Journal of Applied Geophysics, 75, 390–400.

    Article  Google Scholar 

  • O’Brien, R. M. (2007). A caution regarding rules of thumb for variance inflation factors. Quality & Quantity, 41, 673–690.

    Article  Google Scholar 

  • Smith, A. C., Koper, N., Francis, C. M., & Farig, L. (2009). Confronting collinearity: Comparing methods for disentangling the effects of habitat loss and fragmentation. Landscape Ecology, 24, 1271–1285.

    Article  Google Scholar 

  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso: A retrospective. Journal of the Royal Statistical Society, Series B, 58(1), 267–288.

    MathSciNet  MATH  Google Scholar 

  • Vargas-Guzman, J. A. (2009). Unbiased estimation of intrinsic permeability with cumulants beyond the lognormal assumption. SPE Journal, 14, 805–810.

    Article  Google Scholar 

  • Webster, J. T., Gunst, R. F., & Mason, R. L. (1974). Latent root regression analysis. Technometrics, 16(4), 513–522.

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Appendices

Appendices

1.1 Appendix 6.1: Lord’s Paradox and Importance of Judgement Objectivity

Lord (1967) framed his paradox based on the following hypothetical example (p. 304):

A large university is interested in investigating the effects on the students of the diet provided in the university dining halls and any sex difference in these effects. … the weight of each student at the time of his arrival in September and his weight the following June are recorded.

Because of the variation in students’ weights from September to the following June, there is a meaningful spread in the crossplot between the weight at the arrival and the weight at the end of school year (although the correlation is still significant, as Lord implied). Lord also indicated that the average gain was zero for both the male and female students. He then asked two hypothetical statisticians to determine if the school diet (or anything else) influenced student weight; he was looking for any evidence of a differential effect on the two sexes.

Barring a possible cancelation of the multiple effects from several factors, which was implied nonexistent by Lord (1967), the answer should be obvious for a researcher with common sense, just as Lord’s first hypothetical statistician, who concluded that no diet effect was acting because the average weights for males and females remained unchanged. However, Lord’s second statistician performed two linear regressions, respectively for the males and females, and concluded that there was a meaningful difference of the diet effect for males and females (Fig. 6.6). His conclusion was based on the fact that the two linear regressions gave different results. This statistician obviously fell into the trap of the regression paradox. Unfortunately, Lord and other researchers following him have claimed that it is not clear which statistician is correct or wrong, and no simple explanation is available for the conflicting results (Chen et al. 2009).

Fig. 6.6
figure 6

Simulated Lord’s paradox. O = female, + = male

Lord’s paradox is fundamentally a manifestation of the regression paradox, twisted with two preexisting heterogeneous classes. Many researchers focus on the group effects and ignore the regression paradox. Above all, because this was a descriptive problem, not a prediction problem, analysis of covariance suffices, and the regression should not be used in the first place (the major axis regression could be used to describe the trend).

Incidentally, Lord’s paradox does show an effect of pre-existing heterogeneous groups. The question is, should all the data be analyzed together or analyzed separately for each of the two groups? This question has profound implications regarding multiple levels of heterogeneities in geosciences, and appropriate uses of hierarchical modeling for descriptions of heterogeneities in hierarchy. Examples with appropriate regression uses are presented in Chap. 20.

1.2 Appendix 6.2 Effects of Collinearity in Multivariate Linear Regression

Besides the redundancy, collinearity has another important aspect, termed suppression in early statistical literature and variance inflation in more recent literature. Although the current statistical literature talks more on variance inflation, it tends to treat the symptoms of collinearity. Understanding suppression can help better understand the effect of collinearity. The term suppression does not imply suppressing information, but rather implies inflation of the weighting coefficients of the predictor variables in multivariate linear regression.

The suppression phenomenon often leads to confusion because of its paradoxical effects in regression results (Cohen et al. 2003). Recent analyses in multivariate applications have shed light on this problem (Friedman and Wall 2005; Smith et al. 2009; Ma 2011). A strong suppression effect will cause instability in the predictive system. Three types of suppression have been reported, including classical, cooperative, and net suppressions. An example of the classical suppression is discussed in the main text. The following presentations discuss cooperative and net suppressions updated from a previous study (Ma 2011).

1.2.1 A6.2.1 Cooperative Suppression

In a multivariate linear regression with two predictors, cooperative suppression occurs when each of the two predictor variables is correlated with the response variable positively, but they are correlated negatively between themselves, or when two predictor variables are correlated positively, but they are correlated with the response variable in the opposite sign (one positive and one negative correlation). This occurs when the nontransitivity of correlation (see Chap. 4) is present.

Figure 6.7 shows an example of trivariate regression in which both the total porosity (PHIT) and Vsand are positively correlated with the effective porosity (PHIE), but they are negatively correlated between them. Knowing the correlation coefficients between each pair of the three variables: PHIT, Vsand, and PHIE, it is straightforward to obtain the weighting coefficients using the matrix equations (Eqs. 6.13 and 6.14 in the main text):

$$ {\displaystyle \begin{array}{l}\left(\begin{array}{ll}1& -0.150\\ {}-0.150& \kern1.2em 1\end{array}\right)\kern0.36em \left(\begin{array}{l}{\beta}_1\\ {}{\beta}_{\mathrm{v}}\end{array}\right)\kern0.36em =\kern0.36em \left(\begin{array}{l}0.520\\ {}0.431\end{array}\right)\\ {}\left(\begin{array}{l}{\beta}_1\\ {}{\beta}_{\mathrm{v}}\end{array}\right)=\left(\begin{array}{ll}1.023& 0.153\\ {}0.153& 1.023\end{array}\right)\kern0.36em \left(\begin{array}{l}0.520\\ {}0.431\end{array}\right)\kern0.36em =\left(\begin{array}{l}0.598\\ {}0.521\end{array}\right)\\ {}\ {\mathrm{PHIE}}^{\ast }={m}_p+0.598\kern0.37em \frac{\sigma_p\;}{\sigma_t}\left(\mathrm{PHIT}\hbox{--} {m}_t\right)+0.521\ \;\frac{\sigma_p}{\sigma_v}\;\left(V\mathrm{sand}\hbox{--} {m}_v\right)\end{array}} $$
(6.22)

where m t, m v, and m p are the mean values, and σ t , σ v , and σ p are standard deviations of PHIT, Vsand and PHIE, respectively.

Fig. 6.7
figure 7

Crossplots between each pair of three well logs and regression variable in a subsurface formation. (a) Total porosity (PHIT) versus effective porosity (PHIE). (b) Vsand versus PHIE. (c) PHIT versus Vsand. (d) PHIE versus its trivariate regression from Eq. 6.22. (e) Illustrations of correlation, suppression, and redundancy using Venn diagrams for the cooperative suppression example. (f + g) represents the squared correlation between Phie and PHIT, (h + g) represents the squared correlation between PHIE and Vsand. (k + g) represents the squared correlation between PHIT and Vsand. When the two predictors are correlated positively, the redundancy is dominant, and the area g is active. When the predictors are negatively correlated between them, mutual suppression is dominant, and the area k is active. Notice the direct effects of PHIT and Vsand on the prediction of PHIE, and the interaction between PHIT and Vsand. (Adapted from Ma (2011))

Both regression coefficients are greater than their correlations because of the mutual suppression between the two predictor variables. PHIT has a weighting of 0.598 compared to the correlation coefficient of 0.520, which would be the weight if it were used in the bivariate linear regression. Similarly, Vsand has increased its weighting to 0.521 from its correlation coefficient of 0.431 with PHIE. Comparison of R-squared values also shows a large gain using the trivariate regression with the two predictor variables (Table 6.4).

Table 6.4 Summary statistics for the trivariate linear regression (Eq. 6.22)

Consider a hypothetical case in which the two predictor variables are positively correlated at 0.15 (instead of –0.15 in the real case). Then, the regression coefficients for the two predictors are smaller than their correlations with the response variable (Table 6.4). This is because the correlation transitivity condition is satisfied (the two predictors correlated positively); the suppression is subdued, and redundancy is the main actor.

In comparing the classical and cooperative suppressions, the third variable is essentially uncorrelated with the response variable in the classical suppression and an indirect transitive effect of the suppressor variable takes place through the other predictor for the prediction. On the other hand, when both predictors are correlated with the response variable positively, either cooperative suppression or redundancy occurs depending on whether the transitivity condition is satisfied or not. When the two predictors are correlated positively, the redundancy is dominant. When they are correlated negatively, the cooperative suppression is dominant. In such three-way correlations (no nil correlation), the direct effect of predictors for the prediction is either mutual suppression or redundancy depending on the transitivity condition.

1.2.2 A6.2.2 Net Suppression

In multivariate regression, net suppression occurs quite often, especially when one predictor variable has a low correlation with the response variable. In the classical suppression example presented in the main text, the resistivity has a very small positive (almost nil) correlation to porosity. If it were slightly negative, its weighting coefficient in the linear regression would only change slightly but remain positive, which would be a net suppression. As discussed in Chap. 4, correlation can be sensitive to sampling scheme, either sampling bias or missing values. Thus, a small positive or negative correlation can change easily from one to the other in practice.

Ma (2011) reported another resistivity log that had a small negative correlation of −0.073 with PHI in the same study as presented in the main text (Sect. 6.3.2). Under the linear regression of PHI using Vsand and this resistivity (named Resistivity2), the regression equation is thus:

$$ {PHI}^{\ast }={m}_p+ 0.735\times \frac{\sigma_p}{\sigma_v}\left( Vsand\hbox{--} {m}_v\right)+ 0.228\times \frac{\sigma_p}{\sigma_r}\left( Resistivity 2\hbox{--} {m}_r\right) $$
(6.23)

where m v, m r and m p are the mean values, and σ v , σ r , and σ p are standard deviations of Vsand, Resistivity2, and PHI, respectively.

Notice the reversal to the positive regression coefficient of Resistivity2 from its negative correlation with the response variable, PHI. The R-square has also increased (Table 6.5).

Table 6.5 Summary statistics for the trivariate linear regression (Eq. 6.23)

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Ma, Y.Z. (2019). Regression-Based Predictive Analytics. In: Quantitative Geosciences: Data Analytics, Geostatistics, Reservoir Characterization and Modeling. Springer, Cham. https://doi.org/10.1007/978-3-030-17860-4_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-17860-4_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-17859-8

  • Online ISBN: 978-3-030-17860-4

  • eBook Packages: EnergyEnergy (R0)

Publish with us

Policies and ethics