In order to further assess the robustness of my model, I run a number of post-estimation tests. To look for influential observations that could be skewing the results, I calculated the Cook’s Distance, or “Cook’s”. This measures how much parameter estimates of a model would change if a particular observation were to be dropped from the analysis. A general rule of thumb is that D
values larger than 4/n are considered influential. Out of 700 observations, 40 of my observations had a value for D
higher than 4/700 (approximately 0.0057). I also created added variable plots to examine which variables seemed particularly influenced by a few observations. The plots for trade openness, urban population, democracy, and the capital-labor ratio all looked fairly normal. It was only the plot for the labor-land ratio that stood out.
Models with different ways of measuring trade openness
The United Arab Emirates, Qatar, and Kuwait all seem to be influencing the estimated relationship between education inequality and the land-labor ratio. The proper response to influential data would not be to drop the data points, since that would also cause bias, but rather to try other estimation techniques that are more robust to extreme values. One such technique is median regression. Rather than modeling the mean of y
as a function of x
, median regression models the median of y
conditional on x
. This means that it minimizes the least absolute deviations rather than the least squared errors. In other words, a median regression line is placed such that there is the same number of residuals above and below the regression line, and the magnitude of the residual does not matter, so outliers do not have the same power to pull the regression line toward themselves. Using quantile regression, it is possible to model not only the median but any quantile in order to see whether the parameter estimates are different for different quantiles. Table 5.13
shows the results of three models for different quantiles: the 25th percentile, the median, and the 75th percentile. The results are all fairly similar and consistent with my main model.
Models with different measures of inequality
Models accounting for average years of education
Models separating out communist countries
Quantile regression models
Next I checked the normality of the residuals in my main model. In a linear regression where errors are identically distributed, residuals in the middle of the domain will have higher variability than residuals for observations near the end points, since high leverage points will pull the regression line toward themselves. For this reason, studentized residuals are residuals adjusted for expected variability, depending if the observations fall in the middle of the domain or near one of the ends. The distribution of studentized residuals looked fairly normal.
Finally, I tested for heteroskedasticity, since this would be a violation of the assumption of constant variance for the errors. Scatter plots of the residuals versus the fitted values, the dependent variable, and the independent variables did not indicate any strong patterns.
Nevertheless, the Breusch-Pagan test for heteroskedasticity had a p
-value of 0.000, indicating that the null hypothesis of constant variance could be rejected. Therefore, I re-ran the model with standard errors clustered by country, as seen in Table 5.14
. All of the variables that were statistically significant in the original model remained so except for the percentage of the population living in urban areas.
Main model with clustered standard errors
Cross-sectional regression models 1980–1995
Cross-sectional regression models 2000–2010