Skip to main content
  • 1572 Accesses

Abstract

The past two chapters have provided the necessary technical background for a consideration of statistical procedures that can be especially effective in criminal justice forecasting. The joint probability distribution model, data partitioning, and asymmetric costs should now be familiar. These features combine to make tree-based methods of recursive partitioning the fundamental building blocks for the machine learning procedures discussed. The main focus is random forests. Stochastic gradient boosting and Bayesian trees are discussed briefly as worthy competitors to random forests. Although neural nets and deep learning are not tree-based, they are also considered. Current claims about remarkable performance need to be dispassionately addressed, especially in comparison to tree-based methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    When pruning, one works back up the tree reversing splits for which at least one of the two terminal nodes done not make a sufficient contribution to the fit. This is a way to increase tree stability, but there are better way using machine learning.

  2. 2.

    In some treatments (Hastie et al. 2009, pp. 310–311), the matrix is denoted by L despite the same symbol being used for the loss function in discussions of expected prediction error.

  3. 3.

    Some software allows the loss matrix to be entered directly. They can be some formatting requirements, however, that make that approach a little tedious.

  4. 4.

    And yes, all four proportions will usually differ.

  5. 5.

    There are several recent variants on the classic algorithm, which are mostly of theoretical interest to statisticians and computer scientists. But work on random forests continues. Some practical advances can be expected.

  6. 6.

    Classification importance is often called “forecasting performance.” But it is not really about forecasting because the outcome is already known. I too have been guilty of this naming inaccuracy.

  7. 7.

    In R, the code for variable importance plots, available from the random forests procedure, would take on something like the following form: varImpPlot(rf1, type=1, class=“Fail”, scale=F, main=“Importance Plot for Violent Outcome”), where rf1 is the name of the saved output object from random forests, type=1 calls for the mean decrease in accuracy, class=“Fail” is the outcome class “Fail” for which the importance measures are being requested, scale=F requests no standardizations, and main=“Importance Plot for Violent Outcome” is the plot’s heading.

  8. 8.

    It is sometimes desirable in regression analysis to drop one or more regressors from a model to see how a smaller model performs compared to the larger model. When this is done, both the set of predictors and the model itself change. The impact of the predictors and the model are confounded. When in random forests predictors are shuffled before being dropped down a fixed ensemble of trees, there is no such confounding.

  9. 9.

    The terms partial response plot and partial dependence plot can be used interchangeably.

  10. 10.

    This avoids the problem of having to choose a reference category.

  11. 11.

    In R, the code would look something like this: partialPlot(rf1, pred.data=temp2, x.var=Age, which.class=“Fail”, main=“Dependence Plot for Age”), where rf1 is the name of the saved random forests output object, pred.data=temp2 calls the input dataset, usually the training data (here, temp2), x.var=Age indicates that the plot is being requested for the predictor called “Age”, which.class=“Fail” specifies the outcome class, here, “Fail”, for which the plot is being requested, and main=“Dependence Plot for Age” is the heading of the plot.

  12. 12.

    Recall that there are three sources of the instability: the sampling of the training data, the sampling of the OOB data, and the sampling of predictors.

  13. 13.

    The name “boosting” comes from ability of the procedure to “boost” the performance of otherwise “weak learners” in the process by which they are aggregated.

  14. 14.

    This serves much the same purpose as the sampling with replacement used in random forests. A smaller sample is adequate because when sampling without replacement, no case is selected more than once; there are no “duplicates.”

  15. 15.

    For recursive partitioning with a numeric outcome variable, the loss function is the usual OLS error sum of squares. At each potential partitioning, the predictor and split is selected that reduces most the error sum of squares. Means rather than proportions are the summary statistics computed in the terminal nodes.

  16. 16.

    This improves performance by helping to insure that fitted values do not overshoot the best optimization path. Slow and steady wins the race.

  17. 17.

    Stochastic gradient boosting uses approximation of the optimization algorithm gradient descent.

  18. 18.

    Following Hastie et al. (2009, p. 298), suppose there are t = 1, …, T trees and i = 1, …, N observations (e.g., 50 trees and 500 observations). We are seeking the set of \(\hat {f}_{t}\) functions linking each of, say, 50 sets of fitted values from the 50 predictor trees to the outcome.

    1. 1.

      Initialize: \(\hat {\alpha } = \text{prop}(y_{i}), \hat {f}_{t} \equiv 0, \forall _{i,t}\). The value of α is the outcome variable proportion over all observations. This does not change. The functions for each predicator are initially set to zero.

    2. 2.

      Cycle: t = 1, …, T,  1, …, T, …

      $$\displaystyle \begin{aligned} \begin{array}{c} \hat{f}_{t} \leftarrow S_{t}(y_{i}- \hat{\alpha} - \sum_{r \ne t} \hat{f}_{t}(x_{it})),\\ \hat{f}_{t} \leftarrow \hat{f}_{t}- \frac{1}{N} \sum_{i=1}^{N} \hat{f}_{t}(x_{it}), \end{array} \end{aligned}$$

      where S t is a smoother. Continue #2 until the individual functions do not change. The cycling depends on constructing a long sequence of pseudo residuals and fitting those with a smoother such as smoothing splines or lowess.

  19. 19.

    Work is underway on theory that might be used for multinomial outcomes. In principle, the prior tree distribution could be altered to allow for asymmetric costs.

  20. 20.

    Another very good machine learning candidate is support vector machines. There is no ensemble of trees. Other means are employed to explore the sample space effectively. Hastie et al. (2009, pp. 417–436) provide an excellent overview. What experience there is comparing support vectors machines to the tree-based methods emphasized here is that usually they all perform about the same. When they do not, the differences are typically too small to matter much in practice or are related to unusual features of the data or simulation. But a major difficulty with support vector machine is that it does not scale up well. It does not play well with data having a large number of observations (e.g., 100,000).

  21. 21.

    Some implementations of neural networks allow some of the weights to be fixed in advance.

  22. 22.

    The details of backpropagation are not hard to understand but are beyond the scope of this discussion. A good treatment can be found in Bishop (2006, Sect. 5.3).

  23. 23.

    In practice, it can be important to standardized all inputs. Otherwise, the algorithm can have much more trouble converging.

  24. 24.

    Some software builds in reasonable defaults for most of the tuning parameters, which at least provides a sensible base for subsequent tuning.

  25. 25.

    In some cases, the rows are assembled in chronological order. Then the row numbers have information that might improve forecasting accuracy. But usually there will also be a date variable, such as when an offender was released on parole. As long as that information is available as a date variable for all offenders, shuffling all of the rows at random does not affect the forecasts. The chronological information carried along in the date variable that can be used by a forecasting algorithm.

  26. 26.

    Criminal justice applications for machine learning are evolving very quickly. It is very hard to keep up. Literally, 8 days after writing this paragraph there was an article in the New Yorker Magazine describing related ideas for tasters. Body cameras are placed directly on tasters. But that is just the beginning. The business firm, Axon Enterprise (previously called Tasters International) is “building a network of electrical weapons, cameras, drones, and someday, possibly, robots, connected by a software platform called Evidence.com” (Goodyear 2018, p. 4).

  27. 27.

    There are extensions of recurrent neural networks called “long short term memory networks” that in many circumstances can improve performance. The basic idea is that within each hidden layer some information is discarded because it is no longer needed, and some is stored to pass forward in time because it is needed. In the example of the word “quit,” there may be no reason to retrain the last letter of the previous word (e.g., the “u” in “you”). But you probably want to store the “q.” In other words, the algorithm is getting some help in determining what information important.

References

  • Berk, R. A., & Bleich, J. (2013) Statistical procedures for forecasting criminal behavior: a comparative assessment. Journal of Criminology and Public Policy 12(3): 515–544, 2013.

    Article  Google Scholar 

  • Bishop, C. M. (2006) Pattern Recognition and Machine Learning. New York: Springer.

    MATH  Google Scholar 

  • Breiman, L. (1996) Bagging predictors. Machine Learning 26:123–140.

    MATH  Google Scholar 

  • Breiman, L. (2001a) Random forests. Machine Learning 45: 5–32.

    Article  Google Scholar 

  • Breiman, L., Friedman, J.H., Olshen, R.A., & Stone, C.J. (1984) Classification and Regression Trees. Monterey, CA: Wadsworth Press.

    MATH  Google Scholar 

  • Chen, T., & Guestrin, C. (2016) XGBoost: a scalable tree boosting system. arXiv:1603.02754v3 [cs.LG]

    Google Scholar 

  • Chipman, H. A., George, E. I., & McCulloch, R. E. (1998) Bayesian CART model search (with discussion). Journal of the American Statistical Association 93: 935–960.

    Article  Google Scholar 

  • Chipman, H. A., George, E. I., & McCulloch, R. E. (2010) BART: Bayesian additive regression trees. Annals of Applied Statistics 4(1): 266–298.

    Article  MathSciNet  Google Scholar 

  • Culp, M., Johnson, K., & Michailidis, G. (2006) ada: an R package for stochastic boosting. Journal of Statistical Software 17(2): 1–27

    Article  Google Scholar 

  • Friedman, J. H. (2002) Stochastic gradient boosting. Computational Statistics and Data Analysis 38: 367–378.

    Article  MathSciNet  Google Scholar 

  • Goodyear, D. (2018) Can the manufacturer of tasers provide the answer to police abuse? New Yorker Magazine. August 27, 2018, downloaded at https://www.newyorker.com/magazine/2018/08/27/can-the-manufacturer-of-tasers-provide-the-answer-to-police-abuse

    Google Scholar 

  • Hastie, T., Tibshirani, R., & Friedman, J. (2009) The Elements of Statistical Learning. Second Edition. New York: Springer.

    Book  Google Scholar 

  • Ho, T.K. (1998) The random subspace method for constructing decision trees. IEEE Transactions on Pattern Recognition and Machine Intelligence 20 (8) 832–844.

    Article  Google Scholar 

  • Mease, D., Wyner, A.J., & Buja, A. (2007) Boosted classification trees and class probability/quantile estimation. Journal of Machine Learning Research 8: 409–439.

    MATH  Google Scholar 

  • Ridgeway, G. (2007) Generalized boosted models: a guide to the gbm package. cran.r-project.org/web/packages/gbm/vignettes/gbm.pdf.

    Google Scholar 

  • Scharre, P. (2018) Army of None. New York: Norton.

    Google Scholar 

  • Tan, M., Chen, B., Pang, R., Vasudevan, V. & Le, Q.L. (2018) MnasNet: platform-aware neural architecture search for Mobile. asXiv:1807.11626v1 [cs.CV].

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Berk, R. (2019). Tree-Based Forecasting Methods. In: Machine Learning Risk Assessments in Criminal Justice Settings. Springer, Cham. https://doi.org/10.1007/978-3-030-02272-3_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-02272-3_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-02271-6

  • Online ISBN: 978-3-030-02272-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics