Skip to main content

Supervised Learning

  • Chapter
  • First Online:
Introduction to Data Science

Part of the book series: Undergraduate Topics in Computer Science ((UTICS))

Abstract

In this chapter, we introduce the basics of classification: a type of supervised machine learning. We also give a brief practical tour of learning theory and good practices for successful use of classifiers in a real case using Python. The chapter starts by introducing the classic machine learning pipeline, defining features, and evaluating the performance of a classifier. After that, the notion of generalization error is needed, which allows us to show learning curves in terms of the number of examples and the complexity of the classifier, and also to define the notion of overfitting. That notion will then allow us to develop a strategy for model selection. Finally, two of the best-known techniques in machine learning are introduced: support vector machines and random forests. These are then applied to the proposed problem of predicting those loans that will not be successfully covered once they have been accepted.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.lendingclub.com/info/download-data.action.

  2. 2.

    Several well-known techniques such as support vector machines or adaptive boosting (adaboost) are originally defined in the binary case. Any binary classifier can be extended to the multiclass case in two different ways. We may either change the formulation of the learning/optimization process. This requires the derivation of a new learning algorithm capable of handling the new modeling. Alternatively, we may adopt ensemble techniques. The idea behind this latter approach is that we may divide the multiclass problem into several binary problems; solve them; and then aggregate the results. If the reader is interested in these techniques, it is a good idea to look for: one-versus-all, one-versus-one, or error correcting output codes methods.

  3. 3.

    Many problems are described using categorical data. In these cases either we need classifiers that are capable of coping with this kind of data or we need to change the representation of those variables into numerical values.

  4. 4.

    The notebook companion shows the preprocessing steps, from reading the dataset, cleaning and imputing data, up to saving a subsampled clean version of the original dataset.

  5. 5.

    The term unbalanced describes the condition of data where the ratio between positives and negatives is a small value. In these scenarios, always predicting the majority class usually yields accurate performance, though it is not very informative. This kind of problems is very common when we want to model unusual events such as rare diseases, the occurrence of a failure in machinery, fraudulent credit card operations, etc. In these scenarios, gathering data from usual events is very easy but collecting data from unusual events is difficult and results in a comparatively small dataset.

  6. 6.

    sklearn allows us to easily automate the train/test splitting using the function train_test_split(...).

  7. 7.

    The reader should note that there are several bounds in machine learning to characterize the generalization error. Most of them come from variations of Hoeffding’s inequality.

  8. 8.

    This set cannot be used to select a classifier, model or hyperparameter; nor can it be used in any decision process.

  9. 9.

    This reduction in the complexity of the best model should not surprise us. Remember that complexity and the number of examples are intimately related for the learning to succeed. By using a test set we perform model selection with a smaller dataset than in the former case.

  10. 10.

    These techniques have been shown to be two of the most powerful families for classification [1].

  11. 11.

    Remember the regularization cure for overfitting.

  12. 12.

    Note the strict inequalities in the formulation. Informally, we can consider the smallest satisfied constraint, and observe that the rest must be satisfied with a larger value. Thus, we can arbitrarily set that value to 1 and rewrite the problem as

    $$a^Ts_i+b\ge 1\; \text {and}\; a^Tr_i+b\le -1.$$
  13. 13.

    It is worth mentioning that another useful tool for visualizing the trade-off between true positives and false positives in order to choose the operating point of the classifier is the receiver-operating characteristic (ROC) curve. This curve plots the true positive rate/sensitivity/recall (TP/(TP+FN)) with respect to the false positive rate (FP/(FP+TN)).

Reference

  1. M. Fernández-Delgado, E. Cernadas, S. Barro, D. Amorim, Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? Journal of Machine Learning Research 15, 3133 (2014). http://jmlr.org/papers/v15/delgado14a.html

Download references

Acknowledgements

This chapter was co-written by Oriol Pujol and Petia Radeva.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Laura Igual .

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Igual, L., Seguí, S. (2017). Supervised Learning. In: Introduction to Data Science. Undergraduate Topics in Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-319-50017-1_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-50017-1_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-50016-4

  • Online ISBN: 978-3-319-50017-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics