Skip to main content

Machine Learning Basics

  • Chapter
  • First Online:
  • 407k Accesses

Part of the book series: Undergraduate Topics in Computer Science ((UTICS))

Abstract

This chapter explores the fundamentals of machine learning, since deep learning is above everything else, a technique for machine learning. We explore the idea of classification and what it means for a classificator to classify data, and proceed to evaluating the performance of a general classifier. The first actual classifier we present is naive Bayes (which includes a general discussion on data encoding and normalization), and we also present the simplest neural network, logistic regression, which is the bread and butter of deep learning. We introduce the classic MNIST dataset of handwritten digits, the so-called ‘fruit fly of machine learning’. We present also two showcase techniques of unsupervised learning, K-means to explain clustering and the general principle of learning without labels and the principal component analysis (PCA) to explain how to learn representations. PCA is also explored in more detail later on. We conclude with a brief exposition on how to represent language for learning with bag of words.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    You may wonder how a side gets a label, and this procedure is different for the various machine learning algorithms and has a number o peculiarities, but for now you may just think that the side will get the label which the majority of datapoints on that side have. This will usually be true, but is not an elegant definition. One case where this is not true is the case where you have only one dog and two cats overlapping (in 2D space) it and four other cats. Most classifiers will place the dog and the two cats in the category ‘dog’. Cases like this are rare, but they may be quite meaningful.

  2. 2.

    A dataset is simply a set of datapoints, some labelled some unlabelled.

  3. 3.

    Noise is just a name for the random oscillations that are present in the data. They are imperfections that happen and we do not want to learn to predict noise but the elements that are actually relevant to what we want.

  4. 4.

    It does not have to a perfect separation, a good separation will do.

  5. 5.

    Think about how one-hot encoding can boost the understanding of n-dimensional space.

  6. 6.

    Deep learning is no exception.

  7. 7.

    Notice that to do one-hot encoding, it needs to make two passes over the data: the first collects the names of the new columns, then we create the columns, and then we make another pass over the data to fill them.

  8. 8.

    Strictly speaking, these vectors would not look exactly the same: the training sample would be (54,17,1,0,0, Dog), which is a row vector of length 6, and the row vector for which we want to predict the label would have to be of length 5 (without the last component which is the label), e.g. (47,15,0,0,1).

  9. 9.

    If we will be needing more we will keep more decimals, but in this book we will usually round off to four.

  10. 10.

    It is mostly a matter of choice, there is no objective way of determining how much to split.

  11. 11.

    The prior probability is just a matter of counting. If you have a dataset with 20 datapoints and in some feature there are five values of ‘New Vegas’ while the others (15 of them) are ‘Core region’, the prior probability \(\mathbb {P}(\text {New Vegas })=0.25\).

  12. 12.

    If we were to have n features, this would be an n-dimensional row vector such as \((x_1,x_2,\ldots , x_n)\), but now we have only one feature so we have a 1D row vector of the form \((x_1)\). A 1D vector is exactly the same as the scalar \(x_1\) but we keep referring to it as a vector to delineate that in the general case it would be an n-dimensional vector.

  13. 13.

    That is, the assumption that features are conditionally independent given the target.

  14. 14.

    Regression problems can be simulated with classification. An example would be if we had to find the proper value between 0 and 1, and we had to round it in two decimals, then we could treat it as a 100-class classification problem. The opposite also holds, and we have actually seen this in the naive Bayes section, where we had to pick a threshold over which we would consider it a 1 and below which it would be a 0.

  15. 15.

    Afterwards, we may do a bit of feature engineering and use an all-together different model. This is important when we do not have an understanding of the data we use which is often the case in industry.

  16. 16.

    We will see later that logistic regression has more than one neuron, since each component of the input vector will have to have an input neuron, but it has ‘one’ neuron in the sense of having a single ‘workhorse’ neuron.

  17. 17.

    If the training set consists of n-dimensional row vectors, then there are exactly \(n-1\) features—the last one is the target or label.

  18. 18.

    Mathematically, the bias is useful to make an offset called the intercept.

  19. 19.

    There are other error functions that can be used, but the SSE is one of the simplest.

  20. 20.

    Recall that this is not the same as a \(3\times 5\) matrix.

  21. 21.

    In the older literature, this is sometimes called activation function.

  22. 22.

    See http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec1.pdf.

  23. 23.

    Available at https://www.kaggle.com/c/digit-recognizer/data.

  24. 24.

    The interested reader may look up the details in Chap. 4 of [10].

  25. 25.

    But PCA itself is not that simple to understand.

  26. 26.

    K-means (also called the Lloyd-Forgy algorithm) was first proposed by independently by S. P. Lloyd in [16] and E. W. Forgy in [17].

  27. 27.

    Usually, in a predefined number of times, there are other tactics as well.

  28. 28.

    Imagine that a centroid is pinned down and connected to all its datapoints with rubber bands, and then you unpin it from the surface. It will move so that the rubber bands are less tense in total (even though individual rubber bands may become more tense).

  29. 29.

    Recall that a cluster in K-means is a region around a centroid separated by the hyperplane.

  30. 30.

    We have to use the same number of centroids in both clusterings for this to work.

  31. 31.

    These features are known as latent variables in statistics.

  32. 32.

    One of the reasons for this is that we have not yet developed all the tools we need to write out the details now.

  33. 33.

    See Chap. 2.

  34. 34.

    And if a feature is always the same, it has a variance of 0 and it carries no information useful for drawing the hyperplane.

  35. 35.

    An example of an expansion of the basic bag of words model is a bag of n-grams. An n-gram is a n-tuple consisting of n words that occur next to each other. If we have a sentence ‘I will go now’, the set of its 2-grams will be \(\{(`I',`will'), (`will',`go'), (`go', `now')\}\).

  36. 36.

    For most language processing tasks, especially tasks requiring the use of data collected from social media, it makes sense to convert all text to lowercase first and get rid of all commas apostrophes and non-alphanumerics, which we have already done here.

References

  1. R. Tibshirani, T. Hastie, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edn. (Springer, New York, 2016)

    MATH  Google Scholar 

  2. F. van Harmelen, V. Lifschitz, B. Porter, Handbook of Knowledge Representation (Elsevier Science, New York, 2008)

    MATH  Google Scholar 

  3. R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction (MIT Press, Cambridge, 1998)

    MATH  Google Scholar 

  4. J.R. Quinlan, Induction of decision trees. Mach. Learn. 1, 81–106 (1986)

    Google Scholar 

  5. M.E. Maron, Automatic indexing: an experimental inquiry. J. ACM 8(3), 404–417 (1961)

    Article  Google Scholar 

  6. D.R. Cox, The regression analysis of binary sequences (with discussion). J. Roy. Stat. Soc. B (Methodol.) 20(2), 215–242 (1958)

    MATH  Google Scholar 

  7. P.J. Grother, NIST special database 19: handprinted forms and characters database (1995)

    Google Scholar 

  8. Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)

    Article  Google Scholar 

  9. M.A. Nielsen, Neural Networks and Deep Learning (Determination Press, 2015)

    Google Scholar 

  10. P.N. Klein, Coding the Matrix (Newtonian Press, London, 2013)

    Google Scholar 

  11. I. Färber, S. Günnemann, H.P. Kriegel, P. Kroöger, E. Müller, E. Schubert, T. Seidl, A. Zimek. On using class-labels in evaluation of clusterings, in MultiClust: Discovering, Summarizing, and Using Multiple Clusterings, ed. by X.Z. Fern, I. Davidson, J. Dy (ACM SIGKDD, 2010)

    Google Scholar 

  12. J. Dunn, Well separated clusters and optimal fuzzy partitions. J. Cybern. 4(1), 95–104 (1974)

    Article  MathSciNet  Google Scholar 

  13. K. Pearson, On lines and planes of closest fit to systems of points in space. Phil. Mag. 2(11), 559–572 (1901)

    Article  Google Scholar 

  14. C. Manning, H. Schütze, Foundations of Statistical Natural Language Processing (MIT Press, Cambridge, 1999)

    MATH  Google Scholar 

  15. D. Jurafsky, J. Martin, Speech and Language Processing (Prentice Hall, New Jersey, 2008)

    Google Scholar 

  16. S. P. Lloyd, Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137 (1982)

    Google Scholar 

  17. E. W. Forgy, Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics, 21(3), 768–769 (1965)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sandro Skansi .

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Skansi, S. (2018). Machine Learning Basics. In: Introduction to Deep Learning. Undergraduate Topics in Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-319-73004-2_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-73004-2_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-73003-5

  • Online ISBN: 978-3-319-73004-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics