Machine learning (ML) algorithms perform classification. Given a large set of sensor data, an ML algorithm determines a discriminant that can classify future sensor data into the correct classes. Most ML algorithms are statistical. A simple form of ML uses the means and variances of the data from two sensors to choose the sensor that produces the better discriminant. An optimal discriminant can be obtained by combining data from two sensors using linear discriminant analysis (LDA). LDA depends on statistical properties of the samples that do not always hold. When LDA is not appropriate, perceptrons, which are related to neural networks, can be used to perform classification.
Consider a robot that recognizes and grasps yellow objects (Fig. 14.1). It can use a color camera to identify yellow objects, but the objects will appear different in different environments, such as in sunlight, in a dark room, or in a showroom. Furthermore, it is hard to precisely define what “yellow” means: what is the boundary between yellow and lemonyellow or between yellow and orange? Rather than write detailed instructions for the robot, we would prefer that the robot learn color recognition as it is performing the task, so that it could adapt to the environment where the task takes place. Specifically, we want to design a classification algorithm that can be trained to perform the task without supplying all the details in advance.
Classifications algorithms are a central topic in machine learning, a field of computer science and statistics that develops computations to recognize patterns and to predict outcomes without explicit programming. These algorithms extract rules from the raw data acquired by the system during a training period. The rules are subsequently used to classify a new object and then to take the appropriate action according to the class of the object. For the colorrecognition task, we train the robot by presenting it with objects of different colors and telling the robot which objects are yellow and which are not. The machine learning algorithm generates a rule for color classification. When presented with new objects, it uses the rule to decide which objects are yellow and which are not.
This chapter assumes that you are familiar with the concepts of mean, variance and covariance. Tutorials on these concepts appear in Appendices B.3 and B.4.
14.1 Distinguishing Between Two Colors
We start with the problem of distinguishing yellow balls from nonyellow balls. To simplify the task, we modify the problem to one of distinguishing dark gray areas from light gray areas that are printed on paper taped to the floor (Fig. 14.2). The robot uses two ground sensors that sample the reflected light as the robot moves over the two areas.
14.1.1 A Discriminant Based on the Means
By examining the plots in Fig. 14.3, it is easy to see which samples are from the dark gray area and which are from the light gray area. For the left sensor, the values of the light gray area are in the range 500–550, while the values of the dark gray area are in the range 410–460. For the right sensor, the ranges are 460–480 and 380–400. For the left sensor, a threshold of 480 would clearly distinguish between light and dark gray, while for the right sensor a threshold of 440 would clearly distinguish between light and dark gray. But how can these optimal values be chosen automatically and how can we reconcile the thresholds of the two sensors?
Let us first focus on the left sensor. We are looking for a discriminant , a value that distinguishes samples from the two colors. Consider the values \( max _{dark}\), the maximum value returned by sampling dark gray, and \( min _{{light}}\), the minimum value returned by sampling light gray. Under the reasonable assumption that \( max _{dark} < min _{light}\), any value x such that \( max _{dark}< x < min _{light}\) can distinguish between the two shades of gray. The midpoint between the two values would seem to offer the most robust discriminant.
From Fig. 14.3 we see that \( max _{dark}\approx 460\) occurs at about 10 s and \( min _{{light}}\approx 500\) occurs at about 60 s, so we choose their average 480 as the discriminant. While this is correct for this particular data set, in general it is not a good idea to use the maximum and minimum values because they could be outliers: extreme values resulting from unusual circumstances, such as a hole in the paper which would incorrectly return a very high value in the dark gray area.
14.1.2 A Discriminant Based on the Means and Variances
The variance computes the average of the distances of each sample from the mean of the samples. The distances are squared because a sample can be greater than or less than the mean, but we want a positive distance that shows how far the sample is from the mean.
The difference of the means and the quality criteria J
Left  Right  

Dark  Light  Dark  Light  
\(\mu \)  431  519  383  467 
\(s^{2}\)  11  15  4  7 
\({\mu }_{{dark}}{\mu }_{{light}}\)  88  84  
J  22  104 
14.1.3 Algorithm for Learning to Distinguish Colors
These computations are done by the robot itself, so the choice of the better discriminant and the better sensor is automatic. The details of the computation are given Algorithms 14.1 and 14.2.^{4}
Activity 14.1:

Construct an environment as shown in Fig. 14.2. Print two pieces of paper with different uniform gray levels and tape them to the floor.

Write a program that causes the robot to move at a constant speed over the area of one color and sample the reflected light periodically. Repeat for the other color.

Plot the data, compute the means and the discriminant.

Implement a program that classifies the measurements of the sensor. When the robot classifies a measurement it displays which color is recognized like a chameleon (or gives other feedback if changing color cannot be done).

Apply the same method with a second sensor and compare the separability of the classes using the criterion J.

Repeat the exercise with two very close levels of gray. What do you observe?
14.2 Linear Discriminant Analysis
In the previous section we classified samples of two levels of gray based on the measurements of one sensor out of two; the sensor was chosen automatically based on a quality criterion. This approach is simple but not optimal. Instead of choosing a discriminant based on one sensor, we can achieve better recognition by combining samples from both sensors. One method is called linear discriminant analysis (LDA) and is based upon pioneering work in 1936 by the statistician Ronald A. Fisher.
14.2.1 Motivation
To understand the advantages of combining samples from two sensors, suppose that we need to classify objects of two colors: electric violet (ev) and cadmium red (cr). Electric violet is composed largely of blue with a bit of red, while cadmium red is composed largely of red with a bit of blue. Two sensors are used: one measures the level of red and the other measures the level of blue. For a set of samples, we can compute their means \(\mu _j^k\) and variances \((s_j^k)^2\), for \(j= ev , cr \), \(k= blue , red \).
From the diagram we see that there is a larger difference between the means for the blue sensor than between the means for the red sensor. At first glance, it appears that using the blue sensor only would give a better discriminant. However, this is not true: the dashed lines show that the redonly discriminant completely distinguishes between electric violet and cadmium red, while the blueonly discriminant falsely classifies some electric violet samples as cadmium red (some samples are below the line) and falsely classifies some cadmium red samples as electric violet (some samples are above the line).
The reason for this unintuitive result is that the blue sensor returns values that are widely spread out (have a large variance), while the red sensor returns values that are narrowly spread out (have a small variance), and we saw in Sect. 14.1 that classification is better if the variance is small. The right plot in Fig. 14.5 shows that by constructing a discriminant from both sensors it is possible to better separate the electric violet objects from the cadmium red objects. The discriminant is still linear (a straight line) but its slope is no longer parallel to one of the axes. This line is computed using the variances as well as the means. The method is called linear discriminant analysis because the discriminant is linear.
14.2.2 The Linear Discriminant
In Fig. 14.6, classification based only on the left sensor corresponds to the vertical dashed line, while classification based only on the right sensor corresponds to the horizontal dashed line. Both the horizontal and vertical separation lines are not optimal. Suppose that classification based on the left sensor (the vertical line) is used and consider a sample for which the left sensor returns 470 and the right sensor returns 460. The sample will be classified as dark gray even though the classification as light gray is better. Intuitively, it is clear that the solid diagonal line in the graph is a far more accurate discriminant than either of the two discriminants based on a single sensor.
Linear discriminant analysis automatically defines the vector \(\mathbf {w}\) and constant c that generates an optimal discriminant line between the data sets of the two classes. The first step is to choose a point on the discriminant line. Through that point there are an infinite number of lines and we have to choose the line whose slope gives the optimal discriminant. Finally, the value c can be computed from the slope and the chosen point. The following subsections describe each of these steps in detail.
14.2.3 Choosing a Point for the Linear Discriminant
How can we choose a point? LDA is based upon the assumption that the values of both classes have the same distribution. Informally, when looking at an x–y plot, both sets of points should have similar size and shape. Although the distributions will almost certainly not be exactly the same (say a Gaussian distribution) because they result from measurements in the real world, since both sensors are subject to the same types of variability (sensor noise, uneven floor surface) it is likely that they will be similar.
14.2.4 Choosing a Slope for the Linear Discriminant
Once we have chosen the point M on the discriminant line, the next step is to choose the slope of the line. From Fig. 14.6, we see that there are infinitely many lines through the point M that would distinguish between the two sets of samples. Which is the best line based on the statistical properties of our data?
When the values of M and \(\mathbf {w}\) have been computed, all that remains is to compute the constant c to fully define the discriminant line. This completes the learning phase of the LDA algorithm. In the recognition phase, the robot uses the line defined by \(\mathbf {w}\) and c for classifying new samples.
14.2.5 Computation of a Linear Discriminant: Numerical Example
14.2.6 Comparing the Quality of the Discriminants
If we compare the linear discriminant found above with the two simple discriminants based upon the means of a single sensor, we see a clear improvement. Because of the overlap between the classes in a single direction, the simple discriminant for the right sensor correctly classifies only \(84.1\%\) of the samples, while the simple discriminant for the left sensor is somewhat better, classifying \(93.7\%\) of samples correctly. The linear discriminant found using LDA is better, correctly classifying \(97.5\%\) of the samples.
It might be surprising that there are discriminant lines that can correctly classify all of the samples! One such discriminant is shown by the thick dashed line in Fig. 14.9. Why didn’t LDA find this discriminant? LDA assumes both classes have a similar distribution (spread of values) around the mean and the LDA discriminant is optimal under this assumption. For our data, some points in the second class are far from the mean and thus the distributions of the two classes are slightly different. It is hard to say if these samples are outliers, perhaps caused by problem when printing the gray areas on paper. In that case, it is certainly possible that subsequent sampling of the two areas would result in distributions that are similar to each other, leading to the correct classification by the LDA discriminant.
14.2.7 Activities for LDA
Activities for LDA are collected in this section.
Activity 14.2:

Construct an environment as shown in Fig. 14.2 but with two gray levels very similar to each other.

Write a program that causes the robot to move at a constant speed over the area of one color and sample the reflected light periodically. Repeat for the other color.

Plot the data.

Compute the averages, the covariance matrices and the discriminant.

Implement a program that classifies measurements of the sensor. When the robot classifies a measurement it displays which color is recognized (or gives other feedback if changing color cannot be done).
Activity 14.3:

Figure 14.10 shows a robot approaching a wall. The upper part of the diagram shows various situations where the robot detects the wall with its right sensors; therefore, it should turn left to move around the wall. Similarly, in the lower part of the diagram the robot should turn right.

Write a program that stores the sensor values from both the right and left sensors when a button is pressed. The program also stores the identity of which button was pressed; this represents the class we are looking for when doing obstacle avoidance.

Train the robot: Place the robot next to a wall and run the program. Touch the left button if the robot should turn left or the right button if the robot should turn right. Repeat many times.

Plot the samples from the two sensors on an x–y graph and group them by class: turn right or left to avoid of the wall. You should obtain a graph similar to the one in Fig. 14.11.

Draw a discriminant line separating the two classes.

How successful is your discriminant line? What percentage of the samples can it successfully classify?

Compute the optimal discriminant using LDA. How successful is it? Do the assumptions of LDA hold?
Activity 14.4:

Write a program that causes the robot to follow an object. The robot moves forward if it detects the object in front; it moves backwards if it is too close to the object. The robot turns right if the object is to its right and the robot turns left if the object is to its left.

Use two sensors so we can visualize the data on an x–y plot.

Acquire and plot the data as in Activity 14.3. The plot should be similar to the one shown in Fig. 14.12.

Explain the classifications in Fig. 14.12. What is the problem with classifying a sample as going forwards or going backwards? Why do the samples for going forwards and backwards have different values for the left and right sensors?

Suggest an algorithm for classifying the four situations. Could you use a combination of linear separators?
14.3 Generalization of the Linear Discriminant
In this section we point out some ways in which LDA can be extended and improved.
First, we can have more sensors. The mathematics becomes more complex because with n sensors, the vectors will have n elements and the covariance matrix will have \(n\times n\) elements, requiring more computing power and more memory. Instead of a discriminant line, the discriminant will be an \(n1\) dimension hyperplane. Classification with multiple sensors is used with electroencephalography (EEG) signals from the brain in order to control a robot by thought alone.
Activity 14.4 demonstrated another generalization: classification into more than two classes. Discriminants are used to classify between each pair of classes. Suppose you have three classes \(C_1\), \(C_2\), and \(C_3\), and discriminants \(\varDelta _{12}, \varDelta _{13}, \varDelta _{23}\). If a new sample is classified in class \(C_2\) by \(\varDelta _{12}\), in class \(C_1\) by \(\varDelta _{12}\), and in class \(C_2\) by \(\varDelta _{23}\), the final classification will be into class \(C_2\) because more discriminants assign the sample to that class.
A third generalization is to use a higher order curve instead of a straight line, for example, a quadratic function. A higher order discriminant can separate classes whose data sets are not simple clusters of samples.
14.4 Perceptrons
LDA can distinguish between classes only under the assumption that the samples have similar distributions in the classes. In this section, we present another approach to classification using perceptrons which are related to neural networks (Chap. 13). There we showed how learning rules can generate specified behaviors linking sensors and motors; here we show how they can be used to classify data into classes.
14.4.1 Detecting a Slope
Consider a robot exploring difficult terrain. It is important that the robot identify steep slopes so it won’t fall over, but it is difficult to specify in advance all dangerous situations since these depend on characteristics such as the geometry of the ground and its properties (wet/dry, sand/mud). Instead, we wish to train the robot to adapt its behavior in different environments.
To simplify the problem, assume that the robot can move just forwards and backwards, and that it has accelerometers on two axes relative to the body of the robot: one measures acceleration forwards and backwards, and the other measures acceleration upwards and downwards. A robot that is stationary on a level surface will measure zero acceleration forwards and backwards, and an downwards acceleration of 9.8 m/sec\(^{2}\) due to gravity. Gravitational acceleration is relatively strong compared with the acceleration of a slowmoving robot, so the relative values of the accelerometer along the two axes will give a good indication of its attitude.
The dashed lines in the figure show the means for the two data sets. It is clear that they do not help us classify the data because of the large overlap between the two sets of samples. Furthermore, LDA is not appropriate because there is no similarity in the distributions: samples when the robot is stable appear in many parts of the plot, while samples from dangerous situations are concentrated in a small area around their mean.
14.4.2 Classification with Perceptrons
The data are normalized so that all inputs are in the same range, usually \(1\le x_i\le +1\). The data in Fig. 14.14 can be normalized by dividing each value by 30.
Given a set of input values \(\{x_0=1,x_1,\ldots ,x_n\}\) of a sample, the object of a training session is to find a set of weights \(\{w_0,w_1,\ldots ,w_n\}\) so that the output will be the value \(\pm 1\) that assigns the sample to the correct class.
14.4.3 Learning by a Perceptron
The iterative search for values of the weights \(\{w_0,w_1,\ldots ,w_n\}\) starts by setting them to a small value such as 0.1. During the learning phase, a set of samples is presented to the perceptron, together with the expected output (the class) for each element of the set. The set of samples must be constructed randomly and include elements from all the classes; furthermore, the elements from a single class must also be chosen randomly. This is to prevent the learning algorithm from generating a discriminant that is optimal in one specific situation, and to ensure that the process converges rapidly to an overall optimal discriminant, rather than spending too much time optimizing for specific cases.
Equation 14.7 corrects the weights by adding or subtracting a value that is proportional to the input, where the coefficient of proportionality is the learning rate. A small value for the learning rate means that the corrections to the weights will be in small increments, while a high learning rate will cause the corrections to the weights to be in larger increments. Once learning is completed, the weights are used to classify subsequent samples.
When should the learning phase be terminated? One could specify an arbitrary value, for example: terminate the learning phase when \(98\%\) of the samples are classified correctly. However, it may not be possible to achieve this level. A better method is to terminate the learning phase when the magnitudes of the corrections to the weights become small.
14.4.4 Numerical Example
We return to the robot that is learning to avoid dangerous slopes and apply the learning algorithm to the data in Fig. 14.14. The perceptron has three inputs: \(x_0\) which always set to 1, \(x_1\) for the data from the front/back accelerometer, and \(x_2\) for the data from the up/down accelerometer. The data is normalized by dividing each sample by 30 so that values will be between 0 and 1. We specify that an output of 1 corresponds to class \(C_1\) (stable) and an output of \(1\) corresponds to class \(C_2\) (dangerous).
14.4.5 Tuning the Parameters of the Perceptron
The performance of a perceptron is determined by the number of iterations and the learning rate. Figure 14.16 shows that there is a strong variation in the weights at the beginning, but the weights stabilize as the number of iterations increases. Thus it is relatively simple to monitor the weights and terminate the computation when the weights stabilize.
This evolution of the weights depends strongly on the learning rate . Increasing the learning rate speeds the variation at the beginning, but strong corrections are not beneficial when the weights begin to stabilize. From Fig. 14.16, it is clear that even at the end of the run, there are significant variations in the weights which oscillate around the optimal value. This suggests that we reduce the learning rate to reduce the oscillations, but doing so will slow down the convergence to the optimal weights at the beginning of the learning phase.
Activity 14.5:

Take a set of measurements of the accelerometers on your robot on various slopes and plot the data. For each sample you will have to decide if the robot is in danger of falling off the slope.

Classify the data using a perceptron. What discriminant line do you find?

Use a perceptron to classify the gray areas using the data of Activity 14.2. What discriminant do you find? Compare the discriminant found by the perceptron to the discriminant found by LDA.
14.5 Summary
Samples of two classes can be distinguished using their means alone or using both the means and the variances. Linear discriminant analysis is a method for classification that is based on computing the covariances between the samples of the classes. LDA performs well only when the distributions of the samples of the classes are similar. When this assumption does not hold, perceptrons can be used. For optimum performance, the learning rate of a perceptron must be adjusted, if possible dynamically during the learning phase.
14.6 Further Reading
