On the Perceptron’s Compression
 73 Downloads
Abstract
We study and provide exposition to several phenomena that are related to the perceptron’s compression. One theme concerns modifications of the perceptron algorithm that yield better guarantees on the margin of the hyperplane it outputs. These modifications can be useful in training neural networks as well, and we demonstrate them with some experimental data. In a second theme, we deduce conclusions from the perceptron’s compression in various contexts.
Keywords
Machine learning Compression Convex separation1 Introduction
The perceptron is an abstraction of a biological neuron that was introduced in the 1950’s by Rosenblatt [31], and has been extensively studied in many works (see e.g. the survey [27]). It receives as input a list of real numbers (various electrical signals in the biological case) and if the weighted sum of its input is greater than some threshold it outputs 1 and otherwise \(1\) (it fires or not in the biological case).
Formally, a perceptron computes a function of the form \(\mathsf {sign}(w \cdot x  b)\) where \(w \in \mathbb {R}^d\) is the weight vector, \(b \in \mathbb {R}\) is the threshold, \(\cdot \) is the standard inner product, and \(\mathsf {sign}:\mathbb {R}\rightarrow \{\pm 1\}\) is 1 on the nonnegative numbers. It is only capable of representing binary functions that are induced by partitions of \(\mathbb {R}^ d\) by hyperplanes.
Definition 1
A map \(Y: \mathcal{X} \rightarrow \{\pm 1\}\) over a finite set \(\mathcal{X} \subset \mathbb {R}^d \) is (linearly)^{1} separable if there exists \(w\in \mathbb {R}^d\) such that \(\mathsf {sign}(w \cdot x)=Y(x)\) for all \(x\in \mathcal{X}\). When the Euclidean norm of w is \(\Vert w\Vert =1\), the number \(\mathsf {marg}(w,Y) = { \min _{x\in \mathcal{X}} {Y(x) w\cdot x }}\) is the margin of w with respect to Y. The number \(\mathsf {marg}(Y) = \sup _{w \in \mathbb {R}^d : \Vert w\Vert =1} \mathsf {marg}(w,Y)\) is the margin of Y. We call Y an \(\varepsilon \)partition if its margin is at least \(\varepsilon \).
Variants of the perceptron (neurons) are the basic building blocks of general neural networks. Typically, the sign function is replaced by some other activation function (e.g., sigmoid or rectified linear unit \(\mathsf {ReLu}(z) = \max \{0, z\}\)). Therefore, studying the perceptron and its variants may help in understanding neural networks, their design and their training process.
Overview. In this paper, we provide some insights into the perceptron’s behavior, survey some of the related work, deduce some geometric applications, and discuss their usefulness in other learning contexts. Below is a summary of our results and a discussion of related work, partitioned to five parts numbered (i) to (v). Each of the results we describe highlights a different aspect of the perceptron’s compression (the perceptron’s output is a sum of small subset of examples). For more details, definitions and references, see the relevant sections.
(i) Variants of the perceptron (Sect. 2). The wellknown perceptron algorithm (see Algorithm 1 below) is guaranteed to find a separating hyperplane in the linearly separable case. However, there is no guarantee on the hyperplane’s margin compared to the optimal margin \(\varepsilon ^*\). This problem was already addressed in several works, as we now explain (see also references within). The authors of [9] and [23] defined a variant of the perceptron that yields a margin of the form \(\varOmega (\varepsilon ^*/R^2)\); see Algorithm 2 below. The authors of [10] defined the passiveaggressive perceptron algorithms that allow e.g. to deal with noise, but provided no guarantee on the margin of the output. The authors of [22] defined a variant of the perceptron that yields provable margin under the assumption that a lower bound on the optimal margin is known. The author of [17] designed the ALMA algorithm and showed that it provides almost optimal margin under the assumption that the samples lie on the unit sphere. It is worth noting that normalizing the examples to be on the unit sphere may significantly alter the margin, and even change the optimal separating hyperplane. The author of [37] defined the minimal overlap algorithm which guarantees optimal margin but is not online since it knows the samples in advance. Finally, the authors of [36] analyzed gradient descent for a single neuron and showed convergence to the optimal separating hyperplane under certain assumptions (appropriate activation and loss functions). We provide two new ideas that improve the learning process. One that adaptively changes the “scale” of the problem and by doing so improves the guarantee on the margin of the output (Algorithm 3), and one that yields almost optimal margin (Algorithms 4).
(ii) Applications for neural networks (Sect. 3). Our variants of the perceptron algorithm are simple to implement, and can therefore be easily applied in the training process of general neural networks. We validate their benefits by training a basic neural network on the MNIST dataset.
(iii) Convex separation (Sect. 4). We use the perceptron’s compression to prove a sparse separation lemma for convex bodies. This perspective also suggests a different proof of Novikoff’s theorem on the perceptron’s convergence [30]. In addition, we interpret this sparse separation lemma in the language of game theory as yielding sparse strategies in a related zerosum game.
(iv) Generalization bounds (Sect. 5). An important aspect of a learning algorithm is its generalization capabilities; namely, its error on new examples that are independent of the training set (see the textbook [33] for background and definitions). We follow the theme of [18], and observe that even though the (original) perceptron algorithm does not yield an optimal hyperplane, it still generalizes.
(v) Robust concepts (Sect. 6). The robust concepts theme presented by Arriaga and Vempala [3] suggests focusing on wellseparated data. We notice that the perceptron fits well into this framework; specifically, that its compression yields efficient dimension reductions. Similar dimension reductions were used in several previous works (e.g. [3, 4, 5, 6, 16, 21]).
Summary. In parts (i)–(ii) we provide a couple of new ideas for improving the training process and explain their contribution in the context of previous work. In part (iii) we use the perceptron’s compression as a tool for proving geometric theorems. We are not aware of previous works that studied this connection. Parts (iv)–(v) are mostly about presenting ideas from previous works in the context of the perceptron’s compression. We think that parts (iv) and (v) help to understand the picture more fully.
2 Variants of the Perceptron
The perceptron algorithm terminates whenever its input sample is linearly separable, in which case its output represents a separating hyperplane. Novikoff analyzed the number of steps T required for the perceptron to stop as a function of the margin of the input sample [30].
Part I: The Projection Grows Linearly in Time. In each iteration, the projection of \(w^{(t)}\) on \(w^*\) grows by at least \(\varepsilon ^*\), since \(y_i x_i \cdot w^* \ge \varepsilon ^*\). By induction, we get \(w^{(t)} \cdot w^* \ge \varepsilon ^* t\) for all \(t \ge 0\).
As discussed in Sect. 1, Algorithm 1 has several drawbacks. Here we describe some simple ideas that allow to improve it. Below we describe three algorithms, each is followed by a theorem that summarizes its main properties.
In the following, \(\mathcal {X}\subset \mathbb {R}^d\) is a finite set, Y is a linear partition, \(\varepsilon ^* = \mathsf {marg}(Y)\) is the optimal margin, and \(R=\max _{x \in \mathcal{X}} \Vert x \Vert \) is the maximal norm of a point.
Theorem 1
([9, 23]). The \(\beta \)perceptron algorithm performs at most \(\frac{2\beta +R^2}{(\varepsilon ^*)^2}\) updates and achieves a margin of at least \(\frac{\beta \epsilon ^ *}{2\beta +R^2} \).
Proof
Theorem 2
The Rindependent perceptron algorithm performs at most \(\tfrac{10 R^2}{(\varepsilon ^*)^2}\) updates and achieves a margin of at least \(\frac{\varepsilon ^*}{3}\).
Proof
Theorem 3
If \(R \le 1\), the \(\infty \)perceptron algorithm performs at most \((1/\varepsilon ^*)^{2/(2\alpha )}\) updates and achieves a margin of at least \(\frac{\alpha \epsilon ^ *}{2}\).
Proof
For simplicity, we assume here that \(R=\max _i \Vert x_i \Vert =1\). The idea is as follows. The analysis of the classical perceptron relies on the fact that \(\Vert w^{(t)}\Vert ^2\le t\) in each step. On the other hand, in an “extremely aggressive” version of the perceptron that always updates, one can only obtain a trivial bound \(\Vert w^{(t)}\Vert ^2\le t^2\) (as \(w^{(t)}\) can be the sum of t unit vectors in the same direction). The update rule in the version below is tailored so that a bound of \(\Vert w^{(t)}\Vert ^2\le t^\alpha \) for \(\alpha \in (1,2)\) is maintained.
Remark 1
The bound on the running time is sharp, as the following example shows. The two points \((\sqrt{1\varepsilon ^2} , \varepsilon ) ,( \sqrt{1\varepsilon ^2} , \varepsilon )\) with labels \(1,1\) are linearly separated with margin \(\varOmega (\varepsilon )\). The algorithm stops after \(\varOmega \big ((1/\varepsilon )^{2/(2\alpha )} \big )\) iterations (if \(\varepsilon \) is small enough and \(\alpha \) close enough to 2).
3 Application for Neural Networks
Our results explain some choices that are made in practice, and can potentially help to improve them. Observe that if one applies gradient descent on a neuron of the form \(\mathsf {ReLu}(w \cdot x)\) with loss function of the form \(\mathsf {ReLu}(\beta  y_x w \cdot x)\) with \(\beta =0\) then one gets the same update rule as in the perceptron algorithm. Choosing \(\beta =1\) corresponds to using the hinge loss to drive the learning process. The fact that \(\beta =1\) yields provable bounds on the output’s margin of a single neuron suggests a formal evidence that supports the benefits of the hinge loss.
Moreover, in practice, \(\beta \) is treated as a hyperparameter and tuning it is a common challenge that needs to be addressed in order to maximize performance. We proposed a couple of new options for choosing and updating \(\beta \) throughout the training process that may contribute towards a more systematic approach for setting \(\beta \) (see Algorithms 4 and 3). Theorems 2 and 3 explain the theoretical advantages of these options in the case of a single neuron.
We also provide some experimental data. Our experiments verify that our suggestions for choosing \(\beta \) can indeed yield better results. We used the MNIST database [24] of handwritten digits as a test case with no preprocessing. We used a simple and standard neural network with one hidden layer consisting of 800/300 neurons and 10 output neurons (the choice of 800 and 300 is the same as in Simard et al. [35] and Lecun et al. [24]). We trained the network by backpropagation (gradient descent). The loss function of each output neuron of the form \(\mathsf {ReLu}(w \cdot G(x) )\), where G(x) is the output of the hidden layer, is \(\mathsf {ReLu}(  y_x w \cdot G(x) + \beta )\) for different \(\beta \)’s. This loss function is 0 if w provides a correct and confident (depending on \(\beta \)) classification of x and is linear in G(x) otherwise. This choice updates the network even when the network classifies correctly but with less than \(\beta \) confidence. It has the added value of yielding simple and efficient calculations compared to other choices (like cross entropy or softmax).^{3}
Finally, a natural suggestion that emerges from our work is to add \(\beta >0\) as a parameter for each individual neuron in the network, and not just to the loss function. Namely, to translate the input to a \(\mathsf {ReLu}\) neuron by \(\beta \). The value of \(\beta \) may change during the learning process. Figuratively, this can be thought of as “internal clocks” of the neurons. A neuron changes its behavior as time progresses. For example, the “older” the neuron is, the less it is inclined to change.
4 Convex Separation
Linear programming (LP) is a central paradigm in computer science and mathematics. LP duality is a key ingredient in many algorithms and proofs, and is deeply related to von Neumann’s minimax theorem that is seminal in game theory [29]. Two related and fundamental geometric properties are Farkas’ lemma [12], and the following separation theorem.
Theorem 4
(Convex separation theorem). For every non empty convex sets \(K ,L\subset \mathbb {R}^d\), precisely one of the following holds: (i) \(\mathsf {dist}(K,L) = \inf \{ \Vert pq\Vert : p \in K, q \in L \} = 0\), or (ii) there is a hyperplane separating K and L.
We observe that the following stronger version of the separation theorem follows from the perceptron’s compression (a similar version of Farkas’ lemma can be deduced as well).
Lemma 1
 1.
\(\mathsf {dist}(K,L) < \varepsilon \).
 2.There is a hyperplane \(H = \{ x : w \cdot x = b\}\) separating K from L so that its normal vector is “sparse”:
 (i)
\(\frac{w\cdot p  b}{\Vert w\Vert } > \frac{\varepsilon }{30}\) for all \(p \in K\),
 (ii)
\(\frac{w\cdot q  b}{\Vert w\Vert } <  \frac{\varepsilon }{30}\) for all \(q \in L\), and
 (iii)
w is a sum of at most \((10/\varepsilon )^2\) points in K and \(L\).
 (i)
Proof
Let K, L be convex sets and \(\varepsilon > 0\). For \(x \in \mathbb {R}^d\), let \(\tilde{x}\) in \(\mathbb {R}^{d+1}\) be the same as x in the first d coordinates and 1 in the last (we have \(\Vert \tilde{x}\Vert \le \Vert x\Vert +1\)). We thus get two convex bodies \(\tilde{K}\) and \(\tilde{L}\) in \(d+1\) dimensions (using the map \(x \mapsto \tilde{x}\)).
Run Algorithm 2 with \(\beta =1\) on inputs that positively label \(\tilde{K}\) and negatively label \(\tilde{L}\). This produces a sequence of vectors \(w^{(0)},w^{(1)},\ldots \) so that \(\Vert w^{(t)} \Vert \le \sqrt{6t}\) for all t. For every \(t>0\), the vector \(w^{(t)}\) is of the form \(w^{(t)} = k^{(t)}  \ell ^{(t)}\) where \(k^{(t)}\) is a sum of \(t_1\) elements of \(\tilde{K}\) and \(\ell ^{(t)}\) is a sum of \(t_2\) elements of \(\tilde{L}\) so that \(t_1+t_2 = t\). In particular, we can write \(\frac{1}{t} w^{(t)} = \alpha ^{(t)} p^{(t)}  (1\alpha ^{(t)}) q^{(t)}\) for \(\alpha ^{(t)} \in [0,1]\) where \(p^{(t)} \in \tilde{K}\) and \(q^{(t)} \in \tilde{L}\) (note that the last coordinate of \(w^{(t)}\) equals \(2\alpha ^{(t)}\frac{1}{2}\)).
The lemma is strictly stronger than the preceding separation theorem. Below, we also explain how this perspective yields an alternative proof of Novikoff’s theorem on the convergence of the perceptron [30]. It is interesting to note that the usual proof of the separation theorem relies on a concrete construction of the separating hyperplane that is geometrically similar to hardSVMs. The proof using the perceptron, however, does not include any “geometric construction” and yields a sparse and strong separator (it also holds in infinite dimensional Hilbert space, but it uses that the sets are bounded in norm).
Alternative Proof of the Perceptron’s Convergence. Assume without loss of generality that all of examples are labelled positively (by replacing x by \(x\) if necessary). Also assume that \(R = \max _i \Vert x_i\Vert = 1\). As in the proof above, let \(w^{(0)},w^{(1)},\ldots \) be the sequence of vectors generated by the perceptron (Algorithm 1). Instead of arguing that the projection on \(w^*\) grows linearly with t, argue as follows. The vectors \(v^{(1)},v^{(2)},\ldots \) defined by \(v^{(t)} = \frac{1}{t} w^{(t)}\) are in the convex hull of the examples and have norm at most \(\Vert v^{(t)}\Vert \le \frac{1}{\sqrt{t}}\). Specifically, for every w of norm 1 we have \(v^{(t)} \cdot w \le \frac{1}{\sqrt{t}}\) and so there is an example x so that \(x \cdot w \le \frac{1}{\sqrt{t}}\). This implies that the running time T satisfies \(\frac{1}{\sqrt{T}} \ge \varepsilon ^*\) since for every example x we have \(x \cdot w^* \ge \varepsilon ^*\).
Claim
Proof
The last strategy in the sequence \(\mu _1,\mu _2,\ldots \) guarantees the Point player a loss of at most \(3 \varepsilon ^*\). This sequence is naturally and efficiently generated by the perceptron algorithm and produces a strategy for the Point player that is optimal up to a constant factor. The ideas presented in Sect. 2 allow to reduce the constant 3 to as close to 1 as we want, by paying in running time (see Algorithm 4).
5 Generalization Bounds
Generalization is one of the key concepts in learning theory. One typically formalizes it by assuming that the input sample consists of i.i.d. examples drawn from an unknown distribution D on \(\mathbb {R}^d\) that are labelled by some unknown function \(c: \mathbb {R}^d \rightarrow \{\pm 1\}\). The algorithm is said to generalize if it outputs an hypothesis \(h: \mathbb {R}^d \rightarrow \{\pm 1\}\) so that \(\Pr _D[h \ne c]\) is as small as possible.
We focus on the case that c is linearly separable. A natural choice for h in this case is given by hardSVM; namely, the halfspace with maximum margin on the input sample. It is known that if D is supported on points that are \(\gamma \)far from some hyperplane then the hardSVM choice generalizes well (see Theorem 15.4 in [33]). The proof of this property of hardSVMs uses Rademacher complexity.
We suggest that using the perceptron algorithm, instead the hardSVM solution, yields a more general statement with a simpler proof. The reason is that the perceptron can be interpreted as a sample compression scheme.
Theorem 5
The theorem can also be interpreted of as a localtoglobal statement in the following sense. Assume that we know nothing of c, but we get a list of m samples that are linearly separable with significant margin (this is a local condition that we can empirically verify). Then we can deduce that c is close to being linearly separable. The perceptron’s compression allows to deduce more general localtoglobal statements, like bounding the global margin via the local/empirical margins (this is related to [32]).
Condition (2) holds when the expected value of one over the margin is bounded from above (and may hold when c is not linearly separable). This assumption is weaker than the assumption in [33] on the behavior of hardSVMs (that the margin is always bounded from below).
For the proof of Theorem 5 we will need the following.
Definition 2

\(\kappa \) maps S to a subsample of S of size at most d.

\(\rho \) maps \(\kappa (S)\) to a hypothesis \(\rho (\kappa (S)): \mathcal{X} \rightarrow \{\pm 1\}\); this is the output of the learning algorithm induced by the selection scheme.
Following Littlestone and Warmuth, David et al. showed that every selection scheme does not overfit its data [11]: Let \((\kappa ,\rho )\) be a selection scheme of size d. Let S be a sample of m independent examples from an arbitrary distribution D that are labelled by some fixed concept c, and let \(K(S) = \rho \left( \kappa \left( S\right) \right) \) be the output of the selection scheme. For a hypothesis h, let \(L_D(h) = \Pr _D[h \ne c]\) denote the true error of h and \(L_S(h) = \frac{1}{m} \sum _{i=1}^m 1_{h(x_i) \ne c(x_i)}\) denote the empirical error of h.
Theorem 6
Proof
(Theorem 5). Consider the following selection scheme of size \(1/\varepsilon ^2\) that agrees with the perceptron on samples with margin at least \(\varepsilon \): If the input sample S has \(\mathsf {marg}(S) \ge \varepsilon \), apply the perceptron (which gives a compression of size \(1/\varepsilon ^2\)). Else, compress it to the emptyset and reconstruct it to some dummy hypothesis. The theorem now follows by applying Theorem 6 to this selection scheme and by the assumption that \(\mathsf {marg}(S) \ge \varepsilon \) for \(1\delta /2\) of the space (note that \(L_S(K(S))=0\) when \(\mathsf {marg}(S)\ge \varepsilon \)). \(\square \)
6 Robust Concepts
Here we follow the theme of robust concepts presented by Arriaga and Vempala [3]. Let \(\mathcal{X} \subset \mathbb {R}^d\) be of size n so that \(\max _{x \in \mathcal{X}} \Vert x \Vert =1\). Think of \(\mathcal{X}\) as representing a collection of high resolution images. As in many learning scenarios, some assumptions on the learning problem should be made in order to make it accessible. A typical assumption is that the unknown function to be learnt belongs to some specific class of functions. Here we focus on the class of all \(\varepsilon \)separated partitions of \(\mathcal{X}\); these are functions \(Y : \mathcal{X} \rightarrow \{\pm 1\}\) that are linearly separable with margin at least \(\varepsilon \). Such partitions are called robust concepts in [3] and correspond to “easy” classification problems.
Arriaga and Vempala demonstrated the difference between robust concepts and nonrobust concept with the following analogy; it is much easier to distinguish between “Elephant” and “Dog” than between “African Elephant” and “Indian Elephant.” They proved that random projections can help to perform efficient dimension reduction for \(\varepsilon \)separated learning problems (and more general examples). They also described “neuronal” devices for performing it, and discussed their advantages. Similar dimension reductions were used in several other works in learning e.g. [4, 6, 15, 16, 21].
We observe that the perceptron’s compression allows to deduce a simultaneous dimension reduction. Namely, the dimension reduction works simultaneously for the entire class of robust concepts. This follows from results in BenDavid et al. [5], who studied limitations of embedding learning problems in linearly separated classes.
We now explain this in more detail. The first step in the proof is the following theorem.
Theorem 7
([5]). The number of \(\varepsilon \)separated partitions of \(\mathcal{X}\) is at most \((2(n+1))^{1/\varepsilon ^2}\).
Proof
Given an \(\varepsilon \)partition of the set \(\mathcal{X}\), the perceptron algorithm finds a separating hyperplane after making at most \(1/ \varepsilon ^2\) updates. It follows that every \(\varepsilon \)partition can be represented by a multiset of \(\mathcal{X}\) together with the corresponding signs. The total number of options is at most \((n+1)^{1/\varepsilon ^2} \cdot 2^{1/\varepsilon ^2}\). \(\square \)
The theorem is sharp in the following sense.
Example 1
Let \(e_1,\ldots ,e_n \in \mathbb {R}^n\) be the n standard unit vectors. Every subset of the form \((e_i)_{i \in I}\) for \(I \subset [n]\) of size k is \(\varOmega (1/\sqrt{k})\)separated, and there are \({n \atopwithdelims ()k}\) such subsets.
The example also allows to lower bound the number of updates of any perceptronlike algorithm. If there is an algorithm that given \(Y : \mathcal{X} \rightarrow \{\pm 1\}\) of margin \(\varepsilon \) is able to find w so that \(Y(x) = \mathsf {sign}(w \cdot x)\) for \(x \in \mathcal{X}\) that can be described by at most K of the points in \(\mathcal{X}\) then K should be at least \(\varOmega (1/\varepsilon ^2)\).
Theorem 8
(implicit in [5]). With probability of at least \(1\delta \) over the choice of A, all \(\varepsilon \)partitions of \(\mathcal{X}\) are \(\varepsilon /2\)partitions of \(A\mathcal{X}\) and all \(\varepsilon /2\)partitions of \(A\mathcal{X}\) are \(\varepsilon /4\)partitions of \(\mathcal{X}\).
The proof of the above theorem is a simple application of Theorem 7 together with the JohnsonLindenstrauss lemma.
Lemma 2
Footnotes
 1.
We focus on the linear case, when the threshold is 0. A standard lifting that adds a coordinate with 1 to every vector allows to translate the general (affine) case to the linear case. This lifting may significantly decrease the margin; e.g., the map Y on \(\mathcal{X} = \{999,1001\} \subset \mathbb {R}\) defined by \(Y(999)=1\) and \(Y(1001) = 1\) has margin 1 in the affine sense, but the lift to (999, 1) and (1001, 1) in \(\mathbb {R}^2\) yields very small margin in the linear sense. This solution may therefore cause an unnecessary increase in running time. This tax can be avoided, for example, if one has prior knowledge of \(R=\max _{x \in \mathcal{X}} \Vert x \Vert \). In this case, setting the last coordinate to be R does not significantly decrease the margin. In fact, it can be avoided without any prior knowledge using the ideas in Algorithm 3 below.
 2.
We assume that S is consistent with a function (does not contain identical points with opposite labels).
 3.
An additional added value is that with this loss function there is a dichotomy, either an error occurred or not. This dichotomy can be helpful in making decisions throughout the learning process. For example, instead of choosing the batchsize to be of fixed size B, we can choose the batchsize in a dynamic but simple way: just wait until B errors occurred.
 4.
Time is measured by the number of updates.
 5.
Other distributions will work just as well.
References
 1.Andoni, A., Panigrahy, R., Valiant, G., Zhang, L.: Learning polynomials with neural networks. PMLR 32(2), 1908–1916 (2014)Google Scholar
 2.Anlauf, J.K., Biehl, M.: The AdaTron: an adaptive perceptron algorithm. EPL 10, 687 (1989)CrossRefGoogle Scholar
 3.Arriaga, R.I., Vempala, S.: An algorithmic theory of learning: robust concepts and random projection. Mach. Learn. 63(2), 161–182 (2006)zbMATHCrossRefGoogle Scholar
 4.Balcan, N., Blum, A., Vempala, S.: On kernels, margins and lowdimensional mappings. In: ALT (2004) Google Scholar
 5.BenDavid, S., Eiron, N., Simon, H.U.: Limitations of learning via embeddings in Euclidean half spaces. JMLR 3, 441–461 (2002)MathSciNetzbMATHGoogle Scholar
 6.Blum, A., Kannan, R.: Learning an intersection of \(k\) halfspaces over a uniform distribution. In: FOCS (1993)Google Scholar
 7.Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: COLT, pp. 144–152 (1992)Google Scholar
 8.CesaBianchi, N., Conconi, A., Gentile, C.: On the generalization ability of online learning algorithms. IEEE Trans. Inf. Theory 50(9), 2050–2057 (2004)MathSciNetzbMATHCrossRefGoogle Scholar
 9.Collobert, R., Bengio, S.: Links between perceptrons, MLPs and SVMs. IDIAP (2004)Google Scholar
 10.Crammer, K., Dekel, O., Keshet, J., ShalevShwartz, S., Singer, Y.: Online passiveaggressive algorithms. J. Mach. Learn. Res. 7, 551–585 (2006)MathSciNetzbMATHGoogle Scholar
 11.David, O., Moran, S., Yehudayoff, A.: Supervised learning through the lens of compression. In: NIPS, pp. 2784–2792 (2016)Google Scholar
 12.Farkas, G.: Über die Theorie der Einfachen Ungleichungen. Journal für die Reine und Angewandte Mathematik 124(124), 1–27 (1902)MathSciNetzbMATHGoogle Scholar
 13.Freund, Y.: Boosting a weak learning algorithm by majority. Inf. Comput. 121(2), 256–285 (1995)MathSciNetzbMATHCrossRefGoogle Scholar
 14.Freund, Y., Schapire, R.E.: Large margin classification using the perceptron algorithm. Mach. Learn. 37, 277–296 (1999). https://doi.org/10.1023/A:1007662407062zbMATHCrossRefGoogle Scholar
 15.Garg, A., HarPeled, S., Roth, D.: On generalization bounds, projection profile, and margin distribution. In: ICML, pp. 171–178 (2002)Google Scholar
 16.Garg, A., Roth, D.: Margin distribution and learning. In: ICML, pp. 210–217 (2003)Google Scholar
 17.Gentile, C.: A new approximate maximal margin classification algorithm. J. Mach. Learn. Res. 2, 213–242 (2001)MathSciNetzbMATHGoogle Scholar
 18.Graepel, T., Herbrich, R., ShaweTaylor, J.: PACBayesian compression bounds on the prediction error of learning algorithms for classification. Mach. Learn. 59, 55–76 (2005). https://doi.org/10.1007/s1099400504627zbMATHCrossRefGoogle Scholar
 19.Johnson, W.B., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbert space. In: Conference in Modern Analysis and Probability (1982)Google Scholar
 20.Khardon, R., Wachman, G.: Noise tolerant variants of the perceptron algorithm. J. Mach. Learn. Res. 8, 227–248 (2007)zbMATHGoogle Scholar
 21.Klivans, A.R., Servedio, R.A.: Learning intersections of halfspaces with a margin. In: ShaweTaylor, J., Singer, Y. (eds.) COLT 2004. LNCS (LNAI), vol. 3120, pp. 348–362. Springer, Heidelberg (2004). https://doi.org/10.1007/9783540278191_24zbMATHCrossRefGoogle Scholar
 22.Korzeń, M., Klęsk, P.: Maximal margin estimation with perceptronlike algorithm. In: Rutkowski, L., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2008. LNCS (LNAI), vol. 5097, pp. 597–608. Springer, Heidelberg (2008). https://doi.org/10.1007/9783540697312_58CrossRefGoogle Scholar
 23.Krauth, W., Mézard, M.: Learning algorithms with optimal stablilty in neural networks. J. Phys. A: Math. Gen. 20, L745–L752 (1987)CrossRefGoogle Scholar
 24.LeCun, Y., Cortes, C.: The MNIST database of handwritten digits (1998)Google Scholar
 25.Littlestone, N., Warmuth, M.: Relating data compression and learnability (1986, unpublished)Google Scholar
 26.Matoušek, J.: On variants of the JohnsonLindenstrauss lemma. Random Struct. Algorithms 33(2), 142–156 (2008)MathSciNetzbMATHCrossRefGoogle Scholar
 27.Mohri, M., Rostamizadeh, A.: Perceptron Mistake Bounds. arXiv:1305.0208
 28.Moran, S., Yehudayoff, A.: Sample compression schemes for VC classes. JACM 63(3), 1–21 (2016)MathSciNetzbMATHCrossRefGoogle Scholar
 29.von Neumann, J.: Zur Theorie der Gesellschaftsspiele. Math. Ann. 100, 295–320 (1928)MathSciNetzbMATHCrossRefGoogle Scholar
 30.Novikoff, A.B.J.: On convergence proofs for perceptrons. In: Proceedings of the Symposium on the Mathematical Theory of Automata, vol. 12, pp. 615–622 (1962)Google Scholar
 31.Rosenblatt, F.: The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65(6), 386–408 (1958)CrossRefGoogle Scholar
 32.Schapire, R.E., Freund, Y., Bartlett, P., Lee, W.S.: Boosting the margin: a new explanation for the effectiveness of voting methods. Ann. Stat. 26(5), 1651–1686 (1998)MathSciNetzbMATHCrossRefGoogle Scholar
 33.ShalevShwartz, S., BenDavid, S.: Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, Cambridge (2014)zbMATHCrossRefGoogle Scholar
 34.ShalevShwartz, S., Singer, Y., Srebro, N., Cotter, A.: Pegasos: primal estimated subgradient solver for SVM. Math. Program. 127(1), 3–30 (2011). https://doi.org/10.1007/s1010701004204MathSciNetzbMATHCrossRefGoogle Scholar
 35.Simard, P.Y., Steinkraus, D., Platt, J.C.: Best practices for convolutional neural networks applied to visual document analysis. ICDAR 3, 958–962 (2003)Google Scholar
 36.Soudry, D., Hoffer, E., Srebro, N.: The implicit bias of gradient descent on separable data. arXiv:1710.10345 (2017)
 37.Wendemuth, A.: Learning the unlearnable. J. Phys. A: Math. Gen. 28, 5423 (1995)MathSciNetzbMATHCrossRefGoogle Scholar