Skip to content
Go back

Implementing PEGASOS

Posted on:

Here’s the original paper that proposes the algorithm that we’re going to implement.

SVMs are a very popular classification learning tool, and in the original form, the task of learning an SVM is actually a loss minimization problem with a penalty term for the norm of the classifier being learned. So for a given training set of mm training examples S={(xi,yi)}i=1mS = \{(x_i, y_i)\}_{i=1}^m, where xiRnx_i \in \mathbb{R}^n and yi{1,+1}y_i \in \{-1, +1\}, we want to build a minimizer for the following function (note that here ww and xx are vectors):

F=minw(λ2w2)+1m(x,y)Sl(w;(x,y))F = \min_{w}\left(\frac{\lambda}{2}||w||^2\right) + \frac{1}{m}\sum_{(x, y) \in S}l(w; (x, y))

where ll is the loss function, given by l(w;(x,y))=max{0,1yw,x}l(w; (x, y)) = \max{\{0, 1 - y \langle w, x \rangle\}} where w,x\langle w, x \rangle denotes the inner product of the two vectors. To have an intuitive insight into why this loss function works, let’s consider two examples:

w,x={+ve, when w is very similar to xve, when w is quite different than x\langle w, x \rangle = \begin{cases} +ve\text{, when } w \text{ is very similar to } x\\ -ve\text{, when } w \text{ is quite different than } x \end{cases}

The variable yy has the classification information, where 11 means belonging to the class, and 1-1 means not belonging to the class for which the classifier is being trained. Therefore, a positive value of w,x\langle w, x \rangle along with y=1y = 1, or a negative value of w,x\langle w, x \rangle along with y=1y = -1 is favorable because both these cases denote that the prediction works right. It also means that the loss function will be minimized in these two scenarios (work it out if needed), which is a part of our main function FF that we need to build the minimizer for.

PEGASOS stands for Primal Estimated sub-GrAdient SOlver for SVM, where stochastic means having a random probability distribution or pattern that may be analysed statistically but may not be predicted precisely. Let’s dive into the algorithm and see why it’s stochastic. The PEGASOS algorithm performs stochastic gradient descent on the primal objective function FF with a carefully chosen step size.

The basic procedure

Let’s try to see why this approximation is okay for a suitable TT. For this, we’ll look at the value of the complete gradient, and the subgradient computed with the approximated function above, and see a relation between them. The complete gradient of FF will be F˙\dot{F},

F˙=λ2w2+1m(x,y)Sl(w;(x,y))\dot{F} = \frac{\lambda}{2} \nabla ||w||^2 + \frac{1}{m} \sum_{(x, y) \in S} l(w; (x, y))

Now what will be the expected value of the subgradient \nabla that we computed above? To find the expected value, we observe that the example taken to approximate the objective is chosen uniformly at random, which means that the probability of any example being selected is P(e)=1/mP(e) = 1/m. Thus the expected value of the subgradient \nabla turns out to be equal to the complete gradient of FF, our primal objective function. And that is why intuitively this approximation is expected to work well enough. Next section deals with mini-batch iterations, to approximate the objective with more determinism.

Mini-batch iterations

As an extension of the basic procedure, now we would select a subset of examples, rather than selecting a single example for approximating the objective. So for a given k{1,2,...m}k \in \{1, 2, ... m\}, we choose a subset of size kk and approximate the objective as before. Note that k=1k = 1 is the case we already saw above. So now the objective can be written as f(w,At)=λ2w2+1kiAtl(w;(xi,yi))f(w, A_t) = \frac{\lambda}{2} ||w||^2 + \frac{1}{k}\sum_{i \in A_t}l(w; (x_i, y_i)) where AtA_t is the subset chosen in ttht^{th} iteration.

Projection step

A potential variation in the above algorithm is that we limit the set of admissible solutions to a ball of radius 1/λ1/\sqrt{\lambda}. To enforce this, project wtw_t after each iteration onto a sphere as wt+1=min{1,1/λwt+1}wt+1w_{t+1} = \min\{1, \frac{1/\sqrt{\lambda}}{||w_{t+1}||}\}w_{t+1}. The revised analysis as presented in the paper does not compulsorily require this projection step. It mentions this as an optional step because no major difference was found during the experiments between the projected and the unprojected variants.

It is proved in the paper that the number of iterations required to obtain a solution of accuracy ϵ\epsilon is O(1/ϵ)O(1/\epsilon), where each iteration operates on a single training example. In contrast, previous analyses of stochastic gradient descent methods for SVMs required Ω(1/ϵ2)\Omega(1/\epsilon^2) iterations because in previously devised SVM solvers, the number of iterations also scales linearly with 1/λ1/\lambda, where λ\lambda is the regularization parameter of the SVM; while with PEGASOS, this is not the case. PEGASOS works on an approximation, so the runtime of the algorithm is not dependent on the number of training examples or with some function of λ\lambda. It just depends on kk, the size of the subset we are taking, and TT, the number of iterations that we are making. The implemented code (in C++) will be put up later when I’m not feeling lazy.



Previous Post
Common join algorithms
Next Post
Weird but awesome Javascript