| ... | ... | @@ -32,7 +32,7 @@ SVM is now one of the mostly applied techniques in supervised machine learning. |
|
|
|
|
|
|
|
In order to understand how it works, let’s go back to its origins and look at a binary classification task (reds and blues) in 2D data space, which is linearly separable. If we try with paper & pen, we see that may lines do exist and the million-dollar question here is to figure out which line (a hyperplane in N dimensional case) is the best way to separate.
|
|
|
|
|
|
|
|

|
|
|
|
<img src="uploads/a74fff981e021b4459b34f8db9286161/svm_1.png" width="300">
|
|
|
|
|
|
|
|
For mathematical convenience, let’s go into the number domain, rather than sticking to colors, and say that we are trying to separate positive numbers from the negative numbers. In this case, the decision boundary will correspond to the locations of zeros along this special line:
|
|
|
|
|
| ... | ... | @@ -52,11 +52,34 @@ $`y_{i} = w . x_i + b >= 1`$ |
|
|
|
|
|
|
|
$`y_{i} = w . x_i + b <= -1`$
|
|
|
|
|
|
|
|
</div>We are adding this rule to limit of degrees of freedom, as we can propose many solutions to the classification problem. This little touch is what we call ‘margin’. The margin is defined as the perpendicular distance between the decision boundary and the closest data points. When we set an objective such as “maximize the margin”, then we end up with a particular solution. Herein, the solution is found via a subset of points called support vectors.
|
|
|
|
</div>
|
|
|
|
|
|
|
|

|
|
|
|
We are adding this rule to limit of degrees of freedom, as we can propose many solutions to the classification problem. This little touch is what we call ‘margin’. The margin is defined as the perpendicular distance between the decision boundary and the closest data points. When we set an objective such as “maximize the margin”, then we end up with a particular solution. Herein, the solution is found via a subset of points called support vectors.
|
|
|
|
|
|
|
|
Then the question becomes “how can we evaluate the margin in the first place?”. We should first have an expression for the margin then use a smart way to find the maximum point. Here we will use a normalized definition. It is best seen with illustration: consider a plane separating blue and red data points as shown below. Here we can easily evaluate the width of the “street” separating the observations with 2 x margin (γ):
|
|
|
|
<img src="uploads/1363053e315b3f87492ed55dbdbfeee2/svm_2.png" width="300">
|
|
|
|
|
|
|
|
Then the question becomes “how can we evaluate the margin in the first place?”. We should first have an expression for the margin then use a smart way to find the maximum point. Here we will use a normalized definition. It is best seen with illustration: consider a plane separating blue and red data points as shown below. Here we can easily evaluate the width of the “street” separating the observations with 2 x margin (γ):
|
|
|
|
|
|
|
|
<img src="uploads/f1688d840610c5aabf43042f8111e57c/svm_3.png" width="300">
|
|
|
|
|
|
|
|
Now our objective becomes the maximization of the geometric margin γ. Nonetheless, the direct solution of this optimization problem is very complex, so we make another decision to convert it into a different problem, easier to solve (as always we do).
|
|
|
|
|
|
|
|
Here we do refresh our memories and think about solution methods that works well with:
|
|
|
|
|
|
|
|
* presence of mathematical constraints (our previous decisions for DOF)
|
|
|
|
|
|
|
|
* finding the minima / maxima of the given expression while satisfying constraints
|
|
|
|
|
|
|
|
Luckily for us, we have the perfect remedy for it: [method of Lagrangian multipliers](https://en.wikipedia.org/wiki/Lagrange_multiplier):
|
|
|
|
|
|
|
|
|
|
|
|
<img src="uploads/82d423a38538131d2302da3ab891d704/svm_4.png" width="600">
|
|
|
|
|
|
|
|
If you are more curious about its derivation and the underlying mathematics of the Kernel trick and soft margin applications, you may explore these two references:
|
|
|
|
|
|
|
|
* Christopher M. Bishop, Pattern Recognition and Machine Learning, Ch. 7.
|
|
|
|
|
|
|
|
* Nello Cristianini and John Shawe-Taylor, [An Introduction to Support Vector Machines and other Kernel-Based Learning Methods](https://www.cambridge.org/core/books/an-introduction-to-support-vector-machines-and-other-kernelbased-learning-methods/A6A6F4084056A4B23F88648DDBFDD6FC)
|
|
|
|
|
|
|
|
...
|
|
|
|
|
| ... | ... | |
| ... | ... | |