☕
Machine Learning Demystified
  • Introduction
  • Blogs
    • Spherical counterpart of Gaussian kernel
    • Vanilla Kalman Filter
    • Support Vector Machine
    • One Class SVM
    • Differentiable Bayesian Structure Learning
    • Stein Variational Gradient Descent
    • Finite Step Unconditional Markov Diffusion Models
Powered by GitBook
On this page
  • Primal Optimization Problem
  • Dual Formulation

Was this helpful?

Edit on GitHub
  1. Blogs

Support Vector Machine

Formulation of Support Vector Machine (SVM)

Primal Optimization Problem

SVM is a maximum margin classifier where the decision boundary f(x)=w⃗⋅x⃗+b⃗f(x) = \vec w\cdot \vec x + \vec bf(x)=w⋅x+b separate two classes labeled by y=±1y=\pm 1y=±1 in Rn\mathbb R^nRn with a largest possible margin.

The normal vector of the hyperplane

The normal vector in direction of increasing f(x)f(x)f(x) is

n^=∇f(x)∥∇f(x)∥2=w⃗∥w∥2\hat n = \frac{\nabla f(x)}{\Vert \nabla f(x) \Vert_2} = \frac{\vec w}{\Vert w \Vert_2}n^=∥∇f(x)∥2​∇f(x)​=∥w∥2​w​

The perpendicular distance to the hyperplane from any point x⃗′∈Rn\vec x'\in\mathbb R^nx′∈Rn

The distance between any point x′x'x′ and the plane f(x)f(x)f(x) is the distance between this point and any point in the plane projected to the normal direction d=∣(x⃗′−x⃗)⋅n^∣d = \vert (\vec x' - \vec x)\cdot \hat n \vertd=∣(x′−x)⋅n^∣ where f(x)=∥w∥n^⋅x⃗+bf(x) = \Vert w \Vert \hat n\cdot \vec x + bf(x)=∥w∥n^⋅x+b. Now ground the decision boundary to zero potential, then 0=∥w∥n^⋅x⃗+b0 = \Vert w \Vert \hat n\cdot \vec x + b0=∥w∥n^⋅x+b and thus,

d=∣w⃗⋅x⃗′+b∣∥w∥d=\frac{\left\vert \vec w\cdot \vec x' + b \right\vert}{\Vert w \Vert}d=∥w∥∣w⋅x′+b∣​

Maximize the distance subject to the constraint of correct class labels

The optimization is

max⁡w,bmin⁡xd=max⁡w,bmin⁡x∣w⃗⋅x⃗+b∣∥w∥\max_{w, b} \min_x d = \max_{w, b}\min_{x} \frac{\left\vert \vec w\cdot \vec x + b \right\vert}{\Vert w \Vert}w,bmax​xmin​d=w,bmax​xmin​∥w∥∣w⋅x+b∣​

subject to

w⃗⋅x⃗+b≥1, y=+1w⃗⋅x⃗+b≤−1, y=−1\vec w \cdot \vec x + b \ge 1,\, y=+1\\ \vec w \cdot \vec x + b \le -1,\, y=-1w⋅x+b≥1,y=+1w⋅x+b≤−1,y=−1

which is equivalent to

max⁡w,b1∥w∥=min⁡w,b12∥w∥22,  ∀(x⃗,y)  1−y(w⃗⋅x⃗+b)≤0\max_{w,b} \frac{1}{\Vert w \Vert} = \min_{w,b} \frac{1}{2}\Vert w \Vert_2^2,\; \forall (\vec x,y) \; 1 - y(\vec w \cdot \vec x + b) \le 0w,bmax​∥w∥1​=w,bmin​21​∥w∥22​,∀(x,y)1−y(w⋅x+b)≤0

Introduce KKT parameters αi\alpha_iαi​ for each pair of data and label (x⃗i,yi)(\vec x_i, y_i)(xi​,yi​), the primal Lagrangian, or cost function is

LP(w,b)=12w⊤w+∑iαi[1−yi(w⊤xi+b)]\mathcal L^P(w,b) = \frac12 w^\top w + \sum_i \alpha_i [1-y_i(w^\top x_i + b)]LP(w,b)=21​w⊤w+i∑​αi​[1−yi​(w⊤xi​+b)]

with

αi[1−yi(w⊤xi+b)]=0,αi>0\alpha_i [1 - y_i(w^\top x_i +b)] = 0, \alpha_i > 0αi​[1−yi​(w⊤xi​+b)]=0,αi​>0

Soft margin and slack variables

The hinge loss max⁡(0,1−y(w⊤x+b))\max(0, 1-y(w^\top x + b))max(0,1−y(w⊤x+b)) measures the degree of misclassification where slack variable ξ=1−y(w⊤x+b)\xi = 1 - y(w^\top x + b)ξ=1−y(w⊤x+b).

The optimization becomes

min⁡w,b12∥w∥2+C∑imax⁡(0,1−yi(w⊤xi+b))\min_{w,b} \frac12 \Vert w \Vert^2 + C\sum_i \max(0, 1-y_i(w^\top x_i + b))w,bmin​21​∥w∥2+Ci∑​max(0,1−yi​(w⊤xi​+b))

If we wish to use the dual form, we need to cast the above into constraints. Now the soft margin constraint becomes

(1−ξ)−y(w⊤x+b)≤0, ξ≥0(1-\xi) - y(w^\top x + b) \le 0,\, \xi \ge 0(1−ξ)−y(w⊤x+b)≤0,ξ≥0

The primal Lagrangian becomes

LP(w,b,{ξi})=12w⊤w+C∑iξi+∑iαi[(1−ξi)−yi(w⊤xi+b)]−∑iμiξi\mathcal L^P(w,b, \{\xi_i\}) = \frac12 w^\top w + C\sum_i \xi_i + \sum_i \alpha_i [(1-\xi_i)-y_i(w^\top x_i + b)] - \sum_i \mu_i\xi_iLP(w,b,{ξi​})=21​w⊤w+Ci∑​ξi​+i∑​αi​[(1−ξi​)−yi​(w⊤xi​+b)]−i∑​μi​ξi​

The gradients are

∂wLP=w−∑iαiyixi=0∂bLP=−∑iαiyi=0∂ξiLP=C−μi−αi=0\partial_w \mathcal L^P = w - \sum_i \alpha_i y_i x_i=0\\ \partial_b \mathcal L^P = -\sum_i \alpha_i y_i = 0\\ \partial_{\xi_i} \mathcal L^P = C - \mu_i - \alpha_i=0∂w​LP=w−i∑​αi​yi​xi​=0∂b​LP=−i∑​αi​yi​=0∂ξi​​LP=C−μi​−αi​=0

while μi>0\mu_i > 0μi​>0, αi>0\alpha_i >0αi​>0, ξi≥0\xi_i \ge 0ξi​≥0.

Dual Formulation

The dual Lagrangian reads

LD=∑iαi−12∑i,jαiαjyiyj⟨xi,xj⟩\mathcal L^D = \sum_i \alpha_i -\frac12\sum_{i,j} \alpha_i \alpha_j y_i y_j \langle x_i, x_j \rangleLD=i∑​αi​−21​i,j∑​αi​αj​yi​yj​⟨xi​,xj​⟩

where C−μC-\muC−μ gives α\alphaα and hinge loss gives −w2-w^2−w2 combine the first term gives −12w2-\frac12 w^2−21​w2.

Now the optimization problem becomes

max⁡{αi}∑iαi−12∑i,jαiαjyiyj⟨xi,xj⟩\max_{\{\alpha_i\}} \sum_i \alpha_i -\frac12\sum_{i,j} \alpha_i \alpha_j y_i y_j \langle x_i, x_j \rangle{αi​}max​i∑​αi​−21​i,j∑​αi​αj​yi​yj​⟨xi​,xj​⟩

subject to αi∈[0,C]\alpha_i\in[0, C]αi​∈[0,C].

Kernel Trick

Replace ⟨xi,xj⟩\langle x_i, x_j \rangle⟨xi​,xj​⟩ with a kernel function K(xi,xj)=⟨ϕ(xi),ϕ(xj)⟩KK(x_i,x_j) = \langle \phi(x_i),\phi(x_j)\rangle_KK(xi​,xj​)=⟨ϕ(xi​),ϕ(xj​)⟩K​. The weight becomes (Representer Theorem)

w=∑iαiyiϕ(xi)w = \sum_i \alpha_i y_i \phi(x_i)w=i∑​αi​yi​ϕ(xi​)

and the decision hypersurface

f(x)=⟨w,ϕ(x)⟩+b=∑iαiyi⟨ϕ(xi),ϕ(x)⟩+b=∑iαiyiK(xi,x)+bf(x) = \langle w, \phi(x) \rangle + b = \sum_i \alpha_i y_i \langle \phi(x_i), \phi(x)\rangle + b = \sum_i \alpha_i y_i K(x_i, x) + bf(x)=⟨w,ϕ(x)⟩+b=i∑​αi​yi​⟨ϕ(xi​),ϕ(x)⟩+b=i∑​αi​yi​K(xi​,x)+b

where the bias (or threshold) can be solved using

yjf(xj)=1⇒∑iαiyiyjK(xi,xj)=1−byjy_j f(x_j) = 1 \Rightarrow \sum_{i} \alpha_i y_i y_j K(x_i, x_j) = 1 - b y_jyj​f(xj​)=1⇒i∑​αi​yi​yj​K(xi​,xj​)=1−byj​

where ϕ(xj)\phi(x_j)ϕ(xj​) is any support vector with αj>0\alpha_j > 0αj​>0

PreviousVanilla Kalman FilterNextOne Class SVM

Last updated 4 years ago

Was this helpful?