☕
Machine Learning Demystified
  • Introduction
  • Blogs
    • Spherical counterpart of Gaussian kernel
    • Vanilla Kalman Filter
    • Support Vector Machine
    • One Class SVM
    • Differentiable Bayesian Structure Learning
    • Stein Variational Gradient Descent
    • Finite Step Unconditional Markov Diffusion Models
Powered by GitBook
On this page

Was this helpful?

Edit on GitHub
  1. Blogs

One Class SVM

An algorithm for novelty detection and empirical distribution support estimation

PreviousSupport Vector MachineNextDifferentiable Bayesian Structure Learning

Last updated 4 years ago

Was this helpful?

Separate data distribution from the origin by a hyperplane

Assume the points are separable from the origin, then what is the maximum margin?

The problem can be formulated as

min⁡w12∥w∥2   subject to   (w⋅xi)≥ρ, with ρ≥0\min_{w} \frac12 \Vert w \Vert^2 \;\text{ subject to }\;(w\cdot x_i) \ge \rho, \text{ with }\rho\ge 0wmin​21​∥w∥2 subject to (w⋅xi​)≥ρ, with ρ≥0

for all data point xi∈Rnx_i \in \mathbb R^nxi​∈Rn. The problem is equivalent to the binary case where (xi;1)(x_i; 1)(xi​;1) and (−xi;−1)(-x_i; -1)(−xi​;−1) are the two classes.

The Lagrangian is

L(w)=12w2−∑iαi(w⋅xi−ρ)−μρ\mathcal L(w) = \frac12 w^2 - \sum_i \alpha_i (w\cdot x_i - \rho) - \mu \rhoL(w)=21​w2−i∑​αi​(w⋅xi​−ρ)−μρ

with KKT conditions αi(w⋅xi−ρ)=0\alpha_i (w\cdot x_i -\rho) = 0αi​(w⋅xi​−ρ)=0 where αi≥0\alpha_i \ge 0αi​≥0.

Take gradient

∂wL=w−∑iαixi=0∂ρL=∑iαi−μ=0\partial_w \mathcal L = w - \sum_i \alpha_i x_i = 0\\ \partial_\rho \mathcal L = \sum_i \alpha_i - \mu = 0∂w​L=w−i∑​αi​xi​=0∂ρ​L=i∑​αi​−μ=0

In other words,

or

Inseparable case and soft margin

Thus, the soft margin loss function

Thus, the dual problem is

Hence we have the quadratic program

Then, w=∑iαixiw=\sum_i \alpha_i x_iw=∑i​αi​xi​. The dual problem Lagrangian is

L∗(α)=−12∑i,jαiαj⟨xi,xj⟩ with αi≥0 and ∑iαi=const.≥0\mathcal L^* (\alpha) = - \frac12 \sum_{i,j} \alpha_i \alpha_j \langle x_i, x_j\rangle \text{ with }\alpha_i \ge 0 \text{ and } \sum_i\alpha_i = {\rm const.} \ge 0L∗(α)=−21​i,j∑​αi​αj​⟨xi​,xj​⟩ with αi​≥0 and i∑​αi​=const.≥0

Without loss of generality, we set ∑iαi=1\sum_i \alpha_i = 1∑i​αi​=1.

max⁡α  −12α⊤Kα   subject to αi∈[0,∞),α⊤1=1\max_{\alpha}\; - \frac12 \alpha^\top K \alpha \;\text{ subject to } \alpha_i\in[0,\infty), \alpha^\top 1 = 1αmax​−21​α⊤Kα subject to αi​∈[0,∞),α⊤1=1
min⁡α12α⊤Kα   subject to αi∈[0,∞),α⊤1=1\min_{\alpha} \frac12 \alpha^\top K \alpha \;\text{ subject to } \alpha_i\in[0,\infty), \alpha^\top 1 = 1αmin​21​α⊤Kα subject to αi​∈[0,∞),α⊤1=1

The ρ\rhoρ can be solved by plugging in the support vectors, ρ=w⋅xi∗\rho = w\cdot x_i^*ρ=w⋅xi∗​ .

The decision function is given by f(x)=sgn(w⋅x−ρ)f(x) = {\rm sgn}(w\cdot x -\rho)f(x)=sgn(w⋅x−ρ) and f(xi)≥0f(x_i)\ge 0f(xi​)≥0 for all data points.

Now we allow the violations of classification criterion w⋅xi≥ρw\cdot x_i \ge \rhow⋅xi​≥ρ, and panelize the misclassifications using Hinge loss

Lhinge(w,ρ)=∑imax⁡(0,−(w⋅xi−ρ))=−∑i(w⋅xi−ρ)1{w⋅xi<ρ}\mathcal L^{\rm hinge} (w, \rho) = \sum_i \max(0, -(w\cdot x_i-\rho)) = -\sum_i (w\cdot x_i -\rho) 1_{\{w\cdot x_i < \rho\}}Lhinge(w,ρ)=i∑​max(0,−(w⋅xi​−ρ))=−i∑​(w⋅xi​−ρ)1{w⋅xi​<ρ}​

Introduce slack variable ξi≥0\xi_i \ge 0ξi​≥0 as the discrepancy ξi=ρ−w⋅xi\xi_i = \rho - w \cdot x_iξi​=ρ−w⋅xi​ for each violation. With help of slack variables, we have

w⋅xi−ρ+ξi≥0w\cdot x_i -\rho +\xi_i \ge 0w⋅xi​−ρ+ξi​≥0
L(w,ρ,ξ)=12∥w∥2+C∑iξi−∑iαi(w⋅xi−ρ+ξi)−∑iβiξi−μρ\mathcal L(w, \rho, \xi) = \frac12 \Vert w\Vert^2 + C\sum_i \xi_i - \sum_i \alpha_i (w\cdot x_i -\rho +\xi_i) - \sum_i \beta_i \xi_i - \mu\rhoL(w,ρ,ξ)=21​∥w∥2+Ci∑​ξi​−i∑​αi​(w⋅xi​−ρ+ξi​)−i∑​βi​ξi​−μρ

where C≥0C\ge0C≥0 and μ≥0\mu\ge 0μ≥0. The gradients are

∂wL=w−∑iαixi=0∂ρL=∑iαi−μ=0∂ξjL=C−αj−βj=0\partial_w \mathcal L = w - \sum_i \alpha_i x_i = 0\\ \partial_\rho \mathcal L = \sum_i \alpha_i - \mu = 0\\ \partial_{\xi_j} \mathcal L = C - \alpha_j - \beta_j = 0∂w​L=w−i∑​αi​xi​=0∂ρ​L=i∑​αi​−μ=0∂ξj​​L=C−αj​−βj​=0
L∗(α)=−12∑i,jαiαj⟨xi,xj⟩=−12α⊤Kα\mathcal L^*(\alpha) = -\frac12 \sum_{i,j}\alpha_i\alpha_j \langle x_i,x_j \rangle = -\frac12 \alpha^\top K \alphaL∗(α)=−21​i,j∑​αi​αj​⟨xi​,xj​⟩=−21​α⊤Kα
min⁡α12α⊤Kα   subject to  αi∈[0,C]  and ∑iαi=1\min_{\alpha} \frac12 \alpha^\top K \alpha \;\text{ subject to }\, \alpha_i \in [0, C] \,\text{ and } \sum_i \alpha_i = 1αmin​21​α⊤Kα subject to αi​∈[0,C] and i∑​αi​=1

where C=(νmsample)−1C=(\nu m_{\rm sample})^{-1}C=(νmsample​)−1 in the original paper and thus C−1C^{-1}C−1 is the upper bound of number of outliers.