One Class SVM

An algorithm for novelty detection and empirical distribution support estimation

Separate data distribution from the origin by a hyperplane

Assume the points are separable from the origin, then what is the maximum margin?

The problem can be formulated as

minw12w2   subject to   (wxi)ρ, with ρ0\min_{w} \frac12 \Vert w \Vert^2 \;\text{ subject to }\;(w\cdot x_i) \ge \rho, \text{ with }\rho\ge 0

for all data point xiRnx_i \in \mathbb R^n. The problem is equivalent to the binary case where (xi;1)(x_i; 1) and (xi;1)(-x_i; -1) are the two classes.

The Lagrangian is

L(w)=12w2iαi(wxiρ)μρ\mathcal L(w) = \frac12 w^2 - \sum_i \alpha_i (w\cdot x_i - \rho) - \mu \rho

with KKT conditions αi(wxiρ)=0\alpha_i (w\cdot x_i -\rho) = 0 where αi0\alpha_i \ge 0.

Take gradient

wL=wiαixi=0ρL=iαiμ=0\partial_w \mathcal L = w - \sum_i \alpha_i x_i = 0\\ \partial_\rho \mathcal L = \sum_i \alpha_i - \mu = 0

Then, w=iαixiw=\sum_i \alpha_i x_i. The dual problem Lagrangian is

L(α)=12i,jαiαjxi,xj with αi0 and iαi=const.0\mathcal L^* (\alpha) = - \frac12 \sum_{i,j} \alpha_i \alpha_j \langle x_i, x_j\rangle \text{ with }\alpha_i \ge 0 \text{ and } \sum_i\alpha_i = {\rm const.} \ge 0

Without loss of generality, we set iαi=1\sum_i \alpha_i = 1.

In other words,

maxα  12αKα   subject to αi[0,),α1=1\max_{\alpha}\; - \frac12 \alpha^\top K \alpha \;\text{ subject to } \alpha_i\in[0,\infty), \alpha^\top 1 = 1

or

minα12αKα   subject to αi[0,),α1=1\min_{\alpha} \frac12 \alpha^\top K \alpha \;\text{ subject to } \alpha_i\in[0,\infty), \alpha^\top 1 = 1

The ρ\rho can be solved by plugging in the support vectors, ρ=wxi\rho = w\cdot x_i^* .

The decision function is given by f(x)=sgn(wxρ)f(x) = {\rm sgn}(w\cdot x -\rho) and f(xi)0f(x_i)\ge 0 for all data points.

Inseparable case and soft margin

Now we allow the violations of classification criterion wxiρw\cdot x_i \ge \rho, and panelize the misclassifications using Hinge loss

Lhinge(w,ρ)=imax(0,(wxiρ))=i(wxiρ)1{wxi<ρ}\mathcal L^{\rm hinge} (w, \rho) = \sum_i \max(0, -(w\cdot x_i-\rho)) = -\sum_i (w\cdot x_i -\rho) 1_{\{w\cdot x_i < \rho\}}

Introduce slack variable ξi0\xi_i \ge 0 as the discrepancy ξi=ρwxi\xi_i = \rho - w \cdot x_i for each violation. With help of slack variables, we have

wxiρ+ξi0w\cdot x_i -\rho +\xi_i \ge 0

Thus, the soft margin loss function

L(w,ρ,ξ)=12w2+Ciξiiαi(wxiρ+ξi)iβiξiμρ\mathcal L(w, \rho, \xi) = \frac12 \Vert w\Vert^2 + C\sum_i \xi_i - \sum_i \alpha_i (w\cdot x_i -\rho +\xi_i) - \sum_i \beta_i \xi_i - \mu\rho

where C0C\ge0 and μ0\mu\ge 0. The gradients are

wL=wiαixi=0ρL=iαiμ=0ξjL=Cαjβj=0\partial_w \mathcal L = w - \sum_i \alpha_i x_i = 0\\ \partial_\rho \mathcal L = \sum_i \alpha_i - \mu = 0\\ \partial_{\xi_j} \mathcal L = C - \alpha_j - \beta_j = 0

Thus, the dual problem is

L(α)=12i,jαiαjxi,xj=12αKα\mathcal L^*(\alpha) = -\frac12 \sum_{i,j}\alpha_i\alpha_j \langle x_i,x_j \rangle = -\frac12 \alpha^\top K \alpha

Hence we have the quadratic program

minα12αKα   subject to αi[0,C] and iαi=1\min_{\alpha} \frac12 \alpha^\top K \alpha \;\text{ subject to }\, \alpha_i \in [0, C] \,\text{ and } \sum_i \alpha_i = 1

where C=(νmsample)1C=(\nu m_{\rm sample})^{-1} in the original paper and thus C1C^{-1} is the upper bound of number of outliers.

Last updated

Was this helpful?