One Class SVM
An algorithm for novelty detection and empirical distribution support estimation
Separate data distribution from the origin by a hyperplane
Assume the points are separable from the origin, then what is the maximum margin?
The problem can be formulated as
wmin21∥w∥2 subject to (w⋅xi)≥ρ, with ρ≥0 for all data point xi∈Rn. The problem is equivalent to the binary case where (xi;1) and (−xi;−1) are the two classes.
The Lagrangian is
L(w)=21w2−i∑αi(w⋅xi−ρ)−μρ with KKT conditions αi(w⋅xi−ρ)=0 where αi≥0.
Take gradient
∂wL=w−i∑αixi=0∂ρL=i∑αi−μ=0 Then, w=∑iαixi. The dual problem Lagrangian is
L∗(α)=−21i,j∑αiαj⟨xi,xj⟩ with αi≥0 and i∑αi=const.≥0 Without loss of generality, we set ∑iαi=1.
In other words,
αmax−21α⊤Kα subject to αi∈[0,∞),α⊤1=1 or
αmin21α⊤Kα subject to αi∈[0,∞),α⊤1=1 The ρ can be solved by plugging in the support vectors, ρ=w⋅xi∗ .
The decision function is given by f(x)=sgn(w⋅x−ρ) and f(xi)≥0 for all data points.
Inseparable case and soft margin
Now we allow the violations of classification criterion w⋅xi≥ρ, and panelize the misclassifications using Hinge loss
Lhinge(w,ρ)=i∑max(0,−(w⋅xi−ρ))=−i∑(w⋅xi−ρ)1{w⋅xi<ρ} Introduce slack variable ξi≥0 as the discrepancy ξi=ρ−w⋅xi for each violation. With help of slack variables, we have
w⋅xi−ρ+ξi≥0 Thus, the soft margin loss function
L(w,ρ,ξ)=21∥w∥2+Ci∑ξi−i∑αi(w⋅xi−ρ+ξi)−i∑βiξi−μρ where C≥0 and μ≥0. The gradients are
∂wL=w−i∑αixi=0∂ρL=i∑αi−μ=0∂ξjL=C−αj−βj=0 Thus, the dual problem is
L∗(α)=−21i,j∑αiαj⟨xi,xj⟩=−21α⊤Kα Hence we have the quadratic program
αmin21α⊤Kα subject to αi∈[0,C] and i∑αi=1 where C=(νmsample)−1 in the original paper and thus C−1 is the upper bound of number of outliers.