nearest neighbour See also: slides 13 , slides 14 , slides 15
expected error minimisation
think of it as bias-variance tradeoff
Squared loss: l ( y ^ , y ) = ( y − y ^ ) 2 l(\hat{y},y)=(y-\hat{y})^2 l ( y ^ , y ) = ( y − y ^ ) 2
solution to y ∗ = arg min y ^ E X , Y ( Y − y ^ ( X ) ) 2 y^* = \argmin_{\hat{y}} E_{X,Y}(Y-\hat{y}(X))^2 y ∗ = y ^ arg min E X , Y ( Y − y ^ ( X ) ) 2 is E [ Y ∣ X = x ] E[Y | X=x] E [ Y ∣ X = x ]
Instead we have Z = { ( x i , y i ) } i = 1 n Z = \{(x^i, y^i)\}^n_{i=1} Z = {( x i , y i ) } i = 1 n
error decomposition
E x , y ( y − y Z ^ ( x ) ) 2 = E x y ( y − y ∗ ( x ) ) 2 + E x ( y ∗ ( x ) − y Z ^ ( x ) ) 2 = noise + estimation error \begin{aligned}
&E_{x,y}(y-\hat{y_Z}(x))^2 \\
&= E_{xy}(y-y^{*}(x))^2 + E_x(y^{*}(x) - \hat{y_Z}(x))^2 \\
&= \text{noise} + \text{estimation error}
\end{aligned} E x , y ( y − y Z ^ ( x ) ) 2 = E x y ( y − y ∗ ( x ) ) 2 + E x ( y ∗ ( x ) − y Z ^ ( x ) ) 2 = noise + estimation error
bias-variance decompositions
For linear estimator:
E Z E x , y ( y − ( y ^ Z ( x ) ≔ W Z T x ) ) 2 = E x , y ( y − y ∗ ( x ) ) 2 noise + E x ( y ∗ ( x ) − E Z ( y Z ^ ( x ) ) ) 2 bias + E x E Z ( y Z ^ ( x ) − E Z ( y Z ^ ( x ) ) ) 2 variance \begin{aligned}
E_Z&E_{x,y}(y-(\hat{y}_Z(x)\coloneqq W^T_Zx))^2 \\
=& E_{x,y}(y-y^{*}(x))^2 \quad \text{noise} \\
&+ E_x(y^{*}(x) - E_Z(\hat{y_Z}(x)))^2 \quad \text{bias} \\
&+ E_xE_Z(\hat{y_Z}(x) - E_Z(\hat{y_Z}(x)))^2 \quad \text{variance}
\end{aligned} E Z = E x , y ( y − ( y ^ Z ( x ) : = W Z T x ) ) 2 E x , y ( y − y ∗ ( x ) ) 2 noise + E x ( y ∗ ( x ) − E Z ( y Z ^ ( x )) ) 2 bias + E x E Z ( y Z ^ ( x ) − E Z ( y Z ^ ( x )) ) 2 variance
accuracy
zero-one loss:
l 0 − 1 ( y , y ^ ) = 1 y ≠ y ^ = { 1 y ≠ y ^ 0 y = y ^ l^{0-1}(y, \hat{y}) = 1_{y \neq \hat{y}}= \begin{cases} 1 & y \neq \hat{y} \\\ 0 & y = \hat{y} \end{cases} l 0 − 1 ( y , y ^ ) = 1 y = y ^ = { 1 0 y = y ^ y = y ^
linear classifier
y ^ W ( x ) = sign ( W T x ) = 1 W T x ≥ 0 ∵ W ^ = arg min W L Z 0 − 1 ( y ^ W ) \begin{aligned}
\hat{y}_W(x) &= \text{sign}(W^T x) = 1_{W^T x \geq 0} \\[8pt]
&\because \hat{W} = \argmin_{W} L_{Z}^{0-1} (\hat{y}_W)
\end{aligned} y ^ W ( x ) = sign ( W T x ) = 1 W T x ≥ 0 ∵ W ^ = W arg min L Z 0 − 1 ( y ^ W )
surrogate loss functions
assume classifier returns a discrete value y ^ W = sign ( W T x ) ∈ { 0 , 1 } \hat{y}_W = \text{sign}(W^T x) \in \{0,1\} y ^ W = sign ( W T x ) ∈ { 0 , 1 }
What if classifier's output is continuous?
y ^ \hat{y} y ^ will also capture the “confidence” of the classifier.
Think of contiguous loss function: margin loss, cross-entropy/negative log-likelihood, etc.
linearly separable data
A binary classification data set Z = { ( x i , y i ) } i = 1 n Z=\{(x^i, y^i)\}_{i=1}^{n} Z = {( x i , y i ) } i = 1 n is linearly separable if there exists a W ∗ W^{*} W ∗ such that:
∀ i ∈ [ n ] ∣ SGN ( < x i , W ∗ > ) = y i \forall i \in [n] \mid \text{SGN}(<x^i, W^{*}>) = y^i ∀ i ∈ [ n ] ∣ SGN ( < x i , W ∗ > ) = y i
Or, for every i ∈ [ n ] i \in [n] i ∈ [ n ] we have ( W ∗ T x i ) y i > 0 (W^{* T}x^i)y^i > 0 ( W ∗ T x i ) y i > 0
linear programming
max W ∈ R d ⟨ u , w ⟩ = ∑ i = 1 d u i w i s.t A w ≥ v \begin{aligned}
\max_{W \in \mathbb{R}^d} &\langle{u, w} \rangle = \sum_{i=1}^{d} u_i w_i \\
&\text{s.t } A w \ge v
\end{aligned} W ∈ R d max ⟨ u , w ⟩ = i = 1 ∑ d u i w i s.t A w ≥ v
Given that data is linearly separable
∃ W ∗ ∣ ∀ i ∈ [ n ] , ( W ∗ T x i ) y i > 0 ∃ W ∗ , γ > 0 ∣ ∀ i ∈ [ n ] , ( W ∗ T x i ) y i ≥ γ ∃ W ∗ ∣ ∀ i ∈ [ n ] , ( W ∗ T x i ) y i ≥ 1 \begin{aligned}
\exists \space W^{*} &\mid \forall i \in [n], ({W^{*}}^T x^i)y^i > 0 \\
\exists \space W^{*}, \gamma > 0 &\mid \forall i \in [n], ({W^{*}}^T x^i)y^i \ge \gamma \\
\exists \space W^{*} &\mid \forall i \in [n], ({W^{*}}^T x^i)y^i \ge 1
\end{aligned} ∃ W ∗ ∃ W ∗ , γ > 0 ∃ W ∗ ∣ ∀ i ∈ [ n ] , ( W ∗ T x i ) y i > 0 ∣ ∀ i ∈ [ n ] , ( W ∗ T x i ) y i ≥ γ ∣ ∀ i ∈ [ n ] , ( W ∗ T x i ) y i ≥ 1
LP for linear classification
perceptron
Rosenblatt’s perceptron algorithm
"\\begin{algorithm}\n\\caption{Batch Perceptron}\n\\begin{algorithmic}\n\\REQUIRE Training set $(\\mathbf{x}_1, y_1),\\ldots,(\\mathbf{x}_m, y_m)$\n\\STATE Initialize $\\mathbf{w}^{(1)} = (0,\\ldots,0)$\n\\FOR{$t = 1,2,\\ldots$}\n \\IF{$(\\exists \\space i \\text{ s.t. } y_i\\langle\\mathbf{w}^{(t)}, \\mathbf{x}_i\\rangle \\leq 0)$}\n \\STATE $\\mathbf{w}^{(t+1)} = \\mathbf{w}^{(t)} + y_i\\mathbf{x}_i$\n \\ELSE\n \\STATE \\textbf{output} $\\mathbf{w}^{(t)}$\n \\ENDIF\n\\ENDFOR\n\\end{algorithmic}\n\\end{algorithm}"
Algorithm 5 Batch Perceptron
Require: Training set ( x 1 , y 1 ) , … , ( x m , y m ) (\mathbf{x}_1, y_1),\ldots,(\mathbf{x}_m, y_m) ( x 1 , y 1 ) , … , ( x m , y m )
1: Initialize w ( 1 ) = ( 0 , … , 0 ) \mathbf{w}^{(1)} = (0,\ldots,0) w ( 1 ) = ( 0 , … , 0 )
2: for t = 1 , 2 , … t = 1,2,\ldots t = 1 , 2 , … do
3: if ( ∃ i s.t. y i ⟨ w ( t ) , x i ⟩ ≤ 0 ) (\exists \space i \text{ s.t. } y_i\langle\mathbf{w}^{(t)}, \mathbf{x}_i\rangle \leq 0) ( ∃ i s.t. y i ⟨ w ( t ) , x i ⟩ ≤ 0 ) then
4: w ( t + 1 ) = w ( t ) + y i x i \mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} + y_i\mathbf{x}_i w ( t + 1 ) = w ( t ) + y i x i
5: else
6: output w ( t ) \mathbf{w}^{(t)} w ( t )
7: end if
8: end for
greedy update
W new T x i y i = ⟨ W old + y i x i , x i ⟩ y i = W old T x i y i + ∥ x i ∥ 2 2 y i y i \begin{aligned}
W_{\text{new}}^T x^i y^i &= \langle W_{\text{old}}+ y^i x^i, x^i \rangle y^i \\
&=W_{\text{old}}^T x^{i} y^{i} + \|x^i\|_2^2 y^{i} y^{i}
\end{aligned} W new T x i y i = ⟨ W old + y i x i , x i ⟩ y i = W old T x i y i + ∥ x i ∥ 2 2 y i y i
proof
See also ( Novikoff, 1962 )
Assume there exists some parameter vector θ ‾ ∗ \underline{\theta}^{*} θ ∗ such that ∥ θ ‾ ∗ ∥ = 1 \|\underline{\theta}^{*}\| = 1 ∥ θ ∗ ∥ = 1 and
∃ γ > 0 s.t \exists \space \upgamma > 0 \text{ s.t } ∃ γ > 0 s.t
y t ( x t ‾ ⋅ θ ∗ ‾ ) ≥ γ y_t(\underline{x_t} \cdot \underline{\theta^{*}}) \ge \upgamma y t ( x t ⋅ θ ∗ ) ≥ γ
Assumption : ∀ t = 1 … n , ∥ x t ‾ ∥ ≤ R \forall \space t= 1 \ldots n, \|\underline{x_t}\| \le R ∀ t = 1 … n , ∥ x t ∥ ≤ R
Then perceptron makes at most R 2 γ 2 \frac{R^2}{\upgamma^2} γ 2 R 2 errors
proof by induction
definition of θ k ‾ \underline{\theta^k} θ k
to be parameter vector where algorithm makes k th k^{\text{th}} k th error.
Note that we have θ 1 ‾ = 0 ‾ \underline{\theta^{1}}=\underline{0} θ 1 = 0
Assume that k th k^{\text{th}} k th error is made on example t t t , or
θ k + 1 ‾ ⋅ θ ∗ ‾ = ( θ k ‾ + y t x t ‾ ) ⋅ θ ∗ ‾ = θ k ‾ ⋅ θ ∗ ‾ + y t x t ‾ ⋅ θ ∗ ‾ ≥ θ k ‾ ⋅ θ ∗ ‾ + γ ∵ Assumption: y t x t ‾ ⋅ θ ∗ ‾ ≥ γ \begin{align}
\underline{\theta^{k+1}} \cdot \underline{\theta^{*}} &= (\underline{\theta^k} + y_t \underline{x_t}) \cdot \underline{\theta^{*}} \\
&= \underline{\theta^k} \cdot \underline{\theta^{*}} + y_t \underline{x^t} \cdot \underline{\theta^{*}} \\
&\ge \underline{\theta^k} \cdot \underline{\theta^{*}} + \upgamma \\[12pt]
&\because \text{ Assumption: } y_t \underline{x_t} \cdot \underline{\theta^{*}} \ge \upgamma
\end{align} θ k + 1 ⋅ θ ∗ = ( θ k + y t x t ) ⋅ θ ∗ = θ k ⋅ θ ∗ + y t x t ⋅ θ ∗ ≥ θ k ⋅ θ ∗ + γ ∵ Assumption: y t x t ⋅ θ ∗ ≥ γ
Follows up by induction on k k k that
θ k + 1 ‾ ⋅ θ ∗ ‾ ≥ k γ \underline{\theta^{k+1}} \cdot \underline{\theta^{*}} \ge k \upgamma θ k + 1 ⋅ θ ∗ ≥ kγ
Using Cauchy-Schwarz we have ∥ θ k + 1 ‾ ∥ × ∥ θ ∗ ‾ ∥ ≥ θ k + 1 ‾ ⋅ θ ∗ ‾ \|\underline{\theta^{k+1}}\| \times \|\underline{\theta^{*}}\| \ge \underline{\theta^{k+1}} \cdot \underline{\theta^{*}} ∥ θ k + 1 ∥ × ∥ θ ∗ ∥ ≥ θ k + 1 ⋅ θ ∗
∥ θ k + 1 ‾ ∥ ≥ k γ ∵ ∥ θ ∗ ‾ ∥ = 1 \begin{align}
\|\underline{\theta^{k+1}}\| &\ge k \upgamma \\[16pt]
&\because \|\underline{\theta^{*}}\| = 1
\end{align} ∥ θ k + 1 ∥ ≥ kγ ∵ ∥ θ ∗ ∥ = 1
In the second part, we will find upper bound for (5):
∥ θ k + 1 ‾ ∥ 2 = ∥ θ k ‾ + y t x t ‾ ∥ 2 = ∥ θ k ‾ ∥ 2 + y t 2 ∥ x t ‾ ∥ 2 + 2 y t x t ‾ ⋅ θ k ‾ ≤ ∥ θ k ‾ ∥ 2 + R 2 \begin{align}
\|\underline{\theta^{k+1}}\|^2 &= \|\underline{\theta^k} + y_t \underline{x_t}\|^2 \\
&= \|\underline{\theta^k}\|^2 + y_t^2 \|\underline{x_t}\|^2 + 2 y_t \underline{x_t} \cdot \underline{\theta^k} \\
&\le \|\underline{\theta^k}\|^2 + R^2
\end{align} ∥ θ k + 1 ∥ 2 = ∥ θ k + y t x t ∥ 2 = ∥ θ k ∥ 2 + y t 2 ∥ x t ∥ 2 + 2 y t x t ⋅ θ k ≤ ∥ θ k ∥ 2 + R 2
(9) is due to:
y t 2 ∥ x t ‾ 2 ∥ 2 = ∥ x t ‾ 2 ∥ ≤ R 2 y_t^2 \|\underline{x_t}^2\|^2 = \|\underline{x_t}^2\| \le R^2 y t 2 ∥ x t 2 ∥ 2 = ∥ x t 2 ∥ ≤ R 2 by assumption of theorem
y t x t ‾ ⋅ θ k ‾ ≤ 0 y_t \underline{x_t} \cdot \underline{\theta^k} \le 0 y t x t ⋅ θ k ≤ 0 given parameter vector θ k ‾ \underline{\theta^k} θ k gave error at t th t^{\text{th}} t th example.
Follows with induction on k k k that
∥ θ k + 1 ‾ ∥ 2 ≤ k R 2 \begin{align}
\|\underline{\theta^{k+1}}\|^2 \le kR^2
\end{align} ∥ θ k + 1 ∥ 2 ≤ k R 2
from (5) and (10) gives us
k 2 γ 2 ≤ ∥ θ k + 1 ‾ ∥ 2 ≤ k R 2 k ≤ R 2 γ 2 \begin{aligned}
k^2 \upgamma^2 &\le \|\underline{\theta^{k+1}}\|^2 \le kR^2 \\
k &\le \frac{R^2}{\upgamma^2}
\end{aligned} k 2 γ 2 k ≤ ∥ θ k + 1 ∥ 2 ≤ k R 2 ≤ γ 2 R 2