profile pic
⌘ '
raccourcis clavier

See also: slides 13, slides 14, slides 15

expected error minimisation

think of it as bias-variance tradeoff

Squared loss: l(y^,y)=(yy^)2l(\hat{y},y)=(y-\hat{y})^2

solution to y=arg miny^EX,Y(Yy^(X))2y^* = \argmin_{\hat{y}} E_{X,Y}(Y-\hat{y}(X))^2 is E[YX=x]E[Y | X=x]

Instead we have Z={(xi,yi)}i=1nZ = \{(x^i, y^i)\}^n_{i=1}

error decomposition

Ex,y(yyZ^(x))2=Exy(yy(x))2+Ex(y(x)yZ^(x))2=noise+estimation error\begin{aligned} &E_{x,y}(y-\hat{y_Z}(x))^2 \\ &= E_{xy}(y-y^{*}(x))^2 + E_x(y^{*}(x) - \hat{y_Z}(x))^2 \\ &= \text{noise} + \text{estimation error} \end{aligned}

bias-variance decompositions

For linear estimator:

EZEx,y(y(y^Z(x)WZTx))2=Ex,y(yy(x))2noise+Ex(y(x)EZ(yZ^(x)))2bias+ExEZ(yZ^(x)EZ(yZ^(x)))2variance\begin{aligned} E_Z&E_{x,y}(y-(\hat{y}_Z(x)\coloneqq W^T_Zx))^2 \\ =& E_{x,y}(y-y^{*}(x))^2 \quad \text{noise} \\ &+ E_x(y^{*}(x) - E_Z(\hat{y_Z}(x)))^2 \quad \text{bias} \\ &+ E_xE_Z(\hat{y_Z}(x) - E_Z(\hat{y_Z}(x)))^2 \quad \text{variance} \end{aligned}Lien vers l'original

accuracy

zero-one loss:

l01(y,y^)=1yy^={1yy^ 0y=y^l^{0-1}(y, \hat{y}) = 1_{y \neq \hat{y}}= \begin{cases} 1 & y \neq \hat{y} \\\ 0 & y = \hat{y} \end{cases}

linear classifier

y^W(x)=sign(WTx)=1WTx0W^=arg minWLZ01(y^W)\begin{aligned} \hat{y}_W(x) &= \text{sign}(W^T x) = 1_{W^T x \geq 0} \\[8pt] &\because \hat{W} = \argmin_{W} L_{Z}^{0-1} (\hat{y}_W) \end{aligned}

surrogate loss functions

assume classifier returns a discrete value y^W=sign(WTx){0,1}\hat{y}_W = \text{sign}(W^T x) \in \{0,1\}

What if classifier's output is continuous?

y^\hat{y} will also capture the “confidence” of the classifier.

Think of contiguous loss function: margin loss, cross-entropy/negative log-likelihood, etc.

linearly separable data

linearly separable

A binary classification data set Z={(xi,yi)}i=1nZ=\{(x^i, y^i)\}_{i=1}^{n} is linearly separable if there exists a WW^{*} such that:

  • i[n]SGN(<xi,W>)=yi\forall i \in [n] \mid \text{SGN}(<x^i, W^{*}>) = y^i
  • Or, for every i[n]i \in [n] we have (WTxi)yi>0(W^{* T}x^i)y^i > 0

linear programming

maxWRdu,w=i=1duiwis.t Awv\begin{aligned} \max_{W \in \mathbb{R}^d} &\langle{u, w} \rangle = \sum_{i=1}^{d} u_i w_i \\ &\text{s.t } A w \ge v \end{aligned}

Given that data is linearly separable

 Wi[n],(WTxi)yi>0 W,γ>0i[n],(WTxi)yiγ Wi[n],(WTxi)yi1\begin{aligned} \exists \space W^{*} &\mid \forall i \in [n], ({W^{*}}^T x^i)y^i > 0 \\ \exists \space W^{*}, \gamma > 0 &\mid \forall i \in [n], ({W^{*}}^T x^i)y^i \ge \gamma \\ \exists \space W^{*} &\mid \forall i \in [n], ({W^{*}}^T x^i)y^i \ge 1 \end{aligned}

LP for linear classification

  • Define A=[xjiyi]n×dA = [x_j^iy^i]_{n \times d}

  • find optimal WW equivalent to

    maxwRd0,ws.t. Aw1\begin{aligned} \max_{w \in \mathbb{R}^d} &\langle{\vec{0}, w} \rangle \\ & \text{s.t. } Aw \ge \vec{1} \end{aligned}

perceptron

Rosenblatt’s perceptron algorithm

"\\begin{algorithm}\n\\caption{Batch Perceptron}\n\\begin{algorithmic}\n\\REQUIRE Training set $(\\mathbf{x}_1, y_1),\\ldots,(\\mathbf{x}_m, y_m)$\n\\STATE Initialize $\\mathbf{w}^{(1)} = (0,\\ldots,0)$\n\\FOR{$t = 1,2,\\ldots$}\n \\IF{$(\\exists \\space i \\text{ s.t. } y_i\\langle\\mathbf{w}^{(t)}, \\mathbf{x}_i\\rangle \\leq 0)$}\n \\STATE $\\mathbf{w}^{(t+1)} = \\mathbf{w}^{(t)} + y_i\\mathbf{x}_i$\n \\ELSE\n \\STATE \\textbf{output} $\\mathbf{w}^{(t)}$\n \\ENDIF\n\\ENDFOR\n\\end{algorithmic}\n\\end{algorithm}"

Algorithm 5 Batch Perceptron

Require: Training set (x1,y1),,(xm,ym)(\mathbf{x}_1, y_1),\ldots,(\mathbf{x}_m, y_m)

1:Initialize w(1)=(0,,0)\mathbf{w}^{(1)} = (0,\ldots,0)

2:for t=1,2,t = 1,2,\ldots do

3:if ( i s.t. yiw(t),xi0)(\exists \space i \text{ s.t. } y_i\langle\mathbf{w}^{(t)}, \mathbf{x}_i\rangle \leq 0) then

4:w(t+1)=w(t)+yixi\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} + y_i\mathbf{x}_i

5:else

6:output w(t)\mathbf{w}^{(t)}

7:end if

8:end for

greedy update

WnewTxiyi=Wold+yixi,xiyi=WoldTxiyi+xi22yiyi\begin{aligned} W_{\text{new}}^T x^i y^i &= \langle W_{\text{old}}+ y^i x^i, x^i \rangle y^i \\ &=W_{\text{old}}^T x^{i} y^{i} + \|x^i\|_2^2 y^{i} y^{i} \end{aligned}

proof

See also (Novikoff, 1962)

Theorem

Assume there exists some parameter vector θ\underline{\theta}^{*} such that θ=1\|\underline{\theta}^{*}\| = 1 and  γ>0 s.t \exists \space \upgamma > 0 \text{ s.t }

yt(xtθ)γy_t(\underline{x_t} \cdot \underline{\theta^{*}}) \ge \upgamma

Assumption:  t=1n,xtR\forall \space t= 1 \ldots n, \|\underline{x_t}\| \le R

Then perceptron makes at most R2γ2\frac{R^2}{\upgamma^2} errors

proof by induction

definition of θk\underline{\theta^k}

to be parameter vector where algorithm makes kthk^{\text{th}} error.

Note that we have θ1=0\underline{\theta^{1}}=\underline{0}

Assume that kthk^{\text{th}} error is made on example tt, or

θk+1θ=(θk+ytxt)θ=θkθ+ytxtθθkθ+γ Assumption: ytxtθγ\begin{align} \underline{\theta^{k+1}} \cdot \underline{\theta^{*}} &= (\underline{\theta^k} + y_t \underline{x_t}) \cdot \underline{\theta^{*}} \\ &= \underline{\theta^k} \cdot \underline{\theta^{*}} + y_t \underline{x^t} \cdot \underline{\theta^{*}} \\ &\ge \underline{\theta^k} \cdot \underline{\theta^{*}} + \upgamma \\[12pt] &\because \text{ Assumption: } y_t \underline{x_t} \cdot \underline{\theta^{*}} \ge \upgamma \end{align}

Follows up by induction on kk that

θk+1θkγ\underline{\theta^{k+1}} \cdot \underline{\theta^{*}} \ge k \upgamma

Using Cauchy-Schwarz we have θk+1×θθk+1θ\|\underline{\theta^{k+1}}\| \times \|\underline{\theta^{*}}\| \ge \underline{\theta^{k+1}} \cdot \underline{\theta^{*}}

θk+1kγθ=1\begin{align} \|\underline{\theta^{k+1}}\| &\ge k \upgamma \\[16pt] &\because \|\underline{\theta^{*}}\| = 1 \end{align}

In the second part, we will find upper bound for (5):

θk+12=θk+ytxt2=θk2+yt2xt2+2ytxtθkθk2+R2\begin{align} \|\underline{\theta^{k+1}}\|^2 &= \|\underline{\theta^k} + y_t \underline{x_t}\|^2 \\ &= \|\underline{\theta^k}\|^2 + y_t^2 \|\underline{x_t}\|^2 + 2 y_t \underline{x_t} \cdot \underline{\theta^k} \\ &\le \|\underline{\theta^k}\|^2 + R^2 \end{align}

(9) is due to:

  • yt2xt22=xt2R2y_t^2 \|\underline{x_t}^2\|^2 = \|\underline{x_t}^2\| \le R^2 by assumption of theorem
  • ytxtθk0y_t \underline{x_t} \cdot \underline{\theta^k} \le 0 given parameter vector θk\underline{\theta^k} gave error at ttht^{\text{th}} example.

Follows with induction on kk that

θk+12kR2\begin{align} \|\underline{\theta^{k+1}}\|^2 \le kR^2 \end{align}

from (5) and (10) gives us

k2γ2θk+12kR2kR2γ2\begin{aligned} k^2 \upgamma^2 &\le \|\underline{\theta^{k+1}}\|^2 \le kR^2 \\ k &\le \frac{R^2}{\upgamma^2} \end{aligned}

Bibliographie