profile pic
⌘ '
raccourcis clavier

See also some statistical theory

bayes rules and chain rules

Joint distribution: P(X,Y)P(X,Y)

Conditional distribution of XX given YY: P(XY)=P(X,Y)P(Y)P(X|Y) = \frac{P(X,Y)}{P(Y)}

Bayes rule: P(XY)=P(YX)P(X)P(Y)P(X|Y) = \frac{P(Y|X)P(X)}{P(Y)}

Chain rule:

for two events:

P(A,B)=P(BA)P(A)P(A, B) = P(B \mid A)P(A)

generalised:

P(X1,X2,,Xk)=P(X1)j=2nP(XjX1,,Xj1)expansion: P(X1)P(X2X1)P(XkX1,X2,,Xk1)\begin{aligned} &P(X_1, X_2, \ldots , X_k) \\ &= P(X_1) \prod_{j=2}^{n} P(X_j \mid X_1,\dots,X_{j-1}) \\[12pt] &\because \text{expansion: }P(X_1)P(X_2|X_1)\ldots P(X_k|X_1,X_2,\ldots,X_{k-1}) \end{aligned}

i.i.d assumption

assume underlying distribution DD, that train and test sets are independent and identically distributed (i.i.d)

Example: flip a coin

Outcome H=0H=0 or T=1T=1 with P(H)=pP(H) = p and P(T)=1pP(T) = 1-p, or x{0,1}x \in \{0,1\}, xx is the Bernoulli random variable.

P(x=0)=αP(x=0)=\alpha and P(x=1)=1αP(x=1)=1-\alpha

Lien vers l'original

Note that for any random variables A,B,CA,B,C we have:

P(A,BC)=P(AB,C)P(BC)P(A,B \mid C) = P(A\mid B,C) P(B \mid C)

nearest neighbour

See also: slides 13, slides 14, slides 15

expected error minimisation

think of it as bias-variance tradeoff

Squared loss: l(y^,y)=(yy^)2l(\hat{y},y)=(y-\hat{y})^2

solution to y=arg miny^EX,Y(Yy^(X))2y^* = \argmin_{\hat{y}} E_{X,Y}(Y-\hat{y}(X))^2 is E[YX=x]E[Y | X=x]

Instead we have Z={(xi,yi)}i=1nZ = \{(x^i, y^i)\}^n_{i=1}

error decomposition

Ex,y(yyZ^(x))2=Exy(yy(x))2+Ex(y(x)yZ^(x))2=noise+estimation error\begin{aligned} &E_{x,y}(y-\hat{y_Z}(x))^2 \\ &= E_{xy}(y-y^{*}(x))^2 + E_x(y^{*}(x) - \hat{y_Z}(x))^2 \\ &= \text{noise} + \text{estimation error} \end{aligned}

bias-variance decompositions

For linear estimator:

EZEx,y(y(y^Z(x)WZTx))2=Ex,y(yy(x))2noise+Ex(y(x)EZ(yZ^(x)))2bias+ExEZ(yZ^(x)EZ(yZ^(x)))2variance\begin{aligned} E_Z&E_{x,y}(y-(\hat{y}_Z(x)\coloneqq W^T_Zx))^2 \\ =& E_{x,y}(y-y^{*}(x))^2 \quad \text{noise} \\ &+ E_x(y^{*}(x) - E_Z(\hat{y_Z}(x)))^2 \quad \text{bias} \\ &+ E_xE_Z(\hat{y_Z}(x) - E_Z(\hat{y_Z}(x)))^2 \quad \text{variance} \end{aligned}Lien vers l'original

accuracy

zero-one loss:

l01(y,y^)=1yy^={1yy^ 0y=y^l^{0-1}(y, \hat{y}) = 1_{y \neq \hat{y}}= \begin{cases} 1 & y \neq \hat{y} \\\ 0 & y = \hat{y} \end{cases}

linear classifier

y^W(x)=sign(WTx)=1WTx0W^=arg minWLZ01(y^W)\begin{aligned} \hat{y}_W(x) &= \text{sign}(W^T x) = 1_{W^T x \geq 0} \\[8pt] &\because \hat{W} = \argmin_{W} L_{Z}^{0-1} (\hat{y}_W) \end{aligned}

surrogate loss functions

assume classifier returns a discrete value y^W=sign(WTx){0,1}\hat{y}_W = \text{sign}(W^T x) \in \{0,1\}

What if classifier's output is continuous?

y^\hat{y} will also capture the “confidence” of the classifier.

Think of contiguous loss function: margin loss, cross-entropy/negative log-likelihood, etc.

linearly separable data

linearly separable

A binary classification data set Z={(xi,yi)}i=1nZ=\{(x^i, y^i)\}_{i=1}^{n} is linearly separable if there exists a WW^{*} such that:

  • i[n]SGN(<xi,W>)=yi\forall i \in [n] \mid \text{SGN}(<x^i, W^{*}>) = y^i
  • Or, for every i[n]i \in [n] we have (WTxi)yi>0(W^{* T}x^i)y^i > 0

linear programming

maxWRdu,w=i=1duiwis.t Awv\begin{aligned} \max_{W \in \mathbb{R}^d} &\langle{u, w} \rangle = \sum_{i=1}^{d} u_i w_i \\ &\text{s.t } A w \ge v \end{aligned}

Given that data is linearly separable

 Wi[n],(WTxi)yi>0 W,γ>0i[n],(WTxi)yiγ Wi[n],(WTxi)yi1\begin{aligned} \exists \space W^{*} &\mid \forall i \in [n], ({W^{*}}^T x^i)y^i > 0 \\ \exists \space W^{*}, \gamma > 0 &\mid \forall i \in [n], ({W^{*}}^T x^i)y^i \ge \gamma \\ \exists \space W^{*} &\mid \forall i \in [n], ({W^{*}}^T x^i)y^i \ge 1 \end{aligned}

LP for linear classification

  • Define A=[xjiyi]n×dA = [x_j^iy^i]_{n \times d}

  • find optimal WW equivalent to

    maxwRd0,ws.t. Aw1\begin{aligned} \max_{w \in \mathbb{R}^d} &\langle{\vec{0}, w} \rangle \\ & \text{s.t. } Aw \ge \vec{1} \end{aligned}

perceptron

Rosenblatt’s perceptron algorithm

"\\begin{algorithm}\n\\caption{Batch Perceptron}\n\\begin{algorithmic}\n\\REQUIRE Training set $(\\mathbf{x}_1, y_1),\\ldots,(\\mathbf{x}_m, y_m)$\n\\STATE Initialize $\\mathbf{w}^{(1)} = (0,\\ldots,0)$\n\\FOR{$t = 1,2,\\ldots$}\n \\IF{$(\\exists \\space i \\text{ s.t. } y_i\\langle\\mathbf{w}^{(t)}, \\mathbf{x}_i\\rangle \\leq 0)$}\n \\STATE $\\mathbf{w}^{(t+1)} = \\mathbf{w}^{(t)} + y_i\\mathbf{x}_i$\n \\ELSE\n \\STATE \\textbf{output} $\\mathbf{w}^{(t)}$\n \\ENDIF\n\\ENDFOR\n\\end{algorithmic}\n\\end{algorithm}"

Algorithm 5 Batch Perceptron

Require: Training set (x1,y1),,(xm,ym)(\mathbf{x}_1, y_1),\ldots,(\mathbf{x}_m, y_m)

1:Initialize w(1)=(0,,0)\mathbf{w}^{(1)} = (0,\ldots,0)

2:for t=1,2,t = 1,2,\ldots do

3:if ( i s.t. yiw(t),xi0)(\exists \space i \text{ s.t. } y_i\langle\mathbf{w}^{(t)}, \mathbf{x}_i\rangle \leq 0) then

4:w(t+1)=w(t)+yixi\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} + y_i\mathbf{x}_i

5:else

6:output w(t)\mathbf{w}^{(t)}

7:end if

8:end for

greedy update

WnewTxiyi=Wold+yixi,xiyi=WoldTxiyi+xi22yiyi\begin{aligned} W_{\text{new}}^T x^i y^i &= \langle W_{\text{old}}+ y^i x^i, x^i \rangle y^i \\ &=W_{\text{old}}^T x^{i} y^{i} + \|x^i\|_2^2 y^{i} y^{i} \end{aligned}

proof

See also (Novikoff, 1962)

Theorem

Assume there exists some parameter vector θ\underline{\theta}^{*} such that θ=1\|\underline{\theta}^{*}\| = 1 and  γ>0 s.t \exists \space \upgamma > 0 \text{ s.t }

yt(xtθ)γy_t(\underline{x_t} \cdot \underline{\theta^{*}}) \ge \upgamma

Assumption:  t=1n,xtR\forall \space t= 1 \ldots n, \|\underline{x_t}\| \le R

Then perceptron makes at most R2γ2\frac{R^2}{\upgamma^2} errors

proof by induction

definition of θk\underline{\theta^k}

to be parameter vector where algorithm makes kthk^{\text{th}} error.

Note that we have θ1=0\underline{\theta^{1}}=\underline{0}

Assume that kthk^{\text{th}} error is made on example tt, or

θk+1θ=(θk+ytxt)θ=θkθ+ytxtθθkθ+γ Assumption: ytxtθγ\begin{align} \underline{\theta^{k+1}} \cdot \underline{\theta^{*}} &= (\underline{\theta^k} + y_t \underline{x_t}) \cdot \underline{\theta^{*}} \\ &= \underline{\theta^k} \cdot \underline{\theta^{*}} + y_t \underline{x^t} \cdot \underline{\theta^{*}} \\ &\ge \underline{\theta^k} \cdot \underline{\theta^{*}} + \upgamma \\[12pt] &\because \text{ Assumption: } y_t \underline{x_t} \cdot \underline{\theta^{*}} \ge \upgamma \end{align}

Follows up by induction on kk that

θk+1θkγ\underline{\theta^{k+1}} \cdot \underline{\theta^{*}} \ge k \upgamma

Using Cauchy-Schwarz we have θk+1×θθk+1θ\|\underline{\theta^{k+1}}\| \times \|\underline{\theta^{*}}\| \ge \underline{\theta^{k+1}} \cdot \underline{\theta^{*}}

θk+1kγθ=1\begin{align} \|\underline{\theta^{k+1}}\| &\ge k \upgamma \\[16pt] &\because \|\underline{\theta^{*}}\| = 1 \end{align}

In the second part, we will find upper bound for (5):

θk+12=θk+ytxt2=θk2+yt2xt2+2ytxtθkθk2+R2\begin{align} \|\underline{\theta^{k+1}}\|^2 &= \|\underline{\theta^k} + y_t \underline{x_t}\|^2 \\ &= \|\underline{\theta^k}\|^2 + y_t^2 \|\underline{x_t}\|^2 + 2 y_t \underline{x_t} \cdot \underline{\theta^k} \\ &\le \|\underline{\theta^k}\|^2 + R^2 \end{align}

(9) is due to:

  • yt2xt22=xt2R2y_t^2 \|\underline{x_t}^2\|^2 = \|\underline{x_t}^2\| \le R^2 by assumption of theorem
  • ytxtθk0y_t \underline{x_t} \cdot \underline{\theta^k} \le 0 given parameter vector θk\underline{\theta^k} gave error at ttht^{\text{th}} example.

Follows with induction on kk that

θk+12kR2\begin{align} \|\underline{\theta^{k+1}}\|^2 \le kR^2 \end{align}

from (5) and (10) gives us

k2γ2θk+12kR2kR2γ2\begin{aligned} k^2 \upgamma^2 &\le \|\underline{\theta^{k+1}}\|^2 \le kR^2 \\ k &\le \frac{R^2}{\upgamma^2} \end{aligned}Lien vers l'original

Support Vector Machine

idea: maximises margin and more robust to “perturbations”

Euclidean distance between two points xx and the hyperplane parametrised by WW is:

WTx+bW2\frac{\mid W^T x + b \mid }{\|W\|_2}

Assuming W2=1\| W \|_2=1 then the distance is WTx+b\mid W^T x + b \mid

regularization

SVMs are good for high-dimensional data

We can probably use a solver, or gradient descent

maximum margin hyperplane

WW has γ\gamma margin if

WTx+bγ  blue xWTx+bγ  red x\begin{aligned} W^T x + b \ge \gamma \space &\forall \text{ blue x} \\ W^T x +b \le - \gamma \space &\forall \text{ red x} \end{aligned}

Margin:

Z={(xi,yi)}i=1n,y{1,1},W2=1Z = \{(x^{i}, y^{i})\}_{i=1}^{n}, y \in \{-1, 1\}, \|W\|_2 = 1

hard-margin SVM

this is the version with bias

"\\begin{algorithm}\n\\caption{Hard-SVM}\n\\begin{algorithmic}\n\\REQUIRE Training set $(\\mathbf{x}_1, y_1),\\ldots,(\\mathbf{x}_m, y_m)$\n\\STATE \\textbf{solve:} $(w_{0},b_{0}) = \\argmin\\limits_{(w,b)} \\|w\\|^2 \\text{ s.t } \\forall i, y_{i}(\\langle{w,x_i} \\rangle + b) \\ge 1$\n\\STATE \\textbf{output:} $\\hat{w} = \\frac{w_0}{\\|w_0\\|}, \\hat{b} = \\frac{b_0}{\\|w_0\\|}$\n\\end{algorithmic}\n\\end{algorithm}"

Algorithm 2 Hard-SVM

Require: Training set (x1,y1),,(xm,ym)(\mathbf{x}_1, y_1),\ldots,(\mathbf{x}_m, y_m)

1:solve: (w0,b0)=arg min(w,b)w2 s.t i,yi(w,xi+b)1(w_{0},b_{0}) = \argmin\limits_{(w,b)} \|w\|^2 \text{ s.t } \forall i, y_{i}(\langle{w,x_i} \rangle + b) \ge 1

2:output: w^=w0w0,b^=b0w0\hat{w} = \frac{w_0}{\|w_0\|}, \hat{b} = \frac{b_0}{\|w_0\|}

Note that this version is sensitive to outliers

it assumes that training set is linearly separable

soft-margin SVM

can be applied even if the training set is not linearly separable

"\\begin{algorithm}\n\\caption{Soft-SVM}\n\\begin{algorithmic}\n\\REQUIRE Input $(\\mathbf{x}_1, y_1),\\ldots,(\\mathbf{x}_m, y_m)$\n\\STATE \\textbf{parameter:} $\\lambda > 0$\n\\STATE \\textbf{solve:} $\\min_{\\mathbf{w}, b, \\boldsymbol{\\xi}} \\left( \\lambda \\|\\mathbf{w}\\|^2 + \\frac{1}{m} \\sum_{i=1}^m \\xi_i \\right)$\n\\STATE \\textbf{s.t: } $\\forall i, \\quad y_i (\\langle \\mathbf{w}, \\mathbf{x}_i \\rangle + b) \\geq 1 - \\xi_i \\quad \\text{and} \\quad \\xi_i \\geq 0$\n\\STATE \\textbf{output:} $\\mathbf{w}, b$\n\\end{algorithmic}\n\\end{algorithm}"

Algorithm 3 Soft-SVM

Require: Input (x1,y1),,(xm,ym)(\mathbf{x}_1, y_1),\ldots,(\mathbf{x}_m, y_m)

1:parameter: λ>0\lambda > 0

2:solve: minw,b,ξ(λw2+1mi=1mξi)\min_{\mathbf{w}, b, \boldsymbol{\xi}} \left( \lambda \|\mathbf{w}\|^2 + \frac{1}{m} \sum_{i=1}^m \xi_i \right)

3:s.t: i,yi(w,xi+b)1ξiandξi0\forall i, \quad y_i (\langle \mathbf{w}, \mathbf{x}_i \rangle + b) \geq 1 - \xi_i \quad \text{and} \quad \xi_i \geq 0

4:output: w,b\mathbf{w}, b

Equivalent form of soft-margin SVM:

minw(λw2+LShinge(w))LShinge(w)=1mi=1mmax({0,1yw,xi})\begin{aligned} \min_{w} &(\lambda \|w\|^2 + L_S^{\text{hinge}}(w)) \\[8pt] L_{S}^{\text{hinge}}(w) &= \frac{1}{m} \sum_{i=1}^{m} \max{(\{0, 1 - y \langle w, x_i \rangle\})} \end{aligned}

SVM with basis functions

minW1nmax{0,1yiw,ϕ(xi)}+λw22\min_{W} \frac{1}{n} \sum \max \{0, 1 - y^i \langle w, \phi(x^i) \rangle\} + \lambda \|w\|^2_2

ϕ(xi)\phi(x^i) can be high-dimensional

representor theorem

W=arg minW1nmax{0,1yiw,ϕ(xi)}+λw22W^{*} = \argmin_{W} \frac{1}{n} \sum \max \{0, 1- y^i \langle w, \phi (x^i) \rangle\} + \lambda \|w\|^2_2

theorem

There are real values a1,,ama_{1},\ldots,a_{m} such that 1

W=aiϕ(xi)W^{*} = \sum a_i \phi(x^i)

kernelized SVM

from representor theorem, we have the kernel:

K(x,z)=ϕ(x),ϕ(z)K(x,z) = \langle \phi(x), \phi(z) \rangle

drawbacks

  • prediction-time complexity
  • need to store all training data
  • Dealing with Kn×n\mathbf{K}_{n \times n}
  • choice of kernel, in which is tricky and pretty heuristic sometimes.
Lien vers l'original

minimize squared error

Given a homogeneous line y=axy = ax to a non-linear curve f(x)=x2+1f(x) = x^2 +1 where a,y,xRa,y,x \in \mathbb{R}

assuming x are uniformly distributed on [0,1][0,1]. What is the value of a to minimize the squared error?

arg minαE[(axx21)2]\argmin_{\alpha} E[(ax - x^2 - 1)^2]

or we need to find

arg minαPX(x)(axx21)2dx\argmin_{\alpha} \int_{-\infty}^{\infty} P_X(x) (ax - x^2 -1)^2 dx

multi-variate chain rule

xfg(x)=[g]d×m[f]m×n\nabla_x f \odot g(x) = [\nabla_g]_{d \times m} \cdot [\nabla_f]_{m \times n}

Or we can find the Jacobian Jf\mathcal{J}_f

if f=Axf = Ax then f=A\nabla_f = A

classification

or on-versus-all classification

idea: train kk different binary classifiers:

hi(x)=sgn(wi,x)h_i(x) = \text{sgn}(\langle w_i, x \rangle)

end-to-end version, or multi-class SVM with generalized Hinge loss:

"\\begin{algorithm}\n\\caption{Multiclass SVM}\n\\begin{algorithmic}\n\\REQUIRE Input $(\\mathbf{x}_1, y_1),\\ldots,(\\mathbf{x}_m, y_m)$\n\\REQUIRE\n \\STATE Regularization parameter $\\lambda > 0$\n \\STATE Loss function $\\Delta: \\mathcal{Y} \\times \\mathcal{Y} \\to \\mathbb{R}_+$\n \\STATE Class-sensitive feature mapping $\\Psi: \\mathcal{X} \\times \\mathcal{Y} \\to \\mathbb{R}^d$\n\\ENSURE\n\\STATE \\textbf{solve}: $\\min_{\\mathbf{w} \\in \\mathbb{R}^d} \\left(\\lambda\\|\\mathbf{w}\\|^2 + \\frac{1}{m}\\sum_{i=1}^m \\max_{y' \\in \\mathcal{Y}} \\left(\\Delta(y', y_i) + \\langle\\mathbf{w}, \\Psi(\\mathbf{x}_i, y') - \\Psi(\\mathbf{x}_i, y_i)\\rangle\\right)\\right)$\n\\STATE \\textbf{output}: the predictor $h_{\\mathbf{w}}(\\mathbf{x}) = \\argmax_{y \\in \\mathcal{Y}} \\langle\\mathbf{w}, \\Psi(\\mathbf{x}, y)\\rangle$\n\\end{algorithmic}\n\\end{algorithm}"

Algorithm 4 Multiclass SVM

Require: Input (x1,y1),,(xm,ym)(\mathbf{x}_1, y_1),\ldots,(\mathbf{x}_m, y_m)

Require:

1:Regularization parameter λ>0\lambda > 0

2:Loss function Δ:Y×YR+\Delta: \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}_+

3:Class-sensitive feature mapping Ψ:X×YRd\Psi: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}^d

Ensure:

4:solve: minwRd(λw2+1mi=1mmaxyY(Δ(y,yi)+w,Ψ(xi,y)Ψ(xi,yi)))\min_{\mathbf{w} \in \mathbb{R}^d} \left(\lambda\|\mathbf{w}\|^2 + \frac{1}{m}\sum_{i=1}^m \max_{y' \in \mathcal{Y}} \left(\Delta(y', y_i) + \langle\mathbf{w}, \Psi(\mathbf{x}_i, y') - \Psi(\mathbf{x}_i, y_i)\rangle\right)\right)

5:output: the predictor hw(x)=arg maxyYw,Ψ(x,y)h_{\mathbf{w}}(\mathbf{x}) = \argmax_{y \in \mathcal{Y}} \langle\mathbf{w}, \Psi(\mathbf{x}, y)\rangle

all-pairs classification

For each distinct i,j{1,2,,k}i,j \in \{1,2,\ldots,k\}, then we train a classifier to distinguish samples from class ii and samples from class jj

hi,j(x)=sgn(wi,j,x)h_{i,j}(x) = \text{sgn}(\langle w_{i,j}, x \rangle)

linear multi-class predictor

think of multi-vector encoding for y{1,2,,k}y \in \{1,2,\ldots,k\}, where (x,y)(x,y) is encoded as Ψ(x,y)=[0  0 x 0  0]T\Psi(x,y) = [0 \space \ldots \space 0 \space x \space 0 \space \ldots \space 0]^T

thus our generalized Hinge loss now becomes:

h(x)=arg maxyw,Ψ(x,y)h(x) = \argmax_{y} \langle w, \Psi(x,y) \rangle

error type

type 1: false positive type 2: false negative

accuracy: TP + TNTP + TN + FP + FN\frac{\text{TP + TN}}{\text{TP + TN + FP + FN}}

precision is TPTP+FP\frac{\text{TP}}{\text{TP}+\text{FP}^{'}}

recall is TPTP + FN\frac{\text{TP}}{\text{TP + FN}}

probabilitic modeling

example: to assume each class is a Gaussian

discriminant analysis

P(xy=1,μ0,μ1,β)=1a0exμ122P(x \mid y = 1, \mu_0, \mu_1, \beta) = \frac{1}{a_0} e^{-\|x-\mu_1\|^2_2}

maximum likelihood estimate

see also priori and posterior distribution

given Θ={μ1,μ2,β}\Theta = \{\mu_1, \mu_2, \beta\}:

arg maxΘP(ZΘ)=arg maxΘi=1nP(xi,yiΘ)\begin{aligned} \argmax_{\Theta} P(Z \mid \Theta) &= \argmax_{\Theta} \prod_{i=1}^{n} P(x^i, y^i \mid \Theta) \\ \end{aligned}

How can we predict the label of a new test point?

Or in another words, how can we run inference?

Check P(y=0X,Θ)P(y=1X,Θ)1\frac{P(y=0 \mid X, \Theta)}{P(y=1 \mid X, \Theta)} \ge 1

Generalization for correlated features

Gaussian for correlated features:

N(xμ,Σ)=1(2π)d/2Σ1/2exp(12(xμ)TΣ1(xμ))\mathcal{N}(x \mid \mu, \Sigma) = \frac{1}{(2 \pi)^{d/2}|\Sigma|^{1/2}} \exp (-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu))

Naive Bayes Classifier

assumption

Given the label, the coordinates are statistically independent

P(xy=k,Θ)=πjP(xjy=k,Θ)P(x \mid y = k, \Theta) = \pi_j P(x_j \mid y=k, \Theta)

idea: comparison between discriminative and generative models

Logistic regression

😄 fun fact: actually better for classification instead of regression problems

Assume there is a plane in Rd\mathbb{R}^d parameterized by WW

P(Y=1x,W)=ϕ(WTx)P(Y=0x,W)=1ϕ(WTx)ϕ(a)=11+ea\begin{aligned} P(Y = 1 \mid x, W) &= \phi (W^T x) \\ P(Y= 0 \mid x, W) &= 1 - \phi (W^T x) \\[12pt] &\because \phi (a) = \frac{1}{1+e^{-a}} \end{aligned}

maximum likelihood

1ϕ(a)=ϕ(a)1 - \phi (a) = \phi (-a) WML=arg maxWP(xi,yiW)=arg maxWP(xi,yi,W)P(W)=arg maxWP(yixi,W)P(xi)=arg maxW[P(xi)][P(yixi,W)]=arg maxWi=1nlog(τ(yiWTxi))\begin{aligned} W^{\text{ML}} &= \argmax_{W} \prod P(x^i, y^i \mid W) \\ &= \argmax_{W} \prod \frac{P(x^i, y^i, W)}{P(W)} \\ &= \argmax_{W} \prod P(y^i | x^i, W) P(x^i) \\ &= \argmax_{W} \lbrack \prod P(x^i) \rbrack \lbrack \prod P(y^i \mid x^i, W) \rbrack \\ &= \argmax_{W} \sum_{i=1}^{n} \log (\tau (y^i W^T x^i)) \end{aligned}

equivalent form

maximize the following:

i=1n(yilogpi+(1yi)log(1pi))\sum_{i=1}^{n} (y^i \log p^i + (1-y^i) \log (1-p^i))

softmax

softmax(y)i=eyiieyi\text{softmax(y)}_i = \frac{e^{y_i}}{\sum_{i} e^{y_i}}

where yRky \in \mathbb{R}^k

Lien vers l'original

cross entropy

Lien vers l'original
Lien vers l'original
Lien vers l'original

Logistic regression

😄 fun fact: actually better for classification instead of regression problems

Assume there is a plane in Rd\mathbb{R}^d parameterized by WW

P(Y=1x,W)=ϕ(WTx)P(Y=0x,W)=1ϕ(WTx)ϕ(a)=11+ea\begin{aligned} P(Y = 1 \mid x, W) &= \phi (W^T x) \\ P(Y= 0 \mid x, W) &= 1 - \phi (W^T x) \\[12pt] &\because \phi (a) = \frac{1}{1+e^{-a}} \end{aligned}

maximum likelihood

1ϕ(a)=ϕ(a)1 - \phi (a) = \phi (-a) WML=arg maxWP(xi,yiW)=arg maxWP(xi,yi,W)P(W)=arg maxWP(yixi,W)P(xi)=arg maxW[P(xi)][P(yixi,W)]=arg maxWi=1nlog(τ(yiWTxi))\begin{aligned} W^{\text{ML}} &= \argmax_{W} \prod P(x^i, y^i \mid W) \\ &= \argmax_{W} \prod \frac{P(x^i, y^i, W)}{P(W)} \\ &= \argmax_{W} \prod P(y^i | x^i, W) P(x^i) \\ &= \argmax_{W} \lbrack \prod P(x^i) \rbrack \lbrack \prod P(y^i \mid x^i, W) \rbrack \\ &= \argmax_{W} \sum_{i=1}^{n} \log (\tau (y^i W^T x^i)) \end{aligned}

equivalent form

maximize the following:

i=1n(yilogpi+(1yi)log(1pi))\sum_{i=1}^{n} (y^i \log p^i + (1-y^i) \log (1-p^i))

softmax

softmax(y)i=eyiieyi\text{softmax(y)}_i = \frac{e^{y_i}}{\sum_{i} e^{y_i}}

where yRky \in \mathbb{R}^k

Lien vers l'original

cross entropy

Lien vers l'original
Lien vers l'original

feed-forward neural network

stateless, and usually have no feedback loop.

universal approximation theorem

see also pdf (Cybenko, 1989)

idea: a single sigmoid activation functions in FFN can approximate closely any given probability distributions.

regression

Think of just a linear layers with some activation functions

import torch.optim as optim
import torch.nn as nn
 
class LinearRegression(nn.Module):
  def __init__(self, input_dim, output_dim):
    super().__init__()
    self.fc = nn.Linear(input_dim, output_dim)
  def forward(self, x): return self.fc(x)
 
model = LinearRegression(224, 10)
loss = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.005)
 
for ep in range(10):
  y_pred = model(X)
  l = loss(Y, y_pred)
  l.backward()
  optimizer.setp()
  optimzer.zero_grad()

classification

Think of one-hot encoding (binary or multiclass) cases

backpropagation

context: using SGD we can compute the gradient:

W(L(w,b))=iW(l(fw,b(xi),yi))\nabla_W(L(w,b)) = \sum_{i} \nabla_W (l(f_{w,b}(x^i), y^i))

This is expensive, given that for deep model this is repetitive!

intuition: we want to minimize the error and optimized the saved weights learned through one forward pass.

vanishing gradient

happens in deeper network wrt the partial derivatives

because we applies the chain rule and propagating error signals backward from the output layer through all the hidden layers to the input, in very deep networks, this involves successive multiplication of gradients from each layer.

thus the saturated neurons σ(x)0\sigma^{'}(x) \approx 0, thus gradient does not reach the first layers.

solution:

  • we can probably use activation functions (Leaky ReLU)
  • better initialisation
  • residual network

regularization

usually prone to overfitting given they are often over-parameterized

  1. We can usually add regularization terms to the objective functions
  2. Early stopping
  3. Adding noise
  4. structural regularization, via adding dropout

dropout

a case of structural regularization

a technique of randomly drop each node with probability pp

Lien vers l'original

Bibliographie

  • Novikoff, A. B. J. (1962). On Convergence Proofs for Perceptrons. Proceedings of the Symposium on the Mathematical Theory of Automata, 12, 615–622. https://apps.dtic.mil/sti/tr/pdf/AD0298258.pdf
  • Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, 2(4), 303–314.
Lien vers l'original

regularization

usually prone to overfitting given they are often over-parameterized

  1. We can usually add regularization terms to the objective functions
  2. Early stopping
  3. Adding noise
  4. structural regularization, via adding dropout

dropout

a case of structural regularization

a technique of randomly drop each node with probability pp

Lien vers l'original

Convolutional Neural Network

See also: this one assignment on CNN

how can we exploit sparsity and locality?

think of sparse connectivity rather than full connectivity

where we exploiting invariance, it might be useful in other parts of the image as well

convolution

accept volume of size W1×H1×D1W_1 \times H_1 \times D_1 with four hyperparameters

  • filters KK
  • spatial extent FF
  • stride SS
  • amount of zero padding PP

calculation

produces a volume of size W2×H2×D2W_2 \times H_2 \times D_2 where:

  • W2=W1F+2PS+1W_2 = \frac{W_1 - F + 2P}{S} + 1
  • H2=H1F+2PS+1H_2 = \frac{H_1 - F + 2P}{S} + 1
  • D2=KD_2 = K

1D convolution:

y=(xw)y(i)=tx(t)w(it)\begin{aligned} y &= (x*w) \\ y(i) &= \sum_{t}x(t)w(i-t) \end{aligned}

2D convolution:

y=(xw)y(i,j)=t1t2x(t1,t2)w(it1,jt2)\begin{aligned} y &= (x*w) \\ y(i,j) &= \sum_{t_1} \sum_{t_2} x(t_1, t_2) w(i-t_1,j-t_2) \end{aligned}

max pooling

idea to reduce number of parameters

batchnorm

xj=[x1j,,xdj]x^{j} = [x_1^j,\ldots,x_d^j]

Batch X=[(x1)T(xb)T]TX = [(x^1)^T \ldots (x^b)^T]^T

Lien vers l'original

autoencoders

Think of using autoencoders to extract representations.

sparsity allows us to interpret hidden layers and internal representations of Transformers model.

graph TD
    A[Input X] --> B[Layer 1]
    B --> C[Layer 2]
    C --> D[Latent Features Z]
    D --> E[Layer 3]
    E --> F[Layer 4]
    F --> G[Output X']

    subgraph Encoder
        A --> B --> C
    end

    subgraph Decoder
        E --> F
    end

    style D fill:#c9a2d8,stroke:#000,stroke-width:2px,color:#fff
    style A fill:#98FB98,stroke:#000,stroke-width:2px
    style G fill:#F4A460,stroke:#000,stroke-width:2px

see also latent space

definition

EncΘ1:RdRqDecΘ2:RqRdqd\begin{aligned} \text{Enc}_{\Theta_1}&: \mathbb{R}^d \to \mathbb{R}^q \\ \text{Dec}_{\Theta_2}&: \mathbb{R}^q \to \mathbb{R}^d \\[12pt] &\because q \ll d \end{aligned}

loss function: l(x)=DecΘ2(EncΘ1(x))xl(x) = \|\text{Dec}_{\Theta_2}(\text{Enc}_{\Theta_1}(x)) - x\|

contrastive representation learning

The goal of contrastive representation learning is to learn such an embedding space in which similar sample pairs stay close to each other while dissimilar ones are far apart. Contrastive learning can be applied to both supervised and unsupervised settings. article

intuition: to give a positive and negative pairs for optimizing loss function.

Lien vers l'original

training objective

we want smaller reconstruction error, or

Dec(Sampler(Enc(x)))x22\|\text{Dec}(\text{Sampler}(\text{Enc}(x))) - x\|_2^2

we want to get the latent space distribution to look something similar to isotopic Gaussian!

Kullback-Leibler divergence

denoted as DKL(PQ)D_{\text{KL}}(P \parallel Q)

definition

The statistical distance between a model probability distribution QQ difference from a true probability distribution PP:

DKL(PQ)=xXP(x)log(P(x)Q(x))D_{\text{KL}}(P \parallel Q) = \sum_{x \in \mathcal{X}} P(x) \log (\frac{P(x)}{Q(x)})

alternative form:

KL(pq)=Exp(logp(x)q(x))=xP(x)logp(x)q(x)dx\begin{aligned} \text{KL}(p \parallel q) &= E_{x \sim p}(\log \frac{p(x)}{q(x)}) \\ &= \int_x P(x) \log \frac{p(x)}{q(x)} dx \end{aligned}Lien vers l'original

variational autoencoders

idea: to add a gaussian sampler after calculating latent space.

objective function:

min(xDec(Sampler(Enc(x)))x22+λi=1q(log(σi2)+σi2+μi2))\min (\sum_{x} \|\text{Dec}(\text{Sampler}(\text{Enc}(x))) - x\|^2_2 + \lambda \sum_{i=1}^{q}(-\log (\sigma_i^2) + \sigma_i^2 + \mu_i^2))Lien vers l'original

ensemble learning

idea: train multiple classifier and then combine them to improve performance.

aggregate their decisions via voting procedure.

Think of boosting, decision tree.

bagging

using non-overlapping training subset creates truly independent/diverse classifiers

bagging is essentially bootstrap aggregating where we do random sampling with replacement.

random forests

bagging but with random subspace methods 2

decision tree

  • handle categorical features

NOTE

can overfit easily with deeper tree.

boosting

a greedier approach for reducing bias where we “pick base classifiers incrementally”.

we will train “weaker learner” and thus it can combined to become “stronger learner”.

Remarque

  1. note that we can also write aTϕa^T \phi where ϕ=[ϕ(x1),,ϕ(xn)]T\phi = [\phi(x^1),\ldots,\phi(x^n)]^T

  2. The idea of training each classifier using a random subset of the feature sets. Also known as feature bagging

Lien vers l'original

Vapnik-Chrvonenkis dimension

fancy name for the measure of size, or the cardinality of the largest sets of points that the algorithm can shatter.

definition

Let HH be a set set family and CC a set. Thus, their intersection is defined as the following set:

HC{hChH}H \cap C \coloneqq \{h \cap C \mid h \in H\}

We say that set CC is shattered by HH if HCH \cap C contains all the subsets of C, or:

HC=2C|H \cap C| = 2^{|C|}

Thus, the VC dimension DD of HH is the cardinality of the largest set that is shattered by HH.

Note that if arbitrary larget sets can be shattered, then the VC dimension is \infty

Lien vers l'original