Connectionist networks

bayes rules and chain rules

Joint distribution: $P(X,Y)$

Conditional distribution of $X$ given $Y$ : $P(X|Y) = \frac{P(X,Y)}{P(Y)}$

Bayes rule: $P(X|Y) = \frac{P(Y|X)P(X)}{P(Y)}$

Chain rule:

for two events:

P(A, B) = P(B \mid A)P(A)

generalised:

\begin{aligned} &P(X_1, X_2, \ldots , X_k) \\ &= P(X_1) \prod_{j=2}^{n} P(X_j \mid X_1,\dots,X_{j-1}) \\[12pt] &\because \text{expansion: }P(X_1)P(X_2|X_1)\ldots P(X_k|X_1,X_2,\ldots,X_{k-1}) \end{aligned}

i.i.d assumption

assume underlying distribution $D$ , that train and test sets are independent and identically distributed (i.i.d)

Example: flip a coin

Outcome $H=0$ or $T=1$ with $P(H) = p$ and $P(T) = 1-p$ , or $x \in \{0,1\}$ , $x$ is the Bernoulli random variable.

$P(x=0)=\alpha$ and $P(x=1)=1-\alpha$

Note that for any random variables $A,B,C$ we have:

P(A,B \mid C) = P(A\mid B,C) P(B \mid C)

See also: slides 13, slides 14, slides 15

expected error minimisation

think of it as bias-variance tradeoff

Squared loss: $l(\hat{y},y)=(y-\hat{y})^2$

solution to $y^* = \argmin_{\hat{y}} E_{X,Y}(Y-\hat{y}(X))^2$ is $E[Y | X=x]$

Instead we have $Z = \{(x^i, y^i)\}^n_{i=1}$

error decomposition

\begin{aligned} &E_{x,y}(y-\hat{y_Z}(x))^2 \\ &= E_{xy}(y-y^{*}(x))^2 + E_x(y^{*}(x) - \hat{y_Z}(x))^2 \\ &= \text{noise} + \text{estimation error} \end{aligned}

bias-variance decompositions

For linear estimator:

\begin{aligned} E_Z&E_{x,y}(y-(\hat{y}_Z(x)\coloneqq W^T_Zx))^2 \\ =& E_{x,y}(y-y^{*}(x))^2 \quad \text{noise} \\ &+ E_x(y^{*}(x) - E_Z(\hat{y_Z}(x)))^2 \quad \text{bias} \\ &+ E_xE_Z(\hat{y_Z}(x) - E_Z(\hat{y_Z}(x)))^2 \quad \text{variance} \end{aligned}

accuracy

zero-one loss:

l^{0-1}(y, \hat{y}) = 1_{y \neq \hat{y}}= \begin{cases} 1 & y \neq \hat{y} \\\ 0 & y = \hat{y} \end{cases}

linear classifier

\begin{aligned} \hat{y}_W(x) &= \text{sign}(W^T x) = 1_{W^T x \geq 0} \\[8pt] &\because \hat{W} = \argmin_{W} L_{Z}^{0-1} (\hat{y}_W) \end{aligned}

surrogate loss functions

assume classifier returns a discrete value $\hat{y}_W = \text{sign}(W^T x) \in \{0,1\}$

What if classifier's output is continuous?

$\hat{y}$ will also capture the “confidence” of the classifier.

Think of contiguous loss function: margin loss, cross-entropy/negative log-likelihood, etc.

linearly separable data

linearly separable

A binary classification data set $Z=\{(x^i, y^i)\}_{i=1}^{n}$ is linearly separable if there exists a $W^{*}$ such that:

$\forall i \in [n] \mid \text{SGN}(<x^i, W^{*}>) = y^i$

Or, for every $i \in [n]$ we have $(W^{* T}x^i)y^i > 0$

linear programming

\begin{aligned} \max_{W \in \mathbb{R}^d} &\langle{u, w} \rangle = \sum_{i=1}^{d} u_i w_i \\ &\text{s.t } A w \ge v \end{aligned}

Given that data is linearly separable

\begin{aligned} \exists \space W^{*} &\mid \forall i \in [n], ({W^{*}}^T x^i)y^i > 0 \\ \exists \space W^{*}, \gamma > 0 &\mid \forall i \in [n], ({W^{*}}^T x^i)y^i \ge \gamma \\ \exists \space W^{*} &\mid \forall i \in [n], ({W^{*}}^T x^i)y^i \ge 1 \end{aligned}

LP for linear classification

Define $A = [x_j^iy^i]_{n \times d}$
find optimal $W$ equivalent to
$\begin{aligned} \max_{w \in \mathbb{R}^d} &\langle{\vec{0}, w} \rangle \\ & \text{s.t. } Aw \ge \vec{1} \end{aligned}$

perceptron

Rosenblatt’s perceptron algorithm

"\\begin{algorithm}\n\\caption{Batch Perceptron}\n\\begin{algorithmic}\n\\REQUIRE Training set $(\\mathbf{x}_1, y_1),\\ldots,(\\mathbf{x}_m, y_m)$\n\\STATE Initialize $\\mathbf{w}^{(1)} = (0,\\ldots,0)$\n\\FOR{$t = 1,2,\\ldots$}\n \\IF{$(\\exists \\space i \\text{ s.t. } y_i\\langle\\mathbf{w}^{(t)}, \\mathbf{x}_i\\rangle \\leq 0)$}\n \\STATE $\\mathbf{w}^{(t+1)} = \\mathbf{w}^{(t)} + y_i\\mathbf{x}_i$\n \\ELSE\n \\STATE \\textbf{output} $\\mathbf{w}^{(t)}$\n \\ENDIF\n\\ENDFOR\n\\end{algorithmic}\n\\end{algorithm}"

Algorithm 1 Batch Perceptron

Require: Training set $(\mathbf{x}_1, y_1),\ldots,(\mathbf{x}_m, y_m)$

1:Initialize $\mathbf{w}^{(1)} = (0,\ldots,0)$

2:for $t = 1,2,\ldots$ do

3:if $(\exists \space i \text{ s.t. } y_i\langle\mathbf{w}^{(t)}, \mathbf{x}_i\rangle \leq 0)$ then

4: $\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} + y_i\mathbf{x}_i$

5:else

6:output $\mathbf{w}^{(t)}$

7:end if

8:end for

greedy update

\begin{aligned} W_{\text{new}}^T x^i y^i &= \langle W_{\text{old}}+ y^i x^i, x^i \rangle y^i \\ &=W_{\text{old}}^T x^{i} y^{i} + \|x^i\|_2^2 y^{i} y^{i} \end{aligned}

proof

maximum margin hyperplane

$W$ has $\gamma$ margin if

\begin{aligned} W^T x + b \ge \gamma \space &\forall \text{ blue x} \\ W^T x +b \le - \gamma \space &\forall \text{ red x} \end{aligned}

Margin:

Z = \{(x^{i}, y^{i})\}_{i=1}^{n}, y \in \{-1, 1\}, \|W\|_2 = 1

hard-margin SVM

this is the version with bias

"\\begin{algorithm}\n\\caption{Hard-SVM}\n\\begin{algorithmic}\n\\REQUIRE Training set $(\\mathbf{x}_1, y_1),\\ldots,(\\mathbf{x}_m, y_m)$\n\\STATE \\textbf{solve:} $(w_{0},b_{0}) = \\argmin\\limits_{(w,b)} \\|w\\|^2 \\text{ s.t } \\forall i, y_{i}(\\langle{w,x_i} \\rangle + b) \\ge 1$\n\\STATE \\textbf{output:} $\\hat{w} = \\frac{w_0}{\\|w_0\\|}, \\hat{b} = \\frac{b_0}{\\|w_0\\|}$\n\\end{algorithmic}\n\\end{algorithm}"

Algorithm 6 Hard-SVM

Require: Training set $(\mathbf{x}_1, y_1),\ldots,(\mathbf{x}_m, y_m)$

1:solve: $(w_{0},b_{0}) = \argmin\limits_{(w,b)} \|w\|^2 \text{ s.t } \forall i, y_{i}(\langle{w,x_i} \rangle + b) \ge 1$

2:output: $\hat{w} = \frac{w_0}{\|w_0\|}, \hat{b} = \frac{b_0}{\|w_0\|}$

Note that this version is sensitive to outliers

it assumes that training set is linearly separable

soft-margin SVM

can be applied even if the training set is not linearly separable

"\\begin{algorithm}\n\\caption{Soft-SVM}\n\\begin{algorithmic}\n\\REQUIRE Input $(\\mathbf{x}_1, y_1),\\ldots,(\\mathbf{x}_m, y_m)$\n\\STATE \\textbf{parameter:} $\\lambda > 0$\n\\STATE \\textbf{solve:} $\\min_{\\mathbf{w}, b, \\boldsymbol{\\xi}} \\left( \\lambda \\|\\mathbf{w}\\|^2 + \\frac{1}{m} \\sum_{i=1}^m \\xi_i \\right)$\n\\STATE \\textbf{s.t: } $\\forall i, \\quad y_i (\\langle \\mathbf{w}, \\mathbf{x}_i \\rangle + b) \\geq 1 - \\xi_i \\quad \\text{and} \\quad \\xi_i \\geq 0$\n\\STATE \\textbf{output:} $\\mathbf{w}, b$\n\\end{algorithmic}\n\\end{algorithm}"

Algorithm 7 Soft-SVM

Require: Input $(\mathbf{x}_1, y_1),\ldots,(\mathbf{x}_m, y_m)$

1:parameter: $\lambda > 0$

2:solve: $\min_{\mathbf{w}, b, \boldsymbol{\xi}} \left( \lambda \|\mathbf{w}\|^2 + \frac{1}{m} \sum_{i=1}^m \xi_i \right)$

3:s.t: $\forall i, \quad y_i (\langle \mathbf{w}, \mathbf{x}_i \rangle + b) \geq 1 - \xi_i \quad \text{and} \quad \xi_i \geq 0$

4:output: $\mathbf{w}, b$

Equivalent form of soft-margin SVM:

\begin{aligned} \min_{w} &(\lambda \|w\|^2 + L_S^{\text{hinge}}(w)) \\[8pt] L_{S}^{\text{hinge}}(w) &= \frac{1}{m} \sum_{i=1}^{m} \max{(\{0, 1 - y \langle w, x_i \rangle\})} \end{aligned}

SVM with basis functions

\min_{W} \frac{1}{n} \sum \max \{0, 1 - y^i \langle w, \phi(x^i) \rangle\} + \lambda \|w\|^2_2

$\phi(x^i)$ can be high-dimensional

representor theorem

W^{*} = \argmin_{W} \frac{1}{n} \sum \max \{0, 1- y^i \langle w, \phi (x^i) \rangle\} + \lambda \|w\|^2_2

theorem

There are real values $a_{1},\ldots,a_{m}$ such that ¹
$W^{*} = \sum a_i \phi(x^i)$

kernelized SVM

from representor theorem, we have the kernel:

K(x,z) = \langle \phi(x), \phi(z) \rangle

drawbacks

prediction-time complexity
need to store all training data
Dealing with $\mathbf{K}_{n \times n}$
choice of kernel, in which is tricky and pretty heuristic sometimes.

minimize squared error

Given a homogeneous line $y = ax$ to a non-linear curve $f(x) = x^2 +1$ where $a,y,x \in \mathbb{R}$

assuming x are uniformly distributed on $[0,1]$ . What is the value of a to minimize the squared error?

\argmin_{\alpha} E[(ax - x^2 - 1)^2]

or we need to find

\argmin_{\alpha} \int_{-\infty}^{\infty} P_X(x) (ax - x^2 -1)^2 dx

multi-variate chain rule

\nabla_x f \odot g(x) = [\nabla_g]_{d \times m} \cdot [\nabla_f]_{m \times n}

Or we can find the Jacobian $\mathcal{J}_f$

if $f = Ax$ then $\nabla_f = A$

classification

or on-versus-all classification

idea: train $k$ different binary classifiers:

h_i(x) = \text{sgn}(\langle w_i, x \rangle)

end-to-end version, or multi-class SVM with generalized Hinge loss:

"\\begin{algorithm}\n\\caption{Multiclass SVM}\n\\begin{algorithmic}\n\\REQUIRE Input $(\\mathbf{x}_1, y_1),\\ldots,(\\mathbf{x}_m, y_m)$\n\\REQUIRE\n \\STATE Regularization parameter $\\lambda > 0$\n \\STATE Loss function $\\Delta: \\mathcal{Y} \\times \\mathcal{Y} \\to \\mathbb{R}_+$\n \\STATE Class-sensitive feature mapping $\\Psi: \\mathcal{X} \\times \\mathcal{Y} \\to \\mathbb{R}^d$\n\\ENSURE\n\\STATE \\textbf{solve}: $\\min_{\\mathbf{w} \\in \\mathbb{R}^d} \\left(\\lambda\\|\\mathbf{w}\\|^2 + \\frac{1}{m}\\sum_{i=1}^m \\max_{y' \\in \\mathcal{Y}} \\left(\\Delta(y', y_i) + \\langle\\mathbf{w}, \\Psi(\\mathbf{x}_i, y') - \\Psi(\\mathbf{x}_i, y_i)\\rangle\\right)\\right)$\n\\STATE \\textbf{output}: the predictor $h_{\\mathbf{w}}(\\mathbf{x}) = \\argmax_{y \\in \\mathcal{Y}} \\langle\\mathbf{w}, \\Psi(\\mathbf{x}, y)\\rangle$\n\\end{algorithmic}\n\\end{algorithm}"

Algorithm 8 Multiclass SVM

Require: Input $(\mathbf{x}_1, y_1),\ldots,(\mathbf{x}_m, y_m)$

Require:

1:Regularization parameter $\lambda > 0$

2:Loss function $\Delta: \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}_+$

3:Class-sensitive feature mapping $\Psi: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}^d$

Ensure:

4:solve: $\min_{\mathbf{w} \in \mathbb{R}^d} \left(\lambda\|\mathbf{w}\|^2 + \frac{1}{m}\sum_{i=1}^m \max_{y' \in \mathcal{Y}} \left(\Delta(y', y_i) + \langle\mathbf{w}, \Psi(\mathbf{x}_i, y') - \Psi(\mathbf{x}_i, y_i)\rangle\right)\right)$

5:output: the predictor $h_{\mathbf{w}}(\mathbf{x}) = \argmax_{y \in \mathcal{Y}} \langle\mathbf{w}, \Psi(\mathbf{x}, y)\rangle$

all-pairs classification

For each distinct $i,j \in \{1,2,\ldots,k\}$ , then we train a classifier to distinguish samples from class $i$ and samples from class $j$

h_{i,j}(x) = \text{sgn}(\langle w_{i,j}, x \rangle)

linear multi-class predictor

think of multi-vector encoding for $y \in \{1,2,\ldots,k\}$ , where $(x,y)$ is encoded as $\Psi(x,y) = [0 \space \ldots \space 0 \space x \space 0 \space \ldots \space 0]^T$

thus our generalized Hinge loss now becomes:

h(x) = \argmax_{y} \langle w, \Psi(x,y) \rangle

error type

type 1: false positive type 2: false negative

accuracy: $\frac{\text{TP + TN}}{\text{TP + TN + FP + FN}}$

precision is $\frac{\text{TP}}{\text{TP}+\text{FP}^{'}}$

recall is $\frac{\text{TP}}{\text{TP + FN}}$

example: to assume each class is a Gaussian

discriminant analysis

P(x \mid y = 1, \mu_0, \mu_1, \beta) = \frac{1}{a_0} e^{-\|x-\mu_1\|^2_2}

maximum likelihood estimate

given $\Theta = \{\mu_1, \mu_2, \beta\}$ :

\begin{aligned} \argmax_{\Theta} P(Z \mid \Theta) &= \argmax_{\Theta} \prod_{i=1}^{n} P(x^i, y^i \mid \Theta) \\ \end{aligned}

How can we predict the label of a new test point?

Or in another words, how can we run inference?

Check $\frac{P(y=0 \mid X, \Theta)}{P(y=1 \mid X, \Theta)} \ge 1$

Generalization for correlated features

Gaussian for correlated features:
$\mathcal{N}(x \mid \mu, \Sigma) = \frac{1}{(2 \pi)^{d/2}|\Sigma|^{1/2}} \exp (-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu))$

Naive Bayes Classifier

assumption

Given the label, the coordinates are statistically independent
$P(x \mid y = k, \Theta) = \pi_j P(x_j \mid y=k, \Theta)$

idea: comparison between discriminative and generative models

😄 fun fact: actually better for classification instead of regression problems

Assume there is a plane in $\mathbb{R}^d$ parameterized by $W$

\begin{aligned} P(Y = 1 \mid x, W) &= \phi (W^T x) \\ P(Y= 0 \mid x, W) &= 1 - \phi (W^T x) \\[12pt] &\because \phi (a) = \frac{1}{1+e^{-a}} \end{aligned}

maximum likelihood

1 - \phi (a) = \phi (-a)

\begin{aligned} W^{\text{ML}} &= \argmax_{W} \prod P(x^i, y^i \mid W) \\ &= \argmax_{W} \prod \frac{P(x^i, y^i, W)}{P(W)} \\ &= \argmax_{W} \prod P(y^i | x^i, W) P(x^i) \\ &= \argmax_{W} \lbrack \prod P(x^i) \rbrack \lbrack \prod P(y^i \mid x^i, W) \rbrack \\ &= \argmax_{W} \sum_{i=1}^{n} \log (\tau (y^i W^T x^i)) \end{aligned}

equivalent form

maximize the following:
$\sum_{i=1}^{n} (y^i \log p^i + (1-y^i) \log (1-p^i))$

softmax

\text{softmax(y)}_i = \frac{e^{y_i}}{\sum_{i} e^{y_i}}

where $y \in \mathbb{R}^k$

😄 fun fact: actually better for classification instead of regression problems

Assume there is a plane in $\mathbb{R}^d$ parameterized by $W$

\begin{aligned} P(Y = 1 \mid x, W) &= \phi (W^T x) \\ P(Y= 0 \mid x, W) &= 1 - \phi (W^T x) \\[12pt] &\because \phi (a) = \frac{1}{1+e^{-a}} \end{aligned}

maximum likelihood

1 - \phi (a) = \phi (-a)

\begin{aligned} W^{\text{ML}} &= \argmax_{W} \prod P(x^i, y^i \mid W) \\ &= \argmax_{W} \prod \frac{P(x^i, y^i, W)}{P(W)} \\ &= \argmax_{W} \prod P(y^i | x^i, W) P(x^i) \\ &= \argmax_{W} \lbrack \prod P(x^i) \rbrack \lbrack \prod P(y^i \mid x^i, W) \rbrack \\ &= \argmax_{W} \sum_{i=1}^{n} \log (\tau (y^i W^T x^i)) \end{aligned}

equivalent form

maximize the following:
$\sum_{i=1}^{n} (y^i \log p^i + (1-y^i) \log (1-p^i))$

softmax

\text{softmax(y)}_i = \frac{e^{y_i}}{\sum_{i} e^{y_i}}

where $y \in \mathbb{R}^k$

stateless, and usually have no feedback loop.

universal approximation theorem

regression

Think of just a linear layers with some activation functions

import torch.optim as optim
import torch.nn as nn
 
class LinearRegression(nn.Module):
  def __init__(self, input_dim, output_dim):
    super().__init__()
    self.fc = nn.Linear(input_dim, output_dim)
  def forward(self, x): return self.fc(x)
 
model = LinearRegression(224, 10)
loss = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.005)
 
for ep in range(10):
  y_pred = model(X)
  l = loss(Y, y_pred)
  l.backward()
  optimizer.setp()
  optimzer.zero_grad()

classification

Think of one-hot encoding (binary or multiclass) cases

backpropagation

context: using SGD we can compute the gradient:

\nabla_W(L(w,b)) = \sum_{i} \nabla_W (l(f_{w,b}(x^i), y^i))

This is expensive, given that for deep model this is repetitive!

intuition: we want to minimize the error and optimized the saved weights learned through one forward pass.

vanishing gradient

happens in deeper network wrt the partial derivatives

because we applies the chain rule and propagating error signals backward from the output layer through all the hidden layers to the input, in very deep networks, this involves successive multiplication of gradients from each layer.

thus the saturated neurons $\sigma^{'}(x) \approx 0$ , thus gradient does not reach the first layers.

solution:

we can probably use activation functions (Leaky ReLU)
better initialisation
residual network

usually prone to overfitting given they are often over-parameterized

We can usually add regularization terms to the objective functions
Early stopping
Adding noise
structural regularization, via adding dropout

dropout

a case of structural regularization

a technique of randomly drop each node with probability $p$

Bibliographie

Novikoff, A. B. J. (1962). On Convergence Proofs for Perceptrons. Proceedings of the Symposium on the Mathematical Theory of Automata, 12, 615–622. https://apps.dtic.mil/sti/tr/pdf/AD0298258.pdf
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, 2(4), 303–314.

usually prone to overfitting given they are often over-parameterized

We can usually add regularization terms to the objective functions
Early stopping
Adding noise
structural regularization, via adding dropout

dropout

a case of structural regularization

a technique of randomly drop each node with probability $p$

convolution

accept volume of size $W_1 \times H_1 \times D_1$ with four hyperparameters

filters $K$
spatial extent $F$
stride $S$
amount of zero padding $P$

calculation

produces a volume of size $W_2 \times H_2 \times D_2$ where:

$W_2 = \frac{W_1 - F + 2P}{S} + 1$

$H_2 = \frac{H_1 - F + 2P}{S} + 1$

$D_2 = K$

1D convolution:

\begin{aligned} y &= (x*w) \\ y(i) &= \sum_{t}x(t)w(i-t) \end{aligned}

2D convolution:

\begin{aligned} y &= (x*w) \\ y(i,j) &= \sum_{t_1} \sum_{t_2} x(t_1, t_2) w(i-t_1,j-t_2) \end{aligned}

max pooling

idea to reduce number of parameters

batchnorm

x^{j} = [x_1^j,\ldots,x_d^j]

Batch $X = [(x^1)^T \ldots (x^b)^T]^T$

Think of using autoencoders to extract representations.

sparsity allows us to interpret hidden layers and internal representations of Transformers model.

graph TD
    A[Input X] --> B[Layer 1]
    B --> C[Layer 2]
    C --> D[Latent Features Z]
    D --> E[Layer 3]
    E --> F[Layer 4]
    F --> G[Output X']

    subgraph Encoder
        A --> B --> C
    end

    subgraph Decoder
        E --> F
    end

    style D fill:#c9a2d8,stroke:#000,stroke-width:2px,color:#fff
    style A fill:#98FB98,stroke:#000,stroke-width:2px
    style G fill:#F4A460,stroke:#000,stroke-width:2px

definition

\begin{aligned} \text{Enc}_{\Theta_1}&: \mathbb{R}^d \to \mathbb{R}^q \\ \text{Dec}_{\Theta_2}&: \mathbb{R}^q \to \mathbb{R}^d \\[12pt] &\because q \ll d \end{aligned}

loss function: $l(x) = \|\text{Dec}_{\Theta_2}(\text{Enc}_{\Theta_1}(x)) - x\|$

The goal of contrastive representation learning is to learn such an embedding space in which similar sample pairs stay close to each other while dissimilar ones are far apart. Contrastive learning can be applied to both supervised and unsupervised settings. article

intuition: to give a positive and negative pairs for optimizing loss function.

training objective

we want smaller reconstruction error, or

\|\text{Dec}(\text{Sampler}(\text{Enc}(x))) - x\|_2^2

we want to get the latent space distribution to look something similar to isotopic Gaussian!

denoted as $D_{\text{KL}}(P \parallel Q)$

definition

The statistical distance between a model probability distribution $Q$ difference from a true probability distribution $P$ :
$D_{\text{KL}}(P \parallel Q) = \sum_{x \in \mathcal{X}} P(x) \log (\frac{P(x)}{Q(x)})$

Alternative form ²:

\begin{aligned} \text{KL}(p \parallel q) &= E_{x \sim p}(\log \frac{p(x)}{q(x)}) \\ &= \int_x P(x) \log \frac{p(x)}{q(x)} dx \end{aligned}

For relative entropy if $\forall x > 0, Q(x) = 0 \implies P(x) = 0$ absolute continuity

For distribution $P$ and $Q$ of a continuous random variable, then KL divergence is:

D_{\text{KL}}(P \parallel Q) = \int_{-\infty}^{+ \infty} p(x) \log \frac{p(x)}{q(x)} dx

where $p$ and $q$ denote probability densities of $P$ and $Q$

variational autoencoders

idea: to add a gaussian sampler after calculating latent space.

objective function:

\min (\sum_{x} \|\text{Dec}(\text{Sampler}(\text{Enc}(x))) - x\|^2_2 + \lambda \sum_{i=1}^{q}(-\log (\sigma_i^2) + \sigma_i^2 + \mu_i^2))

idea: train multiple classifier and then combine them to improve performance.

aggregate their decisions via voting procedure.

Think of boosting, decision tree.

bagging

using non-overlapping training subset creates truly independent/diverse classifiers

bagging is essentially bootstrap aggregating where we do random sampling with replacement.

random forests

bagging but with random subspace methods ³

decision tree

handle categorical features

NOTE

can overfit easily with deeper tree.

boosting

a greedier approach for reducing bias where we “pick base classifiers incrementally”.

we will train “weaker learner” and thus it can combined to become “stronger learner”.

note that we can also write $a^T \phi$ where $\phi = [\phi(x^1),\ldots,\phi(x^n)]^T$ ↩
For discrete probability distribution $P$ and $Q$ defined on the same sample space. ↩
The idea of training each classifier using a random subset of the feature sets. Also known as feature bagging ↩

fancy name for the measure of size, or the cardinality of the largest sets of points that the algorithm can shatter.

definition

Let $H$ be a set set family and $C$ a set. Thus, their intersection is defined as the following set:
$H \cap C \coloneqq \{h \cap C \mid h \in H\}$
We say that set $C$ is shattered by $H$ if $H \cap C$ contains all the subsets of C, or:
$|H \cap C| = 2^{|C|}$
Thus, the VC dimension $D$ of $H$ is the cardinality of the largest set that is shattered by $H$ .

Note that if arbitrary larget sets can be shattered, then the VC dimension is $\infty$

See also some statistical theory

url: thoughts/.../midterm
description: bayes rules and chain rules
bayes rules and chain rules

Joint distribution: $P(X,Y)$

Conditional distribution of $X$ given $Y$ : $P(X|Y) = \frac{P(X,Y)}{P(Y)}$

Bayes rule: $P(X|Y) = \frac{P(Y|X)P(X)}{P(Y)}$

Chain rule:

for two events:
$P(A, B) = P(B \mid A)P(A)$
generalised:
$\begin{aligned} &P(X_1, X_2, \ldots , X_k) \\ &= P(X_1) \prod_{j=2}^{n} P(X_j \mid X_1,\dots,X_{j-1}) \\[12pt] &\because \text{expansion: }P(X_1)P(X_2|X_1)\ldots P(X_k|X_1,X_2,\ldots,X_{k-1}) \end{aligned}$

i.i.d assumption

assume underlying distribution $D$ , that train and test sets are independent and identically distributed (i.i.d)

Example: flip a coin

Outcome $H=0$ or $T=1$ with $P(H) = p$ and $P(T) = 1-p$ , or $x \in \{0,1\}$ , $x$ is the Bernoulli random variable.

$P(x=0)=\alpha$ and $P(x=1)=1-\alpha$
Lien vers l'original

Note that for any random variables $A,B,C$ we have:

P(A,B \mid C) = P(A\mid B,C) P(B \mid C)

url: thoughts/.../nearest-neighbour
nearest neighbour
See also: slides 13, slides 14, slides 15

url: thoughts/.../likelihood
description: expected error minimisation
expected error minimisation

think of it as bias-variance tradeoff

Squared loss: $l(\hat{y},y)=(y-\hat{y})^2$

solution to $y^* = \argmin_{\hat{y}} E_{X,Y}(Y-\hat{y}(X))^2$ is $E[Y | X=x]$

Instead we have $Z = \{(x^i, y^i)\}^n_{i=1}$

error decomposition
$\begin{aligned} &E_{x,y}(y-\hat{y_Z}(x))^2 \\ &= E_{xy}(y-y^{*}(x))^2 + E_x(y^{*}(x) - \hat{y_Z}(x))^2 \\ &= \text{noise} + \text{estimation error} \end{aligned}$
bias-variance decompositions

For linear estimator:
$\begin{aligned} E_Z&E_{x,y}(y-(\hat{y}_Z(x)\coloneqq W^T_Zx))^2 \\ =& E_{x,y}(y-y^{*}(x))^2 \quad \text{noise} \\ &+ E_x(y^{*}(x) - E_Z(\hat{y_Z}(x)))^2 \quad \text{bias} \\ &+ E_xE_Z(\hat{y_Z}(x) - E_Z(\hat{y_Z}(x)))^2 \quad \text{variance} \end{aligned}$ Lien vers l'original

accuracy

zero-one loss:
$l^{0-1}(y, \hat{y}) = 1_{y \neq \hat{y}}= \begin{cases} 1 & y \neq \hat{y} \\\ 0 & y = \hat{y} \end{cases}$
linear classifier
$\begin{aligned} \hat{y}_W(x) &= \text{sign}(W^T x) = 1_{W^T x \geq 0} \\[8pt] &\because \hat{W} = \argmin_{W} L_{Z}^{0-1} (\hat{y}_W) \end{aligned}$
surrogate loss functions

assume classifier returns a discrete value $\hat{y}_W = \text{sign}(W^T x) \in \{0,1\}$

What if classifier's output is continuous?

$\hat{y}$ will also capture the “confidence” of the classifier.

Think of contiguous loss function: margin loss, cross-entropy/negative log-likelihood, etc.

linearly separable data

linearly separable

A binary classification data set $Z=\{(x^i, y^i)\}_{i=1}^{n}$ is linearly separable if there exists a $W^{*}$ such that:

$\forall i \in [n] \mid \text{SGN}(<x^i, W^{*}>) = y^i$

Or, for every $i \in [n]$ we have $(W^{* T}x^i)y^i > 0$

linear programming
$\begin{aligned} \max_{W \in \mathbb{R}^d} &\langle{u, w} \rangle = \sum_{i=1}^{d} u_i w_i \\ &\text{s.t } A w \ge v \end{aligned}$
Given that data is linearly separable
$\begin{aligned} \exists \space W^{*} &\mid \forall i \in [n], ({W^{*}}^T x^i)y^i > 0 \\ \exists \space W^{*}, \gamma > 0 &\mid \forall i \in [n], ({W^{*}}^T x^i)y^i \ge \gamma \\ \exists \space W^{*} &\mid \forall i \in [n], ({W^{*}}^T x^i)y^i \ge 1 \end{aligned}$
LP for linear classification

Define $A = [x_j^iy^i]_{n \times d}$

find optimal $W$ equivalent to
$\begin{aligned} \max_{w \in \mathbb{R}^d} &\langle{\vec{0}, w} \rangle \\ & \text{s.t. } Aw \ge \vec{1} \end{aligned}$

perceptron

Rosenblatt’s perceptron algorithm

$"\\begin{algorithm}\n\\caption{Batch Perceptron}\n\\begin{algorithmic}\n\\REQUIRE Training set $(\\mathbf{x}_1, y_1),\\ldots,(\\mathbf{x}_m, y_m)$\n\\STATE Initialize $\\mathbf{w}^{(1)} = (0,\\ldots,0)$\n\\FOR{$t = 1,2,\\ldots$}\n \\IF{$(\\exists \\space i \\text{ s.t. } y_i\\langle\\mathbf{w}^{(t)}, \\mathbf{x}_i\\rangle \\leq 0)$}\n \\STATE $\\mathbf{w}^{(t+1)} = \\mathbf{w}^{(t)} + y_i\\mathbf{x}_i$\n \\ELSE\n \\STATE \\textbf{output} $\\mathbf{w}^{(t)}$\n \\ENDIF\n\\ENDFOR\n\\end{algorithmic}\n\\end{algorithm}"$

Algorithm 1 Batch Perceptron

Require: Training set $(\mathbf{x}_1, y_1),\ldots,(\mathbf{x}_m, y_m)$

1:Initialize $\mathbf{w}^{(1)} = (0,\ldots,0)$

2:for $t = 1,2,\ldots$ do

3:if $(\exists \space i \text{ s.t. } y_i\langle\mathbf{w}^{(t)}, \mathbf{x}_i\rangle \leq 0)$ then

4: $\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} + y_i\mathbf{x}_i$

5:else

6:output $\mathbf{w}^{(t)}$

7:end if

8:end for

greedy update
$\begin{aligned} W_{\text{new}}^T x^i y^i &= \langle W_{\text{old}}+ y^i x^i, x^i \rangle y^i \\ &=W_{\text{old}}^T x^{i} y^{i} + \|x^i\|_2^2 y^{i} y^{i} \end{aligned}$
proof

See also (Novikoff, 1962)

Theorem

Assume there exists some parameter vector $\underline{\theta}^{*}$ such that $\|\underline{\theta}^{*}\| = 1$ and $\exists \space \upgamma > 0 \text{ s.t }$
$y_t(\underline{x_t} \cdot \underline{\theta^{*}}) \ge \upgamma$
Assumption: $\forall \space t= 1 \ldots n, \|\underline{x_t}\| \le R$

Then perceptron makes at most $\frac{R^2}{\upgamma^2}$ errors

proof by induction

definition of $\underline{\theta^k}$

to be parameter vector where algorithm makes $k^{\text{th}}$ error.

Note that we have $\underline{\theta^{1}}=\underline{0}$

Assume that $k^{\text{th}}$ error is made on example $t$ , or
$\begin{align} \underline{\theta^{k+1}} \cdot \underline{\theta^{*}} &= (\underline{\theta^k} + y_t \underline{x_t}) \cdot \underline{\theta^{*}} \\ &= \underline{\theta^k} \cdot \underline{\theta^{*}} + y_t \underline{x^t} \cdot \underline{\theta^{*}} \\ &\ge \underline{\theta^k} \cdot \underline{\theta^{*}} + \upgamma \\[12pt] &\because \text{ Assumption: } y_t \underline{x_t} \cdot \underline{\theta^{*}} \ge \upgamma \end{align}$
Follows up by induction on $k$ that
$\underline{\theta^{k+1}} \cdot \underline{\theta^{*}} \ge k \upgamma$
Using Cauchy-Schwarz we have $\|\underline{\theta^{k+1}}\| \times \|\underline{\theta^{*}}\| \ge \underline{\theta^{k+1}} \cdot \underline{\theta^{*}}$
$\begin{align} \|\underline{\theta^{k+1}}\| &\ge k \upgamma \\[16pt] &\because \|\underline{\theta^{*}}\| = 1 \end{align}$
In the second part, we will find upper bound for (5):
$\begin{align} \|\underline{\theta^{k+1}}\|^2 &= \|\underline{\theta^k} + y_t \underline{x_t}\|^2 \\ &= \|\underline{\theta^k}\|^2 + y_t^2 \|\underline{x_t}\|^2 + 2 y_t \underline{x_t} \cdot \underline{\theta^k} \\ &\le \|\underline{\theta^k}\|^2 + R^2 \end{align}$
(9) is due to:

$y_t^2 \|\underline{x_t}^2\|^2 = \|\underline{x_t}^2\| \le R^2$ by assumption of theorem

$y_t \underline{x_t} \cdot \underline{\theta^k} \le 0$ given parameter vector $\underline{\theta^k}$ gave error at $t^{\text{th}}$ example.

Follows with induction on $k$ that
$\begin{align} \|\underline{\theta^{k+1}}\|^2 \le kR^2 \end{align}$
from (5) and (10) gives us
$\begin{aligned} k^2 \upgamma^2 &\le \|\underline{\theta^{k+1}}\|^2 \le kR^2 \\ k &\le \frac{R^2}{\upgamma^2} \end{aligned}$ Lien vers l'original

url: thoughts/.../Support-Vector-Machine
Support Vector Machine
idea: maximises margin and more robust to “perturbations”

Euclidean distance between two points $x$ and the hyperplane parametrised by $W$ is:
$\frac{\mid W^T x + b \mid }{\|W\|_2}$

Assuming $\| W \|_2=1$ then the distance is $\mid W^T x + b \mid$

regularization

SVMs are good for high-dimensional data

We can probably use a solver, or gradient descent

maximum margin hyperplane

$W$ has $\gamma$ margin if
$\begin{aligned} W^T x + b \ge \gamma \space &\forall \text{ blue x} \\ W^T x +b \le - \gamma \space &\forall \text{ red x} \end{aligned}$
Margin:
$Z = \{(x^{i}, y^{i})\}_{i=1}^{n}, y \in \{-1, 1\}, \|W\|_2 = 1$
hard-margin SVM

this is the version with bias

$"\\begin{algorithm}\n\\caption{Hard-SVM}\n\\begin{algorithmic}\n\\REQUIRE Training set $(\\mathbf{x}_1, y_1),\\ldots,(\\mathbf{x}_m, y_m)$\n\\STATE \\textbf{solve:} $(w_{0},b_{0}) = \\argmin\\limits_{(w,b)} \\|w\\|^2 \\text{ s.t } \\forall i, y_{i}(\\langle{w,x_i} \\rangle + b) \\ge 1$\n\\STATE \\textbf{output:} $\\hat{w} = \\frac{w_0}{\\|w_0\\|}, \\hat{b} = \\frac{b_0}{\\|w_0\\|}$\n\\end{algorithmic}\n\\end{algorithm}"$

Algorithm 6 Hard-SVM

Require: Training set $(\mathbf{x}_1, y_1),\ldots,(\mathbf{x}_m, y_m)$

1:solve: $(w_{0},b_{0}) = \argmin\limits_{(w,b)} \|w\|^2 \text{ s.t } \forall i, y_{i}(\langle{w,x_i} \rangle + b) \ge 1$

2:output: $\hat{w} = \frac{w_0}{\|w_0\|}, \hat{b} = \frac{b_0}{\|w_0\|}$

Note that this version is sensitive to outliers

it assumes that training set is linearly separable

soft-margin SVM

can be applied even if the training set is not linearly separable

$"\\begin{algorithm}\n\\caption{Soft-SVM}\n\\begin{algorithmic}\n\\REQUIRE Input $(\\mathbf{x}_1, y_1),\\ldots,(\\mathbf{x}_m, y_m)$\n\\STATE \\textbf{parameter:} $\\lambda > 0$\n\\STATE \\textbf{solve:} $\\min_{\\mathbf{w}, b, \\boldsymbol{\\xi}} \\left( \\lambda \\|\\mathbf{w}\\|^2 + \\frac{1}{m} \\sum_{i=1}^m \\xi_i \\right)$\n\\STATE \\textbf{s.t: } $\\forall i, \\quad y_i (\\langle \\mathbf{w}, \\mathbf{x}_i \\rangle + b) \\geq 1 - \\xi_i \\quad \\text{and} \\quad \\xi_i \\geq 0$\n\\STATE \\textbf{output:} $\\mathbf{w}, b$\n\\end{algorithmic}\n\\end{algorithm}"$

Algorithm 7 Soft-SVM

Require: Input $(\mathbf{x}_1, y_1),\ldots,(\mathbf{x}_m, y_m)$

1:parameter: $\lambda > 0$

2:solve: $\min_{\mathbf{w}, b, \boldsymbol{\xi}} \left( \lambda \|\mathbf{w}\|^2 + \frac{1}{m} \sum_{i=1}^m \xi_i \right)$

3:s.t: $\forall i, \quad y_i (\langle \mathbf{w}, \mathbf{x}_i \rangle + b) \geq 1 - \xi_i \quad \text{and} \quad \xi_i \geq 0$

4:output: $\mathbf{w}, b$

Equivalent form of soft-margin SVM:
$\begin{aligned} \min_{w} &(\lambda \|w\|^2 + L_S^{\text{hinge}}(w)) \\[8pt] L_{S}^{\text{hinge}}(w) &= \frac{1}{m} \sum_{i=1}^{m} \max{(\{0, 1 - y \langle w, x_i \rangle\})} \end{aligned}$
SVM with basis functions
$\min_{W} \frac{1}{n} \sum \max \{0, 1 - y^i \langle w, \phi(x^i) \rangle\} + \lambda \|w\|^2_2$

$\phi(x^i)$ can be high-dimensional

representor theorem
$W^{*} = \argmin_{W} \frac{1}{n} \sum \max \{0, 1- y^i \langle w, \phi (x^i) \rangle\} + \lambda \|w\|^2_2$

theorem

There are real values $a_{1},\ldots,a_{m}$ such that ¹
$W^{*} = \sum a_i \phi(x^i)$

kernelized SVM

from representor theorem, we have the kernel:
$K(x,z) = \langle \phi(x), \phi(z) \rangle$
drawbacks

prediction-time complexity

need to store all training data

Dealing with $\mathbf{K}_{n \times n}$

choice of kernel, in which is tricky and pretty heuristic sometimes.

Lien vers l'original

minimize squared error

Given a homogeneous line $y = ax$ to a non-linear curve $f(x) = x^2 +1$ where $a,y,x \in \mathbb{R}$

assuming x are uniformly distributed on $[0,1]$ . What is the value of a to minimize the squared error?

\argmin_{\alpha} E[(ax - x^2 - 1)^2]

or we need to find

\argmin_{\alpha} \int_{-\infty}^{\infty} P_X(x) (ax - x^2 -1)^2 dx

multi-variate chain rule

\nabla_x f \odot g(x) = [\nabla_g]_{d \times m} \cdot [\nabla_f]_{m \times n}

Or we can find the Jacobian $\mathcal{J}_f$

if $f = Ax$ then $\nabla_f = A$

classification

or on-versus-all classification

idea: train $k$ different binary classifiers:

h_i(x) = \text{sgn}(\langle w_i, x \rangle)

end-to-end version, or multi-class SVM with generalized Hinge loss:

"\\begin{algorithm}\n\\caption{Multiclass SVM}\n\\begin{algorithmic}\n\\REQUIRE Input $(\\mathbf{x}_1, y_1),\\ldots,(\\mathbf{x}_m, y_m)$\n\\REQUIRE\n \\STATE Regularization parameter $\\lambda > 0$\n \\STATE Loss function $\\Delta: \\mathcal{Y} \\times \\mathcal{Y} \\to \\mathbb{R}_+$\n \\STATE Class-sensitive feature mapping $\\Psi: \\mathcal{X} \\times \\mathcal{Y} \\to \\mathbb{R}^d$\n\\ENSURE\n\\STATE \\textbf{solve}: $\\min_{\\mathbf{w} \\in \\mathbb{R}^d} \\left(\\lambda\\|\\mathbf{w}\\|^2 + \\frac{1}{m}\\sum_{i=1}^m \\max_{y' \\in \\mathcal{Y}} \\left(\\Delta(y', y_i) + \\langle\\mathbf{w}, \\Psi(\\mathbf{x}_i, y') - \\Psi(\\mathbf{x}_i, y_i)\\rangle\\right)\\right)$\n\\STATE \\textbf{output}: the predictor $h_{\\mathbf{w}}(\\mathbf{x}) = \\argmax_{y \\in \\mathcal{Y}} \\langle\\mathbf{w}, \\Psi(\\mathbf{x}, y)\\rangle$\n\\end{algorithmic}\n\\end{algorithm}"

Algorithm 8 Multiclass SVM

Require: Input $(\mathbf{x}_1, y_1),\ldots,(\mathbf{x}_m, y_m)$

Require:

1:Regularization parameter $\lambda > 0$

2:Loss function $\Delta: \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}_+$

3:Class-sensitive feature mapping $\Psi: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}^d$

Ensure:

5:output: the predictor $h_{\mathbf{w}}(\mathbf{x}) = \argmax_{y \in \mathcal{Y}} \langle\mathbf{w}, \Psi(\mathbf{x}, y)\rangle$

all-pairs classification

For each distinct $i,j \in \{1,2,\ldots,k\}$ , then we train a classifier to distinguish samples from class $i$ and samples from class $j$

h_{i,j}(x) = \text{sgn}(\langle w_{i,j}, x \rangle)

linear multi-class predictor

think of multi-vector encoding for $y \in \{1,2,\ldots,k\}$ , where $(x,y)$ is encoded as $\Psi(x,y) = [0 \space \ldots \space 0 \space x \space 0 \space \ldots \space 0]^T$

thus our generalized Hinge loss now becomes:

h(x) = \argmax_{y} \langle w, \Psi(x,y) \rangle

error type

type 1: false positive type 2: false negative

accuracy: $\frac{\text{TP + TN}}{\text{TP + TN + FP + FN}}$

precision is $\frac{\text{TP}}{\text{TP}+\text{FP}^{'}}$

recall is $\frac{\text{TP}}{\text{TP + FN}}$

url: thoughts/.../probabilitics-modeling
probabilitic modeling
example: to assume each class is a Gaussian

discriminant analysis
$P(x \mid y = 1, \mu_0, \mu_1, \beta) = \frac{1}{a_0} e^{-\|x-\mu_1\|^2_2}$
maximum likelihood estimate

see also priori and posterior distribution

given $\Theta = \{\mu_1, \mu_2, \beta\}$ :
$\begin{aligned} \argmax_{\Theta} P(Z \mid \Theta) &= \argmax_{\Theta} \prod_{i=1}^{n} P(x^i, y^i \mid \Theta) \\ \end{aligned}$

How can we predict the label of a new test point?

Or in another words, how can we run inference?

Check $\frac{P(y=0 \mid X, \Theta)}{P(y=1 \mid X, \Theta)} \ge 1$

Generalization for correlated features

Gaussian for correlated features:
$\mathcal{N}(x \mid \mu, \Sigma) = \frac{1}{(2 \pi)^{d/2}|\Sigma|^{1/2}} \exp (-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu))$

Naive Bayes Classifier

assumption

Given the label, the coordinates are statistically independent
$P(x \mid y = k, \Theta) = \pi_j P(x_j \mid y=k, \Theta)$

idea: comparison between discriminative and generative models

url: thoughts/Logistic-regression
Logistic regression
😄 fun fact: actually better for classification instead of regression problems

Assume there is a plane in $\mathbb{R}^d$ parameterized by $W$
$\begin{aligned} P(Y = 1 \mid x, W) &= \phi (W^T x) \\ P(Y= 0 \mid x, W) &= 1 - \phi (W^T x) \\[12pt] &\because \phi (a) = \frac{1}{1+e^{-a}} \end{aligned}$
maximum likelihood
$1 - \phi (a) = \phi (-a)$ $\begin{aligned} W^{\text{ML}} &= \argmax_{W} \prod P(x^i, y^i \mid W) \\ &= \argmax_{W} \prod \frac{P(x^i, y^i, W)}{P(W)} \\ &= \argmax_{W} \prod P(y^i | x^i, W) P(x^i) \\ &= \argmax_{W} \lbrack \prod P(x^i) \rbrack \lbrack \prod P(y^i \mid x^i, W) \rbrack \\ &= \argmax_{W} \sum_{i=1}^{n} \log (\tau (y^i W^T x^i)) \end{aligned}$

equivalent form

maximize the following:
$\sum_{i=1}^{n} (y^i \log p^i + (1-y^i) \log (1-p^i))$

url: thoughts/optimization
description: softmax
softmax
$\text{softmax(y)}_i = \frac{e^{y_i}}{\sum_{i} e^{y_i}}$
where $y \in \mathbb{R}^k$
Lien vers l'original

url: thoughts/cross-entropy
cross entropy
Lien vers l'original
Lien vers l'original
Lien vers l'original

url: thoughts/Logistic-regression
Logistic regression
😄 fun fact: actually better for classification instead of regression problems

Assume there is a plane in $\mathbb{R}^d$ parameterized by $W$
$\begin{aligned} P(Y = 1 \mid x, W) &= \phi (W^T x) \\ P(Y= 0 \mid x, W) &= 1 - \phi (W^T x) \\[12pt] &\because \phi (a) = \frac{1}{1+e^{-a}} \end{aligned}$
maximum likelihood
$1 - \phi (a) = \phi (-a)$ $\begin{aligned} W^{\text{ML}} &= \argmax_{W} \prod P(x^i, y^i \mid W) \\ &= \argmax_{W} \prod \frac{P(x^i, y^i, W)}{P(W)} \\ &= \argmax_{W} \prod P(y^i | x^i, W) P(x^i) \\ &= \argmax_{W} \lbrack \prod P(x^i) \rbrack \lbrack \prod P(y^i \mid x^i, W) \rbrack \\ &= \argmax_{W} \sum_{i=1}^{n} \log (\tau (y^i W^T x^i)) \end{aligned}$

equivalent form

maximize the following:
$\sum_{i=1}^{n} (y^i \log p^i + (1-y^i) \log (1-p^i))$

url: thoughts/optimization
description: softmax
softmax
$\text{softmax(y)}_i = \frac{e^{y_i}}{\sum_{i} e^{y_i}}$
where $y \in \mathbb{R}^k$
Lien vers l'original

url: thoughts/cross-entropy
cross entropy
Lien vers l'original
Lien vers l'original

url: thoughts/FFN
feed-forward neural network
stateless, and usually have no feedback loop.

universal approximation theorem

see also pdf (Cybenko, 1989)

idea: a single sigmoid activation functions in FFN can approximate closely any given probability distributions.

regression

Think of just a linear layers with some activation functions
import torch.optim as optim
import torch.nn as nn
 
class LinearRegression(nn.Module):
  def __init__(self, input_dim, output_dim):
    super().__init__()
    self.fc = nn.Linear(input_dim, output_dim)
  def forward(self, x): return self.fc(x)
 
model = LinearRegression(224, 10)
loss = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.005)
 
for ep in range(10):
  y_pred = model(X)
  l = loss(Y, y_pred)
  l.backward()
  optimizer.setp()
  optimzer.zero_grad()
classification

Think of one-hot encoding (binary or multiclass) cases

backpropagation

context: using SGD we can compute the gradient:
$\nabla_W(L(w,b)) = \sum_{i} \nabla_W (l(f_{w,b}(x^i), y^i))$
This is expensive, given that for deep model this is repetitive!

intuition: we want to minimize the error and optimized the saved weights learned through one forward pass.

vanishing gradient

happens in deeper network wrt the partial derivatives

because we applies the chain rule and propagating error signals backward from the output layer through all the hidden layers to the input, in very deep networks, this involves successive multiplication of gradients from each layer.

thus the saturated neurons $\sigma^{'}(x) \approx 0$ , thus gradient does not reach the first layers.

solution:

we can probably use activation functions (Leaky ReLU)

better initialisation

residual network

url: thoughts/regularization
regularization
usually prone to overfitting given they are often over-parameterized

We can usually add regularization terms to the objective functions

Early stopping

Adding noise

structural regularization, via adding dropout

dropout

a case of structural regularization

a technique of randomly drop each node with probability $p$
Lien vers l'original
Lien vers l'original

url: thoughts/regularization
regularization
usually prone to overfitting given they are often over-parameterized

We can usually add regularization terms to the objective functions

Early stopping

Adding noise

structural regularization, via adding dropout

dropout

a case of structural regularization

a technique of randomly drop each node with probability $p$
Lien vers l'original

url: thoughts/.../Convolutional-Neural-Network
Convolutional Neural Network
See also: this one assignment on CNN

how can we exploit sparsity and locality?

think of sparse connectivity rather than full connectivity

where we exploiting invariance, it might be useful in other parts of the image as well

convolution

accept volume of size $W_1 \times H_1 \times D_1$ with four hyperparameters

filters $K$

spatial extent $F$

stride $S$

amount of zero padding $P$

calculation

produces a volume of size $W_2 \times H_2 \times D_2$ where:

$W_2 = \frac{W_1 - F + 2P}{S} + 1$

$H_2 = \frac{H_1 - F + 2P}{S} + 1$

$D_2 = K$

1D convolution:
$\begin{aligned} y &= (x*w) \\ y(i) &= \sum_{t}x(t)w(i-t) \end{aligned}$
2D convolution:
$\begin{aligned} y &= (x*w) \\ y(i,j) &= \sum_{t_1} \sum_{t_2} x(t_1, t_2) w(i-t_1,j-t_2) \end{aligned}$
max pooling

idea to reduce number of parameters

batchnorm
$x^{j} = [x_1^j,\ldots,x_d^j]$
Batch $X = [(x^1)^T \ldots (x^b)^T]^T$
Lien vers l'original

url: thoughts/autoencoders
autoencoders
Think of using autoencoders to extract representations.

sparsity allows us to interpret hidden layers and internal representations of Transformers model.
graph TD
    A[Input X] --> B[Layer 1]
    B --> C[Layer 2]
    C --> D[Latent Features Z]
    D --> E[Layer 3]
    E --> F[Layer 4]
    F --> G[Output X']

    subgraph Encoder
        A --> B --> C
    end

    subgraph Decoder
        E --> F
    end

    style D fill:#c9a2d8,stroke:#000,stroke-width:2px,color:#fff
    style A fill:#98FB98,stroke:#000,stroke-width:2px
    style G fill:#F4A460,stroke:#000,stroke-width:2px
see also latent space

definition
$\begin{aligned} \text{Enc}_{\Theta_1}&: \mathbb{R}^d \to \mathbb{R}^q \\ \text{Dec}_{\Theta_2}&: \mathbb{R}^q \to \mathbb{R}^d \\[12pt] &\because q \ll d \end{aligned}$
loss function: $l(x) = \|\text{Dec}_{\Theta_2}(\text{Enc}_{\Theta_1}(x)) - x\|$

url: thoughts/contrastive-representation-learning
description: contrastive learning
contrastive representation learning

The goal of contrastive representation learning is to learn such an embedding space in which similar sample pairs stay close to each other while dissimilar ones are far apart. Contrastive learning can be applied to both supervised and unsupervised settings. article

intuition: to give a positive and negative pairs for optimizing loss function.
Lien vers l'original

training objective

we want smaller reconstruction error, or
$\|\text{Dec}(\text{Sampler}(\text{Enc}(x))) - x\|_2^2$
we want to get the latent space distribution to look something similar to isotopic Gaussian!

url: thoughts/Kullback-Leibler-divergence
description: KL divergence
Kullback-Leibler divergence
denoted as $D_{\text{KL}}(P \parallel Q)$

definition

The statistical distance between a model probability distribution $Q$ difference from a true probability distribution $P$ :
$D_{\text{KL}}(P \parallel Q) = \sum_{x \in \mathcal{X}} P(x) \log (\frac{P(x)}{Q(x)})$

Alternative form ²:
$\begin{aligned} \text{KL}(p \parallel q) &= E_{x \sim p}(\log \frac{p(x)}{q(x)}) \\ &= \int_x P(x) \log \frac{p(x)}{q(x)} dx \end{aligned}$
For relative entropy if $\forall x > 0, Q(x) = 0 \implies P(x) = 0$ absolute continuity

For distribution $P$ and $Q$ of a continuous random variable, then KL divergence is:
$D_{\text{KL}}(P \parallel Q) = \int_{-\infty}^{+ \infty} p(x) \log \frac{p(x)}{q(x)} dx$
where $p$ and $q$ denote probability densities of $P$ and $Q$
Lien vers l'original

variational autoencoders

idea: to add a gaussian sampler after calculating latent space.

objective function:
$\min (\sum_{x} \|\text{Dec}(\text{Sampler}(\text{Enc}(x))) - x\|^2_2 + \lambda \sum_{i=1}^{q}(-\log (\sigma_i^2) + \sigma_i^2 + \mu_i^2))$ Lien vers l'original

url: thoughts/ensemble-learning
ensemble learning
idea: train multiple classifier and then combine them to improve performance.

aggregate their decisions via voting procedure.

Think of boosting, decision tree.

bagging

using non-overlapping training subset creates truly independent/diverse classifiers

bagging is essentially bootstrap aggregating where we do random sampling with replacement.

random forests

bagging but with random subspace methods ³

decision tree

handle categorical features

NOTE

can overfit easily with deeper tree.

boosting

a greedier approach for reducing bias where we “pick base classifiers incrementally”.

we will train “weaker learner” and thus it can combined to become “stronger learner”.
Lien vers l'original

url: thoughts/Vapnik-Chrvonenkis-dimension
Vapnik-Chrvonenkis dimension
fancy name for the measure of size, or the cardinality of the largest sets of points that the algorithm can shatter.

definition

Let $H$ be a set set family and $C$ a set. Thus, their intersection is defined as the following set:
$H \cap C \coloneqq \{h \cap C \mid h \in H\}$
We say that set $C$ is shattered by $H$ if $H \cap C$ contains all the subsets of C, or:
$|H \cap C| = 2^{|C|}$
Thus, the VC dimension $D$ of $H$ is the cardinality of the largest set that is shattered by $H$ .

Note that if arbitrary larget sets can be shattered, then the VC dimension is $\infty$
Lien vers l'original

Connectionist networks

Étiquette

publié à

modifié à

durée

source

bayes rules and chain rules

nearest neighbour

expected error minimisation

error decomposition

bias-variance decompositions

accuracy

linear classifier

surrogate loss functions

linearly separable data

linear programming

LP for linear classification

perceptron

greedy update

proof

Support Vector Machine

maximum margin hyperplane

hard-margin SVM

soft-margin SVM

SVM with basis functions

representor theorem

kernelized SVM

drawbacks

minimize squared error

multi-variate chain rule

classification

all-pairs classification

linear multi-class predictor

error type

probabilitic modeling

discriminant analysis

maximum likelihood estimate

Naive Bayes Classifier

Logistic regression

maximum likelihood

softmax

cross entropy

Logistic regression

maximum likelihood

softmax

cross entropy

feed-forward neural network

universal approximation theorem

regression

classification

backpropagation

vanishing gradient

regularization

dropout

regularization

dropout

Convolutional Neural Network

convolution

max pooling

batchnorm

autoencoders

definition

contrastive representation learning

training objective

Kullback-Leibler divergence

variational autoencoders

ensemble learning

bagging

random forests

decision tree

boosting

Vapnik-Chrvonenkis dimension

Bibliographie

Remarque

Vous pourriez aimer ce qui suit