Machine Learning

Cheat Sheet

Thomas Karl

Classical ML

Estimation Theory

Maximum Likelihood

data matrix \(X = (\vec{x}_1,\dots,\vec{x}_n)\), model \(M\) with parameter vector \(\vec{\Theta}\)

probability density \(p(X|\vec{\Theta},M)\), e.\(\,\)g. Gauss \(p(x|\mu,\sigma) = \frac1{\sqrt{2\pi}\sigma}\exp\left( -\frac{(x-\mu)^2}{2\sigma^2} \right)\)

statistically independent samples: \(p(X|\vec{\Theta}) = \prod\limits_{i=1}^{n}p(\vec{x}_i|\vec{\Theta})\) joint probability density

\(p(X|\vec{\Theta}) = p(X)\) density, \(p(X|\vec{\Theta}) = p(\vec{\Theta}) = L(\Theta)\) likelihood

Task: \(L(\vec{\Theta}_{\text{ML}}) = \max\limits_{\vec{\Theta}} L(\vec{\Theta})\)

neg-log-likelihood \(\mathcal{L}(\vec{\Theta}) = -\ln L(\vec{\Theta}) = -\sum_{i=1}^{n}\ln p(\vec{x}_n|\vec{\Theta})\)

Theorem 1.1 (Maximum-Likelihood Method). \[\frac{\partial \mathcal{L}(\vec{\Theta})}{\partial\vec{\Theta}} = \sum\limits_{i=1}^n \frac1{p(\vec{x}_i|\vec{\Theta})} \frac{\partial p(\vec{x}_i|\vec{\Theta})}{\partial \vec{\Theta}} = \vec{0}\]

Properties:

Theorem 1.2 (Bayes). \[P(\vec{\Theta}|X) = \frac{P(X|\vec{\Theta})P(\vec{\Theta})}{P(X)}\] with \(P(\vec{\Theta}|X)\) a posterior probability, \(P(X|\vec{\Theta})\) likelihood, \(P(\vec{\Theta})\) a priori probability of the model, \(P(X)\) a priori probability of the data

maximum a posterior \(\vec{\Theta}_{\text{MAP}} = \mathrm{argmax}_{\vec{\Theta}} \left( p(X|\vec{\Theta},M)p(\vec{\Theta}|M) \right)\)

advantages:

Expectation Maximization Algorithm

latent variables \(Z\), complete likelihood \(L_C(\vec{\Theta}|X,Z)\)

E-step: expectation of \(L_C\) \(Q(\vec{\Theta}|\vec{\Theta^l}) = E[ L_C(\vec{\Theta}|X,Z)|X,\vec{\Theta^l} ]\)

M-step: \(\vec{\Theta^{l+1}} = \mathrm{argmax}_{\vec{\Theta}} (Q(\vec{\Theta}|\vec{\Theta^l}))\)

\(L_C(\vec{\Theta^{l+1}}|X) \geq L_C(\vec{\Theta^{l+1}}|X)\)

Mixtures

gaussian mixture \(p(x) = \sum\limits_{k=1}^K p(x|G_k)P(G_k)\) with density \(p\) and a priori \(P\)

\(k\)-Means Clustering

codebook vectors \(m_k \in \mathbb{R}^M\)

Classification and Regression

Decision Trees and Random Forests

parameter: function of probability distribution

statistic: function of sample \(x\)

\(B\) bootstrap samples, replication \(\hat{\vec{\Theta}}^*(b) = s(x^*_b)\), \(b = 1\dots B\)

standard error \(\hat{e}_B = \sqrt{\mathrm{Var}(\vec{\Theta})}\)

CART: \(s_{\text{opt}}\) best split for feature \(x_m\)

\(N\) samples, \(M\) features

\(m << M\) variables selected at random at node \(t\)

score criterion \(L(y,\hat{y}) = (y-\hat{y})^2\)

regression: \(\hat{c}_m = \frac1{N_m}\sum\limits_{x_n\in \hat{R}_m}y_n\), \(Q_m(T) = \frac1{N_m}\sum\limits_{x_n\in \hat{R}_m}(y_n - \hat{c}_m)^2\)

classification: \(\phi(\vec{p}) = \sum\limits_j p_j(1-p_j)\) Gini impurity \(m = \sqrt{M}\), \(\phi(\vec{p}) = -\sum\limits_j p_j\log p_j\) entropy \(m=\frac{M}{3}\)

bagging: random forest with \(m = M\)

Support Vector Machines

\(N\) dimensional data in two classes linear separated by \(N-1\) dimensional hyperplane \(h\) with normal vector \(w\)

\(\langle w,x \rangle + b \begin{cases} > 0 & x \in \text{pos. halfspace} \\ = 0 & x \in h \\ < 0 & x \in \text{neg. halfspace} \end{cases}\)

\(h\) does not contain origin: \(\langle w,x \rangle = \text{const.}\)

for closest points \(x_1\) and \(x_2\) \(\langle w,x_1-x_2 \rangle = 2 \qquad /\|w\|\)

\(\|w\|^{-1}\) margin minimize \(\|w\|\)

cost function \(\tau(w) = \frac12 \|w\|^2\)

lagrangian \(L(w,b,\alpha_{1\dots m}) = \frac12\|w\|^2 - \sum\limits_{i=1}^m\alpha_i(y_i(\langle x_i,w \rangle + b)-1)\), labels \(y_i = \pm 1\)

task: maximize with respect to dual variables (lagrange multiplier) \(\alpha_i \geq 0\), minimize with respect to primal variables \(w, b\)

\(\sum\limits_{i=1}^m \alpha_i y_i = 0 \neq 0 \mathrm{~for~} \alpha_i > 0\), \(w = \sum\limits_{i=1}^m\alpha_iy_ix_i\) (dual optimization problem) quadratic optimization

only support vectors count

data not linear separable: slack variable \(y_i(\langle x_i,w \rangle + b) \geq 1-\varepsilon_i\)

\(\tau(w,\varepsilon) = \frac12\|w\|^2 + C\sum\limits_{i=1}^m \varepsilon_i^k\), \(0 \leq \alpha_i \leq \frac{C}{m}\)

Kernel Methods

\(\vec{\phi} = \phi(\vec{x})\) non-linear mapping from input to feature space

kernel function \(k(\vec{x},\vec{x}') = \langle \phi(\vec{x}),\phi(\vec{x}') \rangle = \phi^T(\vec{x})\phi(\vec{x}') = \vec{\phi}^T\vec{\phi}' = k(\vec{x}',\vec{x})\)

stationary: \(k(\vec{x},\vec{x}') = k(\vec{x}-\vec{x}')\) translation invariant space

homogeneous: \(k(\vec{x},\vec{x}') = k(\|\vec{x}-\vec{x}'\|)\)

Subspace Methods and Exploratory Matrix Factorization

de-mixing basis vectors \(W\), sources, components, projections \(Z = WX\)

linear models: algebraic approach

PCA \(Z = U^TX\)

BSS, GEVD, ICA, NMF

non-linear: kernels

feature space \(Z=W\phi^T(X_B)\phi(X)\)

PCA in feature space: KPCA, kernel trick

Singular Value Decomposition

task: decompose \(N\times M\) data matrix \(X = U\Sigma V^T\)

\(U\): \(M\times M\) orthogonal

\(V\): \(N\times N\) orthogonal

\(\Sigma\): \(\mathrm{diag}(\sigma_1,\dots,\sigma_r)\) contains \(r = \min(N,M)\) singular values

covariance matrix \(\mathrm{Cov} = XX^T = U\Sigma\Sigma^TU^T = UDU^T\) eigenvalue decomposition

kernel matrix \(K = X^TX = VDV^T\), number of eigenvalues \(>0\) is \(\min(N,M)\)

\(D = \Sigma\Sigma^T = \Sigma^T\Sigma\), \(\sigma_i = \sqrt{\lambda_i}\) eigenvalues

\(\sigma_1 > \sigma_2 > \dots\), \(X = \sigma_1 u_1 v_1^T + \dots\)

Principal Component Analysis

mean centered data: covariance matrix equals correlation matrix

\(u_i\) is associated with \(\lambda_i\), \(z = u^T x\), \(u_1\) with maximum variance? \(u_1^Tu_1 = 1\)

\(\mathrm{Cor}\cdot u_1 = \alpha_1 u_1\)

component with largest eigenvalue \(\frac{\partial}{\partial u_1} \left\{ u_1^Tcu_1 - \alpha_1(u_1^Tu_1-1) \right\} = 0\)

PCA: \(u_m^T = \tilde{z}_m \longleftrightarrow u_m u_m^TX = u_m\tilde{z}_m\)

\(Z = U_L^TX = U_L^TU\Sigma V^T = \Sigma_LV_L^T = D_L^{\frac12}V_L^T\), \(ZZ^T = D\), \(U_L^TU = [1~0]\), projections are related with eigenvectors of \(V\) non-correlated data set

whitening transformation: \(YY^T = \mathbb{1}\)

\(D^{-\frac12}ZZ^TD^{-\frac12} = \mathbb{1}\)

\(Y = D_L^{\frac12}U_L^TX = B^TX\), \(B = U_L U_L^{-\frac12}\)

kernel trick: \(U = XVD^{-\frac12} = XA\) linear combination

\(\Phi = [\phi(x_1),\phi(x_2),...]\), \(U = \Phi VD^{-\frac12}\)

\(Z = U^T\Phi = D^{-\frac12}V^T\Phi^T\Phi = D^{-\frac12}V^Tk(x,y)\)

Blind Signal Separation

task: separate mixed signals from from independent sources

\(X \in \mathbb{R}^{N\times M}\) e.\(\,\)g. time series of \(N\) images containing \(M\) pixels

\(X\approx WH\), \(W: N\times K\), \(H: K\times M\)

\(K = N\) \(\tilde{x}_n = W\tilde{h}_m\) time series for \(n\)-th pixel

\(X^T\approx WH\), \(W: M\times L\), \(H: L\times N\)

\(K = N\) \(\tilde{x}_m = H\tilde{s}_n\) \(m\)-th statistically independent image mode

Independent Component Analysis

mixing matrix \(W\), non-linear activation \(\vec{f}(\dots)\), source \(\vec{x}\), statistically independent noise \(\vec{n}\)

output signal \(\vec{z} = \vec{f}(W\vec{x}) + \vec{n}\)

task:

statistically independent but number of signals not known: ICA \(\rightarrow\) decorrelates higer-order statistics, non-orthogonal system

occurrence probability \(p(\vec{x})\), \(\vec{x}\) in alphabet \(M_X\)

Shannon information \(I(\vec{x}) = -\log(p(\vec{x}))\)

information entropy (average information) \(H(\vec{x}) = E[I(\vec{x})] = -\sum\limits_{x\in M_x} p(\vec{x})\log(p(\vec{x}))\)

information that \(\vec{z}\) conveys about \(\vec{x}\): \(I(\vec{x}) - I(\vec{x}|\vec{z})\)

mutual information \(MI(\vec{x}|\vec{z}) = H(\vec{x}) - H(\vec{x}|\vec{z}) = MI(\vec{z}|\vec{x})\) average information of \(\vec{x}\) when observing \(\vec{z}\)

\(= H(\vec{z}) - H(\vec{N})\), the latter is the information contribution of the noise

Theorem 3.1 (ICA update rule). \[\begin{aligned} &\Delta W = \eta\frac{\partial H(\vec{z}|W)}{\partial W} = (W^T)^{-1} - \vec{\phi}\vec{y}\vec{x}^T \qquad \text{with}\quad \vec{y} = W\vec{x} \\ &\text{and}\quad \vec{\phi} = -\frac{\mathrm{d}\log(p(y_i))}{\mathrm{d}y_i} \nonumber \end{aligned}\]

Dictionary Learning

Supervised DL

Convolutional DL

Empirical Mode Decomposition

Intrinsic Mode Function

properties of \(x(t)\)

Decomposition

\(x(t) = \sum\limits_n x_n(t) + r(t)\) achieved through sifting

completeness automatically and proven numerically

IMF local orthogonal \(x(t)^2 = \sum\limits_{j=1}^{i+1}x_j(t) + 2Y\) with \(Y = \sum\limits_{j=1}^{i+1}\sum\limits_{k=1}^{i+1}x_j x_k\) and \(\sum\limits_{t=0}^T\frac{Y}{x(t)^2} = 0\)

requirements:

Theorem 5.1 (Nyquist). A function with no frequencies larger than \(f_{\text{max}}\) is uniquely determined by an arbitrary series of function values with difference \(\tau < \frac1{2f_{\text{max}}}\).

Ensemble-EMD

add noise with zero mean and unit variance

averaging cancels noise

Local EMD

sifting in regions with large mean values

Hilbert Transformation

\(H\left\{ x(t) \right\} = \frac1\pi P\int\limits_{-\infty}^{\infty}\!\!\mathrm{d}\tau \, \frac{x(\tau)}{t-\tau}\) with Cauchy prime value \(P\) of the singular integral

\(z(t) = x(t) + H(x(t)) = a(t)e^{i\int\!\!\mathrm{d}t\,\omega(t)}\) with \(\omega(t) = -\frac{\mathrm{d}\Theta}{\mathrm{dt}} = -\frac{\mathrm{d}}{\mathrm{d}t}\mathrm{atan}\left(\frac{x(t)}{H(t)}\right)\)

Neural Networks

Single-Layer Perceptron

network output signal \(y_i = g(h_i) = g\left( \sum\limits_k w_{ik}y_k \right) = g(W\vec{x})\), \(h\): local field

\(g\): continuous, continuous differentiable, monotonous

linearly independent linearly separable problems, sufficient but not required for convergence

error function \(E(\vec{w}) = \frac12\sum\limits_{i,\mu}(t_i^{\mu} - y_i^{\mu}) = \frac12\sum\limits_{i,\mu}(\delta_i^{\mu})^2\) with label vector \(\vec{t}\) and neuron \(\mu\)

gradient descent \(\delta w_{ik} = -\eta\frac{\partial E}{\partial w_{ik}} = \eta\sum\limits_{\mu}\delta_i^{\mu}x_k^{\mu}\)

non-linear: \(E = \frac12\sum\limits_{i,\mu}\left[ t_i^{\mu} - g\left(\sum\limits_k w_{ik}x_k^{\mu}\right) \right]^2\)

\(\frac{\partial E}{\partial w_{ik}} = -\sum\limits_{\mu}\left\{\left[ t_i^{\mu} - g(h_i^{\mu}) \right] g'(h_i^{\mu})x_k^{\mu} \right\} = -\sum\limits_{\mu}\delta_i^{\mu}g'(h_i^{\mu})x_k^{\mu}\)

Theorem 1.1 (SLP update rule). \[\Delta w_{ik} = -\eta \sum\limits_{\mu}\delta_i^{\mu}g'(h_i^{\mu})x_k^{\mu} \qquad \text{with}\quad \delta_i^{\mu} = t_i^{\mu} - g(h_i^{\mu})\]

Multi-Layer Perceptron

number of training samples \(n\), number of input neurons \(n\), number of output neurons \(m\)

number of hidden units \(k \approx \frac{P}{10(n+m)}\)

error output layer: \(\delta_i = (t_i - y_i)g'(h_i)\)

error hidden layer: \(\delta_j = g'(h_j)\sum\limits_i w_{ij}\delta_i\)

Theorem 2.1 (MLP update rule (hidden layer)). \[\Delta w_{ij} = \eta\delta_j y_j \qquad \text{with}\quad \delta_j = (t_i - y_i)g'(h_i)\]

stop criterion: \(\|\vec{\nabla}_w E\|\) sufficient small or \(\|\vec{w}(t+1) - \vec{w}(t)\| \leq \varepsilon\)

for a function \(\vec{y}_j = \begin{pmatrix} y_{1j}\\ y_{2j}\\ \vdots \end{pmatrix} = \begin{pmatrix} F_1({\vec{x}_j})\\ F_2({\vec{x}_j})\\ \vdots \end{pmatrix} = \vec{F}(\vec{x}_j)\) that minimizes mean quadratic error
\(L(\vec{F}) = \frac1{2N}\sum\limits_{j=1}^N \|\vec{t}_j - \vec{F}(\vec{x}_j)\|^2\) choose neuron \(k\) for which \(F_k(\vec{x}) \geq \vec{F}_j(\vec{x})\) holds \(\forall j\neq k\)

Self-Organizing Maps

task: map \(m\)-dimensional data onto \(n\) dimensions (typically \(n \in \left\{1,2\right\}\))

weight vector \(\vec{w}\): euclidean point in input space

\(h_i = \sum\limits_j w_{ij}x_j = \vec{w}_i \vec{x}\)

if \(\|w_i\|\): \(\Delta w_{i*j} = \eta (x_j^{\mu} - w_{i*j})\)

else: \(\Delta w_{i*j} = \eta \left(\frac{x_j^{\mu}}{\sum\limits_l x_l^{\mu}} - w_{i*j}\right)\)

binary: \(\Delta w_{i*j} = \eta(y_i x_j - y_i w_{ij})\), Hebb minus decay term, \(y_i = 1\) if \(i = i*\) else \(0\)

with normalized vectors \(w_{i*}(t+1) = \frac{\vec{w}_{i*}(t) + \eta(\vec{x} - \vec{w}_{i*}(t))}{\|\vec{w}_{i*}(t) + \eta(\vec{x} - \vec{w}_{i*}(t))\|}\)

\(\eta(t) = \eta_0 t^{-\alpha}\), \(\eta(t) = \eta_0(1-\alpha t)\), \(\alpha\leq 1\)

stimulus: \(y_j(t+1) = f\left( \sum\limits_{l=0}^p w_{jl}x_l + \eta\sum\limits_{k=-K}^k c_{ik}y_{j+k}(t) \right)\), \(f\) Mexican hat

\(g(y_j)\) binary, \(h(i,i*)\) neighborhood

\(y_j = \begin{cases} 1 \quad j\in U\\ 0\end{cases}\), \(\qquad g(y_j) = \begin{cases} \eta \quad j\in U\\ 0\end{cases}\), \(\qquad h(i,i*) = \exp\left( \frac{|\vec{r}_i - \vec{r}_i^*|^2}{-2\sigma^2} \right)\)

Theorem 3.1 (SOM update rule). \[\begin{aligned} \vec{w}_j(t+1) &= \vec{w}_j(t) + \eta(t)h(i,i*)(\vec{x}-\vec{w}_j(t)) \nonumber\\ \eta(t), \sigma(t) &\sim \begin{cases}\exp\left( -\frac{t}{\tau} \right)\\ t^{-\alpha}\end{cases} \qquad \text{with}\quad 0\leq\alpha\leq 1 \end{aligned}\]

Deep Learning

Recurrent Neural Networks

input layer \(W\) \(u_t \leftarrow W_{hx}x_t + U_{hh}h_{t-1}\)
hidden layer \(U\) \(h_t \leftarrow g_h(u_t)\) \(h_0 = g(Wx_0)\)
label \(V\) \(o_t \leftarrow V_{oh}h_t\)
\(y_t \leftarrow g_y(o_t)\) \(L(y-\hat{y})^2\)

back-propagation through time:

Autoencoder

\(\tilde{x} = UVx\)

PCA: \(U = V^T\)

Error: \(\sum\limits_{m=1}^M (\tilde{x}^{(m)} - x^{(m)})^2\)

Deep Autoencoder

Sparse Autoencoder

Restricted Boltzmann Machines

Gibbs distribution \(p(x) = \frac{\exp(-E(x))}{\sum\limits_x\exp(-E(x))}\)

energy: \(E(v,h) = -(h^TWv + b^Tv + c^Th)\)

if conditionally independent: \(p(v|h) = \prod\limits_i p(h_i|v)\)

\(P(h_i=1|v) = \sigma(c_i + \sum w_{ij}v_j)\)

\(P(v_j=1|h) = \sigma(b_j + \sum w_{ij}h_i)\)

positive phase: use probability to sample \(h_j\)

\(h_j \leftarrow \begin{cases}1\quad h_j > \text{rand}(0,1)\\0\end{cases}\)

negative phase: \(\hat{v}_j\rightarrow \hat{h}_j\)

Theorem 7.1 (RBM update rule). \[\begin{aligned} \Delta W &= \eta(vh^T - \hat{v}\hat{h}^T) \nonumber\\ \Delta b &= \eta(v - \hat{v}) \nonumber\\ \Delta c &= \eta(h - \hat{h}) \end{aligned}\]

Deep Belief Networks

Convolutional Neural Networks

Architectures

Graph Neural Networks

\(V\) vertices \(|V| = N\)

edges \(E\), \(e_{nn'} = (v_n,v_{n'})\)

\(A\in \mathbb{R}^{N\times N}\), \(A_{nn'} = \begin{cases}w_{nn'} > 0 \qquad \text{if}\quad e_{nn'}\in E \\ 0 \qquad \text{else}\end{cases}\)

degree matrix \(D\) with \(d_{nn'} = d_{nn}\) if \(n = n'\) else \(0\), \(d_{nn} = \sum\limits_{nn'}a_{nn'}\)

lagrangian \(L = D-A\), \(L = U\Lambda U^T\), \(M = D^{-\frac12}U\)

\(L_n = D^{-\frac12}LD^{-\frac12} = M\Lambda M^T\)

\(L_{\text{tr}} = D^{-1}A\), \(L_{rw} = \mathbb{1} - L_{\text{tr}}\)

graph Fourier transform \(\tilde{x} = \mathrm{GFT}(x) = M^T x\)

\(F_{\Theta}\) kernel filter (diffusion), diagonal feature matrix \(\Theta\) \(\rightarrow\) convolution theorem

\(x\otimes F_{\Theta} = M\left( (M^T F_{\Theta})\circ(M^T X) \right) = (M\Theta M^T)\tilde{x}\), \(\circ\) denotes Hadamard product

\(\vec{\Theta} = \sum\limits_{k=0}^{k-1}\gamma_k\Lambda^k\)

if \(k = 2, \gamma = \gamma_0 = -\gamma_1\)

Reinforcement Learning

Basics