Datenanalyse und Stochastische Modellierung
11. Supervised Learning

### Training and Testing

• Goodness of fit does not mean that the model has a good predictive power
• How to ensure the model 'generalizes' to new data?
• Divide data into training set and test set
• Model has to be fitted without including information from the test set
• Only run training so long as the performance on the test set still improves

### Linear Models

$\begin{array}{c c}x_1 & \searrow \\ x_2& \nearrow \end{array}\; w_1 x_1 + w_2 x_2 + b_1$

• There are different ways of fitting a model
• Example from exercise: Nelder-Meat: algorithm for local maximum in nonlinear systems with several parameters: iterate ensemble of points - reflect, expand, contract worst point - shrink area

Minimizing parameter a on function F: $a_{n+1} = a_n - l \nabla F(a)$ with learning rate l

F is the loss function prediction x to the measurement z. I.e. for a mean squared error:$a_{n+1} = a_n - 2l [x(a)-z]\frac{\partial x}{\partial a}$

• stochastic: at each step (epoch), minimize with one sample or a subset of samples (batch)

### The Perceptron

$x\;\begin{array}{c c c}\nearrow & \Theta (w_1 x + b_1) & \searrow \\ \searrow & \Theta(w_2 x + b_2) & \nearrow \end{array}\; w_3 \Theta(w_1 x + b_1) + w_4 \Theta(w_2 x + b_2) +b_3$
• There can be multiple inputs and outputs
• $\begin{array}{c c c c}x_1 &{-\rightarrow \atop \diagdown\nearrow} & \Theta (w_1 x_1 + w_2 x_2 + b_1) & \searrow \\ x_2& {\diagup\searrow \atop -\rightarrow} & \Theta(w_3 x_3 + w_4 x_4 + b_2) & \nearrow \end{array}\; w_5 \Theta(w_1 x_1 + w_2 x_2 + b_1) + w_6 \Theta(w_3 x_3 + w_4 x_4 + b_2) +b_3$
• Tasks: regression and classification

### The Neural Network

Nice and very simple introduction

Different activation functions $\Theta \rightarrow h(x)$

• tanh $h(x)=\tanh(x)$
• relu $h(x)=\left\lbrace\begin{array}{ll}0 & x\leq 0 \\ x & x>0 \end{array}\right.$
• softplus $h(x)=\log(1+e^x)$
• ...

### More Layers

$x\;\begin{array}{c c c c c c c} & h(w_1 x + b_1) &-\rightarrow & h( w_3 h(w_1 x + b_1) + w_4 h(w_2 x + b_2) +b_3) \\ \nearrow & & \diagdown\nearrow & & \searrow \\ \searrow & &\diagup\searrow & & \nearrow \\ & h(w_2 x + b_2) &-\rightarrow & h( w_5 h(w_1 x + b_1) + w_6 h(w_2 x + b_2) +b_4) & \end{array}\; w_7 h( w_3 h(w_1 x + b_1) + w_4 h(w_2 x + b_2) +b_3) + w_8 h( w_5 h(w_1 x + b_1) + w_6 h(w_2 x + b_2) +b_4) + b_5$
• Input layer - hidden layers - output layer
• Here: feedforward neural network in contrast to recurrent neural networks

### Back-Propagation

Initialize weights (e.g. with Gaussian distribution) and biases (e.g. as 0).

Perform stochastic gradient descent for the network. Update parameters starting sequentially.

Example:$y_i=w_3 h(w_1 x_i + b_1) + w_4 h(w_2 x_i + b_2) +b_3$

Error: $E=\sum_{i}(y_i-z_i)^2$ $\frac{\partial E}{\partial b_3} = 2(y_i-z_i)$ $\frac{\partial E}{\partial w_4} = 2(y_i-z_i)h(w_2 x_i + b_2)$ $\frac{\partial E}{\partial w_3} = 2(y_i-z_i)h(w_1 x_i + b_1)$ $\frac{\partial E}{\partial b_2} = 2(y_i-z_i)w_4\frac{\partial h(w_2 x_i + b_2)}{\partial b_2}$ $...$

### Epochs and early stopping

• Batch: share of data considered for each update step
• Epoch: one update step
• Early stopping: instead of calculating a fixed number of epochs, stop if a condition is fulfilled (e.g. error on validation set increases)

### Deep Learning

• Neural networks with many layers
• Typically, not all layers are of the same type

### Convolutional Neural Networks

• Sometimes it is benificiary to not link all the nodes but only neighboring nodes
• Convolutional layer - filter over input data
• This establishes a context between nodes, e.g. for images, pixels should be connected if they are close to each other
• Algorithms for images also typically have maxpooling

### Recurrent Neural Networks

$\begin{array}{rcl}\mathrm{Input}_1 \rightarrow \times w_1 \rightarrow +b_1 \rightarrow&h(\bullet)&\rightarrow \times w_3 \rightarrow +b_2 \rightarrow \mathrm{Output}_1 \\&{| \atop w_2}&\\ \mathrm{Input}_2 \rightarrow \times w_1 \rightarrow&\stackrel{\downarrow}{+}&b_1 \rightarrow h(\bullet) \rightarrow \times w_3 \rightarrow +b_2 \rightarrow \mathrm{Output}_2 \end{array}$
• Vanishing/ exploding gradient problem
• Similar to AR(1) process

### Long Short Time Memory

Long termmemory L and short term memory (=output) S; the current input is Ii; h is the sigmoid function

Forget gate

$L\rightarrow L h( w_1S+w_2I_i+b_1 )$

Input gate

$L\rightarrow L+ h( w_3S+w_4I_i+b_2 ) \tanh( w_5S+w_6I_i+b_3 )$

Output gate

$S\rightarrow h( w_7S+w_8I_i+b_4 ) \tanh(L)$