Introduction
Shahab et al. (2024)
PyTorch
TensorFlow
JAX
keras
PyTorch
, TensorFlow
, or jax
as a backendkeras
PyTorch
, TensorFlow
, or jax
as a backendnumpy
Python
batch_size
(think “sample size”)timepoint
, feature
, variable
, row
, column
, etc)Regression + non-linear activation
Multiple regressions + non-linear activations
\[ \begin{aligned} z_k &= \sigma \Big(b_k + \sum_{j=1}^J W_{jk}x_j\Big)\\ z &= \sigma \Big(b + x W'\Big) \end{aligned} \]
Multiple “layers” of perceptrons
Function composition
\[ \begin{aligned} z & = f(x) \text{ where } f = f_L \circ f_{L-1} \circ \dots \circ f_1 \\ z & = f_L(\dots(f_2(f_1(x)))) \\ \end{aligned} \]
Python
import keras
network = keras.models.Sequential([
keras.Input((3,)),
keras.layers.Dense(4, activation="relu"), # 3 inputs, 4 outputs
keras.layers.Dense(4, activation="relu"), # 4 inputs, 4 outputs
keras.layers.Dense(2, activation="softmax") # 4 inputs, 2 outputs
])
network.summary()
x = keras.random.normal((100, 3))
x.shape # TensorShape([100, 3])
z = network(x)
z.shape # TensorShape([100, 2])
Python
import keras
network = keras.models.Sequential([
keras.Input((3,)),
keras.layers.Dense(4, activation="relu"), # 3 inputs, 4 outputs
keras.layers.Dense(4, activation="relu"), # 4 inputs, 4 outputs
keras.layers.Dense(2, activation="softmax") # 4 inputs, 2 outputs
])
network.summary()
x = keras.random.normal((100, 3))
x.shape # TensorShape([100, 3])
z = network(x)
z.shape # TensorShape([100, 2])
A composition of linear functions is itself a linear function
Non-linear activations introduces non-linearity
\(\rightarrow\) Represent any non-linear function
Often used for output range control
\[ \tanh{(x)} = \frac{e^x - e^{-x}}{e^x + e^{-x}} \]
\[ \text{ReLU}(x) = \begin{cases} 0, x \leq 0 \\ x, x > 0\end{cases} \]
\[ \text{softplus}(x) = \log(1 + e^x) \]
\[ \sigma(x) = \frac{1}{1 + e^{-x}} \]
\[ \text{softmax}(x)_i = \frac{e^{x_i}}{\sum_{j=1}^{J} e^{x_j}} \]
\[ \mathcal{L}(x; \theta) \]
Minimise the loss with respect to the model parameters
\[ \operatorname*{argmin}_{\theta} \mathcal{L}(x; \theta) \]
\[ \theta_{n+1} = \theta_n - \gamma \Delta_\theta \mathcal{L}(x; \theta_n) \]
\[ \theta_{n+1} = \theta_n - \gamma \Delta_\theta \mathcal{L}(x; \theta_n) \]
\[ g_n = \Delta_\theta \mathcal{L}(x; \theta_n) \]
\[ G_n = G_{n-1} + g_n^2 \]
\[ \theta_{n+1} = \theta_n + \frac{\gamma}{\sqrt{G_n} + \epsilon} g_n \]
\[ m_n = \beta m_{n-1} + (1-\beta) \Delta_\theta \mathcal{L}(x; \theta_n) \]
\[ \theta_{n+1} = \theta_n - \gamma m_n \]
\[ g_n = \Delta_\theta \mathcal{L}(x; \theta_n) \] \[ \begin{aligned} m_n & = \beta_1 m_{n-1} + (1-\beta_1) g_n; & \hat{m}_n & = \frac{m_n}{1 - \beta_1^n} \\ v_n & = \beta_2 v_{n-1} + (1-\beta_2) g_n^2; & \hat{v}_n & = \frac{v_n}{1 - \beta_1^n}\\ \end{aligned} \]
\[ \theta_{n+1} = \theta_n - \frac{\gamma}{\sqrt{\hat{v}_n} + \epsilon} \hat{m}_t \]
\[\Delta_\theta \mathcal{L}(x; \theta_n)\]
\[ \frac{\partial \mathcal{L}}{\partial \theta_l} = \frac{\partial \mathcal{L}}{\partial z_L} \frac{\partial z_L}{\partial z_{L-1}} \dots \frac{\partial z_l}{\partial \theta_l} \]
\(\rightarrow\) networks to add dimensions
Gradients can become excessively large or vanishingly small
Scale down gradients if they exceed certain threshold
Large networks tend to overfit
Large networks tend to overfit
Add (weighted) norm of the parameters to the loss \[ \mathcal{L}(x, \theta) = \mathcal{L}_0(x, \theta) + \lambda ||\theta|| \]
Python
keras
kernel_regularizer
: Weightsbias_regularizer
: Biasactivity_regularizer
: Layer outputDuring training, “turn off” each activation with a probability \(p\)
Python
network = keras.Sequential([
keras.Input((2,)),
keras.layers.Dense(64, activation="relu"),
keras.layers.Dropout(0.1),
keras.layers.Dense(64, activation="relu"),
keras.layers.Dropout(0.05),
keras.layers.Dense(10, activation="softmax")
])
x = keras.random.normal((100, 2))
network(x)
network(x, training=True)
Python
network = keras.Sequential([
keras.Input((2,)),
keras.layers.Dense(64, activation="relu"),
keras.layers.Dropout(0.1),
keras.layers.Dense(64, activation="relu"),
keras.layers.Dropout(0.05),
keras.layers.Dense(10, activation="softmax")
])
x = keras.random.normal((100, 2))
network(x)
network(x, training=True)
Examples:
\(\rightarrow\) leverage properties of data to our advantage by building networks that make correct assumptions
\[ \text{Attention}(Q, K, V) = \text{softmax}(QK^{\text{T}})V \]
Permutation invariant function: \(f(x) = f(\pi(x))\)
Embeddings of sets
\[ f(X = \{ x_i \}) = \rho \left( \sigma(\tau(X)) \right) \]
Amortized Bayesian Inference