Learn \(p_X\) given a set of training data \(x_i, \dots, x_n\)
Weighted sum of multiple simpler distributions, e.g., Normal \[p_X(X) = \sum_k^K w_k \times \text{Normal}(X; \mu_k, \sigma_k)\]
Map \(p_X\) to a base distribution \(p_Z\) through some operation \(g\)
\[ x \sim g(z) \text{ where } z \sim p_Z \]
Built on invertible transformations of random variables
\(f\)
\[\rightarrow\]
\(f^{-1}\)
\[\leftarrow\]
Change of variables formula
\[ p_X(x) = p_Z(f(x)) \left| \det{J}_f(x) \right| \]
\[Z \sim \text{Uniform}(0, 1)\]
\[Z \sim \text{Uniform}(0, 1)\]
\[Z \sim \text{Uniform}(0, 1)\]
\[X = 2Z - 1\]
\[Z \sim \text{Uniform}(0, 1)\]
\[X = 2Z - 1\]
\[f: Z = a X + b\]
\[p_X(x) = p_Z(f(x)) \times a\]
\[ \scriptsize \begin{aligned} p_Z(z) & = \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{1}{2} z^2 \right) \\[10pt] f: Z & = \frac{(X - \mu)}{\sigma} \\ \end{aligned} \]
\[ \scriptsize \begin{aligned} p_X(x) & = p_Z(f(x)) \times a \\[10pt] & = \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{1}{2} f(x)^2 \right) \times a \\[10pt] & = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{1}{2} \left(\frac{x-\mu}{\sigma}\right)^2 \right) \end{aligned} \]
\[ p_X(x) = p_Z(f(x)) \left| \frac{d}{dx} f(x) \right| \]
\[ p_X(x) = p_Z(f(x)) \left| \frac{d}{dx} f(x) \right| \]
\[ \scriptsize \begin{align} f: Z & = \log(X) \\[10pt] p_Z(z) & = \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{1}{2} z^2 \right) \end{align} \]
\[ \scriptsize \begin{align} \frac{d}{dx} f(x) & = \frac{d}{dx} \log(x) = \frac{1}{x} \\[10pt] p_X(x) & = \frac{1}{x\sqrt{2\pi}} \exp\left(-\frac{1}{2} \log(x)^2\right) \end{align} \]
\[ p_X(x) = p_Z(f(x)) \left| \det{J}_f(x) \right| \]
\[ J_f(x) = \begin{bmatrix} \frac{\partial z_1}{\partial x_1} & \dots & \frac{\partial z_1}{\partial x_K} \\ \vdots & \ddots & \vdots \\ \frac{\partial z_K}{\partial x_1} & \dots & \frac{\partial z_K}{\partial x_K} \end{bmatrix} \]
\[f\left(\begin{bmatrix}x_1 \\ x_2\end{bmatrix}\right) = \begin{bmatrix} x_1^2 x_2 \\ 3x_1 + \sin x_2 \end{bmatrix} = \begin{bmatrix}z_1 \\ z_2\end{bmatrix}\]
\[J_f(x) = \begin{bmatrix} \frac{\partial z_1}{\partial x_1} & \frac{\partial z_1}{\partial x_2} \\ \frac{\partial z_2}{\partial x_1} & \frac{\partial z_2}{\partial x_2} \end{bmatrix} = \begin{bmatrix} 2x_1x_2 & x_1^2 \\ 3 & \cos x_2 \end{bmatrix} \]
\[ p_X(x) = p_Z(f(x)) \left| \det{J}_f(x) \right| \]
Define a \(f\) as a neural network with trainable weights \(\phi\)
Maximum likelihood (or rather: negative log likelihood)
\[ \arg \min_\phi - \sum_{i=1}^n \log p_Z(f(x_i \mid \phi)) + \log \left| \det{J}_f(x_i \mid \phi) \right| \]
Invertible and differentiable functions are “closed” under composition
\[ f = f_L \circ f_{L-1} \circ \dots \circ f_1 \\ \]
\(f_1\)
\(f_2\)
\(f_3\)
\(\rightarrow\)
\(\rightarrow\)
\(\rightarrow\)
To invert a flow composition, we invert individual flows and run them in the opposite order
\[ f^{-1} = f_1^{-1} \circ f_2 ^{-1} \circ \dots \circ f_L^{-1} \\ \]
\(f_1^{-1}\)
\(f_2^{-1}\)
\(f_3^{-1}\)
\(\leftarrow\)
\(\leftarrow\)
\(\leftarrow\)
Chain rule \[ \left| \det{J}_f(x) \right| = \left| \det \prod_{l=1}^L J_{f_l}(x)\right| = \prod_{l=1}^L \left| \det{J}_{f_l}(x)\right| \]
if we have a Jacobian for each individual transformation, then we have a Jacobian for their composition \[ \arg \min_\phi \sum_{i=1}^n \log p_Z(f(x_i \mid \phi)) + \sum_{l=1}^L \log \left| \det{J}_{f_l}(x_i \mid \phi) \right| \]
\[ f(x) = Ax + b \]
inverse: \(f^{-1}(z) = A^{-1}(x - b)\)
Jacobian: \(\left| \det{J}_f(x) \right| = \left| \det{A} \right|\)
Limitations:
\[ J_f = \begin{bmatrix} \text{I} & 0 \\ \frac{\partial}{\partial x_A}f(x_B \mid \theta(x_A)) & J_f(x_B \mid \theta(x_A)) \end{bmatrix} \]
\[ \det{J}_f = \det(\text{I}) \times \det{J}_f(x_B \mid \theta(x_A)) = \det{J}_f(x_B \mid \theta(x_A)) \]
\(\theta(x_A)\): Trainable coupling networks, e.g., MLP
Linear (affine) transform function \(f(x_B\mid\theta(x_A)) = \frac{x_B - \mu(x_A)}{\sigma(x_A)}\)
Jacobian: \(-\log{\sigma(x_A)}\)
Build your own affine coupling normalizing flow!
Essentials from an extensive tutorial by Lipman et al. (2024) available at https://neurips.cc/virtual/2024/tutorial/99531.
Lipman et al. (2024)
Lipman et al. (2024)
\[ \begin{aligned} \mathbb{E}_{t, X_t}|| u_{t,\theta}(X_t) - u_t(X_t) ||^2 \\ t\sim\text{Uniform}(0,1) \\ X_t \sim p_t(X_t) \end{aligned} \]
Lipman et al. (2024)
Linear probability path
\[X_t = (1-t) X_0 + t X_1\]
Velocity
\[u_t(X_t \mid X_1, X_0) = X_1 - X_0\]
Lipman et al. (2024)
\[
\begin{aligned}
\mathbb{E}_{t, X_t}|| u_{t,\theta}\big(\underbrace{(1-t) X_0 + t X_1)}_{X_t}\big) - (\underbrace{X_1-X_0}_{u_t}) ||^2 \\ t\sim\text{Uniform}(0,1) \\ X_0 \sim p_0 \\ X_1 \sim p_1
\end{aligned}
\]
Figure 6: Fjelde et al. (2024)
Amortized Bayesian Inference