Nice Jacobians for normalizing flows

We cover some nice Jacobians used in various implementations of normalizing flows. Here "nice" means the determinant of these Jacobians is easy to calculate. The key paradigm of normalizing flows is learning a transformation of input data \( y = f(x), x,y \in R^n \) such that the distribution of \( y \) is a known probability distribution (usually Gaussian or uniform).\[ p(x) = p(y) \times \lvert \underbrace{\det(\overbrace{\partial y / \partial x}^{\text{Jacobian}}) \rvert}_\amber{\text{easy to calculate}} \]We cover three main transformations the Jacobians of which are simply the product of the diagonal (possibly block) elements of the Jacobian.
1. Autoregressive transformations
This is arguably the simplest and most likely the earliest transformation used in normalizing flows. The Jacobian of an autoregressive transformation is a lower triangular matrix since \( \amber{\partial y_i / \partial x_j = 0} \ \text{if} \ i \lt j \).
Jacobian of an autoregressive transformation is a lower triangular matrix.
The determinant of this Jacobian is the product of the diagonal elements, which is easy to calculate.
2. Affine transformations
The key idea here is to parition the vector \( x \in R^n \) into two parts \( x_A, x_B \in R^{n/2} \) of equal length. These two parts are transformed to \( y_A, y_B \in R^{n/2} \) in the following way: \[ \begin{align} y_A &= x_A \\ y_B &= \overbrace{\amber{\gamma(x_A) \odot x_B}}^{\amber{\text{elementwise scaling}}} \underbrace{\rose{\oplus \ \beta(x_A)}}_{\amber{\text{elementwise shifting}}} \end{align} \] This leads to a Jacobian that is ever sparser than a lower triangular matrix. The determinant of this Jacobian is the product of the scaling parameters (which is also the product of the diagonal elements).
1111𝛾1𝛾2𝛾3𝛾4
Determinant of Jacobian is just the product of scaling parameters \( \prod_i \gamma_i \).
This transformation was used in the paper Density estimation using Real NVP . In order to make sure that the entire input space was covered by the transformation, the sets \( x_A, x_B \) were switched in the next step (i.e \( x_B \) was used to calculate the affine parameters to scale and shift \( x_A \)).
3. Channel-wise transformations
In this transformation, the input vector \( x \in R^n \) is partitioned into \( c \) channels of equal length \( C_i \in R^{n/c}, i \in \{1, \dots, c\} \) and each channel is linearly transformed by multiplying with a matrix \( W \in R^{(n/c) \times (n/c)} \).
\[ \underbrace{x_1, x_2}_{\amber{C_1}}, \ \underbrace{x_3, x_4}_{\rose{C_2}}, \ \text{...}, \ \underbrace{x_{n-1}, x_{n}}_{\amber{C_c}}, \]\[ \underbrace{\amber{y_1, y_2}}_{\zinc{WC_1}}, \ \underbrace{\rose{y_3, y_4}}_{\zinc{WC_2}}, \ \text{...}, \ \underbrace{\amber{y_{n-1}, y_n}}_{\zinc{WC_c}} \]
Each channel \( C_i \in R^2 \) is linearly transformed by a matrix \( W \)
The Jacobian of this transformation is a block diagonal matrix where each block is the \( n/c \times n/c \) matrix \( W \). In the example below, \( x \in R^8 \) is partitioned into four channels.
WWWW
Determinant of Jacobian is \( (\det(W))^{\text{c}} \).
This transformation was used in the paper Glow: Generative Flow with Invertible 1x1 Convolutions . In that paper, the linear transformation was applied to each \( x,y \) location of the input tensor of shape \( c, h, w \). In this case, the number of "channels" is \( h \times w \).

Related articles: