Matrix scaling

In the paper A Relationship Between Arbitrary Positive Matrices and Doubly Stochastic Matrices , Richard Sinkhorn provided a proof of the following theorem:
For any square matrix A with strictly positive entries, there exist diagonal matrices with positive entries D₁ and D₂ such that D₁AD₂ is a doubly stochastic matrix.
Here we explore this statement and its generalizations.
0. Sinkhorn iteration
A not-so-obvious observation about the above statement is that we can arrive at such a matrix D₁AD₂ by iteratively scaling the rows and columns of A:
import torch
dim = 5

# matrix with strictly positive entries
matrix = torch.randn(dim, dim).exp()
print(f'Col sums of matrix: {matrix.sum(dim=0).half().tolist()}')
print(f'Row sums of matrix: {matrix.sum(dim=1).half().tolist()}')

for i in range(10):
    matrix /= matrix.sum(dim=1, keepdim=True)
    matrix /= matrix.sum(dim=0, keepdim=True)
    if i%3 == 0:
     print(f'Iteration {i}: 
	Row sums: {matrix.sum(dim=1).half().tolist()},
	col sums: {matrix.sum(dim=0).half().tolist()}')

##############
### Output ###
##############
Col sums of matrix: [10.890625, 13.4921875, 9.4296875, 9.0859375, 9.6953125]
Row sums of matrix: [12.9921875, 14.390625, 10.6953125, 8.2578125, 6.2578125]
Iteration 0: 
	Row sums: [0.94140625, 1.009765625, 1.0732421875, 0.95556640625, 1.01953125],
	col sums: [1.0, 1.0, 1.0, 1.0, 1.0]
...
...
Iteration 9: 
	Row sums: [1.0, 1.0, 1.0, 1.0, 1.0],
	col sums: [1.0, 1.0, 1.0, 1.0, 1.0]
Each step of row normalization is equivalent to left-multiplication by a diagonal matrix with positive entries. Each step of column normalization is equivalent to right-multiplication by a diagonal matrix with positive entries.
1. Proof of the above theorem
We start with the proof of uniqueness first. Then we provide proof of existence.
1.1. Uniqueness
We want to prove that if for a matrix M with strictly positive entries there exist two pairs of diagonal matrices C₁, C₂ and D₁, D₂ such that C₁MC₂ and D₁MD₂ are equal to doubly stochastic matrices S₁ and S₂, then C₁ = kD₁ and C₂ = k⁻¹D₂ for some positive constant k and S₁ = S₂. In other words, there is a unique way to scale the rows and columns of M to obtain a doubly stochastic matrix.
Note that \( S_1 = C_1MC_2, \ S_2 = D_1MD_2 \Rightarrow S_2 = \underbrace{\rose{D_1C_1^{-1}}}_{\rose{A}} S_1\underbrace{\amber{C_2^{-1}D_2}}_{\amber{B}} = \rose{A}S_1\amber{B} \)
To prove that \( C_1, D_1 \) and \( C_2, D_2 \) are equal up to a constant multiple, we simply need to prove that \( A \) and \( B \) are equal to the identity matrix up to a constant multiple. We use proof by contradiction by assuming that they are not.
We start with the observation that for any doubly stochastic matrix \( S \), we have:
\( \underbrace{\langle \text{row/col of S}, \amber{v} \rangle}_{\amber{\text{convex sum of } v}} \in [v_{min}, v_{max}] \)
Applying this to any row of \( AS_1B \) gives us:
\( 1 = \sum_{k} a_ib_kS_{1ik} = a_i \underbrace{\amber{\sum_{k} b_k S_{1ik}}}_{\amber{\in [b_{min}, b_{max}]}} \)
... which implies \( \rose{a_i b_{min} \le 1 \ \forall i} \) and \( \amber{a_i b_{max} \ge 1 \ \forall i} \). But this means that the column of \( AS_1B \) corresponding to \( b_{min} \) will have a sum ≤ 1 and the column corresponding to \( b_{max} \) will have a sum ≥ 1.
\( \rose{\underbrace{\sum_{i} a_ib_{min}\text{col}_{1i}}_{\le \sum_{i}{\text{col}_{1i}} = 1}} \)\( \amber{\underbrace{\sum_{i} a_ib_{max}\text{col}_{2i}}_{\ge \sum_{i}{\text{col}_{2i}} = 1}} \)
Columns corresponding to min and max values of B cannot sum to 1
This leads to the result that \( a_ib_{min} = a_ib_{max} = 1 \Rightarrow B = cI \) for some constant \( c \). Similar reasoning can be used to prove the same result for \( A \).
1.2. Existence
Consider a matrix A whose columns have been scaled to sum to 1. Let the row sums of A be \( \{r_1, \ldots, r_n\} \). Also let the minimum and maximum row sums of A be \( r_{min} \) and \( r_{max} \). We show that when the rows of this matrix are scaled, the column sums lie in the range \( [r_{min}, r_{max}] \).
\( \begin{bmatrix} ... & a_{12} & ... \\ ... & a_{22} & ... \\ ... & a_{32} & ... \end{bmatrix} \)\(\xrightarrow{\text{row normalization}} \)\( \begin{bmatrix} ... & a_{12}/r_1 & ... \\ ... & a_{22}/r_2 & ... \\ ... & a_{32}/r_3 & ... \end{bmatrix} \)In the example above since \( \sum_i a_{i2} = 1 \), the second column sum after row normalization is a convex sum of \( \{1/r_1, \ldots, 1/r_n\} \). This is true for every column sum after row normalization. Thus the column sums \( \{ c_1, \ldots, c_n \} \) after row normalization are in the range \( [r_{min}, r_{max}] \). More specifically, \( {\underbrace{\rose{r_{min} \le c_{min}}}_{\text{time step t}}} \le 1 \le \underbrace{\amber{c_{max} \le r_{max}}}_{\text{time step t}} \).
In the next time step, when the columns are scaled to sum to 1, the row sums will be convex sums of \( \{c_1, \ldots, c_n\} \). More specifically, \( {\underbrace{\rose{c_{min} \le r_{min}}}_{\text{time step t+1}}} \le 1 \le \underbrace{\amber{r_{max} \le c_{max}}}_{\text{time step t+1}} \). This leads to the following insight:
At each step of row/column scaling, the minimum of these sums increases and the maximum decreases. Since these sums are bounded, they must converge to a limit.
Since the average value of row/column sums is equal to 1 at each time step and does not change, the row/column sums must converge to 1 (this is not a rigorous proof but it can be made rigorous with some more work).
2. Generalization to rectangular matrices
The above idea of alternately scaling rows and columns to sum to 1 can be extended to rectangular matrices as well. The proof is similar to the square case. The main difference here is that since the number of rows is not equal to the number of columns, the row/column sums will not be equal to 1 at the end. However, the row/column sums will still converge to a limit.
One can also set the row/column sums to any other value, say \( \alpha \), by scaling the rows and columns by \( \alpha \) instead of 1. This is particularly useful when learning to map a probability distribution \( P \in R^m \) to another probability distribution \( Q \in R^n \) from samples such that it minimizes a given cost function.
n_cols, n_rows = 10,4
matrix = torch.randn((n_rows, n_cols)).exp()

# source probability distribution
row_sums = torch.randn(n_rows,1).softmax(dim=0)

# target probability distribution
col_sums = torch.randn(1,n_cols).softmax(dim=1)

for index in range(100):
    matrix *= col_sums / matrix.sum(dim=0, keepdim=True)
    matrix *= row_sums / matrix.sum(dim=1, keepdim=True)
    row_sums_converged = torch.allclose(matrix.sum(dim=1, keepdim=True), row_sums)
    col_sums_converged = torch.allclose(matrix.sum(dim=0, keepdim=True), col_sums)
    if row_sums_converged and col_sums_converged:
        print(f'row and column sums converged at iteration {index}')
        break

##############
### Output ###
##############
row and column sums converged at iteration 7
This forms the basis of Sinkhorn algorithm which we will look into next.
3. Generalization to non-negative matrices
One last note about matrix scaling. The above idea of alternately scaling rows and columns of a square matrix to sum to 1 can be extended to non-negative matrices as well. However, it only works for indecomposable matrices .
Finally, it is time to look into Sinkhorn distance and Sinkhorn algorithm.