An entropic generalization of Caffarelli's contraction theorem via covariance inequalities

The optimal transport map between the standard Gaussian measure and an $\alpha$-strongly log-concave probability measure is $\alpha^{-1/2}$-Lipschitz, as first observed in a celebrated theorem of Caffarelli. In this paper, we apply two classical covariance inequalities (the Brascamp-Lieb and Cram\'er-Rao inequalities) to prove a sharp bound on the Lipschitz constant of the map that arises from entropically regularized optimal transport. In the limit as the regularization tends to zero, we obtain an elegant and short proof of Caffarelli's original result. We also extend Caffarelli's theorem to the setting in which the Hessians of the log-densities of the measures are bounded by arbitrary positive definite commuting matrices.


Introduction
In [Caf00], Caffarelli proved the following seminal result.
Here, ϕ 0 : R d → R is a convex function, known as a Brenier potential.The optimal transport map ∇ϕ 0 : R d → R d pushes forward P to Q, in the sense that if X is a random variable with law P , then ∇ϕ 0 (X) is a random variable with law Q.See Section 2.2 and the textbook [Vil03] for background on optimal transport.
Caffarelli's contraction theorem can be used to transfer functional inequalities, such as a Poincaré inequality, from the standard Gaussian measure on R d to other probability measures [BGL14].Towards this end, recent works have also constructed and studied alternative Lipschitz transport maps (e.g.[KM12, MS21, MS22, Nee22]), but still the properties of the original optimal transport map remain of fundamental interest, with many questions unresolved [Val07,CFJ17].
Indeed, besides the application to functional inequalities, the structural properties of optimal transport maps play a fundamental role in theoretical and methodological advances in optimal transport, such as the control of the curvature of the Wasserstein space through the notion of extendible geodesics [LPRS19,ACLGP20], the stability of Wasserstein barycenters [CMRS20], and the statistical estimation of optimal transport maps [HR21].
In applied domains, however, the inauspicious computational and statistical burden of solving the original optimal transport problem has instead led practitioners to consider entropically regularized optimal transport, as pioneered in [Cut13].In addition to its practical merits, entropic optimal transport enjoys a rich mathematical theory, rooted in its connection to the classical Schrödinger bridge problem [Léo14], which has led to powerful applications to high-dimensional probability [Led18,FGP20,GLRT20].As such, it is natural to study the properties of the entropic analogue of the optimal transport map.
In this paper, we prove a generalization of Caffarelli's contraction theorem to the setting of entropic optimal transport.Namely, we study the Hessian of the entropic Brenier potential (see Section 2.3), which admits a representation as a covariance matrix (Lemma 1).By applying two well-known inequalities for covariance matrices (the Brascamp-Lieb inequality and the Cramér-Rao inequality), we quickly deduce a sharp upper bound on the operator norm of the Hessian which holds for any value ε > 0 of the regularization parameter.
As a byproduct of our analysis, by sending ε ց 0 and appealing to recent convergence results for the entropic Brenier potentials [NW21], we obtain the shortest proof of Caffarelli's contraction theorem to date.Notably, our argument allows us to sidestep the regularity of the optimal transport map, which is a key obstacle in Caffarelli's original proof.
Recently, in [FGP20], Fathi, Gozlan, and Prod'homme gave a proof of Caffarelli's theorem using a surprising equivalence between Theorem 1 and a statement about Wasserstein projections, which was discovered through the theory of weak optimal transport [GJ20].In order to verify the latter, their proof also used ideas from entropic optimal transport.In comparison, we note that our argument is more direct and also allows us to handle the case of non-zero regularization (ε > 0).
To further demonstrate the applicability of our technique, in Section 4 we prove a generalization of Caffarelli's result: if ∇ 2 V A −1 and ∇ 2 W B −1 , where A and B are arbitrary commuting positive definite matrices, then the Hessian of the Brenier potential from P to Q is pointwise upper bounded (in the PSD ordering) by A −1/2 B 1/2 .This result implies a remarkable extremal property of optimal transport maps between Gaussian measures, namely: the optimal transport map from N (0, A) to N (0, B) maximizes the Hessian of the Brenier potential at any point among all possible measures P and Q satisfying our assumptions.To the best of our knowledge, this result is new.

Assumptions
We study probability measures P and Q on R d satisfying the following mild regularity assumptions.
Assumption 1 (Regularity conditions).We henceforth refer to the source measure as P and the target measure as Q.We say that (P, Q) satisfies our regularity conditions if: 1. P has full support on R d and Q is supported on a convex subset of R d .Let Ω Q denote the interior of the support of Q, so that Ω Q is a convex open set.
2. P and Q admit positive Lebesgue densities on R d and Ω Q , which we can therefore be written exp(−V ) and exp(−W ) respectively for functions V, W : R d → R ∪ {∞}.We abuse notation and identify the measures with their densities, thus writing P = exp(−V ) and Q = exp(−W ).
3. We assume that V and W are twice continuously differentiable on R d and Ω Q respectively.Some of these assumptions can be eventually relaxed, but they suffice for the purposes of this work.Throughout the rest of the paper and for the sake of simplicity, these regularity assumptions are assumed to hold for the probability measures under consideration.

Optimal transport without regularization
Let P and Q be probability measures with finite second moment.The optimal transport problem is the optimization problem minimize π∈Π(P,Q) where Π(P, Q) is the set of joint probability measures with marginals P and Q.The following fundamental result characterizes the optimal solution to (1).
Theorem 2 (Brenier's theorem).Suppose that P admits a density with respect to Lebesgue measure.
Then, there exists a proper, convex, lower semicontinuous function ϕ 0 : R d → R ∪ {∞} such that the optimal transport plan in (1) can be written π 0 = (id, ∇ϕ 0 ) ♯ P .The function ϕ 0 is called the Brenier potential, and the mapping ∇ϕ 0 is called the optimal transport map from P to Q.Moreover, the optimal transport map ∇ϕ 0 is unique up to P -almost everywhere equality.The Brenier potential ϕ 0 is obtained as the solution to the dual problem where ϕ * is the convex conjugate to ϕ, and Γ 0 is the set of proper, convex, lower semicontinuous functions on R d .
We refer to [Vil03] for further background.
Theorem 3 (Entropic optimal transport).Let P and Q be probability measures on R d and fix ε > 0.
Then there exists a unique solution π ε ∈ Π(P, Q) to (3).Moreover, π ε has the form where (f ε , g ε ) are maximizers for the dual problem The constraint that π ε has marginals P and Q implies the following dual optimality conditions for (f ε , g ε ) (see [MNW19,NW21]): In particular, f ε and g ε are smooth.In this work, it is more convenient to work with the entropic Brenier potentials, defined as Since (f ε , g ε ) are only unique up to adding a constant to f ε and subtracting the same constant from g ε , we fix the normalization convention f ε dP = g ε dQ.Under this condition, it was shown in [NW21] that we have convergence to the Brenier potential ϕ ε → ϕ 0 as ε ց 0.
Adopting this new notation, with P = exp(−V ) and Q = exp(−W ), we can rewrite the entropic optimal plan as The entropic Brenier potentials were first introduced to develop a computationally tractable estimator of the optimal transport map ∇ϕ 0 [SDF + 18, PNW21, PCNW22].Indeed, this is motivated by the following observation, which acts as an entropic version of Brenier's theorem.Write π Y |X=x ε for the conditional distribution of Y given X = x for (X, Y ) ∼ π ε , and similarly define π For clarity of exposition, we abuse notation and abbreviate π Y |X=x ε by π x ε and π X|Y =y ε by π y ε when there is no danger of confusion.
Lemma 1.It holds that In particular, both ϕ ε and ψ ε are convex.Moreover, under our regularity conditions,

Covariance inequalities
In our proofs, we make use of the following key inequalities.
Lemma 2. Let P = exp(−V ) be a probability measure on R d and assume that V is twice continuously differentiable on the interior of its domain.Then, the following hold.
1. (Brascamp-Lieb inequality) If in addition we assume that P is strictly log-concave, then it holds that Cov X∼P (X) The Brascamp-Lieb inequality is classical, and we refer readers to [BL00, BGL14, CE17] for several proofs.To make our exposition more self-contained, we provide a proof of the Cramér-Rao inequality in the appendix.

Main theorem
We now state and prove our main theorem.
1. Suppose that (P, Q) satisfy our regularity assumptions, as well as Then, for every ε > 0 and all x ∈ R d , the Hessian of the entropic Brenier potential satisfies 2. Suppose that (Q, P ) satisfy our regularity assumptions, as well as Then, for every ε > 0 and all x ∈ Ω P := int(supp(P )), the Hessian of the entropic Brenier potential satisfies Observe that as ε ց 0, we formally expect the following bounds on the Brenier potential: In particular, this recovers Caffarelli's contraction theorem (Theorem 1).We make this intuition rigorous below by appealing to convergence results for the entropic potentials as the regularization parameter ε tends to zero.
Proof of Theorem 4. Upper bound.Fix x ∈ R d .Recall from Lemma 1 that . By an application of the Brascamp-Lieb inequality, this results in the upper bound where in the last inequality we also used the lower bound on the spectrum of ∇ 2 W . Next, using Lemma 1 and the Cramér-Rao inequality (Lemma 2), we obtain the lower bound where we used the upper bound on the spectrum of ∇ 2 V .Combining these inequalities, . Now, define the quantity Then, we have shown Taking the supremum over x ∈ R d , Solving the inequality yields Lower bound.The lower bound argument is symmetric, but we give the details for completeness.Using Lemma 1 and the Cramér-Rao inequality (Lemma 2), Applying Lemma 1 and the Brascamp-Lieb inequality (Lemma 2), Combining the two inequalities and setting we deduce that On the other hand, from Lemma 1, we know that ℓ ε ≥ 0. Solving the inequality then yields Next, we rigorously deduce Caffarelli's contraction theorem from Theorem 4.
Remark 1.Our main theorem provides both upper and lower bounds for ∇ 2 ϕ ε .In the case when ε = 0, the lower bound follows from the upper bound.Indeed, if ϕ 0 is the Brenier potential for the optimal transport from P to Q, then the convex conjugate ϕ * 0 is the Brenier potential for the optimal transport from Q to P .By applying Caffarelli's contraction theorem to ϕ * 0 and appealing to convex duality, it yields a lower bound on ∇ 2 ϕ 0 .However, we are not aware of a method of deducing the lower bound from the upper bound for positive values of ε.
Remark 2. In Appendix B, by inspecting the Gaussian case, we show that Theorem 4 is sharp for every ε > 0.
An inspection of the proof of the upper bound in Theorem 4 reveals the following more general pair of inequalities.
Proposition 1.Let (P, Q) be probability measures satisfying our regularity conditions.Then, for all x ∈ R d and y ∈ Ω Q , In the next section, we use these inequalities to prove a generalization of Caffarelli's theorem.

A generalization to commuting positive definite matrices
In the next result, we replace the main assumptions of Caffarelli's contraction theorem, namely where A and B are commuting positive definite matrices.Recall that the Hessian of the Brenier potential between the Gaussian distributions N (0, A) and N (0, B) is the matrix In light of this observation, the following theorem is sharp for every pair of commuting positive definite (A, B), and shows that the Brenier potential between Gaussians achieves the largest possible Hessian among all source and target measures obeying the constraint (11).
Theorem 5. Let (P, Q) satisfy our regularity conditions as well as the condition (11).Then, the Hessian of the Brenier potential satisfies the uniform bound: for all x ∈ R d , it holds that As in Theorem 4, the proof technique also yields a lower bound on ∇ 2 ϕ 0 under appropriate assumptions.We omit this result because it is straightforward.
In light of Theorem 4, C ε is well-defined and finite.Equivalently, Let (x, e) achieve the above supremum.(If the supremum is not attained, then the rest of the proof goes through with minor modifications.) Using our assumptions and Proposition 1, we obtain From our assumptions and Theorem 4, we know that the spectrum of M ε := A −1/2 B 1/2 + C ε I is bounded away from zero and infinity as ε ց 0, which justifies the Taylor expansion Hence, This shows that lim εց0 C ε = 0 (otherwise (C ε ) ε>0 would have a strictly positive cluster point which would contradict the above inequality for small enough ε > 0).By combining this fact with convergence of the entropic Brenier potentials as in the proof of Theorem 1, we deduce the Next, we show how our theorem recovers and extends a result of Valdimarsson [Val07].Valdimarsson proves that if: • Ā, B, and G are positive definite matrices; • Ā G and B commutes with G; • P = N (0, BG −1 ) * µ where * denotes convolution and µ is an arbitrary probability measure on R d ; and then the Brenier potential satisfies ∇ 2 ϕ 0 G.This result was then used to derive new forms of the Brascamp-Lieb inequality. 1 To prove this result, we first check that convolution with any probability measure only makes the density more log-smooth.
Lemma 3. Let P ∝ exp(− V ) be a probability measure, where V : R d → R is twice continuously differentiable.Let P := P * µ = exp(−V ) where µ is any probability measure on R d .Suppose that for some positive definite matrix A −1 , we have ∇ 2 V A −1 .Then, ∇ 2 V A −1 as well.
Proof.An elementary computation shows that if we define the probability measure from which the result follows.
From the lemma, we deduce that under Valdimarsson's assumptions, for P = exp(−V ), we have . By Theorem 5, the Brenier potential ϕ 0 satisfies ∇ 2 ϕ 0 G.However, it is seen that our argument yields much more.For example, rather than requiring P to be a convolution with a Gaussian measure, we can allow P to be a convolution with any measure exp(− V ) satisfying ∇ 2 V B−1 G.
Remark 3. It is natural to ask whether Theorem 5 can be obtained by first applying Caffarelli's contraction theorem to show that the optimal transport map T 0 between the measures (A −1/2 ) ♯ P and (B −1/2 ) ♯ Q is 1-Lipschitz, and then considering the mapping T 0 (x) := B 1/2 T 0 (A −1/2 x).Although T 0 is indeed a valid transport mapping from P to Q, under our assumptions ∇T 0 is not guaranteed to be symmetric, so it does not make sense to ask whether or not ∇T 0 A −1/2 B 1/2 .In Valdimarsson's application to Brascamp-Lieb inequalities, it is crucial that the transport map T 0 is chosen so that ∇T 0 is a symmetric positive definite matrix.Symmetry of ∇T 0 implies that T 0 is the gradient ∇ϕ 0 of a function ϕ 0 : R d → R, and positive definiteness implies that ϕ 0 is convex.By Brenier's theorem, the unique gradient of a convex function that pushes forward P to Q is the optimal transport map.Thus, it is crucial that we consider the optimal transport map here; in particular, alternative maps such as the ones in [KM12, MS21] cannot be applied.

Discussion
We have proven a generalization of Caffarelli's celebrated theorem on the Lipschitz properties of the optimal transport map to the setting of entropic optimal transport using two complementary covariance inequalities (the Brascamp-Lieb inequality and the Cramér-Rao inequality).
We conjecture that our proof technique can also be used to recover the bounds on the moment measure mapping in [Kla14], provided that the existence of an "entropic moment measure" can be established (with convergence towards the true moment measure as the regularization tends to zero).As this is outside the scope of this work, we do not pursue this question here.
Integration by parts shows that ∇V ⊗2 dP = ∇ 2 V dP , and upon rearranging we deduce that Var P h ≥ E P ∇h, (E P ∇ 2 V ) −1 E P ∇h .
By approximation, this continues to hold for any locally Lipschitz h : R d → R with E P ∇h < ∞.Specializing the inequality (13) to h := e, • for a unit vector e ∈ R d then recovers the Cramér-Rao inequality of Lemma 2.