An elementary analysis of ridge regression with random design

In this note, we provide an elementary analysis of the prediction error of ridge regression with random design. The proof is short and self-contained. In particular, it bypasses the use of Rudelson's deviation inequality for covariance matrices, through a combination of exchangeability arguments, matrix perturbation and operator convexity.


Introduction
Let (X, Y ) be a random vector in R d × R with distribution P . We consider the problem of random-design regression, namely prediction of Y by linear functions of X. (This setting also allows to consider nonlinear functions of general covariates X ′ , taking values in a measurable space X ′ , by letting X = Φ(X ′ ) for some feature map Φ : X ′ → R d .) Specifically, the prediction error of a regression parameter θ ∈ R d is defined by its risk L(θ) = E[(Y − θ, X ) 2 ], where θ, x = θ ⊤ x is the standard scalar product on R d . In the statistical setting, the true distribution P , and in particular the (population) risk L : R d → R + and its minimizer θ * = arg min θ∈R d L(θ), are unknown. The aim is then, given a random independent and identically distributed (i.i.d.) sample (X 1 , Y 1 ), . . . , (X n , Y n ) from P , to find a good parameter θ, as measured by its excess risk E( θ) = L( θ) − L(θ * ) . (1) A popular approach to this problem is the method of regularized least squares (also called empirical risk minimization), where the estimator θ minimizes the sum of an empirical error term and a regularization term favoring some structure on the parameter. In this note, we consider the classical ridge estimator [13,34], defined for λ > 0 by θ λ := arg min where θ := θ, θ 1/2 is the Euclidean norm and Σ n := n −1 n i=1 X i X ⊤ i is the empirical covariance matrix. While this estimator is well-studied (see Section 3), our aim here is to present a short and elementary analysis of its performance. In particular, the analysis presented here does not rely on matrix concentration [29,1,26,35] or uniform deviation bounds for empirical processes [22,16,4], but rather on a combination of exchangeability and matrix convexity arguments. It draws inspiration from an analysis of [25] in the context of conditional density estimation. Our main error estimate is provided in Theorem 1 below.
Notation. Given a d × d matrix A, we denote by Tr(A) its trace and A op its operator norm. The d × d identity matrix is denoted I d , or simply I; for λ ∈ R, we denote A + λ = A + λI. The symbol denotes the standard order on symmetric matrices: A B means that Av, v Bv, v for all v ∈ R d .

Risk analysis of ridge regression
Assumptions. The analysis requires two assumptions on the joint distribution P of (X, Y ), the first one on the distribution of the error Y − θ * , X , and the second one on the distribution of X. Assumption 1. There exist θ * ∈ R d and σ > 0 such that The first condition in Assumption 1 states that the linear model is well-specified, in the sense that the true regression function x → E[Y |X = x] is linear. This condition is standard, although restrictive when the dimension d is low. On the other hand, the guarantees we consider do not explicit depend on the dimension d, and extend with minor changes in notation to the case where R d is replaced by an infinite-dimensional Hilbert space. This allows to handle the case of reproducing kernel Hilbert spaces [2] (such as certain Sobolev spaces), for which ridge regression is a classical estimator [41,32]. When considering a "universal" kernel, the corresponding Hilbert space is dense in the space L 2 (P X ) of square-integrable functions of X [32], in which case the well-specified assumption is considerably less restrictive. The main issue is then the dependence of the bound on θ * . In this respect, the bound of Theorem 1 will only depend on θ * through the approximation properties of balls of the Hilbert space. Finally, the second condition in Assumption 1 is a bound on the conditional variance of Y given X, which controls the amount of noise. It holds for instance if Y is bounded, or if the error Y − θ * , X is independent of X with finite variance. Assumption 2. There exists a constant R > 0 such that X R almost surely.
The boundedness assumption 2 is classical in the context of ridge regression. This condition is automatically satisfied, for instance, in the case where X is the feature associated to a bounded reproducing kernel Hilbert space [32]. Assumption 2 implies in particular that the covariance matrix Σ = E[XX ⊤ ] of X is well-defined and satisfies Tr(Σ) = E X 2 R 2 .
Risk analysis of ridge regression. The proof hinges on two main lemmas. Before presenting them, we start with a classical bias-variance decomposition.
Lemma 1 shows the main quantities that need to be controlled in the random-design setting. The first term in (4) is a bias term, due to the use of a regularization favoring solutions with small norm. The second one is a variance term due to the presence of errors ε i = Y i − θ * , X i . Both terms depend on the (random) sample covariance matrix Σ n , as well as the population covariance matrix Σ. The fact that both of these matrices appear comes from the fact that, in the random-design/statistical learning setting, one is interested in making a prediction at a new point X, rather than at the points X 1 , . . . , X n in the sample.
In order to argue that the sample covariance matrix Σ n is "close" to Σ in a suitable sense and deduce an explicit error bound, some assumption on the distribution of X is needed. This is where Assumption 2 is used. Under this assumption, one can apply Rudelson's inequality for sample covariance matrices [29]; this is the approach adopted, for instance, in [7,14]. Rudelson's inequality, a consequence of the non-commutative Khintchine inequality of [21], is however a non-trivial result, despite subsequent simplifications to its proof [1,26,35]. In addition, matrix concentration through Rudelson's inequality introduces an additional logarithmic term [35,39].
In what follows, we present an alternative approach to controlling the random matrix terms in the right-hand side of (4), which only uses short and elementary arguments. The proof relies on a combination of exchangeability, matrix perturbation and operator convexity. It draws inspiration from an analysis of [25], where similar arguments were used in the context of conditional density estimation and logistic regression. Lemma 2. Under Assumption 2, we have that for every λ > 0, Proof. The lower bound comes from convexity of A → Tr(A −1 Σ) over positive matrices (see Lemma 5 below) and Jensen's inequality. Let us now prove the upper bound. We start by writing: where X n+1 is a random variable distributed as X and independent of X 1 , . . . , X n . Now, the Sherman-Morrison identity (10) with S = n Σ n + λnI and v = X n+1 shows that where Σ n+1 = (n + 1) −1 n+1 i=1 X i X ⊤ i , and where we used that, by Assumption 2, ( Σ n + λI) −1 X n+1 , X n+1 X n+1 2 /λ R 2 /λ. It follows that where (6) follows from exchangeability of (X 1 , . . . , X n+1 ), while (7) follows from concavity of the map A → Tr[(A + λI) −1 A] over positive matrices (by Lemma 5).
Next, we turn to controlling the bias term in Lemma 3 below. The proof follows a similar recipe as that of Lemma 2.
Before deriving the excess risk bound, we express the bias term in a more relatable form.

Discussion
Comments on the bound (9). For λ cR 2 /n, the bound (9) is at most where c, C are constants. In addition, the first term above is at most λ θ * 2 (take θ = θ * instead of inf θ ). A bound in finite dimension can be deduced by letting λ ≍ R 2 /n and bounding d λ := Tr[(Σ + λ) −1 Σ] d, which yields a O((σ 2 d + R 2 θ * 2 )/n) bound. Both the variance and the bias term in (9) are distribution-dependent; they depend, respectively, on the spectrum of Σ (through the effective dimension d λ = Tr[(Σ + λ) −1 Σ]) and on the approximation properties of Euclidean balls of R d .
As shown in the lower bound of Lemma 2, the upper bound on the variance term from the decomposition of Lemma 1 is sharp up to universal constants in the regime n R 2 /λ. We note that the variance term from Lemma 2 is itself only an upper bound on the actual variance (after bounding ( Σ n + λ) −1 Σ n I), though it is generally of the correct order (an exception is the "interpolation" regime [5], where λ is very small or equal to 0 and n ≪ d). For instance, under the "nonparametric" regime d ≫ n and under a polynomial decay of eigenvalues of Σ, namely if Tr(Σ 1/b ) B for some b > 1 and B > 0, then d λ 2Bλ −1/b and this gives the optimal variance under this assumption [7].
The bound on the bias term is somewhat less accurate (in particular, there is no matching lower bound in Lemma 3), though still of correct order in some relevant regimes. Ideally, one may wish to replace Σ n by Σ (at least for n large enough) in the bias term of Lemma 1, leading to a term of where θ λ = arg min θ∈R d {L(θ) + λ θ 2 } = (Σ + λ) −1 Σθ * . Instead, the bound (9) gives a term L(θ λ ) − L(θ * ) + λ θ λ 2 , with an additional λ θ λ 2 component. Roughly speaking, this extra term dominates L(θ λ ) − L(θ * ) when θ * is highly aligned with the leading eigenvectors of Σ. Similarly to the variance term, a simple way to assess this bound is to consider the stylized nonparametric regime, with d large or infinite, and polynomial decay (this time, of coefficients of θ * in the basis of eigenvectors of Σ). Specifically, assume as in [7] that Σ (1−r)/2 θ * ρ for some r > 0 and ρ > 0; the parameter r > 0 controls the rate of decay of components of θ * . One can bound the bias term as C(ρ)λ min(r,1) , while it is known (e.g., from [7]) that the actual bias of ridge can be bounded as C(ρ)λ min(r,2) under these assumptions 1 . The bound is therefore of the correct order for 0 < r 1, but suboptimal in the regime r > 1.
Additional comments and references. A possible approach to analyzing ridge regression is based on viewing it as empirical risk minimization and using tools from empirical process theory together with localization and fixed-point arguments [37,22,4,16], see for instance [17,Example 2 p. 86], [23] and [38] for analyses in this spirit.
A direct approach is based on matrix concentration [29,21,1,26,35]. Results in this direction were first derived in [10] and then refined in a series of works [40,30,31,7]. In particular, optimal bounds depending on the effective dimension d λ were first derived in [7], see also [14,33,6]. In fact, two-sided matrix concentration is not necessary to control the error, and one-sided lower bounds on the sample covariance matrix suffice, see [27,20,9,24,43] for results of this type. In a different direction, a risk analysis of ridge regression (assuming that X is a sub-Gaussian random vector, see [18] for matrix concentration results in this case) covering more regimes of choices of θ * , Σ, λ, n can be found in [36].
The elementary analysis of ridge regression presented here does not explicitly rely on matrix concentration (or lower tail) results. The main arguments, namely exchangeability, matrix perturbation and matrix convexity, were also used in [25], but for different estimators and for conditional density estimation rather than regression. For ridge regression, an analysis related to the one presented in this note, based on average stability arguments, is proposed in [38]. Additional relevant works include [19,12], that use average stability to analyze empirical minimization in exp-concave statistical learning (with bounds depending on the dimension d). It is worth noting that, while exchangeability/leave-one-out/average stability arguments typically lead to simple and direct proofs, a shortcoming of these approaches is that they generally give in-expectation rather than deviation bounds.
Finally, a precise in-expectation finite-sample analysis of estimators based on stochastic gradient descent (SGD) can be found in [11] (extending results in [3]), see also [42,28,15] (and references therein) for more information on SGD for least squares. The arguments of [11] are direct and do not rely on matrix concentration either. This analysis is however specific to iterative methods such as stochastic gradient descent, and it is unclear whether it applies to ridge regression.

A Operator convexity and Sherman-Morrison's identity
In this appendix, we provide for the sake of completeness statements and short proofs of facts used in the proofs of Lemmas 2 and 3.
We start with (operator) convexity of the matrix inverse.
Lemma 5 (Lemma 2.7 in [8]). Let S be a symmetric, positive semi-definite matrix. Then, the map A → Tr(A −1 S) , defined on the cone of positive-definite matrices, is convex.
Proof. By continuity of the map A → Tr(A −1 S) on its domain, it suffices to prove that it is midpoint-convex. Hence, it suffices to show that, for any positive-definite matrices A, B, Now, letting C = A −1/2 BA −1/2 , since it suffices to show that (I + C) −1 /2 (I + C −1 )/2. Up to conjugating with a rotation, one may assume that C is diagonal. In this case, the desired inequality follows from the convexity of the (scalar) inverse on R * + , applied to each entry of the diagonal.
We also use the following identity involving the inverse of rank-one perturbations of matrices, which follows from the Sherman-Morrison identity.
Lemma 6. Let S be a positive-definite d × d matrix, and v ∈ R d . Then, one has Proof. We recall the Sherman-Morrison identity: which can be checked by multiplying both sides by S + vv ⊤ , and deduce that