Nonlinear compressive reduced basis approximation for PDE’s

,


Introduction
The approximation of the solution of a parameterized partial differential equation (PDE) : given µ, find u solution to D(u; µ) = 0 can benefit from the a priori analysis of the set of all generated solutions when the parameter µ is varied, that is, where u µ is the solution for the given value µ = (µ 1 , . . ., µ d ) of the parameter vector that ranges in some set P ⊂ R d .The set K is also referred to as the solution manifold, since it may be thought of as a parameterized d-dimensional manifold typically immersed in a Hilbert space X, where the solution to the PDE is well defined.In what follows, the norm in X and the scalar product are respectively denoted by ∥.∥ X and ⟨., .⟩X .Assuming K to be compact in X, its Kolmogorov m-width defined as describes how well the set can be approximated by an ideally selected (and usually out of reach) mdimensional space.If d m has a certain rate of decay as m → ∞, it is possible to practically construct low-dimensional spaces X m that perform with the same approximation rate, by pre-computing offline a reduced basis consisting of m solutions associated with a well-chosen set of parameters.If d m has a fast rate of decay, the RB method yields an approximation of the solution for any parameter based on an algebraic system involving very few unknowns.We refer to [QMN16], [HRS15] or [MP20] for a presentation of RB methods.
If the parameters µ i are considered as random variables and thus u µ is an X-valued random variable, a stochastic counterpart to these concepts is described by the principal component analysis in the Hilbert space X, that is, the spectral analysis of the covariance operator v → E(⟨v, u µ ⟩ X u µ ), once u µ has been recentered so that E(u µ ) = 0. Denoting by σ 1 ≥ σ 2 ≥ • • • the sequence of positive eigenvalues of this compact self-adjoint operator, and by e 1 , e 2 , . . . the Karhunen-Loève orthonormal basis of eigenfunctions, it is well known that and that the minimizing space is spanned by e 1 , . . ., e m .This is the starting point to the proper orthogonal decomposition (POD) method which amounts to replacing the aforementioned eigenfunctions by approximations computed offline, based on a sufficiently large set of training solutions.These linear reduced modeling methods have penetrated industrial applications, a guarantee of their success.However, there are still cases where these approaches have difficulties to overcome, namely, when the Kolmogorov width d m or eigenvalues σ m do not decay fast.This is in particular what happens for transport type problems.Even in the conceptually simple case of constant speed translation of an initial condition given by a step function, where the only parameter is the position of the discontinuity, it is well known that with X = L 2 , the numbers d m and κ m decay slowly like O(m −1/2 ).In other words, for a target precision of ε, the basis is of prohibitive size O(ε −2 ).
For families of such functions, substantial gain can be expected when searching for nonlinear reduced models.Prominent examples of nonlinear approximation include rational fractions, finite elements on adaptive grids of fixed cardinality, n-term approximations in a basis or dictionary, and neural networks, see [DeV98,DHP21] for a general treatment.In these methods, the "coordinates" describing the approximation to a function u are typically nonlinear functionals applied to u, and the reconstruction map from such parameters is also nonlinear.In the frame of model reduction, we refer to [AZF12, GFTBM21, BCD + 21] that considers libraries of affine reduced models, [BF22] that uses quadratic manifolds, and [LC20, FDM21, GGJW22, PDG + 22, PWL23, BFM23] for neural network based approaches, see also [BIR18, CMS19, BSU20, HMC22], and [Peh22] for an overview on these nonlinear approaches.
Interestingly, it appears that an efficient approach to nonlinear model reduction is to maintain linear functionals for computing the coordinates while performing reconstruction in a well-chosen nonlinear way.This state of affair is in particular illustrated by the development of compressed sensing in the last two decades, where signals are reconstructed from linear measurements by nonlinear methods promoting sparsity, such as ℓ 1 minimization.
In this note, we begin by substantiating this idea more precisely in §2, by recalling and comparing certain notions of linear and nonlinear m-widths.We present in §3 a general approach that consists in taking as linear functionals the first components in a linear reduced model (RB or POD) that has been learned offline; and also use the offline stage to learn a computationally tractable nonlinear map that reconstructs the missing components from these first ones to reach a better accuracy.One key aspect lies in the type of nonlinear maps that is allowed.This approach is analyzed in §4, in the case of a simple univariate model of step functions; it is illustrated by numerical tests for this model in §5.

Linear and nonlinear notions of m-widths
Generally speaking, the process of dimensionality reduction can be described by a pair of continuous mappings, the encoder and the decoder D : R m → X.
The maximum distorsion of the encoding procedure over K is given by the quantity Then, for a general Banach space X and a compact set K ⊂ X, we can define various notion of widths inf by optimizing the choice of E and D, under specific restrictions: • If D and E are both assumed to be linear, one obtains the approximation numbers where the infimum is taken over operators L of rank at most m.
• If only D is assumed to be linear, one obtains the already mentionned Kolmogorov width d m (K) X .When X is a general Banach space, the inequality can be strict.Equality obviously holds in the case when X is a Hilbert space since best approximation in a subspace of X of dimension m is achieved by linear orthogonal projection.
• The sensing numbers s m (K) X correspond to the reciprocal situation, where E is assumed to be linear and D is assumed to be nonlinear.In other words, they can be defined as where the infimum is taken over all choice of linear functionals λ 1 , . . ., λ m ∈ X ′ and decoding map D. These number are closely related to the Gelfand width classically defined as It is easily checked that s m (K) X = d m (K) X in the case where K is convex and centrally symmetric; and that s for a general compact set K and K − K is a notation for the set {u : • Finally, the nonlinear width or manifold width δ m (K) X is defined when no other assumption but continuity is made on the operators E and D. For numerical stability purpose, it is interesting to tame this notion by imposing that D and E are both Lipschitz continuous, that is for some fixed γ > 1, with ∥ • ∥ m an arbitrary norm on R m .The resulting infimized quantity δ γ m (K) X is referred to as the stable width.
The last two notions of width s m and δ m (or δ γ m ) are natural to describe the expected performance of optimal nonlinear model reduction, since the manifold is approximated by the set D(R m ) -which is no longer a linear space.However, the sensing numbers take the view that encoding can be restricted to simple linear measurements.
As already mentionned, the quantities d m and a m typically decay slowly for families of piecewise smooth functions, which reflects the fact that they cannot be well approximated efficiently by linear spaces.A substantial gain in the rate of decay can be expected however when considering the nonlinear widths δ m and δ γ m .Interestingly, it appears that this substantial optimal gain is already present when considering the sensing numbers s m .
As a basic example, consider the two-parameter family of univariate step functions Clearly, the parameters (a, ℓ) are not linear functionals of u.However, any u ∈ K can be exactly reconstructed from two linear functionals, namely, the first moments Indeed, λ 0 = ℓ and λ 1 = 1 2 ℓ(2a+ℓ), so that a and ℓ can be exactly recovered from such data.Therefore, one has s m (K) X = 0 for any m > 2 and for any Banach space X.
At a more general level, it was proved in [CDPW22] that when X is a Hilbert space, then both s m (K) X and δ γ m (K) X are tied to the so-called entropy numbers ε m (K) X defined as the smallest value of ε > 0 such that K can be covered by 2 m balls of radius ε.More precisely, it was shown that, on the one hand, for any s > 0, one has a Carl-type inequality where C s depends on (s, γ), and that on the other hand, there exists c > 0 depending on γ > 1 such that δ γ cm (K) X ≤ 3ε m (K) X , m ≥ 1.In the proof of this last inequality, the γ-stable encoding-decoding pair (E, D) which is constructed has actually a linear E. In turn, one also has One consequence of these results is that s m (K) X , δ γ m (K) X and ε m (K) X share the same algebraic rates of decay.
Remark 1.An additional aspect of non-linear dimensionality reduction is the notion of adaptivity, which means that the measurements E(u) = (E 1 (u), . . ., E m (u)) are chosen incrementally, that is, the functional E m is picked depending on the value of E 1 (u), . . ., E m−1 (u).This allows the definition of similar notions of adaptive sensing numbers and non-linear widths.Our next described approach is not of this form, since we use linear functionals that are pre-defined through the standard POD or RB analysis.

Non-linear compressive Reduced Basis approximation
In this contribution, we thus intend to deal with situations where: • The Kolmogorov widths d m (K) X , or the singular values σ n , decay slowly.
• The sensing numbers s m (K) X , and stable non-linear widths δ γ m (K) X , decay much faster.
In other words, a target accuracy ε > 0 can be reached by d N (K) X or κ N , however with a dimension N = N (ε) much larger than the value of n = n(ε), such that s n (K) X reaches the same accuracy.
Since the optimal linear functionals in the definition of s n (K) X are usually out of reach and could be computationally unpractical to apply, we take the view of fixing these measurements to be a small number n of components in the offline computed (orthonormalized) RB or POD basis (e j ) j=1,...,N for some N >> n.Typically, we choose the n first ones, that is, Intuitively, it is expected that in the situation where s n (K) X is very small, then the unknown component (λ j (v)) j=n+1,...,N should be somehow dependent, up to a small error, of the n fist ones that carry most of the relevant information.This idea was at first presented in [BFM23].Here, we formalize it and study its validity in detail on a simple step function model, and propose a general numerical strategy that we test on this model.
Our objective is thus to predict from these first components the extra components λ k (v) for k = n + 1, . . ., N that are needed to approximate the functions v ∈ K with target accuracy ε.We are thus interested to construct N − n functions ψ k : R n → R so that λk (v) := ψ k (λ 1 (v), . . ., λ n (v)), is a very accurate approximation to λ k (v) for k = n + 1, . . ., N and can be fastly computed.
Let us stress that ψ k should typically be a nonlinear function.Indeed consider the ideal case of the PCA basis computed after having recentered the variable u µ .Then the variables are uncorrelated and centered, such that, for any k > n, min α1,...,αn Thus, the best choice of a linear function would be the null one that does not deliver any information.
On the other hand, the best choice of a nonlinear function in this mean square sense, that is, minimizing E(|z k − ψ(z 1 , . . ., z n )| 2 ) over all functions ψ, is given by the conditional expectation which is out of reach and should be approximated by a computationally tractable function.
Our practical approach to the construction of ψ k is by learning it in a second step of the offline stage, after the basis (e j ) j=1,...,N has been identified.Having in mind the above mean square loss, one typical approach is to select ψ k within a sufficiently rich class F of nonlinear functions by empirical risk minimization : with (u i ) i=1,...,M a training set of random snapshots u i = u µ i , we define A critical aspect in this approach lies in the choice of the class F, which could be, for example, the set of: • Quadratic functions, as in [BF22] or [GWW23].
• Polynomials of some higher degree d > 2.
• Neural networks with a given architecture, as proposed in [BFM23] (see also [LC20] where an autoencoder-based approximation was proposed, which was in a way a pioneering idea but unfortunately not computationally tractable one).
This class should be able to approximate correctly the ideal but out of reach ψ * k by a computationally tractable function ψ k ∈ F. Another difficulty with this approach is the fact that when the number n of informative components is chosen to be not very small, one faces a regressing problem in large dimension, for which classical methods such as splines or polynomials are known to suffer from the curse of dimensionality.
For these reasons, we have also considered in our numerical tests regression methods based on trees (CART) and random forests, that are both universally consistent and able to tackle large-dimensional problems.These methods seem to deliver the best numerical results for the considered problems.

Analysis of a model framework : periodic step functions
In order to investigate the aforementioned questions, we place ourselves in a framework where the Karhunen-Loève basis is explicitely known.Specifically, we work in the Hilbert space X = L 2 (0, 1), and consider a randomly parameterized family such that where R is an even and 1-periodic function.In other words, u µ is a periodic stationary process, its covariance operator coincides with the convolution operator by R, and therefore its Karhunen-Loève basis is exactly given by the basis of the Fourier series on [0, 1] (see e.g.[PUP02]).
More specifically, we consider a simple model of periodic stationary step functions by introducing the three-parameter family that is, u µ = bχ [a,a+ℓ] in a 1-periodic sense.
Here a, ℓ, and b are assumed to have independent uniform distributions.Taking the base point a to be uniformly distributed over [0, 1], it is easily checked that the process is periodic stationnary.In addition, we take the height b to be uniformly distributed in [0, 1] and the length ℓ to be uniformly distributed in [ℓ min , 1 − ℓ min ] for some 0 < ℓ min < 1 2 .The best linear approximation of dimension m = 2n + 1 is thus given by the truncation up to k ≤ n of the Fourier expansion where It is also easily checked that the eigenvalues associated to the functions x → cos(2πkx) and x → sin(2πkx) are the same and are given by for some c = c(ℓ min ) > 0. It follows that the best linear approximation has a mean-square error κ 2 m behaving like m −1 .Note that, for the corresponding manifold one obviously has d m (K) X ≥ κ m , since a worst case error dominates the average error.On the other hand, it is also readily seen that the worst case approximation by Fourier series behaves like m −1/2 , and therefore We also consider the two-parameter family K ℓ0 obtained by freezing the value ℓ = ℓ 0 and the oneparameter family K b0,ℓ0 obtained by freezing in addition the value b = b 0 .It is easily checked that one has the same behaviour m −1/2 for κ m and d m after such restrictions.The one-parameter family K b0,ℓ0 can be encoded by the data of a, so that its nonlinear width satifies δ m (K b0,ℓ0 ) X = 0, m > 1.
It is easily seen that the data of only one Fourier coefficient is not sufficient to characterize the elements of this family.Indeed, On the other hand, the recovery of a can be done through the data of the two coefficients α 1 and Similarly, any element in the two-parameter family K ℓ0 , parametrized by a and b, can be recovered from the data of these two coefficients, since one also has When more coefficients are available, we note that there is not a unique reconstruction map : for example from the three coefficients α 0 , α 1 , and β 1 , we can also recover b according to Finally, in the case of the three parameter family K, exact recovery of (a, b, ℓ) can be obtained by solving the nonlinear system however the exact recovery map does not anymore have an explicit form.These exact recovery procedures induce for all k > 1 an exact recovery map ψ * k such that and similarly an exact recovery map ψ * k such that In this very simple case, the success of the learning strategy outlined in the previous section therefore depends on how these maps can be approximated by the family F.
A simple intuition can be given when looking at the particular case of ψ * k for the one-parameter family K b0,ℓ0 , when b 0 = 1 and ℓ 0 = 1 2 .Then, we find that which is null for even values of k, and for odd values k = 2j + 1 satisfies where T k is the Chebychev polynomial of degree k.Therefore for such values, the optimal reconstruction is exact and given by ψ * Clearly, the function ψ * k cannot be well approximated by polynomials of moderate dimensions for large values of k.On the other hand, it is well known that the derivative of T k has maximal norm of order k over [−1, 1], and this implies that the functions ψ * k are Lipschitz continuous with Lipschitz constant bounded independently of k > 0.
This property holds in more generality from the following argument: the derivative of the arctan function being upper bounded by 1, the recovery of a from α 1 (a, ℓ 0 , b 0 ) and β 1 (a, ℓ 0 , b 0 ) is stable as are both bounded since, by construction, Hence, an error in the values of α 1 or β 1 will induce an error of comparable size on a.The same hold for the determination of b in the case of K ℓ0 .On the other hand, it is readily seen from the definition of Fourier coefficients that the two maps are Lipschitz continuous with Lipschitz constants bounded independently of k > 0. In turn, the stable recovery of a and b from α 0 , α 1 and β 1 , induces recovery maps that are Lipschitz continuous with Lipschitz constants bounded independently of k > 0. This state of affairs explains that universally consistent methods such as random forests are well adapted for the joint approximation of ψ * k and ψ * k , while approaches based on low order polynomials are doomed to fail.This is confirmed by the numerical tests presented in the next section.
In the perspective of recovering more general piecewise smooth functions, we expect that the loworder components are affected by the smooth pieces in addition to the jumps, while the high-order components are only affected by the jumps.Thus it is interesting to adress the question of the recovery of the parameters of the step function from a few components α k and β k for larger values of k.
This task appears to be more involved and requiring more coefficients.For example, when recovering a as in (5), we find One possiblity to lift the indeterminacy (mod 1/k) is to combine the information coming from (11) and since a = a ′ , (mod 1/k), a = a ′ , (mod 1/(k + 1)), and a, a ′ ∈ (0, 1) imply a = a ′ .Thus we can in principle recover the parameters a and b out of 4 coefficients of arbitrary high frequencies (k, k + 1).However we also observe that the stability of this recovery is deteriorated since da dα k and da dβ k increase linearly with k.We may hope to improve the stability by using a larger number of coefficient values with indexes (k, k + 1, . . ., k + d) for some d > 1.

Numerical illustrations
In this section, we investigate the ability of different methods to learn mappings that use different amount m of components, namely for each of the three families K b0,ℓ0 (Figures 1 and 2), K ℓ0 (Figures 3 and 4), K (Figures 5 and 6), for all 2 ≤ k ≤ 500.In these Figures, the average recovery error for the α k and β k are presented in a symmetric manner, on the left and right side of the x-axis respectively.
The learning methods are • linear regression: F is the set of linear functions.
• quadratic regression: F is the set of polynomials of total degree 2.
• quartic regression: F is the set of polynomials of total degree 4.

• decision tree [HTF09]
• For all the numerical illustrations we used Python 3.8 and scikit-learn 1.2.2 [PVG + 11] for the implementation of each of the regression methods described above.For more information, the code can be found at https://github.com/agussomacal/NonLinearRBA4PDEs As can be expected, linear regression give the same results as the null forecast, and quadratic and quartic regression give the same (bad) result here, with some improvement over the null forecast only for very small value of k in certain cases (see Figures 3 and 4).
In contrast, decision tree and random forest are well suited as can be seen for the one parameter family on Figure 1, with improved results on Figure 2 obtained from a larger training samples (10 000 rather than 1 000).This reflects the universal consistency of the these methods that are guaranteed to converge towards the regression function as the number of sample tends to +∞.
The same also holds for the two parameter family, as seen on Figures 3 and 4 : the problem is slightly more involved but nevetheless decison tree and random forest manage to obtain a fair (resp.good) approaximation after a learning phase of 1 000 (resp. 10 000) training samples.The numerical results for the three parameter family are displayed on Figure 5 for the range ℓ ∈ [0.4,0.6], and on Figure 6 for the range ℓ ∈ [0.01, 0.99], that is ℓ min = 0.4 and 0.01 respectively.One first observation is that all methods fail in the case of m = 2 known component since they are unsufficient to characterize an element of K. Secondly we observe that the performance deteriorates as ℓ is allowed to be very small in which case the exact recovery becomes less stable in view of the multiplicative coupling between b and ℓ in the last two equations of the system (6).These last results reveal difficulties for these too simple non-linear recovery methods to achieve a satisfactory recovery.We expect that more involved approaches such as deep neural networks can improve this state of affair.
Finally, in the case of the two parameter family (ℓ fixed), we study the recovery error of (α k , β k ) for k > 10 + d when using information from high frequency coefficients (α j , β j ) for j = 10, 11, . . ., 10 + d.As explained in the end of the last section, the exact recovery is feasible for d = 1, yet less stable and thus more difficult to learn.This is confirmed by Figure 7, where we see an improvement when using a larger value of d and a larger training sample.

Figure 1 :
Figure 1: In this figure we plot the error obtained from different recovery methods for the family K b=1,ℓ=.5 where we recover all coefficients α k and β k in (3) for 2 ≤ k ≤ 500 from 2 (left) 3 (center) and 5 (right) Fourier coefficients with different approaches: linear, quadratic, quartic, tree and random forests.Note that linear, quadratic, quartic are superposed and do not improve over the trivial recovery of the missing modes by value 0. The learning phase is based on 1 000 training samples.The x-axis represents (in a log scale) the index k of α k and the index k of β k and the y-axis the mean-square reconstruction error on the mode.

Figure 3 :
Figure 3: Same test as Figure 1 for the two parameter family K ℓ=.5 , using 1 000 training samples.

Figure 4 :
Figure 4: Same test as Figure 3 for the two parameter family K ℓ=.5 , using 10 000 training samples.

Figure 5 :
Figure 5: Same test as Figure 1 for the three parameter family K, using 10 000 training and ℓ min = 0.4.

Figure 7 :
Figure 7: Recovery of components k for k > 10 + d for the two parameter family K ℓ=.5 using random forest and components j for j = 10, 11, . . ., 10 + d with d = 4 (left) and d = 9 (right), and various numbers of training samples.