On the Christoffel function and classification in data analysis

We show that the empirical Christoffel function associated with a cloud of finitely many points sampled from a distribution, can provide a simple tool for supervised classification in data analysis, with good generalization properties.


Introduction
In this note we are mainly concerned with supervised classification with noiseless deterministic labels where the objects of interest x ∈ X belong to m classes with supports X j ⊂ X ⊂ R n , j ∈ [m] (with [m] = {1, . . ., m} =: Y).The supports satisfy X i ∩ X j = ∅ for all i, j with i = j.The data set consists of clouds of finitely many points (x(i)) ⊂ X j sampled from an underlying distribution φ j on X j , j ∈ [m].In this situation, an exact classifier f : X → Y, selects j =: f (x) whenever x ∈ X j .When constructing a classifier from a sample of data points, as e.g. in machine learning, a sensitive issue is its generalization properties when applied on a test set different from the training set.For the reader interested in recent developments on various techniques and issues in supervised and unsupervised classification, we refer to e.g. the book [1] and the many references therein.Contribution.We first introduce a simple and natural ideal classifier f t : X → Y, with nice asymptotic properties as t increases.It is based on the Christoffel function Λ µ t associated with the joint distribution dµ(x, y) on X × Y.As µ is supported on the graph {(x, f (x)) : x ∈ X } of the exact classifier f , recent results of [5] transported in our context suggest that the classifier f t (x) := arg max y∈Y Λ µ t (x, y) should approximate f nicely.This is the case and indeed, by a slight modification of the definition of Λ µ t , we show that f t is simply expressed in terms of the Christoffel functions Λ φj t of the φ j ; namely, f t (x) = arg max k Λ φ k t (x).Notice that this simple form of f t mathematically justifies for supervised classification, the intuitive argument that Λ φj t (x) > Λ φ k t (x), ∀k = j, whenever t is sufficiently large and x ∈ X j .Indeed as x ∈ X j is outside the support X k of φ k , for every k = j, the "score" Λ φ k t (x) is close to zero for sufficiently large t, as it decreases exponentially fast to zero (while the decrease of Λ φj t (x) is at most polynomial in t).We next consider the practical case where we only have access to a discrete sample of points in each class X j (e.g., the training set in Machine Learning) so that Λ φj t is not available.We provide a data-driven analogue of the previous result which, as expected, is in terms of the Christoffel functions Λ φj,N t

associated with
Research supported by the AI Interdisciplinary Institute ANITI funding through the french program Investing for the Future PI3A under the grant agreement number ANR-19-PI3A-0004.The author is also affiliated with IPAL-CNRS laboratory, Singapore.
the discrete empirical measures φ j,N .Namely the empirical discrete analogue f N t of the classifier f t simply reads , and has same properties as f t , but of course in an almost-sure sense with respect to random samples.In particular it shows good generalization properties.Indeed with ε > 0 fixed and t sufficiently large, with probability 1 (with respect to random samples), f N t (x) = j for every j ∈ [m] and all x ∈ X j at distance at least ε from the boundary ∂X j , for sufficiently large N .
Finally, we also briefly discuss more general joint distributions of pairs (x, y) (where possibly X i ∩ X j = ∅ for some (i, j)) which covers practical cases where some misclassification may occur and/or some ambiguity is allowed.
1.1.Notation, definitions and preliminary results.Let R[x] denote the ring of real polynomials in the variables x = (x 1 , . . ., x n ) and R[x] t ⊂ R[x] be its subset of polynomials of total degree at most t.Let N n t := {α ∈ N n : |α| ≤ t} (where |α| = i α i ) with cardinal s(t) = n+t n .Let v t (x) = (x α ) α∈N n t be the vector of monomials up to degree t.
The support of a Borel measure µ on R n is the smallest closed set A such that µ(R n \ A) = 0, and such a set A is unique.Moment matrix.Let φ be a Borel measure whose support Ω ⊂ R n is compact with nonempty interior.Its moment matrix of order (or degree) t is the real symmetric matrix M t (φ) with rows and columns indexed by N n t , and with entries Then necessarily M t is positive semidefinite for all t, denoted M t (φ) 0. Christoffel function.If Ω has nonempty interior then M t is positive definite for all t, denoted M t (φ) ≻ 0. Let (P α ) α∈N n ⊂ R[x] be a family of polynomials, orthonormal with respect to φ, i.e., Then the Christoffel function (CF) Λ φ t : R n → R + associated with φ, is defined by and recalling that M t (µ) is nonsingular, it turns out that An equivalent and variational definition is also One interesting and distinguishing feature of the CF is that as t increases, Λ φ t (x) ↓ 0 exponentially fast for every x ∈ support(φ).In other words Λ φ t identifies the support of φ when t is sufficiently large.In addition, at least in dimension n = 2 or n = 3, one may visualize this property even for small t, as the resulting superlevel sets Ω γ := { x : Λ φ t (x) ≥ γ }, γ ∈ R, capture the shape of Ω quite well; see e.g.[2].
1.2.Setting.Let Y := {1, 2, . . ., m} be the set of m classes, and for each (class) j ∈ Y, let X j ⊂ R n be the set of points in the class j, assumed to be open with compact closure X j , Let X := m j=1 X j ⊂ R n be the open set (with compact closure X = m j=1 X j ) of all points to be classified and µ be the joint probability distribution of (x, y) on X × Y. Write (1.4) dµ(x, y) = ϕ(dy|x) φ(dx) , where µ has been disintegrated into its marginal φ on X and its conditional probability distribution ϕ(dy| x) on Y given x ∈ X. Next, each point x ∈ X belongs to only one class and therefore X i ∩ X j = ∅ for all pairs (i, j) with i = j, and so we may and will assume that φ(X i ∩ X j ) = 0 for all pairs (i, j) with i = j.Therefore one may write µ = m j=1 µ j , with for some marginals φ j on X j , j ∈ Y.In particular: , and so the joint distribution µ is supported on the graph G := { (x, f (x)) : x ∈ X } of the function f .The Christoffel function is a powerful tool from the theory of approximation and orthogonal polynomials and one of its distinguishing features is its ability to identify the support of the underlying measure.So the CF Λ µ t associated with µ is an appropriate tool to approximate f since the graph of f is precisely the support of the measure µ in (1.6).In Marx et al. [5] the authors propose to approximate f (when f ∞ < M ) by: with a small ε > 0 and where µ 0 is a measure with a density w.r.t.Lebesgue measure, positive on X × [−M, M ].They prove nice theoretical convergence guarantees as t increases; see [5] for more details.Notice that in the present supervised classification framework, the function to approximate is a step function so that the support of its graph is contained in a real algebraic variety, an even more specific case.

Main result
We first consider the ideal case of approximating the classifier f in (1.5) via the CF of the joint distribution µ in (1.6).Then we next consider the more practical setting (as in machine learning) where we only have access to a finite sample (the training set).In this case we use the empirical measures φ j,N associated with the points of the sample in class "j".In section 2.2 we invoke results from [3] that relate the degree t of the CF Λ φj,N t with the size N of the sample to ensure that important asymptotic properties of Λ φj,N t and Λ φj t as t and N increase, coincide.
2.1.The CF on a real variety.Let q ∈ R[x, y] be the polynomial (x, y) → v(x, y) = m i=1 (y − i) and let µ be the probability measure on Ω = X × Y defined in (1.6).Its support Ω is contained in the real algebraic variety y] generated by the polynomial v is the ideal of polynomials that vanish on V .
Observe that the moment matrix M t (µ) is singular since the vector v of coefficients of the polynomial v is in the kernel of M t (µ) as soon as t ≥ m.Indeed v T M t (µ)v = v 2 dµ = 0 because the support of µ is contained in V .So the definition (1.2) of the CF is not valid any more.Denote by L 2 t (µ) ⊂ R[x, y] the space of polynomials on V of total degree at most t (and degree at most m − 1 in the variable y), equipped with the inner product and norm inherited from L 2 (Ω, µ).It turns out that L 2 t (µ) is a RKHS (Reproducing Kernel Hilbert Space).Then in this context, the variational definition (1.3) of the CF associated with µ reads µ) be the moment matrix associated with µ in (1.6) with rows and columns indexed by all monomials (x α y k ) of Γ t (and not all monomials x α y k of total degree at most t), e.g., listed according to the lexicographic ordering.Then M ′ t (µ) is non singular and where v ′ t (x, y) is the vector of all monomials of Γ t .Alternatively where the Q α,k 's and the λ α,k 's are the eigenvectors and their respective eigenvalues associated with M ′ t (µ).Following [5], one may consider the perturbed measure µ + εµ 0 where µ 0 is a probability uniformly distributed on X × [0, m] and ε > 0 is a small parameter.Then Recall that v t (x, y) is the vector of all monomials x α y k of degree at most t, and M t (µ + εµ 0 ) is the moment matrix of µ + εµ 0 of order t, which is non singular for all ε > 0. In addition Λ µ+εµ0 t is defined for all (x, y) ∈ R n+1 whereas Λ µ t in (2.1) is defined for all (x, y) ∈ V .Then as proved in [5] the classifier ft in (1.7) approximates f as t increases.Next we show that by a slightly change of the vector space L 2 t (µ) in (2.1), the resulting CF has nice additional properties that can be exploited to provide the resulting classifier (1.7) with a clear and more intuitive interpretation.
A slight variant of the CF.We now introduce a slight variant Λµ t of the CF Λ µ t , defined by: where L 2 t (µ) =: R[x, y] t,m−1 is the vector space of polynomials of degree at most t with respect to the variable x and at most m − 1 with respect to the variable y, so that t and Λµ t be as in (2.1) and (2.2) respectively.Then: which form an orthonormal family with respect to the uniform probability measure on Y.
Theorem 2.2.For each j ∈ Y, let (P j α ) α∈N ⊂ R[x] be a family of polynomials that are orthonormal with respect to the marginal probability measure φ j of µ j , and let Λ φj t be the standard Christoffel function associated with φ j on X j .Then : (i) The family (θ j (y) t (µ) in the form u(x, y) := p(y) q(x) for some q ∈ R[x] t and some p ∈ R[y] m−1 , arbitrary.Then as (P j α ) α∈N n t generates R[x] t , observe that for every j ∈ Y: for some coefficients (q j α ) α∈N n t .Next, as the polynomials (θ j ) j∈Y generate R[y] m−1 , write p(y) = j∈Y p j θ j (y) for some coefficients (p j ) j∈Y , and therefore u(x, y) = j∈Y p j (θ j (y) q(x)) = α∈N n t , j∈Y p j q j α θ j (y) P j α (x) .
Orthogonality.If i = j then θ i (y)θ j (y) = 0 everywhere on the support of µ and therefore Ω θ i (y) P i α (x) θ j (y) P j β (x) dµ(x, y) = 0 , Next, as which is also the number of terms in the family (θ j (y) P j α (x)) α∈N n t ,j∈Y which also generates L 2 t (µ).Hence (θ j (y) P j α (x)) α∈N n t ,j∈Y is an orthonormal basis of L 2 t (µ).(ii) Let Mt (µ) be the moment matrix of degree t with rows and columns indexed by monomials (x α y j ) α∈N n t ,0≤j≤m−1 (e.g. with lexicographic ordering), and let vt (x, y) be the vector of monomials x α y k listed with the same ordering.Observe that (2.2) reads where p ∈ R r(t) is the vector of coefficients of p ∈ L 2 t (µ) in that basis.Then (2.7) is a convex optimization problem whose optimal solution p * satisfies 2 Mt (µ) p * = λ * vt (x, y) for some scalar λ * .Hence λ * = 2 Λµ t (x, y) and p * = Λµ t (x, y) Mt (µ) −1 vt (x, y) so that the corresponding polynomial p * ∈ L 2 t (µ) reads and therefore which is (2.4).In particular we also retrieve that Next (2.5) follows from the definition of the Christoffel function associated with φ j for each j ∈ Y, and (2.6) follows from the properties of interpolation polynomials (θ j ) j∈Y .
So whenever y ∈ Y, the Christoffel function Λµ t (x, y) has a very simple expression (2.5), stated directly in terms of the Christoffel functions (Λ φj t (x)) j=1,...,m associated with the classes j = 1, . . ., m.This is quite natural but is proper to the CF Λµ t and not to the standard CF Λ µ t .
An ideal classifier.Given the Christoffel function Λµ t defined in (2.5) and inspired by (1.7), a natural candidate classifier is the function (2.9) x → ft (x) := arg min y∈Y Λµ t (x, y) −1 = arg max y∈Y Λµ t (x, y), ∀x ∈ X , which in view of (2.6) reads: (2.10) Observe that the "max" in (2.10) is over y ∈ [m] and not over the interval [0, m].This is because in supervised classification, we know that f (x) ∈ [m] for all x ∈ X.
The rationale is the following: Let x ∈ X j be fixed arbitrary, so that x ∈ X k for every k = j.As t increases, Λ φ k t (x) decreases to zero exponentially while Λ φj t (x) decreases not faster than t n .Therefore for t sufficiently large, necessarily Λ φ k t (x) < Λ φj t (x) for all k = j, and so by (2.6), x ∈ X j ⇒ ∃t 0 s.t.ft (x) = arg min An even stronger almost-uniform result holds.Let ∂X j denote the boundary of the set X j ⊂ R n .
Theorem 2.3.Let f be as in (2.10) and let X ε j := { x ∈ X j : d(x, ∂X j ) > ε } where ε > 0 is fixed.Assume that for every j ∈ [m], φ j has a density w.r.t.Lebesgue measure λ restricted to X j , bounded from below by c > 0. Then there exists t ε such that ft (x) = j for all x ∈ X ε j and all t ≥ t ε .
Proof.Let s(t) := n+t t and denote by diam(S) the diameter of a bounded set Hence by [4,Lemma 6.6] (and using that Λ ) n exp(−n 2 /t) , ∀x ∈ X ε j .
On the other hand, as d(x, ∂X j ) > ε, we can invoke [4, where ω n is the n-dimensional area of S n+1 .Hence clearly there exists t ε such that Λ φ k t (x) −1 > Λ φj t (x) −1 for all k = j and all x ∈ X ε j , whenever t ≥ t ε .In particular, as the functions Λ φ k t are strictly positive, continuous and X j is compact, there exists a j > 0 such that Therefore the result follows from (2.6) and (2.9).

2.2.
Application to supervised classification with noiseless deterministic labels.In supervised classification we do not have access to the CF Λ µ t or Λµ t .We only have access to a sample of N points Tr N = { (x(i), y(i)) : i = 1, . . ., N } ⊂ X (the training data set) and a sample of test points (the test data set).For instance, in a typical Machine Learning (ML) approach one tries to learn a classifier function f in (1.5) from the supervised data Tr N by computing parameters of a deep neural network that minimize some loss function.Usually, the number of parameters is very large (compared to N ) making the resulting solution sensitive to a classical overfitting phenomenon.One way to attenuate this overfitting phenomenon is to add an appropriate regularization term to the loss function in the criterion to minimize.
One reason behind this overfitting phenomenon is that in minimizing the loss function, each data point (x(i), y(i)) is treated separately.Ideally one should somehow consider the entire training set Tr N itself and not its members separately.This is precisely what the CF function approach does.Indeed the training set Tr N is used to construct the empirical (discrete) analogues φ j,N of the measures φ j , and their associated empirical Christoffel function Λ φj,N t , now obtained from empirical moments.Remarkably, even though the geometry of the support of φ j,N is quite trivial, the CF Λ φj,N t is still close to Λ φj t in a certain sense and the training set Tr N can still be used to infer properties of the underlying measures φ j .Hence, and importantly, even though the mathematical object Λ φj,N t is built from individual items, it is in fact concerned with the cloud of points of Tr N in class {j}, rather than the points x(i) in that class taken separately.However of course, for Λ φj,N t to recover asymptotic properties of Λ φj t , the sample size N and the degree t cannot be chosen independently; see [3,4] for more details.Setting.For every j ∈ [m], let Tr j N = (x(i)) i≤N ⊂ X j be a training set for class {j}, where x(i) are i.i.d.random vectors with common distribution φ j whose support is X j .So the whole training set Tr N has a total of mN points where the N points in each class j are sampled from φ j .For every fixed t, a natural approach suggested by (2.10), consists of: Then the empirical version of Theorem 2.3 reads as follows: = v t (x) T M t (φ j,N ) −1 v t (x), for all x ∈ R n and all j ∈ [m].The moments of φ j,N are easily obtained byφ j,N (α) := { 1 N i x(i) α : x(i) ∈ Tr j N } , ∀α ∈ N n ,and the moment matrix M t (φ j,N ) is nonsingular for sufficiently large t.• Following (2.10), introduce the empirical classifier(2.13)x → f N t (x) := arg max k Λ φ k,N t (x) , ∀x ∈ X .
has an interesting additional feature.Namely, it has a nice characterization in closed form which when exploited for classification leads to a classifier with a clear interpretation.Let (θ j ) j∈[m] ⊂ R[y] m−1 be interpolation polynomials at the points {1, 2, . . ., m} of Y, i.e., t and Λ µ t are close but as we next see, Λµ t y → θ j (y) :=