\makeatletter
\@ifundefined{HCode}
{
\documentclass[CRBIOL,Unicode,screen]{cedram}
\newenvironment{noXML}{}{}
\newenvironment{Table}{\begin{table}}{\end{table}}
\def\thead{\noalign{\relax}\hline}
\def\endthead{\noalign{\relax}\hline}
\def\tabnote#1{\vskip4pt\parbox{.9\linewidth}{#1}}
\def\xxtabnote#1{\vskip4pt\parbox{.72\linewidth}{#1}}
\def\tsup#1{$^{{#1}}$}
\def\tsub#1{$_{{#1}}$}
\RequirePackage{etoolbox}
\def\jobid{crbiol20240388}
%\graphicspath{{/tmp/\jobid_figs/web/}}
\graphicspath{{./figures/}}
\newcounter{runlevel}
\let\MakeYrStrItalic\relax
\def\refinput#1{}
\def\back#1{}
\def\rmpi{\uppi}
\usepackage{multirow}
\def\nrow#1{\@tempcnta #1\relax%
\advance\@tempcnta by 1\relax%
\xdef\lenrow{\the\@tempcnta}}
\def\morerows#1#2{\nrow{#1}\multirow{\lenrow}{*}{#2}}
\def\botline{\\\hline}
\def\hyphen{\text{-}}
\def\0{\phantom{0}}
\def\xsection#1{}
\def\og{\guillemotleft}
\def\fg{\guillemotright}
\def\xsection#1{}
\def\back#1{}
\usepackage{upgreek}
\DOI{10.5802/crbiol.162}
\datereceived{2024-05-13}
\daterevised{2024-08-28}
\dateaccepted{2024-08-30}
\ItHasTeXPublished
}
{\documentclass[crbiol]{article}
\def\CDRdoi{10.5802/crbiol.162}
\let\newline\break
\def\xxtabnote#1{\tabnote{#1}}
\makeatletter
\usepackage[T1]{fontenc}
\def\selectlanguage#1{}
}
\makeatother
\usepackage{upgreek}
\begin{document}
\begin{noXML}
\CDRsetmeta{articletype}{review}
\title{Species abundance, urn models, and neutrality}
\alttitle{Abondance des esp\`eces, mod\`eles d'urnes et neutralit\'e}
\author{\firstname{Jerome} \lastname{Chave}\CDRorcid{0000-0002-7766-1347}}
\address{CRBE, Toulouse, France}
\email{jerome.chave@cnrs.fr}
\keywords{\kwd{Biodiversity}\kwd{Ecology}\kwd{Species abundances}}
\altkeywords{\kwd{Biodiversit\'e}\kwd{Ecologie}\kwd{Abondances
d'esp\`eces}}
\thanks{Agence Nationale de la Recherche (CEBA, ref. ANR-10-LABX-25-01;
TULIP, ref. ANR-10-LABX-0041; ANAEE-France: ANR-11-INBS-0001), CNES
long-term funding, European Space Agency CCI-BIOMASS and FRM4BIOMASS
projects, French Foundation for Research on Biodiversity (FRB)}
\begin{abstract}
The neutral theory of biodiversity and biogeography has stimulated much
research in community ecology. Here, exact results are used to apply
neutral model predictions to large regional samples. Three
complementary neutral models are presented: the Ewens canonical neutral
model, a model of subdivided ecological communities, and a ``diversity
begets diversity'' neutral model. For all three models, an exact
sampling formula is provided, and a new R package \texttt{neutr}, is
presented. This package is used to fit species abundances from regional
inventories of tropical forest trees in the Amazon, tropical Africa and
Southeast Asia. It is shown that the neutral models fit well empirical
data for all but the few most abundant species (from 6 to 40 depending
on the continent). When the parameter \tralicstex{\textit{θ}}{$\theta$} is taken as an index or
regional diversity, the Amazonia and Southeast Asia emerge with similar
regional diversities (\tralicstex{\textit{θ}}{$\theta$} = 654 for Amazonia, versus \tralicstex{\textit{θ}}{$\theta$} =
726 for Southeast Asia), with a less diverse tropical African tree
flora (\tralicstex{\textit{θ}}{$\theta$} = 219). The model infers 10,141 tree species with at
least 50 individuals in Amazonia, 3477 in tropical Africa and 9915 in
Southeast Asia. The spatially subdivided neutral model provides clear
evidence for a spatial substructure in all three regional floras. These
results show how neutral models are useful to explore regional patterns
of species abundance and to provide insights about regional species
pools.
\end{abstract}
\begin{altabstract}
La th\'eorie neutre de la biodiversit\'e et de la biog\'eographie a
stimul\'e de nombreuses recherches en \'ecologie des communaut\'es.
Ici, des r\'esultats exacts sont utilis\'es pour appliquer les
pr\'edictions du mod\`ele neutre \`a de grands \'echantillons
r\'egionaux. Trois mod\`eles neutres compl\'ementaires sont
pr\'esent\'es : le mod\`ele neutre canonique d'Ewens, un mod\`ele de
communaut\'es \'ecologiques subdivis\'ees et un mod\`ele neutre dans
lequel la diversit\'e engendre la diversit\'e. Pour les trois
mod\`eles, une formule d'\'echantillonnage exacte est fournie, et un
nouveau package R appel\'e {\og}neutr {\fg} est pr\'esent\'e. Ce
package est utilis\'e pour ajuster les abondances d'esp\`eces provenant
d'inventaires r\'egionaux d'arbres de for\^ets tropicales en Amazonie,
en Afrique tropicale et en Asie du Sud-Est. Je montre que les mod\`eles
neutres s'ajustent bien aux donn\'ees empiriques pour toutes les
esp\`eces, \`a l'exception des quelques esp\`eces les plus abondantes
(de 6 \`a 40 selon le continent). Lorsque le param\`etre \tralicstex{\textit{θ}}{$\theta$} est
consid\'er\'e comme un indice de la diversit\'e r\'egionale, il
appara\^{\i}t que l'Amazonie et l'Asie du Sud-Est ont des diversit\'es
r\'egionales similaires (\tralicstex{\textit{θ}}{$\theta$} = 654 pour l'Amazonie, \tralicstex{\textit{θ}}{$\theta$} = 726
pour l'Asie du Sud-Est), avec une flore arborescente africaine
tropicale moins diversifi\'ee (\tralicstex{\textit{θ}}{$\theta$} = 219). Le mod\`ele inf\`ere le
nombre de 10 141 esp\`eces d'arbres avec au moins 50 individus en
Amazonie, 3477 en Afrique tropicale et 9915 en Asie du Sud-Est. Le
mod\`ele neutre subdivis\'e dans l'espace d\'emontre la pr\'esence
d'une sous-structure spatiale dans les trois flores r\'egionales. Ces
r\'esultats montrent que les mod\`eles neutres sont utiles pour
explorer les patrons r\'egionaux d'abondance des esp\`eces et pour
explorer les r\'eservoirs r\'egionaux d'esp\`eces.
\end{altabstract}
\maketitle
\vspace*{.75pc}
\twocolumngrid
\end{noXML}
\section{Introduction}
Patterns of species abundance and rarity are an important dimension of
biological diversity. A species can be rare because it is limited by
its physiology, by local biotic constraints on its abundance, or by its
limited geographical distribution~\cite{rabinowitz_29_1981,
kruckeberg1985biological, kunin1993biology}. These diverse causes for
rarity also reflect the fundamentally different reasons why species are
threatened. Species that are locally abundant but geographically
limited may be threatened with extinction if their habitat is
transformed, as is the case with certain species of forest tree in the
tropics. Highly specialized sap-sucking insects may be threatened
because the trees on which they feed are removed. Finally, the
degradation of restricted habitats can lead to the disappearance of
species, as for example on the island of Saint
Helens~\cite{cronk_past_1989}. Although rare species exert great
fascination, the question of why some species are abundant locally or
regionally is no less interesting. The most abundant species are most
relevant to many ecosystem processes. In a study of Amazonian rain
forest trees, ter Steege et~al.\ have shown that no more than 227 tree
species make up half of the trees in this biome, out of more than 6700
species~\cite{ter_steege_hyperdominance_2013}. These were called
hyperdominant species. Patterns of abundance for Amazonian rain forest
trees~\cite{ter_steege_hyperdominance_2013} are represented in
Figure~\ref{fig:figure1} for illustration.
\begin{figure}
\includegraphics{fig01}
{\vspace*{-.2pc}}
\caption{\label{fig:figure1}Rank abundance distribution of tropical
tree species in three continents. These empirical distributions are
scaled such that the sum of the relative abundances is equal to one.
Red: Amazon; green: tropical Africa; blue: Southeast Asia. Data
from~\cite{cooper_consistent_2024}.}
{\vspace*{-.35pc}}
\end{figure}
The motivation for this contribution is to understand the processes
underlying regional patterns of species abundance. One way to
contribute to this research is to ask to what extent empirical species
abundance distributions deviate from those of regional species pools
generated purely from \mbox{random} processes.\ Here, I provide a
self-contained treatment of neutral models of relevance in
the study of species abundance distributions, together with a code for
the algorithms in the R statistical language. Then I illustrate this
method with regional species abundance data for three tree flora
(Amazonia, Africa, Southeast Asia).
The neutral theory of biodiversity and
biogeography~\cite{hubbell_unified_2001} has emerged from the
mathematics inspired by population genetics theory of the 1970s, and it
has generated much debate in ecology. The theory has shed light on
questions such as: how are the sampled individuals distributed between
species? how many species remain to be discovered after a given number
of individuals have been sampled? How representative are samples of
larger ecological communities? When I was invited to contribute to the
pages of this journal, I took opportunity to return to a debate on
neutral models of species abundance that have animated the scientific
community in ecology. Neutral models are now in the toolkit of
ecologists for the analysis of species
abundance~\cite{ter_steege_hyperdominance_2013}. Yet, many important
results on neutral models tend to be overlooked in the modern
literature and regularly rediscovered~\cite{ewens2004mathematical}. In
addition, since the 1970s, important research in probability theory has
been developed~\cite{pitman_two-parameter_1992, durrett1999stochastic,
pitman_combinatorial_2006, wakeley2009coalescent,
crane_ubiquitous_2016, tavare_magical_2021}, some of which is relevant
to quantitative research on biodiversity.
This study explores models of species abundance that mimic the process
of species discovery in a real situation, where individual organisms
are identified to the species one at a time. The first organism is
always a new species in the sample, the second may be a representative
of the first species or of a new species, and so on. For such a model
to make sense, it is assumed that organisms are identified one by one,
through a classical taxonomic study. Bulk identification methods, using
mass sequencing of environmental DNA for microbial species provide a
different context to the study of biological diversity in that species
are substituted for molecular operational taxonomic units, and
individuals are not always observable or even clearly
defined~\cite{taberlet2018environmental}. With that limitation in mind,
it is still helpful to explore the patterns of species abundance in
discrete assemblages of organisms~\cite{durrett1994importance}. The
goal here is to return to known mathematical formulation of neutrality,
provide several representations of the model and show that these
representations are quite flexible.
Here, I first review the mathematical foundations of two standard
models of species abundance, the canonical neutral model, and a
spatially subdivided version of this model. I also discuss a model that
has not been explored in the context of species abundance
distributions, which models the hypothesis that
the addition of species to a community may increase the resources and
biotic interactions, making that community
hospitable to a greater number of species, or in short ``diversity
begets diversity'', as proposed by
Whittaker~\cite{whittaker1972evolution}\footnote{\looseness=-1 The hypothesis that
diversity begets diversity has been extremely popular in the recent
literature, yet the history of this catchy term is quite obscure. It is
generally traced back to Robert~H.\ Whittaker (1972,
p.~216)~\cite{whittaker1972evolution} who writes that ``facilitation of
increase in species number in interacting trophic levels is reciprocal.
We should thus expect diversity to increase in parallel on any adjacent
trophic levels and, in fact, throughout the various groups of
interacting species that the community comprises.''\ In fact similar
ideas are already present in Allee {et~al.}\ (1949) textbook on animal
ecology \cite[p.~695]{allee1949principles}. Vane-Wright (1978) referred
explicitly to the diversity begets diversity mechanism in an
evolutionary context~\cite{vane1978ecological}.{\vspace*{-2.5pt}}}. It
is curious that this ``diversity begets diversity'' model has never
received proper quantitative mathematical treatment in the ecological
literature and one aim of this contribution is to promote this
discussion~\cite{pitman_two-parameter_1992, pitman_combinatorial_2006}.
In Section~\ref{sec2}, I present the fascinating mathematical results
associated with three neutral models. In Section~\ref{sec3}, I
illustrate the application of these models to practical examples of
parameter inference for empirical samples of tropical forest trees.
Finally, I discuss the possible ramifications of the theory of random
partitions for the study of empirical patterns of biodiversity.
\section{Three neutral models of biodiversity} \label{sec2}
In ecology, models have been referred to as \textit{neutral} in the
sense that individuals all have the same prospects of reproduction and
mortality, whatever the species to which they
belong~\cite{chave_neutral_2004}. In probability theory, the more
general notion of \textit{exchangeability} is defined: a model is
exchangeable if the probability distribution of the class abundances
$\{n_{1},n_{2},\ldots,n_{k}\}$, where $n_{i}$ is the abundance of
species~$i$, does not depend on the labels of the
classes~\cite{kingman_random_1975, pitman_poisson-kingman_2003}. This
property is essential in order to obtain exact mathematical results on
the probability distribution of the model. This probability
distribution, when known, can then be used as a likelihood function to
compare the model to empirical species abundance distributions. In
Section~\ref{urn}, an intuitive construction is provided in the form of
urn models.
Model~1 is the canonical neutral model, for which the probability
distribution of species abundances is the Ewens sampling
formula~\cite{ewens1972sampling}, explained in Section~\ref{model1}. I
also present the species individual curve for Model~1, and the
Griffiths--Engen--McCloskey formula, which allows samples conforming to
Model~1 to be drawn in a time proportional to the number of species $k$
rather than the number of individuals~$n$. Models~2 and~3 are two
possible generalizations of Model~1. Model~2, presented in
Section~\ref{model2}, assumes a spatially subdivided system, with
limited dispersal between local sites. Under general assumptions, this
Model~2 is also associated with a probability distribution for species
abundances, and the model parameters can be estimated by maximal
likelihood estimation. The second generalization, Model~3
(Section~\ref{model3}), implements the ``diversity begets diversity''
model. It turns out to be a equivalent to a model first developed
in~\cite{pitman_two-parameter_1992}, with the same properties as the
canonical neutral model but with the addition of one parameter.
\subsection{Urn model representation} \label{urn}
The process of species discovery can be summarized in generic terms
using so-called urn models~\cite{johnson1977urn}. The ``urn''
represents the system (here, a sample, or an ecological community), and
it is populated by ``balls'', which represent objects (here,
individuals). The objects may belong to two or more classes, which are
usually represented by the color of a ball. The urn representation is
relevant when an operator picks balls and performs a number of
operations based on this sampling. A lottery is an example of a game
that can be represented as an urn model, other examples including
election systems~\cite{mowbray2007electing, fernandez2014voter} or
sports~\cite{ben2013randomness}.
In the P\'olya urn model~\cite{polya1954patterns, johnson1977urn}, an
urn is initially filled with $n_{i}$ balls of color~$i$, and the balls
are drawn one by one, being replaced in the urn after its
color has been observed. The construction process is as follows. When a
ball is drawn, it is replaced in the urn together with a new ball of
the same color. The most abundant color tends to be selected more
often, so its abundance increases more rapidly. This P\'olya urn model
resembles the species sampling process, where a color symbolizes a
species. Note that even if the process is stochastic, the abundance of
each species depends on the initial condition of the system, i.e., the
initial abundance of the species.
A slight variant of the P\'olya urn model, due to
Hoppe~\cite{hoppe_polya-like_1984}, is directly related to the problem
of species sampling. It assumes that initially the urn contains a
single black ball. The construction process is as follows. When the
black ball is picked, it is replaced in the urn together with a ball of
a completely new color. When any other ball is picked, it is replaced
in the urn together with another ball of the same color. All the
colored balls have the same chance of being chosen, which we define as
a unit ``weight''. In contrast, the black ball has a weight~$\theta$,
which is a positive real number. The parameter $\theta$ is proportional
to the probability of adding a new color in the urn per iteration. This
method yields a partition of the colored balls, and this partition
depends on the single parameter~$\theta$. This is the basis of Model~1
presented below, also called the ``canonical neutral model''. It is a
drastic simplification of reality, but is amenable to exact
probabilistic results.
In large biological assemblages of organisms, geographical structure
can become an important factor, and can invalidate the assumption of
perfect \mbox{mixing}, i.e., the assumption that there is a single urn
from which one samples the balls. For example, the Amazon is a large
collection of trees (on the order of $4\times 10^{11}$,
see~\cite{ter_steege_hyperdominance_2013}), covering more than 6
million square kilometers, and assemblages of tree species differ west
and east of the Amazon. Spatially subdivided models have been developed
to account for this effect, where local sites are considered as a
dispersal-limited sample of the regional
pool~\cite{hubbell_unified_2001, wakeley_nonequilibrium_1999,
wakeley2001coalescent}. The goal is to predict the local distribution
of individuals between species in a local community, as a function of
immigration rates and regional species
diversity~\cite{hubbell_unified_2001, etienne_new_2005,
jabot_reconciling_2008}. The influence of space on the canonical
neutral model is discussed in Model~2 below.
In Model~3, the diversity begets diversity model, the rate of species
appearance depends on the number of species in the system, which is
denoted $k_n$ (the subset $n$ is because the number of species depends
on~$n$, the number of individuals). As will be explained below
(Section~\ref{model3}), one definition of this model is through an urn
scheme, sometimes referred to as the Blackwell--MacQueen
model~\cite{blackwell1973ferguson, pitman_two-parameter_1992,
pitman_combinatorial_2006}. \mbox{After} $n - 1$ individuals have been
sampled, the probability of sampling an altogether new species (i.e.,
the probability of sampling the black ball) is equal to \mbox{$(\theta
+ \sigma k_n)/(\theta + n)$}, so for any value of $\sigma>0$, the
probability to pick a new species is proportional to the number of
existing species~$k_n$. In the special case $\sigma=0$, this model is
equivalent to the Hoppe urn model (Model~1). The probability of adding
one individual to species $i$ (i.e., the probability of sampling a
colored ball) is equal to $(n_{i}-\sigma)/(\theta+n)$, so each of the
colored balls is picked slightly less often than expected by chance,
and the rarest colors, represented by a single ball in the urn,
are counter-selected. Biologically, rare species are likely to be less
viable than expected by chance due to reproductive difficulties, both
pre- and postzygotic~\cite{lande_extinction_1987,
rhymer_extinction_1996, courchamp_inverse_1999}. Crucially, the
complete sampling theory is known for this third urn model, as
described below, so it lends itself to statistical
inference~\cite{pitman_two-parameter_1992, pitman_combinatorial_2006}.
\subsection{Canonical neutral model (model~1)} \label{model1}
The first model describes a natural partition of $n$ objects into $k$
classes. This partitioning could concern many real-life situations, and
has been applied in population genetics (number of allelic copies in a
population, \cite{ewens1972sampling}), linguistics (number of word
\mbox{occurrences} in a book~\cite{simon1955class}), and ecology (species
abundance in a sample~\cite{engen_species_1974}). The question of how
to partition discrete collections is also relevant to many other
fascinating problems in mathematics~\cite{crane_ubiquitous_2016,
tavare_magical_2021}. All the results in this section are classic but
they are still provided as they provide essential context to Models~2
and~3. An excellent introduction is found in Ewens' textbook (2004
edition, Chapters~3, 9, and~11,~\cite{ewens2004mathematical}).
In a sample of organisms, let us assume that the species have been
labeled: the species differ in detectable ways, such as a taxonomic
feature. We denote $n_{i}$ the abundance of species~$i$, and
$n=\sum_{i=1}^{k}n_{i}$ the total number of organisms sampled, where
$k$ the total number of distinct species. The sample is fully described
by the vector $\{n_{1},n_{2},\ldots,n_{k}\}$, and the system is
described by a probability distribution $p(n_{1},n_{2},\ldots,n_{k})$
to find the system in a given state. A different description of the
state of the system is as follows. Let $a_{r}$ be the number of species
with exactly $r$ individuals in the sample (which can be zero). The
total number of individuals is $n=\sum_{r=1}^{\infty}r a_{r}$. The
difference with the above description is that $a_{r}$ are random
numbers, and that the total number of species in the sample
$k=\sum_{r=1}^{\infty}a_{r}$ is the sum of random numbers, and is
therefore also a random number. In this second representation, species
are\break unlabeled.
In the Hoppe urn model, step one draws the black ball with probability
one, generating one new species. At the $n$th draw, $n-1$ individuals
have been sampled, and the probability of sampling an altogether new
class (or color) is equal to $\theta/(\theta+n)$, while the probability
of adding one object to class $i$ is equal to $n_{i}/(\theta+n)$. The
Hoppe urn model generates a population of $n$ objects, and patterns of
class abundance are parameterized only by~$\theta$. This construction
turns out to be equivalent with Fisher--Wright model of population
genetics~\cite{hoppe_polya-like_1984, ewens2004mathematical}, or
the dispersal-unlimited neutral model in
ecology~\cite{etienne_novel_2004}. Consider a time-dependent system of
$N$ individuals such that all individuals die exactly at the end of the
time step and are replaced by a multinomial draw of their offspring. In
addition, with probability $\nu$ they are replaced by a totally new
species. This means that new species can arise at a rate $N\nu=\theta$
(new species per time step). This system is assumed to be large in the
sense that any sample $n$ verifies $n\ll N$. This process soon reaches
a dynamic equilibrium where the \mbox{appearance} of new species is
balanced by the \mbox{extinction} of rare species.
Hoppe~\cite{hoppe_polya-like_1984} has shown that the urn model
generates a typical configuration of the above model at dynamic
equilibrium.
In the Fisher--Wright model, the probability $F_{2}(t)$ that two
randomly chosen individuals belong to the same species at time $t$ is
computed as follows~\cite{malecot1948, ewens2004mathematical,
chave2002spatially}. For two individuals to belong to the same species,
none of the two individuals can be a new species (probability
$(1-\nu)^{2}$), and they can descend from the same parent (probability
$1/N$) or from two different parents already of the same species at the
previous time step (probability $F_{2}(t-1)$). This reasoning is
summarized in the equation:
$F_{2}(t)=(1-\nu)^{2}(1/N+(1-1/N)F_{2}(t-1))$. At dynamic equilibrium,
$F_{2}=F_{2}(t)=F_{2}(t-1)$, and substituting in the equation, one
finds $F_{2}=1/(1-N+N(1-\nu)^{-2})$. For $N$ large and $\nu$ small,
this results in: $F_{2}={1}/({1+\theta})$. This result is equivalent
with the Hoppe urn model, since the probability of picking any one ball
in the urn is ${1}/({1+\theta})$. The same reasoning can be applied one
step further to the probability of picking three individuals of the
same species $F_{3}(t)$. Three situations can arise: (i)~all three
could descend from the same parent (probability $1/N^{2}$), (ii)~they
could descend from two parents (probability $3(N-1)$) already of the
same species at time $t-1$ (probability $F_{2}(t-1)$), or (iii)~they
could descend from three different parents (probability $(N-1)(N-2)$)
already of the same species at time $t-1$ (probability $F_{3}(t-1)$).
At dynamic equilibrium, the formula reads: $F_{3} \sim
2/({2+\theta})F_{2}$. This reasoning for two and three individuals can
be extended to compute the probability of picking $n$ individuals of
the same species~\cite{ewens1972sampling}, which is
$F_{n}={(n-1)!}/({\theta(\theta+1)\cdots(\theta+n-1)})$.
The denominator of this expression, $\theta(\theta+1)\cdots
(\theta+n-1)$, is an important mathematical quantity in this theory,
and for this reason it deserves a specific notation:
$\theta^{(n)}=\theta(\theta+1)\cdots(\theta+n-1)$, which is called the
\textit{increasing factorial power}, or sometimes the ``Pochhammer
symbol''. $\theta^{(n)}$ may be expressed in terms of the usual Gamma
function as follows: $\theta^{(n)}=\Gamma(\theta+n)/\Gamma(\theta)$. An
expansion of $\theta^{(n)}$ as a polynomial is known and it turns out
to be useful below:
{\begin{equation} \label{eq1}
\theta^{(n)}=\sum_{k=0}^{n}S(n,k)\theta^{k}
\end{equation}}\unskip
where the coefficients $S(n,k)$ are called the \textit{absolute value
of the Stirling number of the first kind}.
The above calculus on the probabilities $F_n$ suggests that the
computation of the complete probability distribution of the state
$\{n_{1},n_{2},\ldots,n_{k}\}$, denoted
$p_{\theta}(n_{1},n_{2},\ldots,n_{k})$, is possible.
Ewens~\cite{ewens1972sampling} has computed
$p_{\theta}(n_{1},n_{2},\ldots,n_{k})$ as a closed-form expression, and
Karlin and McGregor have provided a formal proof for this formula by
recurrence~\cite{karlin1972addendum}. This result is known as the Ewens
sampling formula~\cite{crane_ubiquitous_2016, tavare_magical_2021}:
{\begin{eqnarray}
p_{\theta}(n_{1},n_{2},\ldots,n_{k}) &=& \frac{\theta^{k}}{\theta(\theta
+1)\cdots(\theta+n-1)}\frac{n!}{k!\prod_{j=1}^{k}n_{j}}\nonumber\\
&=& \frac{\theta^{k}}{\theta^{(n)}}\frac{n!}{k!\prod_{j=1}^{k}n_{j}}.
\label{eq2}
\end{eqnarray}}\unskip
Intuitively, the factor $\theta^{k}$ reflects the selection of the
black ball exactly $k$ times, and the denominator $\theta^{(n)}$ is the
product of the successive masses in the urn on each of the first $n$
draws. The coefficient $n!/(k!\prod_{j=1}^{k}n_{j})$ is valid here in
the case of labeled partitions. Using the vector $\{a_{1},a_{2},
\ldots,a_{r}\}$ instead, the partitions are unlabeled and the Ewens
sampling formula is rewritten as follows:
{\begin{equation} \label{eq3}
p_{\theta}(a_{1},a_{2},\ldots,a_{r},\ldots)=\frac{\theta^k}
{\theta^{(n)}}\frac{n!}{\prod_{j=1}^n j^{a_j}a_{j}!}.
\end{equation}}\unskip
Equations~(\ref{eq2}) and (\ref{eq3}) may look scary but in fact the
dependence on $\theta$ of $p_{\theta}$ is quite simple, and the rest of
the formula is the result of a combinatorial exercise. Having these
formulas available makes it possible to explore a variety of problems,
a few of which I report here.
A first natural question is how many classes, or species, are in a
sample of $n$ individuals given the parameter~$\theta$. From the Ewens
sampling formula, the probability $P_{\theta,n}(k)$ of finding $k$
species in the sample of size $n$ must be proportional to
${\theta^{k}}/{\theta^{(n)}}$, and since $P_{\theta,n}(k)$ is a
probability distribution, then $\sum_{k=1}^{n}P_{\theta,n}(k)=1$. So it
follows that $\sum_{k=1}^{n}c_{k}\theta^{k}=\theta^{(n)}$, where
$c_{k}$ is the proportionality constant (i.e.,
$P_{\theta,n}(k)=c_{k}({\theta^{k}}/{\theta^{(n)}})$). It now becomes
clear why Equation~(\ref{eq1}) was reported above: comparing the two
formulas it appears that $c_{k}$ must be equal to $S(n,k)$, therefore:
{\begin{equation} \label{eq4}
P_{\theta,n}(k)=S(n,k)\frac{\theta^{k}}{\theta^{(n)}}
\end{equation}}\unskip
which is the first remarkable exact result from Model~1.\ Next, from the
construction of Hoppe urn model, the expected number of species in the
sample, $\overline{k}_n=\sum_{k=1}^{n}kP_{\theta,n}(k)$ increases with
$n$ as
{\begin{equation} \label{eq5}
\overline{k}_n=\sum_{j=0}^{n-1}\frac{\theta}{\theta+j}.
\end{equation}}\unskip
For large sample sizes $n$, it is useful to replace the summation term
by a closed-form expression. It turns out that the above formula is
equal to:
{\begin{equation} \label{eq6}
\overline{k}_n=\theta\psi^{(0)}(\theta+n)-\theta\psi^{(0)}(\theta)
\end{equation}}\unskip
where $\psi^{(0)}(\theta)=(\mathrm{d}/{\mathrm{d}\theta})\ln(\Gamma(\theta))$
is the digamma function (the first derivative of the log-Gamma
function). This formula must respect the initial condition that there
must be exactly one species in the sample if $n=1$, or:
$\overline{k}_1=1$. It is true, although not immediately obvious, that
$\overline{k}_1=\theta\psi^{(0)}(\theta+1)-\theta\psi^{(0)}(\theta)=1$.
This is because the Gamma function verifies the following condition:
$\Gamma(\theta+1)=\theta\Gamma(\theta)$, which itself implies:
$\psi^{(0)}(\theta+1)=({\mathrm{d}}/{\mathrm{d}\theta})\ln
(\Gamma(\theta+1))=({\mathrm{d}}/{\mathrm{d}\theta})\ln
(\theta\Gamma(\theta))=1/\theta+\psi^{(0)}(\theta)$. This demonstrates
that $\overline{k}_1=1$ for this model.
It may also be shown that the variance of $k$ is equal to
{\begin{equation} \label{eqn7}
\text{var}(k)_n=\sum_{k=1}^{n}k^2P_{\theta,n}(k)-(\overline{k}_n)^2=
\sum_{j=0}^{n-1}\frac{\theta j}{(\theta+j)^{2}}.
\end{equation}}\unskip
Now, defining the log-likelihood function for Model~1 as
$L_k(\theta)=\ln P_{\theta}(k)$ from Equation~(\ref{eq4}), the maximum
likelihood estimate of~$\theta$, $\overline{\theta}$, verifies the
following equation: $({\mathrm{d}}/{\mathrm{d}\theta})L_k
(\overline{\theta})=0$ which turns out to be exactly
Equation~(\ref{eq5}). So Equation~(\ref{eq6}) can be used to calculate
the maximum likelihood estimate of parameter $\theta$ given $n$ and $k$
in a sample. This approach also gives access to the variance
of~$\theta$, $\sigma_{\theta}^{2}$, through the formula
$\mathrm{d}^{2}L(\theta)/\mathrm{d}\theta^{2}=-1/\sigma_{\theta}^{2}$:
{\begin{equation} \label{eqn8}
\frac{1}{\sigma_{\theta}^{2}}=\frac{\overline{k}_n}{\overline{\theta}^{2}}
-\sum_{j=0}^{n-1}\frac{1}{(\overline{\theta}+j)^{2}}.
\end{equation}}\unskip
In the literature on species diversity estimation, the estimated number
of species in a sample $\overline{k}_n$ has been explored in the limit
of a large sample sizes~$n$. Using the first order approximation of the
digamma function $\psi^{(0)}(x)\sim\ln(x)$ for $x$ large, and
Equation~(\ref{eq6}), it is apparent that:
{\begin{equation} \label{eq8}
\overline{k}_n\sim\theta\ln\left(1+\frac{n}{\theta}\right).
\end{equation}}\unskip
Thus the expected number of species $\overline{k}_n$ in a sample
increases roughly as the logarithm of sample size for large samples, a
scaling relationship first proposed in the form of (\ref{eq8}) by
Fisher (1943)~\cite{fisher_relation_1943}. Note that Fisher (1943) used
the notation $\theta=\alpha$, which was later called Fisher's $\alpha$
in the biological diversity literature~\cite{pielou1975ecological,
engen_stochastic_1978}. The depth of Fisher's intuition concerning this
model has already been pointed out by
Watterson~\cite{watterson1978homozygosity} in the context of population
genetics, see also Tavar\'e's recent review~\cite{tavare_magical_2021}.
Finally, the following result turns out to be useful. A second
generative process, the residual allocation model, has been shown to
converge to the Ewens sampling formula. This construction has been
popularized under the name Griffiths--Engen--McCloskey model by
Ewens~\cite{ewens2004mathematical}, in light of the pioneering works
of~\cite{engen_species_1974, mccloskey1965model}. Let $\{W_{1},W_{2},
\ldots,W_{k}\}$ be a vector of independently identically distributed
random numbers drawn from the beta probability distribution with
parameters $(1,\theta)$:
$\text{Beta}(1,\theta)=\theta(1-W)^{\theta-1}$. Let us define the
variables $\{V_{1},V_{2},\ldots,V_{k}\}$ as follows:
{\begin{equation} \label{eq10}
V_{1}=W_{1},V_{k}=W_{k}\prod_{i=1}^{k-1}(1-W_{i}).
\end{equation}}\unskip
This model has an intuitive interpretation as a broken stick model, and
it has been explored in the species abundance
literature~\cite{macarthur_relative_1957, cohen_alternate_1968,
engen_species_1974}: a random fraction $W_{1}$ of a stick of unit
length is labeled~1. The random fraction $W_{2}$ of the unlabeled
portion of the stick, of length $1-W_{1}$, is then labeled with
species~2, with a length $W_{2}(1-W_{1})$, and so forth. The sequence
$\{V_{1},V_{2},\ldots,V_{k}\}$ can be used to generate a multinomial
sample $\{n_{1},n_{2},\ldots,n_{k}\}$ with weights
$\{V_{1},V_{2},\ldots,V_{k}\}$ it was shown that this construction
verifies the Ewens sampling formula
Equation~(\ref{eq2})~\cite{pitman_combinatorial_2006}. The
Griffiths--Engen--McCloskey is extremely helpful computationally
because it allows to generate a partition structure of $n$ objects,
with $n$ possibly very large, while drawing only on the order of $k$
random samples, with typically $k\ll n$.
\subsection{Neutral model with dispersal limitation{\hfill\break}
(model~2)} \label{model2}
Spatial extensions of Model~1 have been developed early on in the
context of population genetics~\cite{malecot1948,
wakeley_nonequilibrium_1999, rousset2004genetic}, with subsequent
applications in ecology and
biogeography~\cite{hanski_metapopulation_1998, hubbell_unified_2001}.
One possible framework is as follows: a region is considered as a
collection of $K$ local sites (``demes'' in the parlance of population
genetics), and each local site has a total size $\{N_{1},\ldots,
N_{K}\}$. Within a local site, individuals interact directly, whereas
local sites are only connected through migration.\ This framework is the
multi-deme model of population genetics~\cite{maruyama1970effective,
wakeley_nonequilibrium_1999} and the metapopulation model of population
dynamics~\cite{hanski_metapopulation_1998}. Within-site processes
(local \mbox{reproduction,} \mbox{interactions)} are the dominant processes over
small time scales compared with the genealogical processes occurring
across local sites and over much longer time
scale~\cite{wakeley_nonequilibrium_1999}.\ A similar setting arises in
spatial interacting particle systems, such as chemical
reactions {\cite{van1992stochastic,
durrett1999stochastic}}.
An historically important special case, due to
Hubbell~\cite{hubbell_unified_2001}, is one where a single site is
sampled. Conceptually, this model is parameterized by the same
parameter $\theta$ as above, which describes the regional species pool.
An additional parameter $m$ ($0\leq m \leq 1$) represents the
probability that a new individual appears in the focal community
through immigration rather than due to local reproduction. If $m\sim
1$, all the local recruits are immigrants, and the local structure
becomes irrelevant, in which case the Ewens sampling formula
Equation~(\ref{eq2}) applies, together with the results of the previous
section. In the more general case of an arbitrary parameter~$m$, a
closed-form solution of the general sampling formula does exist and it
generalizes Equation~(\ref{eq2})~\cite{etienne_novel_2004,
etienne_new_2005}. With $I={m}/({1-m})(n-1)$ a rescaled migration
parameter, the generalized version of the sampling formula is (see
Equation~(6) in~\cite{etienne_new_2005}):
{\begin{eqnarray}
&& p_{\theta,I}(n_{1},n_{2},\ldots,n_{k})\nonumber\\
&& \quad =\frac{\theta^{k}}{\theta^{(n)}}
\frac{n!}{k!\prod_{j=1}^{k}n_{j}}\sum_{j=k}^{n}\left(K(j)
\frac{\theta^{(n)}}{\theta^{(j)}}\frac{I^{j}}{I^{(n)}}\right).
\label{eq11}
\end{eqnarray}}\unskip
This sampling formula involves a summation term and series of numbers
$K(j)$ which are defined as the coefficients of the polynomial
{\begin{equation} \label{eq12}
\sum_{j=k}^{n}K(j)x^{j}=\prod_{i=1}^{k}\sum_{a_{i}=1}^{n_{i}}
\frac{S(n_{i},a_{i})S(a_{i},1)}{S(n_{i},1)}x^{a_{i}}
\end{equation}}\unskip
where $S(n,a)$ are again the unsigned Stirling numbers of the first
kind (Equation~(\ref{eq1})). However, for relatively large values
of~$n$, an exact calculation of the coefficients $K(j)$ is difficult
(but see~\cite{jabot_reconciling_2008,janzen_sampling_2015}), and is
impossible for very large sample sizes. For this reason Hubbell's
dispersal-limited neutral model is of limited use in many practical
cases.
Let us now turn to Model~2.\ One far more interesting model of spatially
subdivided populations is the $K$-deme model, with $K$ local samples
(or~demes), of size $\{N_{1},\ldots,N_{K}\}$.\ Assume that the regional
relative species abundance distribution is given by
$\{x_{1},\ldots,x_{k}\}$, where $x_i$ is the regional relative
abundance of species~$i$, and $\sum_i x_i = 1$. Denote
$\{n_{1j},\ldots,n_{kj}\}$ the species abundances in deme~$j$, such
that $\sum_i n_{ij} = N_j$. Finally, assume that each local deme $j$
has an immigration rate~$m_j$, with $m_j\in [0,1]$ (or equivalently the
rescaled immigration rate $I_j={m_j}/({1-m_j})(N_j-1)$). This model
could describe an archipelago with $K$ islands, some far away from the
continent ($m\ll 1$) and others closer, as in the insular theory of
biogeography~\cite{macarthur2001theory}. In this case, the sampling
formula $p_{\mathbf{x},\mathbf{I}}(n_{ij})$, $i\in\{1, \ldots, k\}$,
$j\in\{1, \ldots, K\}$ can be written as~\cite{jabot_reconciling_2008}:
{\begin{equation} \label{eq13}
p_{\mathbf{x},\mathbf{I}}(n_{i,j})=\prod_{j=1}^K\frac{N_j!}
{\prod_{i=1}^k n_{ij}!}\frac{\prod_{i=1}^k (I_jx_i)^{(n_{ij})}}
{I_j^{(N_j)}}.
\end{equation}}\unskip
In Model~2, the regional species abundance $\mathbf{x}$ is assumed
known, rather than resulting from a neutral regional dynamics as in
Model~1, so the parameter $\theta$ is absent. It may also be assumed
that the local species abundances $n_{ij}$ are a representative and
unbiased sample of the regional species pool, implying that the vector
$\mathbf{x}$ can be approximated by $x_i = \sum_{j=1}^K
n_{ij}/\sum_{j=1}^K N_j$~\cite{jabot_reconciling_2008}. A less
straightforward alternative consists in assuming that the region
follows Model~1, compute $\theta$ and deduce
$\mathbf{x}$~\cite{munoz2007estimating}. From Equation~(\ref{eq13}),
the likelihood function for this problem is:
{\begin{equation} \label{eq14}
L_{\mathbf{x},n_{i,j}}(\mathbf{I})=C+\sum_{j=1}^K\left(\sum_{i=1}^k\ln
\frac{\Gamma(I_jx_i +n_{ij})}{\Gamma(I_jx_i)} -\ln\frac{\Gamma(I_j+N_j)}
{\Gamma(I_j)}\right)
\end{equation}}\unskip
\par\vspace*{.5pc}\noindent
with $C$ a constant, and using again the equality
$x^{(n)}=\Gamma(x+n)/\Gamma(x)$.
Note also that $p_{\mathbf{x},\mathbf{I}}(n_{i,j})$ is the product of
the probabilities for each of the $K$ demes. The maximum likelihood
estimate of $I_j$ is obtained for the condition: $\forall j$,
${\partial L_{\mathbf{x},n_{i,j}}(\mathbf{I})}/{\partial I_j} = 0$,
and this implies that the migration parameter $I_j$ can be estimated
independently of all other model parameters through the following
equation, for all~$j$:
{\begin{eqnarray}
&& \sum_{i=1}^kx_i[\psi^{(0)}(I_jx_i+n_{ij})-\psi^{(0)}(I_jx_i)]
\nonumber\\
&& \quad = \psi^{(0)}(I_j+N_j)-\psi^{(0)}(I_j)\label{eq15b}
\end{eqnarray}}\unskip
$\psi^{(0)}(\theta)=({\mathrm{d}}/{\mathrm{d}\theta})\ln
(\Gamma(\theta))$. From Equation~(\ref{eq15b}), or by maximization of
Equation~(\ref{eq14}), the parameters $I_j$ of Model~2 can be simply
inferred at each site~$j$.
\subsection{Neutral ``diversity begets diversity'' model{\hfill\break}
(model~3)} \label{model3}{\vspace*{-.5pc}}
A second natural extension of Model~1 is one where the probability of
creating new species increases with the number of species in the
sample. The idea of this model is that species entering a community
generate the conditions for the establishment of more species than
originally possible. Formally, the rate of species appearance is not
strictly equal to $\theta$ but increases with the number of extant
species $k$ as $\theta+\sigma k$, where $\sigma$ is a new parameter in
the model compared with Model~1.
As outlined above, let us first represent this model as an urn scheme.
After $n-1$ individuals have been sampled, the probability of adding
one individual to species $i$ (i.e., sampling a colored ball) is equal
to $(n_{i}-\sigma)/(\theta+n)$, while the probability of sampling an
altogether new species (i.e., sampling the black ball) is equal to
$(\theta+\sigma k)/(\theta+n)$. In the special case $\sigma=0$, this
model is equivalent to the Hoppe urn model (Model~1). Values
$1\geq\sigma\geq0$ imply that rare species tend to be picked less, and
that more new species arise. As $\sigma\to 1$, the probability of
picking a singleton species vanishes, and at $\sigma=1$, species cannot
increase in abundance and each new individual represents a different
species. This model was introduced by Blackwell and
MacQueen~\cite{blackwell1973ferguson} in the early 1970s, then was
formally studied by Pitman and
colleagues~\cite{pitman_two-parameter_1992, perman_size-biased_1992,
pitman_exchangeable_1995}.
In Model~3, the expected number of species $k$ is given by the summed
probability of picking the black ball at each step:
{\begin{eqnarray}
\overline{k}_{n+1} &=& \sum_{j=0}^{n}\frac{\theta+\overline{k}_{j}
\sigma}{\theta+j}=\overline{k}_{n}+\frac{\theta+\overline{k}_{n}
\sigma}{\theta+n}\nonumber\\
&=& \frac{\theta}{\theta+n}+\left(1+\frac{\sigma}{\theta+n}\right)
\overline{k}_{n}.\label{eq13b}
\end{eqnarray}}\unskip
This equation has an exact solution, with the boundary condition
$\overline{k}_{1}=1$:
{\begin{equation} \label{eq17a}
\sigma \overline{k}_n=\frac{\Gamma(\theta+1)}{\Gamma(\theta+\sigma)}
\frac{\Gamma(n+\theta+\sigma)}{\Gamma(n+\theta)}-\theta.
\end{equation}}\unskip
The validity of the boundary condition $\overline{k}_1=1$ is verified
immediately from the equality: $\Gamma(1+\theta+\sigma)=
(\theta+\sigma)\Gamma(\theta+\sigma)$. The asymptotic regime at large
sample size $n$ is obtained with the Stirling formula $\Gamma(x)
\approx\sqrt{2\rmpi}x^{x-1/2} \mathrm{e}^{-x}$,
valid for large $x$ and applied on \mbox{Equation}~(\ref{eq17a}):
{\begin{equation} \label{eq18a}
\overline{k}_n\approx\frac{\Gamma(\theta+1)\mathrm{e}^{-\sigma}}
{\Gamma(\theta+\sigma)\sigma}n^{\sigma}-\frac{\theta}{\sigma}.
\end{equation}}\unskip
This complements the asymptotic regime of Equation~(\ref{eq8}) in the
case $\sigma> 0$. Interestingly, the power-law species accumulation
curve emerges from this simple generalization of Model~1. The
discussion of whether the species accumulation curve should follow a
logarithmic or a power-law shape has been much discussed in the
ecological literature for at least a
century~\cite{arrhenius1921species}.
Let us now turn to the existence of a sampling formula for Model~3.
Pitman has shown that a sampling formula analogous to that of Ewens can
also be derived in this case~\cite{perman_size-biased_1992} and that it
has the following form:
{\begin{eqnarray}
&& p_{\theta,\sigma}(n_{1},n_{2},\ldots,n_{k})\nonumber\\
&&\quad =\frac{\theta(\theta+\sigma)\cdots(\theta+(k-1)\sigma)}
{\theta^{(n)}}\prod_{j=1}^{k}(1-\sigma)^{(n_j-1)}.\qquad
\label{eq17}
\end{eqnarray}}\unskip
This formula is called the two-parameter Pitman sampling
formula~\cite{pitman_two-parameter_1992, perman_size-biased_1992,
pitman_exchangeable_1995, pitman_combinatorial_2006}. Noticing that the
numerator in the above equation can be rewritten
$\sigma^k(\theta/\sigma)^{(k)}$, it follows that:
{\begin{equation} \label{eq15a}
p_{\theta,\sigma}(n_{1},n_{2},\ldots,n_{k})=\frac{\sigma^k(\theta/
\sigma)^{(k)}}{\theta^{(n)}}\prod_{j=1}^{k}(1-\sigma)^{(n_j-1)}.
\end{equation}}\unskip
Using again the fact that the increasing factorial obeys the following
relationship: $x^{(n)}=\Gamma(x+n)/\Gamma(x)$, Equation~(\ref{eq17})
can be rewritten in terms of the Gamma function:
{\begin{eqnarray}
&& p_{\theta,\sigma}(n_{1},n_{2},\ldots,n_{k})\nonumber\\
&& \quad =\sigma^k\frac{\Gamma(\theta)}{\Gamma(\theta+n)}\frac{\Gamma
(\theta/\sigma+k)}{\Gamma(\theta/\sigma)}\prod_{j=1}^{k}\frac{\Gamma
(n_{j}-\sigma)}{\Gamma(1-\sigma)}.\label{eq19a}
\end{eqnarray}}\unskip
In empirical species abundance studies, one objective is to infer the
values of model parameters $\theta$, $\sigma$ given the observed vector
$n_{1},n_{2},\ldots,n_{k}$. In the case of the Ewens sampling formula,
the number of species $k$ and the sampling size $n$ are jointly
sufficient to estimate the parameter~$\theta$. It is no longer the case
in this two-parameter Model~3. However, it remains true that
Equation~(\ref{eq19a}) can be used to define a
\mbox{log-likelihood}
function $L_n(\theta,\sigma)=\ln
p_{\theta,\sigma}(n_{1},n_{2},\ldots,n_{k})$, which takes the form:
{\begin{eqnarray}
&& L_{\mathbf{n}}(\theta,\sigma)\nonumber\\
&&\quad =\ln\left[\frac{\sigma^k}{\Gamma(1-\sigma)^k}\frac{\Gamma
(\theta)}{\Gamma(\theta+n)}\frac{\Gamma(\theta/\sigma+k)}{\Gamma
(\theta/\sigma)}\prod_{j=1}^{k}\Gamma(n_j-\sigma)\right].
\nonumber\\ \label{eq19}
\end{eqnarray}}\unskip
The values of $\theta$, $\sigma$ such that the partial derivatives of
$L_n(\theta,\sigma)$ vanish yield the necessary conditions for the
existence of maximum likelihood estimates $\overline{\theta}$,
$\overline{\sigma}$:{\vspace*{.3pc}}
{\begin{equation} \label{eqn23}
\frac{\partial L_{\mathbf{n}}}{\partial\theta}(\overline{\theta},
\overline{\sigma})=0,\quad \frac{\partial L_{\mathbf{n}}}{\partial\sigma}
(\overline{\theta},\overline{\sigma})=0.
\end{equation}}\unskip
\par{\vspace*{.25pc}}\noindent
Finding best-fit parameters $\theta$, $\sigma$ can be obtained by
solving these two equations jointly, but it is simpler to maximize the
function $L_{\mathbf{n}}(\theta,\sigma)$ (Equation~(\ref{eq19})). The
numerical problem of finding $\theta$, $\sigma$ given the vector
$\mathbf{n}$ is therefore easily resolved (see next section for a
numerical implementation).
Importantly, species abundance distributions for Model~3 can be
generated through a random allocation model
(Griffiths--Engen--McCloskey construction) similar to that in Model~1.
Let $\{W_{1},W_{2},\ldots, W_{k}\}$ be a vector of i.i.d.\ random draws
with $W_{i}$ from the probability distribution
$\text{Beta}(1-\sigma,\theta+k\sigma)$, with
\mbox{$0\leq\sigma\leq1$}.\ Define the variables
$\{V_{1},V_{2},\ldots,V_{k}\}$ as \mbox{follows:}
{\begin{equation} \label{eq23}
V_{1}=W_{1},\quad V_{k}=W_{k}\prod_{i=1}^{k-1}(1-W_{i}).
\end{equation}}\unskip
This model generates variables $\{V_{1},V_{2},\ldots,V_{k}\}$.\ A
sample of $n$ individuals from a multinomial distribution with weights
$\{V_{1},V_{2},\ldots,V_{k}\}$ is denoted
$\{n_{1},n_{2},\ldots,n_{k}\}$ and it was
shown~\cite{pitman_exchangeable_1995, pitman_combinatorial_2006} that
this sample verifies the Pitman sampling formula Equation~(\ref{eq17}).
\subsection{Numerical analyses} \label{sec2.5}
To perform empirical analyses, I have written the package
\texttt{neutr} in the R Statistical Language~\cite{Rstatlang},
available at \url{https://github.com/jeromechave/neutr}.
Parameters can be fit against the empirically observed species
abundance data.\ For Model~1, function \texttt{optim.ewens()} optimizes
Equation~(\ref{eq2}).\ It has the empirical species abundance as an
argument and returns parameter $\theta$ and the maximal log-likelihood
value.\ For Model~2, the function \texttt{optim.multideme()} takes as
argument a matrix with entry the abundance $n_{ij}$ (species~$i$ in
deme~$j$) and it optimizes
Equation~(\ref{eq13}).~This function returns a vector of parameters
$I_j={m_j}/({1-m_j})(n_j-1)$, one per deme, and~$m_j$.\ Importantly,
the \texttt{optim.multideme()} function assumes that the regional
species abundance distribution is the sum of all local species
abundances. For Model~3, the function \texttt{optim.pitman()} takes the
empirical species abundance as an argument, plus initial \mbox{values} of
$\theta$, $\sigma$ and returns the best-fit parameters $\theta$,
$\sigma$ and the maximal log-likelihood value based on\break
Equation~(\ref{eq19}).
Another set of functions return typical species abundance distribution
generated by Models~1 and~3.\ Package \texttt{neutr} includes the
function \texttt{generate.hoppe.urn0()} that generates a species
abundance distribution according to the Hoppe urn model (Model~1). It
takes parameter value $\theta$ and sampling size $n$ as an argument,
and returns one possible species abundance distribution and the species
number $\overline{k}_n$ inferred from Equation~(\ref{eq6}). There are
at least two ways to code this process. Drawing balls one at a time
results in a relatively inefficient procedure (but function
\texttt{generate.hoppe.urn0()} does this in an efficient way). The
alternative is to generate random variables according to the residual
allocation model described in Equation~(\ref{eq10}), and to perform a
single multinomial sampling of $n$ individuals with weights
$\{V_1,V_2,\ldots,V_k\}$.\ This second ultrafast procedure is coded in
function \texttt{generate.hoppe.urn()}.\ Package \texttt{neutr} also
includes the function \texttt{generate.pitman.urn()} that generates a
species abundance distribution according to the Pitman model (Model~3).
These two last functions have been coded to run with sample sizes of
more than~$10^{12}$.
The three models can also be compared: Model~1 is nested within both
Models~2 and~3, but the number of $k_p$ parameters vary ($k_p=1$ for
Model~1, $k_p=K$ for Model~2, and $k_p=2$ for Model~3). It is possible
to compare the models based on some form of the Akaike Information
Criterion~\cite{akaike1974new}.
Package \texttt{neutr} bears resemblance with package
\texttt{untb}~\cite{hankin2007introducing}, which implements Model~1,
but not its Griffiths--Engen--McCloskey construction. Also, packages
\texttt{ecolottery}~\cite{munoz2018ecolottery} and package
GUILDS~\cite{janzen_sampling_2015} both implement Models 1 and some
forms of Model~2 based on the coalescent, as first proposed
by~\cite{etienne_novel_2004}. To my knowledge, Model~3 and its
Griffiths--Engen--McCloskey construction have never been implemented in
a R~package.
\section{Application to the tropical tree flora} \label{sec3}
{\vspace*{-1pt}}
\subsection{Datasets} \label{sec3.1}
{\vspace*{-1pt}}
An empirical application of the above theory is now provided for three
large empirical data sets taken from numerous tropical forest
inventories around the world and reproduced as a Supplementary
Information of~\cite{cooper_consistent_2024}. It contains a total of
over a million sampled trees all identified to the species, for a total
area of forest sampled of 2324~ha (23.24~$\text{km}^2$; summary in
Table~\ref{Table1}). This is a huge sampling size, although it is very
small compared with the ca. 10.7~million $\text{km}^2$ of tropical
forests~\cite{vancutsem_long-term_2021}. Since the exact species name
is not reported in this data set, it is impossible to estimate the
exact number of species in total, although the species overlap across
continents is small, and the species total is likely to be close to the
sum (9765 species). The raw data is plotted as a rank-abundance curve
in Figure~\ref{fig:figure1}.
\begin{figure*}
\includegraphics{fig02}
\caption{\label{fig:figure3}Kullback--Leibler distance between observed
and simulated distributions replicated 1000 times, after sequentially
removing 1 to 50 of the most abundant species. The minimal
Kullback--Leibler distance, number of species to remove to minimize the
distance, and $\theta$ are reported in Table~\ref{Table2}. Color codes
as in Figure~\ref{fig:figure1}.}
\end{figure*}
\begin{table*} %tab1
\caption{\label{Table1}Statistics of the data used in this study}
\tabcolsep=8pt
\begin{tabular}{ccccc}
\thead
Region & Nb trees & Nb species & Cumul. area (ha) & Nb plots\\
\endthead
Amazon & \0\,821,670 & 4670 & 1590 & 1417\\
Tropical Africa & \0\,210,313 & 1509 & \0504 & \0483\\
Southeast Asia & \0\,100,152 & 3586 & \0201 & \0230\\
Total & 1,021,974 & NA & 2324 & 2130
\botline
\end{tabular}
\xxtabnote{The table reports the total number of trees sampled, total
number of species, cumulative sampled area (in hectares, ha), and total
number of permanent sampling plots in the biome. Data from
Ref.~\cite{cooper_consistent_2024}.}
\end{table*}
Briefly, the data have been obtained using a standard method in
tropical forest science: standard areas, usually squares of one hectare
(100 $\times$ 100~m), are positioned on the ground, and all trees with
at least 10~cm in trunk diameter are mapped, tagged using a permanent
tag (in plastic or metal), and identified to the
species~\cite{blundo2021taking}. Establishing a permanent forest
inventory plot requires several days of work in the field, and complete
botanical identification is usually much more time consuming. The above
data set is therefore the result of a long term vision, and hard work
of a large scientific community from across the tropics. \looseness=-1
{\vspace*{-1pt}}
\subsection{Results} \label{sec3.2}
{\vspace*{-1pt}}
For Model~1, the estimate of parameter $\theta$ for all three data sets
is provided using the empirical values of $n$ and $k$ with
Equation~(\ref{eq6}). I used the empirical values of $\theta$ and
Equation~(\ref{eq10}) to produce 1000 neutral distributions for each of
the three empirical distributions. Figure~\ref{fig:figure1} reveals
that the fit to the data is not bad, except for a small number of the
most abundant species (see~\cite{jabot_inferring_2009} for a similar
pattern). It is therefore interesting to explore if the goodness of fit
improves when the top species are removed.
\looseness=-1
To estimate the goodness of fit, one method is to define a distance
between the observed $P_{\mathrm{obs}}$ and the theoretical
$P_{\mathrm{theor}}$ distributions.\ I~use the Kullback--Leibler
distance,
defined by $\text{KL}=\sum_{i=1}^{k1}
P_{\mathrm{theor}}\ln(P_{\mathrm{theor}}/P_{\mathrm{obs}})$, with $k1$
the minimum of the non-null values of both observed and theoretical
distributions.\ Successively removing $1,2,\ldots$ of the top species, a
new value of $\theta$ was computed, and the Kullback--Leibler distance
was \mbox{calculated.} Figure~\ref{fig:figure3} shows the shape of the
Kullback--Leibler distance against successive removals of top species.
The removal of 13, 5, and 40 top species, for the Amazon, tropical
Africa and Southeast Asia respectively, resulted in a massive
improvement of the model's goodness of fit as see in
Figure~\ref{fig:figure3} and in Table~\ref{Table2}. Removing the
ultradominant species results in a much improved fit (compare
Figures~\ref{fig:figure1} and~\ref{fig:figure4}), even though the
neutral model tends to underestimate slightly the abundance of the rare
species (right panel of Figure~\ref{fig:figure4}).
\begin{figure}
{\vspace*{-.3pc}}
\includegraphics{fig03}
{\vspace*{-.3pc}}
\caption{\label{fig:figure4}Fit of the rank abundance distributions
after removing the ultradominant species (see Table~\ref{Table2}).
Color codes are as in Figure~\ref{fig:figure1}.}
{\vspace*{-.6pc}}
\end{figure}
\begin{table*} %tab2
\caption{\label{Table2}Best fit of the canonical neutral model after
removing a number of the top species (see Figure~\ref{fig:figure3} for
an illustration)}
\begin{tabular}{ccccc}
\thead
Region & Initial $\theta$ & Min. Kullback--Leibler & Nb. of
ultradominant species & $\theta$ of best model\\
\endthead
Amazon & 654.3 & 2748.6 & 13 & 673.19\\
Tropical Africa & 219.7 & 1077.0 & \06 & 226.69\\
Southeast Asia & 726.8 & \0299.9 & 40 & 778.23
\botline
\end{tabular}
\tabnote{Empirical values of parameter $\theta$ are reported
before (first column) and after (last column) the ultradominant species
have been removed. Column ``Min. Kullback--Leibler'' reports the mean
value of the minimal Kullback--Leibler distance between model and
observations (average across 1000 values).}
\end{table*}
The multideme model (Model~2) can be fitted with the maximization of
Equation~(\ref{eq15b}). The distribution of $m$~values, which represent
the fraction of individuals drawn from the regional pool rather than
from the local site, peaks at $m=0.122$ for Amazonia, $m=0.094$ for
Africa and $m=0.127$ for Southeast Asia (Figure~\ref{fig:figure5}).
This shows that local sites are dispersal-limited in similar ways
across continents on average. Sites-specific parameters $m$ could be
interpreted as an environmental filtering effect, since $m$ measures
how dissimilar the local assemblage is from the regional
one~\cite{jabot_reconciling_2008}.
\begin{figure}
{\vspace*{-.3pc}}
\includegraphics{fig04}
{\vspace*{-.3pc}}
\caption{\label{fig:figure5}Estimation of the local $m$ parameters for
all the sites in the three continents. The figure reports the density
distribution of $m$ values. Color codes are as in
Figure~\ref{fig:figure1}.}
{\vspace*{-.8pc}}
\end{figure}
A fit of the data set against the two-parameter Model~3 is illustrated
in Figure~\ref{fig:figure6}. The fit is not improved for the Amazonia
and tropical Africa data ($\sigma$~values ${<}10^{-7}$), and barely so
for Southeast Asia ($\sigma=0.034$). Thus, for these data sets, Model~3
does not result in a better fit of the data than Model~1.
\begin{figure}
\includegraphics{fig05}
{\vspace*{-.3pc}}
\caption{\label{fig:figure6}Fit of the rank abundance distributions
(thick solid lines) against Model~3 ($n = 100$ simulations, thin solid
lines), in linear--log axes, which allow a better display of the tail
of the distribution (rare species). Compare with
Figure~\ref{fig:figure1} for a similar representation in log--log axes.
Color codes are as in Figure~\ref{fig:figure1}.}
{\vspace*{-.4pc}}
\end{figure}
The neutral model represents well tropical tree species abundances at
regional scale, and this finding is used to extrapolate species
numbers~\cite{ter_steege_hyperdominance_2013}. Assuming that the
Amazonian tropical forest covers about 6.3~million $\text{km}^2$, with
an average 500 trees per hectare, the estimated number of trees is
$N\approx 3.15 \times 10^{11}$. Using the value of $\theta$ reported in
Table~\ref{Table2} and $N$ in Equation~(\ref{eq6}) yields
$\overline{k}_N = \text{13,081}$ species. Using the improved estimate
of $\theta$ after the removal of ultra-dominant species yields
$\overline{k}_N = \text{13,440}$ species. The values for the two other
continents are reported in Table~\ref{Table3}. It is also possible to
estimate the number of species with at least 50 individuals overall (a
lower bound of the minimum viable population
size~\cite{frankham2014genetics}) to avoid the pitfall of predicting
the occurrence of species represented by a handful of individuals in a
region\cite{lande_extinction_1987} (Table~\ref{Table3}). For Amazonia,
the resulting number is $\overline{k}_{N, n>50} = \text{10,141}$
species, very close to the latest estimate of 10,071 species reported
in~\cite{ter2019towards}.\looseness=1
\begin{table*} %tab3
\caption{\label{Table3}Estimated number of species in tropical forests
based on an extrapolation of Model~1}
\begin{tabular}{ccccc}
\thead
Region & Area (M km$^2$) & Nb trees (estimated) & Nb species & Nb
species with at least 50 ind.\\
\endthead
Amazon & 6.3 & $3.15 \times 10^{11}$ & 13,081 & 10,141 $\pm$ 91\\
Tropical Africa & 2.9 & $1.45 \times 10^{11}$ & \04,462 & \03,477 $\pm$ 51\\
Southeast Asia & 1.1 & $0.55 \times 10^{11}$ & 13,186 & \09,915 $\pm$ 90
\botline
\end{tabular}
\tabnote{The estimated number of species
is provided together with the estimated number of species with at least
50 individuals.}
\end{table*}
\section{Discussion} \label{sec4}
{\vspace*{-.15pc}}
\subsection{Three neutral models} \label{sec4.1}
The three models presented here are only a few examples of the many
systems that can be framed as urn models~\cite{johnson1977urn}. The
reader may be surprised to read no mention of the coalescent theory,
even if Model~1 is the key building block of this
theory~\cite{kingman1982coalescent, wakeley2009coalescent}. The
coalescent is a powerful approach, but the choice was made here to
focus solely on species abundance distribution and efficient numerical
analyses can be performed without resorting to coalescent models. There
is no doubt that exploring further applications of the coalescent in
ecology should be a rewarding \mbox{effort}.
The three models presented here are neutral in the following sense. All
three describe a partition of the collection into disjoint subsets
(classes), such that the objects within each class are interchangeable,
and the abundance $n_i\geq 1$ within class $i$ is a random number that
fully describes the class. The models are thus fully described by the
probability distribution $p(n_1,\ldots,n_k)$ of the sequence of random
numbers $\{n_1,\ldots,n_k\}$, where $\sum_{i=1}^k=n$, and this
probability distribution function is invariant under any permutation
$\eta$ of the class labels:
$p(n_1,\ldots,n_k)=p(n_{\eta(1)},\ldots,n_{\eta(k)})$. This last
property is called {\it exchangeability}, and the probability
distribution $p(n_1,\ldots,n_k)$ is then called {\it
exchangeable}~\cite{kingman1978uses}. It turns out that there is a
formal equivalence between the statement (1)~the random partition has
the property of exchangeability and (2)~the property that Models~1
and~3 can be constructed as urn processes and as random allocation
processes (Proposition~9 in~\cite{pitman_exchangeable_1995}). This is
an important result because it provides a rigorous definition of the
concept of class equivalence in neutral models in ecology and
population genetics.
Another property common to exchangeable models is that a random
partition following Equations~(\ref{eq10}) and~(\ref{eq23}) define a
size-biased partition of $\{V_{1},V_{2},\ldots,V_{k}\}$ that can be
used to define an ordered series of relative abundances $P_i$,
$i\in\{1,\ldots,k\}$, with $P_1\geq P_2\geq \cdots \geq P_k$, such that
$\sum_{i=1}^k P_i = 1$. The probability distribution of this collection
is unchanged upon the removal of the first species $P_1$ and
normalization $P'_i = P_i/(1-P_1)$, $i>1$, since the new sequence has
exactly the same formal structure. This provides the opportunity to
generate a test of neutrality by sequentially removing the first
species, the second species, and so forth, until the empirical data
best fits the model. This idea provides an intuitive method to define a
notion related to that of species hyperdominance in a species
assemblage~\cite{ter_steege_hyperdominance_2013}. Here, a concept
related to hyperdominance, ultradominance, is defined: a top species is
ultradominant if is removal significantly improves the fit of the
neutral model. More precisely, ultradominant species are the first $U$
species such that the series $P_{U+1}\geq P_{U+2}\geq \cdots \geq P_k$
minimizes the distance between the neutral model fit and the empirical
observations (cf.~\cite{jabot_inferring_2009} for a related
discussion). This definition is relative to a choice of distance on
probability distributions, and also on the choice of the neutral model
(Model~1, 2, 3 or another variant). As shown in the
Section~\ref{sec3.2} (Figure~\ref{fig:figure3}), the number of
ultradominant species is usually far lower than that of hyperdominant
species.
The one-parameter ($\theta$) Model~1 is a special case of a more
general two-parameter ($\theta, \sigma$) Model~3. When $\sigma=0$,
Models~1 and~3 are equal. Model~3 turns out to have a biological
interpretation as a species abundance model where diversity begets
diversity: new species tend to be picked with probability
$({\theta+k\sigma})/({\theta+n})$ ($0\leq\sigma\leq1$), when $k$
species are already present, so more often than expected by chance.
With respect to the asymptotic scaling regime $n\gg 1$,
Equations~(\ref{eq8}) and~(\ref{eq18a}) are two well-known scaling
relationships in ecology and there has been much literature on whether
the species accumulation curve $\overline{k}_{n}$ should follow a
logarithmic shape $\overline{k}_{n}\sim\theta\ln(n)$ (as proposed by
Fisher 1943) or a power-law shape $\overline{k}_{n}\sim n^\sigma$.
Models~1 and~3 show that the two regimes are consistent with a single
representation of a model of exchangeable random partitions. While
Model~3 did not provide a better fit of empirical data than Model~1 for
tropical tree species, it is possible that species assemblages at
higher trophic levels are more likely to verify the conditions of
Model~3.
The dispersal-limited neutral model of~\cite{hubbell_unified_2001} is
historically important in the context of ecology and biogeography.
However, Hubbell's two-parameter $(\theta,m)$ generalization of Model~1
is not framed as an urn model or a random allocation model. An urn
representation of Hubbell's model is possible, but it is not
straightforward\footnote{Define two urns, the first with a black ball
with weight~$\theta$, the second empty. First, draw $J$ balls from the
first urn exactly as in the Hoppe urn scheme, generating the sequence
of ball colors: $n^{[1]}_1, \ldots, n^{[1]}_k$, with
$n^{[1]}=J=\sum_{i=1}^{k}n^{[1]}_i$. Here the notation ``${}^{[1]}$''
represents the first urn. Then turn to the second urn. At step~$n$,
either (1)~pick one ball from the {\it first} urn with probability
$I/(I+n)$; this ball is of color $i$ with probability $n^{[1]}_i/J$;
add a new ball of the same color $i$ in the {\it second} urn (so:
$n^{[2]}_i \to n^{[2]}_i+1$), or (2)~with probability $n/(I+n)$, pick a
ball in the second urn, and add one ball of the same color. In the
second urn, the total number of $n^{[2]}=n=\sum_{i=1}^{k}n^{[2]}_i$
(usually, $n\ll J$).}. This is a special form of a hierarchical urn
construction, such that the construction of the second urn depends on
that of the first urn, but not the other way around. Here, I define
Model~2 as a neutral model of $K$ subdivided \mbox{assemblages,} or multi-deme
model, a more useful alternative in practical situations. Though only
one result is presented and studied here, many other results have been
\mbox{obtained}, and it is an area where coalescent based numerical
analyses are most
useful~\cite{rousset2004genetic,wakeley2009coalescent}.
\subsection{Implications for tropical forests} \label{sec4.2}
\looseness=-1
A remarkable feature is that neutral models fit extremely well tree
species abundance data~\cite{hubbell_unified_2001,
ter_steege_hyperdominance_2013}. The only departure from this neutral
fit appears to be for the most abundant species, which I have here
informally called ultradominant species. Ultra-dominant species
are the most abundant species that tend to depart from the prediction
of the canonical neutral model (Model~1), and a simple empirical method
is proposed here to detect these species. When ultra-dominant species
are removed from the analysis, the neutral model reproduces empirical
observations extremely well in the three regional tropical tree data
sets explored here (Figure~\ref{fig:figure3}). The nature of
ultra-dominant species is unclear, but it would be interesting to
explore the commonalities of ultra-dominant species, and possible
biological explanations for their occurrence.
A second insight from this analysis stems from the fit of the
multi-deme model (Model~2) to the same three regional tropical tree
data sets. The finding of Figure~\ref{fig:figure4} is that most local
sites significantly depart from a hypothesized random sample of the
regional species pool, detected by $m$ values much lower than unity. In
a dispersal-limited interpretation~\cite{hubbell_unified_2001}, this
suggests that a relatively constant proportion of the reproduction
events are local, the rest being immigration events. The alternative
explanation is that biotic or abiotic effects exert a major influence
on the floristic composition of each local community, and $m<1$ values
in the multi-deme model reflect these environmental filtering
effects~\cite{jabot_reconciling_2008}.
In the Ewens canonical model, $\theta$ is a natural measure of local
diversity, which is a convenient property of the model. However,
non-parameteric methods of biodiversity estimation have also been
developed, and they are often presented as more robust than parametric
methods~\cite{bunge1993estimating, chao1984nonparametric}. It is not
the goal of the present contribution to discuss this issue at length.
One known limitation of Model~1 is that the $\theta$ parameter cannot
be estimated with confidence for small sample sizes. A method such as
Model~2 may provide a useful alternative to Model~1 and it would be
interesting to explore the scale dependence of{\break} the
$m$~parameter (i.e., how $m$ varies with decreasing sample sizes~$n$).
The notion of neutrality has generated a heated debate in
ecology~\cite{hubbell_unified_2001, chave_neutral_2004,
ricklefs_unified_2006, alonso_merits_2006, ricklefs_global_2012}. Part
of the debate has been motivated by a narrow definition of neutrality.
The questions raised relate to the fact that this model ignores so many
important features as to become useless or event
dangerous~\cite{clark2009beyond}. Species differ not only in abundance
but also in ecological traits, and individuals within species also
differ in size, metabolic capacity, and fitness. A second class of
critiques of ecological neutrality is that tests of the theory are not
robust because independent estimates of the parameters cannot be
accessed~\cite{ricklefs_unified_2006}. A last class of critiques relate
to the consistency of the ecological neutral model with respect to the
dominant theory in ecology, niche theory, which interprets patterns of
species occurrence in the light of competition between
species~\cite{chesson_mechanisms_2000}. Sampling methods that make the
assumption of neutrality (in the sense of exchangeability), are however
more general than commonly discussed in the ecology literature. One
strength of these models presented here is that they are naturally
associated with a method of exact inference, which makes it possible to
compare model and data quantitatively. By analogy, the analysis
of selectively neutral alleles in natural populations is a useful tool
set of population genetics, and helps infer past population sizes, and
phases of demographic expansions and
bottlenecks~\cite{ewens2004mathematical, li2011inference}.
In ecology, the development of the neutral mathematical tool set has
not been as rapid as in genetics. One limitation has been that, unlike
in genetics, methods for analyzing huge data sets has been relatively
less in need in ecology. The situation is changing now with the rapid
rise of DNA-based ecology~\cite{taberlet2018environmental}, in which
samples are not of individuals but of sequences, and with applications
in tropical forest research~\cite{zinger2020advances}. Provided that
sequence abundance data can be interpreted biologically, the question
of the structure of sequence abundance distributions can be explored
with the same approach as outlined here.
Why should a model as simple as Model~1 provide such an excellent fit
to regional tropical tree abundance data? The neutral hypothesis cannot
be valid over ecological and evolutionary time scales. Yet, regional
species assemblages result from such a myriad of local processes, both
biotic and abiotic, that~large enough samples of regional patterns
average over these effects. Half a century ago, Robert~H.\ Whittaker
wrote~\cite{whittaker1972evolution}: ``The enigma of the diversity of
the tropical rainforest should be expected to open itself to no single
key, and may be enigmatic to the extent we have yet to comprehend the
full implications of biotic differentiation and interaction, the
complexity [{\ldots}] that is feasible and has evolved in these
forests''. To this day, this analysis still holds, and the successes of
the neutral theory to replicate broad patterns remains a mystery. Even
if the details of species persistence and coexistence may be explained
by fundamentally different processes~\cite{kruckeberg1985biological},
the laws of large numbers provide important insights into key questions
on the regional species abundance patterns.
\section*{Declaration of interests}
The author does not work for, advise, own shares in, or receive funds
from any organization that could benefit from this article, and has
declared no affiliations other than his research institution.
\section*{Funding}
I gratefully acknowledge funding from ``Investissement d'Avenir''
grants managed by the Agence Nationale de la Recherche (CEBA, ref.
ANR-10-LABX-25-01; TULIP, ref. ANR-10-LABX-0041; ANAEE-France:
ANR-11-INBS-0001), CNES long-term funding, the European Space Agency
CCI-BIOMASS and FRM4BIOMASS projects. This research is product of the
SynTreeSys group funded by the synthesis center CESAB of the French
Foundation for Research on Biodiversity (FRB).
\CDRGrant[Agence Nationale de la Recherche]{ANR-10-LABX-25-01}
\CDRGrant[Agence Nationale de la Recherche]{ANR-10-LABX-0041}
\CDRGrant[Agence Nationale de la Recherche]{ANR-11-INBS-0001}
\CDRGrant[European Space Agency]{CCI-BIOMASS}
\CDRGrant[European Space Agency]{FRM4BIOMASS}
\back{}
\section*{Supplementary data}
Supporting information for this article is available on the journal's
website under \printDOI\ or from the author. The package
\texttt{neutr} in the R Statistical Language~\cite{Rstatlang} is
available at \url{https://github.com/jeromechave/neutr}.
%\CDRsupplementaryTwotypes{supplementary-material}{\cdrattach{crbiol-154-suppl.pdf}}
\bibliographystyle{crunsrt}
\bibliography{crbiol20240388}
\refinput{crbiol20240388-reference.tex}
\end{document}