\makeatletter
\@ifundefined{HCode}
{\documentclass[CRBIOL,Unicode,screen,biblatex,published]{cedram}
\addbibresource{crbiol20250199.bib}
\newenvironment{noXML}{}{}
\let\citep\parencite
\let\citet\textcite
\def\xcitealp#1#2{\citeauthor{#1}, \citelink{#1}{#2}}
\newcommand*{\citelink}[2]{\hyperlink{cite.\therefsection @#1}{#2}}
\def\defcitealias#1#2{}
\let\citepalias\parencite
\let\citetalias\textcite
\newenvironment{Table}{\begin{table}}{\end{table}}
\def\thead{\noalign{\relax}\hline}
\def\endthead{\noalign{\relax}\hline}
\def\tabnote#1{\vskip4pt\parbox{.83\linewidth}{#1}}
\def\tsup#1{$^{{#1}}$}
\def\tsub#1{$_{{#1}}$}
\RequirePackage{etoolbox}
\def\jobid{crbiol20250199}
%\graphicspath{{/tmp/\jobid_figs/web/}}
\graphicspath{{./figures/}}
\newcounter{runlevel}
\skip\footins30pt
\let\MakeYrStrItalic\relax
\def\refinput#1{}
\def\back#1{}
\def\hyphen{\text{-}}
\def\xmorerows#1#2{{#2}}
\def\0{\phantom{0}}
\def\xsection#1{}
\def\botline{\\\hline}
\DOI{10.5802/crbiol.183}
\datereceived{2025-03-04}
\daterevised{2025-06-25}
\dateaccepted{2025-07-07}
\ItHasTeXPublished
\def\og{\guillemotleft}
\def\fg{\guillemotright}
\makeatletter
\g@addto@macro{\UrlBreaks}{\UrlOrds}
\gappto{\UrlBreaks}{\UrlOrds}
\usepackage{hyperref}
\makeatother
}
{
\PassOptionsToPackage{authoryear}{natbib}
\documentclass[crbiol]{article}
\def\CDRdoi{10.5802/crbiol.183}
\let\newline\break
\def\selectlanguage#1{}
\usepackage[T1]{fontenc}
\def\xmorerows#1#2{\morerows{#1}{#2}}
 \def\citelink#1#2{\citeyear{#1}}
 \def\xcitealp#1#2{\citealp{#1}}
\def\href#1#2{\url[#1]{#2}}
\def\xsection#1{}
\makeatletter
\def\CDRsupplementaryTwotypes#1#2{}
}
\makeatother

\usepackage{upgreek}
%\def \sct {\mbox{SARS-CoV-2}}
%\def \Covid {\mbox{Covid-19}}
\newcommand{\sct}{\mbox{SARS-CoV-2}}
\newcommand{\Covid}{\mbox{Covid-19}}
\begin{DefTralics}
\newcommand{\sct}{\mbox{SARS-CoV-2}}
\newcommand{\Covid}{\mbox{Covid-19}}
\end{DefTralics}
\dateposted{2025-09-04}
\begin{document}



\begin{noXML}

\CDRsetmeta{articletype}{review}

\title{Theories of the origin of SARS-CoV-2 in the light of its continuing
evolution}

\alttitle{\'Evaluation de th\'eories sur l'origine du SARS-CoV-2 \`a la
lumi\`ere de son \'evolution}

\author{\firstname{Florence} \lastname{D\'ebarre}\CDRorcid{0000-0003-2497-833X}\IsCorresp}
\address{Institute of Ecology and Environmental Sciences, CNRS UMR
7618, Sorbonne Universit\'e, UPEC, IRD, INRAE, Paris, France}
\email[F. D\'ebarre]{florence.debarre@normalesup.org}

\author{\firstname{Zach} \lastname{Hensel}\CDRorcid{0000-0002-4348-6229}}
\address{ITQB NOVA, Universidade NOVA de Lisboa, Av. da Rep\'ublica,
Oeiras, Lisbon 2780-157, Portugal}
\email[Z. Hensel]{zach.hensel@itqb.unl.pt}

\keywords{\kwd{Viral evolution}
\kwd{SARS-CoV-2}
\kwd{Furin cleavage site}
\kwd{Emerging diseases}
\kwd{Conspiracy theories}}

\altkeywords{\kwd{\'Evolution virale}
\kwd{SARS-CoV-2}
\kwd{Site de clivage par la furine}
\kwd{Maladies \'emergentes}
\kwd{Th\'eories du complot}}

\begin{abstract} 
The exact details of the emergence of SARS-CoV-2, the virus causing Covid-19,
remain unknown. Scientific publications using data available to date
point to a natural origin linked to the wildlife trade at a market in
Wuhan, China. Yet, theories postulating a research-related origin of
SARS-CoV-2 abound, and currently dominate the public discussion of the
origin of the Covid-19 pandemic. Here, we attempt to characterize the
diversity of research-related origin scenarios, discuss their
characteristics and evidence base, or the lack thereof, and highlight
mutual incompatibilities between some scenarios. We then focus on a
feature of SARS-CoV-2 that is central in today's leading research-related
hypotheses, namely the insertion that led to the introduction of a
polybasic cleavage site in the spike glycoprotein. We examine various
scenarios put forward to explain this insertion in a research-related
context, and we show how SARS-CoV-2's evolution in humans has provided
examples demonstrating that such insertions happen naturally. 
\end{abstract}

\begin{altabstract}
Bien que les d\'etails exacts de l'\'emergence de SARS-CoV-2, le virus
responsable de la Covid-19, restent inconnus, les donn\'ees disponibles
\`a ce jour vont dans la direction d'une origine naturelle li\'ee au
commerce d'animaux sauvages sur un march\'e de Wuhan, en Chine.
Cependant, les th\'eories postulant une origine de SARS-CoV-2 li\'ee \`a des travaux de recherche abondent. Nous tentons ici de caract\'eriser
leur diversit\'e et d'en discuter les caract\'eristiques. Nous nous
concentrons ensuite sur une caract\'eristique de SARS-CoV-2 qui est au
c{\oe}ur des principales hypoth\`eses d'origine li\'ee \`a des travaux de
recherche, \`a savoir l'insertion qui a conduit \`a l'introduction d'un
site de clivage polybasique dans la prot\'eine Spike. Nous montrons
comment l'\'evolution de SARS-CoV-2 chez les humains a fourni des
exemples d\'emontrant que de telles insertions se produisent
naturellement.
\end{altabstract}

\maketitle

%{\vspace*{4pt}}

\twocolumngrid

\end{noXML}

\defcitealias{MilletWhittaker2015}{ibid.}
\defcitealias{Pradhan2020}{ibid.}
\defcitealias{Meselson1994}{ibid.}
\defcitealias{Aksamentov2021}{ibid.}

\section{Introduction}\label{sec1}

{\vspace*{4pt}}

\sct{}, the virus causing \Covid, was first detected in Wuhan in China,
in late December 2019 \citep{promed2019,Zhu2020,Yang2024}. The new
disease was identified via clusters of patients seeking treatment in
various hospitals in Wuhan \citep{Worobey2021}, many of which were
vendors from the Huanan Seafood Wholesale Market (hereafter ``Huanan
market''). The market was known to sell live animals \citep{Tan2020};
wildlife trade was therefore considered as a likely source of the
outbreak \citep{Wu2020CCDC}. In an effort to control the outbreak, the
market was closed in the early hours of January 1{st}, 2020
\citep{Yang2024}. However, \sct{} was already spreading from human to
human outside of the market by the time the market was closed
\citep{Huang2020, Li2020, WHO2021}. Whether it was the only source of
the outbreak or not, closing the market on January 1{st}, 2020, was
therefore insufficient to control the outbreak, which grew out of
control, spread across the world, and caused a pandemic.

The \Covid{} pandemic was a major event in the history of the 21{st}
century, and it is therefore important to understand what originally
caused it. Available data, to date, point to a zoonotic origin linked
to the wildlife trade at the Huanan market \citep{ACC2024, Holmes2024},
a conclusion supported by multiple lines of evidence.  \sct{} is a
generalist virus, readily able to infect various mammal species
\citep{Nerpel2022, EFSA2023}, and even to be transmitted among several
of them, including raccoon dogs \citep[][shown
experimentally]{Freuling2020}.  Early human cases, with disease onset
in December 2019, were retrospectively identified, and the locations of
their residences mapped \citep{WHO2021}. This revealed a striking
pattern: the cases were centered around the Huanan market, whether they
were epidemiologically linked to it (e.g., vendors or buyers), or not
\citep{Worobey2022, DW2024SC, DW2024W} (see
Figure~\ref{fig:previousresults}A--B). The early lineages of \sct{}
(lineage B and lineage A \citep{Liu2022}) were both present in the
Huanan market, suggesting that they emerged, or at least evolved, there
(see Figure~\ref{fig:previousresults}C). Finally, live animals were
sold in the Huanan market, including some species already involved in
the 2002--2004 SARS epidemic \citep{Xiao2021, ACC2024, Liu2024}. The
stalls selling these animals were located in the southwest corner of
the west wing of the market \citep{Wu2020CCDC, WHO2021}, which was a
hotspot of \sct{} positivity \citep{Worobey2022}. Genetic material from
both \sct{} and from key animal species such as raccoon dogs and civets
was detected in samples from the same stall \citep{ACC2024}. 

\begin{figure*}
{\vspace*{-2pt}}
\includegraphics{fig01}
{\vspace*{-2pt}}
\caption{\label{fig:previousresults}Epidemiological and genomic data
point to the Huanan market. (A)~Locations of the residences of Covid-19
cases with symptom onset in December 2019 (gray dots). The green star
is the mode of the distribution of cases (i.e., the location of the
peak of a kernel density estimate of case residential locations,
computed as in \citet{DW2024SC}). Shown are the locations of the Huanan
market (red square), and of the two campuses of the Wuhan Institute of
Virology: the historical campus in Wuchang district (light pink), and
the more recent campus in Jiangxia district (dark pink), where the
biosafety level 4 (BSL4) laboratory is located. Case data from
\citet{WHO2021}, extracted by \citet{Worobey2022}, updated with Hubei
cases outside of Wuhan; figure adapted from \citet{DW2024SC}.
(B)~Zooming in on map~A near the market, showing two additional
landmarks: the Hankou railway station (olive green triangle) and the
new location of Wuhan CDC (orange diamond). (C)~Phylogeny of early
SARS-CoV-2 sequences, showing the two main early lineages, A and B.
Figure adapted and updated from \citet{ACC2024} following the removal
of duplicates and new annotations identified by
\citet{HenselDebarre2025}. Sequences linked to the market are shown in
red; dark red is for geographic links (spatial proximity) to the
market.}
{\vspace*{-2pt}}
\end{figure*}

The emergence of \sct{} shows similarities with the emergence of
SARS-CoV, 17 years prior \citep{Holmes2021, Pekar2025Cell}. Infected
animals were detected in a market in Shenzen, in Guangdong Province, in
May 2003, i.e., months after the emergence of SARS-CoV \citep{Guan2003}.
This led to a ban on wildlife trade, which was lifted in the summer
2003 \citep{NormileDing2003, Li2020trade}, and sampling in the Fall that
year, in the same market, found again SARS-CoV-positive animals:
civets, raccoon dogs, ferret badgers, hog badgers, and badgers
\citep{He2004}. Animals were also directly identified as infection
sources in a later resurgence of the virus in Guangzhou, still in
Guangdong Province, in late 2003 \citep{Wang2005}. The exact details of
how originally SARS-CoV emerged in 2002 are however still unknown; the
specific animals that led to those early infections in 2002 were not
identified \citep{Xu2004}.  Yet, in spite of this uncertainty, and for
lack of a reasonable alternative explanation, there is a strong
consensus that \mbox{SARS-CoV} was of zoonotic origin, and that it was
transmitted to humans via intermediate host(s) in the wildlife trade
\citep{Cui2019}. New data and analyses have continued to improve our
understanding of how sarbecoviruses diversify and spread in bats
\citep{Pekar2025Cell}.

Shortly after the emergence of SARS-CoV, two new coronaviruses
infecting humans were identified: NL63, an alphacoronavirus first
detected in the Netherlands \citep{vanderHoek2004}, and HKU1, a
betacoronavirus first detected in Hong Kong, in a patient with
pneumonia who had recently returned from Shenzhen (Guangdong; China)
\citep{Woo2005}. Related viruses have been detected in bats (NL63) and
rodents (HKU1) \citep{Corman2018}, but the exact details of the
emergences of these two coronaviruses, now endemic in humans, are
unknown---including the identities of potential intermediate hosts
between their putative reservoirs and humans \citep{Holmes2024}. Yet,
the zoonotic origins of these coronaviruses are not called into
question.  

The zoonotic origin of \sct{}, on the other hand, is contested 
\citep[e.g.,][]{Bloom2021Investigate, vanHelden2021, Berche2023}.
Research-related origin scenarios \citep{vanHelden2021} are supported by
the presence in Wuhan of virology laboratories, with one in particular
studying SARS-like coronaviruses, and by apparently unique properties
of \sct{} not observed in the known related viruses. Unlike the
examples of previous emergences given above, here, there was a credible
non-zoonotic alternative origin of \sct. Apparently unique molecular
features of \sct{} have been scrutinized since the beginning of the
pandemic, and the possibility of a research-related origin has
therefore been seriously considered early on \citep{Andersen2020}. 
\looseness=-1

\begin{table*}
\caption{\label{tab:labscenarios}Exhaustive list of options for
research-related origins of SARS-CoV-2 (tentative)}
\begin{tabular}{ll}
\thead
Category & Options \\
\endthead
\xmorerows{4}{A. Nature of the virus} & A.1 Fully natural \\
& A.2 Research product; Undirected evolution \\
& A.3 Research product; Directed evolution \\
& A.4 Research product; Genetic engineering \\
& A.5 Some combination of A.2, A.3, A.4 \vspace*{5pt}\\
\xmorerows{11}{B. Location of the origin of \sct{}} & B.1 Outside of a research site \\
& B.2 Fieldwork site \\
& B.3.a WIV, Wuchang campus, BSL2 lab \\
& B.3.b WIV, Wuchang campus, BSL3 lab \\
& B.4.a WIV, Jiangxia campus, BSL2 lab \\
& B.4.b WIV, Jiangxia campus, BSL3 lab \\
& B.4.c WIV, Jiangxia campus, BSL4 lab \\
& B.5.a Wuhan CDC, Dec 2019 location near market \\
& B.5.b Wuhan CDC, old location \\
& B.6 Other lab in Wuhan \\
& B.7 Other lab in China \\
& B.8 Lab outside of China\vspace*{5pt}\\
C. Location of first infections & (Same options as B.)\vspace*{5pt}\\
\xmorerows{2}{D. Type of first infections} & D.1 Accidental \\
& D.2 Deliberate but non malicious \\
& D.3 Deliberate and malicious \vspace*{5pt}\\
E. Timing & Specified time (until December 2019) 
\botline
\end{tabular}
\tabnote{In each category, only one option can be chosen
to build a research-related origin scenario.}
\end{table*}


``Research-related origin'', often simplified as ``lab leak'' in the
public discourse, is an umbrella term encompassing a diversity of
scenarios, which will be detailed in the first part of this review. The
second part will focus on a specific element that is prominently
featured in discussions of a potential research-related origin, namely
the presence in \sct's spike of an insert that encodes a functional
furin cleavage site. We will see in particular how the \mbox{evolution} of
\sct{} over the last five years has informed the plausibility of a
natural origin of this \mbox{insertion}.

Finally, while the discussion of the potential origins of \sct{} is a
valuable academic exercise, it is important to keep in mind that some
research-related origin theories implicate specific, identifiable
researchers as responsible for the \Covid{} pandemic, and that such
serious and consequential \mbox{accusations} should be based on evidence, not
just \mbox{speculation}.  

\section{A typology of \sct{} research-related origin scenarios}

Multiple scenarios can be described as corresponding to a
research-related origin. Here, we attempt to classify them exhaustively
by considering five factors. Two factors correspond to intrinsic
characteristics of the virus: (A) the nature of the virus (\mbox{natural} or
synthetic to some degree), (B) the location of its origin. Three other
factors correspond to features of the first human infections by \sct{}
that led to the \Covid{} pandemic: (C) the location, (D) type
(accidental or not), and (E) timing of these first human infections. 
The proposed classification is such that an origin scenario is
constructed by choosing a single option in each of the six categories.
We will see that some combinations of options are not possible. The
various options are recapitulated in Table~\ref{tab:labscenarios}.  

\subsection{(A) Nature of the virus}

\sct{} is either a natural virus, that evolved via natural selection,
or a virus modified to some degree in a laboratory. Evolution in a lab
can be \mbox{accidental}, e.g., a side-effect of isolation. For \mbox{instance},
\mbox{isolation} of a \sct-related pangolin coronavirus (GX\_P2V) was
accompanied by a 104-nucleotide deletion, that led to attenuation in
cell culture and {in vivo} \citep{Lu2023}. Isolation and culture
of WIV1 from bat SARS-like coronavirus Rs3367 \citep{Ge2013} resulted in
two amino-acid changes in its spike, one of which was shown to increase
the virus's ability to bind to the cell receptor ACE2 and was
interpreted as an adaptation to cell culture \citep{Tse2025}.  Evolution
can be directed, for instance in the context of serial passage. In
serial passage experiments, pathogens are transferred from one host to
another (usually of the same species; the host can be from an
experimental animal to a cell culture), which can lead to adaptation to
the host on which the pathogen is passaged \citep{Ebert1998}; mutations
arise spontaneously, but their selection is artificial. For instance,
\sct{} was adapted to mice in order to generate laboratory models, as
the original version of the virus did not interact well with the mouse
ACE2 receptor \citep{Zhou2020}. Serial passage on mice led in particular
to a substitution in the receptor binding domain of the spike, N501Y
\citep{Gu2020}, which later appeared in variants of concern like Alpha
and Omicron.  Finally, \sct{} can be the product of direct genetic
engineering: here, the mutations are planned and deliberately
introduced by researchers. For instance, another mouse-adapted version
of \sct{} was generated by reverse genetics after introducing two
substitutions predicted to be key for interaction with the mouse ACE2
receptor \citep{Dinnon2020}.  It is also possible to envision
combinations of different options, for instance genetic engineering of
a previously serially-passaged virus\break (see
Table~\ref{tab:labscenarios}).

The possible nature of the virus may be informed by the location where
it is assumed to have emerged. For instance, location is the main
characteristic to differentiate a natural virus from a virus generated
by undirected evolution in a lab, as the two may be barely
distinguishable at the sequence level. Also, a genetically engineered
virus is only possible in a laboratory with the capability to conduct
genetic engineering of viruses. In other words, following our
classification attempt (Table~\ref{tab:labscenarios}), the selection of
a given option for one factor may affect the range of possible options
for others.

Scenarios involving virus manipulation in the lab (directed evolution,
genetic engineering, and \mbox{combinations} thereof) require the knowledge
and possession by the researchers of a virus that serves as progenitor.
If \sct{} is assumed to be chimeric, more than one progenitor is
required. A virus of natural origin, on the other hand, also has a
direct progenitor, but us knowing or not the identity of the progenitor
is not a limitation, because evolution occurred without human
intervention. To date, no known virus could have served as progenitor
of \sct{} \citep{Andersen2020}. The closest known cousins, RaTG13 and
then BANAL viruses, are too distant from \sct{} across the whole genome
to be its progenitor \citep{Zhou2020,Temmam2022}. After delimiting
non-recombining regions in \sct's genome, the closest relatives vary
across regions: \sct's genome is a mosaic \citep{Boni2020,
Pekar2025Cell}. Lab manipulation scenarios therefore imply the
existence of viruses kept secret---for which there is, to date, no
evidence. If it were possible to definitely demonstrate the absence of
such progenitors in the Wuhan laboratories, the discussion would stop
here. Conversely, if the existence of such a progenitor in the
collections of a laboratory were discovered, a lab origin would
immediately become much more likely. We therefore still consider these
scenarios here, keeping in mind the fundamental limitation that they
require a virus that would have been kept secret.     

The possible nature of the virus may also be informed by its genomic
sequence. The assembly of a full viral genome from smaller fragments
may or may not leave traces \citep{Almazan2014}. These techniques may
use Type IIS restriction enzymes, which cleave outside of their
recognition sequences and leave overhangs. Depending on the orientation
of the restriction sites, those may be retained in the assembled
product, or removed \citep{Almazan2014, Yount2002, CaiHuang2023} (see
Figure~S1). Seamless techniques leave no trace
and are therefore not detectable, unless a marker like a silent point
mutation is deliberately introduced \citep{Hou2020RG}.  Traditional
cloning methods, on the other hand, leave traces in the form of
restriction sites. However, because they consist of short nucleotide
sequences, restriction sites may also be present by chance. \sct's
genome contains several restriction sites; they are not regularly
spaced and are found in related viruses, i.e., they are consistent with
a natural origin \citep{CCPekar2022}.

Beyond the technique to generate a potential genetically engineered
virus, genetic engineering could be identified by the presence of
unnatural-looking segments. Suspicions of unnatural-ness are at the
core of claims of genetic engineering, since the early, quickly
rebutted and withdrawn, suggestion that \sct{} may contain fragments
from HIV \citep{Pradhan2020, Sallard2021}. In Section~\ref{sec:FCS}, we
will explore in detail the claim that \sct's furin cleavage site was
inserted in a laboratory. 

The claim that \sct{} may have been generated in a laboratory also
stems from the observation that \sct{} seemed to be efficiently
transmissible from human to human early on \citep{Zhan2020}, leading to
the suggestion that it may have been somehow pre-adaptated in a
laboratory. \sct{} is however a generalist virus; it transmitted well
from human to human, but could also readily infect other mammals.
Notably, it caused outbreaks in mink farms early on
\citep{OudeMunnink2021}, without having been pre-adapted to minks in a
laboratory. Pre-adaptation may simply be a consequence of the fact that
mammals, including humans, share similar features. For instance, a
recently discovered MERS-like coronavirus infecting minks in China
\citep{Zhao2024} was shown to replicate in cells expressing receptors
from minks, but also humans and even camels \citep{Wang2025}.  In
addition, \sct{} was not perfectly adapted humans: further adaptations
took place, in particular the D614G mutation in the spike. This
mutation  stabilized the spike, preventing premature shedding of the S1
domain \citep{Zhang2021D614G, ChoeFarzan2021}, thereby increasing \sct's
infectivity \citep{Korber2020}. Detected as early as January 2020 in
patients from China \citep{Boehmer2020, Lv2024}, the D614G mutation
spread across the world and became dominant. The same mutation later
convergently occurred in lineage-A viruses \citep{Murall2021}, before
the lineage went\break extinct. 

Finally, \sct{} is a pandemic virus, and pandemics are rare; \sct{} is
necessarily an extra-ordinary virus, as were previous pandemic viruses,
including those that emerged before the advent of modern virology: a
laboratory origin is not a necessity to explain \sct{}'s rapid spread
in the human population. Features that brought a selective advantage in
humans to the viruses possessing them---such as the furin cleavage
site---spread better and could be naturally selected. 

\subsection{(B) Location of the emergence of \sct, and (C) location of
the first human infections}

We now consider is the location where the virus would have been
generated, and where the first humans were infected. Under most
research-related origin scenarios, the two locations are the same.
Discrepancies may however exist in the case of a release linked to a
vaccine challenge, or in the case of a virus generated elsewhere and
then shipped to Wuhan.  

A research-related incident could have happened in nature, with a
natural virus, in the context of fieldwork. First infections outside of
a laboratory may also happen in the context of a vaccine challenge, as
will be detailed below. 

Various laboratories have been considered as potential locations of the
emergence of \sct. The most frequently mentioned one is the Wuhan
Institute of Virology (WIV), located across two campuses: a campus in
Wuchang district, and south of it, a campus in Jiangxia district, where
the BSL4 lab is located (Figure~\ref{fig:previousresults}A). The BSL4
lab was the first of its kind in China \citep{Yuan2019}. Coronaviruses
are not typically manipulated in BSL4 conditions, but rather BSL3 or
BSL2 depending on the type of coronavirus \citep{BMBL} and type of
experiment, so the presence of a BSL4 laboratory in Wuhan is
coincidental.  The use of a BSL4 laboratory depends on local
regulations, and may not be aligned with the wider public's perception
of danger; for instance, reconstruction of 1918 pandemic influenza
virus \citep{Tumpey2005}, or experiments resulting in airborne
transmission of H5N1 between mammals \citep{Herfst2012}, did not take
place at BSL4 but BLS3.  When the Wuhan BSL4 laboratory was put in
operation, however, ``low pathogenic coronaviruses'' were used there as
model viruses by researchers for training \citep{Cohen2020Shi}.  Yet,
discussions of a potential research-related origin most often envision
experiments carried out at a biosafety level that some deem
insufficient for viruses with uncertain potential for human infection
and onward transmission, namely BSL2 \citep[e.g.,][]{Chan2024NYT}. Under
such a scenario, Wuhan is not an exceptional location: BSL2
laboratories are common.  

Another Wuhan laboratory considered among the possible locations of
emergence of \sct{} is the Wuhan Center for Disease Control (WCDC).
WCDC moved next to the Huanan market in late 2019 \citep{WHO2021}, and
was mentioned in one of the early public prepublications naming
specific laboratories as potential origins \citep{XiaoXiao2020RG}.
Pre-\Covid{} research from WCDC featuring the researcher targeted by
scenarios involving this institution \citep{XiaoXiao2020RG,
Tufekci2021}, was not on coronaviruses \citep{Guo2013, Lu2017, Shi2018}.
Even a December 2019 promotional video put forward in WCDC scenarios to
incriminate the researcher and his bat-sampling activities
\citep{Tufekci2021}, featured the collection of ticks from bats
\citep{TianVideo2019}, indicating a focus on other types of pathogens.
WCDC had been involved in the collection of samples, including from
bats, but did not have the facilities to conduct actual experiments and
even less so for genetic engineering \citep{Holmes2024}. If \sct{} came
from WCDC, it is therefore a natural virus brought to the laboratory,
and not an engineered virus for instance. Using nomenclature from
Table~\ref{tab:labscenarios}, options A.4 and B.5 are therefore
\mbox{incompatible}.

Wuhan hosts other research laboratories that could potentially be other
locations of emergence. Research on coronaviruses was for instance also
carried out at Huazhong Agricultural University in Wuhan
\citep{Shen2018PEDV}. The labs had, to our knowledge, no history of
experimenting on SARS-like coronaviruses, and they do not have a
geographic association with early cases. Likewise, Chinese laboratories
outside of Wuhan will not be further considered here, although their
implication has sometimes been suggested, e.g.\ for researchers in
Beijing \citep{Kadlec2024}. 

There have also been suggestions that \sct{} could have been conceived
and even generated in a laboratory outside of China, and sent to Wuhan.
One of these scenarios involves a North Carolina research laboratory
that collaborated with WIV \citep{HarrisonSachs2022, Sachs2025}. Other
scenarios, as mirrors of accusations targeting Wuhan's BSL4 laboratory,
implicate Fort Detrick in the United States \citep{HuangBest2024} and
remove any link to a Wuhan laboratory. By getting rid of any
geographical link, such scenarios could \mbox{implicate} virtually any
virology laboratory in the world, and will therefore not be further
considered here.

Case data can inform on the geographic location of the first infections
that lead to the \Covid{} pandemic. The outbreak was first identified
in December 2019 because of clusters of patients suffering from
pneumonia, linked to the Huanan market, seeking care in several Wuhan
hospitals \citep{Worobey2021,Yang2024}.  Available case data, compiled
retrospectively, show that the earliest human cases were centered
around the Huanan market, whether they were epidemiologically linked to
it or not \citep{Worobey2022, DW2024SC, DW2024W}. Similar patterns were
observed for infections of healthcare workers \citep{Wang2021HCW}.
Except for a scenario involving WCDC, which had moved close to the
Huanan market in late 2019, research-related scenarios fail to provide
explanations for this striking spatial pattern. The Huanan market is
indeed not just any location in Wuhan: it was one of the only four 
markets reported to sell live wildlife, and the one with the largest
number of wildlife stalls among them \citep{Xiao2021}. This is a least a
striking coincidence that deserves to be accounted for, whatever the
proposed scenario for the origin of \sct.  



\subsection{(D) Type of first infections}

Three types of first human infections in a research-related context can
be distinguished. First, the first human infections may have been
accidental. Secondly, they may also have been deliberate, but not
necessarily in a nefarious context, e.g., a vaccine challenge. While
little discussed in the context of \sct{} (for lack of any evidence
supporting the scenario), this type of release is listed here as option
because it is considered a plausible origin of 1977 H1N1 influenza
\citep{RozoGG2015}. Such a release cannot adequately be described as
``lab leak'', because the virus is deliberately taken out of a
laboratory. (Note that this category describes the type of first human
infections and not the context of the research; accidental infections
of researchers working on the design of a vaccine would be
characterized as accidental first infections.) Finally, 
a last type of first human infections is 
deliberate with malicious intent \citep{Nielsen2022}, like the release
of a bioweapon. Like the second type, it would not be described as a
``leak''. This option is listed here in order to be exhaustive, but is
completely unsupported \citep{ODNI2021}.  

\subsection{(E) Timing of the first human infections}

The known cases who reported the earliest symptom onset started to feel
sick around December 10--11, 2019 \citep{Worobey2021}. These early cases
were however likely not the first infections. Attempts to date the
first infections, using either only case data \citep{Jijon2024}, or a
combination of case data and genomic sequences \citep{Pekar2022},
converge towards first infections from late October to early December,
with a median in the second half of November 2019.

Research-related origin scenarios consider a whole range of possible
dates of first infection, sometimes contradictory, depending on the
external events considered to the major drivers or signs. These
external events include for instance \citep{Kadlec2024}: the (actually
only temporary at the time) shutdown of a database at WIV in early
September 2019; a security exercise at Wuhan airport, simulating
infections by a new coronavirus, mid-September 2019; the 2019 Military
World Games in Wuhan in the second half of October 2019; a yearly
training in WIV's BSL4 laboratory in the second half of November 2019;
alleged infections of WIV scientists in November 2019 \citep{Cohen2023},
etc. Some of these events are at odds with the estimated dates of first
human infections, and even lead to temporally impossible scenarios. For
instance, the pandemic cannot have started with infections of
scientists in November 2019 and, in the same scenario, have spread
throughout the world via the Military Games in October 2019.   


We tried here to provide an exhaustive list of the different elements
composing a scenario \sct's origin, detailing research-related origins.
Importantly, the accumulation of possible research-origin scenarios,
sometimes put forward to arouse suspicion \citep{Tufekci2021}, does not
necessarily increase the likelihood of a research origin, especially
when the proposed scenarios are mutually contradictory. Arguments about
research done at BSL2 are irrelevant in a scenario involving the BSL4
lab, and reciprocally; descriptions of personal protective equipment by
researchers doing fieldwork are irrelevant in a scenario of a
lab-engineered virus. Making explicit the envisioned scenarios---which
is rarely done---helps see their potential logical flaws.

We now focus on a specific feature of \sct, its furin cleavage site,
and on the suggestion that it could have been the product of deliberate
genetic engineering.

\section{\mbox{Dissecting a particular scenario: the} \mbox{insertion} leading to the
furin cleavage site}\label{sec:FCS}

The spike of coronaviruses is cleaved into S1 and S2 subdomains to
mediate fusion with the host cell membrane \citep{MilletWhittaker2015}.
Spike cleavage can occur at different stages of the infection cycle,
depending on the virus and host cells \citepalias{MilletWhittaker2015}:
during the production of new viruses in the producer host cell; in the
extra-cellular space; at the surface of target cells; in lysosomes
after endocytosis in target cells \citep{Li2016}. The presence of a
polybasic cleavage site allows the spike protein to be cleaved by host
enzymes like furin in the producer cell, so that the spike is already
primed when a target cell is later encountered.

Multiple betacoronaviruses have a polybasic cleavage site at the S1/S2
junction, including human pathogens like MERS-CoV, HKU1, OC43
\citep[][Figure~\ref{fig:fcsalignment}]{WuZhao2021, Holmes2021}.
Polybasic cleavage sites have repeatedly evolved through the history of
coronaviruses. 

\begin{figure*}
{\vspace*{-3pt}}
\includegraphics{fig02}
{\vspace*{-3pt}}
\caption{\label{fig:fcsalignment}\sct's polybasic cleavage site is a
unique feature among known sarbecoviruses, but not among other
betacoronaviruses.  The black triangles locate the cleavage sites; here
and in the other figures, the triangles are not repeated within
alignments. Polybasic sites are highlighted in boldface. The color
palette is borrowed from Nextclade \citep{Aksamentov2021}; colors
depend on chemical properties. The right column shows the predicted
score according to ProP \citep{Duckert2004}; a score above 0.5
corresponds to a predicted polybasic cleavage site. Figure adapted from
\citet{Holmes2021}, updated with examples from
\citet[Figure~S7]{Han2023} and \citet{Zhu2023}. Accessions of the
sequences are provided in the Methods section.}
{\vspace*{-2pt}}
\end{figure*}


While a polybasic cleavage site is not an uncommon feature among the
\textit{Betacoronavirus} genus, \sct{} is the first known sarbecovirus
\citep{Coutard2020}, and to date the only one, that possesses one at the
S1/S2 junction. This feature has been shown to contribute to its
replicability in host cells \citep{Hoffmann2020, Johnson2021} and its
transmissibility among hosts \citep{Peacock2021}.  The known related
sarbecoviruses are devoid of a furin cleavage site. SARS-CoV, notably,
does not have one---which, incidentally, illustrates that a furin
cleavage site is not essential for respiratory infections, nor for
human to human transmission to take place. Compared to related
sarbecoviruses, \sct's furin cleavage site appears to have been
introduced via an out-of-frame insertion (see Figure~\ref{fig:fcs}).
Four amino acids (PRRA) inserted near the S1/S2 cleavage site
contribute to forming a RXXR-type furin cleavage site (where X is any
amino acid). The rarity of furin cleavage sites among \mbox{sarbecoviruses},
and the fact that it is formed by a 12-nucleotide insertion compared to
the most closely related viruses, have led to the suggestion that the
insertion could have been artificial. 


\begin{figure*}
\includegraphics{fig03}
{\vspace*{-3pt}}
\caption{\label{fig:fcs}\sct's furin cleavage site is caused by an
insertion compared to its closest known relatives. The insert is shown
in blue in the nucleotide sequence (two inserts are possible); it leads
to a furin cleavage site in \sct{} (shown in boldface in the amino acid
sequence), but also adds a leading proline (P). The black triangle
locates the cleavage site (shown only once for each alignment group).
Compared to the known viruses, the insert is out of frame. Because of
the ``C''s at each end, both a ${-}1$ and a ${-}2$ out of frame inserts
are possible.}
{\vspace*{-2pt}}
\end{figure*}

\subsection{The features and originalities of SARS-CoV-2's furin cleavage
site are consistent with \mbox{natural} evolution}

The repeated evolution of furin cleavage sites in other coronaviruses
indicates that this feature can evolve naturally. The furin cleavage
sites in other coronaviruses are diverse, yet none seems to exactly
match \sct's, at the amino acid level, and even less so at the
nucleotide level. The original insert in \sct{} introduced a leading
proline (P) that was not part of the polybasic site---but became as
the position mutated throughout \sct's evolution (it mutated into a
histidine [H] in Alpha, Omicron BA.1, 2, and more so when it mutated
into an arginine [R] in Delta and BA.2.86; see
Figure~S2). Although a similar proline (P) happens
to be present in MERS-CoV (see Figure~\ref{fig:fcsalignment}), its
function was not described as part of the polybasic site
\citep{MilletWhittaker2014} nor as critical to it, and therefore as
necessary in an artificial insertion. Instead, this is the kind of
superfluous-looking element that occurs randomly.\looseness=-1

The progenitor virus just before \sct{} (whether it evolved naturally
or not) is unknown; we can therefore only describe \sct's furin
cleavage site in relation to the other known close relatives. It is
however possible that the progenitor sequence was different, which
would affect the estimated length and position of the insert. In
particular, it is possible that an insert was already present, and
\sct's furin cleavage site evolved by {mutation} of that previously
inserted, different sequence \citep{Morgan2025}. Absent any information
on the progenitor sequence, we assume that comparison to close
relatives is representative of what actually happened. Compared to
close relatives, then, the insert is out of frame (both ${-}1$ and ${-}2$
positions are possible; see Figure~\ref{fig:fcs}). There would be no
rationale for doing an out-of-frame insertion in the lab instead of a
regular in-frame insertion; this is the kind of detail that makes the
insert look natural rather than engineered. \looseness=-1

The two arginines (RR) in the original insert were both encoded by CGG,
leading to a CGGCGG suite of nucleotides. Codon usage bias in related
coronaviruses is such that R is rarely encoded by CGG; the occurrence
of a double CGG is therefore an oddity. Yet, this oddity is still
largely present among \sct{} sequences. Although mutations have been
detected at each position of the fragment, the original nucleotides are
still present in 99.9\% of all sequences (as of August 2024
\citep{CoVSpectrum}), with the most variation present at the third
position of the first codon of the pair (T present at 0.1\%). In
other words, regardless of how bizarre it looked, CGGCGG has not yet
been purged by selection in \sct. \looseness=-1

In humans, CGG is a frequent codon for R, which has led to the
suggestion that CGGCGG was a tell-tale sign of engineering
\citep{SegretoDeigin2021, Wade2023}. We now explore in detail this
suggestion. 
\vspace*{-5pt}

\subsection{CGGCGG is not evidence of engineering}

The rationale behind the suggestion that CGGCGG may be the sign of
engineering is that is would reflect codon optimization by a genetic
engineer for expression in humans. CGG would have been chosen, twice,
because it is the most frequent codon \mbox{encoding} arginine (R) in humans.
All elements of the proposition are incorrect.

\begin{figure*}
\includegraphics{fig04}
\vspace*{1pt}
\caption{\label{fig:cgg}CGGCGG is not a codon-optimized encoding of RR.
The black triangle locating the cleavage site is shown only once.
(A)~Sequences obtained via various online codon-optimization tools for
the shown amino-acid sequence. The URLs of the various tools are given
in the Methods section. (B)~Sequence fragments from mRNA vaccines,
Moderna (top) and Pfizer-BioNTech (BNT; bottom).}
\end{figure*}

CGG is not the sole frequent codon for arginine in humans, nor is it
specific to humans. Whether CGG is the most frequent or one of the most
frequent depends on the databases considered (e.g.\ Genscript vs.\ 
``Kazusa'' \url{http://www.kazusa.or.jp/codon/}). Other codons, AGA,
AGG and CGA are also frequent (listed in decreasing order). There is
therefore no rationale for selecting CGG twice rather than combining it
with other frequent codons. In addition, codon usage bias is such that
CGG is also frequent in other mammals. It is for instance the most
frequent for arginine (R) in cows (\textit{Bos taurus} in the Kazusa
database). The high frequency of CGG is therefore not specific to
humans, and the presence of CGG would not be evidence of artificial
adaptation to humans specifically.\looseness=1

Codon optimization refers to the use of synonymous codons to increase
protein expression \citep{MauroChappell2014} while ensuring sequence
stability. There would be no reason, and it would be inefficient, to
only codon-optimize two codons (or~four) in \sct's genome. In addition,
codon optimization is not done by hand by selecting the most frequent
codons for translation in a choice organism. Software exists to carry
out the task, and they try to avoid too high CG content. To demonstrate
that CGGCGG would not actually be the result of codon optimization for
humans, we submitted the QTQTNSPRRARSV amino acid sequence to various
free online codon-optimization tools, for \mbox{expression} in \textit{Homo
sapiens}, with default settings. None of them proposed CGGCGG for RR
(Figure~\ref{fig:cgg}A). In addition, there exist examples of sequences
that were actually codon-optimized for humans, notably with the
sequences of the mRNA vaccines. Again, neither of them used CGGCGG for
RR (Figure~\ref{fig:cgg}B). Last but not least, if codon optimization
had actually taken place to generate a virus, one may have expected it
to be codon-optimized as a human coronavirus, not as a human sequence. 
That the double arginine (RR) in \sct's polybasic site is encoded by
CGGCGG is therefore rather a counter-argument to the claim that it
might be engineered. 

\begin{figure*}
{\vspace*{-2pt}}
\includegraphics{fig05}
{\vspace*{-3pt}}
\caption{\label{fig:experiments}Pre-2020 experiments introducing a
polybasic cleavage site in coronaviruses  (A--G), and counterfactual
equivalent of \sct's insertion in SARS-CoV (H). (A--G)~The mutated
positions are colored in blue and the resulting polybasic cleavage
sites are highlighted in boldface. The polybasic segments were
introduced by mutating the sequences in place, occasionally adding or
removing one amino acid.  (A)~\citet{Follis2006}, on SARS-CoV. 
(B)~\citet{Belouzard2009} on SARS-CoV, at S1/S2 and at S2$'$ (the
S1/S2 alignment shown here includes a K that seemed to have been
accidentally missing in the original paper).  
(C)~\citet{Watanabe2008}, on SARS-CoV at S2$'$.  
(D)~\citet{Burkard2014}, on mouse hepatitis coronavirus (MHV), a
betacoronavirus,  at S2$'$.  (E)~\citet{Li2015}, on porcine
epidemic diarrhea virus (PEDV), an alphacoronavirus, at S2$'$.  
(F)~\citet{Yang2015}, on HKU4 (related to MERS-CoV), at S1/S2 to
resemble MERS-CoV's minimal furin cleavage site RSVR. 
(G)~\citet{Cheng2019}, on Infectious bronchitis virus (IBV), a
gammacoronavirus, at S2$'$.  (H)~Counterfactual \sct-like insertion at
S1/S2, not performed by the authors of \citet{Follis2006,
Belouzard2009}, but similar to what the PRRA insert does to \sct{}
(inserting several amino-acids instead of mutating in place; adding a
leading proline P).}
{\vspace*{-2pt}}
\end{figure*}

\subsection{Previous experiments did not {insert} a furin
cleavage site}

The suggestion that \sct's furin cleavage site could have been
engineered also uses the argument that such manipulations had been done
on coronaviruses before the \Covid{} pandemic 
\citep{Follis2006,Belouzard2009,Watanabe2008,
Burkard2014,Li2015,Yang2015,Cheng2019}
(albeit, especially in the case of experiments on SARS-CoV, with
pseudotypes and not live virus). The experiment was described as
``routine'' in arguments for an artificial origin of the insert
\citep{ChanRidley2021}, while the number of studies was in fact very
limited---less than ten were found on coronaviruses.  

Closer inspection of the details of these experiments reveals that the
introductions of furin cleavage sites were done very differently from
how \sct's is proposed to have been generated. As described above,
compared to its known close relatives, \sct's furin cleavage emerged by
the insertion of 12-nucleotide sequence, encoding four amino acids
(PRRA), including one (P) that was originally not part of the polybasic
site. Previous experiments, on the other hand, did not introduce
polybasic sites via such a large insertion, but instead did so by
modifying existing sequences in place by point mutations
(Figure~\ref{fig:experiments}A--G; occasionally removing or adding one
amino acid, see Figure~\ref{fig:experiments}A).  In addition, except
when the mutations were meant to match another close coronavirus
(changing HKU4's sequence to resemble MERS-CoV's
\citep[][Figure~\ref{fig:experiments}F]{Yang2015}), the introduced
cleavage sites were canonical, i.e., R-X-(R/K)-R (where X is any amino
acid), instead of the minimal R-X-X-R version found in \sct. 
Importantly, none of these experiments introduced amino acids outside
of the polybasic site, unlike \sct's leading proline (P). Even the
experiment on HKU4, designed to \mbox{resemble} MERS-CoV's polybasic site, did
not introduce a proline (P), while there is one in MERS-CoV before its
polybasic site (compare Figure~\ref{fig:fcsalignment} and
Figure~\ref{fig:experiments}F). Figure~\ref{fig:experiments}H
illustrates what an experiment on SARS-CoV matching \sct's insertion
would have looked like.  Moreover, experiments with SARS-CoV
\citep{Follis2006, Belouzard2009, Watanabe2008} (and HKU4
\citep{Yang2015}) were done with pseudotypes, not full viruses.  These
previous experiments are therefore actually arguments against an
artificial origin of the 12-nucleotide insert in \sct. 

\subsection{Proposed engineering scenarios}

\sct's furin cleavage site is non-canonical \citep{Thomas2002}. Besides
its leading proline, discussed above, the polybasic site itself is
minimal and does not match classical sites in other coronaviruses. It
also does not match the RRSRR site introduced in SARS-CoV in previous
experiments \citep[][Figure~\ref{fig:experiments}A, B]{Follis2006,
Belouzard2009}. It is hard to explain why PRRA would have been
inserted, and not some more commonly known polybasic site. Several
post-hoc explanations were proposed for the choice and source of a PRRA
insertion, that we now\break detail.  

\subsubsection{HIV-1}

In late January 2020, a preprint claimed to have identified four
insertions in \sct's genome \citep{Pradhan2020}; the fourth one
contained the insertion that introduced the furin cleavage site.
Searching for the potential sources in genomic databases, the authors
\citepalias{Pradhan2020} claimed that the inserts may be coming from HIV-1.
The preprint, which went viral on social media \citep{AltmetricHIV2025},
was quickly rebutted, and it was withdrawn a few days later. First, the
authors had mischaracterized the four inserts because of improper
sequence alignments; second, the matches with HIV-1 were not
statistically significant \citep{Sallard2021}. In other words, the
matches were simply the product of chance.

\subsubsection{Moderna patent sequence}

Searching for the 12-nucleotide insert in genomic databases also
revealed that a similar sequence was present, as reverse complement, in
a 2016 patent by Moderna \citep{Moderna2017} (a biotechnology \mbox{company}
that developed one of the mRNA \Covid{} \mbox{vaccines} in 2020). In addition
to its lack of {functionality} in its original context, the sequence was,
once again, shown to be a coincidence \citep{DubuyLachuer2022}. 

\subsubsection{ENaC-$\alpha$}

The amino-acid sequence of \sct{} at its furin cleavage site and a few
positions beyond had been shown to match the amino-acid sequence of a
human epithelial channel protein called ENaC-$\alpha$ \citep{Anand2020}.
The comparison was brought further by suggesting that ENaC-$\alpha$
could have been the inspiration for introducing RRAR, a non-canonical
furin cleavage site \citep{HarrisonSachs2022}. Besides the amino-acid
match, the suggestion was made because a lab group at the University of
North Carolina studied ENaC-$\alpha$ (albeit the mouse version mostly),
and the same university is home to another lab group that had
collaborated with WIV. Finally some note that there are eight amino
acids in common between ENaC-$\alpha$ and SARS-CoV-2. 

There is no {a priori} rationale for choosing such a polybasic site
rather than another one, and the match is essentially {post hoc}. In
addition, in spite of a match at the amino-acid level, nucleotide
sequences of ENaC-$\alpha$ and \sct{} vastly differ  \citep[see
Figure~\ref{fig:HSL}A;][]{Garry2022a}. Five of the eight amino acids in
common are also in related coronaviruses, the remaining three being in
the insertion (see Figure~\ref{fig:HSL}). The match is therefore
consistent with natural evolution. Finally, the comparison fails to
explain the presence of a leading proline in \sct.

\begin{figure*}
\includegraphics{fig06}
\caption{\label{fig:HSL}Proposed sources for the insert have different nucleotide
sequences. (A)~Source proposed by \citet{HarrisonSachs2022}. (B)~Source
proposed by \citet{Lisewski2024}. (See Methods section for accessions.)}
\end{figure*}

\subsubsection{MERS-CoV MA~30}

Another suggestion was made that \sct's insert had been chosen to match
the amino-acid sequence of a lab adapted MERS-CoV, MA~30, in which a
point mutation changed PRSVR into PRRVR \citep[][Figure~\ref{fig:HSL}B]{Lisewski2024}.
The match with \sct{} was still imperfect and
no proper explanation was provided to explain why a valine (V) would
have been changed into an alanine (A) in \sct. In addition, while the
experiment leading to MA~30 was published in 2017 \citep{Li2017}, the
sequence was only submitted and published on Genbank in June 2020
\citep[MT576585;][]{GutierrezAlvarez2021}. Using the 2017 experiment as
inspiration would have required intimate knowledge of the paper. In
that theory, in a classical example of guilt by association, such
knowledge was speculated to be due to the fact that the lead author
\citep{Li2017}, of Chinese origin, had done his PhD at WIV between 2005
and 2010 \citep{Morin2025}; it was however omitted that his thesis was
not on coronaviruses, and carried out in another group within WIV, a
large research institute---i.e., not in the group working on SARS-like
coronaviruses. In addition, the serine (S) to arginine (R) mutation in
the furin cleavage site was not the only change in MA~30 compared to
its MERS-CoV precursor. The original \citep{Li2017} and subsequent
\citep{GutierrezAlvarez2021} experiments did not disentangle the effect
of the various other mutations that had appearing in the process of
passaging MERS-CoV in mice, and therefore could not ascribe causality
for the observed phenotypical changes specifically to the serine (S) to
arginine (R) mutation. Finally, the nucleotide sequences in MA~30 and
\sct{} are different (see Figure~\ref{fig:HSL}B), and no explanation
was provided for this difference.

These examples listed above are post-hoc attempts to rationalize the
presence of a furin cleavage site in \sct's spike. While none provides
a compelling rationale accounting for the furin cleavage site's
peculiarities, some go further by making explicit allegations against
specific researchers, despite the absence of any supporting evidence.
Rather than being the sign of the intervention of an intelligent
designer, the peculiarities of \sct's furin cleavage site are
representative of natural ``evolutionary tinkering''. 

\begin{figure*}
\includegraphics{fig07}
\caption{\label{fig:insertions}Examples of insertions in \sct{} near
the cleavage site (represented by a black triangle). (A) Out of frame
insertion detected in six genomes collected mid-2021, from Costa Rica
(2), Canada (3) and the US (Florida; 1). A fragment similar to the
insert is present in \sct's genome (nsp5), in a different reading
frame; it is shown below the alignment. (B)~Insertion detected in 31
genomes from Austria (3), Sweden (1) and Germany (27), collected in
2023. (C)~Insertion detected in two genomes collected in Spain and
France in early 2025. This insertion was first spotted and shared by
Ryan Hisner. Accessions are provided in the Methods section.}
{\vspace*{-1pt}}
\end{figure*}

\subsection{\sct's evolution informs on the \mbox{potential} sources of the
insert}

Confusion about the potential origin of the insert creating a furin
cleavage site in \sct{} also stems from an incorrect description of how
insertions happen in coronaviruses, which are sometimes (incorrectly)
seen exclusively as similar to homologous recombination
\citep{Wade2023}. Insertions in coronaviruses can happen to due template
switching \citep{Garushyants2021}, with RNA picked up from various
sources, including  the virus's own RNA, RNA from another virus
infecting the same cell, but also even host RNA \citep{Yang2022a}.

Insertions have happened repeatedly during the evolution of \sct{} in
humans, including in its spike. Prominent examples include the EPE
insertion (spike position 214) in the Omicron BA.1 variant that swept
through the world from late 2021, and the MPLF insertion (spike
position 216) in the \mbox{lineage} \mbox{descending} from Omicron BA.2.86, that
started spreading mid-2023 and is still dominant at the time of
\mbox{writing}.

In several instances during \sct's evolution, observed insertions could
be traced to the host \citep{Peacock2021Virological, Yang2022a}. While
in many cases the source is putative (and the inserts too short to be
certain), inserts were also found in controlled contexts, like in cell
culture, with the insertion of sequences from the green monkey host
cells \citep{Yang2022a}. In avian influenza virus, the origin of some
furin cleavage sites (that turn low pathogenic strains into high
pathogenic ones) was traced to their host \citep{Gultyaev2021}.
Transcripts similar to the \sct{} insert can be found in mammal hosts
\citep{Romeu2023}. While the exact source of the insert is unknown and
cannot be determined with certainty because of its short length, a host
origin is therefore within the realm of likely possibilities. 

Finally, during \sct's evolution, there were several occurrences of
insertions near or in the furin cleavage site, mimicking the insertion
that may have led to it. These insertions could be with high GC
content, or also out of frame. Examples of such insertions are shown in
Figure~\ref{fig:insertions}. Their existence demonstrates that such
insertions can occur naturally, and that there is therefore nothing
suspicious {per se} in the presence of an insertion at that
location. That so many different scenarios are proposed for the source
of the furin cleavage site sequence, none of them {actually} convincing,
illustrates that \sct's \mbox{furin} cleavage site is not a ``smoking gun''
\citep{LubinskiWhittaker2023}.

\section{Conclusion}

All data available to date point to a zoonotic, natural origin of \sct,
linked to wildlife trade at the Huanan market. Early \Covid{} cases in
Wuhan were found predominantly around the Huanan market, even for cases
that had no reported epidemiological link to the market; the early
diversity of \sct{} is represented inside of the market; the market was
one of the only few places in Wuhan selling wildlife; genetic traces of
wildlife and of \sct{} were found in the same stall inside of the
market; limited viral diversity indicated a recent outbreak, consistent
with the timings of cases that were found retrospectively. The absence
of detection of infected animals is primary due to the absence of
samples from the key animal species sold in the market. While the whole
market was closed down in the early hours of \mbox{January~1, 2020}
\citep{Yang2024}, game stalls were reportedly already closed on December
31, 2019 \citep{RedStar2019}, and disinfection was already in progress
\citep{WSJ2020disinfection}. It is possible to invoke scenarios that
make sense of these data and yet implicate a laboratory, for instance
if infected animals were brought from a lab to the market to cover
traces, but such a scenario is less parsimonious, because it requires
the additional presence of infected animals in a lab. 

Scientific work may only be based on actual data, not speculation
\citep{DH2025}. Additional \sct{} sequence data from the early months of
the pandemic were made public in the last couple of years \citep{Lv2024,
HenselDebarre2025}; they did not challenge previous conclusions
\citep{Pekar2025VE}, and even reinforced links to the Huanan market
\citep{HenselDebarre2025}.

There is a notable precedent for a change in conclusions for the origin
of an outbreak that followed a regime change, namely in the case of the
source of the 1979 anthrax outbreak in Sverdlovsk in the USSR
\citep{Meselson1994}. Local authorities proposed and promoted the idea
that the outbreak was caused by a natural source of tainted meat. The
end of the USRR, over a decade later, allowed for a better flow of
information. It was shown that the outbreak had instead its source in a
military microbiology facility \citepalias{Meselson1994}. There is however a
key difference between the 1979 anthrax outbreak and \Covid{:} the
conclusions presented in this article differ from the official ones in
China. While the Huanan market was initially considered as source
\citep{Tan2020}, its role is now contested in publications from China
\citep{Liu2023, StateCouncil2025}---as is an origin linked to research
activities.    

Elucidating the origin of \sct{} is important from a historical point
of view. Whatever the origin, however, the next pandemic will not
necessarily follow the same pattern. It is therefore crucial to give
ourselves the best chances to mitigate the chances of evolution of
pandemic pathogens, both by monitoring and controlling lab experiments
of \mbox{potential} \mbox{pandemic} pathogens (which are not limited to viruses), and
by reducing our interactions with potentially infected animals, in
particular in dense urban centers \citep{Jones2008}. 


\section{Methods}

The various alignments presented in the \mbox{figures} were done with the msa
package in R \citep{msa}. The sources of the sequences\break (on Genbank
unless indicated otherwise) are:\unskip\break  NC\_045512.2 (\sct);  MZ937000
(BANAL-20-52);  MN996532.2 (RaTG13);  GISAID::EPI\_ISL\_412977
(RmYN02);  NC\_004718.3 (SARS-CoV);  OK017908 (CD35);  NC\_025217.1
(Zejiang2013);  NC\_009020.1 (HKU5);  NC\_019843.3 (MERS-CoV); 
KF963241.1 (OC43);  NC\_026011.1 (HKU24);  KY370046 (JL2014); 
NC\_003045.1 (Bovine CoV);  NC\_048217.1 (MHV);  NC\_006577.2 (HKU1); 
MT576585 (MA\_30);  NM\_001159575.2 (ENaC-$\alpha$).  The BNT and
Moderna\break sequences were obtained from \citet{Jeong2021}. Data sources for
Figure~\ref{fig:insertions}:
\url{https://doi.org/10.55876/gis8.250226ym} (A),
\url{https://doi.org/10.55876/gis8.250226eb} (B),
\url{https://doi.org/10.55876/gis8.250227vy} (C).\looseness=-1

The online codon-optimization tools used in Figure~\ref{fig:cgg}A are
available at the following URLs: 
\begin{itemize}
\item Vector builder:
\url{https://en.vectorbuilder.com/tool/codon-optimization.html}
\item Genscript:
\url{https://www.genscript.com/gensmart-free-gene-codon-optimization.html}
\item IDT: \url{https://www.idtdna.com/pages/tools/codon-optimization-tool} 
\item Twist biosciences: \url{https://www.twistbioscience.com/resources/digital-tools/codon-optimization-tool}
\end{itemize}

The proportions of the different nucleotides present at positions
23606--23611 (CGGCGG in the reference genome), in all available
sequences of \sct, was estimated using CoV-Spectrum \citep{CoVSpectrum}.

The insertions presented in Figure~\ref{fig:insertions} were identified
using GISAID's search function \citep{GISAID}, and checked with
Nextclade \citep{Aksamentov2021}.  Amino acid colors follow the palette
proposed by Nextclade at
\url{https://github.com/nextstrain/nextclade/blob/master/packages/nextclade-web/src/helpers/getAminoacidColor.ts}, 
designed after amino acids chemical properties (e.g., basic in blue,
acidic in red, hydrophobic or aliphatic in yellow, etc.).

\section*{Acknowledgments}
We thank Alex Crits-Christoph for discussions. We are grateful to the
variant tracker community, professional and citizen scientists, for
sharing their findings on Twitter and then Bluesky. We are also
grateful to all the researchers providing tools to the community to
follow the evolution of \sct, including but not limited to Nextstrain
\citepalias{Aksamentov2021}, Outbreak.info \citep{OutbreakInfo}, CoV-Spectrum
\citep{CoVSpectrum}, Gensplore \citep{Gensplore2025}, UShER
\mbox{\citep{Turakhia2021},} CoVariants \citep{CoVariants}. No specific
funding was received for this work.

\section*{Declaration of interests}
The authors do not work for, advise, own shares in, or receive funds
from any organization that could benefit from this article, and have
declared no affiliations other than their research organizations.

\section*{Supplementary materials}
Figures S1 and S2 are available  available on the journal's website
under \printDOI\ or from the author.  

\CDRsupplementaryTwotypes{supplementary-material}{\cdrattach{crbiol-183-suppl.pdf}}

\back{}

\printbibliography
\refinput{crbiol20250199-reference.tex}

\end{document}

\end{thebibliography}
