Origin, evolution and global spread of SARS-CoV-2

. SARS-CoV-2 is the virus responsible for the global COVID19 pandemic. We review what is known about the origin of this virus, detected in China at the end of December 2019. The genome of this virus mainly evolves under the e ﬀ ect of point mutations. These are generally neutral and have no impact on virulence and severity, but some appear to inﬂuence infectivity, notably the D614G mutation of the Spike protein. To date (30/09/2020) no recombination of the virus has been documented in the human host, and very few insertions and deletions. The worldwide spread of the virus was the subject of controversies that we summarize, before proposing a new approach free from thelimitationsofpreviousmethods.Theresultsshowacomplexscenariowith,forexample,numerous introductions to the USA and returns of the virus from the USA to certain countries including France. Résumé. Le SARS-CoV-2 le virus responsable de la pandémie mondiale de COVID19. On dresse ici un bilan de ce qui est connu sur l’origine de ce virus, détecté en Chine ﬁn décembre 2019. Le génome de ce virus évolue sous l’e ﬀ et de mutations ponctuelles. Celles-ci sont généralement neutres et sans impact sur la virulence et la sévérité, mais certaines semblent inﬂuer sur l’infectiosité, notamment la mutation D614G de la protéine Spike. A l’inverse, on n’a à ce jour (30/09/2020) documenté aucune recombinaison du virus chez l’hôte humain, et très peu d’insertions et de délétions. La propagation mondiale du virus a fait l’objet de polémiques sur lesquelles nous revenons, avant de proposer une nouvelle approche débarrassée des limites des méthodes précédentes. Les résultats montrent une propagation complexe avec, par exemple, de très nombreuses introductions aux USA et des retours du virus depuis les USA vers certains pays dont la France.


Introduction
SARS-CoV-2 is an RNA virus. Its genome is approximately 30,000 bases long, making it the longest known RNA virus genome. In comparison, the influenza genome is 10,000 to 15,000 bases long, and HIV (a retrovirus) is about 10,000 bases long. The first sequences of the SARS-CoV-2 genome were available in late December 2019, all from Wuhan, China, with the first sequence available on 23/12, the second on 26/12, and 16 more on 30/12 [1,2]. By mid-January 2020, sequencing began outside of China, and by the end of January there were about 250 sequences available from many countries (Thailand, Nepal, Japan, Canada, USA, France, Germany, Italy, Australia. . . ). Sequencing was initially rather slow, since at the end of March 2020 only a few hundred sequences were available. It accelerated considerably in April with a massive confinement on the surface of the globe and tens, then hundreds, of thousands of cases reported in Europe and the United States. On certain days in April-May several thousand new sequences were deposited on the GISAID (Global Initiative on Sharing Avian Influenza Data, www.gisaid.org) web-site, which collects and makes public sequences from around the world. Since then, the number of genomes deposited has varied but remains high, with more than 500 genomes per day from all over the world. Today (30/09/2020) there are more than 100,000 sequences on the GISAID website from 120 countries. Some countries have not sequenced much (or did not make their sequences public). For example, until recently there were less than 300 Italian sequences on GISAID despite the impact of the pandemic. There are 700 today. France has not provided many sequences either, especially in comparison with the UK (about 600 versus more than 42,000 on GISAID today). In particular, only very few (11) sequences are available from Eastern France, even though it is a major source of SARS-CoV-2 spread [3]. In this context, tracing the spread of the epidemic in France as well as on a more global scale is an arduous task, even impossible in some regions.
All results on the origin, evolution and spread of SARS-CoV-2 come from computer analysis of these sequences, coupled with associated metadata such as the date and place of sequencing, the sequencing technique, etc. [4]. In some cases, it has been possible to determine the precise origin of a virus found at a given location by contact tracing. For example, this is the case for the first two Thai genomes: It was shown by following the routes of patients that they came from China. It was also possible to trace the Chinese origin of the infection of "patient one" in Codogno, Italy, in February 2020, even though we now know from the analysis of wastewater that the virus was already circulating in Lombardy in December 2019. But this information is very partial. Most of our knowledge comes from sequence analysis [4], which is based on models (for example to describe mutations, their rhythm and regularity over time) as well as on algorithms [5], which are now confronted with extraordinary masses and flows of data. It should be kept in mind that the conclusions drawn from these analyses depend on the models and approximations made by these algorithms, which follow heuristic approaches due to the mass of data and the complexity of the problems [6]. Since the evolution of the virus started less than a year ago, the strains still show few differences, which limits some analyses, for example the search for traces of adaptation [7,8]. Finally, some analyses are complicated by the poor sequencing of certain regions of the globe (see above) and the quality of certain sequences. Despite these limitations, we now have very clear answers to a number of questions, for example on the natural origin of the virus and the fact that it is not from a laboratory [9]. These advances and their limitations are reviewed here, with a special focus at the end of the article on the global spread of the pandemic using a new phylogeographic approach to correct sampling biases [10].

Origin and phylogeny
As soon as the first sequences were obtained, phylogenies were constructed to trace the origin of SARS-CoV-2 (which was not yet called that). It is a Betacoronavirus, a member of the Sarbecoviruses, a viral subgenus including the virus responsible for the SARS epidemic in 2003, named SARS-Cov-1 (for severe acute respiratory syndrome-related coronavirus). SARS-CoV-2 is also referred to as hCoV19 by some people, notably GISAID members, who rightly consider that this is probably not the second human epidemic of this type, and that others have preceded it. Sarbecoviruses infect not only humans, but also many mammals, including civets, bats and pangolins. The phylogeny of this virus and its variants ( Figure 1) shows that: • The closest known viruses to SARS-CoV-2 come from two Rhinolophidae or "horseshoe" bats found in Yunnan in 2013 (RaTG13, [1]) and 2019 (RmYN02, [11]). Genome identity is about 96% for one (RaTG13) and 93% for the other (RmYN02), but this rate of identity varies along the sequences. In particular, it is quite low (60-70%) in the region of binding domain (RBD, ∼60 amino acids, Spike protein) to the human protein ACE2, which allows entry into the host cell [9]. • More distant overall (90% identity) is a pangolin virus, whose RBD sequence is very close to SARS-CoV-2, with only one amino acid mutation, compared to about a dozen for bat [9].
Analyses show that recombinations are numerous among Sarbecoviruses [12]. These recombinations are very likely the origin of SARS-CoV-2, but to date this cannot be affirmed because no genomes or significant portions of genomes that are very close to the human form have been found in natural reservoirs. In this respect, SARS-CoV-2 is clearly different from SARS-CoV-1, which is very close (99.6%) to the civet virus [13]. Initially, it was suggested that due to the similarities observed (see above), SARS-CoV-2 would be a recombinant in the RBD region of bat and pangolin viruses. But since then, observation of the evolution of this region in human strains has shown that this region mutates rapidly, with 36 amino acid mutations present among the human strains avail-able today (30/09/2020). The alternative hypothesis of adaptation of the bat virus in this region, rather than recombination with the pangolin virus, is therefore quite credible, especially since it has been shown that SARS-CoV-2 passes easily to mice, with only a few mutations presumably adaptive in the RBD region [14].
It has been widely read that SARS-CoV-2 is derived from bats and it is believed that the transition to humans is recent. In reality there are ∼4% differences between the two genomes, i.e. about 1200 mutations. Since December 2019 we see the evolution of the virus in humans. Based on the number of mutations observed today compared to the very first sequences, and relating this number to the time elapsed, we find that the genomes evolve at the rate of one or two mutations per month, which is slow compared to influenza or HIV. The 1200 mutations observed thus correspond to 50 to 100 years of evolution and a date between 1970 and 1995 for the common ancestor of SARS-CoV-2 and RaTG13 (100 years of evolution gives the date 1970, because the years in the two branches from the common human/bat ancestor are added together). Further Bayesian analyses indicate similar or even earlier dates, with a part of uncertainty covering the whole 20th century [12]. In any case, these results indicate and confirm that between this common ancestor and SARS-CoV-2 there have been many intermediates, in bats or other mammals such as pangolin, and that these intermediates remain to be discovered. Since coronaviruses have crossed the species barrier three times (both SARS and MERS) in a striking manner over the last 20 years, it is likely that they will do so again, hence the importance of searching for these animal reservoirs.

A natural origin
Claims that SARS-CoV-2 is a laboratory product were found early on in the mainstream press. All the results and figures given above demonstrate the contrary. The 1200 mutations separating SARS-CoV-2 from the closest bat strain (RaTG13) are randomly distributed along the genome, whereas bioengineering would have produced an assembly of known fragments (without mutations compared to some known strains), with point mutations in strategic regions, e.g. RBD [9]. Based on local similarities between the SARS-CoV-2 genome and the HIV genome, it has also been claimed that this was an HIV vaccine attempt. But these local similarities do not explain the origin of the rest of the genome, nor are they significant. They relate to a short segment of 38 nucleotides where the two viruses have 87 percent identity. But when comparing the SARS-CoV-2 genome to other genomes, many similarities of this order can be found, for example 89% on a segment of length 44 with a plant genome. We simply see here the effect of evolutionary "bricolage", which uses and reuses the same solutions to build living things. This is by no means a mark of statistical significance, which is very difficult to estimate.

Evolution in the human host
The question of the origin, date and "patient zero" of the human pandemic was raised early on. The sequences alone do not provide complete answers to these questions. In phylogeny the external group method is used to root the group of interest; for example, to root the tree of mammals we use the genome of reptiles or birds that are the closest species. But this method does not work here, the number of mutations observed between the human and bat viruses (1200, cf. above) being out of all proportion to the number of mutations observed among human sequences (a few dozen). In other words, all the human sequences are more or less the same distance from the bat sequence, and it is not possible with this method to designate with certainty the root of the pandemic.
We have better guarantees by relying not only on sequences but also on dates and history. Indeed, on December 30, 2019, the exact same sequence was found in several Chinese patients. It was quickly found in Thailand, Japan and the USA and, for example, it was still present at the end of March in the United Kingdom. It is therefore a good candidate to be the "sequence zero". It is used as such by many teams and numerous software and websites, including GISAID, NEXSTRAIN (https://nextstrain. org/), etc. As this sequence (WIV04/2019) has been found in different parts of the world, it is impossible to say what its geographical origin is. But the history of the pandemic clearly shows a Chinese origin [1,2]. Various dating methods indicate a beginning of the spread of the epidemic between mid-October and early December 2019 [7]. It is difficult at the present time to be more certain. As with other viruses, especially HIV, the acquisition of new sequences, possibly from older samples, should allow us to refine these estimates and possibly find an earlier origin of the human pandemic.
The first sequences showed little or no difference. Today, after about 10 months of evolution, the most recent sequences are separated from the "sequence zero" by at most 30 nucleotide mutations and 15 amino acid mutations. These figures are partly uncertain, as it is sometimes difficult to distinguish mutation from sequencing or assembly errors. Some deletions, sometimes long and found in several patients, are observed, notably a deletion of 382b in ORF8 and its regulatory region, sampled about 50 times between Singapore and Taiwan [15]. This deletion has been observed in a similar form in SARS-CoV-1, where it is associated with attenuation of the virus, but to date this attenuation has not been observed in SARS-CoV-2, nor any adaptive effect. In contrast to deletions, no prominent and widely shared insertion between viruses from different patients has been found [16].
Mutations that could result from adaptation to the human host, or that could be attached to greater virulence or severity, were very quickly sought. The rarity of mutations and the short evolutionary time of the virus in humans make these tests difficult. Mutations observed in viruses over short periods of time are generally considered to have a neutral or low impact on the phenotype and are essentially the result of complex random processes related to errors in replication and subsequent spread in the human population. According to this commonly accepted hypothesis, there are no strains that are more virulent or more severe than others. At present, there is very little data to contradict this hypothesis. However, the D614G mutation of the Spike protein (transformation of a D residue into a G residue at position 614) seems to correspond to increased transmissibility, based on the increasing frequency of this mutation in the global data [17]. While the pandemic started with the D614 version and was transmitted in this version to many countries, these countries are now almost all predominantly affected by the G614 version (e.g. in France, G: ∼100%), with the notable exceptions of China (D: 60%, G: 40%), Iceland where D is returning after an almost exclusively G phase, or for example Santa Clara in California, which is essentially D since the beginning of the pandemic (whereas California is essentially G). We can see from this last example that these results must be taken with care, as they are largely the result of founding effects whose impact can last for a long time. In the case of D614G, however, there are additional clues from in vitro experiments that indicate higher titers and infectivity of variant G; but no difference in severity is seen between the two variants, which have the same resistance to neutralization by the serum of convalescent patients [17,18]. About 100 other Spike mutations have been studied in vitro [18], some of which have an impact on infectivity and antigenic potency, but do not show a substantial increase in prevalence in the global population as seen for the G614 variant.
While coronaviruses recombine abundantly [12], no reliable markers of recombination among human strains have been found so far [4]. These could occur if co-infection by significantly different strains occurs, but the probability of co-infection is low, and the strains currently circulating are too similar for this phenomenon to be detected if it occurs at all. From this point of view, SARS-CoV-2 appears to be clearly distinguishable from influenza, which evolves by reassortment of different subtypes, which can lead to radical changes with major pandemic risks.
Finally, before concluding this section on the evolution of SARS-CoV-2, it is important to point out a tendency in its genome to replace cytosine bases (C) by uracils (U) [8]. This tendency is explained by cytosine metabolism and replication errors [19]. It is relatively weak but significant, with ∼0.1% increase in U on average at the variable sites between December 2019 and April 2020. Detailed analyses [8] show that mutational mechanisms globally produce an excess of U, but that these mutations tend to be counterselected at the level of synonymous sites and certain di-nucleotides. These evolutionary mechanisms constitute a drift rather than an adaptation to the human host, with a low impact on the proteome. However, they may be key in the design of vaccines based on attenuated forms of the virus [8].

Clades and subtypes
The existence of subtypes of SARS-COV-2 was quickly questioned, by analogy with the subtypes of HIV or Dengue fever, for example. The concept of subtype makes sense if it corresponds to clearly distinct sequences with epidemiological characteristics of interest that separate them from other subtypes. For example, HIV subtypes B and C are clearly separated in phylogeny, with strong statistical support [20], and correspond to distinct epidemics affecting mainly Africa for C and Europe and North America for B. The separation between these two subtypes is estimated to be about 100 years [21] and they appear to differ in terms of resistance to treatment or time to AIDS (in the absence of treatment). The same type of clear separation is found for all four dengue subtypes [22].
For SARS-CoV-2 nothing like this is expected today, since the virus appeared in humans at the end of 2019. Several groups have proposed classifications, with the aim of facilitating the monitoring of the epidemic and building a nomenclature that should prove useful in the long run, for example if some strains no longer circulate or others become predominant. The most convincing distinction is associated with the D614G mutation discussed above. Sequences containing the G614 version, together with two mutations at the RNA level, constitute the G clade of GISAID [23], named B1 by the PANGOLIN system [24]. This clade has a clear phylogenetic difference with the other sequences, even if the bootstrap supports are not very high, and is of great potential epidemiological interest, since it seems to correspond to increased transmissibility and the G clade is becoming predominant in most countries and continents [17]. From the G clade are derived the GH and GR sub-clades of GISAID (B1.1 and B1.2 for PANGOLIN), also carrying the G614 mutation of the Spike, whose epidemiological interest, essentially phylogeographic, is less obvious. The sequences outside the G clade constitute the S clade (GISAID, A for PANGOLIN), which contains the "sequence zero" and the first sequences observed in December/January, as well as the L and V clades (B and B2 for PANGOLIN, respectively). The prevalence of these three clades (S, L, V) decreases in favor of G, although one must be wary of sampling bias depending on the country (see above). About 5% of the sequences are unclassified by GISAID.
Forster et al. [25], based on similar classifications (but with only 160 sequences), suggested that some clades may be better adapted to certain populations and that conversely some populations may be resistant to certain variants of the virus. This "news" was picked up by the mainstream press and then tweeted by Donald Trump. The scientific world protested against this study and its hasty interpretation, with a series of responses published in PNAS [10,26,27]. Beyond the methodological problems (see below), the study by Forster et al. ignored the impact of the founding effects. As if in the example of Santa Clara above, it was inferred that the inhabitants of this city had different genetic and phenotypic characteristics from the rest of the Californian population. It is therefore important to be cautious about interpreting these classifications and the traits that seem to be associated with them.

Virus spread and phylogeography
Phylogeographic methods are based on the phylogeny of sequences and on the geographical characters attached to them, to infer the geographical origin of the ancestral nodes of phylogeny, from the leaves to the root of the tree. We thus obtain scenarios that explain the origin of the pandemic and the successive countries it has contaminated. Early approaches to phylogeography were based on parsimony. Today, probabilistic models of migration are used, within likelihood or Bayesian frameworks [22]. Beyond its questionable interpretations, the study by Forster et al. posed two problems in terms of phylogeography: the method for rooting the tree was unfounded, and the sampling biases, which are considerable depending on the country (see above), were not taken into account [10]. To root the tree the authors used the external group method, which consists in finding the point of the human virus tree closest to the bat virus (RaTG13). We have already explained above why this method cannot work here. As for sampling biases, these have a considerable impact on the reconstructions. For example, with 42,000 sequences from the UK versus 700 from Italy, as currently available on GI-SAID, one will tend to see a UK origin for most of the ancestral nodes, and conclude that the origin of the Italian epidemic comes from the UK.
Below, we describe a new approach to correct these two limitations and to offset as much as possible the relatively weak phylogenetic signal in the data, due to the very short evolution time since the origin of the pandemic. Our study focuses on the first epidemic wave. All data, programs and options, as well as the overall pipeline are available at https: //github.com/evolbioinfo/phylocovid/tree/CRAS. We have used 11,316 genomes, corresponding to the totality of sequences available on GISAID as of April 25, and covering the first wave of the epidemic in most regions of the world. A total of 70 countries are represented, as well as the two cruise ships Diamond Princess and Grand Princess. To estimate sampling biases in these data, we use the number of cases reported in each country as of April 25, 2020 (www.ecdc.europa.eu/en/publications-data/downlo ad-todays-data-geographic-distribution-covid-19cases-worldwide). The biases are considerable, with for example in Italy 0.4% of all sequences versus 7.4% of reported cases worldwide, while in the UK the same figures are 28.8% and 5.5%. To validate our reconstructions, we use patient travel and contact follow-up data, available at www.gisaid.org for 294 sequences.
Genomes are aligned by COVID-Align [16]. A first tree is inferred from the totality of the data, minus the duplicated sequences, by combining FastME [28] for the initial tree and RAxML-NG [29] to refine this first tree. This first tree is rooted with the "zero" sequence. This very simple rooting method is widely accepted and is used by others, notably GISAID and NEXTSTRAIN. Duplicated sequences are reinserted into the tree at a zero distance from their sister sequences. Outlier sequences (sequencing or dating errors) with an abnormally high rate of evolution compared to the zero sequence are removed (rate > median rate + 3 standard deviations). The tree thusly obtained contains 11,262 sequences. It is poorly resolved, due to the close proximity of the sequences, and highly biased in terms of sampling density depending on the country. We will see below that the phylogeographic reconstruction based on this complete tree is poorly supported.
This tree is used to construct low biased subsamples, while keeping the "phylogenetic diversity" as high as possible. This measure of biodiversity, commonly used in ecology and species conservation, is simply the sum of the branch lengths of the phylogeny studied [30]. In short, in species conservation the approach underlying this measure is to conserve the essential length of the tree of living organisms, which represents the sum of the genetic inventions carried by the species under consideration, rather than a large number of species, some of which may be genetically similar. Here, the method consist in removing duplicated or very similar sequences, while preserving the essential part of the complete epidemic tree. Steel [31] has shown the optimality of the algorithm consisting in iteratively removing the leaf associated with the shortest branch. This simple and fast algorithm allows us to find the sub-tree of greatest phylogenetic diversity for a given number of sequences, starting from a large initial tree. Here, one first calculates how many leaves from each country should be removed from the complete epidemic tree to approximate as closely as possible the proportions of reported cases. Next, the leaves associated with the shortest branches corresponding to the over-represented countries are randomly removed until a tree of the desired size is obtained. Moreover, for each country we aim to select sequences equally spread in time. However, all (263) sequences dating from December 2019 and January 2020 are kept in order to have as much information on the origin of the pandemic as possible. In this study we chose to keep about 2000 genomes in the tree, to have enough information and unbiased samples. As this algorithm has a random part, we repeated the sub-sampling five times in order to check the stability of our results on different data sets.
For each sample of size ∼2000, we constructed and rooted a tree with the same method as for the complete tree (see above), and dated this tree using LSD [32]. Although these trees are smaller than the complete tree, they still present a relatively low resolution, with many polytomies (nodes having more than two descendants). To resolve these as much as possible, we used geographic characters, since the sequences did not provide any information. The idea is to group descendants attached to the same geographic character in the same sub-tree. For example, if a polytomy has direct descendants from France and Italy, we will create two sub-trees of this polytomy (now resolved) corresponding to the Italian and French nodes. This procedure for resolving polytomies takes place after the ancestral inference of geographic characters by PastML [33] and does not call into question this phase of inference. It induces a better resolution of the tree, which conforms to the principle of parsimony (and maximum likelihood with standard hypotheses and models). The result for one of the debiased samples is given in Figure 2 (the scenarios for the other four sub-samples are almost identical, see https://github.com/evolbioinfo/phylocovid/ The nodes correspond to transmission clusters sharing the same geographical origin. We display the number of viruses sequenced in these clusters (e.g. 42 in the Italian cluster). For each cluster, we display the clades (S, V, L, G, GR and GH from GISAID) the sequences belong to. S contains the "sequence zero". The three G clades carry the G614 mutation of the Spike, GR and GH being G-derived clades. The dates are those of the origin of transmission within the cluster (for example between 29/11 and 10/12 2019 for the initial Chinese cluster). The thin arrows show the transmission by a single patient from one country to another (e.g. a Chinese origin for the English cluster of size 31). Thick arrows indicate multiple transmissions and their number (e.g. 13 transmissions from China to small US clusters, of sizes between 1 and 7). The dashed arrows indicate a polytomy whose resolution comes from geographical characters (e.g. Italian outbreak of size 42, with descendants in Spain, USA, France, Brazil and Russia). The smallest clusters (< 16 sequences) are not represented; 1239 out of 1996 sequences are included in this graph; the complete tree is available on https://github.com/evolbioinfo/phylocovid/tree/CRAS. tree/CRAS/data/20200425/figures). To improve the readability of the figure, the PastML options are used, which consist in showing only the main epidemic clusters (i.e. a set of leaves and nodes connected in the tree and associated to the same geographical character [34]), and in grouping in the same arrow similar transmissions between two countries. This graph (Figure 2), which shows only the main epidemic clusters (size ≥ 16), does not show all the data (1239 sequences out of 1996) and the complexity of the transmission chains.
As expected, at the root of the phylogeographic scenario ( Figure 2), a Chinese epidemic cluster is observed, containing the four original clades: S (to which the "sequence zero" belongs), L, V and G (carrying the Spike G614 mutation). Conversely, the most recent outbreaks are all G, GR and GH, the last two being sub-clades derived from G. This scenario shows the role and diversity of the epidemic in the USA, by far the most affected country on April 25, 2020 (34% of reported cases worldwide). The first US clusters date from the end of December-beginning of January, the majority coming from China, but with a major cluster (size 168) coming from Canada, affected very early (17/12-08/01) by basal S strains. Other later USA clusters (January to March) came from Italy and France, the latter being the source of a large number of cases (341) in the USA, all from the GH sub-clade. In turn, there were transmissions from the USA to France (20), Germany (32) and Turkey (16). The main French clusters come from Italy (G and GH). Spain and Germany were both directly affected (S and L, respectively) by viruses coming from China, and by secondary epidemics, coming from Italy for Spain (G), and from the USA via France and Italy for Germany (GH). In this graph (Figure 2) showing only the main outbreaks, there is only one basal English outbreak in the UK (V, at the hinge 2019-2020, size 31), but there are many smaller and more recent ones, notably from Italy (G, 22/01-28/02/2020, size 13; GR, 21/01-22/02/2020, size 12; etc. see full tree). The global scenario (Figure 2) is consistent with more localized studies, for example on Europe [35] where it is shown, as here, that the first Italian epidemic outbreak originated in China and not in Germany as previously thought.
To confirm the accuracy of our phylogenetic reconstructions, we used available contact and travel data for 294 patients. For each, we compared the historical data (for example, a French patient known to have returned from Italy or to have been in contact with people returning from Italy), with the reconstruction produced by PastML (for example, for a French patient, his first ascendant in the tree whose prediction differs from France). The agreement between these very different sources of information is high, about 50% for the five de-biased trees of size ∼2000. When considering the complete tree (11,269 sequences), the agreement is much lower (16%). It should be noted that full agreement is not expected due to the incompleteness of the data. A French patient may have been infected by a German, for example, even if they travelled to Italy. Similarly, a French patient who stayed in France may have been infected with an Italian strain whose carrier was not sampled.
A 50% agreement is therefore particularly high and validates the approach as a whole.

Conclusions and perspectives
Sequence analysis very clearly indicates a natural origin of SARS-CoV-2 and no significant resemblance to HIV, as has been suggested. However, its origin remains largely unknown due to its remoteness from the closest sequenced animal viruses found in bats and pangolin. New data, from yet unexplored reservoirs or from old samples, should advance our knowledge of the origin of the virus and its date of appearance and circulation in the human population.
At the time of the second wave, work on the evolution of the virus is more important than ever. The sequences of SARS-CoV-2 are mutating and have many variants, both in nucleotides and amino acids. With rare exceptions, the mutations observed since December 2019 have so far not been shown to have an impact on virulence or severity. The most notable exception is the D614G mutation, which is increasing in prevalence worldwide and seems to increase infectivity. All mutations, while not directly affecting virulence or severity, are likely to induce variations in immune responses, which will need to be investigated in potential vaccines or new tests. A greater number of sequences covering longer periods of evolution, with a more exhaustive representation of different human populations, countries and continents, will make it possible to study these mutations (point mutations, deletions, insertions, recombinations, etc.) in terms of selection pressure, convergence, adaptation to the human host, virulence, severity and pandemic risk.
Under the pressure of this pandemic and its massive data, methods, algorithms and models are progressing rapidly, as seen above with phylogeographic analyses. This methodological work, supported by ever more abundant and exhaustive data, should establish molecular epidemiology as a key area in the study and control of future viral pandemics.