Importation, circulation, and emergence of variants of SARS-CoV-2 in the South Indian state of Karnataka

Background: As the coronavirus disease 2019 (COVID-19) pandemic continues, the selection of genomic variants of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) associated with higher transmission, more severe disease, re-infection, and immune escape are a cause for concern. Such variants have been reported from the UK (B.1.1.7), South Africa (B.1.351) and, Brazil (P.1/B.1.1.28). We performed this study to track the importation, spread, and emergence of variants locally. Methods: We sequenced whole genomes of SARS-CoV-2 from international travellers (n=75) entering Karnataka, South India, between Dec 22, 2020 and Jan 31, 2021, and from positive cases in the city of Bengaluru (n=108), between Nov 22, 2020- Jan 22, 2021, as well as a local outbreak. We present the lineage distribution and analysis of these sequences. Results: Genomes from the study group into 34 lineages. Variant B.1.1.7 was introduced by international travel (24/73, 32.9%). Lineage B.1.36 and B.1 formed a major fraction of both imported (B.1.36: 20/73, 27.4%; B.1: 14/73, 19.2%), and circulating viruses (B.1.36: 45/103; 43.7%,. B.1: 26/103; 25.2%). The lineage B.1.36 was also associated with a local outbreak. We detected nine amino acid changes, previously associated with immune escape, spread across multiple lineages. The N440K change was detected in 45/162 (27.7%) of the sequences, 37 of these were in the B.1.36 lineage (37/65, 56.92%) Conclusions: Our data support the idea that variants of concern spread by travel. Viruses with amino acid replacements associated with immune escape are already circulating. It is critical to check transmission and monitor changes in SARS-CoV-2 locally.


Introduction
The coronavirus disease 2019 (COVID-19) pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has claimed millions of lives and has affected people living in all parts of the globe 1 . The evolution of the virus did not initially alarm public health specialists or those involved in vaccine development 2 . However, the emergence of variants with distinct biological properties which include one or more mutations that confer higher infectivity, increased transmission, severe disease, re-infection, and immune escape are a cause for concern 3-9 . Such variants may influence the trend of the pandemic and are therefore broadly known as Variants of Concern (VOCs) 3-8 .
In India, the COVID-19 pandemic began with the importation of the virus in January 2020 10 . It is only after 11 million cases and over 150,000 deaths that the numbers declined, signalling the end of the first wave of SARS-CoV-2 in the country 1,10,11 . As with other countries in the world, India too started vaccination campaigns in January 2021, at about the same time that reports of VOCs were communicated from the United Kingdom (UK), Brazil, and South Africa 3,4,6,11 . The primary concern is that they may herald the second wave of SARS-CoV-2 in the county and/or undermine the vaccination drive.
Genomic studies in India have shown that several lineages of SARS-CoV-2 have been introduced, have spread, and fallen below the limit of detection since January 2020 12-22 . We have previously performed detailed genomic epidemiology of SARS-CoV-2 in the South Indian state of Karnataka, with a population of 64.1 million (Census 2014) 22 . We found multiple introductions of SARS-CoV-2 into the state and at least seven distinct lineages were already circulating in the state by May 2020. Detailed analysis of the contact network of COVID-19 cases to look at transmission within the state emphasized the role of symptomatic individuals in spreading the virus 23 . These data have contributed to our understanding of how the virus enters, spreads, and evolves in a population. In the genomic epidemiology study, no particular lineages were associated with disease severity 22 . Studies of sequences from India juxtaposed with sequences from all over the world, suggest that mutations associated with immune escape and re-infection are already circulating in the population 2,24-26 .
Multiple lineages of SARS-CoV-2 have been reported from across the world and in India 12,13,15-17, [19][20][21][22]27 . There are two ancestral lineages of SARS-CoV-2 in the PANGO classification system, A and B 28 . While viruses of both lineages are circulating across the world, viruses of lineage B are more widespread and prominent in number. The viruses responsible for the outbreak in Italy, in early 2020, with an amino acid change in the spike protein D614G were classified into lineage B.1 28 . This lineage is now the dominant lineage across the world. Several studies have now shown that viruses in this lineage transmit better, with increased infectivity in cell culture [29][30][31][32] .
Viruses of the lineage B.1 have acquired several other amino acid replacements in the Receptor Binding Domain of the Spike protein -specifically in the lineages which have been designated as VOCs, namely -B.1.1.7 (N501Y), B.1.351 (N501Y, E484K, K417T) and P.1 from the lineage B.1.1.28 (N501Y, E484K, K417T). Some of these amino acid replacements either singly or in combination have been shown to influence transmission of the virus, interfere with neutralization of the virus, and are associated with an increase in the number of hospitalizations 2,5,7,8 . The spread of these lineages, therefore, has global implications 5,33 . Early data suggests that some variants may escape neutralization by both therapeutic antibodies and antibodies induced by previous infection and vaccination 8,9,34,35 . This has implications for the efficacy of Spike sequence-based vaccines and suggests that re-infection is possible 7,36 .
Rapid sharing of genomic information enabled the global community to pick-up cases of VOCs and implement relevant public health measures 3,4,6 . A concentrated, ongoing, local approach to genomic surveillance is critical for the identification of variants and establishing epidemiological links with the trend of the outbreak 5,7,12,22 . This has also proved critical for local outbreak management and informed policy decisions across the world 5,7,37,38 .
It is in this context that we conducted genomic surveillance of COVID-19 positive international travellers to the south Indian state of Karnataka between Dec 22, 2020-Jan 31, 2021 (n=75). We also performed sequencing of SARS-CoV-2 (n=108), in samples collected between Nov 22, 2020-Jan 22, 2021) in Bengaluru city (Bengaluru Urban District) to identify and track locally circulating variants and potential VOCs. by the Institutional Ethics Committee of NIMHANS in light of the public health emergency. All samples were collected for routine diagnosis for COVID-19, as part of the State's requirement for epidemiological investigation of variants; and de-identified before analysis of data.

Samples for sequencing
On December 23, 2020, the Government of India established surveillance for detecting the importation of Variants of Concern. As part of this surveillance, nasopharyngeal and oro-pharyngeal swabs were collected from international travellers arriving at the international airport in Bengaluru between Dec 22, 2020-Jan 31, 2021. Samples testing positive by reverse transcription polymerase chain reaction (RT-PCR) and having Ct value < 30 (n=75) were included in the study. Further, the surveillance was also designed to include at least 5% of RT PCR positive samples received for routine diagnosis of COVID-19 from Nov 2020 -Jan 2021. To fulfil this samples from COVID-19 cases in Bengaluru city (n=108, 16.25% (108/664) collected between Nov 22, 2020-Jan 22, 2021) through routine surveillance and from a local outbreak in a nursing college in Bengaluru city in Feb 2021, n=14 were included in the study. Of the 42 samples collected from the local outbreak, 14 were suitable for sequencing (RT-PCR positive, Ct value < 30) and were analysed further. From previous experiments in our laboratory using a similar sequencing approach, we have ascertained that a Ct value of < 30 can inform on lineage of the virus and a Ct of < 25 was correlated with recovery of complete genomes. This was used to set the cut-offs for the two sample types.

Data for epidemiological curve
The data for confirmed new cases and numbers tested for SARS-CoV-2 in Bengaluru Urban district during the study period (Nov 22, 2020 -Jan 31, 2021) were obtained from covid19india.org, a dashboard that collated information released by the Government of Karnataka.
Nucleic Acid extraction and RT-PCR Nucleic acid extraction was performed with automated magnetic bead-based extraction method, using the Chemagic Viral DNA/RNA special H96 kit (PerkinElmer, CMG-1033-S) following manufacturer's instruction. SARS-CoV-2 detection was done using ICMR approved diagnostic kits. A total of 197 RT-PCR positive samples fulfilling the following criteria -(i) Ct values less than 30 in the case of international travellers (n=75), and local outbreak (n=14) or (ii) Ct value less than 25 for local cases (n=108), were taken for whole genome sequencing. Samples and RNA were stored at 4C for <1 week and -80C for long term storage.
Whole genome sequencing of SARS-CoV-2 Whole genome sequencing was performed using the amplicon sequencing approach described in the ARTIC Network protocol using the V3 primer set 39 . The resulting amplicons from 12-24 samples were barcoded using the native barcoding kits (NBD104/114, Oxford Nanopore Technology (ONT)) and sequencing libraries were prepared using the ligation sequencing kit (SQK-LSK109, ONT). The barcoded library was loaded on to FLO-MIN-106 flow cells and sequenced on the MinION (ONT). An average of 0.12 million (median) sequencing reads were acquired per sample with a median coverage of 1737x (see extended data S1 40 ). Raw sequencing reads have been deposited within BioProject ID: PRJNA670824.
Analysis of sequencing data and data sharing Analysis of sequencing reads was performed as described previously 22 . Briefly, sequences were basecalled and demultiplexed using guppy (v3.6) from ONT, alternatively open source basecallers such as Bonito can be used Sequences of length 100-600bp were considered for further analysis. Primer sequences were removed from the sequencing reads by trimming 25bp at the ends and additional trimming based on alignment using BBDuk (v38.37). Resulting reads were mapped to SARS-CoV-2 reference genome (NC_045512) using Minimap2 (v2.17) within Geneious Prime (Geneious Prime 2020.0.3). A consensus was created by calling the majority base (most common base) at each position, resulting in a consensus with the lowest ambiguities. Regions with coverage lower than 10 reads were called Ns. The consensus was aligned to the reference to ensure the correct reading frame, edited by manual inspection of the contigs, followed by transfer of annotations from the reference sequence. Of the 183 samples from international travellers and local cases, 176 (73/75 imported, 103/108 circulating) genomes could be used for the determination of lineage using the PANGO web application (Pangolin v2.2.2 lineages version 2021-02-12) 28 . Of the 176 genomes,162 were complete (>92% at 1X and >85% at 10X) and were deposited into the GISAID Database 41 , accession numbers for the sequences are provided as extended data (see extended data S2 40 ). Sequences have also been deposited in GenBank with accession numbers OM073810 -OM073973. Complete sequences (162) were analysed for SNPs and amino acid replacements with reference MN908947.3 (Wuhan-Hu-1) using the CoV-Glue Web Application 42 .

Phylogenetic analysis
A total of 168 genomes, including the 162 described above, and an additional 6 complete genomes from a local outbreak, were used for phylogenetic analysis with the reference NC_045512 as an outgroup. Multiple sequence alignment was performed using MUSCLE and a maximum likelihood tree was constructed using iqtree 43,44 . The GTR+F+I+G4 substitution model was found to be the best-fit model (of the 88 models tested) using the Bayesian Information Criterion. The consensus tree was constructed from 1000 bootstraps and bootstrap values over 70 were interpreted.

Results
In the study period (Nov 22, 2020 -Jan 21, 2021) an average of 510 SARS-CoV-2 cases were detected daily in the district of Bengaluru Urban in the South Indian state of Karnataka ( Figure 1A, B). We sequenced SARS-CoV-2 genomes from 197 SARS-CoV-2 positive individuals, including international travellers (n=75), local cases (n=108), and a local outbreak (n=14).
Lineage classification using the PANGO scheme was possible for 176 genomes which were either imported (73/75) or circulating (103/108), and for all 14 genomes from the local outbreak (extended data S3 40 ). The genomic surveillance for the local outbreak was carried out to identify the lineage/lineages responsible for the outbreak ( Figure 1CE).
A total of 34 lineages were detected from the 176 genomes in this study. A complete list of lineages and their frequencies are tabulated in Table 1   We carried out further analysis of the amino acid replacements in the receptor binding domain (RBD) of the spike protein ( Figure 2A, extended data, S6 40 ) and mapped them on the Maximum-Likelihood tree ( Figure 2B). We identified mutations leading to nine amino acid replacements in the RBD (Figure 2A). Of these, five (S477N Figure 2B). Of the six sequences from a cluster of cases (Outbreak), only a single sequence carried the mutation resulting in the N440K change ( Figure 2B). A single branch of the B.1.36+B.1.468 clade (n=4, 3 of which were imported) had an additional amino acid replacement F490S in the RBD ( Figure 2B). The mutations in the RBD were seen across the phylogenetic tree and clades ( Figure 2B). belonging to the B.1.36 lineage ( Figure 1CE and extended data, S3 40 ). Only one of the six sequences from the outbreak cluster had the mutation resulting the N440K replacement in the Spike protein ( Figure 2C, extended data S6).
Apart from the introduction and spread of known VOCs, the emergence of variants locally is also a cause for concern. Early in the pandemic, a single mutation in the gene encoding the Spike protein of SARS-CoV-2 resulting in a D614G amino acid change was identified to increase infectivity and transmission 2,29,32 . Viruses with this amino acid replacement dominate across the globe 31,46 . Mutations in the gene encoding the Spike protein are of particular concern due to the role of this protein and its Receptor Binding Domain (RBD) in viral binding and entry 47 .
Some of these mutations have been shown to increase infectivity, affinity to the angiotensin converting enzyme 2 (ACE-2) receptor or affect neutralization by antibodies in vitro. Viral genomes with these mutations were already circulating viruses by mid-2020 2,25,26,45,48,49 .
In the sequences from this study, nine amino acids replacements were noted in the RBD domain of the Spike protein ( Figure 2B and extended data S6 40 ). They occurred singly or in pairs (N440K+F490S) (Figure 2). All nine amino acid changes, namely N440K, S477N, V483A, E484K/Q, F490S, S494L/P, N501Y are associated with immune escape 24,25 . Viruses with some of these amino acid changes were already known to be circulating in other parts of India 16,17,24 .
Mutations in the gene encoding Spike protein that do not map to the RBD have also been described; particularly near the polybasic cleavage site at the S1/S2 boundary of the Spike protein. Towards the end of the year 2020, multiple lineages with amino acid replacements at position 677 were noted 50 . Four viruses in our study have mutations resulting in amino acid changes at this position (Q677H (n=3), Q677P (n=1)) (extended data, S6 40 ).
It is to be noted that in this study we have only included samples with Ct values less than 25 for surveillance of circulating SARS-CoV-2 genomes and Ct values less than 30 for sequencing of international travel-related cases. We have also sequenced only a fraction of cases in a limited geographical area. This may therefore present an incomplete view of circulating viruses and inflate the ones that are more readily sequenced. Also, as we have used the amplicon sequencing approach, not all regions of all lineages are well covered by sequencing reads. Others have also noted homoplasy in SARS-CoV-2, this highlights the need to be cautious while interpreting the phylogenetic relationships between SARS-CoV-2 sequences, especially in the context of outbreaks 51 .
In summary, our data highlight an increase in the frequency of the lineage B.1.36 in Bengaluru Urban, in Karnataka, and indicate an underappreciated global presence of this lineage ( Figure 1, Table 1, extended data S8 40 ). Whether this increase is because of epidemiological linkages such as increased travel, continued local transmission chains or super-spreader events remains to be determined. It is beyond the scope of this work to examine whether the lineage, contributing mutations, and amino acid changes impact transmission/infectivity of the virus. Our data emphasize that a consolidated and local approach to genomic surveillance which includes sequencing of SARS-CoV-2 from travellers, circulating variants, and outbreaks, in a continuous manner is necessary to detect VOCs. We believe that in regions where sequencing capacity is limited and sustained and continuous sequencing of local cases is a challenge, rapid and focussed sequencing of travellers and local outbreaks may serve as early warning systems for novel variants. Even as this conjecture remains to be tested, it is clear that rapid identification of such variants can aid in preparing the healthcare system for a surge in cases, suggest revisions to vaccines and diagnostic tests, inform the international community, and guide public health measures.

Data availability
Underlying data NCBI BioProject: SARS-CoV-2 Genome Sequencing. Accession number PRJNA670824; https://identifiers.org/NCBI/bioproject: PRJNA670824.  Introduction -"The viruses responsible for the catastrophic outbreak in Italy," -I have not seen this outbreak in Italy described as "catastrophic" before, and almost suggests this is the cause of the pandemic rather than the initial outbreak in China.

Extended data
○ Methods -Analysis of seq data and data sharing -this point is quite important:

○
The ONT sequencing was done using the artic protocol and artic v3 primers. The bioinformatics, however, does not use the recommended artic protocol: https://artic.network/ncov-2019/ncov2019-bioinformatics-sop.html. There needs to be more information on how the bioinformatics was done, specifically this step "A consensus genome was generated with a coverage cut-off of 10x and the 0% majority rule." ○ What tool was used to generate this? What does a 0% majority's rule mean?
○ To put into context the artic pipeline using a 40x cutoff for consensus calling, and the variant calling goes through nanopolish to address the highly problematic indels from ONT data, there are other steps as well to remove reads of abnormal length, and remove reads aligning to unexpected regions -were any of these steps done? If not are you sure of the reliability of your consensus seqs? How many of your consensus sequences have indels which broke coding sequences?

If applicable, is the statistical analysis and its interpretation appropriate? Not applicable
Are all the source data underlying the results available to ensure full reproducibility? Yes

Are the conclusions drawn adequately supported by the results? Yes
suggests this is the cause of the pandemic rather than the initial outbreak in China.
The word catastrophic has been removed.

Methods -Analysis of seq data and data sharing -this point is quite important:
The ONT sequencing was done using the artic protocol and artic v3 primers. The bioinformatics, however, does not use the recommended artic protocol: https://artic.network/ncov-2019/ncov2019-bioinformatics-sop.html. There needs to be more information on how the bioinformatics was done, specifically this step "A consensus genome was generated with a coverage cut-off of 10x and the 0% majority rule." The methods section has been changed to include a detailed bioinformatic workflow. "Briefly, sequences were basecalled and demultiplexed using guppy (v3.6) from ONT, alternatively open source basecallers such as Bonito can be used Sequences of length 100-600bp were considered for further analysis. Primer sequences were removed from the sequencing reads by trimming 25bp and additional trimming based on alignment using BBDuk (v38.37). Resulting reads were mapped to SARS-CoV-2 reference genome (NC_045512) using Minimap2 (v2.17) within Geneious Prime (Geneious Prime 2020.0.3). A consensus was created by calling the majority base (most common base) at each position, resulting in a consensus with the lowest ambiguities. Regions with coverage lower than 10 reads were called Ns. The consensus was aligned to the reference to ensure the correct reading frame, edited by manual inspection of the contigs, followed by transfer of annotations from the reference sequence."

What tool was used to generate this? What does a 0% majority's rule mean?
A mapping based assembly was performed using minimap2 to the SARS-CoV-2 reference genome (NC_045512). The consensus generation was performed using Geneious Prime® 2020.0.3. The 0% majority rule within Geneious Prime is the following -the most common base at position is called, resulting in a consensus with the lowest ambiguities. A coverage cut-off of 10X was set in order to call the consensus. Regions with coverage lower than 10 were called Ns.
We have changed this sentence in the Methods (described above) for clarity.

5.
To put into context the artic pipeline using a 40x cutoff for consensus calling, and the variant calling goes through nanopolish to address the highly problematic indels from ONT data, there are other steps as well to remove reads of abnormal length, and remove reads aligning to unexpected regions -were any of these steps done? If not are you sure of the reliability of your consensus seqs? How many of your consensus sequences have indels which broke coding sequences?
We agree with the reviewer that these are important considerations to generate a good quality consensus. As described above in the changes to the methods section -adapter and barcode removal, primer trimming and length filters were used in the analysis workflow.
Due to the computational resources needed for nanopolish, we chose to manually inspect the contigs for indels. The following rules were followed during manual inspection- The indels that were due to sequencing errors tended to be poorly supported/regions of heterogeneity and caused frameshifts. We manually inspected the contig for frameshifts. If the frameshift was due to an indel, we inspected the supporting reads. If >70% of the reads had the indel, the consensus was left unchanged. If the indel was not well supported and there were at least 10 reads without the indel, then the majority base was called. In case of a tie, the consensus was resolved to the wild-type/reference base. If less than 10 reads were present, the consensus was corrected to an N. If the indel was present at the end of reads, only the bridging reads were considered for calling the consensus using the above rules.
We have previously validated this approach using a few samples with nanopolish and are therefore confident of the consensus calls. We also noted that the indels occurred at specific positions in the genome, depending on the genomic context -usually homopolymer repeats or end of reads. These positions that needed manual editing were conserved for a particular lineage and were roughly about 29-30 positions per assembly. Also, both the phylogenetics and Nextclade QC did not suggest major issues with our consensus sequences. Additionally we also provide the sequencing reads in the SRA database for others to rebuild the consensus. We call genomes complete if consensus covers >85% of reference at 10X and >92% of reference at 1X. Some genomes were not complete by this criteria (extended data S1 and S3), however they had enough information to be assigned a lineage using the Pangolin tool. All 14 genomes had 1X coverage >74%. This now indicated in the text. We thank the reviewer for the opportunity to discuss this further. We do not have a detailed travel history for the international travellers, however, people with a history of travel to the UK were prioritized for sequencing at that time. Thank you for the opportunity to review this manuscript.

Results -"Genomic investigation of an outbreak of SARS-CoV
In this study, Nimhans and colleagues performed a whole genome sequencing on virus identified among returning travellers, local surveillance effort and one outbreak in college students. The main purpose of the study is to describe the circulation of variants in Karnataka state and identify key mutations that may have clinical and public health importance. I think the work is of high quality and the manuscript both clearly presented the research conducted and correctly cites the international literature around this issue. As the study is descriptive in nature, I found the study design to be appropriate and the methodology widely adopted in the field so readily reproducible.
The key interesting finding of the study is the identification of the B.1.36 lineage which I admit have not come across before. This lineage was both present in high enough proportion and contain significant mutations in spike to warrant attention and investigation. The authors did well to describe the extent of spread of this variant at the time, given the difficulty of the constantly changing nature SARS-CoV-2 lineage circulation. There are only a few minor suggested revisions on my part: The local epidemic situation at the time of study could provide important context for which the genomic data can be interpreted. A simple epidemic curve or a description of number of people tested and positive in the region would assist readers further gauge the significance of the findings around novel variants.

1.
The trees and phylogenetic analyses were solely based on local sequences. Even though for description purposes this is sufficient, adding related national and international sequences, especially those from potential sources of introduction, could vastly improve the interpretation.

2.
Table 3 and 4 are superfluous and can probably just be described in text. 3.
I would like to see the author discuss how this work shape their future priorities around genomic surveillance. The discussion suggested that the returning travellers are more readily sequenced than routinely diagnosed local cases. Would sequencing returning travellers from an important part of genomic surveillance? I think it would be good to add more voices in this discussion.

4.
2.The trees and phylogenetic analyses were solely based on local sequences. Even though for description purposes this is sufficient, adding related national and international sequences, especially those from potential sources of introduction, could vastly improve the interpretation.
We have performed a phylogenetic analysis using limited sequences of the B.1.36 lineage from all over India and across the world in this time period. This is included as part of extended data SF1. The following changes have been made-"Phylogenetic analysis of a subset of B.1.36 genomes (n=183) from across India and across the world showed that sequences from the study (n=70) both from circulating viruses as well as travel associated cases largely clustered together (extended data SF1 40 )." "The clustering of B.1.36 lineage from travellers with locally circulating viruses suggests either a common source of infection or infection with circulating lineages post-arrival (extended data SF1). Table 3 and 4 are superfluous and can probably just be described in text.  Table 4 has been removed, and a more detailed summary count of Lineage B.1.36 from across the world is provided in extended data S8.

3.
4. I would like to see the author discuss how this work shape their future priorities around genomic surveillance. The discussion suggested that the returning travellers are more readily sequenced than routinely diagnosed local cases. Would sequencing returning travellers from an important part of genomic surveillance? I think it would be good to add more voices in this discussion.
The resources for sequencing are not available equally in all parts of the world and given the resource limitations amplified by the pandemic, it has been hard to argue for an investment in genomic surveillance when it is not always clear what to do with the data or who truly benefits from it. Nevertheless, we believe this is important and of local and global relevance.
We have modified the last paragraph of the discussion as follows-"We believe that in regions where sequencing capacity is limited and sustained and continuous sequencing of local cases is a challenge, rapid and focussed sequencing of travellers and local outbreaks may serve as early warning systems for novel variants. Even as this conjecture remains to be tested, it is clear that rapid identification of such variants can aid in preparing the healthcare system for a surge in cases, suggest revisions to vaccines and diagnostic tests, inform the international community, and guide public health measures."