Whole genome sequencing of two human rhinovirus A types (A101 and A15) detected in Kenya, 2016-2018 [version 1; peer review: 2 approved]

Background: Virus genome sequencing is increasingly utilized in epidemiological surveillance. Genomic data allows comprehensive evaluation of underlying viral diversity and epidemiology to inform control. For human rhinovirus (HRV), genomic amplification and sequencing is challenging due to numerous types, high genetic diversity and inadequate reference sequences. Methods: We developed a tiled amplicon type-specific protocol for genome amplification and sequencing on the Illumina MiSeq platform of two HRV types, A15 and A101. We then assessed added value in analyzing whole genomes relative to the VP4/2 region only in the investigation of HRV molecular epidemiology within the community in Kilifi, coastal Kenya. Results: We processed 73 samples collected between 2016-2018, and 48 yielded at least 70% HRV genome coverage. These included all A101 samples (n=10) and 38 (60.3%) A15 samples.  Phylogenetic analysis revealed that the Kilifi A101 sequences interspersed with global A101 genomes available in GenBank collected between 19992016. On the other hand, our A15 sequences formed a monophyletic group separate from the global genomes collected in 2008 and 2019. Improved phylogenetic resolution was observed with the genome phylogenies compared to the VP4/2 phylogenies. Conclusions: We present a type-specific full genome sequencing approach for obtaining HRV genomic data and characterizing infections. Open Peer Review


Introduction
Genomic surveillance of respiratory viruses is important for (i) developing molecular diagnostics 1,2 , (ii) investigating transmission and evolution 3,4 , and (iii) development of vaccines and therapeutic drugs 5 . Human rhinovirus (HRV) is the most common cause of upper respiratory infections 6,7 and is also occasionally associated with lower respiratory infections 8 . It is a highly diverse virus, with over 160 distinct types identified globally 9 . This diversity presents a challenge in developing a sequencing protocol that works well across the different HRV types 10 . Most HRV molecular epidemiology studies utilize partial genome sequences, which offer lower resolution in identifying epidemiologically linked infections 11 . Viral genome sequencing can take one of the two approaches available: a targeted/enrichment approach or an agnostic/ metagenomics approach 12 . Theoretically, the viral metagenomics approach is an unbiased way to obtain all viral genetic content in a sample as it does not require prior knowledge of their sequences 13 . However, this approach requires high viral titers to succeed 14 , which are not always available in a clinical sample 15 . Furthermore, most clinical samples especially those relevant to HRV sequencing are dominated by host and bacterial nucleic acids 13 . Nonetheless, this challenge can be overcome by target enrichment, for example, polymerase chain reaction (PCR) to bulk up for the target virus before sequencing 15-17 . We describe a target enrichment sequencing approach of two HRV types, A15 and A101, using type-specific primers and compare the phylogenetic inferences between partial and whole genome sequences.

Study population
The study utilized nasopharyngeal swabs collected during two previous studies in Kilifi County, coastal Kenya: outpatient surveillance of nine health dispensaries within the Kilifi Health and Demographic Surveillance System (KHDSS) between January and December 2016 7 and primary school surveillance between May 2017 and April 2018 6 . The school was situated in Junju, a location within the KHDSS. All samples were collected from symptomatic individuals (mild symptoms of acute respiratory tract infection) of varied age (one month -49 years) and archived at -80°C.

Study design
Samples were screened for HRV and typed as previously described 8 . A cycle threshold (Ct) value < 35.0 was used to define positives. HRV positives underwent VP4/2 sequencing to characterize the diversity, spatial and temporal occurrence of HRV in the two settings. Comparison of the HRV diversity within the school in Junju and the health dispensary located in Junju revealed 12 common types, and the most frequent common type was HRV-A101 (n=5 in each setting) 9,11 .
For this study, we purposively selected two types from the two studies: the most frequent type observed in the KHDSS health dispensary surveillance (A15, n=63) and the most frequent common type identified at the Junju health dispensary and school (A101, n=10) for whole genome sequencing (WGS).

Ethics statement
Sample collection was undertaken following an informed written consent provided by parents or guardians for persons <18 years or by participating individuals if aged >17 years. Children whose parents consented were also asked for individual assent to participate. The study protocols were reviewed and approved by the University of Warwick Biomedical and Scientific Research Ethics Committee (BSREC #REGO-2016-1858 and #REGO-2015-6102) and the KEMRI-Scientific Ethics Review Unit (KEMRI-SERU #3332 and #3103).

Primer design
We retrieved nine type A101 genomes and three type A15 genomes, all >6000 nt long from GenBank 18 on 30 th September 2019. Geneious Prime 2019.2.1 (https://www.geneious.com) was used to design eight overlapping primer pairs across the ~7.2 kb HRV genome. The primers targeted eight amplicons 0.9-1.6 kb in size, with overlaps varying in size from 300 to 800 bases, Table 1.
RNA extraction, reverse transcription and PCR Viral RNA was extracted from 140 μl sample using QIAamp Viral RNA kit (Qiagen, USA) as per the manufacturer's recommendations. Reverse transcription was carried out using random hexamers and the Superscript III First-Strand Synthesis System (Invitrogen, United Kingdom). Genome-wide amplification using HRV-specific primers was done using the Q5 High-Fidelity 2X Master Mix (New England Biolabs, United Kingdom). PCR success was assessed by electrophoresing the products on a 1% agarose gel. Once suitable PCR conditions per amplicon were established, a duplex PCR of non-consecutive amplicons of similar conditions was set up (Protocol doidx.doi.org/10.17504/protocols.io.bukxnuxn).
Sequencing PCR products were purified with 1X AMPure XP beads (Beckman Coulter Inc.), quantified with Qubit dsDNA High Sensitivity Assay (Invitrogen, United Kingdom), pooled per sample and normalized to 0.2 ng/uL. Sequencing libraries were prepared using the Nextera XT Sample Preparation Kit (Illumina, CA) and sequencing performed on Illumina MiSeq platform (200 bp × 2) per sample.

Sequence assembly
The raw reads were quality checked using FastQC v0.11.9 and trimmed (Phred score >30) using Trimmomatic v0.39 19 . HRV reads were identified by mapping to the respective reference strains (https://www.picornaviridae.com/sg3/enterovirus/ rv-a/rv-a_seqs.htm) and subsequently assembled into contigs using SPAdes v3.12.0. The contigs were checked for completeness and assembled to a consensus sequence using Sequencher v5.4.6 (www.genecodes.com). We defined sequencing success as obtaining HRV reads covering at least 70% of the genome (>5040 bases). Sequencing depth was visualized using the deepTools 20 package.  Figure 1A and B. Besides, samples that failed sequencing did not have unique phylogenetic clustering patterns based on their previously generated VP4/2 sequences. Sequencing depth was comparable across Ct-values, with the mean depth coverage per genome ranging from 351 -13356 reads per base pair, Figure 1C and D.
The ends of the 5' untranslated region (UTR) and 3' UTR were not amplified due to lack of suitable primers. Genetic diversity was observed across the entire genome and not within a particular genomic region for both types, as shown in Figure 3.

Phylogenetic resolution
We compared the phylogenetic bifurcation patterns and statistical uncertainty of VP4/2 and WGS for the two types. The depiction of sister taxa was comparable across the two trees. However, WGS resolved phylogenetic polytomies/unresolved branches observed previously in the VP4/2 phylogenies. For example, using VP4/2 sequences, all viruses collected in the school formed one polytomy, which was now fully bifurcated using WGS. Similarly, for A15, the four polytomies observed on VP4/2 phylogeny were well resolved using WGS, effectively distinguishing one sample from the other. Although the mean pairwise distances across VP4/2 and WGS were close, the standard error of pairwise distance calculations was notably less (about a tenth) in WGS than VP4/2, Figure 4.
Overall higher branching posterior probabilities in Bayesian phylogenetic trees were observed using WGS than VP4/2  sequences. In VP4/2 Bayesian trees, 17.5% of A15 and 64.7% of A101 nodes had a posterior probability greater than 0.7 compared to 64.3% and 88.2% nodes in WGS trees, respectively, as illustrated in Figure 5.
The improved resolution was further depicted by haplotype networks that displayed notably more alleles using WGS than VP4/2, e.g., in HRV-A101, school sequences that were considered a single allele using VP4/2 sequences resolved into five alleles when using whole-genome sequences, Figure 6. Identical samples at the VP4/2 region had a median of 3 nt changes for A101 and 5 nt changes for A15 across the whole genome.

Recombination analysis
Recombination scans identified breakpoints within the VP3 of one A101 sequence (KEN_Rhinovirus_7018), with both parents belonging to A101 type (p-value < 1.922E-2), Figure 6C.  Recombination within HRV structural regions has been shown to be rare and sporadic 30 .

Discussion
This study presents a type-specific whole genome sequencing protocol for two HRV types. A101 had a higher success (100%) than A15 (60.3%). We attribute the higher A101 sequencing success to the higher number of sequences (n=9) that were available for primer design, which captured more intratype variation, compared to A15 (n=3). Having more genomes contributing to the consensus sequence used in primer design increased genetic variation, and subsequently, the likelihood that the local and contemporaneous diversity was captured.
Whole-genome sequencing provided greater phylogenetic resolution and less statistical uncertainty to partial sequencing.
Polytomies are a product of inadequate data and are, therefore, a potential source of bias. They also result in reduced statistical power due to increased uncertainty. The loss of terminal phylogenetic resolution may result in two opposing predictions: the underestimation (due to unresolved taxa) or overestimation (due to increased total tree length) of diversity 31 . Due to the short size of VP4/2 (~420nt), insufficient data results in reduced phylogenetic resolution and increased uncertainty evidenced by a higher standard error in phylogenetic distance. Unresolved phylogenies are a challenge in epidemiology as one cannot distinguish infections from one individual to another for transmission inference.
Posterior probabilities summarize the uncertainties about a parameter and indicate confidence in the evidence 32 . High posterior probabilities indicate high confidence, and the reverse is also true. Whole genomes consistently provided higher confidence across the two genotypes assessed in this study.
With pathogen sequencing now an established tool to track viral infections 2,12 , it is crucial to compare the resolution of different sequence analysis. As the huge antigenic diversity of HRV continues to pose a challenge in vaccine development, efforts should be directed towards understanding and mitigating transmission. Our study shows that HRV WGS is better suited for transmission inference to the commonly used VP4/2 sequences.
The sequencing approach we developed has some limitations. First, it requires prior genotyping of the HRV positive samples, commonly done by VP4/2 or VP1 sequencing. It is therefore unsuitable for sequencing new or highly divergent types due to the requirement of matching primer sequencing. Second, having to create primer sets for each type is cumbersome and relies on adequate number of pre-existing genomes to design conserved primers. In addition, an amplicon-based target enrichment does not work well for low complexity regions such as the 5'UTR and 3' UTR. Notwithstanding, the new method can successfully enrich for human rhinovirus in archived samples of varying virus titers. It can also effectively capture intratype recombinant regions enabling detailed study of viral dynamics.

Conclusions
With HRV being the most common respiratory virus, it is surprising that we have such few publicly available whole genomes to allow detailed intra-type analysis. We describe a new protocol for the whole genome sequencing of two HRV types and enrich the public database of HRV genomes. The protocol can be adapted for other HRV types. Our study also shows that WGS is more informative than VP4/2 sequencing in studying HRV dynamics as it maximizes resolution and reduces phylogenetic uncertainty.

Yewande Nejo
Microbiology Programme, College of Agriculture, Engineering and Science, Bowen University, Iwo, Nigeria Overall comments: This is an important and well executed study by Luka et al. (2021). It provides detailed informative data on the type-specific whole genome sequencing method designed to detect human rhinovirus. The study is able to present target enrichment sequencing approach which addresses the hurdles in genome amplification and sequencing of human rhinovirus posed by the high genetic diversity of the virus and inadequate information of reference sequences. The study is properly designed; however, it will require few additions.

Abstract:
The abstract is a good summary of the research article but please indicate in the abstract the type of samples that were collected.

Introduction:
The introduction provides information that is significant to the research and is well referenced. The importance and purpose of the study is well explained and justifiable.

Methodology:
The methodology segment is well designed and all analyses were carried out with details and sound scientific merit. The author should however specify in this section the type of research carried out.

Results:
The results are well stated and illustrated in a comprehensive manner. The data is meticulously analyzed

Discussion and Conclusion:
The results of this study are well discussed and the data supports the conclusion. The limitations of the study are well stated. However, the reasons for selecting only two types of HRV for whole genome sequencing out of 12 types were not properly indicated. Can the author also explain what could be responsible for the sequencing failure of 25 samples? What could be the implications of not sequencing 5'UTR and 3' UTR regions in the study of viral dynamics?
Is the work clearly and accurately presented and does it cite the current literature? Yes Is the study design appropriate and is the work technically sound? Yes

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions drawn adequately supported by the results? Yes relatively inadequate reference sequences. The study is well designed, with detailed analysis and well-reported conclusions.
expertise to confirm that it is of an acceptable scientific standard.