Whole genome sequencing of two human rhinovirus A types (A101 and A15) detected in Kenya, 2016-2018

Background: Virus genome sequencing is increasingly utilized in epidemiological surveillance. Genomic data allows comprehensive evaluation of underlying viral diversity and epidemiology to inform control. For human rhinovirus (HRV), genomic amplification and sequencing is challenging due to numerous types, high genetic diversity and inadequate reference sequences. Methods: We developed a tiled amplicon type-specific protocol for genome amplification and sequencing on the Illumina MiSeq platform of two HRV types, A15 and A101. We then assessed added value in analyzing whole genomes relative to the VP4/2 region only in the investigation of HRV molecular epidemiology within the community in Kilifi, coastal Kenya. Results: We processed 73 nasopharyngeal swabs collected between 2016-2018, and 48 yielded at least 70% HRV genome coverage. These included all A101 samples (n=10) and 38 (60.3%) A15 samples. Phylogenetic analysis revealed that the Kilifi A101 sequences interspersed with global A101 genomes available in GenBank collected between 1999-2016. On the other hand, our A15 sequences formed a monophyletic group separate from the global genomes collected in 2008 and 2019. An improved phylogenetic resolution was observed with the genome phylogenies compared to the VP4/2 phylogenies. Conclusions: We present a type-specific full genome sequencing approach for obtaining HRV genomic data and characterizing infections.


Introduction
Genomic surveillance of respiratory viruses is important for (i) developing molecular diagnostics 1,2 , (ii) investigating transmission and evolution 3,4 , and (iii) development of vaccines and therapeutic drugs 5 . Human rhinovirus (HRV) is the most common cause of upper respiratory infections 6,7 and is also occasionally associated with lower respiratory infections 8 . It is a highly diverse virus, with over 160 distinct types identified globally 9 . This diversity presents a challenge in developing a sequencing protocol that works well across the different HRV types 10 . Most HRV molecular epidemiology studies utilize partial genome sequences, which offer lower resolution in identifying epidemiologically linked infections 11 . Viral genome sequencing can take one of the two approaches available: a targeted/enrichment approach or an agnostic/ metagenomics approach 12 . Theoretically, the viral metagenomics approach is an unbiased way to obtain all viral genetic content in a sample as it does not require prior knowledge of their sequences 13 . However, this approach requires high viral titers to succeed 14 , which are not always available in a clinical sample 15 . Furthermore, most clinical samples especially those relevant to HRV sequencing are dominated by host and bacterial nucleic acids 13 . Nonetheless, this challenge can be overcome by target enrichment, for example, polymerase chain reaction (PCR) to bulk up for the target virus before sequencing 15-17 . We describe a target enrichment sequencing approach of two HRV types, A15 and A101, using type-specific primers and compare the phylogenetic inferences between partial and whole genome sequences. We used a combination of genomic, methoddevelopment, descriptive and retrospective approaches to achieve this.

Study population
The study utilized nasopharyngeal swabs collected during two previous studies in Kilifi County, coastal Kenya: outpatient surveillance of nine health dispensaries within the Kilifi Health and Demographic Surveillance System (KHDSS) between January and December 2016 7 and primary school surveillance between May 2017 and April 2018 6 . The school was situated in Junju, a location within the KHDSS. All samples were collected from symptomatic individuals (mild symptoms of acute respiratory tract infection) of varied age (one month -49 years) and archived at -80°C.

Study design
Samples were screened for HRV and typed as previously described 8 . A cycle threshold (Ct) value < 35.0 was used to define positives. HRV positives underwent VP4/2 sequencing to characterize the diversity, spatial and temporal occurrence of HRV in the two settings. The most frequent type observed in the KHDSS health dispensary surveillance was A15 (n=63). Comparison of the HRV diversity within the school (located in Junju) and the Junju outpatient clinic revealed 12 common types, and the most frequent common type was HRV-A101 (n=10, with a frequency of n=5 in each setting) 9,11 .
For this study, we purposively selected two types from the two studies: A15 from the KHDSS surveillance and A101 from both the Junju health dispensary and school for whole genome sequencing (WGS). These two types were selected due to their high frequency of occurrence at the various scales of observation studied.

Ethics statement
Sample collection was undertaken following an informed written consent provided by parents or guardians for persons <18 years or by participating individuals if aged >17 years. Children whose parents consented were also asked for individual assent to participate. The study protocols were reviewed and approved by the University of Warwick Biomedical and Scientific Research Ethics Committee (BSREC #REGO-2016-1858 and #REGO-2015-6102) and the KEMRI-Scientific Ethics Review Unit (KEMRI-SERU #3332 and #3103).

Primer design
We retrieved nine type A101 genomes and three type A15 genomes, all >6000 nt long from GenBank 18 on 30 th September 2019. Geneious Prime 2019.2.1 (https://www.geneious.com) was used to design eight overlapping primer pairs across the ~7.2 kb HRV genome. The primers targeted eight amplicons 0.9-1.6 kb in size, with overlaps varying in size from 300 to 800 bases, Table 1.
RNA extraction, reverse transcription and PCR Viral RNA was extracted from 140 μl sample using QIAamp Viral RNA kit (Qiagen, USA) as per the manufacturer's recommendations. Reverse transcription was carried out using random hexamers and the Superscript III First-Strand Synthesis System (Invitrogen, United Kingdom). Genome-wide amplification using HRV-specific primers was done using the Q5

Amendments from Version 1
Revisions have been made to the manuscript to address reviewers' comments. The changes in the new version are listed below.

Abstract:
We specify the type of samples used in the Abstract section Methods: We mention the type of research approaches used in the study.
We restructure the sentences to better explain the rationale behind choosing two rhinovirus types for this study.

Discussion:
We address the reviewers' concerns on (i) possible reasons for failed sequencing of 25 samples, and (ii) implications of not sequencing the extreme sections of the 5' and 3' non-coding regions.

Funding:
We have listed all funding agencies that supported this work, which were missed out in the previous version.
Other minor changes include the addition of suitable references to support the changes above.
Any further responses from the reviewers can be found at the end of the article

Sequence analysis
Sequences were aligned using MAFFT v7.271 21 . Recombination scans were done using RDP5 22 and visualized on SimPlot 23 . Nucleotide substitutions across the genomes were visualized using a python script to examine genetic diversity across the genome. POPART 24 was used to construct haplotype networks using the Minimum Spanning Network model. The best-fitting model and maximum likelihood trees were inferred using IQ-TREE, v1.6.0 25 . Branch support for phylogenetic trees was assessed using bootstrapping of 1000 iterations. MegaX 26 was used to calculate mean pairwise distances, and the respective standard errors were assessed using 100 iterations.
Bayesian phylogeny was used to create time-structured phylogenetic trees using BEAST v.  Figure 1A and B. Besides, samples that failed sequencing did not have unique phylogenetic clustering patterns based on their previously generated VP4/2 sequences. Sequencing depth was comparable across Ct-values, with the mean depth coverage per genome ranging from 351 -13356 reads per base pair, Figure 1C and D.
The ends of the 5' untranslated region (UTR) and 3' UTR were not amplified due to lack of suitable primers. Genetic diversity was observed across the entire genome and not within a particular genomic region for both types, as shown in Figure 3.

Phylogenetic resolution
We compared the phylogenetic bifurcation patterns and statistical uncertainty of VP4/2 and WGS for the two types. The depiction of sister taxa was comparable across the two trees. However, WGS resolved phylogenetic polytomies/unresolved branches observed previously in the VP4/2 phylogenies. For example, using VP4/2 sequences, all viruses collected in the school formed one polytomy, which was now fully bifurcated using WGS. Similarly, for A15, the four polytomies observed on VP4/2 phylogeny were well resolved using WGS, effectively distinguishing one sample from the other. Although the mean pairwise distances across VP4/2 and WGS were close, the standard error of pairwise distance calculations was notably less (about a tenth) in WGS than VP4/2, Figure 4.
Overall higher branching posterior probabilities in Bayesian phylogenetic trees were observed using WGS than VP4/2 sequences. In VP4/2 Bayesian trees, 17.5% of A15 and 64.7% of A101 nodes had a posterior probability greater than 0.7 compared to 64.3% and 88.2% nodes in WGS trees, respectively, as illustrated in Figure 5.
The improved resolution was further depicted by haplotype networks that displayed notably more alleles using WGS than VP4/2, e.g., in HRV-A101, school sequences that were considered a single allele using VP4/2 sequences resolved into five alleles when using whole-genome sequences, Figure 6. Identical samples at the VP4/2 region had a median of 3 nt changes for A101 and 5 nt changes for A15 across the whole genome.

Recombination analysis
Recombination scans identified breakpoints within the VP3 of one A101 sequence (KEN_Rhinovirus_7018), with both parents belonging to A101 type (p-value < 1.922E-2), Figure 6C. Recombination within HRV structural regions has been shown to be rare and sporadic 30 .

Discussion
This study presents a type-specific whole genome sequencing protocol for two HRV types. A101 had a higher success (100%) than A15 (60.3%). We attribute the higher A101 sequencing success to the higher number of sequences (n=9) that were available for primer design, which captured more intratype variation, compared to A15 (n=3). Having more genomes contributing to the consensus sequence used in primer design increased genetic variation, and subsequently, the likelihood that the local and contemporaneous diversity was captured. While it's not clear what the cause of sequencing failure of the 25 samples was, we speculate that either (i) their genomic diversity was not captured in primer design resulting in primer mismatches, (ii) the sample quality had deteriorated over time or (iii) there was presence of PCR inhibitors/ nuclease enzymes in the sample.
Whole-genome sequencing provided greater phylogenetic resolution and less statistical uncertainty to partial sequencing.
Polytomies are a product of inadequate data and are, therefore, a potential source of bias. They also result in reduced statistical power due to increased uncertainty. The loss of terminal phylogenetic resolution may result in two opposing predictions: the underestimation (due to unresolved taxa) or overestimation (due to increased total tree length) of diversity 31 . Due to the short size of VP4/2 (~420nt), insufficient data results in reduced phylogenetic resolution and increased uncertainty evidenced by a higher standard error in phylogenetic distance. Unresolved phylogenies are a challenge in epidemiology as one cannot distinguish infections from one individual to another for transmission inference.
Posterior probabilities summarize the uncertainties about a parameter and indicate confidence in the evidence 32 . High posterior probabilities indicate high confidence, and the reverse is also true. Whole genomes consistently provided higher confidence across the two genotypes assessed in this study.
With pathogen sequencing now an established tool to track viral infections 2,12 , it is crucial to compare the resolution of different sequence analysis. As the huge antigenic diversity of HRV continues to pose a challenge in vaccine development, efforts should be directed towards understanding and mitigating  transmission. Our study shows that HRV WGS is better suited for transmission inference to the commonly used VP4/2 sequences.
The sequencing approach we developed has some limitations. First, it requires prior genotyping of the HRV positive samples, commonly done by VP4/2 or VP1 sequencing. It is therefore unsuitable for sequencing new or highly divergent types due to the requirement of matching primer sequencing. Second, having to create primer sets for each type is cumbersome and relies on adequate number of pre-existing genomes to design conserved primers. In addition, an amplicon-based target enrichment does not work well for low complexity regions such as the 5'UTR and 3' UTR. Although the 5'UTR alone does not offer adequate resolution to confidently distinguish HRV types 33 , it is speculated to be a hotspot for recombination 30,33 . Not sequencing these extreme regions may therefore result in missing out on important evolutionary/phylogenetic signal. Notwithstanding,  the new method can successfully enrich for human rhinovirus in archived samples of varying virus titers. It can also effectively capture intra-type recombinant regions enabling detailed study of viral dynamics.

Conclusions
With HRV being the most common respiratory virus, it is surprising that we have such few publicly available whole genomes to allow detailed intra-type analysis. We describe a new protocol for the whole genome sequencing of two HRV types and enrich the public database of HRV genomes. The protocol can be adapted for other HRV types. Our study also shows that WGS is more informative than VP4/2 sequencing in studying HRV dynamics as it maximizes resolution and reduces phylogenetic uncertainty.

Yewande Nejo
Microbiology Programme, College of Agriculture, Engineering and Science, Bowen University, Iwo, Nigeria Overall comments: This is an important and well executed study by Luka et al. (2021). It provides detailed informative data on the type-specific whole genome sequencing method designed to detect human rhinovirus. The study is able to present target enrichment sequencing approach which addresses the hurdles in genome amplification and sequencing of human rhinovirus posed by the high genetic diversity of the virus and inadequate information of reference sequences. The study is properly designed; however, it will require few additions.

Abstract:
The abstract is a good summary of the research article but please indicate in the abstract the type of samples that were collected.

Introduction:
The introduction provides information that is significant to the research and is well referenced. The importance and purpose of the study is well explained and justifiable.

Methodology:
The methodology segment is well designed and all analyses were carried out with details and sound scientific merit. The author should however specify in this section the type of research carried out.

Results:
The results are well stated and illustrated in a comprehensive manner. The data is meticulously analyzed

Discussion and Conclusion:
The results of this study are well discussed and the data supports the conclusion. The limitations of the study are well stated. However, the reasons for selecting only two types of HRV for whole genome sequencing out of 12 types were not properly indicated. Can the author also explain what could be responsible for the sequencing failure of 25 samples? What could be the implications of not sequencing 5'UTR and 3' UTR regions in the study of viral dynamics?
Is the work clearly and accurately presented and does it cite the current literature? Yes Is the study design appropriate and is the work technically sound? Yes

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions drawn adequately supported by the results? Yes genome sequences

Discussion and Conclusion
The reasons for selecting only two types of HRV for whole genome sequencing out of 12 types were not properly indicated.
The current study was nested within a larger epidemiological study to study pathways of respiratory virus disease within different scales which included: (i) outpatient surveillance of nine health dispensaries within the Kilifi Health and Demographic Surveillance System (KHDSS) [1] and (ii) a primary school surveillance [2]. The most frequent type observed in the KHDSS health dispensary surveillance was A15 (n=63). The second social scale, the school, was situated in Junju, a location within the KHDSS. Comparison of the HRV diversity within the school and the Junju outpatient clinic revealed 12 common types [3], and the most frequent common type was A101 (n=5 in each setting). These two types were identified as of interest due to their high frequency of occurrence at the various scales of observation studied.
Can the author also explain what could be responsible for the sequencing failure of 25 samples? While it's not clear what the cause of sequencing failure of the 25 samples was, we speculate that either (i) their genomic diversity was not captured in primer design resulting in primer mismatches, (ii) the sample quality had deteriorated over time or (iii) there was presence of PCR inhibitors/ nuclease enzymes in the sample.

What could be the implications of not sequencing 5'UTR and 3' UTR regions in the study of viral dynamics?
Analysis of the 5'UTR across HRV shows high conservation of certain regions [4], making it a suitable target for HRV detection assays [5]. Contrastingly, the ~50 bp long 3'UTR is not universally conserved across HRV species [4]. Although the 5'UTR alone has not been proved to offer adequate resolution to confidently distinguish HRV types [6], it is speculated to be a hotspot for recombination [7]. Not sequencing these extreme regions may therefore result in missing out on important evolutionary/phylogenetic signal.