A new Plasmodium vivax reference sequence with improved assembly of the subtelomeres reveals an abundance of pir genes

Plasmodium vivax is now the predominant cause of malaria in the Asia-Pacific, South America and Horn of Africa. Laboratory studies of this species are constrained by the inability to maintain the parasite in continuous ex vivo culture, but genomic approaches provide an alternative and complementary avenue to investigate the parasite’s biology and epidemiology. To date, molecular studies of P. vivax have relied on the Salvador-I reference genome sequence, derived from a monkey-adapted strain from South America. However, the Salvador-I reference remains highly fragmented with over 2500 unassembled scaffolds. Using high-depth Illumina sequence data, we assembled and annotated a new reference sequence, PvP01, sourced directly from a patient from Papua Indonesia. Draft assemblies of isolates from China (PvC01) and Thailand (PvT01) were also prepared for comparative purposes. The quality of the PvP01 assembly is improved greatly over Salvador-I, with fragmentation reduced to 226 scaffolds. Detailed manual curation has ensured highly comprehensive annotation, with functions attributed to 58% core genes in PvP01 versus 38% in Salvador-I. The assemblies of PvP01, PvC01 and PvT01 are larger than that of Salvador-I (28-30 versus 27 Mb), owing to improved assembly of the subtelomeres. An extensive repertoire of over 1200 Plasmodium interspersed repeat ( pir) genes were identified in PvP01 compared to 346 in Salvador-I, suggesting a vital role in parasite survival or development. The manually curated PvP01 reference and PvC01 and PvT01 draft assemblies are important new resources to study vivax malaria. PvP01 is maintained at GeneDB and ongoing curation will ensure continual improvements in assembly and annotation quality.


Introduction
Infection with Plasmodium vivax is associated with significant direct and indirect morbidity that impacts on the poorest communities of malarious countries, with an estimated annual global cost of $1-2.7 billion [1][2][3] . Accumulating reports of drug-resistant infection and life-threatening disease underscore the urgency to reduce the burden of P. vivax and ensure its ultimate elimination 4-8 . Efforts to contain P. vivax are constrained by a limited understanding of the parasite's basic biology, in part owing to the inability to maintain this species in continuous ex vivo culture. Genetic studies provide an alternative approach to gain novel insights into the parasite from which epidemiological tools and therapeutic approaches can be developed for clinical application 9-17 . The rapidly declining costs of massively parallel sequencing technologies have made it feasible to undertake whole genome sequencing of hundreds of Plasmodium isolates, with recent population genomic studies of P. vivax revealing novel antimalarial drug resistance and vaccine candidates amongst other biological features of the parasite 16,17 . However, in order to achieve a comprehensive understanding of the structure and composition of the P. vivax genome, and to improve read mapping efforts to characterise genetic polymorphisms, a high quality reference genome(s) representative of naturally occurring patient isolates is essential.
The sequences of 5 monkey-adapted strains including the Salvador-I reference 14 and drafts of Brazil-I, India-VII, North Korea and Mauritania-I 13 have provided important resources for the vivax research community to investigate the core genome of P. vivax. However, over 60% of the genes in the published Salvador-I reference 14 (prior to curation by the authors) had unknown function, limiting insight into underlying biological mechanisms. Furthermore, assembly of the subtelomeric regions is highly fragmented in these strains, with Salvador-I comprising >2500 scaffolds. A subsequent draft assembly of a Cambodian patient isolate (C127) revealed 792 genes not present in Salvador-I, including 366 new pir (Plasmodium interspersed repeat) genes 11 . The pir genes are a highly variable multigene family present in all Plasmodium genomes investigated to date 18 . To address the need of the vivax research community for a P. vivax reference with more comprehensive assembly and annotation, we used Illumina genomic data to establish a reference from a Papua Indonesian patient isolate (PvP01). Since P. vivax exhibits marked regional variation in phenotypes such as duration of the dormant liver-stage, drug resistance and disease severity, we compared PvP01 to C127 and the 5 monkey-adapted strains, and generated draft assemblies of patient isolates from Thailand (PvT01) and central China (PvC01). Our sampling focuses on the Asia-Pacific region, where a large burden of P. vivax infection lies 24 . The Indonesian reference provides representation of the island of Papua -the epicentre of multidrug resistance emergence in P. vivax 8 . The draft references from Thailand and Central China provide respective representation of the Mekong region, and the temperate north where long latency phenotypes prevail 25 .

Samples
Three P. vivax field isolates that were judged to be clonal infections following preliminary genomic analysis within the framework of a separate study 17 were selected for assembly. . Post-assembly genome improvements were undertaken using a range of automated configuration tools including ABACAS (version 2), IMAGE (version 2, iterating k-mers from 71 down 31, 7 iterations), Gapfiller (version 1-11, 14 iteration, parameter n=31) and iCORN (version 2, 7 iterations). PAGIT (version 1) and REAPR (version 1.0.17) were employed to detect assembly errors 32-38 . This was followed by visual inspection using ACT 39 to identify any further assembly anomalies. Annotation was undertaken initially using the automated algorithms, RATT (version 1) and Augustus (version 2.7, trained on 500 manually curated gene models) 38,40,41 and further improved by detailed manual inspection performed by an experienced genome curator. PvT01 and PvC01 were annotated using Companion, a new automated annotation tool 42 . RNA-Seq data from asexual blood stage preparations of 4 P. vivax patient isolates from Cambodia (unpublished report, Jessica Hostetler, Lia Chappell, Chanaki Amaratunga, Seila Suon, Thomas D. Otto, Rick Fairhurst and Julian C. Rayner; Accession number ERP017542) was used as supporting evidence to aid the improvement of gene models in PvP01 by manual curation.
For comparative analyses, genome assemblies and gene annotations were sourced for 6 additional P. vivax strains; Salvador-I, C127, Brazil-I, India-VII, Mauritania-I and North Korea 9,13,14 . The published version of Salvador-I 14 presented in PlasmoDB release 9 was selected for comparison of gene annotations as the additional improvements in release 10 reflected curations performed by the authors. Companion was also used to update the annotation of four previously published genomes (Brazil-I, India-VII, Mauritania-I and North Korea).

OrthoMCL and pir analysis
Comparisons of predicted protein-coding genes between the 9 P. vivax assemblies and P. falciparum 3D7 (Pf3D7) (geneDB.org) were undertaken using OrthoMCL version 1.4 43 using the default parameter settings. We determined core genes as 1-1 orthologous between P. vivax P01 and Pf3D7, in total 4465.
Cluster analysis based on structural and sequence homology was undertaken to compare the subfamily organization of the pirs in the partial (Salvador-I) versus more complete (PvP01) reference. All PIR encoded protein sequences in Salvador-I and PvP01 with length greater than 150 amino acids and not flagged as pseudogenes were included in the analysis. Low complexity regions were excluded using the SEG program 44 . The relatedness between sequences was assessed using BLASTp (parameters -F F -e 1e-6), and the results were visualized as a network constructed in Gephi 45 . After provisional assessment of cluster resolution at different thresholds, a cutoff of 25% of the global similarity was selected for distinguishing different clusters (subfamilies). To aid comparison against the new PIRs identified in PvP01, the Salvador-I PIRs were colour-coded according to the subfamily classification proposed by Lopez et al 23 .
Further investigation of the diversity and relatedness amongst the PIRs was undertaken using the PIR sets from PvP01, PvT01, PvC01, Salvador-I and Brazil-I. Exclusion of proteins with less than 150 amino acids, filtering of low complexity sequences and relatedness analysis using BLASTp were performed as described above. A network was constructed from the BLAST output using tribeMCL with an inflation of 1.5 46 . To aid visualization, clusters with less than 15 PIRs were excluded.

Dataset validation
The PvP01 assembly was generated as a new reference sequence and is thus a higher quality, more accurately annotated assembly than PvC01 and PvT01, which were both created as draft assemblies for comparative purposes. The PvP01 assembly quality is greatly improved over the previous Salvador-I reference genome, with fragmentation reduced to <250 scaffolds amongst other features (Table 1). At 29 megabases (Mb), the assembly is notably larger than Salvador-I (27 Mb), mainly due to newly assembled subtelomeric sequences. A complete mitochondrial sequence (5 kb) and partial apicoplast sequence (29.6 kb) are also available. As in P. falciparum 47 , the apicoplast reference will facilitate efforts to identify geographic surveillance markers for P. vivax. Whilst the assembly quality in the core region is high in Salvador-I 14 , PvP01 displays improved gene models and has more complete subtelomeres. Figure 1 provides a schematic of the right-hand end of chromosome 12 from PvP01 and Salvador-I, illustrating the generally greater extension into the subtelomeric regions of chromosomes in PvP01. Furthermore, owing to detailed manual curation and continuous maintenance within the GeneDB framework, the level of gene annotation in the core genome of PvP01 greatly exceeds that of the other available P. vivax assemblies. The asexual stage P. vivax RNA-Seq data enabled correction of the structure of 377 genes. Of the 4577 core P. vivax genes with 1:1 orthologues in P. falciparum, 3318 genes were transcribed with RPKM (reads per kilobase of transcript per million mapped reads) values greater than 15, and contained a total of 4887 splice sites. Of these splice sites, a total of 4845 (99.1%) were confirmed by ≥ 10 reads, highlighting the high quality of the structural annotation. Whereas the published Salvador-I reference includes functions attributed to a total of 1783 (38.0%) core genes 14 , we have been able to expand this to 2848 (58.6%) in PvP01, as of the latest GeneDB release (1st September 2016). Ongoing curation on PvP01 will yield further improvements to the annotation statistics, and progress is highlighted in Table 2, which summarizes annotation changes over a 12 month period between GeneDB releases in 2015 and 2016. To date, a total of 1209 genes have been identified in PvP01 that were either completely absent from Salvador-I or have arisen by splitting gene structures that were falsely joined previously (Table 1). Although the majority of newly identified genes belong to subtelomeric gene families, we confirmed the recently identified EBP2 (erythrocyte binding protein 2, PVP01_0102300) and RBP2e (reticulocyte binding protein 2e, PVP01_0700500) genes 11 . These genes are members of families encoding proteins implicated in host cell recognition during red blood cell (RBC) invasion, and present potential vaccine targets 48-51 .   Table 3, the comparatively high assembly quality in the subtelomeres of PvP01 greatly expanded the repertoire of genes belonging to multigene families in these chromosome regions. Notably, more than 1200 pir genes were identified in PvP01 versus 346 in Salvador-I. To generate a snapshot of the diversity and structural organization of this expanded gene family in P. vivax, we conducted cluster analysis of the PIRs in PvP01 with comparison to previous homology classifications performed by Lopez et al on the partial set of PIRs from Salvador-I 23 .

As summarised in
As illustrated in the network diagram in Figure 2a, the main subfamily clusters defined in earlier classifications are expanded but, on addition of the new PvP01 PIRs, the clusters remained moderately stable with no pooling between or sub-structure within subfamilies. However, the new PvP01 PIRs reveal several large subfamilies containing just 1-4 Salvador-I genes that were   Figure 2b). The analysis conducted here provides a broad overview of the diversity and relatedness amongst the expanded P. vivax pir gene sets, however further investigation beyond the scope of this study will be required to provide detailed characterisation of this family and its contribution to virulence and pathophysiology.
The PvP01 reference is an important new resource for the vivax research community. It will support studies of the complex subtelomeric regions and provide insights into the mechanisms by which the gene families in this region contribute to virulence-associated functions. It will also allow investigation of an array of other biological functions that will expand with continual improvements in annotation in the core genome. PvP01, PvC01 and PvT01 add new geographic locations to the collection of P. vivax assemblies, facilitating biological studies of the diversity of this phenotypically divergent species.

Data availability
The  This section will be updated with accession numbers for PvP01 chromosomes onces available.
Author contributions SA, CIN, MB, RNP and TDO conceived the study. QG and FN provided essential resources for the data generation. MS managed the sequencing of the samples. SA, HT and TDO performed analyses. SS performed automated annotation and UB is maintaining the manual annotation and generated statistics on the annotation. JH generated the RNA-Seq data. SA and TDO prepared the first draft of the manuscript. All authors were involved in the revision of the draft manuscript and have agreed to the final content.

Competing interests
No competing interests were disclosed. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Table 1 presented comparison of the genomes the three new sequences with that of Sal I. The PvC01 and PvT01 sequences contained more assigned scaffolds -are these located mostly in the telomeric regions?

Grant information
A more detailed comparison of the temperate strain PvC01 with the tropical strains would be more useful. A big-picture type perspective on the C01 and T01 would be nice. Figure 1 illustrates the extension of the assembled sequences in the subtelomeric region of chrom12 as compared to that in SalI. Are the gap junctions verified by PCR? Also, the PvP01 also has quite some gaps -how are these assembled and verified?
The network presentation of the Pir genes is interesting -A link to the alignment of the sequences or a phylogenetic tree-type of presentation (as supplements) would be very useful.
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
No competing interests were disclosed.

Competing Interests:
perhaps relevant to discuss them as a gene family. It could even be worthwhile to perform a cluster analysis on this "gene family" similar to the one performed on PIR proteins.
On Figure 2B it would be useful to indicate the correspondence between the cluster numbers of this study and the former classification (A-K). Similarly it would be informative to indicate the cluster numbers on Figure 2A.
From Figure 2B it seems that cluster 5 PIR gene subfamily has expanded (substantially more numerous) in the Brazilian isolate. Something perhaps worthwhile mentioning/discussing.