The genome sequence of the small white, Pieris rapae (Linnaeus, 1758) [version 1; peer review: awaiting peer review]

We present a genome assembly from an individual female Pieris rapae (the small white; Arthropoda; Insecta; Lepidoptera; Pieridae). The genome sequence is 256 megabases in span. The majority of the assembly is scaffolded into 26 chromosomal pseudomolecules, with the W and Z sex chromosome assembled. Gene annotation of this assembly on Ensembl has identified 12,390 protein coding genes.


Introduction
Pieris rapae, commonly known as the small white or the small cabbage white, is a widespread butterfly found in Europe, north-west Africa, and Asia, as well as North America, Australia, and New Zealand where it has been introduced (and is known as the "imported cabbage worm"). Found throughout the British Isles, this bivoltine butterfly can be seen on the wing from spring until autumn. In warmer localities this species extends its flight period and the number of generations. The caterpillars feed on a range of brassicaceae and overwinter as pupae. Pieris rapae has a long history of consuming agricultural crops and has spread as a human commensal (Ryan et al., 2019). Despite recent improvements, overall it has reduced in abundance and occurrence in the UK over the last 50 years (Fox et al., 2015), but it is listed as Least Concern in the IUCN Red List (Europe) (van Swaay et al., 2010). Pieris rapae has 25 pairs of chromosomes, a genome size of approximately 245.9 Mb (Shen et al., 2016), and is female heterogametic (WZ). We note the recent production of a high-quality genome assembly for P. rapae (Shen et al., 2016), and believe the sequence described here, generated as part of the Darwin Tree of Life project, will further aid understanding of the biology and ecology of this butterfly.

Genome sequence report
The genome was sequenced from a single female P. rapae collected from East Linton, East Lothian, Scotland, UK (latitude 55.977161, longitude -2.667545) (Figure 1). A total of 56-fold coverage in Pacific Biosciences single-molecule long reads (N50 14 kb) and 157-fold coverage in 10X Genomics read clouds were generated. Primary assembly contigs were scaffolded with chromosome conformation Hi-C data. Manual assembly curation corrected five missing/misjoins and removed one haplotypic duplication, reducing the scaffold number by 9.30%.
The final assembly has a total length of 256 Mb in 40 sequence scaffolds with a scaffold N50 of 11 Mb (Table 1). Of the assembly sequence, 99.8% was assigned to 26 chromosomal-level scaffolds, representing 24 autosomes (numbered by sequence length), and the W and Z sex chromosome (Figure 2- Figure 5; Table 2). The assembly has a BUSCO (Simão et al., 2015) v5.1.2 completeness of 98.8% (single 98.4%, duplicated 0.4%, fragmented 0.2%, missing 1.0%) using the lepidoptera_odb10 reference set. While not fully phased, the assembly deposited is of one haplotype. Contigs corresponding to the second haplotype have also been deposited.

Gene annotation
The Ensembl gene annotation system (Aken et al., 2016) was used to generate annotation for the Pieris rapae assembly (GCA_905147795.1, see https://rapid.ensembl.org/Pieris_rapae_ GCA_905147795.1/; Table 1). The annotation was created primarily through alignment of transcriptomic data to the genome, with gap filling via protein-to-genome alignments of a select set of proteins from UniProt (UniProt Consortium, 2019) and OrthoDB (Kriventseva et al., 2008). Prediction tools, CPC2 (Kang et al., 2017) and RNAsamba (Camargo et al., 2020), were used to aid determination of protein coding genes.

Sample acquisition and nucleic acid extraction
A single female P. rapae was collected from East Linton, Scotland (latitude 55.977161, longitude -2.667545) using a net by Konrad Lohse, University of Edinburgh, who also identified the sample. The sample was snap-frozen in liquid nitrogen.
DNA was extracted from the whole organism of ilPieRapa1 at the Wellcome Sanger Institute (WSI) Scientific Operations core from the whole organism using the Qiagen MagAttract HMW DNA kit, according to the manufacturer's instructions. RNA (also from the whole organism) was extracted in the Tree of Life Laboratory at the WSI using TRIzol, according to  HiSeq 4000 (RNA-Seq) instruments. Hi-C data were also generated from the whole organism using the Arima v1.0 kit and sequenced on HiSeq X.

Genome assembly
Assembly was carried out with HiCanu (Nurk et al., 2020). Haplotypic duplication was identified and removed with purge_dups (Guan et al., 2020). One round of polishing was  performed by aligning 10X Genomics read data to the assembly with longranger align, calling variants with freebayes (Garrison & Marth, 2012). The assembly was then scaffolded with Hi-C data (Rao et al., 2014) using SALSA2 (Ghurye et al., 2019). The assembly was checked for contamination and corrected using the gEVAL system (Chow et al., 2016) as described previously (Howe et al., 2021). Manual curation was performed using gEVAL, HiGlass (Kerpedjiev et al., 2018) and Pretext. The genome was analysed and BUSCO scores generated within the BlobToolKit environment (Challis et al., 2020). Table 3 contains a list of all software tool versions used, where appropriate.
Ethical/compliance issues The materials that have contributed to this genome note were supplied by a Tree of Life collaborator. The Wellcome Sanger   Institute employs a process whereby due diligence is carried out proportionate to the nature of the materials themselves, and the circumstances under which they have been/are to be collected and provided for use. The purpose of this is to address and mitigate any potential legal and/or ethical implications of receipt and use of the materials as part of the research project, and to ensure that in doing so we align with best practice wherever possible.
The overarching areas of consideration are: • Ethical review of provenance and sourcing of the material; • Legality of collection, transfer and use (national and international).
Each transfer of samples is undertaken according to a Research Collaboration Agreement or Material Transfer Agreement entered into by the Tree of Life collaborator, Genome Research Limited (operating as the Wellcome Sanger Institute) and in some circumstances other Tree of Life collaborators.
The genome sequence is released openly for reuse. The P. rapae genome sequencing initiative is part of the Darwin Tree of Life (DToL) project. All raw sequence data and the assembly have been deposited in INSDC databases. Raw data and assembly accession identifiers are reported in Table 1.