The genome sequence of the large white, Pieris brassicae (Linnaeus, 1758)

We present a genome assembly from an individual female Pieris brassicae (the large white; Arthropoda; Insecta; Lepidoptera; Pieridae). The genome sequence is 292 megabases in span. The majority of the assembly is scaffolded into 16 chromosomal pseudomolecules, with the W and Z sex chromosome assembled. Gene annotation of this assembly on Ensembl has identified 12,229 protein coding genes.


Introduction
The large white, Pieris brassicae, is a Palearctic butterfly species that is common in Europe, North Africa, and Asia. P. brassicae larvae typically feed on Brassicaceae species, including cultivated species with agricultural importance such as Brassica oleracea. It has been unintentionally introduced to New Zealand, Chile, and South Africa, although it was later eradicated from New Zealand in 2016(Phillips et al., 2020. It overwinters as a pupa and is multivoltine. While P. brassicae has been listed as Least Concern in the IUCN Red List (Europe), the Madeiran large white, P. wollastoni (previously considered a subspecies of P. brassicae), has not been observed since 1986 and is possibly extinct (IUCN, 2009). P. brassicae has 15 pairs of chromosomes with the female being heterogametic (Bigger, 1975). This karyotype is unusual, as species in the genus Pieris typically possess between 24 and 28 pairs of chromosomes (Robinson, 1971). Its genome size has been estimated with flow cytometry at approximately 260 Mb (Mackintosh et al., 2019).

Genome sequence report
The genome was sequenced from a single female P. brassicae (ilPieBrab1) collected from East Linton, Scotland (latitude 55.977161, longitude -2.667545); Hi-C data were generated from a male P. brassicae (ilPieBrab3) collected from the same location ( Figure 1). A total of 92-fold coverage in Pacific Figure 1. Fore and hind wings of Pieris brassicae specimens from which the genome was sequenced. (A) Dorsal surface view of wings from specimen SC_PB_1356 (ilPieBrab1) from East Linton, used to generate Pacific Biosciences and 10X genomics data. (B) Ventral surface view of wings from specimen SC_PB_1356 (ilPieBrab1) from East Linton, used to generate Pacific Biosciences and 10X genomics data. (C) Dorsal surface view of wings from specimen SC_PB_1357 (ilPieBrab2) from East Linton, used to generate RNASeq data. (B) Ventral surface view of wings from specimen SC_PB_1357 (ilPieBrab2) from East Linton, used to generate RNASeq data. (A) Dorsal surface view of wings from specimen SC_PB_1358 (ilPieBrab3) from East Linton, used to generate Hi-C data. (B) Ventral surface view of wings from specimen SC_PB_1358 (ilPieBrab3) from East Linton, used to generate Hi-C data.
Biosciences single-molecule long reads (N50 14 kb) and 138-fold coverage in 10X Genomics read clouds were generated. Primary assembly contigs were scaffolded with chromosome conformation Hi-C data, which was obtained from a different individual (ilPieBrab3). Manual assembly curation corrected 25 missing joins, reducing the scaffold number by 5.87%, and increasing the scaffold N50 by 1.21%.
The final assembly has a total length of 292 Mb in 402 sequence scaffolds with a scaffold N50 of 22 Mb (Table 1). Of the assembly sequence, 95.46% was assigned to 16 chromosomal-level scaffolds, representing 14 autosomes (numbered by sequence length), and the W and Z sex chromosome (Figure 2- Figure 5; Table 2). The W chromosome is fragmented as the assembly was scaffolded to an individual of a different sex (ilPieBrab3). The assembly has a BUSCO (Simão et al., 2015) completeness of 99.0% using the lepidoptera_odb10 reference set. While not fully phased, the assembly deposited is of one haplotype. Contigs corresponding to the second haplotype have also been deposited.
BlobToolKit blob and cumulative sequence plots show that the W chromosome has regions with microsporidian origin (Figure 3, Figure 4). However, these regions in the read sets are short, do not match across the rest of the scaffold and do not contain any contigs with microsporidian ribosomal subunits. This indicates that this feature is unlikely to be contamination and is more likely to be the result of integration of microsporidian sequence into the genome.

Gene annotation
The Ensembl gene annotation system (Aken et al., 2016) was used to generate annotation for the Pieris brassicae assembly (GCA_905147105.1, see https://rapid.ensembl.org/ Pieris_brassicae_GCA_905147105.1/; Table 1). The annotation was created primarily through alignment of transcriptomic data to the genome, with gap filling via protein-to-genome alignments of a select set of proteins from UniProt (UniProt Consortium, 2019) and OrthoDB (Kriventseva et al., 2008). Prediction tools, CPC2 (Kang et al., 2017) and RNAsamba (Camargo et al., 2020), were used to aid determination of protein coding genes.

Sample acquisition and nucleic acid extraction
A female (ilPieBrab1) and two male (ilPieBrab2, ilPieBrab3) P. brassicae ( Figure 1) were collected from East Linton, Scotland (latitude 55.977161, longitude -2.667545) using a net by Konrad Lohse, University of Edinburgh, who also identified the samples. The samples were snap-frozen in liquid nitrogen from live.
DNA was extracted from the whole organism of ilPieBrab1 at the Wellcome Sanger Institute (WSI) Scientific Operations core from the whole organism using the Qiagen MagAttract      (Ghurye et al., 2019). The assembly was checked for contamination and corrected using the gEVAL system (Chow et al., 2016) as described previously (Howe et al., 2021). Manual curation was performed using gEVAL, HiGlass (Kerpedjiev et al., 2018) and Pretext. The genome was analysed and BUSCO scores generated within the BlobToolKit environment (Challis et al., 2020). Table 3 contains a list of all software tool versions used, where appropriate. Ethical/compliance issues The materials that have contributed to this genome note were supplied by a Tree of Life collaborator. The WSI employs a process whereby due diligence is carried out proportionate to the nature of the materials themselves, and the circumstances under which they have been/are to be collected and provided for use. The purpose of this is to address and mitigate any potential legal and/or ethical implications of receipt and use of the materials as part of the research project, and to ensure that in doing so we align with best practice wherever possible.
The overarching areas of consideration are: -Ethical review of provenance and sourcing of the material; -Legality of collection, transfer and use (national and international).

Will Nash
The Earlham Institute, Norwich, UK This article describes the assembly and annotation of the genome of Pieris brassicae, the Large White butterfly. Three individuals were used to generate the data contributing to the assembly and annotation: the first contributed high molecular weight DNA which was used for PacBio Hifi and 10X sequencing, the second contributed RNA-seq which was used in the annotation process, the third was used to generate Hi-C chromosome conformation data.
The rationale for this project is made clear in a well written introduction which guides the reader to the status of the agricultural importance of the species, as well as its status as an introduced species in several countries.
The protocols described are exemplary and in line with contemporary studies. The sequencing strategy is appropriate for an invertebrate species of the size of P. brassicae. The bioinformatic pipeline that is used to assemble the Hifi reads, purge haplotigs and polish with 10X is cutting edge and is married with the use of the SALSA2 software to scaffold with Hi-C. The final assembly is well quality controlled with the use of an appropriate BUSCO database as well as BlobToolKit providing interesting information on the potential integration of microsporidian sequence into the genome.
The genome note is of sufficient detail throughout, but I wonder if it would help reproducibility to include the specific settings used for the bioinformatic tools?
Analyses of datasets generated are presented clearly in large, well coloured figures with informative captions. There are also interactive versions of some figures available. The assembly data is accessible and linked in the manuscript.

Are the protocols appropriate and is the work technically sound? Yes
Are sufficient details of methods and materials provided to allow replication by others?