The genome sequence of the black-veined white butterfly, Aporia crataegi (Linnaeus, 1758) [version 1; peer review: 1 approved, 1 not approved]

We present a genome assembly from an individual male Aporia crataegi (the black-veined white; Arthropoda; Insecta; Lepidoptera; Pieridae). The genome sequence is 230 megabases in span. The complete assembly is scaffolded into 26 chromosomal pseudomolecules, with the Z sex chromosome assembled. Gene annotation of this assembly on Ensembl has identified 10,860 protein coding genes. The authors present the documentation of a genome sequencing and assembly of a single individual of A. crataegi. In general the work and results are reported with good rigour. However, there are a few points which in my opinion should be improved. In the abstract it is said that the assembly is from an individual male A.crataegi. However, in the Genome sequence report, it is said that the genome was sequenced from a single female. As it clearly cannot be both, please correct one or the other. To me it seems that the male would be the correct one.


Background
The black-veined white (Aporia crataegi) is a large butterfly with distinctive venation on its wings. This species is oligophagous with a larval host plant preference for Prunus and Crataegus spp. and is often considered a pest species in orchards (Jugovic et al., 2017;Manley, 2008). It is found in a wide variety of habitats including dry grassland, woodland edges, and shrubland (Tolman & Lewington, 2008). Aporia crataegi is found across the Palaearctic, with populations present in northwest Africa, as well as across Europe and Asia. The butterfly disappeared from the British Isles around 1925, and the last British specimens were collected from Herne Bay in Kent during the 1920s (Todisco et al., 2020). It is not understood why the species disappeared from the British Isles; however, climate variability along with other concurrent detrimental conditions, such as parasites, disease, or predation have been suggested as potential reasons (Pratt, 1983). Several reintroductions have been attempted, but all have been unsuccessful (Asher et al., 2001), including one purportedly by Winston Churchill after the end of World War II. Given the butterfly's wide Palaearctic distribution, it remains listed as a species of least concern, but more recently it has been reported as extinct in the Czech Republic, the Netherlands (Van Swaay et al., 2010), and likely South Korea (Kim et al., 2015). Additionally, abundance and/or range is declining in Austria, Luxembourg, Romania, Ukraine, Albania, France, Latvia, Norway and Serbia (Van Swaay et al., 2010). No clear consensus exists on the reasons for these declines. We expect that the assembly reported here will facilitate conservation genomic approaches, shedding light on this species' current status (Todisco et al., 2020). In particular, it will be a valuable resource for any future reintroductions, monitoring, and other local conservation efforts.

Genome sequence report
The genome was sequenced from a single female A. crataegi ( Figure 1) collected from Planoles Station, Catalunya, Spain (latitude 42.3136, longitude 2.0996). A total of 101-fold coverage in Pacific Biosciences single-molecule circular consensus (HiFi) long reads and 147-fold coverage in 10X Genomics read clouds were generated. Primary assembly contigs were scaffolded with chromosome conformation Hi-C data. Manual assembly curation corrected 4 missing/misjoins and removed 5 haplotypic duplications, reducing the assembly length by 0.37% and the scaffold number by 7.14%. The final assembly has a total length of 230 Mb in 26 sequence scaffolds with a scaffold N50 of 25.5 Mb (Table 1). The complete assembly sequence was assigned to 26 chromosomal-level scaffolds, representing 25 autosomes (numbered by sequence length), and the Z sex chromosome (Figure 2- Figure 5; Table 2). The assembly has a BUSCO v5.1.2 (Manni et al., 2021) completeness of 98.5% (single 97.8%, duplicated 0.6%) using the lepidoptera_odb10 reference set (n=5286). While not fully phased, the assembly deposited is of one haplotype. Contigs corresponding to the second haplotype have also been deposited.

Genome annotation report
The ilApoCrat1.1 genome has been annotated using the Ensembl rapid annotation pipeline (Table 1; https://rapid.ensembl.org/ Aporia_crataegi_GCA_912999735.1/). The resulting annotation includes 17,867 transcribed mRNAs from 10,860 protein-coding     (RNA-Seq) instruments. Hi-C data were also generated from remaining whole organism tissue of ilApoCrat1 using the Arima v2 Hi-C kit and sequenced on an Illumina NovaSeq 6000 instrument.

Genome assembly
Assembly was carried out with Hifiasm (Cheng et al., 2021); haplotypic duplication was identified and removed with  et al., 2020). One round of polishing was performed by aligning 10X Genomics read data to the assembly with longranger align, calling variants with freebayes (Garrison & Marth, 2012). The assembly was then scaffolded with Hi-C data (Rao et al., 2014) using SALSA2 (Ghurye et al., 2019). The assembly was checked for contamination as described previously (Howe et al., 2021). Manual curation (Howe et al., 2021) was performed using HiGlass (Kerpedjiev et al., 2018) and Pretext. The mitochondrial genome was assembled using MitoHiFi (Uliano-Silva et al., 2021), which performed annotation using MitoFinder (Allio et al., 2020). The genome was analysed and BUSCO scores generated within the Blob-ToolKit environment (Challis et al., 2020). Table 3 contains a list of all software tool versions used, where appropriate.

Olli-Pekka Smolander
Department of Chemistry and Biotechnology, Tallinn University of Technology, Tallinn, Estonia The authors present the documentation of a genome sequencing and assembly of a single individual of A. crataegi. In general the work and results are reported with good rigour. However, there are a few points which in my opinion should be improved.
In the abstract it is said that the assembly is from an individual male A.crataegi. However, in the Genome sequence report, it is said that the genome was sequenced from a single female. As it clearly cannot be both, please correct one or the other. To me it seems that the male would be the correct one.
○ It might be beneficial to report also the length distributions of the transcripts.

○
As manufacturers instructions may change, or there may be several options to choose from, it should be reported with more detail how the sequencing libraries were made. In the current for, it is not possibly to replicate work with certainty.

○
For the RNA-Seq and Hi-C, it would be interesting to see the amounts of the generated data. Similarly, for the PacBio and 10x read clouds, the amount of data would be more exact way to report than the average coverage.

○
Genome assembly methodology should be reported in somewhat increased detail. If there are modifications to default parameters for used software, those should be reported, or if default parameters are used, it should also be mentioned. Similarly, it would be good to report how many contigs or scaffolds were produced in each stage. From the description it is not entirely clear how the polishing was made after calling the variants, i.e., how was the consensus sequence produced? ○ Is the rationale for creating the dataset(s) clearly described? Yes Are the protocols appropriate and is the work technically sound?
annotation result.
Do all the softwares use default parameters? If not, specific parameter settings should be given.

2.
Is the rationale for creating the dataset(s) clearly described? Yes Are the protocols appropriate and is the work technically sound? Yes

Are sufficient details of methods and materials provided to allow replication by others? Yes
Are the datasets clearly presented in a useable and accessible format? Yes