The genome sequence of the orange-tip butterfly, Anthocharis cardamines (Linnaeus, 1758)

We present a genome assembly from an individual female Anthocharis cardamines (the orange-tip; Arthropoda; Insecta; Lepidoptera; Pieridae). The genome sequence is 360 megabases in span. The majority (99.74%) of the assembly is scaffolded into 31 chromosomal pseudomolecules, with the W and Z sex chromosomes assembled. Gene annotation of this assembly on Ensembl has identified 12,477 protein coding genes.


Background
The orange-tip butterfly (Anthocharis cardamines) is a member of the Anthocharidini, a tribe within the Pierinae (Wahlberg et al., 2014), with a Palearctic distribution, including throughout the British Isles, where two subspecies are recognised, britannica (mainland Britain) and hibernica (Ireland and Isle of Man). The English population exhibits a reduced number of chromosomes (n = 30) compared to specimens from continental Europe (n = 31), implying a fusion event since the separation of England from the continent ~7,000 years ago (Bigger, 1978). Following range contraction on the British mainland in the late 19 th century, which left disjunct populations in England and Northeast Scotland, the species began recolonizing these regions in the mid-20 th century (Long, 1979), and has shown an increasing trend to the present (Fox et al., 2015). A. cardamines is listed as Least Concern in the IUCN Red List (Europe) (van Swaay et al., 2010). The spring flight period is phenologically responsive to temperature and climate change (Prieto & Destouni, 2015). The species is polyphagous on Brassicaceae, usually Cardamine pratensis and Alliaria petiolata in Britain (Courtney & Duggan, 1983), inhabiting flowery meadows, woodland borders, riverbanks, hedgerows and gardens. Host-plant use influences adult emergence schedule, size and dispersal behaviour (Davies & Saccheri, 2013).

Genome sequence report
The genome was sequenced from a single female A. cardamines ( Figure 1) collected from Carrifran Wildwood, Scotland (latitude 55.4001, longitude -3.3352). A total of 69-fold coverage in Pacific Biosciences single-molecule circular consensus (HiFi) long reads and 97-fold coverage in 10X Genomics read clouds were generated. Primary assembly contigs were scaffolded with chromosome conformation Hi-C data. Manual assembly curation corrected 50 missing/misjoins and removed 2 haplotypic duplications, reducing the assembly length by 0.53% and the scaffold number by 40.00%, and increased the scaffold N50 by 5.62%.
The final assembly has a total length of 360 Mb in 54 sequence scaffolds with a scaffold N50 of 12.5 Mb ( Table 1). The majority, 99.74%, of assembly sequence was assigned to 31 chromosomal-level scaffolds, representing 29 autosomes (numbered by sequence length), and the W and Z sex chromosome (Figure 2- Figure 5; Table 2). The assembly has a BUSCO Figure 1. Forewings and hindwings of the Anthocharis cardamines specimen from which the genome was sequenced. Dorsal (top left) and ventral (top right) surface view of wings from specimen SC_AC_1156 (ilAntCard3) and dorsal (bottom left) and ventral (bottom right) surface view of wings from specimen SC_AC_1154 (ilAntCard2) from Scotland, UK. ilAntCard3 was used to generate Pacific Biosciences and 10X genomics data and ilAntCard2 was used to generate Hi-C data.

Genome annotation report
The ilAntCard3.1 genome has been annotated using the Ensembl rapid annotation pipeline (Table 1; https://rapid. ensembl.org/Anthocharis_cardamines_GCA_905404175.1/). The resulting annotation includes 28,207 transcribed mRNAs from 12,477 protein-coding and 4,279 non-coding genes. There are 1.82 coding transcripts per gene and 8.41 exons per transcript.

Sample acquisition and nucleic acid extraction
A single female A. cardamines specimen (ilAntCard3; genome assembly) and a single male A. cardamines specimen (ilAntCard2; HiC) were collected from Carrifran Wildwood, Scotland (latitude 55.4001, longitude -3.3352) using a net by Sam Ebdon, Gertjan Bisshop and Konrad Lohse (all University of Edinburgh). The samples were identified by Konrad Lohse and were snap-frozen at -80°C.
DNA was extracted at the Scientific Operations Core, Wellcome Sanger Institute. The ilAntCard3 sample was weighed and dissected on dry ice. Abdomen tissue was disrupted by  DNA was sheared into an average fragment size between 12-20 kb in a Megaruptor 3 system with speed setting 30. Sheared DNA was purified by solid-phase reversible immobilisation using AMPure PB beads with a 1.8X ratio of beads to sample to remove the shorter fragments and concentrate the DNA sample. The concentration of the sheared and purified DNA was assessed using a Nanodrop spectrophotometer and Qubit Fluorometer and Qubit dsDNA High Sensitivity Assay kit. Fragment size distribution was evaluated by running the sample on the FemtoPulse system.

Sequencing
Pacific Biosciences HiFi circular consensus and 10X Genomics read cloud DNA sequencing libraries were constructed  according to the manufacturers' instructions. Sequencing was performed by the Scientific Operations core at the WSI on Pacific Biosciences SEQUEL II (HiFi) and Illumina HiSeq X (10X) instruments. Hi-C data were also generated from whole organism tissue of ilAntCard2 using the Qiagen Hi-C kit and sequenced on an Illumina HiSeq X (10X) instrument.

Genome assembly
Assembly was carried out with HiCanu (Nurk et al., 2020)); haplotypic duplication was identified and removed with purge_dups (Guan et al., 2020). One round of polishing was performed by aligning 10X Genomics read data to the assembly with longranger align, calling variants with freebayes  ( Garrison & Marth, 2012). The assembly was then scaffolded with Hi-C data (Rao et al., 2014) using SALSA2 (Ghurye et al., 2019. The assembly was checked for contamination and corrected using the gEVAL system (Chow et al., 2016) as described previously (Howe et al., 2021). Manual curation (Howe et al., 2021) was performed using gEVAL, HiGlass (Kerpedjiev et al., 2018) and Pretext. The mitochondrial genome was assembled using MitoHiFi (Uliano- Silva et al., 2021), which performed annotation using MitoFinder (Allio et al., 2020). The genome was analysed and BUSCO scores generated within the Blob-ToolKit environment (Challis et al., 2020). Table 3 contains a list of all software tool versions used, where appropriate.

Gene annotation
The Ensembl gene annotation system (Aken et al., 2016) was used to generate annotation for the Anthocharis cardamines assembly (GCA_905404175.1). Annotation was created primarily through alignment of transcriptomic data to the genome, with gap filling via protein-to-genome alignments of a select set of proteins from UniProt (UniProt Consortium, 2019).

Data availability
European  expertise to confirm that it is of an acceptable scientific standard.