The genome sequence of the red admiral, Vanessa atalanta (Linnaeus, 1758) [version 1; peer review: awaiting peer review]

We present a genome assembly from an individual female Vanessa atalanta (the red admiral; Arthropoda; Insecta; Lepidoptera; Nymphalidae). The genome sequence is 370 megabases in span. The majority of the assembly (99.44%) is scaffolded into 32 chromosomal pseudomolecules, with the W and Z sex chromosome assembled. Gene annotation of this assembly on Ensembl has identified 12,493 protein coding genes.


Background
The red admiral, Vanessa atalanta (Linnaeus, 1758), earns its name due to the majesty of its colours: striking orange, dark brown and white. It has a disjunct distribution in the Holarctic, occurring in the west Palaearctic (up to 85 degrees longitude approximately) and in North and Central America (Williams, 1930). The American populations differ slightly in appearance to the Eurasian populations, and are referred to subspecies rubria (Fruhstorfer, 1909;Vane-Wright & Hughes, 2007). The sister species of V. atalanta is the kamehameha butterfly (V. tameamea), an endemic species from Hawaii (Wahlberg & Rubinoff, 2011). Red admirals are well known for their migratory movements: they migrate latitudinally in Europe and North America between the southern parts of their range, where the majority of individuals overwinter as adult and/or larva, and the northern areas, that are colonized during spring and summer (Brattström et al., 2010;Brattström et al., 2018;Scott, 1992;Walker, 2001;Williams, 1930). The species is listed as "Least concern" according to the IUCN Red List (Europe) ( van Swaay et al., 2010). (Roy & Sparks, 2000) report changes in migration time and length due to global warming, which could be resulting in a northward shift of overwintering latitudes (Fox et al., 2010). The red admiral is polyvoltine across its migratory range. Females lay their eggs on nettles in the genera Urtica, Boehmeria, Laportea, and Parietaria. Males are markedly territorial, especially during late afternoon. The genome will further aid evolutionary studies of behavioral traits such as migration or male territoriality and the evolution of diapause. Vanessa atalanta has 31 pairs of chromosomes and an estimated genome size of 326 Mb (Mackintosh et al., 2019).

Genome sequence report
The genome was sequenced from a single female V. atalanta ( Figure 1A, B) collected from Carrifran Wildwood, Dumfries and Galloway, Scotland (latitude 55.400132, longitude -3.3352). A total of 34-fold coverage in Pacific Biosciences single-molecule long reads (N50 11 kb) and 95-fold coverage in 10X Genomics read clouds were generated. Primary assembly contigs were scaffolded with chromosome conformation Hi-C data. Manual assembly curation corrected 16 missing/misjoins and removed 59 haplotypic duplications, reducing the assembly size by 0.27% and scaffold number by 32.86%, and increasing the scaffold N50 by 1.71%.
The final assembly has a total length of 370 Mb in 142 sequence scaffolds with a scaffold N50 of 13 Mb (Table 1). Of the assembly sequence, 99.44% was assigned to 32 chromosomal-level scaffolds, representing 30 autosomes (numbered by sequence length), and the W and Z sex chromosome (Figure 2- Figure 5; Table 2). The assembly has a BUSCO (Simão et al., 2015) v5.1.2 completeness of 98.8% (single 98.7%, duplicated 0.1%) using the lepidoptera_odb10 reference set. While not fully phased, the assembly deposited is of one haplotype. Contigs corresponding to the second haplotype have also been deposited.

Genome sequence report
The ilVanAtal1.1 genome was annotated using the Ensembl annotation pipeline (Table 1; https://rapid.ensembl.org/Vanessa_ atalanta_GCA_905147765.1/). The resulting annotation includes 57,591 transcribed mRNAs from 12,493 protein-coding and 2,614 non-coding genes. There are 2.25 transcripts per gene and 11.26 exons per transcript. The earlier version of the assembly, ilVanAtal1.1, was annotated, but the changes made to the assembly for ilVanAtal1.2 were minor and will not affect the annotation.

Methods
Sample acquisition and nucleic acid extraction Two female V. atalanta specimens (ilVanAtal1 and ilVanAtal2; Figure 1) were collected from Carrifran Wildwood, Dumfries and Galloway, Scotland (latitude 55.400132, longitude -3.3352) by Konrad Lohse, University of Edinburgh, using a net. The samples were identified by the same individual and snap-frozen in liquid nitrogen.
DNA was extracted from whole organism tissue of ilVanAtal1 at the Wellcome Sanger Institute (WSI) Scientific Operations core from the whole organism using the Qiagen MagAttract HMW DNA kit, according to the manufacturer's instructions. RNA was extracted from whole organism tissue of ilVanA-tal2 in the Tree of Life Laboratory at the WSI using TRIzol (Invitrogen), according to the manufacturer's instructions. RNA was then eluted in 50 μl RNAse-free water and its concentration assessed using a Nanodrop spectrophotometer and Qubit Fluorometer using the Qubit RNA Broad-Range (BR) Assay kit. Analysis of the integrity of the RNA was done using Agilent RNA 6000 Pico Kit and Eukaryotic Total RNA assay.

Sequencing
Pacific Biosciences HiFi circular consensus and 10X Genomics Chromium read cloud sequencing libraries were constructed according to the manufacturers' instructions. Poly(A) RNA-Seq libraries were constructed using the NEB Ultra II RNA Library Prep kit. Sequencing was performed by the Scientific Operations core at the Wellcome Sanger Institute on Pacific Biosciences SEQUEL II (HiFi), Illumina HiSeq X (10X) and Illumina HiSeq 4000 (RNA-Seq) instruments. Hi-C data were generated from head tissue using the Arima v1 Hi-C kit and sequenced on HiSeq X.

Genome assembly
Assembly was carried out with HiCanu (Nurk et al., 2020). Haplotypic duplication was identified and removed with purge_ dups (Guan et al., 2020). One round of polishing was performed by aligning 10X Genomics read data to the assembly with longranger align, calling variants with freebayes (Garrison & Marth, 2012). The assembly was then scaffolded with Hi-C data (Rao et al., 2014) using SALSA2 (Ghurye et al., 2019). The assembly was checked for contamination and corrected using the gEVAL system (Chow et al., 2016) as described previously  (Howe et al., 2021). Manual curation was performed using gEVAL, HiGlass (Kerpedjiev et al., 2018) and Pretext. The mitochondrial genome was assembled using MitoHiFi (Uliano-Silva et al., 2021), which performed annotation using MitoFinder (Allio et al., 2020). The genome was analysed and BUSCO scores generated within the BlobToolKit environment (Challis et al., 2020). Table 3 contains a list of all software tool versions used, where appropriate.

Gene annotation
The Ensembl gene annotation system (Aken et al., 2016) was used to generate annotation for version 1 of the Vanessa atalanta assembly (GCA_905147785.1). The annotation was created primarily through alignment of transcriptomic data to the genome, with gap filling via protein-to-genome alignments of a select

Ethical/compliance issues
The materials that have contributed to this genome note were supplied by a Tree of Life collaborator. The Wellcome Sanger Institute employs a process whereby due diligence is carried out proportionate to the nature of the materials themselves, and the circumstances under which they have been/are to be collected and provided for use. The purpose of this is to address and mitigate any potential legal and/or ethical implications of receipt and use of the materials as part of the research project, and to ensure that in doing so we align with best practice wherever possible.
The overarching areas of consideration are: • Ethical review of provenance and sourcing of the material; • Legality of collection, transfer and use (national and international).

Comments on this article Version 1
Reader Comment 10 Jan 2022 Jacques Dainat, CNRS, France I don't want to throw you the stone because all papers describing genome annotations do the same, but I find pity that BUSCO results of the assembly completeness are always mentioned while annotation completeness (gene predictions) is never. Based on the translation of the protein coding gene, the BUSCO annotation completeness is very useful to give a picture of how complete is the annotation. The comparison between the assembly completeness score and the annotation completeness score shows the potential room for improvement of the annotation. It's easy to see the proportion of genes present in the assembly that are finally not present in the final gene build. In your case, the BUSCO for the assembly is 98.8% but the BUSCO for the annotation might be around 50%. That type of result might afraid and it is why researchers do not like to show it, but such a result does not mean that the annotation is bad. It reflects a choice in the annotation approach type. Evidence based annotations can provide really good gene models (low number of false positive and a high number of high-confidence gene models (protein coding genes as well as non-coding genes), while can miss a lot of genes. Other approaches like ab-initio approaches might miss fewer genes but can, at the same time, predict a lot of false positives. Some people do not care about false positives and want to know if they can trust the blastP on the proteome when it says that the gene/protein is missing. In such case, if we had the BUSCO annotation score, we would know if were better to tBlastn on the assembly instead of blastP the proteome.
I think that the Darwin Tree of Life Projects should include that information by default.

Competing Interests:
No competing interests were disclosed.