The genome sequence of the small tortoiseshell butterfly, Aglais urticae (Linnaeus, 1758) [version 1; peer review: 1 approved, 1 approved with reservations]

We present a genome assembly from an individual female Aglais urticae (also known as Nymphalis urticae ; the small tortoiseshell; Arthropoda; Insecta; Lepidoptera; Nymphalidae). The genome sequence is 384 megabases in span. The majority of the assembly is scaffolded into 32 chromosomal pseudomolecules, with the W and Z sex chromosome assembled.


Introduction
Aglais urticae (also known as Nymphalis urticae), the small tortoiseshell, is a widespread butterfly found in temperate regions from western Europe to Japan. Occasionally, individuals are reported from Eastern North America. Known as slige an t-sligeanaich bheag in Scottish Gaelic, it is ubiquitous in the British Isles despite facing declines in population size over the past 50 years (Fox et al., 2015). Adults can be seen on the wing from spring to autumn or overwintering in sheds and outhouses. As indicated by its name, the caterpillars can be seen feeding on nettles (Urtica dioica and U. urens) over two generations in the summer (with the exception of the Scottish populations, which regularly have just one annual generation). Variation in wing morphology has led to suggestions of a suite of subspecies across the range although evidence for evolutionary lineages using mitochondrial markers is inconclusive (Vandewoestijne et al., 2004). This species is listed as Least Concern in the IUCN Red List (Europe) (van Swaay et al., 2010). A. urticae has 31 pairs of chromosomes (Beliajeff, 1930) and the female is heterogametic (WZ).

Genome sequence report
The genome was sequenced from one female A. urticae (ilAglUrti1) collected from Carrifran Wildwood, Dumfries and Galloway, Scotland (latitude 55.400132, longitude -3.3352); Hi-C data were obtained from a second female A. urticae (ilAglUrti2) collected from Falkland, Fife, Scotland (latitude 56.25567, longitude -3.210498) (Figure 1). A total of 58-fold coverage in Pacific Biosciences single-molecule long Figure 1. Fore and hind wings of Aglais urticae specimens from which the genome was sequenced. (A) Dorsal surface view of wings from specimen ilAglUrti1 (SC_AU_1387) from Carrifran Wildwood, Scotland used to generate Pacific Biosciences and 10X genomics data. (B) Ventral surface view of wings from specimen ilAglUrti1 (SC_AU_1387) from Carrifran Wildwood, Scotland used to generate Pacific Biosciences and 10X genomics data. (C) Dorsal surface view of wings from specimen ilAglUrti2 (SC_AU_1351) from Falkland, Scotland used to generate Hi-C data. (D) Ventral surface view of wings from specimen ilAglUrti2 (SC_AU_1351) from Falkland, Scotland used to generate Hi-C data. reads (N50 15 kb) and 92-fold coverage in 10X Genomics read clouds (from molecules with an estimated N50 of 41 kb) were generated. Primary assembly contigs were scaffolded with chromosome conformation Hi-C data. Manual assembly curation corrected 12 missing/misjoins and removed one haplotypic duplication, reducing the assembly length by 0.01% and the scaffold number by 12.82%, and increasing the scaffold N50 by 0.15%. The final assembly has a total length of 393 Mb in 35 sequence scaffolds with a scaffold N50 of 13.17 Mb ( Table 1). The assembly sequence was assigned to 32 chromosomal-level scaffolds, representing 30 autosomes (numbered by sequence length), and the W and Z sex chromosome (Figure 2- Figure 5; Table 2). The assembly has a BUSCO (Simão et al., 2015) v5.1.2 completeness of 98.8% using the lepidoptera_odb10 reference set. While not fully phased, the assembly deposited is of one haplotype.
Contigs corresponding to the second haplotype have also been deposited.
DNA was extracted from the whole organism of ilAglUrti1 at the Wellcome Sanger Institute (WSI) Scientific Operations core using the Qiagen MagAttract HMW DNA kit, according    et al., 2021). The genome was analysed and BUSCO scores generated within the BlobToolKit environment (Challis et al., 2020). Table 3 contains a list of all software tool versions used, where appropriate.

MitoHiFi (Uliano-Silva
The materials that have contributed to this genome note were supplied by a Tree of Life collaborator. The Wellcome Sanger Institute employs a process whereby due diligence is carried out proportionate to the nature of the materials themselves, and the circumstances under which they have been/are to be collected and provided for use. The purpose of this is to address and mitigate any potential legal and/or ethical implications of receipt and use of the materials as part of the research project, and to ensure that in doing so we align with best practice wherever possible.
The overarching areas of consideration are: • Ethical review of provenance and sourcing of the material; • Legality of collection, transfer and use (national and international).
The genome sequence is released openly for reuse. The A. urticae genome sequencing initiative is part of the Darwin Tree of Life (DToL) project. All raw sequence data and the assembly have been deposited in INSDC databases. The genome will be annotated using the RNA-Seq data and presented through the Ensembl pipeline at the European Bioinformatics Institute. Raw data and assembly accession identifiers are reported in Table 1.

Reuben Nowell
Department of Zoology, University of Oxford, Oxford, UK The manuscript of Bishop et al. presents the high quality genome data and assembly for the small tortoiseshell butterfly Aglais urticae.
It is concise and easy to read, with clear links to the raw datasets and the final assembly. I particularly like the use of interactive data visualisations. Overall I am convinced as to the high quality of the work and I'm sure the data will be a great resource to the community.
I have a few small comments for improvements: I think it would be useful to see a bit more detail given for some of the methods. For example, it is stated "Manual assembly curation corrected 12 missing/misjoins and removed one haplotypic duplication", but there are no details (beyond the tools used) as to what these manual steps involved. Similarly, it is not clear exactly how polishing was performed, beyond the fact that the FreeBayes program was used. It would be helpful to the community to include the actual program commands and/or parameters and flags etc. that were executed for these steps. Perhaps these could be added to Table 3, which already helpfully provides the versioned software used? I don't think every last detail is requiredjust any non-standard steps (e.g., FreeBayes is a variant-calling tool, so it's unclear to me from the manuscript how this was used to polish).
1. Figure 1 is a bit blurry, I can only just read the text. The same for Figure 5 although perhaps this is less important (no text). 2.

3.
Any idea what those two high GC, high(ish) coverage small scaffolds are on the blobplot? 4.
Is the rationale for creating the dataset(s) clearly described? even in a standard fashion that can be generic and used for all genomes shown. For Figure 5, it would greatly help to have the assigned chromosomes indicated along one axis, as I'm left wondering what the chromosomal group in the middle is likely to be (should be W no?).
My final comment is that while all of the data files and software used are clearly reported, none of the bioinformatic commands used for the assembly were reported. I find this rather unfortunate and a missed opportunity. While these may be default implementations of simple command line operations, the reporting of these would help standardize method deployment within the genomics community and thus I strongly suggest that the Darwin Tree of Life consortium refer to versioned command line operations used for their assemblies.
Minor comments: Figure 1 should be in higher resolution, as this is an important image as voucher information for the specimen.

Is the rationale for creating the dataset(s) clearly described? Yes
Are the protocols appropriate and is the work technically sound? Yes