The genome sequence of the small skipper, Thymelicus sylvestris (Poda, 1761) [version 1; peer review: awaiting peer review]

We present a genome assembly from an individual male Thymelicus sylvestris (the small skipper; Arthropoda; Insecta; Lepidoptera; Hesperiidae). The genome sequence is 471 megabases in span. The majority of the assembly (99.97%) is scaffolded into 27 chromosomal pseudomolecules, with the Z sex chromosome assembled. The mitochondrial genome was also assembled and is 17.1 kilobases in length.


Background
The small skipper (Thymelicus sylvestris) is a butterfly within the skipper family Hesperiidae. The skippers are named for their characteristic quick, darting flight. The common name of T. sylvestris is a clear reference to its small size: the adult wingspan ranges from 27-34 mm (Tomlinson & Still, 2002). However, it is not the smallest of the skippers, with four other British species being an equivalent size or smaller. Similar to other skippers, T. sylvestris has golden-orange wings with clear sex brands on males, but it can be distinguished by a lack of coloured patches on its wings and a dull brown or orange colouration to its antennae (Tomlinson & Still, 2002).
Thymelicus sylvestris is widespread across the European continent with a habitat range paralleling that of other skipper species. This range encompasses the northernmost reaches of Morocco and Algeria all the way to the bordering regions between the Baltic states and Russia (Tolman & Lewington, 2008). However, it is noticeably absent from northern Scandinavia, Corsica and Sardinia (Tolman & Lewington, 2008). In the British Isles the small skipper is found across most of Wales and England with recent trends showing a northward expansion in range, beyond the England-Scotland border.
Recently, T. sylvestris individuals have also been observed in Ireland, where they had not been reported previously (Harding & Jacob, 2013). Thymelicus sylvestris populations appear stable and it is listed as a species of least concern by the IUCN (van Swaay et al., 2010).
The small skipper is a habitat generalist (Louy et al., 2007) and can be found in open areas with long grass, such as rough grasslands and roadside verges (Tomlinson & Still, 2002). It is most associated with Yorkshire fog (Holcus lanatus), its main food plant, on which it often basks and lays eggs from June to July. Females are known to be meticulous with their choice of oviposition sites, spending up to 15 minutes inspecting potential host plants prior to laying eggs (Tolman & Lewington, 2008). After approximately a month, eggs hatch into caterpillars which develop through 5 instar stages. Come winter, caterpillars spin cocoons within which they undergo diapause. The caterpillars re-emerge in spring, constructing a 'leaf tube' by joining together the ends of a leaf, where they live and feed, moving to new leaves as necessary. Small skipper caterpillars usually pupate by June, with adult butterflies emerging in July, to spend their remaining days in tall grassland until the summer's end in September.

Genome sequence report
The genome was sequenced from a single male T. sylvestris collected from Ruan Minor, Cornwall, UK (latitude 49.9942295, longitude -5.1974720) (Figure 1). A total of 40-fold coverage in Pacific Biosciences single-molecule long reads and 63-fold coverage in 10X Genomics read clouds were generated. Primary assembly contigs were scaffolded with chromosome conformation Hi-C data. Manual assembly curation corrected 9 missing/misjoins and removed 3 haplotypic duplications, reducing the assembly size by 0.06% and scaffold number by 20.00%.
The final assembly has a total length of 471 Mb in 32 sequence scaffolds with a scaffold N50 of 17 Mb (Table 1). Of the assembly sequence, 99.97% was assigned to 28 chromosomal-level scaffolds, representing 27 autosomes (numbered by sequence length), and the Z sex chromosome (Figure 2- Figure 5; Table 2). The assembly has a BUSCO (Simão et al., 2015) v5.1.2 completeness of 98.5% (single 98.1%, duplicated 0.5%) using the lepidoptera_odb10 reference set. While not fully phased, the assembly deposited is of one haplotype. Contigs corresponding to the second haplotype have also been deposited.

Methods
Specimen acquisition and nucleic acid extraction Three male T. sylvestris (ilThySylv1, ilThySylv2 and ilThy-Sylv3) specimens were collected from Ruan Minor, Cornwall, UK (latitude 49.9942295, longitude -5.1974720) using a net by Alex Hayward in May 2019. The samples were identified by the same individual and snap-frozen on dry ice.

Biosciences SEQUEL II (HiFi), Illumina HiSeq X (10X) and
Illumina HiSeq 4000 (RNA-Seq) instruments. Hi-C data were generated from head tissue of ilThySylv3 in the Tree of Life Laboratory using the Arima Hi-C+ kit and sequenced on an Illumina NovaSeq 6000 instrument.

Genome assembly
Assembly was carried out with Hifiasm (Cheng et al., 2021). Haplotypic duplication was identified and removed with purge_dups (Guan et al., 2020). One round of polishing was performed by aligning 10X Genomics read data to the assembly with longranger align, calling variants with freebayes (Garrison & Marth, 2012). The assembly was then scaffolded with Hi-C data (Rao et al., 2014) using SALSA2 (Ghurye et al., 2019). The assembly was checked for contamination and corrected using the gEVAL system (Chow et al., 2016) as described previously (Howe et al., 2021). Manual curation was performed using gEVAL, HiGlass (Kerpedjiev et al., 2018) and Pretext. The mitochondrial genome was assembled using MitoHiFi (Uliano-Silva et al., 2021). The genome was analysed and BUSCO scores generated within the BlobToolKit environment (Challis et al., 2020). Table 3 contains a list of all software tool versions used, where appropriate.
The genome sequence is released openly for reuse. The T. sylvestris genome sequencing initiative is part of the Darwin Tree of Life (DToL) project. All raw sequence data and the assembly have been deposited in INSDC databases. The genome will be annotated using the RNA-Seq data and presented through the Ensembl pipeline at the European Bioinformatics  Table 3. Software tools used.

Software tool Version Source
Institute. Raw data and assembly accession identifiers are reported in Table 1.