The genome sequence of the yellow-tail moth, Euproctis similis (Fuessly, 1775)

We present a genome assembly from an individual male Euproctis similis (the yellow-tail; Arthropoda; Insecta; Lepidoptera; Lymantriidae). The genome sequence is 508 megabases in span. Over 99% of the assembly is scaffolded into 22 chromosomal pseudomolecules, with the Z sex chromosome assembled. The complete mitochondrial genome, 15.5 kb in length, was also assembled.


Introduction
Euproctis similis, the yellow-tail moth, is widespread across temperate Europe and Asia. In the UK, the moth is relatively common across much of England and Wales, with scattered records from southern Scotland and Northern Ireland. The larvae of E. similis feed on a range of deciduous trees and shrubs, including Crataegus, Prunus, and Betula, in some situations becoming a pest on ornamental and fruit trees. Larvae of are also notable for bearing long hairs that can cause skin irritation in humans, although the effects are rarely as serious as those caused by larvae of the closely related Euproctis chrysorrhoea (brown-tail). A genome sequence for E. similis, therefore, may have agricultural and biomedical relevance, in addition to its use in evolutionary biology, ecology and genome biology. The karyotype of E. similis has been previously recorded as n=22 or 23 (Belyakova & Lukhtanov, 1994). This is not unexpected since Lepidoptera exhibit considerable variation in chromosome number, although n=31 is the most common karyotype (Ahola et al., 2014). The genome of E. similis was sequenced as part of the Darwin Tree of Life Project, a collaborative effort to sequence all of the named eukaryotic species in the Atlantic Archipelago of Britain and Ireland. Here we present a chromosomally complete genome sequence for E. similis, based on one male specimen from Wytham Woods, Oxfordshire (biological vice-country: Berkshire), UK.

Genome sequence report
The genome was sequenced from a single male E. similis ( Figure 1) collected from Wytham Woods, Oxfordshire (biological vice-county: Berkshire), UK (latitude 51.772, longitude -1.338). A total of 70-fold coverage in Pacific Biosciences single-molecule long reads (N50 17 kb) and 78-fold coverage in 10X Genomics read clouds were generated. Primary assembly contigs were scaffolded with chromosome conformation Hi-C data. Manual assembly curation corrected 40 missing/ misjoins and removed 3 haplotypic duplications, reducing the assembly length by 0.10% and the scaffold number by 42.00%,

Amendments from Version 1
The Introduction has been expanded to include further information about the habitat and distribution of the species and the potential uses for the genome assembly.
Details of the RNAseq data accession, which were omitted in v1, have been included, alongside details of the intended use for these data in the Data availability section. The legend to Figure 2 and Figure 5 (formerly Figure 1 and Figure 4) have been expanded to aid understanding.
Other minor changes requested by reviewers have been made.
An image of the E. similis specimen has been included as Figure 1.
Any further responses from the reviewers can be found at the end of the article Figure 1. Image of the Euproctis similis specimen (ilEupSimi1) used for genome sequencing. Image captured during preservation and processing. Specimen is shown below a FluidX storage tube 43.9 mm in length.

REVISED
and increasing the scaffold N50 by 14.24%. The final assembly has a total length of 508 Mb in 30 sequence scaffolds with a scaffold N50 of 24 Mb (Table 1). Over 99.9% of the assembly sequence was assigned to 22 chromosomal-level scaffolds, representing 21 autosomes (numbered by sequence length), and the Z sex chromosome (Figure 2- Figure 5; Table 2). The assembly has a BUSCO (Simão et al., 2015) v5.1.2 completeness of 98.6% using the lepidoptera_odb10 reference set. The complete, unbroken mitochondrial genome was assembled and is 15.5 kb in length. While not fully phased, the assembly deposited is of one haplotype. Contigs corresponding to the second haplotype have also been deposited.

Methods
A single male E. similis, ilEupSimi1, was collected from Wytham Woods, Oxfordshire (biological vice-country: Berkshire), UK (latitude 51.772, longitude -1.338) by Douglas Boyes, University of Oxford, using a light trap. The specimen was snap-frozen in dry ice using a CoolRack before transferring to the Wellcome Sanger Institute (WSI).
DNA was extracted at the Tree of Life laboratory, WSI. The ilEupSimi1 sample was weighed and dissected on dry ice with tissue set aside for RNA extraction and Hi-C sequencing. Thorax/abdomen tissue was cryogenically disrupted to a fine powder using a Covaris cryoPREP Automated Dry Pulveriser, receiving multiple impacts. Fragment size analysis of 0.01-0.5 ng of DNA was then performed using an Agilent FemtoPulse. High molecular weight (HMW) DNA was extracted using the Qiagen MagAttract HMW DNA extraction kit. Low molecular weight DNA was removed from a 200-ng aliquot of extracted DNA using 0.8X AMpure XP purification kit prior to 10X Chromium sequencing; a minimum of 50 ng DNA was submitted for 10X sequencing. HMW DNA was sheared into an average fragment size between 12-20 kb in a Megaruptor 3 system with speed setting 30. Sheared DNA was purified by solid-phase reversible immobilisation using AMPure PB beads with a 1.8X ratio of beads to sample to remove the shorter fragments and concentrate the DNA sample. The concentration of the sheared and purified DNA was assessed using a Nanodrop spectrophotometer and Qubit Fluorometer and Qubit dsDNA High Sensitivity Assay kit. Fragment size distribution was evaluated by running the sample on the FemtoPulse system.
RNA was extracted from thorax/abdomen tissue in the Tree of Life Laboratory at the WSI using TRIzol (Invitrogen),    (10X) and Illumina HiSeq 4000 (RNA-Seq) instruments. Hi-C data were generated from head tissue using the Qiagen EpiTect Hi-C kit and sequenced on HiSeq X.
Assembly was carried out with HiCanu (Nurk et al., 2020); haplotypic duplication was identified and removed with purge_dups (Guan et al., 2020). The assembly was polished with the 10X Genomics Illumina data by aligning to the assembly with longranger align, calling variants with freebayes (Garrison & Marth, 2012). One round of the Illumina polishing was applied. Scaffolding with Hi-C data (Rao et al., 2014) was carried out with SALSA2 (Ghurye et al., 2019). The assembly was checked for contamination and corrected using the gEVAL system (Chow et al., 2016) as described previously (Howe et al., 2021). Manual curation was performed using gEVAL, HiGlass (Kerpedjiev et al., 2018) and Pretext. The mitochondrial genome was assembled using MitoHiFi (Uliano-Silva et al., 2021). The genome was analysed and BUSCO scores generated within the BlobToolKit environment (Challis et al., 2020).

Data availability
European Nucleotide Archive: Euproctis similis (yellow-tail). Accession number PRJEB42127: https://identifiers.org/ena.embl: PRJEB42127 The genome sequence is released openly for reuse. The E. similis genome sequencing initiative is part of the Darwin Tree of Life (DToL) project. All raw sequence data and the assembly have been deposited in INSDC databases. The genome will be annotated using RNAseq data and presented through the Ensembl pipeline at the European Bioinformatics Institute.
Raw data and assembly accession identifiers are reported in Table 1.  I only have three small suggestions for additional information, although I also see that none of these are commonly supplied in the notes coming from the Darwin Tree of Life project. First, as a biologist, I would be interested in knowing a little bit more about the organism. For example that it is a night-active moth, wide-spread across the Eurasian continent, that they're active from August to June and that they are associated with both urban and non-urban habitats and with several host plants. Second, in the presentation of the methods there are no details about the bioinformatic analyses beyond the programs that were used. It would be good to specify any deviation from default settings or even to have a brief summary of the commands used to perform the analyses. This could be done in a separate file archived along with the note or in a Table, possibly integrated in Table 3. Third, the data presentation can benefit from brief expansion of the results. The interactive figures are nice, because some explanation of what is shown can also be found at the corresponding blobtoolkit repository. However, there is no text accompanying these figures beyond a single sentence referencing the number of scaffolds and citing figures 1 through 4. Some expansion of the genome assembly statistics seems desirable. And figure 4 could use a legend as well as axis labels with the chromosome numbers. Again, I do see that other examples of notes on genomes coming from this project also do not necessarily contain these additional pieces of information, so I guess it is up to the authors to decide whether that continuity matters more or whether the details are simply not necessary.

Is the rationale for creating the dataset(s) clearly described? Yes
Are the protocols appropriate and is the work technically sound? Yes

Are sufficient details of methods and materials provided to allow replication by others? Partly
Are the datasets clearly presented in a useable and accessible format?

Major updates:
We have updated the Introduction section to include information about abundance, habitat and distribution of the species, and a description of potential uses for the genome. Also included is a brief mention of lepidopteran karyotypes -a paper discussing lepidopteran chromosome evolution using the sequences generated by this project is forthcoming.
We addressed the issue of the missing RNA-Seq data, which will be used for annotation by Ensembl in the near future as part of the Darwin Tree of Life project pipeline. We have also included details of the method of library preparation.
An image of the specimen used for genome sequencing has been included as Figure 1. The legends of Figures 2-5 have been expanded to make them easier to understand.