The genome sequence of the clay, Mythimna ferrago (Fabricius, 1787)

We present a genome assembly from an individual female Mythimna ferrago (the clay; Arthropoda; Insecta; Lepidoptera; Noctuidae). The genome sequence is 861 megabases in span. The majority of the assembly (99.98%) is scaffolded into 32 chromosomal pseudomolecules, with the W and Z chromosomes assembled. The complete mitochondrial genome was also assembled and is 15.3 kilobases in length. Gene annotation of this assembly on Ensembl has identified 14,075 protein coding genes.


Background
The Clay, Mythimna ferrago (Fabricius, 1787) is a common, nocturnal, non-pest, macro-moth species that occurs across the Palearctic. In Great Britain, M. ferrago has been assessed against the International Union for Conservation of Nature (IUCN) Red List criteria, and categorised as a resident species of Least Concern (Fox et al., 2021). The larvae feed on grasses (Robinson et al., 2010), and overwinter as small larvae. The adult flight period is July and August. Mythimna ferrago can be found in a range of open habitats, including woodland, grassland, scrub, heathland, gardens and farmland.
Moths are important indicators of land-use and climate change (Wagner et al., 2021), used to study site-specific stressors such as light pollution (Boyes et al., 2021), pesticide use, and the effectiveness of agri-environment management schemes (Botham et al., 2015;Merckx et al., 2009;Staley et al., 2016). Within the genus Mythimna, M. ferrago is unusual in that, on the basis of DNA barcoding data, it forms two distinct clusters across its entire range that do not correspond with a geographical pattern (Huemer et al., 2019). Analyses at the genomic level could contribute to elucidating what drives their intra-specific variation and key genes under selection.

Genome sequence report
The genome was sequenced from a single female M. ferrago, collected from Wytham Woods, Berkshire, UK (Figure 1). A total of 31-fold coverage in Pacific Biosciences single-molecule HiFi long reads and 51-fold coverage in 10X Genomics read clouds were generated. Primary assembly contigs were scaffolded with chromosome conformation Hi-C data. Manual assembly curation corrected 15 missing/misjoins and removed 1 haplotypic duplications, reducing the assembly size by 0.99% and the scaffold number by 18.52%, and increasing the scaffold N50 by 1.95%.
The final assembly has a total length of 861 Mb in 44 sequence scaffolds with a scaffold N50 of 27.9 Mb (Table 1). The majority, 99.98%, of the assembly sequence was assigned to 32 chromosomal-level scaffolds, representing 30 autosomes (numbered by sequence length) and the W and Z sex chromosomes (Figure 2- Figure 5; Table 2).
The assembly has a BUSCO v5. 1.2 (Manni et al., 2021) completeness of 98.9% (single 98.1%, duplicated 0.8%) using the lepidoptera_odb10 reference set (n=954). While not fully phased, the assembly deposited is of one haplotype. Contigs corresponding to the second haplotype have also been deposited.

Sample acquisition and nucleic acid extraction
A single adult female M. ferrago specimen (ilMytFerr1) was collected using a light trap from Wytham Woods, Berkshire, UK (latitude 51.772, longitude -1.338) by Douglas Boyes (University of Oxford). The specimen was identified by Douglas Boyes and snap-frozen on dry ice.
DNA was extracted at the Tree of Life laboratory, Wellcome Sanger Institute. The ilMytFerr1 sample was weighed and dissected on dry ice with tissue set aside for Hi-C sequencing. Abdomen tissue was cryogenically disrupted to a fine powder using a Covaris cryoPREP Automated Dry Pulveriser, receiving multiple impacts. Fragment size analysis of 0.01-0.5 ng of DNA was then performed using an Agilent FemtoPulse. High molecular weight (HMW) DNA was extracted using the Qiagen MagAttract HMW DNA extraction kit. Low molecular weight DNA was removed from a 200-ng aliquot of extracted DNA using 0.8X AMpure XP purification kit prior to 10X Chromium sequencing; a minimum of 50 ng DNA was submitted for 10X sequencing. HMW DNA was sheared into an average fragment size between 12-20 kb in a Megaruptor 3 system with speed setting 30. Sheared DNA was purified by solidphase reversible immobilisation using AMPure PB beads with a 1.8X ratio of beads to sample to remove the shorter fragments and concentrate the DNA sample. The concentration of the sheared and purified DNA was assessed using a Nanodrop spectrophotometer and Qubit Fluorometer and Qubit dsDNA High Sensitivity Assay kit. Fragment size distribution was evaluated by running the sample on the FemtoPulse system.

Sequencing
Pacific Biosciences HiFi circular consensus and 10X Genomics Chromium read cloud sequencing libraries were constructed according to the manufacturers' instructions. Sequencing was performed by the Scientific Operations core at the Wellcome Sanger Institute on Pacific Biosciences SEQUEL II (HiFi) and Illumina NovaSeq 6000 (10X) instruments. Hi-C data were generated in the Tree of Life laboratory from head/thorax tissue of ilMytFerr1 using the Arima v2 kit and sequenced on a NovaSeq 6000 instrument.

Genome assembly
Assembly was carried out with Hifiasm (Cheng et al., 2021); haplotypic duplication was identified and removed with purge_ dups (Guan et al., 2020). One round of polishing was performed by aligning 10X Genomics read data to the assembly with

Niclas Backström
Department of Evolutionary Biology, Evolutionary Biology Centre (EBC), Uppsala University, Uppsala, Sweden The article describes the statistics of a genome assembly of the clay (Mythimna ferrago) and is written in standard format for the DToL genome notes. The paper is well written and the results are clearly presented. Accession information for all data sets is also available in the note. The quality of the assembly is impressive and the data will definitely be useful for forthcoming intraspecific population genomic studies and for comparative genomics studies in Lepidoptera in general.
Minor comments: 1) It's a bit unclear to me what average length of coding sequence refers to (Table 1). Is this the average length of the entire gene (introns and UTR included) or the sum of lengths of all coding exons (probably not since it is > 20 kb), or something else? Maybe a short note in the table legend can be added to make this clear.
2) Methods: Section 1, first sentence. "using a light trap from Wytham Woods" sounds a bit like the light trap was from WW and not that the sample site was WW?
3) Methods: Genome annotation paragraph. Unclear if the transcriptomic data are from the same species? 4) Methods: Genome annotation paragraph. Maybe specify which set of proteins from UniProt that was used for the alignments?

Is the rationale for creating the dataset(s) clearly described? Yes
Are the protocols appropriate and is the work technically sound? Yes