The genome sequence of the marbled white butterfly, Melanargia galathea (Linnaeus, 1758)

We present a genome assembly from an individual female Melanargia galathea (the marbled white; Arthropoda; Insecta; Lepidoptera; Nymphalidae). The genome sequence is 606 megabases in span. The majority (99.97%) of the assembly is scaffolded into 25 chromosomal pseudomolecules, with the W and Z sex chromosomes assembled.


Background
The marbled white Melanargia galathea is a common butterfly of flower-rich meadows and other grassy habitats in central and southern Europe, Caucasus, Transcaucasia and northern and central parts of Asia Minor. The species is notably absent from most Mediterranean islands, from most of the Iberian peninsula (where it is replaced by the closely related Melanargia lachesis) and from northwestern Africa (where it is replaced by Melanargia lucasi) (Habel et al., 2017). Melanargia galathea is univoltine, with a flight period from late May to September depending on latitude. Early instar larvae overwinter and can feed on a wide range of grasses. Wing patterns vary throughout the range and several discrete varieties have been described as forms, in particular darker forms (f. procida and f. magdalenae) and specimens with the hind wing underside uniformly white, unmarked (f. leucomelas) (Tolman & Lewington, 2008).
Melanargia galathea is currently listed as a species of Least Concern in the IUCN Red List of Europe (van Swaay et al., 2010). UK populations feed mainly on Red Fescue (Festuca rubra) and have been classified as ssp. serena, based on subtle wing pattern differences (Verity, 1913). While M. galathea is restricted to England and Wales in the UK, it has expanded its range rapidly northwards in recent decades (Fox et al., 2015). Successful introductions to previously unoccupied sites in Northern England suggest that the species lags behind its current climatic niche at the range margin (Willis et al., 2009). The species has a karyotype of 24 chromosomes (Bigger, 1960;Lorković, 1941).

Genome sequence report
The genome was sequenced from a single female M. galathea ( Figure 1) collected near Cluj Napoca, Romania (latitude 46.834, longitude 23.629). A total of 29-fold coverage in Pacific Biosciences single-molecule circular consensus (HiFi) long reads and 55-fold coverage in 10X Genomics read clouds were generated. Primary assembly contigs were scaffolded with chromosome conformation Hi-C data. Manual assembly curation corrected 69 missing/misjoins and removed 7 haplotypic duplications, reducing the assembly length by 0.63% and the scaffold number by 45.36%, and increased the scaffold N50 by 23.78%.
The final assembly has a total length of 606 Mb in 53 sequence scaffolds with a scaffold N50 of 25.5 Mb (Table 1). The majority, 99.97%, of assembly sequence was assigned to 25 chromosomal-level scaffolds, representing 23 autosomes Figure 1. Fore and hind wings of the Melanargia galathea specimen from which the genome was sequenced. Dorsal (left) and ventral (right) surface view of wings from specimen RO_MG_799 (ilMelGala2) from Cluj-Napoca, Romania, used to generate Pacific Biosciences and 10X genomics data. Dorsal (left) and ventral (right) surface view of wings from specimen RO_MG_790 (ilMelGala1) from Cluj-Napoca, Romania, used to generate Hi-C data. (numbered by sequence length), and the W and Z sex chromosome (Figure 2- Figure 5; Table 2). The assembly has a BUSCO v5.1.2 (Manni et al., 2021) completeness of 98.3% (single 97.8%, duplicated 0.5%) using the lepidoptera_odb10 reference set (n=5286). While not fully phased, the assembly deposited is of one haplotype. Contigs corresponding to the second haplotype have also been deposited.

Sample acquisition and nucleic acid extraction
Two M. galathea specimens (ilMelGala2, genome assembly; ilMelGala1, Hi-C, additional HiFi and 10X reads not used in genome assembly; ilMelGala4) were collected near Cluj Napoca, Romania (latitude 46.834, longitude 23.629) using a net by Konrad Lohse, Alex Hayward, Dominik Laetsch and Roger Vila, who also identified the samples. The samples were   was cryogenically disrupted to a fine powder using a Covaris cryoPREP Automated Dry Pulveriser, receiving multiple impacts. Whole organism tissue of ilMelGala1 was disrupted using a Nippi Powermasher fitted with a BioMasher pestle. Fragment size analysis of 0.01-0.5 ng of DNA was then performed using an Agilent FemtoPulse. High molecular weight  (HMW) DNA was extracted using the Qiagen MagAttract HMW DNA extraction kit. Low molecular weight DNA was removed from a 200-ng aliquot of extracted DNA using 0.8X AMpure XP purification kit prior to 10X Chromium sequencing; a minimum of 50 ng DNA was submitted for 10X sequencing. HMW DNA was sheared into an average fragment size between 12-20 kb in a Megaruptor 3 system with speed setting 30. Sheared DNA was purified by solid-phase reversible immobilisation using AMPure PB beads with a 1.8X ratio of beads to sample to remove the shorter fragments and concentrate the DNA sample. The concentration of the sheared and purified DNA was assessed using a Nanodrop spectrophotometer and Qubit Fluorometer and Qubit dsDNA High Sensitivity Assay kit. Fragment size distribution was evaluated by running the sample on the FemtoPulse system.
RNA (from the whole organism of ilMelGala4) was extracted in the Tree of Life Laboratory at the WSI using TRIzol, according to the manufacturer's instructions. RNA was then eluted in 50 μl RNAse-free water and its concentration RNA assessed using a Nanodrop spectrophotometer and Qubit Fluorometer using the Qubit RNA Broad-Range (BR) Assay kit. Analysis of the integrity of the RNA was done using Agilent RNA 6000 Pico Kit and Eukaryotic Total RNA assay.

Sequencing
Pacific Biosciences HiFi circular consensus and 10X Genomics read cloud DNA sequencing libraries were constructed according to the manufacturers' instructions. Poly(A) RNA-Seq libraries were constructed using the NEB Ultra II RNA Library Prep kit. DNA and RNA sequencing was performed by the Scientific Operations core at the WSI on Pacific Biosciences SEQUEL II (HiFi), Illumina NovaSeq 6000 (ilMelGala2, 10X), HiSeq X (ilMelGala1, 10X) and Illumina HiSeq 4000 (RNA-Seq) instruments. Hi-C data were also generated from remaining whole organism tissue of ilMelGala1 using the Arima v1 Hi-C kit and sequenced on HiSeq X.

Genome assembly
Assembly was carried out with Hifiasm (Cheng et al., 2021); haplotypic duplication was identified and removed with purge_dups (Guan et al., 2020). One round of polishing was performed by aligning 10X Genomics read data to the assembly with longranger align, calling variants with freebayes (Garrison & Marth, 2012). The assembly was then scaffolded with Hi-C data (Rao et al., 2014) using SALSA2 (Ghurye et al., 2019. The assembly was checked for contamination and corrected using the gEVAL system (Chow et al., 2016) as described previously (Howe et al., 2021). Manual curation (Howe et al., 2021) was performed using gEVAL, HiGlass (Kerpedjiev et al., 2018) and Pretext. The mitochondrial genome was assembled using MitoHiFi (Uliano-Silva et al., 2021), which performed annotation using MitoFinder (Allio et al., 2020). The genome was analysed and BUSCO scores generated within the BlobToolKit environment (Challis et al., 2020). Table 3 contains a list of all software tool versions used, where appropriate.

Linda Neaves
Australian National University, Canberra, Australia The article presents the genome assembly for the marbled white butterfly, Melanargia galathea (Linnaeus, 1758). The authors generated long-read Pacific Biosciences single-molecule circular consensus and 10X genomics data from a single female, while a second female from the same location is used to generate Hi-C data. A third individual, from a different location was used for RNA extraction. The authors assembled the genome sequence spanning 606 megabases, with the majority (99.97%) of the assembly was scaffolded into 25 chromosomal pseudomolecules, including W and Z sex chromosomes.
The article clearly articulates the rationale for creating the dataset, and the methods are clearly presented, and appear appropriate. I think it could be a little clearer in the main text of the report section that separate individuals were used for PacBio and 10X genomics, and the Hi-C data. It is clear in Figure 1 and later in the methods sections that there were two individuals used.

Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Genomics, Lepidotera, Molecular Biology, NGS, Evolution I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.