The genome sequence of Gymnosoma rotundatum (Linnaeus, 1758), a parasitoid ladybird fly

We present a genome assembly from an individual male Gymnosoma rotundatum (Arthropoda; Insecta; Diptera; Tachinidae). The genome sequence is 779 megabases in span. The majority of the assembly (97.07%) is scaffolded into six chromosomal pseudomolecules, with the X sex chromosome assembled.


Background
The Tachinid flies (Diptera: Tachinidae) are one of the largest families of flies. The entire family are parasitic, with the larvae developing as internal parasites in a range of hosts, mostly insects. Gymnosoma rotundatum (Diptera: Tachinidae) is a small, 5-6-mm-long fly with a dark thorax dusted with gold in males, and a globular red or orange abdomen decorated with dark markings along the midline. This shape and colouration has given rise to the use of the name "ladybird flies" as a common name for various Gymnosoma species.
Gymnosoma rotundatum is known from Britain and Ireland (Belshaw, 1993). In Ireland, the species has only been recorded from a few localities in southern Ireland, with the most recent record from County Kerry in 2015. In Britain, the species has historically been regarded as rare, and was accorded Red Data Book 3 status by Falk (1991). Morris (1997) summarised the known British records and distribution of the species up to 1996, noting that G. rotundatum appeared to be "seemingly confined to a narrow corridor from the West Sussex coast through Surrey and parts of North Hampshire". Gymnosoma rotundatum appears to be one of the species benefiting from the warming climate in the UK, and since 1996 it has been increasingly recorded away from its restricted historical range. It is now known from a large number of sites in central southern and south-east England, with a few recent records from East Anglia.
Gymnosoma rotundatum is a parasite of Shieldbugs (Hemiptera: Pentatomidae), though specific host details are limited. Tschorsnig & Herting (1994) only cite "Pentatomidae" and Belshaw (1993) lists Palomena spp. as a host, although there are no confirmed British rearing records. Adult flies are on the wing from late April until early October, with records peaking in August. The species is most often recorded from warm dry sites, where it visits a range of open shallow flowers such as Hogweed (Heracleum sphondylium), Yarrow (Achillea millefolium) and Mayweeds (Tripleurospermum sp.).

Genome sequence report
The genome was sequenced from a single male G. rotundatum ( Figure 1) collected from Hartslock Reserve, Oxfordshire, UK (latitude 51.511263, longitude -1.112222). A total of 31-fold coverage in Pacific Biosciences single-molecule long reads and 32-fold coverage in 10X Genomics read clouds were generated. Primary assembly contigs were scaffolded with chromosome conformation Hi-C data. Manual assembly curation corrected 191 missing/misjoins and removed 3 haplotypic duplications, reducing the assembly size by 0.11% and the scaffold number by 23.88%, and increasing the scaffold N50 by 14.22%. The final assembly has a total length of 779 Mb in 392 sequence scaffolds with a scaffold N50 of 137.8 Mb ( Table 1). The majority, 97.07%, of the assembly sequence was assigned to 6 chromosomal-level scaffolds, representing 5 autosomes (numbered by sequence length), and the X sex chromosome (Figure 2- Figure 5; Table 2). The X chromosome has been identified based on half diploid coverage. There are a large number of unassigned scaffolds that may belong to X or Y, as we are uncertain whether the karyotype is X0 or XY. The assembly has a BUSCO v5.2.2 (Manni et al., 2021) completeness of 98.8% (single 98.3%, duplicated 0.4%) using the diptera_odb10 reference set. While not fully phased, the assembly deposited is of one haplotype. Contigs corresponding to the second haplotype have also been deposited.

Sample acquisition and nucleic acid extraction
A male G. rotundatum (idGymRotn1) was collected from Hartslock Reserve, Oxfordshire, UK (latitude 51.511263, longitude -1.112222) by Matt Smith, independent researcher, who also identified the specimens. The specimens were collected from grassland using a net and snap-frozen in liquid nitrogen.
DNA was extracted at the Tree of Life laboratory, Wellcome Sanger Institute. The idGymRotn1 sample was weighed and dissected on dry ice with tissue set aside for Hi-C sequencing. Thorax tissue was disrupted using a Nippi Powermasher fitted with a BioMasher pestle. Fragment size analysis of 0.01-0.5 ng of DNA was then performed using an Agilent FemtoPulse. High molecular weight (HMW) DNA was extracted using the Qiagen MagAttract HMW DNA extraction kit. Low molecular weight DNA was removed from a 200-ng aliquot of extracted DNA using 0.8X AMpure XP purification kit prior to 10X Chromium sequencing; a minimum of 50 ng DNA was submitted for 10X sequencing. HMW DNA was sheared into an average fragment size between 12-20 kb in a Megaruptor 3 system with speed setting 30. Sheared DNA was purified by solid-phase reversible immobilisation using AMPure PB beads with a 1.8X ratio of beads to sample to remove the shorter fragments and concentrate the DNA sample. The concentration of the sheared and purified DNA was assessed using a Nanodrop spectrophotometer and Qubit Fluorometer and Qubit dsDNA High Sensitivity Assay kit. Fragment size distribution was evaluated by running the sample on the FemtoPulse system.

Sequencing
Pacific Biosciences HiFi circular consensus and 10X Genomics Chromium read cloud sequencing libraries were constructed according to the manufacturers' instructions. Sequencing was performed by the Scientific Operations core at the Wellcome Sanger Institute on Pacific Biosciences SEQUEL II and Illumina NovaSeq 6000 instruments. Hi-C data were generated from head tissue of idGymRotn1 using the Arima Hi-C+ kit and sequenced on a NovaSeq 6000 instrument.

Genome assembly
Assembly was carried out with Hifiasm (Cheng et al., 2021); haplotypic duplication was identified and removed with
The genome sequence is released openly for reuse. The G. rotundatum genome sequencing initiative is part of the Darwin Tree of Life (DToL) project. All raw sequence data and the assembly have been deposited in INSDC databases. The genome will be annotated and presented through the Ensembl pipeline at the European Bioinformatics Institute. Raw data and assembly accession identifiers are reported in Table 1. enough detail to convince me that it really represents our current understanding of the taxon under the name Gymnosoma rotundatum. Taxonomically the publication is also on safe grounds as if there are synonyms in the future, G. rotundatum is the oldest name in the genus. There is also a comprehensive host catalogue for Palearctic Tachinidae here: http://www.nadsdiptera.org/Tach/WorldTachs/CatPalHosts/Home.html, which could be cited for the host records. As there are many different host species, they could be summed just as "suitable sized shield bugs (Pentatomidae) in the fly's habitat".

Genome sequence report:
You might be able to infer the existence of a Y-chromosome indirectly (most calyptrate flies follow XY-system of sex determination) by looking at the existence of the dominant male-determining factor in your sequence data (e.g. PLoS Biol. 13(4)