The genome sequence of the drone fly, Eristalis tenax (Linnaeus, 1758) [version 1; peer review: awaiting peer review]

We present a genome assembly from an individual female Eristalis tenax (the drone fly; Arthropoda; Insecta; Diptera; Syriphidae). The genome sequence is 487 megabases in span. The majority of the assembly (96.50%) is scaffolded into six chromosomal pseudomolecules, with the X sex chromosome assembled.


Background
The dronefly, Eristalis tenax (Figure 1), is perhaps the most widespread of the hoverflies, with a cosmopolitan distribution. It is common in the United Kingdom and Ireland, where it can be found in the springtime following the emergence of females from overwintering in sheltered cavities in caves or buildings (Ball & Morris, 2013). They are at their most numerous in the UK during the late summer and autumn as the local population is augmented by migratory influxes from mainland Europe. Large southward migrations have been observed during the autumn in the UK, Europe and North America (Aubert et al., 1976;Owen, 1956;Shannon, 1926). E. tenax is a fairly large hoverfly separated from others in the Eristalis genus by the presence of a broad, black longitudinal face-stripe, longitudinal dark stripes on their eyes, and a distinctly enlarged and curved hind tibia. Like their sister species Eristalis pertinax, E. tenax is a visual, acoustic and behavioural mimic of the honeybee, Apis mellifera (Golding et al., 2001;Moore & Hassall, 2016). Males are territorial, aggressively defending patches of flowers from conspecifics and other flying insects (Wellington & Fitzpatrick, 1981). Their larvae, colloquially known as rat-tailed maggots, live in a wide array of organically rich pools, feeding on decaying organic matter. They therefore play an important ecological role in terms of decomposition (Hurtado et al., 2008). In addition, E. tenax is an important pollinator, visiting a wide range of crops and wild plants, while its role as a migratory pollinator may be particularly important for geographically isolated plant populations (Doyle et al., 2020;Pérez-Bañón et al., 2007;Rader et al., 2020). E. tenax can be reared and has been the subject of numerous investigations into its biology, including studies on population structure, flight, mimicry, vision and behaviour (Francuski & Milankov, 2015;Golding et al., 2001;Lunau et al., 2018;Nicholas et al., 2018;Straw et al., 2006). This is the first production of a high quality Eristalis tenax genome, and we believe that the sequence described here, generated as part of the Darwin Tree of Life project, will further aid understanding of the biology and ecology of this hoverfly.

Genome sequence report
The genome was sequenced from a single female E. tenax collected from Wytham Great Wood, Oxfordshire, UK (latitude 51.769, longitude -1.33). A total of 36-fold coverage in Pacific Biosciences single-molecule long reads (N50 12 kb) and 60-fold coverage in 10X Genomics read clouds (from molecules with an estimated N50 of 60 kb) were generated. Primary assembly contigs were scaffolded with chromosome conformation Hi-C data. Manual assembly curation corrected 71 missing/misjoins and removed 11 haplotypic duplications, reducing the assembly length by 1.92% and the scaffold number by 31.28%, and increasing the scaffold N50 by 72.23%.
The final assembly has a total length of 487 Mb in 157 sequence scaffolds with a scaffold N50 of 77.1 Mb (Table 1). The majority, 96.50%, of the assembly sequence was assigned to 6 chromosomal-level scaffolds, representing 5 autosomes (numbered by sequence length), and the X sex chromosome (Figure 2- Figure 5; Table 2). The assembly has a BUSCO (Simão et al., 2015) completeness of 96.6% using the diptera_odb10 reference set. While not fully phased, the assembly deposited is of one haplotype. Contigs corresponding to the second haplotype have also been deposited.

Sample acquisition and nucleic acid extraction
A female (idEriTena2) and a male (idEriTena3) E. tenax were collected from Wytham Great Wood, Oxfordshire, UK (latitude 51.769, longitude -1.33) by Will Hawkes, University of Exeter, who also identified the sample. The samples were collected using a net, snap-frozen on dry ice and stored in a CoolRack. DNA was extracted from the head/thorax of idEriTena2 at the Wellcome Sanger Institute Scientific Operations core from the whole organism using the Qiagen MagAttract HMW DNA kit, according to the manufacturer's instructions. RNA was extracted from head/thorax tissue of idEriPert3 in the Tree of Life Laboratory at the Wellcome Sanger Institute using TRIzol (Invitrogen), according to the manufacturer's instructions. RNA was then eluted in 50 μl RNAse-free water and its concentration assessed using a Nanodrop spectrophotometer and Qubit Fluorometer using the Qubit RNA Broad-Range (BR) Assay kit. Analysis of the integrity of the RNA was done using Agilent RNA 6000 Pico Kit and Eukaryotic Total RNA assay.

Sequencing
Pacific Biosciences HiFi circular consensus and 10X Genomics Chromium read cloud sequencing libraries were constructed according to the manufacturers' instructions. Poly(A) RNA-Seq libraries were constructed using the NEB Ultra II RNA Library Prep kit. Sequencing was performed by the Scientific Operations core at the Wellcome Sanger Institute on Pacific Biosciences SEQUEL II (HiFi), Illumina HiSeq X (10X) and Illumina HiSeq 4000 (RNA-Seq) instruments. Hi-C data were generated from the abdomen tissue of idEriTena2 using the Arima v1 Hi-C kit and sequenced on HiSeq X.

Genome assembly
Assembly was carried out with Hifiasm (Cheng et al., 2021); haplotypic duplication was identified and removed with purge_dups (Guan et al., 2020) with the -e flag. One round of polishing was performed by aligning 10X Genomics read data to the assembly with longranger align, calling variants with freebayes (Garrison & Marth, 2012). The assembly was then scaffolded with Hi-C data (Rao et al., 2014) using SALSA2 (Ghurye et al., 2019). The assembly was checked for contamination and corrected using the gEVAL system (Chow et al., 2016) as described previously (Howe et al., 2021). Manual curation was performed using gEVAL, HiGlass   (Kerpedjiev et al., 2018) and Pretext. The mitochondrial genome was assembled using MitoHiFi (Uliano-Silva et al., 2021). The genome was analysed and BUSCO scores generated within the BlobToolKit environment (Challis et al., 2020). Table 3 contains a list of all software tool versions used, where appropriate.   The genome sequence is released openly for reuse. The E. pertinax genome sequencing initiative is part of the Darwin Tree of Life (DToL) project. All raw sequence data and the assembly have been deposited in INSDC databases. The genome will be annotated using the RNA-Seq data and presented through the Ensembl pipeline at the European Bioinformatics Institute. Raw data and assembly accession identifiers are reported in Table 1.