The genome sequence of the hawthorn shieldbug, Acanthosoma haemorrhoidale (Linnaeus, 1758)

We present a genome assembly from an individual male Acanthosoma haemorrhoidale (hawthorn shieldbug; Arthropoda; Insecta; Hemiptera; Acanthosomatidae). The genome sequence is 866 megabases in span. The majority of the assembly (99.98%) is scaffolded into 7 chromosomal pseudomolecules with the X and Y sex chromosomes assembled. The complete mitochondrial genome was also assembled and is 18.9 kilobases in length.


Background
The hawthorn shield bug, Acanthosoma haemorrhoidale, is a large Pentamoid shield bug, easily recognisable by their size (typically 13mm or more in length) and bright green and red coloration. The species is common on hawthorn (Crataegus monogyna), where the berries comprise their principal food source, but are also found in mixed woodland and will feed on leaves of oak, hazel, and other deciduous trees and shrubs. Adults overwinter in leaf litter or under bark, and sometimes in buildings, and emerge in spring. Eggs are laid in several batches in late spring to early summer, and females exhibit no maternal care, unlike other members of the Acanthosomatidae Originally classified as Cimex haemorrhoidalis by Linnaeus in 1758, the genus Acanthosoma (acantho-= spiny, -soma = body) was raised by Curtis in 1824 for the spined keel on the ventral surface (Curtis, 1824). The species name references the blood red coloration and appearance of discharging blood, particularly from the tip of the abdomen. The species has a trans-palaearctic distribution and comprises at least three currently-recognised subspecies: A. h. haemorrhoidale, Linnaeus 1758; A. h. angulatum, Jakovlev 1880; A. h. ouchii, Ishihara 1950 (Tsai & Rédei, 2015).
The classic work by Southwood and Leston on British land and water bugs (Southwood & Leston, 1959) describes a distribution across much of England and Wales, with only recent colonisation of Northern England. Whilst sporadic records for Scotland are found from the mid-20th century, it appears that a gradual northward range expansion has been underway from at least the mid-1990's, and that the species is now well-established across northern England and is reasonably common up to the central belt of Scotland, with more scattered reports from further north (Ramsay, 2014). A similar northern expansion appears to have occurred in Finland in the mid-20th century (Ramsay, 2014; Southwood & Leston, 1959), and it may be interesting to investigate parallel behavioural or physiological changes in these northward-bound populations. Development is temperature sensitive, with high mortality at 30°C (Hori et al., 1993), and more southern parts of the species range may therefore become unsuitable in the future.
In contrast to groups like the Lepidoptera, where females produce pheromones to attract males, in the Pentatomoidea it seems to be males that produce pheromones, most likely to avoid parasitoids utilising female pheromones to find hosts, and male A. haemorrhoidale possess extensive abdominal sternal glands (Staddon, 1990). The genome sequence will facilitate identification of biosynthetic pathways underlying pheromone production and reception in this species. Similarly, genomic data will shed light on host-symbiont relationships, including not only characterisation of bacterial symbionts themselves, but also the anatomical and behavioural mechanisms for storing and transmitting them to the next generation, such as the midgut crypts and the lubricating organs of females (

Genome sequence report
The genome was sequenced from a single male A. haemorrhoidale collected from Wytham woods, Berkshire, UK ( Figure 1). A total of 26-fold coverage in Pacific Biosciences single-molecule HiFi long reads and 223-fold coverage in 10X Genomics read clouds were generated. Primary assembly contigs were scaffolded with chromosome conformation Hi-C data. Manual assembly curation corrected 321 missing/ misjoins and removed 4 haplotypic duplications, reducing the assembly size by 0.08% and the scaffold number by 65.5%, and increasing the scaffold N50 by 112.02%.
The final assembly has a total length of 866 Mb in 72 sequence scaffolds with a scaffold N50 of 33.6 Mb ( Table 1). The majority, 99.98%, of the assembly sequence was assigned to 7 chromosomal-level scaffolds, representing 5 autosomes (numbered by sequence length) and the X and Y sex chromosomes (Figure 2- Figure 5; Table 2).
The assembly has a BUSCO v5.2.2 (Manni et al., 2021) completeness of 99.2% (single 97.4%, duplicated 1.8%) using the hemiptera_odb10 reference set (n=954). While not fully  phased, the assembly deposited is of one haplotype. Contigs corresponding to the second haplotype have also been deposited. DNA was extracted at the Tree of Life laboratory, Wellcome Sanger Institute. The ihAcaHaem1 sample was weighed and dissected on dry ice with tissue set aside for Hi-C sequencing. Abdomen tissue was cryogenically disrupted to a fine powder using a Covaris cryoPREP Automated Dry Pulveriser, receiving multiple impacts. Fragment size analysis of 0.01-0.5 ng of DNA was then performed using an Agilent FemtoPulse. High molecular weight (HMW) DNA was extracted using the Qiagen MagAttract HMW DNA extraction kit. Low molecular weight DNA was removed from a 200-ng aliquot of extracted DNA using 0.8X AMpure XP purification kit prior to 10X Chromium sequencing; a minimum of 50 ng DNA was submitted for 10X sequencing. HMW DNA was sheared into an average fragment size between 12-20 kb in a Megaruptor 3 system with speed setting 30. Sheared DNA was purified by solid-phase reversible immobilisation using AMPure PB beads with a 1.8X ratio of beads to sample to remove the shorter fragments and concentrate the DNA sample. The concentration of the sheared and purified DNA was assessed using a Nanodrop spectrophotometer and Qubit Fluorometer and Qubit dsDNA High Sensitivity Assay kit. Fragment size distribution was evaluated by running the sample on the FemtoPulse system.

Is the rationale for creating the dataset(s) clearly described?
As the Darwin Tree of Life (DToL) genome sequencing project continues to produce high-quality (chromosome-scale) genome assemblies for eukaryotic species of the British Isles, I welcome here the documentation of a genome assembly for the shieldbug Acanthosoma haemorrhoidale. The bugs (the Hemiptera, including aphids, cicadas, and the true bugs) are the most species-rich order of hemimetabolous insects. Increasing genomic resources for this major animal group will support ongoing research on biodiversity and many aspects of insect biology and ecological interactions, extending resources beyond the already well-sampled Holometabola and pest species of Hemiptera.
I understand that DToL prioritizes rapid dissemination of genome data according to a standard template, and the Wellcome Open Research format of the Data Note is a great fit for this. However, having been asked to review this specific contribution, I find that there are some aspects of presentation for dataset reproducibility, rigor, and readability that should be improved.

Concern leading to status of "approval with reservations":
Documentation of species identification is insufficient. In the Data Note, it is based solely on a lowmagnification image of the sequenced individual ( Figure 1) and a methods statement that the specimen was collected and identified "by Liam Crowley (University of Oxford)". This is insufficient, as there is no mention of a type specimen accession, barcoding identification, or a cited reference for taxonomic expertise. At present, identification of wild-caught individuals for future research is not reproducible, unless Liam Crowley is a resource available on demand to the scientific community! Also, please ensure that the Figure 1 image is of the highest quality, provide a scale bar (should be possible, given the inclusion of a standard collection tube in the image), and crop unneeded white space to maximize size of the insect within the image.

Readability issues, including technical documentation, general clarity, and completeness (rigor) of scientific presentation:
The final paragraph of the Background states that the karyotype consists of ten autosomes and the X and Y sex chromosomes, and this is supported by identification of seven unique chromosomes in the current assembly ( Figure 5), but it would be easier to reconcile these facts if the Background comment were reworded slightly (suggestion in all caps) to "the diploid (2N) karyotype of A. haemorrhoidale to be 12, comprising FIVE AUTOSOMAL PAIRS and two sex chromosomes…". Also, include the year immediately after the author names for clear attribution at the beginning of this sentence, particularly since multiple references are cited together at the end.
In the "Sequencing" section, please specify reagent (chemistry) versions for the HiFi, 10X, and Hi-C work, similar to the software details provided in Table 3. Who is the manufacturer of the Arima kit?
The presentation of Figures 2-5 is inadequate for a non-bioinformatic audience, such as entomologists and molecular geneticists. In-text mention is confined to a single stub, batch parenthetical citation early in the "Genome sequence report" section. The figures and their legends are devoid of biological information and apparently only present pipeline-generated statistical visualization features with no wider context or attempt to customize appearance in a fashion appropriate to this species' assembly values. A few sentences in the main text should make clear what each figure presents. Even if the figures themselves may have interactive online versions, they should be annotated and intelligible in the Data Note itself, with consistent and legibly sized in-figure text legends and text labels that are not merely pipeline designations (e.g., "CAKNEZ01" in Figures 3 and 4).
In Figure 2, there were many elements that I could not interpret. The lower left "Scale" is of unclear value. The upper right BUSCO pie chart is too small and disproportionately tiny compared to the main pie chart. Customarily, these elements should be distinguished as figure panels (a) and (b), not simply dumped in the same graphical space. For the BUSCO chart, there seems to be an inappropriate distinction between complete and duplicated, as the latter is a subset of the former in Table 1. Also, when multiple features of the BUSCO pie chart will be impossibly small to visually distinguish (2 features each <1%), what is the value of this chart in the first place, compared to the clear information already provided in Table 1 and with an interactive hyperlink in the Table 1 footnote? For the main chart, the multiple instances of red and grey referred to in the figure legend were ambiguous, and I could not reconcile these elements with the actual image appearance: please provide in-figure text labels. For example, I cannot tell which is the "dark grey" for chromosome lengths and which is the "pale grey spiral" for cumulative chromosome counts.
The outer blue tracks to indicate GC content may be a standard pipeline output, but it is of no clear visual value here. Rather, the Figure 2 in-figure legend for this feature should be explicitly given for GC content statistics in Table 1.
Elsewhere, there appears to be a typo, as the main text and Table 1  Compared to a simple report that GC content is 35.5%, what is the value of Figure 3? In its present state, it appears to be an unaltered pipeline output, but unhelpful for the uninitiated (see comments above).
Please amend Figure 4 so that the plot itself and in-figure text are appropriately sized. At present the chart is unnecessarily large for its complexity and content, while the in-figure text is too small for the axes and legend (and see above on using appropriate legend text labels). To make this chart useful, in the legend or main text please report what percent of the total assembly can be designated as arthropod-specific, attributable to another phylum (does the pipeline support identification of microbial content? -see below), or with no assigned phylum. Also, as BUSCO focuses on protein coding genes and this assembly has not yet been formally annotated with an official gene set, please comment on what fraction of the assembly is assessed by this method (the technical reference to "buscogenes taxrule" is not informative for me).
I value Figure 5, but there is a complete absence of axis labels or heatmap color code legend for these Hi-C data. It would also be helpful if in-figure annotations indicated the X and Y chromosomes. Alternatively, if this is correct, state in the figure legend that "linkage groups corresponding to chromosomes are presented from top left to lower right in order of descending size, from Chromosome 1 to Y, as listed in Table 2, with the mitochondrial assembly not shown". Also, the link to the interactive version of this figure seems to be broken ("No such uuid") -please amend.
For the "Genome assembly" section, the main text link to "Pretext" on GitHub requires more detailed documentation and citation to indicate which exact methods were used (which versions) in a Data Note generated at a specific point in time and that has a permanent DOI. For the main text citation of Table 3, what does "where appropriate" include or exclude (or, what is the purpose of this caveat)?
Please report the actual assembly metrics associated with the analyses for contamination and mitochondrial genome assembly. As noted by Andrew J. Mongue in his peer review report (5 August 2022), it is a notable omission, after introductory information on interest in microbiomes, that there is no main text reporting on these findings. I was interested to note that the abdomenthe predominant location of microbiome components associated with reproductive and digestive anatomy -was sequenced by HiFi but excluded for Hi-C, which presumably provides a basis for at least an initial explicit statement on which fraction of the unplaced assembly (17.6 Mb) may be due to specimen heterozygosity, microbial content, or other sources.

Taxonomic corrections:
In the article Background, first paragraph, please correct the spelling of the superfamily name "Pentatomoid" (the "-to-" is missing).
The peer review report of this article by Andrew J. Mongue refers to aphids and mealybugs as "true bugs", but this is incorrect. The term "true bugs" applies to the monophyletic Heteroptera, which does include the species presented here (Acanthosoma haemorrhoidale), while aphids and mealybugs belong to the distinct lineage Sternorrhyncha, sometimes (formerly) regarded as a part of the paraphyletic Homoptera (see, for example, Figure 2 in: Panfilio et al., 2018 1 ).
The authors report the genome assembly of the hawthorn shield bug with brief background and detailed methodology of sequencing and assembly. I understand this reporting format is designed to be concise yet informative and I commend the authors on the level of rigor in describing the process from DNA extraction through to finished assembly. As this and other recent genome assembly papers have demonstrated, the combination of PacBio HiFi and HiC sequencing is an almost guaranteed success for generating highly contiguous insect genomes. Given that this approach seems to be so powerful, it is all the more important that researchers document the methodology in detail so that others can replicate this success. I appreciated the table listing tools and their versions as an example of how to effectively highlight methodological details. Likewise, the accession numbers appropriately link to data that are now available, so these resources have immediate value to the community as well.
I have only two comments for clarification and these focus on the species background context.
Firstly, the authors contrast shield bug pheromone production to that of Lepidoptera, stating that female Lepidoptera produce pheromones to attract males. While this is true of some groups within the order, many species behave much like the bugs described here: males produce pheromones to entice females (see milkweed butterflies as a specific example and generally species in which males have a hair-pencil organ). Please clarify the text on this point.
Secondly, the authors mention the potential for sequencing data to reveal symbiont relationships that the hawthorn shield bug has with bacteria. Host-microbe interactions are particularly wellstudied in true bugs like aphids and mealybugs, so I can see the value in this new datapoint for comparison. As such, I was surprised that symbiont screening is never directly mentioned again in the methods or results. Perhaps this is a structural choice from the Tree of Life initiative and these data will be reported elsewhere. Given the interest in hemipteran symbionts from the research community, however, I feel it would be good to include at least a follow-up sentence directing the interested reader to the appropriate resources.
I also noticed two small grammatical errors to correct: "The hawthorn shield bug…is a large Pentamoid shield bug, easily recognisable by their size" should be "easily recognisable by its size".
○ "The species is common on hawthorn…but are also found in mixed woodland" should be "but is also found in mixed woodland".

Is the rationale for creating the dataset(s) clearly described? Yes
Are the protocols appropriate and is the work technically sound? Yes

Are sufficient details of methods and materials provided to allow replication by others? Yes
Are the datasets clearly presented in a useable and accessible format? Yes