The genome sequence of the clouded yellow, Colias crocea (Geoffroy, 1785)

We present a genome assembly from an individual female Colias crocea (also known as Colias croceus; the clouded yellow; Arthropoda; Insecta; Lepidoptera; Pieridae). The genome sequence is 325 megabases in span. The complete assembly is scaffolded into 32 chromosomal pseudomolecules, with the W and Z sex chromosome assembled. Gene annotation of this assembly on Ensembl has identified 13,803 protein coding genes.


Background
Colias crocea (or croceus), the clouded yellow, is a butterfly found in Europe, the middle east, and north Africa. This continuously-brooded migratory species visits the UK in the end of spring and summer, supplementing a small breeding population in the south. The larvae feed on a wide variety of leguminous plants, such as clovers (Trifolium sp.), alfalfa (Medicago sativa) and vetches (Vicia sp.). Despite recent declines, C. crocea has seen a large increase in both abundance and occurrence in the last 50 years in the British Isles (Fox et al., 2015) and is listed as Least Concern in the IUCN Red List (Europe) (van Swaay et al., 2010). A white polymorphism known as Alba (form helice) is associated with an alternative life-history strategy, where females reallocate wing pigment resources to somatic and reproductive development. This is associated with the insertion of a transposable element downstream of the homeobox transcription factor BarH-1 (Woronik et al., 2019). Colias crocea has 31 pairs of chromosomes, a genome size of approximately 318.6 Mb (Woronik et al., 2019), and is female heterogametic (WZ). We note the recent production of a high-quality genome assembly for C. crocea (Woronik et al., 2019), and believe the sequence described here, generated as part of the Darwin Tree of Life project, will further aid understanding of the biology and ecology of this butterfly.

Genome sequence report
The genome was sequenced from a single female C. crocea collected from Bujaruelo, Aragon, Spain (latitude 42.7, longitude -0.1) (Figure 1). A total of 68-fold coverage in Pacific Biosciences single-molecule long reads and 91-fold coverage in 10X Genomics read clouds were generated. Primary assembly contigs were scaffolded with chromosome conformation Hi-C data. Manual assembly curation corrected 6 missing/misjoins, reducing the assembly length by 0.8% and the scaffold number by 13.5%.
The final assembly has a total length of 325 Mb in 33 sequence scaffolds with a scaffold N50 of 11 Mb (Table 1). Of the assembly sequence, 100% was assigned to 32 chromosomal-level scaffolds, representing 30 autosomes (numbered by sequence length), and the W and Z sex chromosome (Figure 2- Figure 5; Table 2). The assembly has a BUSCO (Simão et al., 2015) v5.1.2 completeness of 99.0% (single 98.7%, duplicated 0.3%, fragmented 0.2%, missing 0.8%) using the lepidoptera_odb10 reference set. While not fully phased, the assembly deposited is of one haplotype. Contigs corresponding to the second haplotype have also been deposited.

Gene annotation
The Ensembl gene annotation system (Aken et al., 2016) was used to generate annotation for the Colias crocea assembly (GCA_905220415.1, Table 1). The annotation was created primarily through alignment of transcriptomic data to the genome, with gap filling via protein to-genome alignments of a select set of proteins from UniProt (UniProt Consortium, 2019) and OrthoDB (Kriventseva et al., 2008). Prediction tools, CPC2

Sample acquisition and nucleic acid extraction
A female (ilColCroc2) and a male (ilColCroc3) C. crocea were collected from Bujaruelo, Aragon, Spain (latitude 42.7, longitude -0.1) by Sam Ebdon, Alex Macintosh (both University of Edinburgh), Alex Hayward and Karl Wotton (both University of Exeter). Samples were collected using a net and snapfrozen in liquid nitrogen.
DNA was extracted at the Wellcome Sanger Institute (WSI) Scientific Operations core from the thorax of ilColCroc2 using  RNA was then eluted in 50 μl RNAse-free water and its concentration RNA assessed using a Nanodrop spectrophotometer and Qubit Fluorometer using the Qubit RNA Broad-Range (BR) Assay kit. Analysis of the integrity of the RNA was done using Agilent RNA 6000 Pico Kit and Eukaryotic Total RNA assay.

Sequencing
Pacific Biosciences HiFi circular consensus and 10X Genomics read cloud sequencing libraries were constructed according    out proportionate to the nature of the materials themselves, and the circumstances under which they have been/are to be collected and provided for use. The purpose of this is to address and mitigate any potential legal and/or ethical implications of receipt and use of the materials as part of the research project, and to ensure that in doing so we align with best practice wherever possible.
The overarching areas of consideration are: • Ethical review of provenance and sourcing of the material; • Legality of collection, transfer and use (national and international). For this genome assembly, the authors used a combination of sequence data from an orange female, providing detailed information about the sequencing and assembly steps. The authors also used RNAseq data sampled from the thorax of a single adult male to generate an annotation for this assembly, which may limit its scope for describing both the protein coding and noncoding features of this assembly. Overall, I agree with the previous reviewer that a genome assembly of this quality is highly beneficial (and has likely already been useful) for the study of C. crocea and for comparative research in butterfly genomics. Critiques: I support the critiques of the first reviewer: (1) that more details should be provided about the parameters and options for software used to generate both the assembly and the annotation; and (2) that the manuscript would benefit from a more detailed explanation of the differences between this assembly and existing genomic resources.
Details are lacking about the annotation. Given the annotation was made using RNA from a single male thorax, it is possible that there were many important coding and noncoding transcripts missing in the sample. The protein prediction tools subsequently used to improve the annotation may have filled these gaps, but I think it is important that the authors both acknowledge these limitations and include some evaluation of annotation quality. The authors could run BUSCO on the protein set or could compare orthologs between this and other well-annotated Lepidopteran gene sets. Also, summary statistics (e.g., average length of protein coding sequence) provided in Table 1 should include some measure of variation (95% CI, etc.)

Are the protocols appropriate and is the work technically sound? Yes
Are sufficient details of methods and materials provided to allow replication by others? Partly Are the datasets clearly presented in a useable and accessible format? Yes supplemental file.