Polysomal mRNA Association and Gene Expression in Trypanosoma brucei [version 2; peer review: 1 approved, 3 approved with reservations]

Background: The contrasting physiological environments of Trypanosoma brucei procyclic (insect vector) and bloodstream (mammalian host) forms necessitates deployment of different molecular processes and, therefore, changes in protein expression. Transcriptional regulation is unusual in T. brucei because the arrangement of genes is polycistronic; however, genes which are transcribed together are subsequently cleaved into separate mRNAs by trans-splicing. Following pre-mRNA processing, the regulation of mature mRNA stability is a tightly controlled cellular process. While many stage-specific transcripts have been identified, previous studies using RNA-seq suggest that changes in overall transcript level do not necessarily reflect the abundance of the corresponding protein. Methods: To better understand the regulation of gene expression in T. brucei, we performed a bioinformatic analysis of RNA-seq on total, sub-polysomal, and polysomal mRNA samples. We further crossreferenced our dataset with a previously published proteomics dataset to identify new protein coding sequences. Results: Our analyses showed that several long non-coding RNAs are more abundant in the sub-polysome samples, which possibly implicates them in regulating cellular differentiation in T. brucei. We also improved the annotation of the T.brucei genome by identifying new putative protein coding transcripts that were confirmed by mass spectrometry data. Conclusions: Several long non-coding RNAs are more abundant in the sub-polysome cellular fractions and might pay a role in the regulation of gene expression. We hope that these data will be of wide general interest, as well as being of specific value to researchers studying gene regulation expression and life stage transitions in T. brucei. Open Peer Review


Introduction
Trypanosoma brucei, a protozoan parasite transmitted by the tsetse fly, causes human African trypanosomiasis (HAT) and nagana in cattle 1 . The parasite undergoes a complex lifecycle between its insect vectors and mammalian hosts 2 : Slender bloodstream form (BSF) parasite proliferate predominantly in the blood and lymph of the infected mammalian host in the first stage of the disease and the second neurological stage of the disease occurs when these parasites cross the blood-brain barrier. Some of the slender BSF parasites differentiate into non-replicative stumpy forms in the bloodstream and these are pre-adapted for transformation into replicating procyclic form in the testse vector midgut. Procyclic forms further differentiate into replicating epimastigote and then non-dividing metacyclic trypomastigote forms during parasite migration to the tsetse salivary glands. The metacyclic parasites are transferred to a new host during a bloodmeal and after differentiation into slender BSF parasites, the lifecycle in complete. The BSF and PCF parasites are the easiest to propagate in the laboratory and are the most studied.
Transcription is particularly interesting in T. brucei because the arrangement of its genes is polycistronic. Thus, RNA Polymerase II (RNA Pol II) transcribes protein-coding genes into large polycistrons containing several transcripts. However, the polycistron does not linger as it is co-transcriptionally processed into individual mRNAs 3 . The processing of the transcription unit occurs by trans-splicing coupled to cleavage of the 3´ end by the polyadenylation machinery for poly(A) addition 4,5 . During trans-splicing, a capped 39-nucleotide (nt) spliced leader (SL) mini-exon is added to the 5′ termini of mRNAs. The SL sequence was first discovered when two different VSG transcripts were found with an identical leader sequence at their 5′ ends, which was not evident in their genomic sequence 6-8 . This mini-exon is independently transcribed from a tandem array of 140-nt spliced leader (SL) RNA genes 9,10 .
Recent studies using RNA-seq have greatly improved our understanding of the T. brucei transcriptional landscape across the BSF and PCF life stages 2,11-15 . These studies have found new transcripts, many non-coding RNAs, and facilitated the correction of numerous annotations across the T. brucei genome. While several aspects of translational control have been investigated in T. brucei, there are only a few examples of polysome profile analysis that have explored the efficiency of translation between BSF and PCF parasites 12,16 . Numerous 80S ribosomes can be translating an mRNA transcript at the same time, producing so-called 'polysomes' 17 . The number of ribosomes on an mRNA generally reflects that transcript's rate of translation under given conditions 18 . Further, a particular mRNA's higher or lower than average association with ribosomes indicates the potential involvement of gene-specific regulatory mechanisms 19 .
To make a contribution to our understanding of the regulation of gene expression in trypanosomes, we investigated mRNA recruitment to ribosomes with RNA-seq of total polyA+, sub-polysomal, and poly-ribosomal mRNA purified from BSF and PCF life stages of T. brucei. Catalog Number: A6279; 1 µM TLCK Sigma Catalog Number: 90182; 1 mM PMSF Sigma Catalog Number: 10837091001; 1mg/ml cycloheximide Sigma Catalog Number: C4859). The detergent n-octylglycoside (NOG) was chosen because it does not absorb at 254 nm. The lysates were loaded on top of 10 ml sucrose (Sigma Catalog Number: S0389) gradients (5 increments, 2ml each: 10%-50% sucrose) and centrifuged for 2 h at 38,000 rpm at 4°C in a Beckman ultracentrifuge using a SW41Ti rotor. Gradients were fractionated (0.5 ml fractions) and analysed for nucleic acid content by a Nanodrop spectrophotometer at 254 nm. RNA was purified using RNeasy kits (Qiagen, Catalog Number: 74104) from pooled sub-polysome and poly-ribosomal fractions. Gradient analysis was also performed using a gradient collector (Teledyne) with continuous monitoring at 254 nm. Individual fractions were collected with a Foxy Jr. (Teledyne) fraction collector. Following collection, the RNA from each sample was purified as above and pooled according to the sub-polysomal and polysomal fractions identified in the absorbance trace.
Total RNA was extracted from bloodstream and procyclic form T. brucei using the RNeasy Mini Extraction Kit (Qiagen,Catalog Number: 74104). The protocol was carried out according to manufacturer's instructions with a few deviations for T. brucei. Cells were centrifuged for 10 min, 800 x g at room temperature, media was aspirated and the cell pellets were resuspended in buffer RLT (Qiagen, Catalog Number: 79216) and β-mercaptoethanol (Sigma Catalog Number: 444203) was added at a 1:100 dilution. One volume of 70% ethanol (Sigma Catalog Number: 51976) was added to the lysate and the mixture was transferred to the provided column. RNA was bound to the column by centrifugation for 15 sec, 10,000 x g. The column was then washed with Buffer RWI and twice with Buffer RPE (Qiagen, Catalog Number: 1018013). Following the washes, the column was transferred to a sterile (RNAse free) Eppendorf tube (Thermofisher, Catalog Number: AM12400), and the RNA was eluted in 50 µl RNase-free H 2 O (Thermofisher, Catalog Number: AM9916). The RNA concentration was then estimated from the A 260 value using a Nanodrop 2000c spectrophotometer (Thermo) with path length settings adjusted for RNA (40). Following quantitation, the purified RNA was subsequently used for RNA-seq cDNA library preparation.
Preparation of cDNA libraries for RNA-seq Total RNA, sub-polysomal, and poly-ribosomal RNA was isolated from BSF and PCF T. brucei followed by poly(A) mRNA enrichment with poly-T oligomers attached magnetic beads (Illumina). The mRNA was then fragmented into 200 nt fragments using Covaris Adaptive Focused Acoustics process with the following operating conditions: Sample volume 130 µl, duty cycle 10%, intensity 5, cycles per burst 200, processing time 60 s, water bath temperature 4°C, power mode frequency sweeping, degassing mode continuous. Fragmented mRNA was concentrated by ethanol precipitation and measured on an RNA Pico chip (Agilent 2100 Bioanalyzer). The first strand of cDNA was synthesized using reverse transcriptase (Invitrogen Life Technologies, Catalog Number: 18064-022) and random primers (Invitrogen Life Technologies, Catalog Number: 1880007) using a Omnigene thermal cycler (25°C for 10 min. 42°C for 50 min, 70°C for 15 min), followed by second strand cDNA synthesis using a Omnigene thermal cycler (16°C for 60 min), producing double-stranded cDNA (NEBNext mRNA library kit for Illumina, NEB, Catalog Number: E6100. To blunt-end the DNA fragments, an end repair reaction was performed with Klenow polymerase (NEB, Catalog Number: M0210L), T4 DNA polymerase (NEB, Catalog Number: M0203L), and T4 polynucleotide kinase (NEB,Catalog Number: M0201L). A single 3´ adenosine overhang was added to the cDNA allowing the ligation of Illumina adaptors. These adaptors contain primer sites both for sequencing and complimentary annealing onto the Illumina flow cell surface (Top adapter: 5′-ACACTCTTTCCCTACACGACGCTCTTCCGATCT-3' Bottom adapter 5′-GATCGGAAGAGCGGTTCAGCAGGAATGCCG AG-3'). Adaptor ligated cDNA fragments were measured on an Agilent DNA chip. The final cDNA libraries were sequenced on a HiSeq2000 (Illumina).

Bioinformatic analysis
The software versions of the packages used for the bioinformatic analysis are listed in the file named "package_versions.txt" and deposited in the zenodo repository mtinti/polysome. The FASTQ files of technical replicates were concatenated together. The forward and reverse paired-end reads of the biological replicates (B_tot: 1 to 3, B_pol: 1 to 3, B_sub: 1 to 3, P_tot: 1 to 3, P_pol: 1 to 3, P_sub: 1 to 3, where B=BSF, P=PCF, tot=total, pol=polysomal, sub=sub-polysomal) were aligned to the reference genome v46 of T. brucei clone TREU927 and 427_2018 downloaded from TriTrypDB 23 using Bowtie2 24 , with the 'very-sensitive-local' pre-set alignment option. The alignments were converted to BAM format, reference sorted and indexed with SAMtools 25 . The genome coverage of the aligned reads was extracted from the BAM files using bedtools 26 with the -bg option to output bedGraph files. Fragment counts were determined from the BAM files using featureCounts 27 with parameters: -p (pair end) -B (both ends successfully aligned) -C (skip fragments that have their two ends aligned to different chromosome) -M (count multi-mapping) -O (match overlapping features) -t transcript (count level) -g gene_id (summarization level).

Assembly of Poly A and Spliced Leader Tracks
Alignments with properly paired reads were extracted with SAMtool view using the -f 2 option and parsed with a custom python script to extract the paired reads containing the last 14 bases of the spliced leader sequence (GTGAGGCCTCGCGA) in forward or reverse complement orientation. We used the last 14 bases as they are unique 28 . The same script was used to extract reads containing poly(A) tracts of at least 10 bases that are often found at the intergenic regions of T. brucei 29 . The aligned reads were saved in BAM format and used to create genomic track coverage in bedGraph format.
Assembling T. brucei transcripts The GFF annotation file for v46 of T. brucei clone TREU927 was downloaded from TriTrypDB and converted to GTF with gffread 30 . The gene annotation file was supplemented with a recent prediction of long non-coding RNAs 31 (doi: https://doi.org /10.1101/2020.05.03.074625). Hypothetical new transcripts were predicted using Trinity 32 and Scallop 33 . First, we identified new predicted genes with Scallop that was run for each biological replicate. The scallop predictions in GTF format were filtered to include only genes in intergenic regions that did not have any overlap with previously annotated genes. To achieve this, the GTF prediction files and the GTF reference file were converted to bed format with gtf2bed and intersected using bedops 34 . The filtered regions were converted back to GTF format, merged in a set of unique prediction with StringTie 35 and added to the reference GTF file. In a second run, we used Trinity that was executed with the genome guided and jaccard clip parameters for each biological replicate. The predicted Trinity gene sequences were aligned to the TREU927 genome with gmap 36 and the GFF output files of gmap were converted to GTF with gffread 30 . From this point, the same filtering methods used for the Scallop predictions were applied to the Trinity predictions that were added to the reference GTF file. We also downloaded from GenBank 37 the genomic sequences and GFF annotation files for the entries: M94286 (maxicircle sequences), FM162566 427 VSG bloodstream form expression site 1 (BES1) locus, FM162567 427 BES2 locus and the minicircle sequences L25588, L25589, L25590, M15321. The GFF downloaded from GenBank were converted to GTF files with Biopython 38 . We also constructed a synthetic chromosome of VSG 427 gene transcripts with the sequences deposited at http://tryps.rockefeller.edu/ using the link http://129.85.245.250/Downloads/ vsgs_tb427_all_atleast150aas_cds.txt. The VSG sequences were concatenated with random DNA sequences of 50 base pairs to produce the synthetic chromosome (named fake_vsgs) and a GTF annotation file was produced. All the GTF annotation files were concatenated together as well as the gene sequences to produce a new assembly named tb927_5 (tb927_5.gtf).

Quality control
The quality of alignments were evaluated with Qualimap2 39 using the bamqc and rnaseq options. The Qualimap2 output files, and the outputs of fastp, bowtie2, Picard Mark Duplicates, SAMtools flagastat, SAMtools stats and featureCounts were aggregated with MultiQC 40 , inspected and made available at https://polysome-qc.onrender.com. Dimensionality reduction was performed with the MDS algorithm implemented in SciPy 41 after log2 transform of the read counts of the top 500 expressed gene. The length and GC content of the predicted transcripts were extracted using bedtools nuc function after converting the GTF annotation file to bed format. The GC and length content biases were assessed with the cqn package for R 42 after removing genes with low counts using edgeR 43 . RPKM values for the dataset visualization were extracted using the cqn package for R.

Dataset visualization
Zero counts were replaced by the minimum value counts column-wise. The ANOVA-like test in edgeR was used to retain genes that differ in abundance in at least one of the samples with a false discovery rate <1%.

RadViz
The RadViz function implemented in the pandas python library 44 was modified and used for the visualization. For each gene the median value of the three biological replicates was computed for each experiment (B_tot: 1 to 3, B_pol: 1 to 3, B_sub: 1 to 3, P_tot: 1 to 3, P_pol: 1 to 3, P_sub: 1 to 3). For visualization, each gene was colour coded and assigned to one of the six experiments (B_tot, B_pol, B_sub, P_tot, P_pol, P_sub) where it showed the maximum abundance value.

Clustering
The dataset was normalized raw-wise with a standard scale approach, by subtracting the minimum value and dividing by the maximum value minus the minimum value, for each gene count. The optimal number of clusters was determined with the elbow approach using the KElbowVisualizer function implemented in the yellobrick python package 45 . The dataset was divided in 4 clusters using the K-means algorithm implemented in the scikit-learn python package 46 .

lncRNAs enrichment
The first spreadsheet "Ksplice lncRNAs" in Supplemental Table 1 of doi: https://doi.org/10.1101/2020.05.03.074625 31 was used to extract the hypothetical long non-coding mRNAs. The hypergeometric test implemented in scipy stats 41 was used to compute the enrichment p-value for long non-coding genes in each cluster.

mRNA half-life
The "BS mRNA half-life (min)" and "PC mRNA half-life (min)" columns from Table S5 of Antwi et al., 2016 12 were used to extract the mRNA half-lives. The gene IDs were converted to those of version 46 of the TREU927 genome using TryTripDB.

GO term enrichment
The GO enrichment analysis was performed with the goatools python package 47 . The go-basic.obo file was downloaded with the goatools python package. The gaf association file was downloaded from TritrypDB. Enriched go term p-values were corrected with the Bonferroni option in goatools and filtered at 1% false discovery rate. For visualization, the GO terms were further filtered to include terms appearing uniquely in one of the clusters. The enriched GO terms in each cluster were sorted according to the adjusted p-value and the top-5 GO terms retained.

Identification of new protein coding genes
The Raw files described in our protein half-lives paper 48 were processed in MaxQuant with the same parameters used to compute the iBAQ values, except that the predicted amino acidic sequences from the open reading frames downloaded from TriTrypDB version 46 were used. The start and end coordinates of the identified peptides were retrieved from the peptides.txt output files and organized in bed format. The coverage values of the genomic peptide coordinates in the bed file were set to 1. The file was sorted with the sort-bed function in bedtools. We then extracted the new gene predictions from our assembled GTF file and converted them to bed format. Subsequently, we used the bedextract function in bedtools to extract the peptides mapped to new predicted transcripts. The web interface of the phobius program 49 was used to search for transmembrane domains and the web interfece ot the signalP algoritms 3.1 and 5.0 50 were used to search for signal peptides. The blast 51 searches were performed with the web interfaces implemented at the NCBI or TriTrypDB. The Clustal Omega analysis were performed with the web interfaces implemented at EMBL-EBI 52 .
Coverage Visualisation. The software versions of the packages used for the visualisation of the bedGraph files are listed in the file named "package_versions.txt" and deposited in the zenodo repository mtinti/polysome_coverage. The bedGraph files were visualized with the svist4get python package 53 .
Comparison with previous work Transcription competency. Table S5 from Antwi et al., 2016 12 was downloaded and the Ribosomes/kb on polysomes values were extracted from spreadsheet 1 (PCF) and spreadsheet 2 (BSF). Gene names were mapped to the version 46 of TREU927 genome using the gene search service at TriTrypDB 23 . Fragment counts for our dataset were determined from the BAM files using featureCounts 27 with parameters:p -B -C -M -T 8 -t CDS -f to count only reads mapped to CDS regions. The read counts were filtered for low counts and normalized using edgeR 43 . Before computing the fraction of transcripts in polysomes, the polysome read counts were divided by 0.7 and the sub-polysome read counts were divided by 0.3 to correct for the total amount of mRNA found in polysome (70%) and sub-polysomal fractions (30%) 12 . The median of the fraction of transcripts in polysomal fraction was computed for the three biological replicates of BSF and PCF life stages and compared to the values reported in Antwi et al., 2016 12 . The Pearson correlation coefficients between samples were computed with the python package pandas 41 .

Ribosome profile
The fastq files for the ribosome profile experiment were downloaded from the ENA archive 54 with accession number PRJEB4801 and processed in a similar way as reported in Vasquex et al. 2014 2 . Briefly, the fastq files for the BSF and PCF biological replicates samples were concatenated together and the Illumina adaptor sequences were trimmed with the fastp package 55 . Sequences shorter than 20 bases were removed with the fastp package 55 . Reads were aligned, counted, and normalized as described above. The aligned reads in BAM format were used to create genomic track coverage in bedGraph format.
Sub-polysome / polysome differential abundance analysis and Grumpy Like Genes Differential abundance analyses were carried out with edgeR using generalized linear models (GLM) and the correction factors provided by the cqn package. In this study, we tested the differential abundance between the sub-polysome and polysome samples of the BSF and PCF life stages. To identify the Grumpy Like genes 31 we created a third model to study the differential transcript abundance between the sub-polysomal samples (mixed model of BSF and PCF) against the subpolysomal samples (mixed model of BSF and PCF). The p-values of the test were corrected with the topTags function in R using the Benjamini-Hochberg method.
The code to reproduce the analysis pipeline and the figures, the raw data and additional python scripts used for this study are available at GitHub.

Results
In our study, cells were treated with the antibiotic cycloheximide to prevent polysome run-off during sample preparation. Cycloheximide binds to the 60S ribosomal subunit and arrests translation elongation by inhibiting release of the deacylated tRNA from the ribosome E site, thereby stalling the ribosomes on mRNA in a polysomal state 56 . The high protein content of polysomes allows them to be separated throughout a sucrose gradient according to the number of ribosomes attached to the mRNA ( Figure 1). To prepare samples for RNA-seq, cDNA libraries were generated from both total mRNA, polysomeassociated mRNA and sub-polysomal mRNA transcripts. It is important to note that our procedure enabled the libraries to be completed without PCR amplification, therefore eliminating sample bias associated with variable amplification. In all, three (1 to 3) biological and three technical replicates of total (tot), sub-polysomal (sub), and polysomal (pol) mRNA RNA-seq experiments were performed for BSF (B) and PCF (P) life stages.

Assembling a reference transcriptome
Whole transcriptome experiments offer valuable resources to detect new genes and improve gene models. For this reason, we decided to create a complete TREU927 transcriptome assembly before assigning our reads to the reference gene set. To this end, we first added a set of newly predicted genes described by Guegan, et al. 31 encoding mostly long non-coding RNA (lncRNA). Subsequently, we implemented a genome-guided approach to annotating new genes discovered from our dataset. This strategy consisted of mapping reads along the reference genome, followed by gene prediction (Methods). This final step allowed us to extend the number of transcribed genome loci from 11725 to 15743 (an increase of 34%).
To aid the visualization of the newly predicted genes and assess the quality of the transcript boundaries, we extracted from all the samples the reads containing a spliced leader sequence and poly(A) genomic tract of >9 bases. The spliced leader sequence is present at the beginning of all mature trypanosome transcripts and can be used to determine the exact 5' boundary of the gene. The poly(A) genomic tracts are often present in intergenic regions and can help to determine the 3' gene boundaries. It is useful to note that the script we used to select the poly(A) genomic tracts also selects reads with poly(A) mRNA tails. However, we did not distinguish between poly(A) mRNA tails or poly(A) genomic tracts as both are useful to define gene boundaries 57 .

Quality control
The RNA-seq reads were aligned to the TREU927 reference genome, and the numbers of fragments mapping to our assembled gene list were computed. We evaluated the quality of our dataset at several levels. First, we used multidimensional scaling (MDS) to visualise the similarity between the different RNA-seq samples ( Figure 2). The MDS analyses confirmed the high reproducibility of all biological replicates that cluster closer together within each sample type than between sample types. We also evaluated the reliability of our dataset by visualizing the read coverage of the only two known intron-containing genes in the T. brucei genome: Tb927.3.3160 (Nuclear poly(A) polymerase 1) and Tb927.8.1510 (ATPdependent RNA helicase DBP2B). The visualisations in Figure 3 and Figure 4 show that the intron containing regions of the two genes have a sudden drop with little or no coverage in the polysomal samples (yellow tracks) relative to the total and sub-polysomal samples (blue and purple tracks) in both the BSF and PCF samples.

Comparison with previous work
We compared our results with those of Antwi et al. 12 that describes a similar approach to that used in this study. We first analyzed our dataset by counting reads aligned to coding sequence regions (CDS) only. After read normalization in edgeR, we computed the percentages of transcripts bound by the polysome for each gene. These values were then corrected for the relative proportions of mRNA found in polysome fractions (70%) and sub-polysomal fractions (30%) to mimic the analysis pipeline described in 7 as closely as possible. The percentage of transcripts bound by polysomes from our study was then compared with those reported in Table S1 of 7 ( Figure 5). The comparison showed a stronger correlation in the PCF life stage (R 2 =0.92) than in the BSF life stage (R 2 =0.71).

Bias correction
Before further analysing our datasets, we examined GC content bias and length bias in our read counts as those have been reported to affect RNA-seq experiments 42,58,59 . The data in ( Figure 6 and Figure 7) show that GC content and length biases affect our dataset in a sample-specific way, especially between the sub-polysomal samples (green) relative to the polysomal (blue) and total (grey) samples. We corrected the read counts for these biases and normalized the read counts using the conditional quantile normalization method implemented in the cqn R package 42 .

Differential abundance analysis
Before proceeding to the differential abundance analysis, we visualized the whole dataset with a dimensionality reduction technique. Using an ANOVA-like test implemented in edgeR, we found transcripts that are differentially abundant between any of the groups, without biasing before-hand which groups might be different. We then took the median value of each biological replicate for each gene and applied a radial visualization plot that uses a polar coordinate system to visualize the dataset. Sample types are like hours on the clock-face (i.e. related to the angle of the polar coordinate system) and the orthogonal axis (i.e. the distance from the centre) relates to the     To try to gain insight into this signature, we performed a cluster analysis. We first determined the optimal number of clusters (n=4) with the elbow approach (Figure 9), and then applied a k-means clustering algorithm to divide our dataset into 4 clusters (Extended data: Table 3 60 ). Cluster 1: gene transcripts that are more abundant in PCF versus BSF samples. Cluster 2: gene transcripts that are more abundant in BSF and PCF sub-polysomal samples than in all other samples. Cluster 3: gene transcripts that are more abundant in BSF versus PCF samples. Cluster 4: gene transcripts that are less abundant in BSF and PCF sub-polysomal samples than in all other samples. This clustering analysis confirmed the presence of a group of genes (Cluster 2, n= 3356) with the highest read counts in the BSF and PCF sub-polysomal samples relative to all other samples ( Figure 10).
To assign possible biological functions to the clusters, we performed a GO-term enrichment analysis across the four clusters. We only retained GO terms that were enriched in at most two of the four clusters, and those with false discovery rates of >1%. This analysis, visualized in (Figure 11), showed that the transcripts in Cluster 2 (C2) are highly enriched for those encoding mRNA binding proteins. Interestingly, the average half-life of the transcripts in Cluster 2 are the shortest in the BSF and the PCF life stages, when compared to the mRNA half-lives of the transcripts in the other clusters (Table 1 and Figure 12). We then asked if any of the clusters are particularly enriched for the long non-coding genes identified in 31 and found they are mostly enriched in Cluster 2 (Table 2). Cluster 2 also has the highest number of two other classes of non-coding mRNAs: the snoRNAs and H/ACA-like snoRNAs ( Table 2).
We then focused on the analysis of the transcripts enriched in the sub-polysomal samples. We created two models to test for differential abundance between the sub-polysomal and polysomal samples in the BSF (Extended data: Table 4 60 ) and PCF (Extended data: Table 5 60 ) life stages. As illustrated in Figure 13, several long non-coding genes are more abundant in the sub-polysomal samples with respect to the polysomal samples, including the grumpy transcript ( Figure 14) that sits at the 5' end of RBP7A (Tb927.10.12080) and has been shown to be important for the progression from the slender form to the stumpy form of the parasite 26 . The grumpy transcript made us wonder which other sub-polysome enriched transcripts might have a lncRNA at the 5' end and be associated with this life stage transition. We identified two candidate genes: RBP10 (Tb927.8.2780) with the lncRNA KS17gene_1750a ( Figure 15) and REG9.1 (Tb927.11.14220) with the lncRNA KS17gene_4296a (Figure 16), both of which have been previously associated with the transition between the BSF and PCF life stages 61,62 . We then asked which other genes, involved in RNA processing, have a 5' lncRNA that is more abundant in the sub-polysome samples. To this aim, we created a new model to test for differential abundance between sub-polysomal and polysomal samples (mixed BSF and PCF). After extracting the lncRNAs that are more abundant in the subpolysomal samples (grumpy-like lncRNAs) we then reported the genes at their 3' ends and describe these as "grumpy-like genes". Finally, we intersected these grumpy-like genes with a list of 322 potential post-transcriptional regulators in T. brucei 63 to identify 31 new proteins that may contribute to T. brucei life stage transitions (Extended data: Table 6 60 ).

Identification of new protein coding genes
We were interested in evaluating whether there is proteomic evidence for the new hypothetical protein-coding genes identified in our dataset. To achieve this, we analyzed our protein half-life dataset 64 by running MaxQuant with a database of open reading frames (ORFs) for the TREU927 genome downloaded from TryTripDB. The genomic coordinates of the ORF peptides were then intersected with the genomic coordinates of the hypothetical new protein coding genes. Further, we filtered out unannotated genes in the main 11 chromosomes of T. brucei, without a splice leader site and without ribo-seq data. This analysis led to the identification of 11 new hypothetical protein coding genes reported in Extended data: Table 7  the protein product returned low percentage identity (<50%) matches with genes in T. grayi (DQ04_00451000), T. conorhini (accession: XP_029230363.1) and T. theileri (TM35_ 000192250). Synteny analysis of the TRY.375 locus performed at TryTripDB revealed another gene (TevSTIB805.7.3380) in the T. evansi genome with 100% homology with the predicted TRY.375 gene product. Also, a tblastn search of the TRY.375 predicted gene identified 2 more hits with 100% homology in the genomes of T. brucei 427_2018, 427 (Tb427) and T. brucei gambiense DAL972 (Tbg972), corresponding to unannotated regions in these genomes. We propose that TRY.375 is a novel transmembrane-protein coding gene present in T. brucei and T. evansi.

MSTRG.94
Peptides corresponding to potential new gene MSTRG.94 ( Figure 18) mapped with high confidence to 6 regions within the span Tb927_02_v5.1:592500..617500. Investigation of this section of chromosome 2 revealed it is highly repetitive and contains 6 copies of a 65kDa Invariant Surface Glycoprotein gene with a pairwise protein Identity computed by Clustal Omega between 73% and 99%. This suggests that what had previously been assumed to be untranslated intergenic regions of DNA may in fact encode for protein. SL and PA mapping allowed us to define 6 MSTRG.94 gene boundaries as described in Extended data: Table 7 60 . All of these putative gene regions were identical and we have designated them MSTRG.94_1 through MSTRG.94_6. The putative MSTRG.94 genes contain a predicted ORF of 378 base pairs encoding for a protein of 125 amino acids (14.17 kDa). The predicted protein does not contain any transmembrane domains or signal peptides. A tblastn search with the ORF sequence against trypanosome genomes revealed matching sequences in the genomes of Tb427 and T. evansi. As Figure 9. Determining the optimal number of clusters. A plot of the number of clusters tested (K) on the x-axis and the clustering distortion score (the sum of square distances from each point to its assigned cluster center) on the y-axis. The figure also displays the amount of time needed to train the clustering model per K as a dashed green line. If the line chart resembles an arm, then the "elbow" (the point of inflection on the curve) is a good indication that the underlying model fits best at that point (https://www.scikit-yb.org/en/latest/ api/cluster/elbow.html).
with Tb927, the sequences appear between copies of 65kDa Invariant Surface Glycoprotein genes in chromosome 2. In Tb427 the sequences are annotated as hypothetical proteins and in T. evansi as unspecified products, while in Tbg972 the regions are unnanotated.

Discussion
In this paper we present RNA-seq data on the total, polysomal and sub-polysomal mRNA content of T. brucei bloodstream and procyclic form life stages. Comparison with similar experiments performed earlier by Antwi et al. 12 showed better experimental reproducibility between PCF life stage data (r 2 =0.9) than BSF life stage data (r 2 =0.7) ( Figure 5). A possible source of discrepancy may be different cell culture protocols for the BSF cells. Nevertheless, our dataset showed very good reproducibility (Figure 2), and we were successful in identifying a pool of efficiently transcribed and spliced mRNAs. This is demonstrated by the virtual absence of reads covering the By using clustering and dimensionality reduction techniques (Figure 8 and Figure 10), we were able to identify the sub-polysome samples as the most diverse in our dataset. In particular, we found the presence of several long non-coding mRNAs in the sub-polysomal fractions of both BSF and PCF samples (Extended data: Table 3 60 ). This class of mRNA has been overlooked in T. brucei until recently, and one particular long non-coding mRNA (grumpy) has been shown to regulate the transformation from the slender to the stumpy life stage of the parasite 31 . Interestingly, the RNA-binding protein RBP10 (Tb927.8.2780), that has been shown to bind mRNAs and promote their degradation, acts as a molecular switch whereby RBP10 expression in BSF causes differentiation to PCF, while the overexpression in PCF causes differentiation to BSF 61 . While RBP10 itself was not found in our sub-polysome enriched transcript list, the lncRNA (KS17gene_1749a) which is predicted to be at the 5' end of RBP10 may have a similar regulatory function as the grumpy lncRNA transcript. Therefore, we have assembled a list of other putative RNA regulatory proteins that may be of interest for those working on T. brucei life stage transitions (Extended data: Table 6 60 ).
It should be noted that several lncRNAs are located next to other annotated genes in the genome. We speculate that those lncRNAs might represent transcript isoforms of genes that are  actively regulated; the lncRNAs themselves might be retained and/or removed from a transcript to promote/decrease transcript stability. The excised lncRNA might then be targeted for degradation, and this could be the reason why the lncRNAs are highly abundant in the sub-polysomal fraction. This hypothesis might also explain why the shortest half-life mRNAs are more abundant in the sub-polysome fractions. Once a lncRNA has been excised from the parent transcript, both the lncRNA and the parent transcript may be targeted for degradation. In any case, we anticipate that the study of lncRNAs transcripts that are more abundant in the sub-polysomal fraction may uncover new mechanisms of transcript stability and regulation in T. brucei. To facilitate such studies and to shortlist candidate genes for further experiments, we provided a table of protein coding genes involved in mRNA processing with the closest lncRNA that is more abundant in the subpolysome fractions (Extended data: Table 5 60 ).
Another class of RNA we found to be enriched in the sub-polysomal fractions are snoRNAs. The presence of this class of RNA in the sub-polysomal fraction might also be explained by a degradation mechanism. For example, snoRNAs guide the peculiar trypanosome rRNA maturation events, facilitating the methylation and pseudouridylation modification of rRNA 66,67 . Because polyadenylation by snoRNA is a way of marking the RNA for degradation in yeast and humans 68,69 , it   is possible that a similar mechanism acts in T. brucei, and that our poly-A enrichment step has captured this class of RNAs before they have been targeted to the exosome for degradation 70 .
Finally, we hope that our dataset will be useful for the annotation of the T. brucei genome, as demonstrated by the identification of 30 new hypothetical protein-coding genes.

Data availability
Underlying data All FASTQ files data are deposited at the NCBI SRA database 71 under the bioproject accession number PRJNA634997 Analysis pipeline, links to the raw data and code used to generate the paper figures are available at https://github.com/ mtinti/polysome, reproducible using the mybinder badge in GitHub and archived in Zenodo.
Zenodo: mtinti/polysome: Fix Table 6    The code and the data used to generate the paper figures that visualise the RNA-seq coverage are available at https://github. com/mtinti/polysome_coverage, https://github.com/mtinti/polysome, reproducible using the mybinder badge in github and archived in zenodo. The QC output is avaiable at github https://github.com/mtinti/ polysome_qc, visualizable at https://polysome-qc.onrender.com and archived in zenodo.  The table also reports the predicted cluster identification number (label), a binary column reporting whether the gene is identified or not in the (is_ks), the gene description (desc), a binary column reporting whether the gene is annotated as an H/ACA-like snoRNA, a binary column reporting whether the gene is annotated as a snoRNA and a binary column reporting whether the gene is annotated as non-coding (Noncoding) RNA.
· Table  4. Polysome/sub-polysome transcript differential abundance in BSF cells. Comparison between the polysome and sub-polysome samples in the bloodstream form life stage: logFC, the log fold-change for each gene in the two groups being compared. logCPM, the log-average abundance for each gene in the two groups being compared. LR, likelihood ratio statistic. PValue, exact p-value for differential expression test. FDR, the p-value adjusted for multiple testing with the Benjamini-Hochberg method (false discovery rate).
· Table 5. Polysome / Sub-polysome Transcript Differential Abundance in PCF. Comparison between the polysome and subpolysome samples in the procyclic form life stage the: logFC, the logabundance ratio, i.e. fold change, for each gene in the two groups being compared; logCPM, the logaverage concentration/abundance for each gene in the two groups being compared; LR, likelihood ratio statistics; PValue, exact p-value for differential expression test; FDR, the p-value adjusted for multiple testing with the Benjamini-Hochberg method.
· Table 6. Grumpy-Like lncRNA Genes. Comparison between the polysome and sub-polysome samples for the 'grumpy-like' lncRNAs (gene_ks) reporting the: logFC, the log fold-change for each gene in the two groups being compared. logCPM, the log-average abundance for each gene in the two groups being compared. LR, likelihood ratio statistic. PValue, exact p-value for differential expression test. FDR, the p-value adjusted for multiple testing with the Benjamini-Hochberg method. The table also shows the predicted affected genes (gene_sensitive ) at the 3' of the 'grumpy-like' genes and the description (desc) of the affected gene and the regulation type (reg_ type) reported in 63 for potential post-transcriptional regulators in T. brucei.

Open Peer Review
The manuscript by Tinti et al. sought to generate profiles of mRNAs associated with polysomal and non-polysomal fractions from the parasitic protozoa "Trypanosoma brucei"; polysomal mRNAs indicating those being actively translated and used for protein synthesis. The study compares the profiles seen for the two major life forms of "T. brucei" (bloodstream and procyclic) using as biological material fractions of cycloheximide-treated cells. Pooled samples consisting of polysomal and non-polysomal fractions, separated by sucrose gradient, were submitted to RNA extraction, cDNA synthesis and next generation sequencing in order to qualitatively and quantitatively define mRNAs (or polyadenylated RNAs) bound or not bound to the polysomes. In all, the authors investigated three different samples for each parasite life form: total polyA+ mRNAs, sub-polysomal mRNAs and polysomal mRNAs. A large range of bioinformatic tools were used to analyze the results and these were also compared with previous published data.
The data generated by the reported research, with the comparison between polysomal and sub-polysomal mRNAs from the two different life forms, expands on previous works assessing polysome bound mRNAs in T. brucei, increases the number of experimentally validated and mapped mRNAs and identifies and investigates non-coding mRNAs in the sub-polysomal fraction, with possible roles in regulation of specific messages. It constitutes an important tool for a large number of investigators working on mRNA translation and gene expression regulation in these and related protozoa. The approach is especially relevant considering that these are eukaryotes with very peculiar mechanisms for regulation of gene expression, with previous evidence highlighting a strong role for this regulation during mRNA translation, therefore being model organisms for mechanisms targeting regulation of translation. Nevertheless, for publication, the manuscript needs improvements regarding several issues, as detailed below.

Abstract:
The abstract should be thoroughly revised with an inclusion of more relevant results. As detailed further below, a focus only on the presence of long non-coding RNA in the sub-polysomal fraction (stated twice) is misleading since these should be mainly found in this fraction to begin with. More relevant results include: the proper identification/confirmation of long non-coding, polyadenylated RNAs (with the 5' SL), with substantial number of reads and their mapping, highlighting possible roles regarding the regulation of specific protein coding genes (it is not clear by the text to which extent novel lncRNAs were identified, but if so this has to be highlighted as well as the association reported between lcnRNAs and neighboring protein coding transcripts); the substantial increase of genome loci found to be transcribed, as reported in the main text; and the identification of novel hypothetical proteins.

Methods:
The approach applied in the manuscript was based on several bioinformatic tools and computational steps. Some of the tools used are quite old, such as GMAP, but more relevant was the use of several computational tools and analyses without a proper justification for their use. For example, why did the authors decide to use two different de novo assemblers (Trinity and Scallop) for the transcriptomic data? A consensus between the transcript predictions from both tools and including prior predictions was then expected, but it is not clear if this is what the authors did and should be clarified. At the end of the "de novo" approach, a new genome annotation file was generated, but it is not clear if it was used for the final read counts using the "featureCounts" tool and this also needs clarification. Regarding the description of the cDNA library preparation, important details such as read length and which type of library was employed cannot be found in the methods section. Base on the text it is understood that the libraries used were paired-end, but then for quality control, the authors mentioned they used RPKM as normalization step, which is applicable to single-end libraries. These details need to be better clarified.
An important point of concern is the use of two different strains, one for each of the two T. brucei life forms. No considerations or comments are made on how this can impact on the differences found between the same fractions from the two different life forms investigated. To what extent could the differences seen between the two life forms, and related to the results from figure 5, could be associated with the use of different strains?

Results:
The very large number of figures can be a distraction and keep the reader from focusing on what is important. The authors should consider reducing those. Some suggestions: Figures 1 is not necessary; for Figures 6 and 7, only one representative figure could be kept, or both could be removed altogether; Figure 9 also does not need to be shown.
For validation purposes, wouldn´t it be relevant mentioning or showing the profile of known, stage specific genes, which would preferentially be present in the polysomal fractions of either bloodstream or procyclic forms? For instance, known surface antigens? precursor or maybe nuclear mRNAs within the sub-polysomal fractions that need to be considered somehow?
A relevant observation from the results shown in Figure 4 is the overlap in the UTRs from the Tb927.8.1500 and Tb927.8.1510. How unique is this? Likewise, in Figure 3, for the Tb927.3.3160 gene, its intron might be associated with poly-A tracts. Again, what does this mean and is it seen elsewhere? These issues should be considered in the text.
The authors mention an enrichment of lncRNAs in cluster 2, composed mainly by sub-polysomal mRNAs from bloodstream and procyclic forms. Biologically speaking, isn't that to be expected, as no ribosomes should be attached to non-translatable lncRNAs? The authors propose a mechanism where a lncRNA would be part of a transcript, and its presence or absence could define different transcript isoforms. But which isoform would be translated, with or without the lncRNA segment? In this case would the lncRNA segment be considered a long non-coding RNA? Why would the lncRNA have a Spliced Leader and poly-A, as indicated by the results? These issues were discussed only superficially in the Discussion.
The authors also proposes that the location of the segments encoding several lncRNA might indicate a role in regulation of neighboring genes. What is the basis for this? Has it been shown elsewhere or in Tryps? It seems quite speculative and not much related to this is seen in the discussion.
Finally, it would be nice to have some data regarding the abundance of different functional classes of mRNA coding proteins when forms and fractions were compared, however, there is nothing on that in the discussion.

Discussion:
Regarding the discussion, the work has several interesting results, but few of them are discussed sufficiently. For instance, the results mention a 34% increment in the number of transcribed loci, but there is no consideration regarding this very relevant results in the discussion, and no details can be found regarding function and curation of these loci.

Minor Points:
What is the difference between PCA and MDS? Do the axes have weights? This can change the interpretation of the result. Why didn't authors use PCA? 1.
In the methods, in the sentence "Before computing the fraction of transcripts in polysomes, the polysome read counts were divided by 0.7 and the sub-polysome read counts were divided by 0.3 …", shouldn't the counts be multiplied by 0.7 or 0.3, instead of divided by? 2.
The following sentence is not understandable and needs to be clarified: "To identify the Grumpy Like genes we created a third model to study the differential transcript abundance between the sub-polysomal samples (mixed model of BSF and PCF) against the subpolysomal samples (mixed model of BSF and PCF)" present at "Sub-polysome / polysome differential abundance analysis and Grumpy Like GenesSub-polysome / polysome differential abundance analysis and Grumpy Like Genes" topic. 3.
The first sentences of the Results section explaining in detail the use and properties of cycloheximide are not necessary and can be shortened since the cycloheximide use for polysomal gradients is common.

4.
The statement "The poly(A) genomic tracts are often present in intergenic regions and can help to determine the 3' gene boundaries" needs a reference.

5.
In various Figure   For Figure 10, review the order that the results are shown. It makes no sense having the two different sub-polysomal samples followed by the polysomal and total samples from procyclics and then polysomal and total samples from bloodstream. Either you alternate between each of the procyclic and bloodstream samples or show all procyclic and then all bloodstream samples.

8.
The following sentence needs to be modified "The grumpy transcript made us wonder which other sub-polysome enriched transcripts might have a lncRNA at the 5' end….". Based on the "Grumpy" example, it refers to polysome enriched transcripts which have a subpolysome enriched lncRNA at their 5' end.
In the sentence "Synteny analysis of the TRY.375 locus performed at TryTripDB revealed another gene (TevSTIB805.7.3380) in the T. evansi genome with 100% homology with the predicted TRY.375 gene product.", the homology concept is misused as it is a binary concept bringing the idea of evolutionary relationship, so you cannot have degree of homology between two loci of different species. Either they are homologs or re not. The degree of similarity or identity should be informed.

11.
The novel MSTRG.94 gene and neighboring sequences seem to be transcribed only in Bloodstream cells and this should be highlighted.

12.
In the Discussion, the statement "This is demonstrated by the virtual absence of reads covering the intron regions of the two experimentally validated intron containing genes" is not valid for the sub-polysomal transcripts and should be revised.

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions drawn adequately supported by the results? Partly identified 11 novel protein coding genes. Since both NGS and mass spectrometry rely heavily on databases, I find this manuscript a helpful contribution to the continuous curation of the T brucei genome. I also celebrate how detailed the Methods section is. I am supportive of publication, though make some suggestions for possible improvement of the manuscript.

Major criticism:
lncRNAs: The author found that several lncRNA are in close proximity (typically at the 5'-end) to genes associated with the transition between the BSF and PCF life stages or potential posttranscriptional regulators. Is this statically significant? How does it compare to different functional gene categories or among developmentally vs non-developmentally-regulated genes? The authors show the FC between different fractions but I would like to see how levels of the protein-coding gene and its proximal lncRNA vary in PF vs BF. For instance, what is going on with the lncRNA immediately upstream of the RBP10 gene in PF? Since it seems to be embedded into the 3'UTR of the upstream gene (Tb927.8.2770), is its level independently regulated or follow the Tb927.8.2770 pattern? Most of the lncRNAs are found in subpolysomal fractions. For the nuclear grumpy it is expected. However, RBP7A seems to be cytosolic (also RBP10 for which a subpolysomal lncRNA is apparently linked), so how does the author imagine the crosstalk between the coding genes and their proximal lncRNAs? 1.
In the Discussion, the author hypothesized that the excised lncRNA might be targeted for degradation, and this could be the reason why the lncRNAs are enriched in the subpolysomal fractions. Fig 15 shows that the lncRNA present upstream of the RBP10 coding region is indeed detected in the ribosome profiling data in both PF and BF stages (although no peptides detected). In human cells, cytoplasmic lncRNAs may indeed be recruited to ribosomes for degradation (PMID: 27090285). 1 Can this also apply for T. brucei and what is observed are lncRNA that are for instance either nuclear (like grumpy) or on the way to ribosomes? The authors should discuss all the possible options and do comment about the presence of lncRNA-derived reads on ribosome profiling data.

2.
Novel proteins: Are the novel proteins developmentally regulated? Could the authors go back to the RITseq from Alsford et al. and check whether those novel genes are required for proliferation?
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.