Polysomal mRNA Association and Gene Expression in Trypanosoma brucei

Background: The contrasting physiological environments of Trypanosoma brucei procyclic (insect vector) and bloodstream (mammalian host) forms necessitates deployment of different molecular processes and, therefore, changes in protein expression. Transcriptional regulation is unusual in T. brucei because the arrangement of genes is polycistronic; however, genes which are transcribed together are subsequently cleaved into separate mRNAs by trans-splicing. Following pre-mRNA processing, the regulation of mature mRNA stability is a tightly controlled cellular process. While many stage-specific transcripts have been identified, previous studies using RNA-seq suggest that changes in overall transcript level do not necessarily reflect the abundance of the corresponding protein. Methods: To better understand the regulation of gene expression in T. brucei, we performed a bioinformatic analysis of RNA-seq on total, sub-polysomal, and polysomal mRNA samples. We further cross-referenced our dataset with a previously published proteomics dataset to identify new protein coding sequences. Results: Our analyses showed that several long non-coding RNAs are more abundant in the sub-polysome samples, which possibly implicates them in regulating cellular differentiation in T. brucei. We also improved the annotation of the T.brucei genome by identifying new putative protein coding transcripts that were confirmed by mass spectrometry data. Conclusions: Several long non-coding RNAs are more abundant in the sub-polysome cellular fractions and might pay a role in the regulation of gene expression. We hope that these data will be of wide general interest, as well as being of specific value to researchers studying gene regulation expression and life stage transitions in T. brucei.


Introduction
Trypanosoma brucei, a protozoan parasite transmitted by the tsetse fly, causes human African trypanosomiasis (HAT) and nagana in cattle 1 . The parasite undergoes a complex lifecycle between its insect vectors and mammalian hosts 2 : Slender bloodstream form (BSF) parasite proliferate predominantly in the blood and lymph of the infected mammalian host in the first stage of the disease and the second neurological stage of the disease occurs when these parasites cross the blood-brain barrier. Some of the slender BSF parasites differentiate into non-replicative stumpy forms in the bloodstream and these are pre-adapted for transformation into replicating procyclic form in the testse vector midgut. Procyclic forms further differentiate into replicating epimastigote and then non-dividing metacyclic trypomastigote forms during parasite migration to the tsetse salivary glands. The metacyclic parasites are transferred to a new host during a bloodmeal and after differentiation into slender BSF parasites, the lifecycle in complete. The BSF and PCF parasites are the easiest to propagate in the laboratory and are the most studied.
Transcription is particularly interesting in T. brucei because the arrangement of its genes is polycistronic. Thus, RNA Polymerase II (RNA Pol II) transcribes protein-coding genes into large polycistrons containing several transcripts. However, the polycistron does not linger as it is co-transcriptionally processed into individual mRNAs 3 . The processing of the transcription unit occurs by trans-splicing coupled to cleavage of the 3´ end by the polyadenylation machinery for poly(A) addition 4,5 . During trans-splicing, a capped 39-nucleotide (nt) spliced leader (SL) mini-exon is added to the 5′ termini of mRNAs. The SL sequence was first discovered when two different VSG transcripts were found with an identical leader sequence at their 5′ ends, which was not evident in their genomic sequence 6-8 . This mini-exon is independently transcribed from a tandem array of 140-nt spliced leader (SL) RNA genes 9,10 .
Recent studies using RNA-seq have greatly improved our understanding of the T. brucei transcriptional landscape across the BSF and PCF life stages 2,[11][12][13][14][15] . These studies have found new transcripts, many non-coding RNAs, and facilitated the correction of numerous annotations across the T. brucei genome. While several aspects of translational control have been investigated in T. brucei, there are only a few examples of polysome profile analysis that have explored the efficiency of translation between BSF and PCF parasites 12,16 . Numerous 80S ribosomes can be translating an mRNA transcript at the same time, producing so-called 'polysomes' 17 . The number of ribosomes on an mRNA generally reflects that transcript's rate of translation under given conditions 18 . Further, a particular mRNA's higher or lower than average association with ribosomes indicates the potential involvement of gene-specific regulatory mechanisms 19 .
To make a contribution to our understanding of the regulation of gene expression in trypanosomes, we investigated mRNA recruitment to ribosomes with RNA-seq of total polyA+, sub-polysomal, and poly-ribosomal mRNA purified from BSF and PCF life stages of T. brucei. Gradients were fractionated (0.5 ml fractions) and analysed for nucleic acid content by a Nanodrop spectrophotometer at 254 nm. RNA was purified using RNeasy kits (Qiagen, Catalog Number: 74104) from pooled sub-polysome and poly-ribosomal fractions. Gradient analysis was also performed using a gradient collector (Teledyne) with continuous monitoring at 254 nm. Individual fractions were collected with a Foxy Jr. (Teledyne) fraction collector. Following collection, the RNA from each sample was purified as above and pooled according to the sub-polysomal and polysomal fractions identified in the absorbance trace.
Total RNA was extracted from bloodstream and procyclic form T. brucei using the RNeasy Mini Extraction Kit (Qiagen, Catalog Number: 74104). The protocol was carried out according to manufacturer's instructions with a few deviations for T. brucei. Cells were centrifuged for 10 min, 800 x g at room temperature, media was aspirated and the cell pellets were resuspended in buffer RLT (Qiagen, Catalog Number: 79216) and β-mercaptoethanol (Sigma Catalog Number: 444203) was added at a 1:100 dilution. One volume of 70% ethanol (Sigma Catalog Number: 51976) was added to the lysate and the mixture was transferred to the provided column. RNA was bound to the column by centrifugation for 15 sec, 10,000 x g. The column was then washed with Buffer RWI and twice with Buffer RPE (Qiagen, Catalog Number: 1018013). Following the washes, the column was transferred to a sterile (RNAse free) Eppendorf tube (Thermofisher, Catalog Number: AM12400), and the RNA was eluted in 50 µl RNase-free H 2 O (Thermofisher, Catalog Number: AM9916). The RNA concentration was then estimated from the A 260 value using a Nanodrop 2000c spectrophotometer (Thermo) with path length settings adjusted for RNA (40). Following quantitation, the purified RNA was subsequently used for RNA-seq cDNA library preparation.

Preparation of cDNA libraries for RNA-seq
Total RNA, sub-polysomal, and poly-ribosomal RNA was isolated from BSF and PCF T. brucei followed by poly(A) mRNA enrichment with poly-T oligomers attached magnetic beads (Illumina). The mRNA was then fragmented into 200 nt fragments using Covaris Adaptive Focused Acoustics process with the following operating conditions: Sample volume 130 µl, duty cycle 10%, intensity 5, cycles per burst 200, processing time 60 s, water bath temperature 4°C, power mode frequency sweeping, degassing mode continuous. Fragmented mRNA was concentrated by ethanol precipitation and measured on an RNA Pico chip (Agilent 2100 Bioanalyzer

Bioinformatic analysis
The software versions of the packages used for the bioinformatic analysis are listed in the file named "package_versions.txt" and deposited in the zenodo repository mtinti/polysome. The FASTQ files of technical replicates were concatenated together. The forward and reverse paired-end reads of the biological replicates (B_tot: 1 to 3, B_pol: 1 to 3, B_sub: 1 to 3, P_tot: 1 to 3, P_pol: 1 to 3, P_sub: 1 to 3, where B=BSF, P=PCF, tot=total, pol=polysomal, sub=sub-polysomal) were aligned to the reference genome v46 of T. brucei clone TREU927 and 427_2018 downloaded from TriTrypDB 23 using Bowtie2 24 , with the 'very-sensitive-local' pre-set alignment option. The alignments were converted to BAM format, reference sorted and indexed with SAMtools 25 . The genome coverage of the aligned reads was extracted from the BAM files using bedtools 26 with the -bg option to output bedGraph files. Fragment counts were determined from the BAM files using featureCounts 27 with parameters: -p (pair end) -B (both ends successfully aligned) -C (skip fragments that have their two ends aligned to different chromosome) -M (count multi-mapping) -O (match overlapping features) -t transcript (count level) -g gene_id (summarization level).

Assembly of Poly A and Spliced Leader Tracks
Alignments with properly paired reads were extracted with SAMtool view using the -f 2 option and parsed with a custom python script to extract the paired reads containing the last 14 bases of the spliced leader sequence (GTGAGGCCTCGCGA) in forward or reverse complement orientation. We used the last 14 bases as they are unique 28 . The same script was used to extract reads containing poly(A) tracts of at least 10 bases that are often found at the intergenic regions of T. brucei 29 . The aligned reads were saved in BAM format and used to create genomic track coverage in bedGraph format.
Assembling txt. The VSG sequences were concatenated with random DNA sequences of 50 base pairs to produce the synthetic chromosome (named fake_vsgs) and a GTF annotation file was produced. All the GTF annotation files were concatenated together as well as the gene sequences to produce a new assembly named tb927_5 (tb927_5.gtf).

Quality control
The quality of alignments were evaluated with Qualimap2 39 using the bamqc and rnaseq options. The Qualimap2 output files, and the outputs of fastp, bowtie2, Picard Mark Duplicates, SAMtools flagastat, SAMtools stats and featureCounts were aggregated with MultiQC 40 , inspected and made available at https://polysome-qc.onrender.com. Dimensionality reduction was performed with the MDS algorithm implemented in SciPy 41 after log2 transform of the read counts of the top 500 expressed gene. The length and GC content of the predicted transcripts were extracted using bedtools nuc function after converting the GTF annotation file to bed format. The GC and length content biases were assessed with the cqn package for R 42 after removing genes with low counts using edgeR 43 . FPKM values for the dataset visualization were extracted using the cqn package for R.

Dataset visualization
Zero counts were replaced by the minimum value counts column-wise. The ANOVA-like test in edgeR was used to retain genes that differ in abundance in at least one of the samples with a false discovery rate <1%.

RadViz
The RadViz function implemented in the pandas python library 44 was modified and used for the visualization. For each gene the median value of the three biological replicates was computed for each experiment (B_tot: 1 to 3, B_pol: 1 to 3, B_sub: 1 to 3, P_tot: 1 to 3, P_pol: 1 to 3, P_sub: 1 to 3). For visualization, each gene was colour coded and assigned to one of the six experiments (B_tot, B_pol, B_sub, P_tot, P_pol, P_sub) where it showed the maximum abundance value.

Clustering
The dataset was normalized raw-wise with a standard scale approach, by subtracting the minimum value and dividing by the maximum value minus the minimum value, for each gene count. The optimal number of clusters was determined with the elbow approach using the KElbowVisualizer function implemented in the yellowbrick python package 45 . The dataset was divided in 4 clusters using the K-means algorithm implemented in the scikit-learn python package 46 . The columns were clustered as well using the clustermap function implemented in seaborn.

lncRNAs enrichment
The first spreadsheet "Ksplice lncRNAs" in Supplemental Table 1 of doi: https://doi.org/10.1101/2020.05.03.074625 31 was used to extract the hypothetical long non-coding mRNAs. The hypergeometric test implemented in scipy stats 41 was used to compute the enrichment p-value for long non-coding genes in each cluster.

mRNA half-life
The "BS mRNA half-life (min)" and "PC mRNA half-life (min)" columns from Table S5 of Antwi et al., 2016 12 were used to extract the mRNA half-lives. The gene IDs were converted to those of version 46 of the TREU927 genome using TryTripDB.

GO term enrichment
The GO enrichment analysis was performed with the goatools python package 47 . The go-basic.obo file was downloaded with the goatools python package. The gaf association file was downloaded from TritrypDB. Enriched go term p-values were corrected with the Bonferroni option in goatools and filtered at 1% false discovery rate. For visualization, the GO terms were further filtered to include terms appearing uniquely in one of the clusters. The enriched GO terms in each cluster were sorted according to the adjusted p-value and the top-5 GO terms retained.

Identification of new protein coding genes
The Raw files described in our protein half-lives paper 48 were processed in MaxQuant with the same parameters used to compute the iBAQ values, except that the predicted amino acidic sequences from the open reading frames downloaded from TriTrypDB version 46 were used. The start and end coordinates of the identified peptides were retrieved from the peptides.txt output files and organized in bed format. The coverage values of the genomic peptide coordinates in the bed file were set to 1. The file was sorted with the sort-bed function in bedtools. We then extracted the new gene predictions from our assembled GTF file and converted them to bed format. Subsequently, we used the bedextract function in bedtools to extract the peptides mapped to new predicted transcripts. The web interface of the phobius program 49 was used to search for transmembrane domains and the web interfece ot the signalP algoritms 3.1 and 5.0 50 were used to search for signal peptides. The blast 51 searches were performed with the web interfaces implemented at the NCBI or TriTrypDB. The Clustal Omega analysis were performed with the web interfaces implemented at EMBL-EBI 52 .
Coverage Visualisation. The software versions of the packages used for the visualisation of the bedGraph files are listed in the file named "package_versions.txt" and deposited in the zenodo repository mtinti/polysome_coverage. The bedGraph files were visualized with the svist4get python package 53 .
Comparison with previous work Transcription competency. Table S5 from Antwi et al., 2016 12 was downloaded and the Ribosomes/kb on polysomes values were extracted from spreadsheet 1 (PCF) and spreadsheet 2 (BSF). Gene names were mapped to the version 46 of TREU927 genome using the gene search service at TriTrypDB 23 . Fragment counts for our dataset were determined from the BAM files using featureCounts 27 with parameters:p -B -C -M -T 8 -t CDS -f to count only reads mapped to CDS regions. The read counts were filtered for low counts and normalized using edgeR 43 . Before computing the fraction of transcripts in polysomes, the polysome read counts were multiplied by 0.7 and the sub-polysome read counts were multiplied by 0.3 to correct for the total amount of mRNA found in polysome (70%) and sub-polysomal fractions (30%) 12

Ribosome profile
The fastq files for the ribosome profile experiment were downloaded from the ENA archive 54 with accession number PRJEB4801 and processed in a similar way as reported in Vasquez et al. 2014 2 . Briefly, the fastq files for the BSF and PCF biological replicates samples were concatenated together and the Illumina adaptor sequences were trimmed with the fastp package 55 . Sequences shorter than 20 bases were removed with the fastp package 55 . Reads were aligned, counted, and normalized as described above. The aligned reads in BAM format were used to create genomic track coverage in bedGraph format.
Sub-polysome / polysome differential abundance analysis Differential abundance analyses were carried out with edgeR using generalized linear models (GLM) and the correction factors provided by the cqn package. In this study, we tested the differential abundance between the sub-polysome and polysome samples of the BSF and PCF life stages. To study the effect of the lncRNAs on the surrounding genes, we mapped again the data using the annotations of the lncRNAs and the CDS of protein coding genes. We then created a third model to identify the transcripts with differential abundance between the sub-polysomal samples (BSF and PCF) against the polysomal samples (BSF and PCF). The p-values of the test were corrected with the topTags function in R using the Benjamini-Hochberg method.
For the McNemar's test of paired samples we counted A) the lncRNAs more abundant in the polysome fraction (log fold change > 0); B) the lncRNAs more abundant in the sub-polysomal fraction (log fold change < 0); C) The 5' genes respect to the lncRNAs more abundant in the polysome fraction (log fold change > 0); D) The 5' genes respect to the lncRNAs more abundant in the sub-polysome fraction (log fold change < 0). We than used the McNemar's test implemented in the statsmodels python package as [[A+C , A+D] , [B+C , B+D]]. The same test was performed using the lncRNAs and the genes at the 3' of the lncRNAs. Only genes and lncRNAs with an FDR <0.01% were considered. The regplot function of the seaborn python package was used for the LOWESS fitting.
The code to reproduce the analysis pipeline and the figures, the raw data and additional python scripts used for this study are available at GitHub.

Results
In our study, cells were treated with the antibiotic cycloheximide to prevent polysome run-off during sample preparation. Cycloheximide binds to the 60S ribosomal subunit and arrests translation elongation by inhibiting release of the deacylated tRNA from the ribosome E site, thereby stalling the ribosomes on mRNA in a polysomal state 56 . The high protein content of polysomes allows them to be separated throughout a sucrose gradient according to the number of ribosomes attached to the mRNA ( Figure 1). To prepare samples for RNA-seq, cDNA libraries were generated from both total mRNA, polysomeassociated mRNA and sub-polysomal mRNA transcripts. It is important to note that our procedure enabled the libraries to be completed without PCR amplification, therefore eliminating sample bias associated with variable amplification. In all, three (1 to 3) biological and three technical replicates of total (tot), sub-polysomal (sub), and polysomal (pol) mRNA RNA-seq experiments were performed for BSF (B) and PCF (P) life stages.

Assembling a reference transcriptome
Whole transcriptome experiments offer valuable resources to detect new genes and improve gene models. For this reason, we decided to create a complete TREU927 transcriptome assembly before assigning our reads to the reference gene set. To this end, we first added a set of newly predicted genes described by Guegan, et al. 31 encoding mostly long non-coding RNA (lncRNA). Subsequently, we implemented a genome-guided approach to annotating new genes discovered from our dataset. This strategy consisted of mapping reads along the reference genome, followed by gene prediction (see Methods). This final step allowed us to extend the number of transcribed genome loci from 11725 to 15743 (an increase of 34%).
To aid the visualization of the newly predicted genes and assess the quality of the transcript boundaries, we extracted from all the samples the reads containing a spliced leader sequence and poly(A) genomic tract of >9 bases. The spliced leader sequence is present at the beginning of all mature trypanosome transcripts and can be used to determine the exact 5' boundary of the gene. The poly(A) genomic tracts are often present in intergenic regions and can help to determine the 3' gene boundaries 29 . It is useful to note that the script we used to select the poly(A) genomic tracts also selects reads with poly(A) mRNA tails. However, we did not distinguish between poly(A) mRNA tails or poly(A) genomic tracts as both are useful to define gene boundaries 57 .

Quality control
The RNA-seq reads were aligned to the TREU927 reference genome, and the numbers of fragments mapping to our assembled gene list were computed. We evaluated the quality of our dataset at several levels. First, we used multidimensional scaling (MDS) to visualise the similarity between the different RNA-seq samples ( Figure 2). The MDS analyses confirmed the high reproducibility of all biological replicates that cluster closer together within each sample type than between sample types. We also evaluated the reliability of our dataset by visualizing the read coverage of the only two known intron-containing genes in the T. brucei genome: Tb927.3.3160 (Nuclear poly(A) polymerase 1) and Tb927.8.1510 (ATPdependent RNA helicase DBP2B). The visualisations in Figure 3 and Figure 4 show that the intron containing regions of the two genes have a sudden drop with little or no coverage in the polysomal samples (yellow tracks) relative to the total and sub-polysomal samples (blue and purple tracks) in both the BSF and PCF samples.

Comparison with previous work
We compared our results with those of Antwi et al. 12 that describes a similar approach to that used in this study. We first analyzed our dataset by counting reads aligned to coding sequence regions (CDS) only. After read normalization in edgeR, we computed the percentages of transcripts bound by the polysome for each gene. These values were then corrected for the relative proportions of mRNA found in polysome fractions (70%) and sub-polysomal fractions (30%) to mimic the analysis pipeline described in 12 as closely as possible. The percentage of transcripts bound by polysomes from our study was then compared with those reported in Table S1 of 12 ( Figure 5). The comparison showed a stronger correlation in the PCF life stage (R 2 =0.91) than in the BSF life stage (R 2 =0.74).

Bias correction
Before further analysing our datasets, we examined GC content bias and length bias in our read counts as those have been reported to affect RNA-seq experiments 42,58,59 . The data in (Figure 6 and Figure 7) show that GC content and length biases affect our dataset in a sample-specific way, especially between the sub-polysomal samples (green) relative to the polysomal (blue) and total (grey) samples. We corrected the read counts for these biases and normalized the read counts using the conditional quantile normalization method implemented in the cqn R package 42 .

Differential abundance analysis
Before proceeding to the differential abundance analysis, we visualized the whole dataset with a dimensionality reduction technique. Using an ANOVA-like test implemented in edgeR, we found transcripts that are differentially abundant between any of the groups, without biasing before-hand which groups might be different. We then took the median value of each biological replicate for each gene and applied a radial visualization plot that uses a polar coordinate system to visualize the dataset. Sample types are like hours on the clock-face (i.e. related to the angle of the polar coordinate system) and the orthogonal axis (i.e. the distance from the centre) relates to the relative abundance of a gene across the samples. This analysis showed a strong signature for the BSF and PCF sub-polysomal samples, where many transcripts showed the greatest differential abundance relative to all of the other samples (Figure 8, blue and orange gene dots).
To try to gain insight into this signature, we performed a cluster analysis. We first determined the optimal number of clusters (n=4) with the elbow approach (Figure 9), and then applied a k-means clustering algorithm to divide our dataset into 4 clusters (Extended data: Table 3 Figure 10).
To assign possible biological functions to the clusters, we performed a GO-term enrichment analysis across the four clusters. We only retained GO terms that were enriched in at most two of the four clusters, and those with false discovery rates of >1%. This analysis, visualized in (Figure 11), showed that the transcripts in Cluster 2 (C2) are highly enriched for those encoding mRNA binding proteins. Interestingly, the average half-life of the transcripts in Cluster 2 are the shortest in the BSF and the PCF life stages, when compared to the mRNA half-lives of the transcripts in the other clusters (Table 1 and Figure 12). We then asked if any of the clusters are particularly enriched for the long non-coding genes identified in 31 and found they are mostly enriched in Cluster 2 ( Table 2).
Cluster 2 also has the highest number of two other classes of non-coding mRNAs: the snoRNAs and H/ACA-like snoRNAs ( Table 2).
We then focused on the analysis of the (cluster 2) transcripts enriched in the sub-polysomal samples. We created two models to test for differential abundance between the sub-polysomal and polysomal samples in the BSF (Extended data: Table 4        and PCF (Extended data: Table 5 60 ) life stages. As illustrated in Figure 13, several long non-coding genes are more abundant in the sub-polysomal samples with respect to the polysomal samples, including the grumpy transcript ( Figure 14) that sits at the 5' end of RBP7A (Tb927.10.12080) and has been shown to be important for the progression from the slender form to the stumpy form of the parasite 26 . The grumpy transcript made us wonder which other sub-polysome enriched transcripts might have a lncRNA at the 5' end and might be associated with this life stage transition. We identified two candidate genes: RBP10 (Tb927.8.2780) with the lncRNA KS17gene_1749a ( Figure 15) and REG9.1 (Tb927.11.14220) with the lncRNA KS17gene_4296a (Figure 16), both of which have been previously associated with the transition between the BSF and PCF life stages 61,62 .

Expression Analysis of lncRNAs and surrounding genes
We then asked if we could find evidence of coregulation between the lncRNAs and the genes at their 5' or 3'. To this aim, we first mapped again our dataset to the T. brucei genome considering only the coding sequence (CDS) of protein coding genes and the lncRNAs. We decided to consider only the CDSs for two reasons. First, several UTRs are not well annotated in T. brucei and, second, multiple lncRNAs overlap with UTR regions. We than created a new model to test for differential abundances between BSF and PCF sub-polysomal samples versus BSF and PCF polysomal samples. Finally, we reported the log fold change of lncRNAs (sub-polysomal vs polysomal samples) along with the log fold changes of the transcripts of genes at their 5' or 3'. This allowed us to use a McNemar's test and observe a statistical significant association between Figure 9. Determining the optimal number of clusters. A plot of the number of clusters tested (K) on the x-axis and the clustering distortion score (the sum of square distances from each point to its assigned cluster center) on the y-axis. The figure also displays the amount of time needed to train the clustering model per K as a dashed green line. If the line chart resembles an arm, then the "elbow" (the point of inflection on the curve) is a good indication that the underlying model fits best at that point (https://www.scikit-yb.org/en/latest/ api/cluster/elbow.html).
the differential abundance of the lncRNAs and the genes at their 5' (pval 1e -16 ). In particular, we observed that lncRNAs that are more abundant in the polysomal fractions are more likely to have a gene at their 5' that is more abundant in the polysomal fractions as well (Figure 17). We could find a similar association between the lncRNAs and the genes at their 3', but several order of magnitude weaker (pval 1e -3 ). The GO term analysis of those genes at the 5' of lncRNAs, where both the lncRNAs and the 5' genes are more abundant in the polysomal fractions, showed an enrichment for the following GO terms: posttranscriptional regulation of gene expression; cytoplasm; glycosome and mRNA binding. Since the GO term enrichment analysis highlighted a possible role of lncRNAs in regulating transcripts involved in posttranscriptional regulation of gene expression, we intersected the lncRNAs surrounding genes with a list of 322 potential post-transcriptional regulators in T. brucei 63 (Extended data: Table 6     . The predicted protein does not contain any transmembrane domains or signal peptides. A tblastn search with the ORF sequence against trypanosome genomes revealed matching sequences in the genomes of Tb427 and T. evansi. As with Tb927, the sequences appear between copies of the ISG65 genes in chromosome 2. In Tb427 the sequences are annotated as hypothetical proteins and in T. evansi as unspecified products, while in Tbg972 the regions are unannotated. The transcript seems to be preferentially expressed in BSF form ( Figure 19).

Discussion
In this paper we present RNA-seq data on the total, polysomal and sub-polysomal mRNA contents of T. brucei bloodstream and procyclic form life stages. Comparison with similar experiments performed earlier by Antwi et al. 12 showed good experimental reproducibility between the PCF life stage data (r 2 =0.9) and BSF life stage data (r 2 =0.7) ( Figure 5). A possible source of discrepancy may be different cell culture protocols for the BSF cells. Nevertheless, our datasets showed very is a novel transmembrane-protein coding gene present in T. brucei and T. evansi.

MSTRG.94
Peptides corresponding to potential new gene MSTRG.94 ( Figure 19) mapped with high confidence to 6 regions within the span Tb927_02_v5.1:592500..617500. Investigation of this section of chromosome 2 revealed it is highly repetitive and contains 6 copies of a 65kDa Invariant Surface Glycoprotein (ISG65) gene with a pairwise protein Identities computed by Clustal Omega between 73% and 99%. This suggests By using clustering and dimensionality reduction techniques (Figure 8 and Figure 10), we were able to identify the sub-polysome samples as the most diverse in our dataset. In particular, we found the presence of several long non-coding mRNAs in the sub-polysomal fractions of both BSF and PCF samples (Extended data: Table 3 60 ). However, some lncRNAs    were also found enriched in the polysomal fractions as already identified in human cells 65 . This class of mRNA has been overlooked in T. brucei until recently, and one particular long non-coding mRNA (grumpy) has been shown to regulate the transformation from the slender to the stumpy life stage of the parasite 31 . Interestingly, the RNA-binding protein RBP10 (Tb927.8.2780), that has been shown to bind mRNAs and promote their degradation, acts as a molecular switch whereby RBP10 expression in BSF causes differentiation to PCF, while the overexpression in PCF causes differentiation to BSF 61 . While RBP10 itself was not found in our sub-polysome enriched transcript list, the lncRNA (KS17gene_1749a) which is predicted to be at the 5' end of RBP10 may have a similar regulatory function as the grumpy lncRNA transcript. Intrigued by these findings, we have assembled a list of lncRNAs along with their surrounding genes at their 5' and 3' ends and reported their differential expression values between the polysomal and sub-polysomal samples (Extended data: Table 6 60 ). The analysis of these data highlighted a possible co-regulation between the lncRNAs and the genes at their 5' ends, such that when a lncRNA is more abundant in the polysome fractions relative to the sub-polysomal fractions, the same is true of the gene at its 5' end ( Figure 17). It is possible that the lncRNAs might influence the transcription efficiency of the proximal genes at their 5' ends, as observed in other organisms 66 . It is possible that lncRNAs might bind to the gene transcript at its 5' end to stabilize it or promote transcription 66 . It may be that these lncRNAs, more abundant in the polysomal fraction of BSF and PCF, regulate genes that are important for the maintenance of such life stages, while the lncRNAs that regulate life stage transitions (like the grumpy gene) are targeted to the subpolysomal fractions, possibly for degradation. In any case, we anticipate that the study of lncRNAs transcripts that show differential abundance between the sub-polysomal and polysomal fractions may uncover new mechanisms of transcript stability and regulation in T. brucei.
Another class of RNA we found to be enriched in the sub-polysomal fractions are snoRNAs. However some snoRNAs were also detected in the polysomal fraction. The presence of this class of RNA in the polysomal fraction could be explained by contamination, but also by a degradation mechanism. For example, snoRNAs guide the peculiar trypanosome rRNA maturation events, facilitating the methylation and pseudouridylation modification of rRNA 67,68 . Because polyadenylation by snoRNA is a way of marking the RNA for degradation in yeast and humans 69,70 , it is possible that a similar mechanism acts in T. brucei, and that our poly-A enrichment step has captured this class of RNAs before they have been targeted to the exosome for degradation 71 .
Finally, we hope that our dataset will be useful for the annotation of the T. brucei genome. Our approach to discover new transcripts in T. brucei detected several new transcribed loci.  While most of the transcribed loci represent miss-annotation of putative gene transcripts, we used an unbiased proteomic approach to detect at least 30 new hypothetical protein-coding genes, two of which were further manually annotated here (Figure 18 and Figure 19 19)

Data availability
Underlying data All FASTQ files data are deposited at the NCBI SRA database 72 under the bioproject accession number PRJNA634997 Analysis pipeline, links to the raw data and code used to generate the paper figures are available at https://github.com/ mtinti/polysome, reproducible using the mybinder badge in GitHub and archived in Zenodo.
Zenodo: mtinti/polysome: Fix Table 6 The table also reports the predicted cluster identification number (label), a binary column reporting whether the gene is identified or not in the (is_ks), the gene description (desc), a binary column reporting whether the gene is annotated as an H/ACA-like snoRNA, a binary column reporting whether the gene is annotated as a snoRNA and a binary column reporting whether the gene is annotated as non-coding (Noncoding) RNA.
· Table  4. Polysome/sub-polysome transcript differential abundance in BSF cells. Comparison between the polysome and sub-polysome samples in the bloodstream form life stage: logFC, the log fold-change for each gene in the two groups being compared. logCPM, the log-average abundance for each gene in the two groups being compared. LR, likelihood ratio statistic. PValue, exact p-value for differential expression test. FDR, the p-value adjusted for multiple testing with the Benjamini-Hochberg method (false discovery rate).
· Table 5. Polysome / Sub-polysome Transcript Differential Abundance in PCF. Comparison between the polysome and subpolysome samples in the procyclic form life stage the: logFC, the logabundance ratio, i.e. fold change, for each gene in the two groups being compared; logCPM, the logaverage concentration/abundance for each gene in the two groups being compared; LR, likelihood ratio statistics; PValue, exact p-value for differential expression test; FDR, the p-value adjusted for multiple testing with the Benjamini-Hochberg method.
· Table 6. lncRNA and Surrounding Genes. Comparison between the polysome and sub-polysome samples for the lncRNAs (gene_ks) and the genes at their 5' (gene_ sensitive_at_5prime) or 3' (gene_sensitive_at_3prime) reporting the: logFC, the log fold-change for each lncRNA in the two groups being compared. FDR, the p-value adjusted for multiple testing with the Benjamini-Hochberg method for the lncRNAs. logFC_5p, the log fold-change for the genes at the 5' of the lncRNAs. FDR_5p, the p-value adjusted for multiple testing with the Benjamini-Hochberg method for the genes at the 5' of the lncRNAs. logFC_3p, the log fold-change for the genes at the 3' of the lncRNAs. FDR_3p, the p-value adjusted for multiple testing with the Benjamini-Hochberg method for the genes at the 3' of the lncRNAs. Desc_5p, the gene description for the genes at the 5' of the lncRNAs. Desc_3p, the gene description for the genes at the 3' of the lncRNAs.

Open Peer Review
The manuscript by Tinti et al. sought to generate profiles of mRNAs associated with polysomal and non-polysomal fractions from the parasitic protozoa "Trypanosoma brucei"; polysomal mRNAs indicating those being actively translated and used for protein synthesis. The study compares the profiles seen for the two major life forms of "T. brucei" (bloodstream and procyclic) using as biological material fractions of cycloheximide-treated cells. Pooled samples consisting of polysomal and non-polysomal fractions, separated by sucrose gradient, were submitted to RNA extraction, cDNA synthesis and next generation sequencing in order to qualitatively and quantitatively define mRNAs (or polyadenylated RNAs) bound or not bound to the polysomes. In all, the authors investigated three different samples for each parasite life form: total polyA+ mRNAs, sub-polysomal mRNAs and polysomal mRNAs. A large range of bioinformatic tools were used to analyze the results and these were also compared with previous published data.
The data generated by the reported research, with the comparison between polysomal and sub-polysomal mRNAs from the two different life forms, expands on previous works assessing polysome bound mRNAs in T. brucei, increases the number of experimentally validated and mapped mRNAs and identifies and investigates non-coding mRNAs in the sub-polysomal fraction, with possible roles in regulation of specific messages. It constitutes an important tool for a large number of investigators working on mRNA translation and gene expression regulation in these and related protozoa. The approach is especially relevant considering that these are eukaryotes with very peculiar mechanisms for regulation of gene expression, with previous evidence highlighting a strong role for this regulation during mRNA translation, therefore being model organisms for mechanisms targeting regulation of translation. Nevertheless, for publication, the manuscript needs improvements regarding several issues, as detailed below.

Abstract:
The abstract should be thoroughly revised with an inclusion of more relevant results. As detailed further below, a focus only on the presence of long non-coding RNA in the sub-polysomal fraction (stated twice) is misleading since these should be mainly found in this fraction to begin with. More relevant results include: the proper identification/confirmation of long non-coding, polyadenylated RNAs (with the 5' SL), with substantial number of reads and their mapping, highlighting possible roles regarding the regulation of specific protein coding genes (it is not clear by the text to which extent novel lncRNAs were identified, but if so this has to be highlighted as well as the association reported between lcnRNAs and neighboring protein coding transcripts); the substantial increase of genome loci found to be transcribed, as reported in the main text; and the identification of novel hypothetical proteins.

Methods:
The approach applied in the manuscript was based on several bioinformatic tools and computational steps. Some of the tools used are quite old, such as GMAP, but more relevant was the use of several computational tools and analyses without a proper justification for their use. For example, why did the authors decide to use two different de novo assemblers (Trinity and Scallop) for the transcriptomic data? A consensus between the transcript predictions from both tools and including prior predictions was then expected, but it is not clear if this is what the authors did and should be clarified. At the end of the "de novo" approach, a new genome annotation file was generated, but it is not clear if it was used for the final read counts using the "featureCounts" tool and this also needs clarification. Regarding the description of the cDNA library preparation, important details such as read length and which type of library was employed cannot be found in the methods section. Base on the text it is understood that the libraries used were paired-end, but then for quality control, the authors mentioned they used RPKM as normalization step, which is applicable to single-end libraries. These details need to be better clarified.
An important point of concern is the use of two different strains, one for each of the two T. brucei life forms. No considerations or comments are made on how this can impact on the differences found between the same fractions from the two different life forms investigated. To what extent could the differences seen between the two life forms, and related to the results from figure 5, could be associated with the use of different strains?

Results:
The very large number of figures can be a distraction and keep the reader from focusing on what is important. The authors should consider reducing those. Some suggestions: Figures 1 is not necessary; for Figures 6 and 7, only one representative figure could be kept, or both could be removed altogether; Figure 9 also does not need to be shown.
For validation purposes, wouldn´t it be relevant mentioning or showing the profile of known, stage specific genes, which would preferentially be present in the polysomal fractions of either bloodstream or procyclic forms? For instance, known surface antigens? polysomal or polysomal fractions of presumably cytoplasmic and mature mRNAs, selected through their poly-A tails. So, very low or no intron mapping would be expected overall and not only for the polysomal fractions. Wouldn't the presence of introns indicate a substantial amount of precursor or maybe nuclear mRNAs within the sub-polysomal fractions that need to be considered somehow?
A relevant observation from the results shown in Figure 4 is the overlap in the UTRs from the Tb927.8.1500 and Tb927.8.1510. How unique is this? Likewise, in Figure 3, for the Tb927.3.3160 gene, its intron might be associated with poly-A tracts. Again, what does this mean and is it seen elsewhere? These issues should be considered in the text.
The authors mention an enrichment of lncRNAs in cluster 2, composed mainly by sub-polysomal mRNAs from bloodstream and procyclic forms. Biologically speaking, isn't that to be expected, as no ribosomes should be attached to non-translatable lncRNAs? The authors propose a mechanism where a lncRNA would be part of a transcript, and its presence or absence could define different transcript isoforms. But which isoform would be translated, with or without the lncRNA segment? In this case would the lncRNA segment be considered a long non-coding RNA? Why would the lncRNA have a Spliced Leader and poly-A, as indicated by the results? These issues were discussed only superficially in the Discussion.
The authors also proposes that the location of the segments encoding several lncRNA might indicate a role in regulation of neighboring genes. What is the basis for this? Has it been shown elsewhere or in Tryps? It seems quite speculative and not much related to this is seen in the discussion.
Finally, it would be nice to have some data regarding the abundance of different functional classes of mRNA coding proteins when forms and fractions were compared, however, there is nothing on that in the discussion.

Discussion:
Regarding the discussion, the work has several interesting results, but few of them are discussed sufficiently. For instance, the results mention a 34% increment in the number of transcribed loci, but there is no consideration regarding this very relevant results in the discussion, and no details can be found regarding function and curation of these loci.

Minor Points:
What is the difference between PCA and MDS? Do the axes have weights? This can change the interpretation of the result. Why didn't authors use PCA? 1.
In the methods, in the sentence "Before computing the fraction of transcripts in polysomes, the polysome read counts were divided by 0.7 and the sub-polysome read counts were divided by 0.3 …", shouldn't the counts be multiplied by 0.7 or 0.3, instead of divided by? 2.
The following sentence is not understandable and needs to be clarified: "To identify the Grumpy Like genes we created a third model to study the differential transcript abundance between the sub-polysomal samples (mixed model of BSF and PCF) against the subpolysomal samples (mixed model of BSF and PCF)" present at "Sub-polysome / polysome 3. differential abundance analysis and Grumpy Like GenesSub-polysome / polysome differential abundance analysis and Grumpy Like Genes" topic.
The first sentences of the Results section explaining in detail the use and properties of cycloheximide are not necessary and can be shortened since the cycloheximide use for polysomal gradients is common.

4.
The statement "The poly(A) genomic tracts are often present in intergenic regions and can help to determine the 3' gene boundaries" needs a reference.

5.
In various Figure   For Figure 10, review the order that the results are shown. It makes no sense having the two different sub-polysomal samples followed by the polysomal and total samples from procyclics and then polysomal and total samples from bloodstream. Either you alternate between each of the procyclic and bloodstream samples or show all procyclic and then all bloodstream samples.

8.
The following sentence needs to be modified "The grumpy transcript made us wonder which other sub-polysome enriched transcripts might have a lncRNA at the 5' end….". Based on the "Grumpy" example, it refers to polysome enriched transcripts which have a subpolysome enriched lncRNA at their 5' end.
In the sentence "Synteny analysis of the TRY.375 locus performed at TryTripDB revealed another gene (TevSTIB805.7.3380) in the T. evansi genome with 100% homology with the predicted TRY.375 gene product.", the homology concept is misused as it is a binary concept bringing the idea of evolutionary relationship, so you cannot have degree of homology between two loci of different species. Either they are homologs or re not. The degree of similarity or identity should be informed.

11.
The novel MSTRG.94 gene and neighboring sequences seem to be transcribed only in Bloodstream cells and this should be highlighted.

12.
In the Discussion, the statement "This is demonstrated by the virtual absence of reads covering the intron regions of the two experimentally validated intron containing genes" is not valid for the sub-polysomal transcripts and should be revised.

Is the work clearly and accurately presented and does it cite the current literature? Yes
Is the study design appropriate and is the work technically sound? Yes Either you alternate between each of the procyclic and bloodstream samples or show all procyclic and then all bloodstream samples. This is the order determined by the column clustering, that we would rather keep as it is. We acknowledge that we did not describe the clustering of the columns that is now added to the materials and methods.
The following sentence needs to be modified "The grumpy transcript made us wonder which other sub-polysome enriched transcripts might have a lncRNA at the 5' end….". Based on the "Grumpy" example, it refers to polysome enriched transcripts which have a sub-polysome enriched lncRNA at their 5' end.

1.
We have rephrased this sentence. We don't find it confusing, we cordially ask the reviewer to specify better where the confusion arises.
In the sentence "Synteny analysis of the TRY.375 locus performed at TryTripDB revealed another gene (TevSTIB805.7.3380) in the T. evansi genome with 100% homology with the predicted TRY.375 gene product.", the homology concept is misused as it is a binary concept bringing the idea of evolutionary relationship, so you cannot have degree of homology between two loci of different species. Either they are homologs or re not. The degree of similarity or identity should be informed.

1.
We now report on the degree of similarity.
The novel MSTRG.94 gene and neighboring sequences seem to be transcribed only in Bloodstream cells and this should be highlighted.

1.
We have highlighted this.
In the Discussion, the statement "This is demonstrated by the virtual absence of reads covering the intron regions of the two experimentally validated intron containing genes" is not valid for the sub-polysomal transcripts and should be revised.

1.
We have rephrased this sentence, referring only to the polysomal samples.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
Reviewer Report 11 June 2021 https://doi.org/10.21956/wellcomeopenres.18079.r44184 © 2021 Erben E. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

IIBIO-UNSAM, Buenos Aires, Argentina
In Tinti and co-workers, the authors look for novel coding sequences and potential lncRNAs by sequencing total RNA and different polysomal fractions from both PCF and BSF-trypanosome forms. The authors found that the grumpy lncRNA (the only lncRNA from T. brucei partially characterised so far potentially involved in the regulation of the slender-to-stumpy differentiation) is enriched in subpolysomal fractions in both life forms. They then looked for new lncRNAs displaying similar polysomal distribution finding a bunch of them associated or in close proximity to genes involved in stumpy formation or post-transcriptional regulators suggesting an interesting and yet unexplored functional link between them. If this holds true, it would also mean that tagging some specific proteins from the endogenous locus -a popular approach in this field -(where UTRs are disrupted) may have unforeseen functional consequences. The authors also identified 11 novel protein coding genes. Since both NGS and mass spectrometry rely heavily on databases, I find this manuscript a helpful contribution to the continuous curation of the T brucei genome. I also celebrate how detailed the Methods section is. I am supportive of publication, though make some suggestions for possible improvement of the manuscript.

Major criticism:
lncRNAs: The author found that several lncRNA are in close proximity (typically at the 5'-end) to genes associated with the transition between the BSF and PCF life stages or potential posttranscriptional regulators. Is this statically significant? How does it compare to different functional gene categories or among developmentally vs non-developmentally-regulated genes? The authors show the FC between different fractions but I would like to see how levels of the protein-coding gene and its proximal lncRNA vary in PF vs BF. For instance, what is going on with the lncRNA immediately upstream of the RBP10 gene in PF? Since it seems to be embedded into the 3'UTR of the upstream gene (Tb927.8.2770), is its level independently regulated or follow the Tb927.8.2770 pattern? Most of the lncRNAs are found in subpolysomal fractions. For the nuclear grumpy it is expected. However, RBP7A seems to be cytosolic (also RBP10 for which a subpolysomal lncRNA is apparently linked), so how does the author imagine the crosstalk between the coding genes and their proximal lncRNAs? 1.
In the Discussion, the author hypothesized that the excised lncRNA might be targeted for degradation, and this could be the reason why the lncRNAs are enriched in the subpolysomal fractions. Fig 15 shows that the lncRNA present upstream of the RBP10 coding 2.
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.