Brain de novo transcriptome assembly of a toad species showing polymorphic anti-predatory behavior

Sample collection and RNA preparation

We analyzed 6 adult yellow-bellied toad individuals representative of distinct behavioral profiles, i.e. prolonged unken-reflex display vs no unken-reflex display (thereafter referred as “ + ” and “-“, respectively). Behavioral profiles were scored as in Chiocchio et al.¹²: 3 toads showed prolonged unken-reflex (+), whereas the other 3 did not show unken-reflex (−), as reported in Table 1. Sampling procedures were approved by the Italian Ministry of Ecological Transition and the Italian National Institute for Environmental Protection and Research (ISPRA; permit number: 20824, 18-03-2020). After dissection, brain tissue was immediately stored in RNAprotect Tissue Reagent (Quiagen) until RNA extraction. RNA extractions were performed using the RNeasy Plus Kit (Quiagen), according to the manufacturer’ instructions. RNA quality and concentration were assessed by means of both a spectrophotometer and a Bioanalyzer (Agilent Cary60 UV-vis and Agilent 2100, respectively – Agilent Technologies, Santa Clara, USA).

Table 1 Summary of the 6 libraries deposited in the Sequence Read Archive (SRA) of NCBI, in terms of number of raw and trimmed reads per sample.

Full size table

Library preparation and sequencing

Library preparation and RNA sequencing were performed by NOVOGENE (UK) COMPANY LIMITED using Illumina NovaSeq platform. Library construction was carried out using the NEBNext® Ultra ™ RNA Library Prep Kit for Illumina®, following manufacturer instructions. Briefly, after the quality control check, the mRNA sample was isolated from the total RNA by using magnetic beads made of oligos d(T)25 (i.e. polyA-tail mRNA enrichment). Subsequently, mRNA was randomly fragmented, and a cDNA synthesis step proceeded using random hexamers and the reverse transcriptase enzyme. Once the synthesis of the first chain has finished, the second chain was synthesized with the addition of the Illumina buffer, dNTPs, RNase H and polymerase I of E.coli, by means of the Nick translation method. Then, the resulting products went through purification, repair, A-tailing and adapter ligation. Fragments of the appropriate size were enriched by PCR, the indexed P5 and P7 primers were introduced, and the final products were purified. Finally, the Illumina Novaseq 6000 sequencing system was used to sequence the libraries, through a paired-end 150 bp (PE150) strategy. We obtained on average 52.7 million reads for each library. The sequencing data are available at the NCBI Sequence Read Archive (project ID PRJNA764013²⁰).

Pre-assembly processing stage

A total of 316,329,573 pairs of reads was generated by Illumina sequencing. All of them went to a cleaning analytic step. The quality of the raw reads was assessed with the FastQC 0.11.5 tool (http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc), in order to estimate the RNAseq quality profiles. The quality estimators were generated for both the raw and trimmed data. The quality assessment metrics for trimmed data were aggregated across all samples into a single report for a summary visualization with MultiQC software tool²¹ v.1.9 (see Fig. 1). To remove low quality bases and adapter sequences, raw reads were also analyzed through a quality trimming step with Trimmomatic²², v.0.39 (setting the option SLIDINGWINDOW: 4: 15, MINLEN: 36, and HEADCROP: 13). All the unpaired reads were discarded. After the cleaning step and removal of low-quality reads, 297,354,405 clean reads (i.e. 94% of raw reads) were maintained for building the de novo transcriptome assembly (see Table 1).

Fig. 1

The cleaned reads from all samples were assessed with FastQC and visualized with MultiQC. (a) Read count distribution for mean sequence quality. (b) Mean quality scores distribution. (c) Read length distribution. (d) Per Sequence GC Content.

Full size image

De novo transcriptome assembly and quality assessment

As there is no reference genome for B. pachypus, we performed a de novo transcriptome assembly procedure. The workflow of the bioinformatic pipelines is shown in Fig. 2. All the described bioinformatics analyses were performed on the high-performance computing systems provided by ELIXIR-IT HPC@CINECA²³.

Fig. 2

Workflow of the bioinformatic pipeline, from raw input data to annotated contigs, for the de novo transcriptome assembly of B. pachypus.

Full size image

To construct an optimized de novo transcriptome, avoiding chimeric transcripts, we employed rnaSPAdes²⁴, a tool for de novo transcriptome assembly from RNA-Seq data implemented in the SPAdes v.3.14.1 package. rnaSPAdes automatically detected two k-mer sizes, approximately one third and half of the maximal read length (the two detected k-mer sizes were 45 and 67 nucleotides, respectively). At this stage, a total of 1,118,671 assembled transcripts were generated by rnaSPAdes runs, with an average length of 689.41 bp and an N50 of 1474 bp (Table 2).

Table 2 Similarity rate of newly assembled transcripts versus the de novo transcriptome of B. pachypus.

Full size table

Results from the assembly procedures were validated through three independent validator algorithms implemented in BUSCO²⁵ v.4.1.4, DETONATE²⁶ v.1.11 and TransRate²⁷ v.1.0.3. These tools generate several metrics used as a guide to evaluate error sources in the assembly process and provide evidence about the quality of the assembled transcriptome. Busco provides a quantitative measure of transcriptome quality and completeness, based on evolutionarily-informed expectations of gene content from the near-universal, ultra-conserved eukaryotic proteins (eukaryota_odb9) database. Detonate (DE novo TranscriptOme rNa-seq Assembly with or without the Truth Evaluation) is a reference-free evaluation method based on a novel probabilistic model that depends only on the assembly and the RNA-Seq reads used to construct it. Transrate generates standard metrics and remapping statistics. No reference protein sequences were used for the assessment with Transrate. The main metrics resulted from the assembly validators are shown in Table 2 (“Before CD-HIT-est” column). After this triple assessment validation step, the result of the assembly procedure become the input for the CD-HIT-est v.4.8.1²⁸ program, a hierarchical clustering tool used to avoid redundant transcripts and fragmented assemblies common in the process of de novo assembly, providing unique genes. CD-HIT-est was run using the default parameters, corresponding to a similarity of 95%. Subsequently, a second validation step was launched on the CD-HIT-est output file. To refine the final transcriptome dataset, a further hierarchical clustering step was performed by running CORSET v1.06²⁹. Then, the output of CORSET was validated by BUSCO, and quality assessment was performed with HISAT2^30,31 by mapping the trimmed reads to the reference transcriptome (unigenes). Results from all validation steps are shown in Table 2 and discussed in the “Technical Validation” paragraph.

Finally, the CORSET output was run on TransDecoder^32,33, the current standard tool that identifies long open read frames (ORFs) in assembled transcripts, using default parameters. TransDecoder by default performs ORF prediction on both strands of assembled transcripts regardless of the sequenced library. It also ranks ORFs based on their completeness, and determines if the 5 ‘end is incomplete by looking for any length of AA codons upstream of a start codon (M) without a stop codon. We adopted the “Longest ORF” rule and selected the highest 5 AUG (relative to the inframe stop codon) as the translation start site.

Transcriptome annotation

We employed different kinds of annotations for the de novo assembly. We introduced DIAMOND³⁴, an open-source algorithm based on double indexing that is 20,000 times faster than BLASTX on short reads and has a similar degree of sensitivity. Like BLASTX, DIAMOND attempts to determine exhaustively all significant alignments for a given query. Most sequence comparison programs, including BLASTX, follow the seed-and-extend paradigm. In this two-phase approach, users search first for matches of seeds (short stretches of the query sequence) in the reference database, and this is followed by an ‘extend’ phase that aims to compute a full alignment. The following parameter settings were applied: DIAMOND-fast DIAMOND BLASTX-t 48 -k 250 -min-score 40; DIAMOND-sensitive: DIAMOND BLASTX -t 48 -k 250 -sensitive -min-score 40.

Contigs were aligned with DIAMOND on Nr, SwissProt and TrEMBL to retrieve the corresponding best annotations. An annotation matrix was then generated by selecting the best hit for each database. Following the analysis of BLASTX against Nr, SwissProt and TremBL, we obtained respectively: 123,086 (64.57%), 77,736 (40.78%), 122,907 (64.48%) contigs. The results obtained following the analysis with BLASTP against Nr, SwissProt and TrEMBL were 96,321 (50.53%), 57,877 (30.36%) and 97,256 (51.02%) contigs respectively. All the information on the resulting datasets is resumed in Table 3.

Table 3 Summary of homology annotation hits on the different databases queried in this study.

Full size table

The output obtained by the BLASTX annotation consisted in a total of 77391 sequences simultaneously mapped on the three queried databases (i.e., Nr, SwissProt and TrEMBL). The output obtained following the BLASTP annotation consisted in a total of 57704 sequences simultaneously mapped on the three databases. Venn diagrams are presented in Fig. 3, showing the redundancy of the annotations in the different databases for both DIAMOND BLASTX (Fig. 3a) and DIAMOND BLASTP (Fig. 3b). Furthermore, the ten most represented species and the ten hits of the gene product obtained respectively with BLASTX and BLASTP by mapping the transcripts against the reference database Nr are shown in Figs. 4 and 5. Since BLASTX translated nucleotide sequence searches against protein sequences the BLASTX results are more exhaustive than BLASTP results. Contigs were also processed with InterProScan³⁵ to detect InterProScan signatures. The InterPro database (http://www.ebi.ac.uk/interpro/) integrates together predictive models or ‘signatures’ representing protein domains, families and functional sites from multiple, diverse source databases: Gene3D, PANTHER, Pfam, PIRSF, PRINTS, ProDom, PROSITE, SMART, SUPERFAMILY and TIGRFAMs. The obtained InterProScan results for all the unigenes are available on Figshare in the form of Tab Separated Values (tsv) file format, which includes the GO and KEGG annotated contigs, respectively.

Fig. 3

Venn diagrams for the number of contigs annotated with DIAMOND (BLASTX (a) and BLASTP (b) functions) against the three databases: Nr, SwissProt, TREMBL.

Full size image

Fig. 4

Most represented species and gene product hits. Top 10 best species (a) and protein (b) hits present in the reference database (Nr, BLASTX).

Full size image

Fig. 5

Most represented species and gene product hits. Top 10 best species (a) and protein (b) hits present in the reference database (Nr, BLASTP).

Full size image

Comparison with Bombina orientalis brain transcriptome

We compared the brain de novo transcriptome of B. pachypus with the brain de novo transcriptome of B. orientalis, recently produced in the frame of a prey-catching conditioning experiment^17,18. The B. orientalis transcriptome resource was downloaded from GEO archive of NCBI (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE171766). To make the datasets comparable, we first performed ORF prediction on B. orientalis trascriptome using Transdecoder, using default settings. Results from the B. orientalis trascriptome ORF prediction are available in Figshare at the following link https://doi.org/10.6084/m9.figshare.20319633/). We also applied the makedb function implemented in DIAMOND to create the protein database index. Then, we aligned the B. pachypus predicted coding sequences and proteins (query files) against the B. orientalis protein database (reference) using DIAMOND BLASTX and BLASTP, respectively. We obtained 167041 matches from BLASTX and 156248 matches for BLASTP. Results from the BLASTX and BLASTP comparisons, and the most matched proteins, are available on Figshare³⁶ (link available in next paragraph).

Brain de novo transcriptome assembly of a toad species showing polymorphic anti-predatory behavior

Sample collection and RNA preparation

Library preparation and sequencing

Pre-assembly processing stage

De novo transcriptome assembly and quality assessment

Transcriptome annotation

Comparison with Bombina orientalis brain transcriptome

Substantial differences in soil viral community composition within and among four Northern California habitats

Wildfires disproportionately affected jaguars in the Pantanal

ITALIAN LANGUAGE

ENGLISH LANGUAGE