Genome-wide characterisation of SSRs
We identified and mapped a total of 428,342 microsatellites across the 47,828 scaffolds of the unpublished genome sequence draft of T. erytreae using the GMATA software35. The SSRs frequency was estimated at 765.6 SSRs/Mb, which means 1 SSR for every 1.09 Kb. In silico identified SSRs were distributed among ten types of in tandem repeated motifs (from di- to deca-nucleotides). Analysis of SSR distribution revealed that the di-nucleotide motifs (340,227) were the most abundant SSRs, with a frequency of 79.4%. Both tetra- (20,902) and tri- (61,839) nucleotide repeats comprised about 5–15% (Fig. 1A; Supplementary Data 1). The remaining motifs, from hepta- to deca-nucleotides, comprised less than 1.5% of total SSRs identified in this study (Fig. 1A). Considering the unknown orientation of DNA strands in the Tery6 draft genome sequence of T. erytreae, a further SSRs characterization was carried out grouping the repeat motifs into pairs of complementary sequences. According to this, GA/TC (36.6%) and CT/AG (31.9%) are the most frequent motif pairs, with a total frequency of 68.5% (Fig. 1B). Grouped motif pairs GC/GC (0.05%) and CG/CG (~ 0.02%) were the least abundant di-nucleotide motifs. In decrease order, the most abundant tri-nucleotide motif pairs were ATT/AAT, ATA/TAT, ACA/TGT, TAA/TTA, AAC/GTT, TTG/CAA, and AAG/CTT, which encompassed 9.8% of all identified grouped motif pairs. Occurrence frequency of the remaining grouped motifs, including the rest of tri- and those from tetra- to deca-nucleotides (552 all together), was less than 11% of all motif pairs (Fig. 1B). Our data analysis reveals that SSR markers of 10 bp were most frequent, accounting for about 10% all SSR markers identified in this study. The overall trend of SSR length distribution in the T. erytreae genome is that the frequency of occurrence of SSRs gradually decreases as their length increases (Fig. 1C).
Frequency distribution of different classes of SSR repeat units in the Trioza erytreae genome. (A) Frequency of motif types by unit length (K-mers). (B) Frequency of grouped repeated motifs by nucleotide composition. (C) Length distribution of SSRs (total number of each type of SSR length is shown in the top of the bars).
SSR markers development for T. erytreae
Fifteen SSRs chosen from those repeated motifs identified in silico in this study (Table 1) were used as potential markers to investigate the genetic diversity, structure and phylogeography of T. erytreae individuals from populations in mainland Europe and the archipelagos of the Macaronesia. Scaffolds Tery6_s00034 (274,710 bp), Tery6_s02825 (48,689 bp) and Tery6_s07841 (26739 bp) were randomly selected based on their sequence length (long, medium, and short scaffolds, respectively). SSRs were selected on the base of their type of repeat motif (di, tri-, tetra- and penta-nucleotides), nucleotide composition and length (number of in tandem repeated motifs) (Table 1; Supplementary Data 2). For the scaffold Tery6_s00034, 11 SSR loci were chosen from the total of 106 SSRs identified in silico, three for Tery6_s07841 and one for Tery6_s02825. Selected scaffolds were further investigated to know whether SSR loci mapped into coding or non-coding regions (inter-genic or intron sequences). Although gene annotation of the T. erytreae genome draft is not yet completed, it was possible to get this information for most of the selected SSR loci (data not shown). The scaffolds Tery6_s00034, Tery06, − 07, − 13 and − 14 were found in inter-genic regions, while Tery08, − 12 and − 15 were mapped into introns. For Tery05, − 9, − 10 and − 11 was not possible to establish whether they targeted coding or non-coding regions. SSR loci Tery01, − 02 and − 03 were found in intron regions in the scaffold Tery6_s07841, and SSR locus Tery04 in an inter-genic region in the sequence corresponding to the scaffold Tery6_s02825. For amplification of SSR loci, specific PCR primers were designed on the sequence flanking the in tandem repeated motifs. Blast of the different amplicons against the T. erytreae draft genome sequence showed that PCR primers would result in the specific amplification of their specific SSR locus. Experimental validation of PCR primers was carried out on a testing panel of individuals collected in different locations in the Canary Islands and South Africa. Primers pairs for SSR loci Tery04, − 05, − 06, − 08, − 09, − 10, − 11, − 12, − 13 and − 15 yielded DNA fragments of the expected size and were chosen for carry on further population genetic analysis. These loci contain eight di-nucleotides (AC, AG, GA, CA, GT, TC, TA and TG), one tri-nucleotide (TGA), and three tetra-nucleotides (CATA, CTAC and TACC), which arranged in microsatellites of different length (from 5 to 30 in tandem repeated motifs) (Table 1). Five SSR loci (Tery01, − 02, − 03, − 07 and − 14) were not amplified efficiently and the corresponding primer pairs were discarded for further analysis.
The individuals of T. erytreae collected in different geographical locations in the west coast of mainland Spain and Portugal, the Canary Islands and Madeira, as well as in South Africa and Kenya (Table 2), were analysed using the 10 selected SSR markers designed in this study. The scored allelic data for each SSR marker is summarised in the Table 3. The analysis showed that all SSR markers were polymorphic. Seventy alleles were detected over the ten selected SSR loci, and the average number of alleles per locus (Na) was seven. SSR markers Tery08 and Tery11 had the highest number of alleles (12 and 20 alleles respectively), whereas Tery13 had the lowest (only two alleles). The expected (He) and observed (Ho) heterozygosity per locus in the entire population ranged from 0.20 to 0.77 and from 0.03 to 0.84, respectively. SSRs Tery11 and Tery08 displayed the highest diversity (He of 0.77 and 0.72, respectively), and Tery09 and Tery13 (He of 0.20 and 0.22, respectively) were the least informative markers. Most of the SSR markers used in this work showed He values higher than 0.5, apart from Tery05, − 09 and − 13 (with values of 0.39, 0.20 and 0.22, respectively). With the only exception of Tery04 and Tery15, for most of the analysed SSRs He was higher than Ho. It can be also observed that the whole population displayed a deficit of average Ho (0.31) compared with the He value (0.51) under Hardy–Weinberg equilibrium. This observation agrees with the positive value of the Wright’s fixation index (Fw) estimated for all analysed SSR markers over the whole population (Fw = 0.41). The SSR markers Tery12 and Tery13 showed Fw values close to 1.0 (0.81 and 0.85, respectively), suggesting that their alleles were considerably fixed in the population.
Population structure based on T. erytreae SSR data
To assess the differentiation and genetic diversity among the local populations of T. erytreae sampled in newly invaded areas from Spain and Portugal, including Madeira and the Canary Islands, and those from the previous invaded areas in Africa (South Africa and Kenya), we used a Bayesian clustering method to analyse the SSR multi-locus genotyping data. The STRUCTURE analysis according to the method of ΔK36 showed that the overall genetic profile of all the individuals sampled could be described with two or three different hypothetically original populations corresponding to the highest ΔK values (Fig. 2). It means that the most likely values of genetic clusters (K) are 2 or 3. Nevertheless, Pritchard’s method37 showed a posterior probability of data at K = 7 (Fig. 2). The estimated likelihood distribution increased from K = 1 to K = 7, and then started to decrease. This implied that seven was the smallest value of K, which was the most likely number of inferred populations in our data set. Interestingly, the value of K at which the likelihood distribution reached its maximum coincided with a further peak value of the ΔK statistic at K = 7, suggesting a more complex hierarchical structure of the T. erytreae populations (Fig. 2). In consequence, we plotted the clustering results for K = 2, K = 3 and K = 7 (Fig. 3). Furthermore, we considered an initial structure of two populations (K = 2) as was suggested by the method of ΔK36 whereby most of the analysed individuals were classified with high probability (Q > 0.90) in two clusters (Fig. 3). Cluster 1 (in green) was exclusively formed by individuals from newly invaded areas in Spain and Portugal, including those from the archipelagos of Madeira and the Canary Islands. On the other hand, Cluster 2 (in beige) was mainly comprised of individuals from Africa, but also included individuals from Camacha (Madeira). The exception to this pattern involved three locations in Madeira (Quebradas, Camacha and Moreno), Pretoria (South Africa), and Homa Bay (Kenya), where almost all individuals consistently had significant membership in both clusters. Looking at K = 3 plot, the Bayesian clustering analysis resolved Cluster 1 into two by reassigning some individuals to Cluster 3 (in purple). Almost of all individuals from Moreno, Poiso, and Farrobo (in Madeira and Porto Santo, respectively) were entirely reassigned to Cluster 3 along with several individuals from the Canary Islands and Galicia (Spain). In addition, individuals from Vairão (Porto) and São Vicente de Pereira Jusã (Aveiro) (both in the northwest coast of Portugal) were also assigned to Cluster 3, while those individuals sampled from southern locations up to Sobreda (Setúbal) were assigned to Cluster 1. The exceptions to this pattern were the individuals from Ribamar (Ericeira), which were assigned to Cluster 3. Most notably, samples from Kenya were genetically different from those of South Africa and grouped in Cluster 1. At K = 7 the population structure scenario was more hierarchical, but 73% of all individuals (108 out from 147) could be assigned to one of the seven clusters with more than 90% probability (Q > 0.9). The assignment of half of the remaining individuals (21 out of 39) could be done with more than 70% probability (Q > 0.7). Among the different groups, Cluster 1 (in green) and 2 (in beige) are restricted to the populations of South Africa and Kenya, respectively, with almost no presence of individuals from any of the newly invaded areas. Clusters 3 (in purple) and 4 (in pink) are mostly exclusive to the individuals from Madeira and Portugal mainland, although with some membership in the Canary Islands and Galicia. Cluster 5 (in light blue) and Cluster 6 (in orange) are represented by individuals from Madeira, the Canary Islands and Galicia, while the individuals from Camacha (Madeira) –the only ones that were collected from Casimiroa edulis La Llave & Lex. (Rutacea: Toddalioideae)—form exclusively Cluster 7 (in dark blue). Remarkably, Q fractions corresponding to Cluster 7 are present in the individuals from Nelspruit, Tzaneen, and some in Pretoria.
Inference of the number of unique genetic clusters (K) from structure simulations derived from ten SSR markers. Diagrams of posterior probability of SSR data were obtained according to the methods of Evanno et al36 and Pritchard et al37. The likelihood of data given K (ln Pr(X|K), in open circles) and ΔK (the standardised second order rate of change of the likelihood function with respect to K, in bold circles) are plotted as functions of K. Error bars of the ln Pr(X|K) indicate standard deviations, but they are too small to be seen in the plot.
Bayesian clustering analysis of individuals genotyped with ten SSR markers in 23 populations of T. erytreae sampled in Africa, Spain, and Portugal. The assignment of individuals to genetic clusters inferred from STRUCTURE37 simulations are based on average membership coefficient (Q). Estimated membership fractions for each individual and population are shown for K = 2, 3 and 7. Selection of the number of clusters was based both on the K value at which the likelihood distribution began to decrease and the peak values of ΔK. Each individual is represented by a single vertical bar, with the colouring of each bar represents the stacked proportion of assignment probabilities to each genetic cluster. For K = 7, clusters 1, 2, 3, 4, 5, 6 and 7 are shown in green, beige, purple, pink, light blue, orange, and dark blue, respectively. Black vertical lines separate sample sites. Labels identify T. erytreae populations from old invaded areas in Africa, and newly invaded areas in the Iberian Peninsula and the Macaronesia.
Genetic diversity analysis using T. erytreae SSR allelic data
The genetic diversity of T. erytreae populations was also assessed by means of a distance-based clustering method. The scored SSR allelic data obtained from the ten SSR loci developed in this study were used to calculate a genetic dissimilarity matrix and to compute a Neighbor Joining (NJ) tree. A preliminary dendogram constructed using only the African populations of T. erytreae showed that the individuals from South Africa grouped together into a single cluster clearly separated from the Kenyan population. The robustness of the tree clustering was supported by the high bootstrap values obtained for nearly all branches (Fig. 4). To confirm the results obtained from the structure analysis a NJ tree under topological constraints was inferred using as initial tree the population structure of individuals from all the sampled areas with Q > 0.7. The remaining individuals were positioned (constraint) on that previous topology. Inspection of the constrained tree topology revealed seven clusters that were in congruence with the structural population at K = 7 suggested by the STRUCTURE analysis (Fig. 5). It is noteworthy that Cluster 7 emerged as a paraphyletic group in the base of African Cluster 2. The cluster assignments of individuals with low membership coefficients (Q < 0.7) performed well in our distance-based clustering analysis.
NJ consensus tree showing the phylogenetic relation between analysed individuals from Trioza erytreae populations sampled in South Africa and Kenya. Consensus tree is the result of 10,000 iterations of genetic allelic data obtained for the ten SSR markers selected in this work. Bootstrap values over 50% are indicated.
NJ tree under topological constraints showing the phylogenetic relation among the populations of T. erytreae sampled in newly invaded areas in Spain and Portugal and those from old invaded areas in Kenya and South Africa, respectively. Dendogram is the result of 10,000 iterations of allelic data obtained for the ten SSR loci developed in this work. Structure of the tree inferred from allelic data of individuals with Q > 0.7 according to STRUCTURE37 was used as initial tree, and the remaining individuals were positioned (constraint) on this previous topology. Spain: Aldán (A), Areeiro (AR), Gran Canaria (GC), Los Rodeos (LR), Oratava (O), Portonovo (PN), Tacoronte (T). Portugal: Areeiro-Lisbon (AR-Lis), Barreiralva (B), Camacha (C), Farrobo (F), Moreno (M), Paião (P), Poiso (PO), Quebradas (Q), Ribamar (R), Sobreda (S), São Vicente de Pereira Jusã (SV), Vairão (V). South Africa: Nelspruit (N), Pretoria (PR), Tzaneen (TZ). Kenya: Homa Bay (HB). Genetic clusters for K = 7 are indicated. Admixed individuals with Q < 0.7 are shown in black.
Phylogenetic analysis using mtDNA-based barcoding
The maternal phylogenetic relationship among the T. erytreae individuals collected in this study was assessed using mitochondrial DNA (mtDNA) barcoding, the sequence analysis of nucleotide variations in the 5’ region of the Cytochrome C Oxidase I gene (COI)38. Comparison of the nucleotide sequence of COI barcode fragments from this study with other T. erytreae GenBank accessions demonstrated that all sequences are highly conserved (Supplementary Data 3). With the only exception of the fragments amplified from Kenyan individuals, all remaining sequences showed absolute identity (100%) to the GenBank accession numbers that we previously deposited in the GenBank39, corresponding to COI barcode sequences from T. erytreae individuals collected in the Canary Islands (MK285551-MK285553), Galicia (MK285548-MK285550), Madeira (MK285558) and South Africa (MK285554-MK285557, MK285559, MK285560) (Supplementary Data 4). Alignment of sequences amplified in this study from individuals collected in Homa Bay (Kenya) shared absolute identity with those extracted from the entire mitochondrion genome sequences from Eritrea and Uganda27, and 97–98% identity with those COI barcode fragment sequences from individuals from other locations in Kenya39.
The nucleotide sequences of the COI barcode fragments generated in this study (n = 39) and some previously deposited in the GenBank (n = 37) were used to analyse the maternal phylogenetic relationship of T. erytreae populations that have invaded Spain and Portugal, with those from South Africa and Kenya (Fig. 6). From all COI sequences used in this study, 38.2% were obtained from Spain and Portugal (including Madeira and the Canary Islands), 36.8% from South Africa, 19.7% from Kenya, and 5.3% from other African countries (Cameroon, Ethiopia, Tanzania and Uganda). In accordance with the high level of identity of their COI barcode nucleotide sequences, the NJ tree generated from these sequences showed that the individuals from Spain and Portugal, including those from Madeira and the Canary Islands, formed a monophyletic group with the individuals from South Africa (Pretoria, Nelspruit, and Tzaneen). Our phylogenetic analysis reveals a clear differentiation between this monophyletic group and the individuals from Homa Bay (Kenya), as well as from those individuals previously reported in other locations in Kenya39. It was observed that the individuals from Spain and Portugal formed a paraphyletic group with those from Pretoria (Fig. 6), as the remaining South African individuals from Nelspruit and Tzaneen formed a separated clade. Furthermore, our analysis demonstrated the presence of two different T. erytreae lineages in Tzaneen, as most of their individuals formed a paraphyletic group with those from Nelspruit, while the remaining formed a clade with four individuals from West Acres (South Africa)39. The few exceptions to this observation were three South African individuals, one from Pretoria (Pretoria-100), and two from West Acres (TeSA1 and TeSA7), which may correspond to migrants from Nelspruit or Tzaneen, and Pretoria, respectively. In a sister clade position to the South African clade, the GenBank accessions of Kenyan and Tanzania COI sequences39 included in this phylogenetic analysis formed a monophyletic group. The COI barcode sequences from Homa Bay (Kenya) clustered separately as an outgroup in a different clade with the corresponding fragment extracted from the mitochondrion genome sequences from Ethiopia and Uganda (MT416551 and MT416549, respectively)27, and Cameroon (MG989238)40 present in GenBank.
Phylogenetic tree based on COI barcode sequences of Trioza erytreae individuals from invaded areas in Spain and Portugal, and African local populations. The evolutionary history was inferred by means of the Maximum Likelihood method and Tamura-Nei model41. The analysis involved 76 nucleotide sequences, including those generated from this study and 37 available GenBank accessions of T. erytreae. The tree with the highest log likelihood (− 1037.95) is shown. The percentage of trees in which the associated taxa clustered together is shown next to the branches. The tree is drawn to scale, with branch lengths measured in the number of substitutions per site. Codon positions included were 1st + 2nd + 3rd + noncoding. There was a total of 657 positions in the final dataset. Evolutionary analyses were conducted in MEGA X42. GenBank accession numbers are shown in brackets. Colour dots beside the COI accessions generated in this study correspond to the sampling location in the map: Kenya (green), South Africa (brown), Madeira (red), Canary Islands (deep purple), Portugal (magenta) and Spain (blue) main land. Maps were taken and manipulated from www.outline-worldmap.com.
Source: Ecology - nature.com