Environmental palaeogenomic reconstruction of an Ice Age algal population

Site description, chronology, and sampling

A detailed description of the site, coring methods, age-depth model reconstruction, and sampling strategy can be found in Alsos et al.⁴¹. Briefly, Lake Øvre Æråsvatnet is located on Andøya, Northern Norway (69.25579°N, 16.03517°E) (Fig. 1a, b). In 2013, two cores were collected from the deepest sediments, AND10 and AND11, which were stored at 4 °C prior to sampling. Macrofossil remains were dated, with those from AND10 all dating to within the LGM. For the longer core AND11, a Bayesian age-depth model was required to estimate the age of each layer⁴¹. In this study, we selected one sample of LGM sediments from each of the two cores. According to the Bayesian age-depth model, sample Andøya_LGM_B, from 1102 cm depth in AND11, was dated to a median age of 17,700 (range: 20,200–16,500) cal yr BP. The age of Andøya_LGM_A, from 938 cm depth in AND10, was estimated at 19,500 cal yr BP, based on the interpolated median date between two adjacent macrofossils (20 cm above: 19,940–18,980 cal yr BP, 30 cm below: 20,040–19,000 cal yr BP). As Andøya_LGM_A falls within the age range of Andøya_LGM_B, we consider the samples to be broadly contemporaneous.

Sampling, DNA extraction, library preparation, and sequencing

The two cores were subsampled at the selected layers under clean conditions, in a dedicated ancient DNA laboratory at The Arctic University Museum of Norway in Tromsø. We extracted DNA from 15 g of sediment following the Taberlet phosphate extraction protocol²⁹ in the same laboratory. We shipped a 210 µL aliquot of each DNA extract to the ancient DNA dedicated laboratories at the Centre for GeoGenetics (University of Copenhagen, Denmark) for double-stranded DNA library construction. The extracts were concentrated to 80 µL using Amicon Ultra-15 30 kDa centrifugal filters (Merck Millipore, Darmstadt, Germany) and half of each extract (40 µL, totalling between 31.7 and 36.0 ng of DNA) was converted into Illumina-compatible libraries using established protocols¹⁰. Each library was dual-indexed via 12 cycles of PCR. The libraries were then purified using the AmpureBead protocol (Beckman Coulter, Indianapolis, IN, USA), adjusting the volume ratio to 1:1.8 library:AmpureBeads, and quantified using a BioAnalyzer (Agilent, Santa Clara, CA, USA). The indexed libraries were pooled equimolarly and sequenced on a lane of the Illumina HiSeq 2500 platform using 2 × 80 cycle paired-end chemistry.

Raw read filtering

For each sample, we merged and adapter-trimmed the paired-end reads with SeqPrep (https://github.com/jstjohn/SeqPrep/releases, v1.2) using default parameters. We only retained the resulting merged sequences, which were then filtered with the preprocess function of the SGA toolkit v0.10.15 (ref. ⁵⁷) by the removal of those shorter than 35 bp or with a DUST complexity score > 1.

Metagenomic analysis of the sequence data

We first sought to obtain an overview of the taxonomic composition of the samples and therefore carried out a BLAST-based metagenomic analysis on the two filtered sequence datasets. To make the datasets more computationally manageable, we subsampled the first and last one-million sequences from the filtered dataset of each sample and analysed each separately. The data subsets were each identified against the NCBI nucleotide database (release 223) using the blastn function from the NCBI-BLAST+ suite v2.2.18+⁵⁸ under default settings. For each sample, the results from the two subsets were checked for internal consistency, merged into one dataset, and loaded into MEGAN v6.12.3 (ref. ⁵⁹). Analysis and visualization of the Last Common Ancestor (LCA) was carried out for the taxonomic profile using the following settings: min score = 35, max expected = 1.0E−5, min percent identity = 95%, top percent = 10%, min support percentage = 0.01, LCA = naive, min percent sequence to cover = 95%. We define sequences as the reads with BLAST hits assigned to taxa post-filtering, thus ignoring “unassigned” and “no hit” categories.

Alignment to reference genome panels

We mapped our filtered data against three different reference panels to help improve taxonomic identifications and provide insight into the sequence abundance of the identified taxa (Supplementary Data 2 and 3). The first reference panel consisted of 42 nuclear genomes that included genera expected from Northern Norway, exotic/implausible taxa for LGM Andøya, human, six Nannochloropsis species, and four strains of Mycobacterium. The inclusion of exotic taxa was to give an indication of the background spurious mapping rate, which can result from mappings to conserved parts of the genome and/or short and damaged ancient DNA molecules^22,23. We included Nannochloropsis, Mycobacterium, and human genomes, due to their overrepresentation in the BLAST-based metagenomic analysis. The other two reference panels were based on either all 8486 mitochondrial or 2495 chloroplast genomes on NCBI GenBank (as of January 2018). The chloroplast dataset was augmented with 247 partial or complete chloroplast genomes generated by the PhyloNorway project⁶⁰ for 2742 chloroplast genomes in total. The filtered data were mapped against each reference genome or organellar genome set individually using bowtie2 v2.3.4.1 (ref. ⁶¹) under default settings. The resulting bam files were processed with SAMtools v0.1.19 (ref. ⁶²). We removed unmapped sequences with SAMtools view and collapsed PCR duplicate sequences with SAMtools rmdup.

For the nuclear reference panel, we reduced potential spurious or nonspecific sequence mappings by comparing the mapped sequences to both the aligned reference genome and the NCBI nucleotide database using NCBI-BLAST+, following the method used by Graham et al.⁹, as modified by Wang et al.¹¹. The sequences were aligned using the following NCBI-BLAST+ settings: num_alignments = 100 and perc_identity = 90. Sequences were retained if they had better alignments, based on bit score, to reference genomes as compared to the NCBI nucleotide database. If a sequence had a better or equal match against the NCBI nucleotide database, it was removed, unless the LCA of the highest NCBI nucleotide bit score was from the same genus as the reference genome (based on the NCBI taxonID). To standardize the relative mapping frequencies to genomes of different size, we calculated the number of retained mapped sequences per Mb of genome sequence.

The sequences mapped against the chloroplast and mitochondrial reference panels were filtered and reported in a different manner than the nuclear genomes. First, to exclude any non-eukaryotic sequences, we used NCBI-BLAST+ to search sequence taxonomies and retained sequences if the LCA was, or was within, Eukaryota. Second, for the sequences that were retained, the LCA was calculated and reported in order to summarize the mapping results across the organelle datasets. LCAs were chosen as the reference sets are composed of multiple genera.

Within the Nannochloropsis nuclear reference alignments, the relative mapping frequency was highest for N. limnetica. In addition, the relative mapping frequency for other Nannochloropsis taxa was higher than those observed for the exotic taxa. This could represent the mapping of sequences that are conserved between Nannochloropsis genomes or suggest the presence of multiple Nannochloropsis taxa in a community sample. We therefore cross-compared mapped sequences to determine the number of uniquely mapped sequences per reference genome. First, we individually remapped the filtered data to six available Nannochloropsis nuclear genomes, the accession codes of which are provided in Supplementary Data 2. For each sample, we then calculated the number of sequences that uniquely mapped to, or overlapped, between each Nannochloropsis genome. We repeated the above analysis with six available chloroplast sequences (Supplementary Data 2) to get a comparable overlap estimation for the chloroplast genome.

Reconstruction of the Andøya Nannochloropsis community organellar palaeogenomes

To place the Andøya Nannochloropsis community taxon into a phylogenetic context, and provide suitable reference sequences for variant calling, we reconstructed environmental palaeogenomes for the Nannochloropsis mitochondria and chloroplast. First, the raw read data from both samples were combined into a single dataset and re-filtered with the SGA toolkit to remove sequences shorter than 35 bp, but retain low complexity sequences to assist in the reconstruction of low complexity regions in the organellar genomes. This re-filtered sequence dataset was used throughout the various steps for environmental palaeogenome reconstruction.

The re-filtered sequence data were mapped onto the N. limnetica reference chloroplast genome (NCBI GenBank accession: NC_022262.1) with bowtie2 using default settings. SAMtools was used to remove unmapped sequences and PCR duplicates, as above. We generated an initial consensus genome from the resulting bam file with BCFtools v1.9 (ref. ⁶²), using the mpileup, call, filter, and consensus functions. For variable sites, we produced a majority-rule consensus using the –variants-only and –multiallelic-caller options, and for uncovered sites the reference genome base was called. The above steps were repeated until the consensus could no longer be improved. The re-filtered sequence data was then remapped onto the initial consensus genome sequence with bowtie2, using the above settings. The genomecov function from BEDtools v2.17.0 (ref. ⁶³) was used to identify gaps and low coverage regions in the resulting alignment.

We attempted to fill the identified gaps, which likely consisted of diverged or difficult-to-assemble regions. For this, we assembled the re-filtered sequence dataset into de novo contigs with the MEGAHIT pipeline v1.1.4 (ref. ⁶⁴), using a minimum k-mer length of 21, a maximum k-mer length of 63, and k-mer length increments of six. The MEGAHIT contigs were then mapped onto the initial consensus genome sequence with the blastn tool from the NCBI-BLAST+ toolkit. Contigs that covered the gaps identified by BEDtools were incorporated into the initial consensus genome sequence, unless a blast comparison against the NCBI nucleotide database suggested a closer match to non-Nannochloropsis taxa. We repeated the bowtie2 gap-filling steps iteratively, using the previous consensus sequence as reference, until a gap-free consensus was obtained. The re-filtered sequence data were again mapped, the resulting final assembly was visually inspected, and the consensus was corrected where necessary. This was to ensure the fidelity of the consensus sequence, which incorporated de novo-assembled contigs that could potentially be problematic, due to the fragmented nature and deaminated sites of ancient DNA impeding accurate assembly⁶⁵.

Annotation of the chloroplast genome was carried out with GeSeq v1.77 (ref. ⁶⁶), using the available annotated Nannochloropsis chloroplast genomes (accession codes provided in Supplementary Table 7). The resulting annotated chloroplast was visualized with OGDRAW v1.3.1 (ref. ⁶⁷).

The same assembly and annotation methods outlined above were used to reconstruct the mitochondrial palaeogenome sequence, where the initial mapping assembly was based on the N. limnetica mitochondrial sequence (NCBI GenBank accession: NC_022256.1). The final annotation was carried out by comparison against all available annotated Nannochloropsis mitochondrial genomes (accession codes provided in Supplementary Table 7).

If the Nannochloropsis sequences derived from more than one taxon, then alignment to the N. limnetica chloroplast genome could introduce reference bias, which would underestimate the diversity of the Nannochloropsis sequences present. We therefore reconstructed Nannochloropsis chloroplast genomes, but using the six available Nannochloropsis chloroplast genome sequences, including N. limnetica, as seed genomes (accession codes for the reference genomes are provided in Supplementary Table 3). The assembly of the consensus sequences followed the same method outlined above, but with two modifications to account for the mapping rate being too low for complete genome reconstruction based on alignment to the non-N. limnetica reference sequences. First, consensus sequences were called with SAMtools, which does not incorporate reference bases into the consensus at uncovered sites. Second, neither additional gap filling nor manual curation was implemented.

Analysis of ancient DNA damage patterns

We checked for the presence of characteristic ancient DNA damage patterns for sequences aligned to three nuclear genomes: human, Nannochloropsis limnetica and Mycobacterium avium. We further analysed damage patterns for sequences aligned to both the reconstructed N. limnetica composite organellar genomes. Damage analysis was conducted with mapDamage v2.0.8 (ref. ⁶⁸) using the following settings: –merge-reference-sequences and –length = 160.

Assembly of high- and low-frequency variant consensus sequences

The within-sample variants in each reconstructed organellar palaeogenome was explored by creating two consensus sequences, which included either high- or low-frequency variants at multiallelic sites. For each sample, the initial filtered sequence data were mapped onto the reconstructed Nannochloropsis chloroplast palaeogenome sequence with bowtie2 using default settings. Unmapped and duplicate sequences were removed with SAMtools, as above. We used the BCFtools mpileup, call, and normalize functions to identify the variant sites in the mapped dataset, using the —skip-indels, –variants-only, and –multiallelic-caller options. The resulting alleles were divided into two sets, based on either high- or low-frequency variants. High-frequency variants were defined as those present in the reconstructed reference genome sequence. Both sets were further filtered to only include sites with a quality score of 30 or higher and a coverage of at least half the average coverage of the mapping assembly (minimum coverage: Andøya_LGM_A = 22×, Andøya_LGM_B = 14×). We then generated the high- and low-frequency variant consensus sequences using the consensus function in BCFTools. The above method was repeated for the reconstructed Nannochloropsis mitochondrial genome sequence in order to generate comparable consensus sequences of high- and low-frequency variants (minimum coverage: Andøya_LGM_A = 16×, Andøya_LGM_B = 10×).

Phylogenetic analysis of the reconstructed organellar palaeogenomes

We determined the phylogenetic placement of our high- and low-frequency variant organellar palaeogenomes within Nannochloropsis, using either full mitochondrial and chloroplast genome sequences or three short loci (18S, ITS, rbcL). We reconstructed the 18S and ITS1-5.8S-ITS2 complex using DQ977726.1 (full length) and EU165325.1 (positions 147:1006, corresponding to the ITS complex) as seed sequences following the same approach that was used for the organellar palaeogenome reconstructions, except that the first and last 10 bp were trimmed to account for the lower coverage due to sequence tiling. We then called high and low variant consensus sequences as described above.

We created six alignments using available sequence data from NCBI GenBank (Supplementary Data 4) with the addition of: (1 + 2) the high- and low-frequency variant chloroplast or mitochondrial genome consensus sequences, (3) an ~1100 bp subset of the chloroplast genome for the rbcL alignment, (4 + 5) ~1800 and ~860 bp subsets of the nuclear multicopy complex for the 18S and ITS alignments, respectively, and (6) the reconstructed chloroplast genome consensus sequences derived from the alternative Nannochloropsis genome starting points. Full details on the coordinates of the subsets are provided in Supplementary Data 4. We generated alignments using MAFFT v7.427 (ref. ⁶⁹) with the maxiterate = 1000 setting, which was used for the construction of a maximum likelihood tree in RAxML v8.1.12 (ref. ⁷⁰) using the GTRGAMMA model and without outgroup specified. We assessed branch support using 1000 replicates of rapid bootstrapping.

Nannochloropsis variant proportions and haplogroup diversity estimation

To estimate major haplogroup diversity, we calculated the proportions of high and low variants in the sequences aligned to our reconstructed Nannochloropsis mitochondrial and chloroplast genomes. For each sample, we first mapped the initial filtered sequence data onto the high- and low-frequency variant consensus sequences with bowtie2. To avoid potential reference biases, and for each organellar genome, the sequence data were mapped separately against both frequency consensus sequences. The resulting bam files were then merged with SAMtools merge. We removed exact sequence duplicates, which may have been mapped to different coordinates, from the merged bam file by randomly retaining one copy. This step was replicated five times to examine its impact on the estimated variant proportions. After filtering, remaining duplicate sequences—those with identical mapping coordinates—were removed with SAMtools rmdup. We then called variable sites from the duplicate-removed bam files using BCFTools under the same settings as used in the assembly of the high- and low-frequency variant consensus sequences. We restricted our analyses to transversion-only variable positions to remove the impact of ancient DNA deamination artifacts. For each variable site, the proportion of reference and alternative alleles was calculated, based on comparison to the composite N. limnetica reconstructed organellar palaeogenomes. We removed rare alleles occurring at a proportion of <0.1, as these may have resulted from noise.

To ensure that the variants detected are not driven by contaminating modern DNA, we repeated the above variant calling process using a restricted dataset that only included damaged sequences. We used PMDtools v0.60 (ref. ⁷¹) to extract sequences from the above high- and low-frequency bam files for each sample and organellar genome with a DNA damage score threshold of ≥3. We then repeated the SAMtools merge and variant calling process, as described above, to confirm that the variants identified could be recovered based only on reads exhibiting ancient DNA damage. The variant calling was altered to use a minimum SNP quality threshold of three and minimum coverage of three, in order to recover calls from the lower coverage dataset.

To infer the minimum number of haplogroups in each reconstructed organellar genome sequence, we inspected the phasing of adjacent variable sites that were linked by the same read in the duplicate-removed bam files, akin to the method used by Søe et al.²⁵ (Supplementary Fig. 13). For this, we first identified all positions, from both samples, where two or more transversion-only variable sites occurred within 35 bp windows. For each window, we then examined the allelic states in mapped sequences that fully covered the linked positions. We recorded the combination and frequency of alleles for each window of linked positions to calculate the observed haplotype diversity. We removed low-frequency haplotypes, which were defined as those with <3 sequences or <15% of all sequences that covered a linked position, and the remaining haplotypes were scored. These steps were repeated for each set of linked positions across both organellar genomes for both of the metagenomic datasets.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Source: Ecology - nature.com