in

Handling of targeted amplicon sequencing data focusing on index hopping and demultiplexing using a nested metabarcoding approach in ecology

Targeted amplicon sequencing (TAS) or targeted analysis sequencing is a method which addresses the sequencing of specific amplicons and genes. The approach is technologically rooted in next-generation sequencing (NGS), also called high-throughput sequencing (HTS) or massively parallel sequencing and offers the possibility to read millions of sequences in one sequencing run. The rapid evolution of NGS technology with constant increases in sample numbers, data output per sequencing run and associated decreases in costs, has led to this approach becoming widely used in various areas of research. With epigenome, genome and transcriptome sequencing, NGS extends over a wide field, regardless of the different biological disciplines (e.g., botany, ecology, evolutionary biology, genetics, medical sciences, microbiology, zoology, etc.)1,2,3,4,5,6,7,8. In addition to the use of NGS runs in studies to research gene regulation and expression, the characterization of mRNA during transcriptome analyses, the development of molecular markers and genome assembly, another possible application in the context of TAS is the investigation of genetic variation. There is a large range of possible TAS applications including variant detection and tumour profiling in cancer research, the detection of somatic mutations or those associated with susceptibility to disease, new findings in the field of phylogeny and taxonomy studies or the discovery of useful genes for applications in molecular breeding2,3,9,10. In the field of environmental sciences, TAS is becoming increasingly important, as it facilitates the assessment of the taxonomic composition of environmental samples with the help of metabarcoding approaches such as environmental DNA (eDNA) based biomonitoring or food web studies11,12,13.

Although NGS-based TAS is a powerful approach, different errors and biases can be introduced in such data sets. Sequencing errors have already been documented in medical studies, wherein factors such as sample handling, polymerase errors and PCR enrichment steps were identified as potential biases14,15. Similarly, other factors such as the variation in sequencing depth between individual samples, sequencing errors rates and index hopping can also play an important role within the analysis of NGS data. The difficulty is that there are currently no general standards requiring detailed reports and explanations to correct such potential errors, and very few studies have addressed this issue. Moreover, there is ever increasing access to NGS platforms, provided by sequencing companies, core facilities and research institutes16,17. NGS services often only provide the sequencing data while general information on the particular NGS run, demultiplexing-efficiency of individual samples and other relevant parameters are usually not passed on. The lack of such information and of a precise description of bioinformatic data processing makes it difficult to assess how the respective NGS run and the subsequent data processing went, which in turn complicates the comparison of results from different studies. Here, we show that specific aspects of library and data preparation have a critical influence on the assignment of sequencing results and how these problems can be addressed using a carabid beetle trophic data set as a case study system.

Currently, a widely used approach to study large sample numbers is the analysis of pooled samples, by combining DNA from multiple individuals into one sample of the NGS library, thereby excluding the opportunity of backtracking specific sequences to an individual sample (no individual tagging)18,19,20. In ecological studies (e.g., in biodiversity research and functional ecology), the analysis of such pooled samples may then lead to a decreased estimate of the diversity of the identified species compared to an individual-based analysis21. Aside from the potential loss of information, pooled samples make it impossible to assign a given sample to its specific collection site and thus, the ability to refer to habitat related differences. For individual-level analyses, the ‘nested metabarcoding approach’22 offers a promising solution to problems of complexity and cost. It is both a cost-efficient NGS protocol and one that is scalable to hundreds of individual samples, making it ideal for any study that relies on high sample numbers or that analyses samples which need to be tagged individually, such as in the medical field for patient samples. Using the nested metabarcoding approach, each sample is tagged with four indexes defining a sample. The presence of sequencing errors within the index region can complicate the demultiplexing process and thus the identification of the sample affiliation of individual reads. For a precise assignment of reads to each sample using the index combinations, sequencing errors must be considered in the analysis in order to be able to assign a maximum number of reads.

Besides sequencing errors within the different index regions that renders the read assignment difficult, a well-known, but at the same time often ignored problem is ‘index hopping’. This phenomenon, also called index switching/swapping, describes the index mis-assignment between multiplexed libraries and its rate rises as more free adapters or primers are present in the prepared NGS library23,24. Illumina therefore differentiates between combinatorial dual indexing and unique dual indexing. Special kits are offered with unique dual index sequences (set of 96 primer pairs) to counter the problem of index hopping and pitfalls of demultiplexing. This is an option for low sample numbers, as these can still be combined with unique dual indexes (UDIs). If several hundred samples are to be individually tagged in one run, it can be difficult to implement unique dual indexing due to the high number of samples and for cost reasons. Here, the nested metabarcoding approach offers a convenient solution for analysing a large number of individual samples at comparatively low costs. However, it is important to be careful regarding index hopping since more indexes are used in the nested metabarcoding approach than for pooling approaches. For instance, in silico cross-contamination between samples from different studies and altered or falsified results can occur if a flow cell lane is shared and the reads were incorrectly assigned. Even where samples are run exclusively on a single flow cell, index hopping may result in barcode switching events between samples that lead to mis-assignment of reads.

For library preparations of Illumina NGS runs, two indexes are usually used to tag the individual samples (dual indexing)25. Illumina offers the option to do the demultiplexing and convert the sequenced data into FASTQ file formats using the supplied ‘bcl2fastq’ or ‘bcl2fastq2’ conversion software tool26. This demultiplexing is a crucial step, as it is here that the generated DNA sequences are assigned to the samples. In most cases, the data is already provided demultiplexed after the NGS run by the sequencing facility, especially if runs were shared between different studies/sample sets. Researchers starting the bioinformatic analysis with demultiplexed data assume that the assignment of the sequences to samples was correct. Verifying this is extremely difficult because the provided data sets lack all the information on the demultiplexing settings and, above all, on the extent of sequencing errors within indexes and index hopping. As a consequence, sequences can be incorrectly assigned to samples and, in case of a shared flow cell, even across sample sets. These steps of bioinformatic analysis are very often outsourced to companies and details on demultiplexing are seldom reported, showing that the problem of read mis-assignment has received little attention so far. However, it is known that demultiplexing errors occur and depend on various factors such as the Illumina sequencing platform, the library type used and index combinations23,24,25,27,28,29,30. The few existing studies investigating index hopping in more detail give rates of 0.2–10%24,31,32,33,34. This indicates the importance of being able to estimate the extent of index hopping for a specific library. The problem of sequencing errors within indexes and index hopping can become particularly significant if, due to the large number of individual samples, libraries were constructed with two instead of one index pair, such as it is the case in the nested metabarcoding approach35. Then, one is inevitably confronted with the effect of sequencing errors and index hopping on demultiplexing and subsequently on the data output.

After each NGS run, the combination of computational power and background knowledge in bioinformatics are needed to ensure time-efficient and successful data analysis36. But even for natural scientists with considerable bioinformatic experience, there is a lack of know-how or even rules-of-thumb in this still nascent field. It is well known that specific decisions have a marked impact on the outcome of a study, with both the sequencing platform and software tools significantly affecting the results and thereby the interpretation of the sequencing information37. Knowledge of the individual data processing steps, such as for the demultiplexing, is also often missing or poorly described. Information on how to minimize data loss within the individual steps for data preparation of the NGS data is also mostly not explained. Given this lack of detail, it is a challenge to understand what was done during sample processing and data analysis, and impossible to compare the outcomes of different studies. To date, published NGS studies, such as TAS or DNA metabarcoding studies, are difficult to compare or evaluate because of the lack of this essential information on data processing. This is particularly important as NGS is increasingly being done by external service providers. As a consequence, there is a pressing need for comprehensive protocols that detail the aspects that need to be considered during analysis.

Using a case study on the dietary choice of carabid beetles (Coleoptera: Carabidae) in arable land, we detail a comprehensive protocol that describes an entire workflow targeting ITS2 fragments, using an Illumina HiSeq 2500 system and applying the nested metabarcoding approach22 to identify those species of weed seeds consumed by carabid individuals. We demonstrate a concept that employs bioinformatic tools for targeted amplicon sequencing in a defined order. By analysing the effects of sequencing errors and index hopping on demultiplexing and data trimming, we show the importance of describing the software and pipeline used and its version, as well as specifying software configurations and thresholds settings for each TAS data set to receive a realistic data output per sample. Without this information, there is the possibility of incorrectly assigning samples or not receiving the maximum or at least a sufficient number of sequences which in turn would hamper the results.

The concept described below can be used to analyse a large number of samples, here to identify food items on species-specific level, and to address the possible problems that may arise in NGS data processing. We identify problems to overcome and potential solutions by examining: (i) the variation in sequencing depth of individually tagged samples and the effect of library preparation on the data output; (ii) the influence of sequencing errors within index regions and its consequences for demultiplexing; and, (iii) the effect of index hopping. By doing this, we highlight the benefits of a detailed protocol for bioinformatic analysis of a given data set, and the importance of the reporting of bioinformatic parameters, especially for the demultiplexing, and thresholds to be used for meaningful data interpretation.


Source: Ecology - nature.com

The language of change

Genetic diversity may help evolutionary rescue in a clonal endemic plant species of Western Himalaya