Human gut microbiome data and preprocessing
The publicly available data that we re-analyzed here were generated by David et al.32 accessible on the European Nucleotide Archive (ENA) under the accession number ERP006059, and by Hsiao et al.31 on the NCBI Short Read Archive (SRA) under the accession number PRJEB6358. The downloaded reads were trimmed with V-xtractor version 2.146 a HMM scan based method of isolating variable regions from 16S rRNA sequences) to ensure the amplicon sequences could be aligned across consistent fractions of the 16S rRNA variable regions. Trimmed reads were then clustered into OTUs using usearch v9.2.6447 with a minimum cluster size of two. Representative sequences from each OTU were classified using mothur v1.36.148 and the RDP reference 16S rRNA sequences v1649.
Prochlorococcus data
Data from Malstrom et al.33 was obtained from the Biological and Chemical Oceanography Data Management Office (https://www.bco-dmo.org), accession number 3381.
Mapper
Conceptually, the Mapper algorithm accepts as input a matrix of distances or dissimilarities between data, and aims to represent the shape of the distribution of data points in high-dimensional phase space as an undirected graph. In this graph, vertices represent neighborhoods of phase space spanned by subsets of adjacent data points, and edges represent connectivity between neighborhoods. In brief, it does this by dividing the data into overlapping subsets that are similar according to the output of at least one filter function that assigns a scalar value to each data point, performing local clustering on each subset, and representing the result as an undirected graph, where each vertex represents a local cluster of data points, and edges between vertices represent at least one shared data point between clusters.
Distance matrix
We interpreted microbiome relative abundances to be probability distributions, and thus used the square root of the Jensen-Shannon divergence as a metric50. However, it is important to note that any other metric can be used in place of the Jensen-Shannon distance, such as the Aitchison distance51, calculated from centered10 or isometric12 log-transformed relative abundances.
Filter functions and binning
For the filter functions used by Mapper to bin data points, we performed principal coordinate analysis (PCoA, also known as classical multidimensional scaling) in two dimensions on the pairwise distance matrix, and used the ranked values of principal coordinates (PCo) 1 and 2 as the first and second filter values for Mapper, following Rizvi et al.28. PCo ranks are an appropriate filter for our purposes, as it assigns similar filter values to points that are relatively close together in the original phase space. We wish to note that while PCoA leads to loss of information, the following local clustering step is performed using subsets of distances from the original distance matrix, and is thus not affected. The data points were then binned by overlapping intervals of the two ranked principal coordinates. For hyperparameters specifying these bins and their overlaps, see Table 1.
Local clustering
The algorithm first performs hierarchical clustering from all pairwise distances between data points within a bin of filter values. Then, it creates a histogram of branch lengths using a predefined number of bins, and uses the first empty bin in the histogram as a cutoff value, separating the hierarchical tree into single-linkage clusters. The algorithm thus finds a separation of length scales within each neighborhood of phase space represented by a bin of the filter values. We used the default number of histogram bins, 10, for each data set (Table 1).
Creating the undirected Mapper graph
The final output is produced by representing each local cluster of data points as a vertex, and drawing an edge between each pair of vertices that share at least one data point. When plotting, the size of each vertex represents the number of data points therein. Layout and visualization of the Mapper graph may be performed with any graph layout algorithm; we used the Fruchterman-Reingold force-directed layout algorithm52. It is important to note that the visualized shape of the Mapper graph depends on the algorithm used, and may not be deterministic. When performing a Mapper analysis, one should rely on the connectivity of the graph rather than the overall shape.
Selection of hyperparameters
The Mapper algorithm is relatively new, and there are currently no standard protocols to optimize the values of the hyperparameters. For our purposes, it was important that the algorithm achieved a sufficiently high resolution in partitioning data, but also adequately represented connections between regions of phase space. We thus used the following heuristic to set the number of intervals and percent overlap for each data set.
- 1.
The largest vertex in the resultant Mapper graph should represent no more than ≈10% of the total number of data points in the set;
- 2.
the number of connected components representing only one data point should be minimized.
We acknowledge that a heuristic determination of appropriate hyperparameter values leaves much to be desired; as such, we recommend future in-depth theoretical explorations of how the Mapper output depends on the choice of hyperparameters.
Density estimation
We estimated the inverse density for each vertex by calculating the k-nearest neighbors (kNN) distance53 for each constituent data point i.
We first define the k-neighborhood N(k)i of a point i, to be the set of k nearest neighbors of i, choosing k equal to 10% of the number of samples in each data set, rounded to the nearest integer. Then the kNN distance of point i is defined as:
$${rm{kNN}}(i,k)=frac{{sum }_{jin N{(k)}_{i}}{d}_{ij}}{k}$$
(1)
where dij is the distance between points i and j.
For a vertex V representing n points, we define its inverse density as
$${D}_{{rm{inv}}}(V)=frac{{sum }_{iin V}{rm{kNN}}(i,k)}{{n}^{2}}$$
(2)
The n2 term in the denominator compensates for the differing sizes of vertices. Finally, we invert the inverse density to obtain the estimated density:
$$D(V)=frac{1}{{D}_{{rm{inv}}}}$$
(3)
State assignment
We then defined states as topological features of the density surrounding local maxima of D. We designated each vertex with higher D than its neighbors to be a local maximum of the potential. Connected vertices tied for maximum D were each assigned to be a local maximum. To approximate a gradient, we converted the undirected Mapper graph to a directed graph, with each edge pointing from the vertex with lower D to the one with higher D. For each non-maximum vertex, we found the graph distance dg to each local maximum constrained by edge direction. We defined the state Bx of a maximum Vx as the set of vertices V with uniquely shortest graph distance to Vx:
$$Vin {B}_{x},{rm{if}},{d}_{g}(V,{V}_{x}),<,{d}_{g}(V,{V}_{y})$$
(4)
for all y ≠ x and Vy ∈ M, where M is the set of all local maxima (Fig. 1b). Vertices equidistant to multiple maxima were defined to be unstable regions unassigned to any state. Multiple connected maxima were defined as belonging to the same state. Notably, one data point may be associated with multiple vertices and states, or an unstable region and at least one state: we interpreted this to mean that the point is near a saddle point separating states, and as the ‘true’ coordinates of the saddle point are unknown, the data point is assigned to all such states and/or an unstable region with uniform weight.
Calculating the temporal correlation function
Given that a system occupied state Bx at time t, we defined the temporal correlation to be the probability that it will still (or again) occupy state Bx at time t + τ:
$${f}_{x}(t+tau )=left{begin{array}{ll}1&,{text{if}}, {text{system}}, {text{is}}, {text{associated}}, {text{with}}, {text{state}},{B}_{x},,{text{at}}, {text{time}},,t+tau 0&{rm{otherwise}}.end{array}right.$$
(5)
$${{rm{corr}}}_{x}(tau )=langle {f}_{x}(t+tau )rangle$$
(6)
We calculated the correlation function for each state x visited by a subject during a characteristic period and for all sampled intervals between pairs of samples of length τ, where the subject was in state Bx in the sample at the start of the interval. For the cholera data set, we calculated correlation functions for each state visited by each subject over the disease period. For the data set of two healthy adult males, we calculated correlation functions for each state visited by each subject in each healthy period, either before or after infection. For the Prochlorococcus data set, we calculated correlation functions for each state at each depth fraction at either site. Where a data point is associated with multiple states, we weigh the association with each state as ({f}_{x}^{prime}(t)=frac{1}{p}{f}_{x}(t)), with p the total number of unique states associated with the system at time t, with the unassigned/unstable state regarded as a single distinct state. Notably, this means ({f}_{x}^{prime}(t+tau )) can have values of (1,frac{1}{2},frac{1}{3}ldots)
Rarefaction test
We created random subsets of each data set representing 90%, 50%, and 10% of the original data points, repeating 10 times for each data set and downsampling ratio. We then created Mapper graphs representing the rarefied data using the same hyperparameters as for each of the full data sets. We colored the vertices to indicate the same features as for the full data sets: for the cholera data set, by fraction of samples belonging to the diarrhea or recovery phase; for the two healthy adult gut microbiomes data set, by fraction of samples obtained from each subject; and for the Prochlorococcus data set, by the mean depth from which samples originated. We ordered the vertices by feature value and used a circularized linear layout algorithm, such that vertices with similar feature values are adjacent. Finally, we used shading to display edge densities.
Reporting summary
Further information on experimental design is available in the Nature Research Reporting Summary linked to this article.
Source: Ecology - nature.com