Let us first describe the setting and introduce notation. The main object of analysis is a matrix A (a contingency table with (n_r) rows and (n_c) columns) that contains the counts of two variables. A common example from ecology is that (A_{ij}) contains some measure of abundance of species i (rows) in sampling site j (columns). The matrix A can also be a binary incidence matrix, containing either the presence (1) or absence (0) of species in sites. The matrix A can be interpreted as the bi-adjacency matrix of a bipartite network that connects species to sites. The network contains (n_r) nodes on one side (the species, given by the rows of A, indexed by i), and (n_c) nodes on the other side (the sites, given by the columns of A, indexed by j). In general, we will refer to the two sets of nodes as row nodes and column nodes, respectively. The degree of a row node i is defined by the row sum (r_i = sum _j A_{ij}), which gives the total abundance of species i in all sites. Likewise, the degree of a column node j is defined as the column sum (c_j = sum _i A_{ij}), which gives the total abundance of species in a site j. The degrees of the row and column nodes are given by the vectors (mathbf {r}= (r_1, r_2, dots , r_{n_r})^T) and (mathbf {c}= (c_1,c_2,dots ,c_{n_c})^T). We further define two square matrices, (D_r) ((n_r times n_r)) and (D_c) ((n_c times n_c)) as the diagonal matrices that have (mathbf {r}) and (mathbf {c}) on the diagonal, respectively. The sum (n = sum _{ij} A_{ij}) gives the total number of occurrences in the table (in the case of a species-site example, the total abundance of species).
CA as canonical correlation analysis
One of the first derivations of CA was obtained by applying canonical correlation analysis to categorical variables12,19,22. Here we follow the derivation in Ref.1 (Chapter 9), where CA is derived as an application of canonical correlation analysis applied to a bipartite network, and to which we refer for further details. For ease of explanation, we will assume the network is defined by a binary presence-absence matrix (i.e. the network is unweighted), but the result generalizes to any contingency table (i.e. weighted bipartite networks).
The aim is to assign a ‘score’ to each node in the network, under the assumption that row and column nodes with similar scores connect to each other. Hence, connected nodes get assigned similar scores, and the scores can be thought of as a latent variable that drive the formation of links in the network. In ecology, such latent variables are referred to as gradients2,3. Considering a bipartite network describing the occurrence of species in a set of sites, for example, the resulting scores may reflect some variable determining why species locate in specific sites, such as the temperature preference of a species and the temperature at a site. In practice, the interpretation of a gradient resulting from application of CA can be verified by correlating it with known environmental variables (e.g. data on the temperature of each site).
Mathematically, such gradients can be inferred from the edges of the bipartite network. Recall that for a presence-absence matrix, the total number of edges in the bipartite network is given by (n = sum _{ij} A_{ij}). Let us construct a vector (mathbf {y}_r) of length n that contains, for each edge, the scores of the row node it connects to, and a vector (mathbf {y}_c) of length n that contains, again for each edge, the score of the column node it connects to. Given the assumption that edges connect row nodes and column nodes with similar scores, the node scores can be found by maximizing the correlation between (mathbf {y}_r) and (mathbf {y}_c), so that the row- and column scores for each edge are as similar as possible. Denoting the vector of length (n_r) containing the row scores by (mathbf {v}) and the vector of length (n_c) containing the column scores by (mathbf {u}), this leads to the optimization problem
$$begin{aligned} max _{mathbf {v}, mathbf {u}} mathrm {corr}(mathbf {y}_r,mathbf {y}_c). end{aligned}$$
(1)
In order to obtain standardized scores, the constraints that (mathbf {y}_r) and (mathbf {y}_c) have zero mean and unit variance need to be added. Solving this problem using Lagrangian optimization, the solution is given by
$$begin{aligned} D_r^{-1}A D_c^{-1} A^T mathbf {v}&= lambda mathbf {v}nonumber D_c^{-1}A^T D_r^{-1} A mathbf {u}&= lambda mathbf {u}. end{aligned}$$
(2)
The score vectors (mathbf {v}) and (mathbf {u}) can thus be found by solving an eigenvector problem. Following Ref.1 , they are subject to the constraint that (|mathbf {v}|=|mathbf {u}|=1). The general interpretation of the elements of (mathbf {v}) and (mathbf {u}) is as follows. Each row node (of A) is represented in (mathbf {v}), and each column node is represented in (mathbf {u}). The smaller the difference between the values of two row (column) nodes in (mathbf {v}) ((mathbf {u})), the more similar these nodes are. The similarity among row nodes that is reflected in (mathbf {v}) vectors arises because they are connected to a similar set of column nodes in the original bipartite network (vice versa for similarities in (mathbf {u})). In literature, this is referred to as row nodes being similar because of their similar ‘profile’11,23, and reciprocal averaging defines exactly how the scores are calculated in terms of the profiles and reduces to the same set of equations15. Both matrices on the left-hand side of Eq. (2) are row-stochastic and positive definite, and have identical eigenvalues that are real and take values between 0 and 1. Assuming that we have a connected network, sorting the eigenvalues in decreasing order leads to (1=lambda _1 > lambda _2 dots ge 0).
It can be shown that the correlation between (mathbf {y}_r) and (mathbf {y}_c) for a given set of eigenvectors (mathbf {v}) and (mathbf {u}) is given by their corresponding eigenvalue, so that (lambda = mathrm {corr}^2(mathbf {y}_{r},mathbf {y}_{c})). Note that the correlations between the row and column vectors can be negative, meaning that merely the absolute value of the correlation between (y_r) and (y_c) is related to the (square root) of the eigenvalues. Iterative approaches to extract potential negative correlations exist in literature24. The node scores leading to the highest correlation are thus given by the eigenvectors associated with the largest eigenvalue. However, the eigenvectors corresponding to (lambda _1) have all constant values and represent the trivial solution in which all row nodes and all column nodes have equal scores (leading to a perfect correlation). This trivial solution does not satisfy the condition that the scores have to be centered, and thus it must be rejected. The solution to Eq. (1) is thus given by the eigenvectors (mathbf {v}_2) and (mathbf {u}_2), corresponding to the second largest eigenvalue (lambda _2), which corresponds to the square root of the (maximized) correlation. We notice here that this derivation leads to analogous results than observed in classical derivations of CA, where the matrix A is centered both with respect to the rows and to the columns.
The second eigenvectors (mathbf {v}_2) and (mathbf {u}_2) hold the unique scores such that row- and column nodes with similar scores connect to each other. The second eigenvalue (lambda _2) indicates to what extent the row- and column scores can be ‘matched’, where high values (close to 1) indicate a high association between the inferred scores (the gradient) and the structure of the network.
The higher order eigenvectors in Eq. (2) and their eigenvalues are solutions to Eq. (1) with the additional constraint that (mathbf {y}_r) and (mathbf {y}_c) are orthogonal to the other solutions. The vectors (mathbf {v}_3) and (mathbf {u}_3), for example, may represent other variables that drive the formation of links (e.g. precipitation, primary productivity, etc.) on top of the gradients described by (mathbf {v}_2) and (mathbf {u}_2). We note that, differently from notation in some CA literature, we here denote the k-th non-trivial eigenvector with the subscript k+1.
CA as a clustering algorithm
A completely different approach shows that the eigenvectors (mathbf {v}_2) and (mathbf {u}_2) (i.e. the second eigenvectors in Eq. (2)) can also be interpreted as cluster labels, obtained when identifying clusters in the network of similarities that is derived from the bipartite network.
A similarity network can be constructed from a bipartite network by ‘projecting’ the bipartite network onto one of its layers (either the row nodes or the column nodes) through stochastic complementation18. Projecting the bipartite network defined by A onto its row layer leads to the (n_r times n_r) similarity matrix (S_r = A D_c^{-1} A^T). The entries of (S_r) represent pairwise similarities between row nodes of A, based on how many links they share with the same column node, weighted for the degree of each column node. Similarly, the (n_c times n_c) similarity matrix (S_c = A^T D_r^{-1} A) defines the pairwise similarities between the column nodes of A.
Identifying clusters in the similarity network can be done by minimizing the so-called ‘normalized cut’20. The normalized cut assigns, for a given partition of a network into K clusters, a score that represents the strength of the connections between the clusters for that partition. A partition can be described by assigning a discrete cluster label to each node. Hence, minimizing the normalized cut is equivalent to assigning a cluster label to each node in the network in such a way that the clusters are minimally connected. Finding the discrete cluster labels that minimize the normalized cut in large networks is in general not possible20. However, a solution of a related problem can be obtained when the cluster labels are allowed to take continuous values as opposed to discrete values. Solutions of this ‘relaxed’ problem can be interpreted as continuous approximations of the discrete cluster labels.
Minimizing the normalized cut in (S_r) leads to the generalized eigensystem20
$$begin{aligned} (D_r – S_r) mathbf {v}= tilde{lambda } D_r mathbf {v}, end{aligned}$$
(3)
where the entries of the generalized eigenvector (mathbf {v}_2) corresponding to the second smallest eigenvalue (tilde{lambda }_2) of Eq. (3) hold the approximate cluster labels of the optimal partition into two clusters. It is easily shown that generalized eigenvectors in Eq. (3) are exactly the eigenvectors of Eq. (2), where the eigenvalues are related by (tilde{lambda }_k = 1 – lambda _k), where (k=1,2,dots ,n_r) (see “Suppl. Material A”).
The matrix (L_r = D_r-S_r) is known as the Laplacian matrix of the similarity network defined by (S_r), and is well known in spectral graph theory25. The number of eigenvalues of (L_r) for which (tilde{lambda } = 0) (or equivalently (lambda = 1) in Eq. (2)) denotes the number of disconnected clusters in the network. Each of these ’trivial’ eigenvalues has a corresponding generalized eigenvector that has constant values for the nodes in a particular cluster, indicating cluster membership.
The situation changes when the clusters are weakly connected. The optimal solution for partitioning the similarity network into two clusters is given by the eigenvector (mathbf {v}_2) associated to eigenvalue (lambda _2). Although continuous, the entries of (mathbf {v}_2) can be interpreted as approximations to cluster labels, which indicate for each row node to which cluster it belongs. In other words, nodes with high values in this eigenvector (i.e., high scores) belong to one cluster, and nodes with low scores to the other. A discrete partition can then be obtained from the approximate (continuous) cluster labels by discretizing them, for example by assigning all negative values to one cluster and all positive values to the other26. The corresponding eigenvalue (lambda _2) represents the quality of the partitioning, as determined by the normalized cut criterion. High values are indicative of a network that can be well partitioned into two clusters (two totally disconnected clusters would yield eigenvalues (lambda _1 = lambda _2=1)), whereas lower values correspond to a network that is less easily grouped into two clusters (i.e. the resulting clusters are more interconnected).
Finding a partitioning into multiple, say K, clusters is more involved, where (Kle n_r) (or (Kle n_c) if working with column variables). Minimizing the normalized cut for K clusters yields a trace minimization problem of which the relaxed solution is given by the first K eigenvectors in Eq. (2)27. Because the first eigenvector in Eq. (2) is trivial, in practice we only require (K-1) eigenvectors (i.e., the 2nd, 3rd, … up to the Kth). The discrete cluster labels can then be obtained, for example, by running a k-Means algorithm on the matrix consisting of those (K-1) eigenvectors, a technique that is also known as spectral clustering28,29. How well the network can be partitioned into K clusters is given by the average value of the first K eigenvalues, i.e. (frac{1}{K} sum _{k=1}^K lambda _k)27.
The clustering approach thus brings an alternative interpretation to CA results. A key observation is that the eigenvalues and eigenvectors in Eq. (2) are directly related to the generalized eigenvectors of the Laplacian of the similarity matrix (S_r), and thus hold information on the structure of the similarity network. The entries of the second eigenvector (mathbf {v}_2) can be interpreted as the approximate cluster labels of a two-way partitioning of the similarity network defined by (S_r). Although at first sight the interpretation of CA scores as cluster labels may seem different from the interpretation as a latent variable described above in “CA as canonical correlation analysis”, note that cluster labels can be seen as latent variables, albeit discrete rather than continuous.
CA as a graph embedding technique
A third interpretation of the eigenvectors and eigenvalues in Eq. 2 arises from a so-called graph embedding of the similarity matrix (S_r) (or (S_c)). Graph embeddings provide a way to obtain a low-dimensional representation of a high-dimensional network, that are used for example for graph drawing. A graph embedding represents the nodes of a graph as node vectors in a space, such that nodes that are ‘close’ in the network are also close in terms of their distance in the embedding. A key feature of these embeddings is that their dimensionality can be reduced in order to obtain a low-dimensional representation of the data, while retaining its most important structural properties (see Ref.1, chapter 10 for an overview of graph embedding techniques).
As noted by several authors, CA is equivalent to graph embedding in the case of a similarity matrix obtained through stochastic complementation. For example, computing a 1-step diffusion map of the similarity matrix (S_r) leads exactly to the eigenvectors of Eq. (2)18,30. Also, the graph embedding using the Laplacian eigenmap has been shown to be equivalent to graph partitioning using the normalized cut, which is in turn equivalent to CA31. CA-specific type of embedding is based on the chi-square statistics and it is thus Euclidean.
Embedding the similarity network (S_r) in a ((K-1))-dimensional space yields an ‘embedding matrix’ (X_r in mathbb {R}^{n_r times K-1}) (known in CA-related literature as ’principal coordinates’). Each row of (X_r) represents a node of (S_r) as a ‘node vector’ in the embedding. The rows of (X_r) can be seen as components of ((K-1))-dimensional basis vectors that span the embedding, and are identical to what is referred to as the ‘axes’ in CA. Every entry (X_{i,k}) represents the coordinate of row node i on the k’th basis vector, and can be seen as the ‘score’ of i on the k’th CA axis. An embedding matrix of (S_r) can defined as (X_r = [sqrt{lambda _2} mathbf {v}_2, dots , sqrt{lambda _K} mathbf {v}_{K}]), where the vectors (mathbf {v}_k) are the eigenvectors defined in (2), and each of them is weighted by the square root of their corresponding eigenvalue. We will refer to columns of the embedding matrix as ‘CA-axes’, given by (mathbf {x}_k = sqrt{lambda _k}mathbf {v}_k) (with (mathbf {x}_2) being the ’first CA axis’, and so on).
The axes are constructed in such a way that they capture the largest amount of ‘variation’ or ‘inertia’ in the data, which is given by their corresponding eigenvalue11. The sum of all the eigenvalues gives the total variation in the data (in CA, this is referred to as the total inertia). CA decomposes the total variation in such a way that the first axis captures a maximal part of the variation, the second a maximal part of the remaining variation, and so on. A low-dimensional embedding that preserves the maximal amount of variation can thus be obtained by discarding the eigenvectors corresponding to smaller eigenvalues. The ‘quality’ of the embedding can then be expressed as the share of the total variation that is preserved in the embedding.
A typical way of presenting CA results is by showing the first two coordinates of each row (or column) node, i.e. plotting (mathbf {x}_2) against (mathbf {x}_3), which is usually referred to as a ’correspondence plot’. Since the first two axes capture a maximal amount of inertia, such a plot is in a way the optimal two-dimensional representation of the data that captures the relations between the rows (or columns) of A. The distances between points in the correspondence plot approximate the similarities between nodes. How well the correspondence plot represents the similarities is given by the percentage of variation explained by the first two axes.
Each axis can be interpreted as a latent variable that account for part of the total variation in the data. Since the axes in the embedding are given by a scaled version of the eigenvectors discussed above in “CA as canonical correlation analysis”, the interpretation of the eigenvalues as the amount of variation explained is complementary to the interpretation as the correlation between row and column scores which we also introduced above in “CA as canonical correlation analysis”. Furthermore, the axes spanning the K-dimensional embedding are exactly the generalized eigenvectors that follow from minimizing the normalized cut for K clusters31. Indeed, when there are clear clusters in the similarity network, they will show up in the embedding space as separate groups of points.
Summarizing, we find three interpretations of CA axes and their corresponding eigenvalues: as latent variables that drive the formation of links in the bipartite network, as approximate clusters labels of a bi-partition of the similarity network, and as coordinates of an embedding of the similarity network. The different derivations of CA and their interpretations are summarized in Table 1.
Source: Ecology - nature.com