in

BioCPPNet: automatic bioacoustic source separation with deep neural networks

Our novel approach to bioacoustic source separation involves an end-to-end pipeline consisting of multiple discrete steps, including (1) synthesizing a dataset, (2) developing and training a separator network to disentangle the input mixture, and (3) constructing and training a classifier model to employ as a downstream evaluation task. This workflow requires few hyperparameter modifications to account for unique vocal behavior across different biological taxa but is otherwise general and makes no species-level assumptions about the spectrotemporal structure of the source calls. We develop a complete framework for bioacoustic source separation in a permutation-invariant mode using overlapping waveforms drawn from the same class of signals. We apply BioCPPNet to macaques, dolphins, and Egyptian fruit bats, and we consider two or three concurrent “speakers”. Note that we henceforth refer to non-human animal signalers as “speakers” for consistency with the human speech separation literature2. We address both the closed speaker regime in which the training and evaluation data subsets contain calls produced by individuals drawn from the same distribution as well the open speaker regime in which the model is tested on calls generated by individuals not present in the training dataset.

Bioacoustic data

We investigate a set of species with dissimilar vocal behaviors in terms of spectral and temporal properties. We apply BioCPPNet to a macaque coo call dataset47 consisting of 7285 coos produced by 8 unique individuals; a bottlenose dolphin signature whistle dataset26 comprised of 400 signature whistles generated by 20 individuals, of which we randomly select 8 for the purposes of this study; and an Egyptian fruit bat vocalization dataset48 containing a heterogeneous distribution of individuals, call types, and call contexts. In the case of the bat dataset, we extract the data (31399 calls) corresponding to the 15 most heavily represented individual bats, reserving 12 individuals (27586 calls) to address the closed speaker regime and the remaining 3 individuals (3813 calls) to evaluate model performance in the open speaker scenario.

Datasets

The mixture dataset is generated from a species-specific corpus of bioacoustic recordings containing signals annotated according to the known identity of the signaller. Motivated by WSJ0-2mix2, a preeminent reference dataset used for human single-channel acoustic source separation, we adopt a similar approach of constructing bioacoustic datasets by temporally overlapping and summing ground truth individual-specific signals to enable supervised training of our model. For macaques and dolphins, the mixture waveforms contain discrete source calls that overlap in the time domain, by design. For bats, mixtures are constructed by adding signal streams, each of which may exhibit one or more temporally separated sequential vocal elements. In all cases, the mixtures operate under the assumption that, without loss of generality, the constituent sources vary in the degree of spectral overlap due to differential spectrotemporal properties of sources, in accordance with the DUET principle (i.e, the mixtures contain approximately disjoint sources that rarely coincide in dominant frequency)19. The resultant dataset consists of an input array of the composite mixture waveforms, a target array containing the separated ground truth waveforms corresponding to the respective mixtures, and a class label array denoting the identities of the vocalizing animals responsible for generating the signals. In the case of macaques, we here consider closed speaker set mixtures of two and three simultaneous speakers, but our method is functionally not limited in the number of sources (N) it can handle. For dolphins, we consider the closed speaker regime with two overlapping calls, and for bats, we consider the closed and open speaker scenarios with two sources.

We first extract the labeled waveforms either by truncating or zero-padding the waveforms to ensure that all the samples are of fixed duration. We select the number of frames either by computing the mean plus three-sigma of the durations of the calls contained in the corpus from which we draw the signals, by selecting the maximum duration of all calls, or by choosing a fixed value. For macaques, dolphins, and bats, we use 23156 frames (0.95s), 290680 frames (3.03s), and 250000 frames (1.0s), respectively. We then randomly select vocalizations from N different speakers drawn from the distribution of individuals used in the study (8 macaques, 8 dolphins, 12 bats for the closed speaker regime, 3 bats for the open speaker regime) and mix them additively, ensuring to randomly shift the overlaps to simulate a more plausible scenario and to provide for asynchronicity of start times, an important acoustic cue that has been suggested as a mechanism with which the animal brain can solve the CPP1. Despite higher computational and memory costs, we opt to use native sampling rates, since certain animal vocalizations may reach frequencies near the native Nyquist frequency. With this in mind, however, our method does provide for resampling when the vocalizations of the particular species of interest are amenable to downsampling. Explicitly, for the three species we consider including macaques, dolphins, and bats, we use sampling rates of 24414 Hz, 96 kHz, and 250 kHz, respectively. For the closed speaker regime, the training and evaluation subsets contain calls produced by the same distribution of individuals to ensure a closed speaker set. We segment the original nonoverlapping vocalizations into 80/20 training/validation subsets. We generate the mixture training waveforms using 80% of the calls, and we construct the mixture validation subset using the remaining 20% of calls held out from the training data. In the case of overlapping bat calls (for which the corpus of bioacoustic recordings contains (mathscr {O}(10text { hours})) of data as opposed to (mathscr {O}(10^{-1}text { hours})) for macaques and dolphins), we also address the open speaker source separation problem by constructing a further testing data subset of mixtures of calls of additional vocalizers not contained in the training distribution. For macaques, we construct a training data subset comprised of 12k samples and a validation subset with 3k samples, all of which contain calls drawn from 8 animals. For dolphins, we randomly select 8 individuals and construct training/validations subsets with 8k and 2k samples, respectively. For bats, we select 15 individuals, randomly reserving 12 for the closed speaker problem and the remaining 3 for the open speaker situation. We train the bat separator model on 24k mixtures. We evaluate performance in both the closed and open speaker scenarios using data subsets consisting of 6k mixtures containing unseen vocalizations produced by the appropriate distribution of individuals according to the regime under consideration. We repeat the bat training using a larger mixture dataset (denoted by +) containing 72k samples. We here report validation metrics to ensure that we are evaluating model performance on unseen mixtures of unseen calls in the closed speaker regime and on unseen mixtures of unseen calls of unseen individuals in the open speaker regime.

For the downstream classification task, we extract vocalizations annotated according to the individual identity, and we segment the calls into an 80/20 training/testing split to ensure that we are evaluating model performance on unseen calls. For both the training and evaluation data subsets, we employ an augmentation scheme in which we apply random temporal shifts to call onsets to better reflect more plausible real-world scenarios.

Classification models

In an effort to provide a more physically interpretable evaluation metric to supplement the commonly-implemented SI-SDR used in human speech separation studies, we develop CNN-based classifier models to label the individual identity of the separated vocalizations as a downstream task. This requires training classification networks to predict the speaker class label of the original unmixed waveforms. For each species we consider, we design and train custom simple and lightweight CNN-based architectures largely motivated by previous work24, tailored to accommodate the unique vocal behavior of the given species.

The first layer in the model is an optional high pass filter constructed using a nontrainable 1D convolution (Conv1D) layer with frozen weights determined by a windowed sinc function49,50 to eliminate low-frequency background noise. We omit this computationally intensive layer for macaques and Egyptian fruit bats, but we implement a high pass filter for the dolphin dataset, selecting an arbitrary cutoff frequency of 4.7 kHz and transition bandwidth 0.08 to remove background without impinging on the region of support for dolphin whistles. After the optional filter is an encoder layer to compute on-the-fly feature extraction. We experimented with a fully learnable free Conv1D filterbank, a spectrogram, and a log-magnitude spectrogram and observed optimal performance using a non-decibel (dB)-scaled STFT layer computed with a nfft window width, a hop window shift, and a Hann window where nfft and hop are species-dependent variables. For macaques, we select nfft=1024 and hop=64 corresponding to temporal scales on the order of 40ms and frequency resolutions on the order of 20 Hz. We choose nfft=1024 and hop=256 for dolphins and nfft=2048 and hop=512 for bats, corresponding to temporal resolutions of ~ 10 ms and ~ 8 ms and frequency resolutions of ~ 90 Hz and ~ 120 Hz, respectively.

Following the built-in feature engineering, the architecture includes 4 convolutional blocks, which consist of two sequential 2D convolution (Conv2D) layers with leaky ReLU activation and a max pooling layer with pool size 4. Next is a dense fully connected layer with leaky ReLU activation followed by another linear layer with log softmax activation to output the V log probabilities (i.e. confidences) where V is the number of individual vocalizers used in the study (8, 8, 12 for macaques, dolphins, and bats, respectively). We also include dropout regularization with p=0.25 for the macaque classifier and p=0.5 for the dolphin and bat classifiers to address potential overfitting. With these architectures, the macaque, dolphin, and bat classifier models have 230k, 279k, and 247k trainable parameters, respectively.

For all species, we minimize the negative log-likelihood objective loss function using the Adam optimizer51 with learning rate lr = 3e−4. For macaques, dolphins, and bats, respectively, we train for 100, 50, and 100 epochs with batch sizes 32, 8, and 8. We serialize the model after each epoch and select the top-performing models. We opt not to carry out hyperparameter optimization since the classification task is of secondary importance and is used solely as a downstream task.

Figure 1

(a) Schematic overview of the BioCPPNet pipeline. Source vocalization waveforms are overlapped in time and mixed additively. BioCPPNet operates on the mixture waveform, yielding predictions for the separated waveforms, which are compared to the source ground truths, up to a permutation. The estimated waveforms are classified by the identity classification model24 (ID) to compute the downstream classification accuracy metric. (b) Block diagram of the BioCPPNet architecture. The input mixture waveform is transformed to a learnable or handcrafted representation (Rep), which then passes through a 2-dimensional U-Net52 composed of a contracting encoder path and an expanding decoder path with skip connnections. The encoder path consists of sequential downsampling convolutional blocks, each of which is constructed using two convolutional layers (Conv2D) with leaky ReLU activation and batch normalization (BatchNorm) followed by a max pooling. The decoder path employs upsampling convolutional blocks, consisting of an upsampling and skip connection concatenation followed again by the Conv2D layers with leaky ReLU and BatchNorm. The U-Net predicts masks (Mask 0 and Mask 1), the number of which is determined by the number of sources (N), that are multiplicatively applied to the original mixture representation. The predicted time-frequency representations of the separated waveforms are inverted with learnable or handcrafted inverse transforms (iRep) to output raw waveforms. All schematic diagrams were created using Affinity Designer (version 1.8.1) https://affinity.serif.com/en-us/designer/.

Full size image

Separation models

BioCPPNet (Fig. 1) is a lightweight and modular architecture with a modifiable representation encoder, a 2D U-Net core, and an inverse transform decoder, which acts directly on raw audio via on-the-fly learnable or handcrafted transforms. The structure of the network is designed to provide for extensive experimentation, optimization, and enhancement across a range of species with variable vocal behavior. We construct and train a separation model for each species and each number N of sources contained in the input mixture.

Figure 2

Schematic diagram demonstrating the application of BioCPPNet to dolphin signature whistles using handcrafted STFT-based encoders and decoders. The source waveforms produced by N speakers of unique identity (e.g. T. truncatus 0 and T. truncatus 1) are overlapped in time, summed, and transformed to time-frequency space using an STFT layer, resulting in the mixture time-frequency representation (Mixture TFR). The U-Net predicts masks (Mask 0 and Mask 1) that are applied to the mixture representation. The separated spectrogram estimations (TFR 0 and TFR 1) are inverted using an iSTFT layer to yield the model’s predictions for the separated raw waveforms, which are compared to the ground truth waveforms and classified according to predicted identity using the classification model.

Full size image

Model architecture

As with the classifier model, the network’s encoder consists of a feature engineering block, the initial layer of which is an optional high pass filter. This is followed by the representation transform, which includes several options including the Conv1D free encoder, the STFT filterbank, and the log-magnitude (dB) STFT filterbank. We choose the same kernel size (nfft) and stride (hop) parameters defined in the classifier model. Sequentially following the feature extraction encoder is a 2D U-Net core. This architecture consists of B (4 for macaques, 3 for dolphins, and 4 for bats) downsampling convolutional blocks, a middle convolutional block, and B upsampling convolutional blocks. The downsampling blocks consist of two 2D convolutional layers with filter number that increases with model depth with leaky ReLU activation followed by a max pooling with pool size 2, 6, and 3 for macaques, dolphins, and bats. The middle block contains two 2D convolutional layers with leaky ReLU activation. The upsampling blocks include an upsampling using the bilinear algorithm and a scale factor corresponding to the pool size used during downsampling, followed by skip connections in which the corresponding levels of the contracting and expanding paths are concatenated before passing through two 2D convolutional layers with leaky ReLU activation. All convolutional layers in the downsampling, middle, and upsampling blocks include batch normalization after the activation function to stabilize and expedite training and to promote regularization. Though our default implementation is phase-unaware, we also offer the option for a parallel U-Net pathway working directly on phase information, which has been shown to improve performance in other applications53,54,55. The final layer in the U-Net core is a 2D convolutional layer with N channels, which are then split prior to entering the inverse transform decoder. For the inverse transform, we again provide numerous choices including a free filterbank decoder based on a 1D convolutional transpose (ConvTranspose1D) layer, an iSTFT layer, an iSTFT layer accepting dB-scaled inputs, and a multi-head convolutional neural network (MCNN) for fast spectrogram inversion56. In detail, the U-Net returns N masks that are then multiplied by the original encoded representation of the mixture waveform. The separated representations are then passed into the inverse transform layer in order to yield the raw waveforms corresponding to the model’s predictions for the separated vocalizations. We initialize all trainable weights using the Xavier uniform initialization. In the case of macaques, we experiment across all combinations of representation encoders and inverse transform decoders, and we find optimal performance using the handcrafted non-dB STFT/iSTFT layers operating in the time-frequency domain. Since the model with the fully learnable Conv1D-based encoder/decoder uniquely operates in the time domain, we report evaluation metrics for this model, as well. For dolphins and bats, we here report metrics using exclusively the non-dB STFT/iSTFT technique.

BioCPPNet (Fig. 1) is designed as a lightweight fully convolutional model in order to efficiently process large amounts of bioacoustic data sampled at high sampling rates while simultaneously minimizing computational costs and limitations and the likelihood of overfitting. For the macaque separators, the networks consist of 1.2M parameters (for the STFT, iSTFT combination), 2.5M parameters (for the STFT, iSTFT combination with parallel phase pathway), or 2.8M parameters (for the Conv1D free filterbanks). For the dolphin separator (Fig. 2), the model has 304k parameters, while the bat separator model has 1.2M parameters. This is to be contrasted with the comparatively heavyweight default implementations of models commonly used in human speech separation problems, such as Conv-TasNet3, which has 5.1M parameters; DPTNet4 with 2.7M parameters; or Wavesplit5 with 29M parameters. Regardless of the lower complexity of BioCPPNet, the model achieves comparable performance or even outperforms reference human speech separator models while still being lightweight enough to train on a single NVIDIA P100 GPU.

Model training objective

The model training objective aims to optimize the reconstruction of separated waveforms from the aggregated composite input signal. We adopt a permutation-invariant training (PIT)57 scheme in which the model’s predicted outputs are compared with the ground truth sources by searching over the space of permutations of source orderings. This fundamental property of our training objective reflects that the order of estimations and their corresponding labels from a mixture waveform is not expressly germane to the task of acoustic source separation, i.e. separation is a set prediction problem independent of speaker identity ordering5.

Source separation involves training a separator model f to reconstruct the source single-channel waveforms given a mixture (x=sum _{i=1}^N s^i) of N sources, where each source signal (s^i) for (i in [1, N]) is a real-valued continuous vector with fixed length T, i.e., (s^i in mathbb {R}^{1 times T}). The model outputs the predicted waveforms ({hat{s}^i}_{i=1}^N) where (forall i, hat{s}^i = f^i(x)), and a loss function is evaluated by comparing the predictions to the ground truth sources ({s^i}_{i=1}^N) up to a permutation. Explicitly, we consider a permutation-invariant objective function5,

$$begin{aligned} mathscr {L}(hat{s}, s) = min _{sigma in S_N} frac{1}{N} sum _{i=1}^N ell (hat{s}^{sigma (i)}, s^i) qquad text {where} forall i, hat{s}^i = f^i(x) end{aligned}$$

Here, (ell (cdot , cdot )) represents the loss function computed on an (output, target) pair, (sigma) indicates a permutation, and (S_N) is the space of permutations. In certain scenarios, we include the L2 regularization term,

$$begin{aligned} mathscr {L} mapsto mathscr {L} + lambda sum _{j=1}^P beta _j^2 end{aligned}$$

where (beta _j) represent the model parameters, P denotes the model complexity, and (lambda) is a hyperparameter empirically selected to minimize overfitting (i.e. enhance convergence of training and evaluation losses and metrics).

For the single-channel loss function (ell), we consider a linear combination of several loss terms that compute the error in estimated waveform reconstructions ({hat{s}^i}_{i=1}^N) relative to the ground truth waveforms ({s^i}_{i=1}^N).

  • L1 Loss

    $$begin{aligned} |hat{s}^{sigma (i)} – s^{i}| end{aligned}$$

    This represents the absolute error on raw time domain waveforms.

  • STFT L1 Loss

    $$begin{aligned} |text {STFT}(hat{s}^{sigma (i)}) – text {STFT}(s^{i})| end{aligned}$$

    This term functions to minimize absolute error on time-frequency space representations. Empirically, the inclusion of this contribution enhances the reconstruction of signal harmonicity.

  • Spectral Convergence Loss

    $$begin{aligned} ||text {STFT}(hat{s}^{sigma (i)}) – text {STFT}(s^{i})||_F / ||text {STFT}(s^{i})||_F end{aligned}$$

    where (||cdot ||_F) denotes the Frobenius norm over time and frequency. This term emphasizes high-magnitude spectral components56.

We also experimented with additional terms including L1 loss on log-magnitude spectrograms to address spectral valleys and negative SI-SDR (nSI-SDR), but the inclusion of these contributions did not yield empirical improvements in results.

For macaques, we modify the training algorithm according to the representation transform and inverse transform built into the model. For the model with the fully learnable Conv1D encoder and decoder, we train using the AdamW58 optimizer with a learning rate 3e-4 and batch size 16 for 100 epochs. In order to stabilize training and avoid local minima when using handcrafted STFT and iSTFT filterbanks, we initially begin training the models for 3 epochs with batch size 16 using stochastic gradient descent (SGD) with Nesterov momentum 0.6 and learning rate 1e-3 before switching to the AdamW optimizer until reaching 100 epochs.

For dolphins, we provide the model with the original mixture as input, but we use high pass-filtered source waveforms as the target, which means the separation model must additionally learn to denoise the input. We again initialize training with 3 epochs and batch size 8 using SGD with Nesterov momentum 0.6 and learning rate 1e-3 before switching to the AdamW optimizer with learning rate 3e-4 for the remaining 97 epochs. We use a similar training scheme for bats, initially training with SGD for 3 epochs before employing the optimizer switcher callback to switch to AdamW and to complete 100 epochs.

Model evaluation metrics

We consider the reconstruction performance by computing evaluation metrics using an expression given by5,

$$begin{aligned} mathscr {M}(hat{s}, s) = max _{sigma in S_N} frac{1}{N} sum _{i=1}^N m(hat{s}^{sigma (i)}, s^i) qquad text {where} forall i, hat{s}^i = f^i(x) end{aligned}$$

where (m(cdot , cdot )) is the single-channel evaluation metric computed on permutations of (output, target) pairs.

Specifically, we implement two evaluation metrics to assess reconstruction quality, including (1) SI-SDR and (2) downstream classification accuracy. We consider the signal-to-distortion ratio (SDR)2, defined as the negative log squared error normalized by reference signal energy5. However, as is commonly implemented in the human speech separation literature, we instead compute the scale-invariant SDR (SI-SDR), which disregards prediction scale by searching over gains5,40. Explicitly, SI-SDR((hat{s}, s) = -10log _{10}(|hat{s} – s|^2) + 10log _{10}(|alpha s|^2)) for optimal scaling factor (alpha = hat{s}^Ts / |s|^2).

Additionally, to provide a physically interpretable metric, we evaluate the performance of the trained classifier models in labeling separated waveforms according to the predicted identity of the vocalizer. This metric assumes that the classification accuracy on a downstream task reflects the fidelity of the estimated signal relative to the ground truth source and thus serves as a proxy for reconstruction quality.


Source: Ecology - nature.com

Eco-evolutionary responses of the microbial loop to surface ocean warming and consequences for primary production

Population genetics and independently replicated evolution of predator-associated burst speed ecophenotypy in mosquitofish