The Bayesian VMD Method we developed can classify pulsed signals with similar frequency content in poor SNR files from underwater acoustic recordings. The Method consists of two parts. The first part scans the incoming audio data as segments that potentially contain signals of interest by detecting energy peaks. It then uses the start and end of the energy peaks to isolate those areas of interest from non-signal areas of the audio file. The second part classifies the detected signals into separate categories based on their frequency content. The algorithms of our Detector and Classifier steps are self-developed, but some key components in them were inspired by previous work39,40,41.
Detector
The proposed detector uses full audio files that are 4.5 s long at a sampling rate of 100 kHz. It then finds audio file segments where potential signals of interest exist.
For a given audio file, denoted by ({hat{x}}(n)), where (n=1, dots , N), and N is the total number of samples, the Laplacian Differential Operator (LDO) is applied to ({hat{x}}(n)) resulting in an enhanced version of the audio file denoted by y(n), as follows:
$$begin{aligned} y(n) = frac{1}{4}frac{partial ^2 {hat{x}}}{partial n^2} end{aligned}$$
(1)
The LDO enhances the transient signals (edge detection) and filters out the low frequencies ((< 10) kHz) which are not needed for Gg and Lo pulsed signal classification. The y(n) is then transformed into a time-frequency representation using Short-time Fourier transform (STFT). The STFT was implemented on 1024 samples with 90% overlap and a 1024-point Hanning window. The magnitude of the STFT matrix s(n, f) is given as ({hat{S}}(n,f)).
$$begin{aligned} {hat{S}}(n,f) = begin{bmatrix} |s_{11}| &{} dots &{} |s_{1N}| vdots &{} ddots |s_{M1}| &{} &{} |s_{MN}| end{bmatrix} end{aligned}$$
(2)
Where (N) is the length of the input segment and (M) is the number of frequency bins. The dimensionality of matrix ({hat{S}}(n,f)) is reduced from 2-D to 1-D as follows:
$$begin{aligned} S_{d}(n) = sum _{f=1}^{M} {hat{S}}(n,f) end{aligned}$$
(3)
The resulting temporal sequence is an accumulated sum of all frequency bins from (begin{aligned} {hat{S}}(n,f) end{aligned}), so scaling is applied, as follows:
$$begin{aligned} S_{d}(n) = frac{S_{d}(n)}{max{S_{d}(n)}} end{aligned}$$
(4)
After finding (S_{d}(n)) from Eq. (4), the mean of (S_{d}(n)) is subtracted. Then, to determine the boundaries of the acoustic signal, an adaptive threshold is applied. The first step in developing the threshold is to vectorize the matrix ({hat{S}}(n,f)) in column order into a vector called (S_{r}(n)):
$$begin{aligned} S_{r}(n) = overrightarrow{{hat{S}}(n,f)} end{aligned}$$
(5)
Then, (S_r (n)) is scaled similar to (S_{d}(n)) and is sorted into ascending order, denoted by ({hat{S}}_r(n)). The changing point where the root-mean-square level of the sorted curve ({hat{S}}_r(n)) changes the most is obtained by minimizing Eq. (6)39,40,42
$$begin{aligned} J(k) = sum _{i=1}^{k-1} Delta ({hat{S}}_{r,i}; chi ([{hat{S}}_{r,1} dots {hat{S}}_{r,k-1}])) + sum _{i=k}^{N} Delta ({hat{S}}_{r,i}; chi ([{hat{S}}_{r,k} dots {hat{S}}_{r,N}])) end{aligned}$$
(6)
where (k) and N are the index of the changing point and the length of the sorted curve ({hat{S}}_r (n)), respectively, and
$$begin{aligned} sum _{i=u}^{v} Delta ({hat{S}}_{r,i}; chi ([{hat{S}}_{r,u} dots {hat{S}}_{r,v}])) = (u-v+1)log left( frac{1}{u-v+1}sum _{n=u}^{v}{hat{S}}_{r,n},^{2}right) end{aligned}$$
(7)
The threshold, (lambda), is the value of ({hat{S}}_r (k)) which equals the noise floor estimation, and can be represented as follows:
$$begin{aligned} begin{aligned} {mathcal {H}}_{0}: S_d(n) < lambda {mathcal {H}}_{1}: S_d(n) ge lambda end{aligned} end{aligned}$$
(8)
where ({mathcal {H}}_0) and ({mathcal {H}}_1) are the hypothesis that the activity was below or above the threshold, respectively. The calculated threshold can vary for each file, thus making it adaptable if ambient noise conditions change between files. The threshold (lambda) is then projected onto the temporal sequence (S_{d}(n)) to extract the boundaries of the regions of the acoustic signal that comprised the detected energy peak. The start and end points of each acoustic signal are determined as the first and last points that are greater than (lambda) in amplitude.
The boundaries of the detected segments are scaled by the sampling rate to obtain start and end times which will be used to extract the audio file segments from the original data file in the classification step. Figure 4 illustrates the layout of the the proposed detector.
Classifier
Once segments with energy peaks were identified, they were scanned by the team’s bioacoustics expert, and any segments confirmed to contain only Gg or Lo signals were sifted out for use in testing the accuracy of the Bayesian VMD Method classifier.
In this paper, the metric weight was defined for classification purposes. The weight for a parameter (varvec{theta _i}) given its measurement (varvec{y_i}) is defined as
$$begin{aligned} w(varvec{theta _i} mid varvec{y_i}) = P_{varvec{Theta mid Y}}(varvec{theta _i} mid varvec{y_i}) * varvec{p_i} end{aligned}$$
(9)
where (varvec{theta _i}) is the probability density function (PDF) of (varvec{y_i}), (varvec{y_i}) is one measurement in the measurement vector (varvec{y}), (P_{varvec{Theta mid Y}}(varvec{theta _i} mid varvec{y_i})) is the posterior probability of the parameter (varvec{theta _i}) given the measurement (varvec{y_i}), and (varvec{p_i}) is the scaled prominence value of (varvec{y_i}).
When a detected audio file segment is fed into the Bayesian VMD classifier, the classification process starts with a feature extraction step. During this step, peak and notch frequencies and their prominence values were obtained from the VMD-Hilbert spectrum of the segment. The prominence measures how much a peak stands out due to its intrinsic height or how much a notch stands out due to its depth and its location relative to surrounding peaks or notches. In general, peaks that are taller and more isolated have a higher “prominence” (p) than peaks that are shorter or surrounded by other peaks.
In the feature extraction step, VMD decomposed the input audio segment into a set of IMFs. The HHT was then applied to all IMFs to create a Hilbert spectrum with a frequency resolution of 50 Hz. The Hilbert spectrum is a matrix, (H(n,f)) that contains the instantaneous energies, (h(n,f)).
$$begin{aligned} H(n,f) = begin{bmatrix} h_{11} &{}dots &{} h_{1R} vdots &{} ddots h_{Q1} &{} &{} h_{QR} end{bmatrix} end{aligned}$$
(10)
where r is the length of the input segment and q is the number of frequency bins in (H).
The matrix (H (n,f)) is then converted from a 2-D array to a 1-D spectral representation by summing all instantaneous energy values in each frequency bin, as follows:
$$begin{aligned} H(f) = sum _{n=1}^{R} H(n,f) end{aligned}$$
(11)
The energy summation sequence was converted to a base-10 logarithmic scale and then smoothed by passing through a 17-point median filter and an 11-point moving average filter for the purpose of easily extracting features. All peaks and notches in the sequence whose prominence values exceeded the threshold of 0.5 were located, and their frequency values and prominence values were then stored as extracted features from the input signal (see Fig. 5).
For testing the effectiveness of the VMD feature extractor, a second set of features were extracted from the FFT-based power spectrum using the same input signals with the Welch’s algorithm. The FFT-based spectrum was calculated on 2048 samples with 50% overlap and a 2048-point Hanning window with 48.82 Hz frequency resolution. The power spectral density sequence was then converted to dB and went through a 21-point median filter and a 15-point moving average filter. Feature extraction followed the same strategies as in VMD feature extractor except using a prominence threshold of 2 dB.
Next, the measured features, frequencies (Hz) of the peaks and notches (henceforth referred to as “measured peaks and notches”), were matched with the probability distribution functions (PDFs) of peaks and notches (henceforth referred to as “parameter peaks and notches”) from Soldevilla et al. (2008). The matching between measured and parameter peaks and notches was done in preparation of weight calculations, and it was implemented for both Gg and Lo. There are four Gaussian PDFs for parameter peaks and three for parameter notches for each species in Soldevilla et al. (2008) (Table 2). A 95% confidence interval of a Gaussian PDF was used here as a frequency range defined as 1.96 standard deviations to the left and right of its mean value. When measured peaks and notches were matched to parameter peaks and notches, only the peak or notch that fell within a 95% confidence interval were kept. Any peaks or notches outside the 95% confidence intervals were discarded.
Because there are overlaps between the 95% confidence intervals of 22.4 kHz and 25.5 kHz parameter peaks of Gg and between 33.7 kHz and 37.3 kHz parameter peaks of Lo (see Table 2), it is likely that some measured peaks will fall in the overlapping areas. In this paper, the maximum a posterior (MAP) estimation41 was used to determine which PDF results in the measured peak in an overlapping area. For a measured peak that falls into an overlapping area, two parameter peaks’ PDFs are plugged in the MAP estimation equation sequentially, and then the measured peak will be matched with the PDF that maximizes the posterior probability of it given the measured peak.
After the preliminary match, if more than one measured peak or notch remains in any one PDF confidence interval, the measured peak and notch with the highest prominence value is selected as the real measured peak or notch of this PDF, and the redundant ones are discarded. Finally, all remaining peak prominence values and notch prominence values were scaled to be between 0 and 1, respectively.
Once peak and notch matching and selection was finished, Bayesian weights were calculated to select the most likely species. From Bayes’s rule, the posterior probability of a parameter given its measurement is proportional to the product of the likelihood function of the measurement given the parameter and the prior probability of the parameter41, as shown in Eq. (12).
$$begin{aligned} P_{varvec{Theta mid Y}}(varvec{theta _i} mid varvec{y_i}) propto f_{varvec{Y mid Theta }}(varvec{y_i} mid varvec{theta _i}) P_{varvec{Theta }}(varvec{theta _i}) end{aligned}$$
(12)
therefore, substitution of the posterior probability in Eq. (9) yields
$$begin{aligned} w(varvec{theta _i} mid varvec{y_i}) = f_{varvec{Y mid Theta }}(varvec{y_i} mid varvec{theta _i}) *P_{varvec{Theta }}(varvec{theta _i}) * varvec{p_i} end{aligned}$$
(13)
With all PDFs and a priori probabilities from Soldevilla et al. (2008), the weight value in terms of Gg and Lo given a set of measurements, (varvec{y}), was obtained by Eqs. (13) and (14)
$$begin{aligned} w(Gg mid varvec{y}) = sum _{forall i} w(varvec{theta _i} mid varvec{y_i}) qquad w(Lo mid varvec{y}) = sum _{forall j} w(varvec{theta _j} mid varvec{y_j}) end{aligned}$$
(14)
where (varvec{y_i}) and (varvec{y_j}) are the remaining measured peaks and notches that were matched with Gg’s PDFs and Lo’s PDFs after the matching and matching step. The feature matching and selection results and the weight calculation process are shown in Fig. 6.
The last step was a comparison between weight values in terms of Gg and Lo. If (w(Lo mid varvec{y}) > w(Gg mid varvec{y})), the signal was labeled an Lo signal; otherwise, it was labeled a Gg signal. The classifier is illustrated in Fig. 7. The weight values are significant to three digits because weights are normally smaller than 1.000 and three significant digits was sufficient for comparing all calculated weight values for these audio files. In the case that the weight comparison is equal to three significant digits (even though this never happened in these 174 signals), the Bayesian VMD algorithm will automatically classify the input as a Gg signal given that the highest precision (85.91%) by the Bayesian VMD Method was achieved on Gg.
Source: Ecology - nature.com