in

How to analyse overlapping sounds in the marine environment using supervised multi-label classification


Abstract

Marine soundscapes commonly contain concurrent acoustic events and identifying all components is required to understand interactions between sound classes, for example between sonar and marine mammal vocalisations. Machine learning techniques previously applied to such data primarily aimed to associate a single event type to a specific time interval. Here, we demonstrate how multi-label classification (MLC) systems can automatically identify overlapping (polyphonic) sound combinations. We develop a methodology that uses an input representation previously shown effective for marine sound classification, together with a curated dataset with balanced class distributions and various combinations of sound events in the training data. Using this framework, we test four supervised MLC systems, which yielded comparable performance, with Binary Relevance and Neural Discriminative Dimensionality Reduction multi-task learning systems showing marginal advantages. System performances were affected by the number of overlapping sound classes, the representation of combinations in the training dataset and the complexity or interactions between sound classes. Across systems, vessel noise and delphinid clicks were most accurately classified, while high energy sonar signals, particularly from high frequency echosounders, were frequently misclassified as delphinid tonals. These findings highlight the importance of system design and dataset structure when developing multi-label classification models for marine soundscapes.

Similar content being viewed by others

Seismic survey noise reduces fin whale vocalisations offshore northwestern Spain

Marine reef soundscape monitoring with fiber-optic distributed acoustic sensing

Animal-borne soundscape logger as a system for edge classification of sound sources and data transmission for monitoring near-real-time underwater soundscape

Introduction

Passive acoustic monitoring (PAM) is a non-invasive tool that can be used for long-term monitoring of the underwater environment1,2,3. In addition to naturally occurring sounds in the ocean, human activity contributes to the complex underwater soundscape, e.g. through vessel noise, active sonar and pile driving4. PAM is frequently employed to study cetaceans, whose vocalisations provide the opportunity to identify the presence of species, or species groups, in an area, to analyse specific behaviours5,6,7, to estimate abundance8,9,10 and assess the impact of human activities11,12,13.

The majority of cetacean species are highly vocal14,15, utilising sound for social purposes and all odontocetes (toothed whales) also use it to sense their environment through echolocation16,17. Cetacean vocalisations are usually broadly categorised as: burst pulses and whistles, which are used for social interactions, and echolocation clicks used for sensing their environment17,18.

The increasingly common use of PAM has led to a growing need for automated detection systems capable of distinguishing sounds and advances in machine learning have greatly improved this capability19,20. One common strategy is to divide a time series into a set of segments of fixed duration and compute a single label for each segment. Some methods are simple detectors, implementing binary classification21,22,23 whereas others seek to perform single-label multi-class classification24,25,26. In an acoustic environment where multiple audio events co-occur within a single segment, multi-label classification (MLC) systems, in which multiple labels can be assigned to one segment, are a natural tool.

In terrestrial acoustics, MLC systems have been frequently used for analysing overlapping sounds from different sources27,28,29. However, there is a lack of corresponding research in marine bioacoustics, where the interactions of anthropogenic sounds and the vocalisations of aquatic fauna are widely studied4,30,31. This is partly due to challenges such as the limited availability of labelled datasets and spectral overlap between anthropogenic and biophonic sound sources, as well as the greater complexity of underwater background geophonic noise. When designing architectures for MLC, there is a spectrum of approaches that aim to balance the use of shared network elements with overall network complexity. At one end lies binary relevance (BR), where independent binary classifiers are constructed for each class32,33,34,35. Whilst at the other end are the Conventional Multi-Label Classifiers (CMLC), that employ a single shared network27,28,36,37.

CMLC systems are optimised to learn general features across all classes27. However, this generalisation may compromise performance compared to the tailored features learned in class-specific BR networks37. BR systems, while capable of creating bespoke features for individual tasks, being trained independently cannot exploit inter-class relationships such as feature co-occurrences32. Consequently, in the field of MLC, intermediate architectures that lie between the extremes of BR and CMLC are often implemented using multi-task learning (MTL) strategies, which aim to balance feature sharing with task-specific specialisation33,38,39,40. Two primary forms of MTL are commonly used41. The first is hard parameter sharing (HPS), where some, or all, of the feature extraction process is performed by a single network shared across all tasks42. The second is soft parameter sharing (SPS), which maintains distinct layers for the feature extraction process but facilitates information exchange between them41. There is a wide variety of methods for realising such information exchange43,44. In this study, we use neural discriminative dimensionality reduction (NDDR) layers45 as one such mechanism to integrate them within an SPS MTL framework, which we refer to as NDDR MTL system.

Class-specific binary detectors (BR) often achieve strong performance32,34,35,37, although their computational requirements do not scale well as the number of classes increases, since a new network is required for each new class. Moreover, because the classifiers operate independently, BR cannot exploit correlations or naturally co-occurring structure between classes, which one might expect to degrade performance. Conversely, BR models are more flexible, since classes can be added or removed from a problem without necessarily retraining.

Accordingly, we compare the performance of four systems, BR, CMLC, HPS MTL, and NDDR MTL, which span the range of MLC approaches, for underwater sound classification by using data from the west of Scotland.

Results

A comparison of the performance of MLC systems

The performance of any classifier should be made with reference to some baseline non-informative system46. For MLC with an imbalance in the data, then one approach is to consider a system that predicts zero prevalence for all classes across audio segments46. The resulting baseline accuracy using our validation dataset is 58.2% and baseline exact match is 31.4%.

Table 1 shows the accuracy and exact match scores for 4 systems considered. The performance metrics are computed for each of the folds with the mean and standard deviation across the 4 folds presented in the table along with the baseline scores. It is evident that all systems substantially outperform the baseline. Overall, the 4 systems offer similar levels of performance, with small differences in mean values (within 0.03). For the single network approaches (i.e. excluding BR), introducing partial layer sharing (as in HPS MTL and NDDR MTL) provides small improvements in average performance, consistent with previous MTL studies40,41,47,48. The BR system achieves the highest mean performance across both metrics; however, the differences relative to the other systems are small, which shows all the systems have comparable performances under the present experimental setup, despite the larger overall model size of BR.

Table 1 The performance of four different multi-label classification systems on the validation dataset, reported as mean ± standard deviation across the 4 folds (see Tables S1 and S2)
Full size table

System performances under varying event concurrency

In multi-label sound classification tasks, different combinations of concurrent events can cause varying levels of difficulty40,49. There are at least two factors affecting this performance, one is the number of different event types occurring concurrently within a sample (we refer to this as the concurrency level) and the other is the number of instances that a particular combination occurs in the dataset. One might hypothesise that performance on samples containing many different event types is poorer than on samples with fewer event types and, further, that performance on rarely occurring combinations is poorer than on combinations that occur frequently. The following results aim to explore these hypotheses.

To study the effect of event concurrency, we computed the macro averaged exact match values for each concurrency level, the results are shown in Fig. 1a for each of the networks. The data were grouped based on their concurrency level and the exact match scores were averaged across these groups. Exact match values of the systems for each unique combination are given in Fig. S1.

Fig. 1: Effect of event concurrency on multi-label classification performance and combination prevalence in the validation datasets.
The alternative text for this image may have been generated using AI.

Full size image

a Macro-averaged exact match values (%) across multi-label classification systems as a function of concurrency level (number of co-occurring events) in the validation datasets. Blue bars indicate Binary Relevance (BR), red bars Neural Discriminative Dimensionality Reduction multi-task learning (NDDR MTL), yellow bars Hard Parameter Sharing multi-task learning (HPS MTL), and green bars Conventional Multi-Label Classifier (CMLC). b Counts of unique audio event combinations in the validation dataset displayed on a logarithmic scale; concurrency levels are indicated with different coloured dashed boxes. Each label vector lists the events present in the combination by using the acronyms: delphinid clicks (C), delphinid tonal (T), delphinid burst pulses (B), vessel noise (V), and sonar (S); events not present are indicated as 0. A total of 18 combinations are present in the datasets used in this study. Note that the proportions of the number of audio segments per combination introduced in validation dataset is similar as that in the training dataset. Please see Fig. S2 for the number of counts per combination in the training dataset.

There is no consistent trend in the performance with increasing concurrency level (Fig. 1a). Whilst the best performance is obtained for the concurrency level of zero (no event present, i.e. ambient noise), there not a monotonic decrease in performance from that point: the next best performing case (for 3 of the 4 classifiers) is for concurrency level 4. This does not support the hypothesis that higher concurrency levels lead to poorer performance.

To shed light on the reasons for this, Fig. 1b shows the number of events in the dataset for each combination and the concurrency levels are indicated. Of the possible 32 combinations, only 18 are represented in this dataset. The correlation between presence of the event types leads to an uneven distribution of data across concurrency levels. For instance, because three of the classes are associated with the presences of dolphins, they frequently co-occur in some combinations, this is one reason why the higher concurrency levels (3 and 4) are more common in the dataset. This greater representation coincides with better system performances, especially evident at a concurrency level 4. Suggesting that the prevalence of combinations in the dataset is a more important factor affecting performance than the concurrency level.

To further evidence of the impact of the amount of data on system performance is provided by Fig. 2, which shows the exact match scores for each combination, averaged across the networks, against how many times that combination is represented in the validation dataset. There is a clear positive correlation between classification performance and the number of audio segments for specific combinations present in the data. Highlighting the critical role of the availability of sufficient audio segments for each combination in enabling accurate MLC.

Fig. 2: Relationship between combination prevalence and multi-label classification performance.
The alternative text for this image may have been generated using AI.

Full size image

Average exact match values across all folds and systems for each unique combination of audio events in relation to their representation in the dataset. The x-axis represents the counts for each combination on a logarithmic scale, and the y-axis shows the corresponding average exact match values (%). The binary vector for each combination follows notation [Delphinid Clicks (C), Delphinid Tonal (T), Delphinid Burst Pulses (B), Vessel Noise (V), Sonar(S)]. Note the reduced exact match value for combination [C,0,0,V,S] (highlighted by the red box) discussed in main text.

One notable outlier in Fig. 2 is the label vector [C,0,0,V,S], which corresponds to the co-occurrence of delphinid clicks, vessel noise and sonar signals. Within our dataset, delphinid clicks naturally co-occur with delphinid vocalisations more frequently than with sonar signals. Furthermore, sonar signals, particularly echosounders, which typically occupy a similar frequency band to delphinid tonal, introduce similar spectral and temporal patterns as delphinid whistles. As a result, systems often (falsely) associate these sonar signals with delphinid tonals, leading to false insertions.

Confusion patterns across audio events

Identifying confusion patterns between audio events provides insights into the underlying reasons for system difficulties26,50. In MLC, confusion must be examined at the combination level, as multiple events can co-occur within the same segment. To analyse these patterns, we computed a normalised confusion matrix showing the conditional probability of predicting a given combination for each ground truth combination (Fig. 3).

Fig. 3: Confusion patterns between ground truth and predicted multi-label combinations.
The alternative text for this image may have been generated using AI.

Full size image

Normalised confusion matrix demonstrating the conditional probability of predicting each label combination (x-axis) given the corresponding ground truth combination (y-axis). Cell colours indicate prediction probabilities, as shown by the colour scale, with darker colours representing higher probabilities. Only ground truth combinations with sufficient representation (>100 training segments) are included. Predicted combinations were shown if the entries (probabilities in the columns) exceeded 0.1 probability. Label vector notation follows [Delphinid Clicks (C), Delphinid Tonal (T), Delphinid Burst Pulses (B), Vessel Noise (V), Sonar(S)], where events present are indicated by their letter and absent events by 0.

The confusion matrix was computed by first identifying all segments belonging to each ground truth combination across all systems and folds, pooling their predictions and counting how frequently each possible combination was predicted. These counts were then divided by the number of occurrences of the ground truth combination. To reduce visual clutter, only results for the commonly occurring ground truth combinations, specifically those which occurred more than 100 times in the training dataset, are shown. Further, we removed rarely occurring predictions, i.e. columns in which the prediction probabilities never exceeded 0.1 for any ground truth combination, so that the results shown contain only the interpretable and non-negligible prediction patterns.

The resulting confusion matrix has a strong leading diagonal, those values corresponding to the probability of an exact match for the given combination. This matrix provides greater insight regarding the outlier [C,0,0,V,S], highlighted in Fig. 2, 51% of this combination were predicted with combinations that include delphinid tonal (T), with the majority (36%) of those being predicted as [C,T,0,V,S], confirming the high rate of false positives of delphinid tonals. Further, the systems often get confused when sonar does not co-occur with tonals, for the combinations [0,T,0,V,S] and [C,T,0,V,S], the majority of the segments were correctly identified.

Delphinid vocalisations, delphinid clicks (C), tonal and burst pulses (B) frequently trigger insertions of one another. For example, 28% of [0,T,B,V,0] segments were predicted as [C,T,B,V,0], 14% of [C,0,B,V,0] were shifted to [C,T,B,V,0], and for [0,T,0,V,S] the systems inserted delphinid clicks in 40% of the cases and burst pulses in 11% of the cases. These interactions are also evident in false negatives, where one delphinid component suppresses another. For example, 33% of [C,T,0,0,0] samples were missing clicks and 12% were missing tonals. Similarly, 25% of [C,0,B,V,0] samples failed to detect burst pulses. This is mainly because delphinid vocalisations often naturally co-occur in varying combinations, which can confuse the systems when identifying event patterns.

While the insertion of tonals in the presence of sonar is a noticeable issue, the opposite pattern does not occur. When tonals are present, sonar is rarely inserted (there is only 4% insertion of sonar in total when delphinid tonal is present in the segments). Similarly, when sonar and tonals co-occur, the systems miss both the tonal and sonar together in 5% of the cases for both [0,T,0,V,S] and [C,T,0,V,S]. Therefore, the systems’ ability to identify sonar is not affected by the presence of tonals, in contrast to the presence of sonar causing the systems to incorrectly insert tonals. A prominent insertion of sonar occurs for ambient noise, where 17% of ambient noise was predicted as sonar. The main reason is the presence of sonar segments that propagate from long distances (especially sonar signals occurring without vessel noise), which exhibit very faint spectral patterns, causing systems identifying some ambient noise segments as sonar.

Vessel noise does not show a strong confusion pattern and is mostly identified correctly. The only prominent miss rate is 7% for [C,0,B,V,0]. In other cases, vessel noise is not missed on its own but only as part of broader misclassifications.

Segment-based PR curve analysis (per-class performance analysis)

To analyse the per-class performances of the systems and variabilities in the performances across the folds, we computed precision-recall (PR) curves with area under the curve (AUC) values (Fig. 4). All the systems achieve AUC values above 84.0% across the audio events, exceeding 90.0% when delphinid tonals are excluded and reaching above 98.0% for vessel noise. The greater variability in performance is observed for delphinid tonal class. This was caused by the uneven distribution of echosounder signals, particularly validation datasets for fold 1 and fold 2 have high representation of this signal type, causing a high false positive (FP) rate. A sharp drop in precision at low recall values, driven by these FP, yields the low AUC values.

Fig. 4: Segment-based precision-recall performance of multi-label classification systems for all the audio events used in this study.
The alternative text for this image may have been generated using AI.

Full size image

Precision-recall (PR) curves showing per-class classification performance for delphinid clicks, tonal, burst pulses, vessel noise and sonar. In each PR curve, the horizontal axis represents recall and the vertical axis represents precision. Solid lines indicate the mean PR curves computed across four validation folds. Shaded regions denote the minimum and maximum PR values observed across the folds to show performance variability. Binary Relevance (BR) is shown with blue lines, Neural Discriminative Dimensionality Reduction multi-task learning (NDDR MTL) with red lines, Hard Parameter Sharing multi-task learning (HPS MTL) with grey lines, and the Conventional Multi-Label Classifier (CMLC) with green lines. Area under the curve (AUC) values are provided for each PR curve to summarise overall per-class performance.

Across all systems, vessel noise is the most accurately classified event with AUCs exceeding 98.0%, with the lowest variability. This low variability highlights this class’s predictability, indicating its less complex acoustic characteristics regardless of the system type. The factors contributing to this is that vessel noise is particularly, as a persistent tonal sound, more consistent across the audio segments, its being well-represented along with delphinid clicks (at least one of these two audio events are present in ~45.0% of the entire dataset), and the MS-PCEN input representation used in this study providing higher resolution at lower frequencies. This shows all systems abilities in identifying discriminative and well-represented events.

System model size effect

Experiments were conducted to understand the effect of the models’ size by systematically adjusting the number of trainable parameters across a range, from approximately 5 thousand to 12 million. This was realised by varying the number of convolutional kernels and recurrent units, while keeping all other architectural components and training procedures consistent with those described in Systems and Model Training Settings.

It is important to note that there are two common strategies to compare the performance of BR with other MLC systems in such experiments: (1) normalising BR’s total trainable parameters37 or (2) keeping each BR network the same as that in the other systems35,51. We followed the second approach because each BR network operates independently, meaning it does not share representations across labels like the other systems. Reducing BR’s total parameter count would not align with its independent-task structure, as each classifier learns separately without benefiting from shared features. Hence, we compare the systems as they naturally are, considering both total parameter counts and network structures.

The average accuracy values with min-max error bars across the four folds are shown in Fig. 5, where error bars illustrate the spread of performance across folds for each model size. As model size increases, all systems show improved performance and the performance differences between systems diminish. A key observation is that after 410 K trainable parameters, all systems reach a performance plateau. The differences between the 5 K and 410 K configurations are as follows: BR shows a 5.0% change in accuracy; NDDR MTL changes by 8.4%; HPS MTL by 10.7%; and CMLC by 13.8%. These results indicate that BR is particularly robust to variations in model size.

Fig. 5: Effect of model size on multi-label classification performance across systems.
The alternative text for this image may have been generated using AI.

Full size image

This figure shows MLC performances of the evaluated systems, Binary Relevance (BR) with blue circular markers, Neural Discriminative Dimensionality Reduction multi-task learning (NDDR MTL) with red square markers, Hard Parameter Sharing multi-task learning (HPS MTL) with grey triangular markers and Conventional Multi-Label Classifier (CMLC) with green diamond markers under varying model sizes, measured by accuracy in percentages. Each mark represents the average performances of the systems across four folds, with error bars indicating the minimum and maximum values. The x-axis shows the total number of trainable parameters on a logarithmic scale. The cluster of points for each parameter size has been horizontally separated to improve clarity.

When comparing systems, BR and NDDR MTL generally achieve higher accuracy than CMLC and HPS MTL across nearly all model sizes. Notably, performance differences between systems become more apparent under reduced model capacity, where BR consistently maintains higher accuracy compared to the other approaches. The performance variability, as indicated by min-max error bars, shows the differences in the systems robustness. In general, variability increases as model size decreases. Overall, CMLC system has the highest variability among the systems. For example, at the smallest model size (5 K), CMLC experiences a sharp drop in accuracy (6.8%) compared to the next configuration (10 K), with the widest min-max spread of 20.0%. This shows the CMLC system’s sensitivity to the data distribution across folds and may struggle in domain adaptation or cross-site classification tasks.

Discussion

This study provides an effective methodology for analysing marine polyphonic soundscapes that contain both biophonic and anthropogenic signals using MLC systems. By systematically comparing four supervised deep learning systems (CMLC, BR, HPS MTL, and NDDR MTL), our work highlights the strengths and limitations of each approach under realistic classification scenarios. We evaluated performance with respect to the number of concurrent event types present, the diversity of co-occurring event combinations, and model size.

The BR system has previously shown success in classifying a number of distinct sonar signals37. Our results extend this finding to a more complex underwater task involving concurrent biophonic and anthropogenic audio events. Overall, the BR system achieved the best MLC performance, closely followed by NDDR MTL. The success of these two systems shows the value of class-specific network structures for accurately classifying co-occurring underwater sound events.

Although BR produced the best classification performance, its computational properties introduce practical limitations for large-scale PAM applications. Each BR classifier has training and inference times comparable to the full CMLC network because it uses the same underlying architecture (defined in ‘Methods’). BR’s total computational cost increases almost linearly with the number of classes; in our five-label task, this resulted in an approximately five-fold increase in inference time relative to the integrated systems (about four times longer than NDDR MTL). This limitation is expected to become increasingly critical as the number of target classes grows (e.g. applications involving larger acoustic taxonomies). For example, extending the current framework to 15–20 audio events would require a proportional increase in the number of independent BR classifiers, which substantially increases computational cost, whereas integrated MLC systems would maintain a single network with shared layer(s). The model size experiments provide additional insight into this trade-off. When the total number of parameters is reduced, BR exhibits the smallest decrease in performance, even at relatively small model sizes (e.g. the 5k configuration, corresponding to 25k total parameters for the system). This shows that the class-specific nature of BR classifiers allows them to retain strong performance under limited model capacity. However, this robustness at the classifier level does not completely offset the fundamental scalability limitation of BR at the system level, since achieving comparable per-class capacity requires maintaining one independent network per audio event, and normalising the total BR system size implies proportionally smaller binary classifiers. At the same time, we note that an important practical advantage of BR that requires training only an additional binary classifier when introducing a new audio event, whereas integrated systems require reconfiguring and retraining the entire network. On the other hand, integrated systems generate predictions for all classes with a single forward pass, which makes them computationally efficient when operating on long-duration PAM datasets or when scaling to a larger number of event types, particularly given that the performance gap between NDDR-MTL (10k model size) and BR (single binary classifier with 205k parameters) remains below 5% accuracy.

A key insight from our study is that the diversity of co-occurrence types and their representation in the training dataset have a greater impact on classification performance than the number of co-occurring classes alone. This contrasts with prior studies showing monotonic decreases in performance with increased concurrency levels40,49, which used artificially mixed datasets. Our dataset only includes naturally co-occurring signals, and our findings emphasise the importance of co-occurrence frequency in training data. This is mainly because artificially mixed signals often lack coherent acoustic patterns, as events are combined without preserving their original environmental or behavioural contexts. Such combinations can result in unnatural overlaps where spectral components of distinct signals mask or distort each other, which makes them harder to disentangle. As a result, increasing the concurrency level in artificially mixed datasets consistently decreases the networks’ ability to distinguish between them.

We further observed consistent misclassification of delphinid tonal signals in the presence of high-energy echosounders occurring within a similar frequency band. This issue was not observed for sonar signals in lower frequency bands or for lower energy echosounders operating within the same frequency range. These findings support the hypothesis that automated systems may misclassify sonar signals as delphinid whistles, especially in regions where both signal types co-occur. Given that whistles play critical roles in identifying whistling animal and studying species behaviour52,53, such misclassifications could hinder ecological analyses. Therefore, care must be taken in data collection to avoid regions dominated by such anthropogenic signals or to take additional measures to improve classifier robustness.

These observations can be further interpreted in terms of how the systems learn from the data. In the evaluated systems, each class is associated with its own output and loss, and therefore, the models are primarily trained to learn class-specific spectro-temporal features from the input representations. However, the results also show that model behaviour is influenced by the co-occurrence structure present in the dataset, as demonstrated in both the confusion patterns and the representation of different combinations in the data. These effects appear to vary across architectures. The BR system relies on independent classifiers and does not model shared representations across classes, whereas integrated systems (CMLC, HPS MTL, and NDDR MTL) incorporate shared or partially shared features. Lower performance of CMLC for the delphinid tonal class, as observed in the PR curve analysis, suggests that fully shared representations may not sufficiently capture the spectro-temporal characteristics required to distinguish acoustically similar events, potentially due to feature interference when events share similar spectro-temporal patterns. This effect appears to be more evident in CMLC, which also shows greater sensitivity to reduced model capacity in the model size experiments. In contrast, architectures that allow either class-specific feature learning (BR) or controlled feature sharing (e.g. NDDR MTL) appear better able to distinguish between spectro-temporally similar events.

Beyond the system-level findings, this study also highlights the broader value of multi-label approaches for marine bioacoustics. Many ecological questions require understanding not only the presence of individual sound types, but also how biophonic and anthropogenic signals co-occur within the same acoustic scene. Unlike single-label or binary detection frameworks, MLC systems can characterise these polyphonic conditions by providing a more realistic and diverse representation of marine soundscapes. Such capabilities are important for future ecological analyses, particularly in scenarios where the types or combinations of anthropogenic signals are unknown a priori. The framework established in this study can therefore support subsequent applications, including population- or behaviour-level studies, by providing a foundation for analysing complex acoustic mixtures. At Tolsta, the frequent co-occurrence of vessel noise with delphinid vocalisations shows that both sound sources are persistently present in the same acoustic scenes. This may suggest a regular overlap between anthropogenic activity and cetacean presence in the region, rather than short-term avoidance behaviour. Future work may extend this approach to new regions and/or incorporate additional event categories.

Our choice of Tolsta, a site reflective of Western Scotland26, enabled this study to capture a range of co-occurring biophonic and anthropogenic sounds. However, system generalisation to other locations remains untested. Domain adaptation and cross-site validation have gained prominence as problems in underwater sound classification50,54 and extending our study to include other datasets will be essential to fully assess generalisability. Nevertheless, our findings, particularly the observed variability in the CMLC system under different folds, already hint at potential differences in how well each system may generalise across acoustic environments. Additionally, although we evaluated five sound types (three delphinid vocalisations, vessel noise, and sonar), the framework is adaptable to additional biophonic, anthropogenic, and geophonic sources (e.g. baleen whale vocalisations, seismic airgun pulses, rain), contingent on annotated data availability.

We also note that we curated the full 101k-segment dataset to allow a controlled and fair comparison of the four systems, as class imbalance is known to adversely affect both training stability and the interpretation of results55,56. The original dataset (Table 2) is heavily dominated by vessel noise, which would have masked system-level differences and made the comparison largely reflective of dataset bias rather than model behaviour. The curated dataset therefore had a specific methodological purpose, as outlined in Data Preparation and Experimental Setup: reducing extreme imbalance, improving the representation of rare but ecologically meaningful combinations, and ensuring that all systems were trained under the same experimental conditions (same input representation, training settings, and balanced label distribution). Although loss-function adjustments (e.g. weighted loss) can also address imbalance, such approaches are not straightforward for multi-label problems because segments may contain multiple events and class frequencies are not independent. Developing imbalance-aware loss strategies is an important direction for future work, but lies beyond the scope of the present study, which focuses on comparing system behaviour under controlled and reproducible conditions. We note that system performance and relative comparisons may vary when evaluated on the full imbalanced dataset.

Table 2 Distribution of the labelled data segments across classes and seasons
Full size table

The overall performance and stability of both BR and NDDR multi-task learning (NDDR MTL) systems for different numbers of concurrent event types, variability representation of label combinations in data and model sizes suggest they are better suited for real-world polyphonic analysis. BR’s success supports the utility of class-specific modelling in class-imbalanced, complex environments, while NDDR MTL’s competitive performance highlights the promise of SPS MTL systems. Together, these findings provide a foundation for MLC in marine bioacoustics, particularly in complex and ecologically significant situations.

Methods

Multi-labelled data

This study uses a subset of the PAM data collected during the COMPASS project (EU INTERREG). The data are collected over January, April, June and November in 2019 from a single site, Tolsta (58.392° N, −6.008° W) (Fig. 6). The COMPASS mooring comprises a single omni-directional broadband recorder (Ocean Instruments SoundTrap 300 HF) positioned in 100 m water depth55. The recording protocol followed a 20/40 min on/off duty cycle at a sampling rate, ({f}_{s}), of 96 kHz.

Fig. 6: Location of the hydrophone mooring at the Tolsta site and surrounding special areas of conservation (SAC).
The alternative text for this image may have been generated using AI.

Full size image

The hydrophone mooring deployed during the COMPASS project is indicated by a red circle symbol. Special Areas of Conservation boundaries are shown using purple hatched regions. The inset map shows the location of the study area within the United Kingdom.

Multi-labelling process

The MLC is based on the analysis of 3 s segments extracted from each 20-min recording. Five labels are considered: delphinid tonal, delphinid clicks, delphinid burst pulses, vessel noise and sonar, where sonar is used to represent anthropogenic, modulated, narrow-band signals, including sources such as echosounders and pingers. A subset of the available recordings was labelled based on visual inspection of the spectrograms with linear and mel frequency scales (2048 sample FFT size with 0.75 overlap) and audio playback, while also considering data outside of the 3 s segment to provide context. More than one label can be applied to a segment. There were several challenges faced when labelling which were resolved by using some basic criteria. Delphinid clicks were distinguished from other transient sounds by their short duration and based on only having energy above 5 kHz. In rare instances delphinid whistles and sonar could be challenging to separate, usually whistles could be distinguished by their higher degree of frequency and amplitude modulation.

All primary labelling was conducted by the lead author. To ensure reliability, labelled data was independently cross-checked by a second analyst. Any uncertain or ambiguous segments identified during the cross-check were excluded from the labelled dataset if there was no consensus between analysts; in total, fewer than 0.1% of all segments were removed through this process.

Labels were encoded as a 5-digit binary vectors for each 3 s segment, using the order [Delphinid Clicks, Delphinid Tonal, Delphinid Burst Pulses, Vessel Noise, Sonar], where ‘1’ indicates presence and ‘0’ absence of that class. For example, a vector of [1, 0, 1, 1, 0] would indicate the presence of clicks, burst pulses, and vessel noise, with no tonal and sonar. The binary vector [0, 0, 0, 0, 0] represents the ambient noise class.

A total of 101,535 segments (approximately 85 h of data) were annotated, yielding 192,209 individual labels (Table 2). The breakdown clearly shows that the dataset is dominated by vessel noise, which occurs roughly 2.5 times more frequently than the next most common event type, delphinid clicks. Several seasonal patterns are also apparent: delphinid vocalisations are most prevalent during the winter months (January and November); sonar detections increase in April, coinciding with an annual naval exercise in the region; and in November, no segments were labelled as ambient noise due to the near-constant presence of other classes, primarily those associated with delphinids. Four multi-labelled exemplars are given in Fig. 7.

Fig. 7: Examples of co-occurring underwater acoustic events in linear frequency spectrograms.
The alternative text for this image may have been generated using AI.

Full size image

Linear frequency spectrograms of four different 3-second segments illustrating various co-occurrences of underwater acoustic events. Each signal type is labelled with arrows of different colours for clarity. The corresponding binary vectors indicating the presence (1) or absence (0) of different sound classes in the format [Delphinid Clicks, Delphinid Tonal, Delphinid Burst Pulses, Vessel Noise, Sonar] are: a [1, 0, 1, 1, 0], b [1, 0, 0, 1, 1], c [1, 1, 1, 1, 0], d [1, 1, 1, 1, 0].

Input representation

Systems benefit from using an input representation that is robust to changes in the background noise23,26,57. Herein, we adopt per-channel energy normalisation (PCEN)58,59 to provide dynamic range compression and stabilise signal energy levels to the Mel-spectrogram (MS), resulting in what we refer to as the MS-PCEN55.

MS-PCEN is calculated for each 3 s audio segment based on a short-time Fourier transform (STFT) using a 2048-point (21 ms) Hanning window with 0.75 overlap. The MS is computed using a Mel-filter bank with 64 bands. PCEN is then applied as described in refs. 58,59: The PCEN hyperparameters we used were: offset, (delta =0.05), gain parameter, (alpha =0.98), scaling factor, (r=0.5), numerical stability parameter (epsilon =1.4), and smoothing coefficient, ({rm{gamma }}=0.967) following55. The resulting representation contains 64 Mel bands and 559 time frames for each 3 s segment. MS-PCEN is provided as a three-channel input by repeating the single-channel spectrogram. Although MS-PCEN can also be represented as a single-channel input, the three-channel format was retained to remain consistent with the formulation applied in previous work55, yielding a final segment dimension of 64 × 559 × 3.

Systems background

Four different MLC systems are considered for analysing concurrent (polyphonic) acoustic data. Rather than focusing on the design or optimisation of new deep learning models, our primary objective is to examine how different MLC frameworks operate and perform. All the systems in this study share a common foundation based on a Convolutional Recurrent Neural Network (CRNN) architecture, which is widely used for similar tasks27,28,37,60 and is detailed in the following.

In our study, CRNN architecture consists of three convolutional blocks followed by three recurrent blocks. Each convolutional block in the architecture comprises 2D convolutional layer followed by Batch Normalisation (BN)61. A Rectified Linear Unit (ReLU) activation function is used to introduce non-linearity62, 2D non-overlapping max pooling (MP) is used, followed by a dropout layer with a rate of 0.25 to help mitigate overfitting63. We use 5 × 5 kernels across all convolutional layers, zero-padding the inputs to the convolutional layers and adopt pooling sizes of [(5, 1), (4, 1), (2, 1)] respectively. All the CNN outputs are reshaped by using time distributed flatten layers to ensure each timestep’s feature map is individually flattened for processing by the subsequent recurrent layers.

The recurrent blocks incorporate 128 bidirectional gated recurrent units (GRUs). Each GRU uses a tanh activation function, with a dropout rate of 0.2 applied to the input connections. The outputs of bidirectional GRUs are combined by using averaging64. A temporal MP (TMP) layer is applied before the final dense output layer, which uses sigmoid activation to implement binary logistic regression function(s) (BLRF) for MLC task. This architecture is used uniformly across all systems.

System architectures

Here we provide details the four MLC architectures considered, these are illustrated in Fig. 8 and summarised in Table 3 with training and inference times. All the systems considered are based on networks of 4.1 M trainable parameters, the BR system employs K (=5) such networks, except in the System Model Size Effect experiment, where the number of trainable parameters is systematically varied.

Fig. 8: Architectural designs of the multi-label classification systems evaluated in this study.
The alternative text for this image may have been generated using AI.

Full size image

This figure illustrates the convolutional recurrent neural network (CRNN) architectures used for the Conventional Multi-Label Classifier (CMLC), Binary Relevance (BR), Hard Parameter Sharing multi-task learning (HPS MTL), and Neural Discriminative Dimensionality Reduction multi-task learning (NDDR MTL) systems. CMLC and BR systems are illustrated together because each BR classifier uses the same CRNN architecture as CMLC; however, BR consists of K independent binary classifiers, each producing a single output, and these outputs are concatenated to form the final label vector. CMLC architecture consists of three convolutional blocks, each containing a two-dimensional (2D) convolutional layer, Batch Normalisation (BN), Rectified Linear Unit (ReLU) activation function, max pooling (MP) layers with sizes [(5, 1), (4, 1), (2, 1)], and dropout with a rate of 0.25. These are followed by three recurrent blocks, each with bidirectional GRUs (128 units) using tanh activation and a dropout rate of 0.2 applied to the input connections. The convolutional neural network (CNN) and recurrent neural network (RNN) blocks are followed by temporal max pooling (TMP) and binary logistic regression function (BLRF) layers. HPS MTL architecture extends the CMLC design by incorporating class-specific recurrent blocks after the shared convolutional layers. The NDDR MTL architecture uses class-specific convolutional and recurrent layers while incorporating NDDR layers for feature sharing in the CNN part. The final one by one (1 × 1) convolutional layer in NDDR MTL, which combines outputs from all NDDR layers, uses the same configuration as the NDDR layers but applies a diagonal initialiser prioritising the final NDDR layer output, with a weight of 0.6 assigned to it, while the two earlier layers are weighted at 0.2 each.

Table 3 System architecture configurations, trainable parameter counts, and training/inference times
Full size table

The CMLC network shares a single CRNN structure across all labels. A dense layer is then used to produce the output vector, which contains one element for each class. The convolutional layers in CMLC use 160, 192 and 512 kernels. The network is trained to output predictions in the form of the K element binary label vector representing absences (0) or presences (1) for each of the K labels.

The BR system consists of K independent binary classifiers, each trained to detect a particular audio event. The output predictions of these K classifiers are concatenated to form the output vector. Each binary classifier in the BR system uses the same architecture used in the CMLC network but with a single output unit. Note that this means that the BR system has roughly K times as many parameters as in the corresponding CMLC network.

Instead of building separate models for each class (as in BR) or using completely shared network layers (as in CMLC), MTL architectures balance shared and task-specific components within a unified network. We implement two common types of MTL structures, which are examples of hard and SPS. The distinction between these approaches lies in the degree to which features are shared across classes during learning.

The HPS MTL system employs convolutional blocks that are shared across all classes, with distinct recurrent blocks used within each class. In our implementation, the shared convolutional layers have 96, 128 and 192 kernels, after which each class has its own bidirectional GRU layers.

Within the SPS MTL system, each class employs its own convolutional and recurrent blocks, but the blocks can exchange information within the network. We use 64 kernels across the convolutional layers to keep the size of the network similar to the other networks considered. For feature exchange, we employ NDDR layers with shortcuts in the convolutional blocks45. At the end of each class-specific convolutional block, feature maps from all blocks are concatenated along the channel dimension and passed through a BN layer for stabilisation before being processed by the NDDR layers. The concatenation combines discriminative features from all classes, which results in feature maps with K times the number of channels compared to a single block. NDDR layers use 1 × 1 convolutional layers after BN layer to fuse features across blocks by also reducing the channel dimension to match a single output. ({l}_{2}) weight decay is applied to the 1 × 1 convolutional weights, with a moderate regularisation factor of 0.01 and each NDDR layer employs class-specific diagonal matrix initialisers to prioritise class-relevant features (values used are: 0.6 for self, 0.1 for others).

After each NDDR block, bilinear interpolation is used to ensure the outputs of all the layers are resized to match the spatial dimensions of the output of the final NDDR layer45. For each class, resized outputs from the NDDR layers are concatenated along the channel dimension. Skip connections are constructed individually for each class, and the resultant feature maps, after concatenation, are subsequently reduced to the original channel dimension by an additional 1 × 1 convolutional layer with the same configuration as that NDDR layers, with the exception of diagonal initialiser. In this case, the initialiser assigns a weight of 0.6 to the final NDDR layer and 0.2 to the two earlier layers, which ensures that skip connections emphasise recent information while retaining contributions from prior layers. These convolutional layers are followed by class-specific recurrent layers to process each classification path accordingly. This SPS MTL structure with NDDR layers using skip connections is referred to as NDDR MTL in this study. Both HPS MTL and NDDR MTL systems produce K binary outputs, with each class-specific path terminating in a dense layer with one output unit. This design yields binary label vectors across the segments, same as the CMLC system.

Model training settings

A training batch size of 16 is used throughout. An early stopping criterion was not used in this study as it can lead to some system networks stopping earlier than others, which potentially causes incomplete training and inconsistencies due to the criterion’s dependence on validation (or test) error curves that frequently exhibit multiple local minima. In preliminary work, the error curves were monitored to determine a suitable number of epochs, the value used was 100, which was selected as it allowed all the networks to converge.

We use adaptive moment estimation (Adam) optimiser with exponential learning rate decay65. The exponential decay schedule used an initial rate of 0.001 and a decay rate of 0.75 at every 90 steps with staircase function for stable training26,50,55. The binary cross-entropy loss function is applied for all the systems. All the initialisation is Glorot uniform66 for the weights and zero initialisation is used for bias for all the layers in the systems, except for that NDDR layers that use diagonal initialiser for the weights.

All experiments are implemented in Python using Keras library67 and executed on the cloud environment Google Colab Pro+ framework with 52 GB system RAM and 15360 MiB of Tesla T4 GPU memory.

Data preparation and experimental setup

The imbalance in the labelled dataset (see ‘Multi-labelling process’) potentially causes significant challenges in system evaluation, as severe overrepresentation of a certain class can lead to biased training and testing. This imbalance is not readily addressed in MLC problems as it is with single class classification.

To mitigate this issue, we applied a systematic data arrangement strategy, for more details see Supplementary Material D. The aim of this process is not to eliminate imbalance entirely, but to substantially reduce extreme dominance effects (particularly vessel noise) and to provide the representation of rare event combinations while preserving seasonal trends. The process was carried out stepwise by prioritising underrepresented events and combinations. The initial steps targeted rare combinations, such as delphinid burst pulses co-occurring with sonar signals. Then, segments with sonar and burst pulses without vessel noise were selected to reduce the risk of over-representing vessel noise. Next, delphinid tonal and click signals were included, with priority given to segments that did not contain vessel noise.

Following this, the dataset was expanded to include segments with high co-occurrence rates among the events. This ensured that the added segments contributed to a balanced increase in the counts of multiple events. Ambient noise segments were also included to provide a realistic baseline and ensure that the dataset adequately represents the full soundscape. Although ambient noise is presented as one of six audio event classes in our datasets, it is not treated as a separate class during the training or inference, it is simply a record of when no class was detected. The curated dataset incorporates 15,520 audio segments, the distribution of the audio events within is shown in Fig. 9.

Fig. 9: Distribution of audio events per season and their co-occurrences in the arranged dataset.
The alternative text for this image may have been generated using AI.

Full size image

a Number of audio segments per sound class across four seasons: January, April, June, November. Stacked bars represent the contribution of each season, with different colours indicating individual seasons as shown in the legend. The total number of segments for each sound class is indicated above each bar. b Co-occurrence patterns between audio events in the dataset. Each panel shows the number of segments in which a given target event co-occurs with other audio events. The horizontal axes indicate the co-occurring sound classes, and the vertical axes represent the number of co-occurring segments. Each co-occurrence pair is shown only once across the panels to avoid duplication.

The curated data was randomly split into 4-folds. The number of training segments in each fold is 2904 ± 1, with 976 ± 1 segments in the validation data. This is arranged so that the percentage of labels in each fold and between the training and validation datasets is kept approximately consistent. Specifically the percentage of label in the ambient, delphinid tonal, clicks, burst pulse, vessel and sonar classes are, respectively, roughly 13.2%, 18.0%, 22.7%, 11.8%, 22.7% and 11.6% (the differences between the folds are <0.02% and the differences between the training and validation datasets are all <0.2%). For the number of audio segments audio events are present across the folds, see Table S3.

All training and validation experiments reported in this study are conducted in the curated dataset described above. The full data mentioned in ‘Multi-labelling process’ is not used directly for model training but is presented to characterise the original data distribution (Table 2) and to motivate the data arrangement strategy. Accordingly, the reported results correspond to validation performance, as no independent hold-out test set was used.

Evaluation metrics

In MLC systems, the system outputs are vectors of predicted probabilities across the labels. Evaluation is carried out by comparing predicted label vectors and target labels in the validation datasets. In MLC, a prediction can be partially correct, meaning it may correctly predict some, but not all, of the true labels. To measure the performance of MLC systems, we use accuracy and exact match46.

Accuracy measures the proportion of correctly predicted individual class labels to the total number of labels in the label vectors across the audio segments.

$${rm{Accuracy}}({bf{y}},hat{{bf{y}}})=frac{1}{{NK}}mathop{sum }limits_{s=1}^{N}mathop{sum }limits_{l=1}^{K}left[{,y}_{s}^{left(lright)}{=hat{y}}_{s}^{left(lright)}right]$$
(1)

where ({bf{y}}=left{{{bf{y}}}_{s}right}) is the set of target labels in the validation data comprising N segments, ({{bf{y}}}_{s}) is the binary vector for the ({s}^{{rm{th}}}) segment indicating the presence or absence of each of the K classes. ({y}_{s}^{left(lright)}) represents the ({l}^{{rm{th}}}) element of the label vector ({{bf{y}}}_{s}) corresponding to the label for the ({l}^{{rm{th}}}) class in segment s. The ‘hat’, as in ({hat{y}}_{s}^{left(lright)}), denotes predictions generated from the systems. The Iverson bracket (left[cdot right]) is the indicator function returning 1 if its argument is true and 0 otherwise.

An exact match defines the proportion of segments where the predictions for all the classes match the target labels exactly. It can be defined as:

$${rm{ExactMatch}}({bf{y}},hat{{bf{y}}})=frac{1}{N}mathop{sum }limits_{s=1}^{N}mathop{prod }limits_{l=1}^{K}left[{,y}_{s}^{left(lright)}{=hat{y}}_{s}^{left(lright)}right]$$
(2)

The product over all classes ensures that the value is 1 only when all labels are predicted correctly for the segment; otherwise, it is 0. The exact match and accuracy can be related to each other if the accuracy is class independent, call that accuracy α and decisions for each class in a segment are assumed to be statistically independent, then the exact match score is given by ({alpha }^{K}). For consistency, a prediction probability of 0.5 is applied to the system outputs when making classification decisions for computing both accuracy and exact match scores.

Performance is also evaluated by segment-based precision and recall (PR) curves68, which are computed by averaging PR curves across the folds, also the maxima and minima across the folds can be computed to assess variability. The area under the curve (AUC, aka average precision) for each class is also computed by averaging across the folds.

Data availability

The datasets generated and/or analysed during the current study are not publicly available due to data access restrictions associated with the COMPASS project, but are available from the corresponding author on reasonable request. Custom scripts were developed in Python for data preprocessing, model training and evaluation. The code is maintained in a private repository and is not publicly available at this time. Access to the code can be provided by the corresponding author upon reasonable request. Model development and training were carried out using TensorFlow (v2.19.0) with the Keras API, together with standard scientific Python libraries including NumPy, SciPy, librosa, OpenCV and scikit-learn.

Code availability

Custom scripts were developed in Python for data preprocessing, model training, and evaluation. The code is maintained in a private repository and is not publicly available at this time. Access to the code can be provided by the corresponding author upon reasonable request. Model development and training were carried out using TensorFlow (v2.19.0) with the Keras API, together with standard scientific Python libraries including NumPy, SciPy, librosa, OpenCV, and scikit-learn.

References

  1. Mellinger, D., Stafford, K., Moore, S., Dziak, R. & Matsumoto, H. An overview of fixed passive acoustic observation methods for Cetaceans. Oceanography 20, 36–45 (2007).

    Article 

    Google Scholar 

  2. Baumgartner, M. F., Stafford, K. M. & Latha, G. Near real-time underwater passive acoustic monitoring of natural and anthropogenic sounds. In Springer Oceanography 203–226 (Springer, 2017).

  3. Todd, N. R. E. et al. Using passive acoustic monitoring to investigate the occurrence of cetaceans in a protected marine area in northwest Ireland. Estuar. Coast. Shelf Sci. 232, 106509 (2020).

    Article 

    Google Scholar 

  4. Hildebrand, J. A. Impacts of anthropogenic sound. In Marine Mammal Research: Conservation beyond Crisis (eds Reynolds, J. E. III, Perrin, W. F., Reeves, R. R., Montgomery, S. & Ragen, T. J.) 101–124 (The Johns Hopkins Univ. Press, 2005).

  5. Hastie, G. D., Wilson, B. & Thompson, P. M. Diving deep in a foraging hotspot: acoustic insights into bottlenose dolphin dive depths and feeding behaviour. Mar. Biol. 148, 1181–1188 (2005).

    Article 

    Google Scholar 

  6. Lin, T.-H., Yu, H.-Y., Chen, C.-F. & Chou, L.-S. Passive acoustic monitoring of the temporal variability of odontocete tonal sounds from a long-term marine observatory. PLoS ONE 10, e0123943–e0123943 (2015).

    Article 

    Google Scholar 

  7. Todd, N. R. E., Jessopp, M., Rogan, E. & Kavanagh, A. S. Extracting foraging behavior from passive acoustic monitoring data to better understand harbor porpoise (Phocoena phocoena) foraging habitat use. Mar. Mammal. Sci. 38, 1623–1642 (2022).

    Article 

    Google Scholar 

  8. Marques, T. A., Thomas, L., Ward, J., DiMarzio, N. & Tyack, P. L. Estimating cetacean population density using fixed passive acoustic sensors: an example with Blainville’s beaked whales. J. Acoust. Soc. Am. 125, 1982–1994 (2009).

    Article 
    ADS 

    Google Scholar 

  9. Küsel, E. T. et al. Cetacean population density estimation from single fixed sensors using passive acoustics. J. Acoust. Soc. Am. 129, 3610–3622 (2011).

    Article 
    ADS 

    Google Scholar 

  10. Davis, G. E. et al. Exploring movement patterns and changing distributions of baleen whales in the western North Atlantic using a decade of passive acoustic data. Glob. Change Biol. 26, 4812–4840 (2020).

    Article 
    ADS 

    Google Scholar 

  11. Ellison, W. T., Southall, B. L., Clark, C. W. & Frankel, A. S. A new context-based approach to assess marine mammal behavioral responses to anthropogenic sounds. Conserv. Biol. 26, 21–28 (2011).

    Article 

    Google Scholar 

  12. Weir, C. R. & Dolman, S. J. Comparative review of the regional marine mammal mitigation guidelines implemented during industrial seismic surveys, and guidance towards a worldwide standard. J. Int. Wildl. Law Policy 10, 1–27 (2007).

    Article 

    Google Scholar 

  13. Heiler, J., Elwen, S. H., Kriesell, H. J. & Gridley, T. Changes in bottlenose dolphin whistle parameters related to vessel presence, surface behaviour and group composition. Anim. Behav. 117, 167–177 (2016).

    Article 

    Google Scholar 

  14. Branstetter, B. K. & Mercado, E. Sound localization by cetaceans. Int. J. Comp. Psychol. 19, 26–61 (2006).

    Article 

    Google Scholar 

  15. Janik, V. M. Cetacean vocal learning and communication. Curr. Opin. Neurobiol. 28, 60–65 (2014).

    Article 

    Google Scholar 

  16. Whitmore, F. C. & Sanders, A. E. Review of the Oligocene Cetacea. Syst. Zool. 25, 304–304 (1976).

    Article 

    Google Scholar 

  17. Sayigh, L. S. Cetacean Acoustic Communication 275–297 (Springer eBooks, 2013). https://doi.org/10.1007/978-94-007-7414-8_16.

  18. Herzing, D. L. Clicks, whistles and pulses: Passive and active signal use in dolphin communication. Acta Astronaut. 105, 534–537 (2014).

    Article 
    ADS 

    Google Scholar 

  19. Usman, A. M., Ogundile, O. O. & Versfeld, D. J. J. Review of automatic detection and classification techniques for cetacean vocalization. IEEE Access 8, 105181–105206 (2020).

    Article 

    Google Scholar 

  20. Yang, H., Lee, K., Choo, Y. & Kim, K. Underwater acoustic research trends with machine learning: general background. J. Ocean Eng. Technol. 34, 147–154 (2020).

    Article 

    Google Scholar 

  21. Harvey, M. Acoustic detection of humpback whales using a convolutional neural network. Google AI Blog (2018). Available at: https://ai.googleblog.com/2018/10/acoustic-detection-of-humpback-whales.html.

  22. Zhong, M. et al. Beluga whale acoustic signal classification using deep learning neural network models. J. Acoust. Soc. Am. 147, 1834–1841 (2020).

    Article 
    ADS 

    Google Scholar 

  23. ‌Allen, A. N. et al. A convolutional neural network for automated detection of humpback whale song in a diverse, long-term passive acoustic dataset. Front. Mar. Sci. 8, 165 (2021).

  24. Belghith, E. H., Rioult, F. & Bouzidi, M. Acoustic diversity classifier for automated marine big data analysis. HAL (Le Centre pour la Communication Scientifique Directe) 130–136. https://doi.org/10.1109/ictai.2018.00029 (2018).

  25. Ibrahim, A. K. et al. Transfer learning for efficient classification of grouper sound. J. Acoust. Soc. Am. 148, EL260–EL266 (2020).

    Article 

    Google Scholar 

  26. White, E. L. et al. More than a whistle: automated detection of marine sound sources with a convolutional neural network. Front. Mar. Sci. 9, 879145 (2022).

  27. Cakir, E., Parascandolo, G., Heittola, T., Huttunen, H. & Virtanen, T. Convolutional recurrent neural networks for polyphonic sound event detection. IEEE/ACM Trans. Audio Speech Lang. Process. 25, 1291–1303 (2017).

    Article 

    Google Scholar 

  28. Hou, Y., Kong, Q., Wang, J. & Li, S. Polyphonic audio tagging with sequentially labelled data using CRNN with learnable gated linear units. In Detection and Classification of Acoustic Scenes and Events (2018).

  29. Chan, T. K. & Chin, C. S. A comprehensive review of polyphonic sound event detection. IEEE Access 8, 103339–103373 (2020).

    Article 

    Google Scholar 

  30. Moore, S. E. et al. A new framework for assessing the effects of anthropogenic sound on marine mammals in a rapidly changing Arctic. BioScience 62, 289–295 (2012).

    Article 

    Google Scholar 

  31. Wright, A. J. et al. Do marine mammals experience stress related to anthropogenic noise? Int. J. Compar. Psychol. 20, 274–316 (2007).

  32. Luaces, O., Díez, J., Barranquero, J., del Coz, J. J. & Bahamonde, A. Binary relevance efficacy for multilabel classification. Prog. Artif. Intell. 1, 303–313 (2012).

    Article 

    Google Scholar 

  33. Ganda, D. & Buch, R. A survey on multi label classification. Recent Trends Program. Lang. 5, 19–23 (2018).

    Google Scholar 

  34. Douibi, K., Settouti, N., Chikh, M. A., Read, J. & Benabid, M. M. An analysis of ambulatory blood pressure monitoring using multi-label classification. Australas. Phys. Eng. Sci. Med. 42, 65–81 (2018).

    Article 

    Google Scholar 

  35. Yang, Z. & Emmert-Streib, F. Optimal performance of binary relevance CNN in targeted multi-label text classification. Knowl. Based Syst. 284, 111286 (2024).

    Article 

    Google Scholar 

  36. Cakir, E., Heittola, T., Huttunen, H. & T. Virtanen. Polyphonic sound event detection using multi label deep neural networks. In Proc. 2022 International Joint Conference on Neural Networks (IJCNN). https://doi.org/10.1109/ijcnn.2015.7280624 (2015).

  37. Ibrahim, M., Sagers, J. D., Ballard, M. S., Le, M. & Koutsomitopoulos, V. Evaluating machine learning architectures for sound event detection for signals with variable signal-to-noise-ratios in the Beaufort Sea. J. Acoust. Soc. Am. 154, 2689–2707 (2023).

    Article 
    ADS 

    Google Scholar 

  38. Imoto, K. et al. Sound event detection by multitask learning of sound events and scenes with soft scene labels. In Proc. ICASSP 2022 – 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 621–625 (IEEE, 2020). https://doi.org/10.1109/icassp40776.2020.9053912

  39. Zhang, X., Zhang, Q.-W., Yan, Z., Liu, R. & Cao, Y. Enhancing label correlation feedback in multi-label text classification via multi-task learning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 1190–1200 (2021).

  40. Phan, H., Nguyen, T. N. T., Koch, P. & Mertins, A. Polyphonic Audio Event Detection: Multi-Label or Multi-Class Multi-Task Classification Problem? In ICASSP 2022 – 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 8877–8881 https://doi.org/10.1109/icassp43922.2022.9746402 (2022).

  41. Ruder, S. An Overview of Multi-Task Learning in Deep Neural Networks. Preprint at https://arxiv.org/abs/1706.05098 (2017).

  42. Caruana, R. Multitask learning: a knowledge-based source of inductive bias. In Proc. Tenth International Conference on Machine Learning 41–48 (Morgan Kaufmann, 1993).

  43. Misra, I., Shrivastava, A., Gupta, A. & Hebert, M. Cross-Stitch Networks for Multi-task Learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3994–4003. https://doi.org/10.1109/CVPR.2016.433 (2016).

  44. Ruder, S., Bingel, J., Augenstein, I. & Søgaard, A. Latent Multi-Task Architecture Learning. In Proc. AAAI Conference on Artificial Intelligence, Vol. 33, 4822–4829 (2019).

  45. Gao, Y., Ma, J., Zhao, M., Liu, W. & Yuille, A. L. NDDR-CNN: Layerwise feature fusing in multi-task CNNs by neural discriminative dimensionality reduction. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3200–3209 (2019).

  46. Zhang, L., Towsey, M., Xie, J., Zhang, J. & Roe, P. Using multi-label classification for acoustic pattern detection and assisting bird species surveys. Appl. Acoust. 110, 91–98 (2016).

    Article 

    Google Scholar 

  47. Zhang, Y. & Yang, Q. An overview of multi-task learning. Natl. Sci. Rev. 5, 30–43 (2017).

    Article 

    Google Scholar 

  48. Zhang, Y., Wei, Y. & Yang, Q. Learning to multitask. Advances in Neural Information Processing Systems 31 (ACM, 2018).

  49. Meire, M., Karsmakers, P. & Vuegen, L. The impact of missing labels and overlapping sound events on multi-label multi-instance learning for sound event classification. Lirias (KU Leuven) 159, 163 (2019).

    Google Scholar 

  50. White, E. L., Klinck, H., Bull, J. M., White, P. R. & Risch, D. One size fits all? Adaptation of trained CNNs to new marine acoustic environments. Ecol. Inform. 78, 102363 (2023).

    Article 

    Google Scholar 

  51. Read, J., Pfahringer, B., Holmes, G. & Frank, E. Classifier chains for multi-label classification. Mach. Learn. 85, 333–359 (2011).

    Article 
    MathSciNet 

    Google Scholar 

  52. Oswald, J. N. et al. Species information in whistle frequency modulation patterns of common dolphins. Philos. Trans. R. Soc. B: Biol. Sci. 376, 20210046 (2021).‌

  53. Quick, N. J. & Janik, V. M. Whistle rates of wild bottlenose dolphins (Tursiops truncatus): Influences of group size and behavior. J. Comp. Psychol. 122, 305–311 (2008).

    Article 

    Google Scholar 

  54. Napoli, A. & White, P. R. Unsupervised domain adaptation for the cross-dataset detection of humpback whale calls. In Detection and Classification of Acoustic Scenes and Events (DCASE 2023) (Tampere, Finland, 2023).

  55. Olcay, A. et al. Sounds of the deep: how input representation, model choice, and dataset size influence underwater sound classification performance. J. Acoust. Soc. Am. 157, 3017–3032 (2025).

    Article 
    ADS 

    Google Scholar 

  56. Buda, M., Maki, A. & Mazurowski, M. A. A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 106, 249–259 (2018).

    Article 

    Google Scholar 

  57. Yang, S., Xue, L., Hong, X. & Zeng, X. A lightweight network model based on an attention mechanism for ship-radiated noise classification. J. Mar. Sci. Eng. 11, 432 (2023).

    Article 

    Google Scholar 

  58. Wang, Y., Getreuer, P., Hughes, T., Lyon, R. F. & Saurous, R. A. Trainable frontend for robust and far-field keyword spotting. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5670–5674. https://doi.org/10.1109/icassp.2017.7953242 (2017).

  59. Lostanlen, V. et al. Per-channel energy normalization: why and how. IEEE Signal Process. Lett. 26, 39–43 (2018).

    Article 
    ADS 

    Google Scholar 

  60. Cakir, E. & Virtanen, T. End-to-End Polyphonic Sound Event Detection Using Convolutional Recurrent Neural Networks with Learned Time-Frequency Representation Input. In IEEE International Joint Conference on Neural Networks (IJCNN), 1–7. (IEEE, 2018). https://doi.org/10.1109/ijcnn.2018.8489470

  61. Ioffe, S. & Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proc. Machine Learning Research, Vol. 37, 448–456 (JMLR.org, 2015). ‌

  62. Fukushima, K. Visual feature extraction by a multilayered network of analog threshold elements. IEEE Trans. Syst. Sci. Cybern. 5, 322–333 (1969).

    Article 

    Google Scholar 

  63. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. lya & Salakhutdinov, R. uslan Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014).

    MathSciNet 

    Google Scholar 

  64. Schuster, M. & Paliwal, K. K. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45, 2673–2681 (1997).

    Article 
    ADS 

    Google Scholar 

  65. Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. In Proc. International Conference on Learning Representations, 1–13 (ICLR, 2015).

  66. Glorot, X. & Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. PMLR 249, 256 (2010).

    Google Scholar 

  67. Chollet, F. Keras: the Python deep learning library. Astrophysics Source Code Library (Scientific Research, 2018).

  68. Mesaros, A., Heittola, T. & Virtanen, T. Metrics for polyphonic sound event detection. Appl. Sci. 6, 162–162 (2016).

    Article 

    Google Scholar 

Download references

Acknowledgements

This research was supported by the COMPASS project, which is funded by the EU’s INTERREG VA Programme, managed by the Special EU Programmes Body. The views and opinions expressed in this document do not necessarily reflect those of the European Commission or the Special EU Programmes Body (SEUPB). We thank Suzanne Beck (Agri-Food and Biosciences Institute), Susanna Quer and Ewan Edwards (Marine Scotland Science) for their contributions to data planning and acquisition. Additionally, Abdullah Olcay acknowledges the financial support provided by the Ministry of National Education of Türkiye.

Funding

The authors received no specific funding for this work.

Author information

Authors and Affiliations

Authors

Contributions

A.O., P.W. and J.B. conceptualised the study. A.O. organised the dataset for model training, wrote the code, and analysed the results. At every stage of the study, P.W. and J.B. contributed to data arrangement, method development and analyses. A.O. wrote the first manuscript draft, with P.W., J.B., E.W. and B.D. contributing to all sections of the manuscript. D.R. conceptualised the COMPASS project and carried out data collection and survey design. All authors reviewed the manuscript.

Corresponding author

Correspondence to
Abdullah Olcay.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Olcay, A., White, P.R., Bull, J.M. et al. How to analyse overlapping sounds in the marine environment using supervised multi-label classification.
npj Acoust. 2, 22 (2026). https://doi.org/10.1038/s44384-026-00060-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • DOI: https://doi.org/10.1038/s44384-026-00060-x


Source: Ecology - nature.com

Biodegradation of microplastics by Armadillidium vulgare and microbial isolates from an aged landfill

Agricultural crop trade alleviates China’s water shortage but redistributes water value unevenly

Back to Top