    Sperm whale acoustic abundance and dive behaviour in the western North Atlantic

    Data collectionBetween June 27 and August 25, 2016, 6600 km of simultaneous visual and passive acoustic line transect surveys were completed on the National Oceanic and Atmospheric Administration (NOAA) ship Henry B. Bigelow5. Survey effort was distributed along saw tooth track lines spanning the continental slope from Virginia (US) to the southern tip of Nova Scotia (Canada) (36–42 N) and on several larger track lines over the abyssal plain. Two teams of visual observers independently recorded sightings of marine mammals using high-powered Fujinon binoculars (25 × 150; Fujifilm, Valhalla, NY) as well as environmental conditions (e.g. sea state) every 30 min.The speed of sound in water was collected three times each day (morning, noon, evening) by measuring conductivity, temperature, and depth (CTD) at specific intervals in the water column. The sound speed closest to the depth of the towed hydrophone array was extracted. On alternating survey days, Simrad EK60 single beam scientific echosounders operating at frequencies of 18, 38, 70, 120 and 200 kHz were used to collect active acoustic data.When possible during daylight hours (06:00–18:00 ET), passive acoustic data were collected continuously using a custom-built linear array composed of eight hydrophone elements and a depth sensor (Keller America Inc. PA7FLE, Newport News, VA) within two oil-filled modular sections separated by 30 m of cable (Fig. 1). The array was towed 300 m behind the vessel at approximately 5–10 m depth while the vessel was in waters more than 100 m deep and underway at speeds of 16–20 km/h. For more details see DeAngelis et al.31, with the only change being that two APC hydrophones and one Reson hydrophone in the aft section were replaced with HTI-96-Min hydrophones (High Tech, Inc., Long Beach, MS). The HTI’s had a flat frequency response from 1 to 30 kHz (− 167 dB re V/uPa ± 1.5 dB). Recordings were made using the acoustical software PAMGuard (v.1.15.02)34. This analysis used the data recorded by the last two 192 kHz sampled hydrophones in the array (MF5 and MF6).Figure 1The linear towed array included eight hydrophone elements and a depth sensor within two oil-filled modular sections separated by 30 m of cable. Six hydrophones sampled at 192 kHz (MF1–MF6) and two sampled at 500 kHz. The hydrophones were connected to two National Instruments sound cards (NI-USB-6356). A high pass filter of 1 kHz was applied by the recording system to reduce the amount of vessel noise in the recordings. This analysis used the passive acoustic data from MF5 and MF6. The schematic is not to scale.Full size imageClick detection and 2D event localizationThe passive acoustic data were filtered using a Butterworth band pass filter (4th order) between 2 and 20 kHz and decimated to 96 kHz to improve sperm whale click resolution. Clicks were automatically detected using the PAMGuard (v.2.01.03) general sperm whale click detector with a trigger threshold of 12 dB.Using PAMGuard’s bearing time display, all detections were reviewed to classify click types and mark click trains as “events” based on consistent changes in bearing, audible sound, ICI and spectral characteristics. Each event was marked to an individual level, tracking a whale from the first to the last detected click15,35. All events containing usual clicks were localized with PAMGuard’s Target Motion Analysis (TMA) module’s 2D simplex optimization algorithm. For further analysis, events were truncated at a slant range of 6500 m (Supplementary Fig. S1).Echosounder analysisA regression analysis was run using the R package MASS36. To account for overdispersion, a negative binomial generalized linear model (GLM) with a log link function was applied to a dataset of the daily acoustic detections33. Echosounder state (active versus passive), month (June, July, August), and habitat type (slope or abyssal) were included as covariates, with the total number of daily detections as the response variable. The track line distance covered per day was used as an offset for effort. The best fitting model was selected based on backwards stepwise selection using Akaike’s information criterion (AIC) and the single-term deletion method using Chi-squared goodness-of-fit tests37.3D localizationExtracting a .wav clip for each click and attributing metadataAn automated process was developed using the R package PAMpal38 (v. 0.14.0) to extract the time of each click in the marked events from PAMGuard databases, generate a .wav clip for each click, and attribute all metadata (e.g., event 2D localization, array depth, radial distance, sea state, and sound speed) necessary for estimating the click depth.Slant delayUsing the methods established by DeAngelis et al.31 and custom Matlab R2021a (MathWorks Inc., Natick, NA) scripts, the multipath arrival of clicks and surface reflected echoes were used to mathematically convert the linear array into a 2D planar array and estimate 3D localizations. Using the .wav clips exported from PAMPal, the time delay between the click and the corresponding surface reflected echo, known as the slant delay, was measured via autocorrelation. Within the autocorrelation solution’s envelope of correlation values, the optimal slant delay was measured using the peak with the highest correlation value above a threshold of 0.02 and within an expected time window after the direct click of 0.0005–0.015 s. Although theoretically a surface reflected echo could have arrived less than a millisecond ( 5 min) were categorized as U shaped, and as shallow ( 1600 m) based on the maximum click depth (Fig. 3).Figure 3Example of click depths (m) over time (min) for events categorized as (a) U shaped and shallow ( 1600 m).Full size imageClick depths were then binned at 400 m intervals to account for an animal’s unknown horizontal movement over time as well as uncertainty in the estimated click depths, and the total time an animal spent within each depth bin was calculated. For each event with a U shaped click depth pattern, the depth bin in which the bottom phase occurred6 was determined. Finally, to assess if a whale was diving in the water column or close to the seafloor, the depth bin in which the 90th percentile of the click depths was recorded was compared to the bin including the seafloor depth. If the whale was more than 400 m above the seafloor, it was determined to be diving in the water column.Distance samplingDepth-corrected average horizontal perpendicular distancesFor each event, a depth-corrected average horizontal perpendicular distance was calculated using the TMA derived perpendicular slant range and the average depth or an assumed depth in the Pythagorean theorem19,31. The weighted mean, first quartile, and third quartile of the average depths were tested as assumed depths for events excluded from 3D localization. If depth was greater than or equal to the slant range the perpendicular distance was coerced to 0, indicating the whale was diving directly below the track line. The resulting distribution of depth-corrected perpendicular distances that aligned most with distance sampling theory was used in the final distance analysis.Acoustic density and abundance estimationThe R package Distance41,42 (v.1.0.4) was used to estimate two separate detection functions based on the uncorrected slant ranges and the depth-corrected perpendicular distances. Half-normal, uniform, and hazard rate key functions were tested with cosine, simple polynomial, and Hermite polynomial adjustment terms. The best fitting models were selected based on the AIC, the Kolmogorov–Smirnov (K–S) test, the Cramer-von Mises (CvM) test, quantile–quantile plots, and visual review of the fitted models43.     A real-time rural domestic garbage detection algorithm with an improved YOLOv5s network model

    Attention combination mechanismDue to the difficulty in extracting features from target areas in images, the high computational effort of the model and the low accuracy of detection are addressed. As shown in Fig. 3, we introduce a lightweight feedforward convolutional attention module CBAM after the backbone network Focus module of the YOLOv5s network model. The SE-Net (Squeeze and Excitation Networks) channel attention module is posted at the end of the backbone network. We propose an attention combination mechanism based on the YOLOv5s network model and name the improved network model YOLOv5s-CS. Where the CBAM module has a channel number of 128, a convolutional kernel size of 3 and a step size of 2, the SELayer has a channel number of 1024 and a step size of 4.Figure 3YOLOv5 backbone network structure before and after improvement.Full size imageConvolutional block attention module networkIn 2018, Woo et al.25 proposed the lightweight feedforward convolutional attention module CBAM. The CBAM module focuses on feature information from both channels and space dimensions and combines feature information to some extent to obtain more comprehensive reliable attentional information26. CBAM consists of two submodules, the channel attention module (CAM) and spatial attention module (SAM), and its overall module structure is shown in Fig. 4a.Figure 4Principle of CBAM.Full size imageThe working principle of the CAM is shown in Fig. 4b. First, the feature map F is input at the input entrance. Second, the global maximum pooling operation and the global average pooling operation are applied to the width and height of the feature map respectively to obtain two feature maps of the same size. Third, two feature maps of the same size are input to the shared parameter network MLP at the same time. Finally, the new feature map output from the shared parameter network is subjected to a summation operation and a sigmoid activation function to obtain the channel attention features ({M}_{c}).The channel attention module CAM is calculated as shown in Formula (1):$${text{M}}_{rm{c}}({text{F}}){=sigma}({text{MLP (AvgPool (F))}}+ {text MLP (MaxPool (F)))}{=sigma}({rm{W}}_{1}({text{W}}_{0}({text{F}}_{{{rm{avg}}}^{rm{c}}}))+{rm{W}}_{0}({rm{W}}_{1}({rm{F}}_{{{rm{max}}}^{rm{c}}})))$$
    where σ represents the sigmoid function, MLP represents the shared parameter network, ({text{W}}_{0}) and ({text{W}}_{1}) represent the shared weights, ({text{F}}_{text{avg}}^{text{c}}) is the result of feature map F after global average pooling, and ({text{F}}_{text{max}}^{text{c}}) is the result of feature map F after global maximum pooling.The working principle of SAM is shown in Fig. 4c. The feature map F’ is regarded as the input of the SAM. F’ is obtained by multiplying the input of SAM with the output of CAM. First, the global maximum pooling operation and the global average pooling operation are applied to the channels of the feature map to obtain two feature maps of the same size. Second, two feature maps that have completed the pooling operation are stitched at the channels and the feature channels are dimensioned down using the convolution operation to obtain a new feature map. Finally, spatial attention features ({text{M}}_{text{s}}) are generated using the sigmoid activation function.The spatial attention module (SAM) is calculated, as shown in Formula (2):$${text{M}}_{text{s}}left({text{F}}right) {=sigma}left({text{f}}^{7 times 7}left(left[{text{AvgPool}}left({text{F}}right)text{;MaxPool}left({text{F}}right)right]right)right) {=sigma}left({text{f}}^{7 times 7}left(left[{text{F}}_{text{avg}}^{text{s}} ; {text{F}}_{text{max}}^{text{s}}right]right)right)$$
    where σ is the sigmoid function, ({text{f}}^{7 times 7}) denotes the convolution operation with a filter size of 7 × 7, ({text{F}}_{text{avg}}^{text{s}}) is the result of the feature map after global average pooling, and ({text{F}}_{text{max}}^{text{s}}) is the result of the feature map after global maximum pooling.Squeeze and excitation networkIn 2018, Hu et al.27 proposed a single-path attention network structure SE-Net. SE-Net uses the idea of an attention mechanism to analyze the relationship feature maps by modeling and adaptively learning to obtain the importance of each feature map28 and then assigns different weights to the original feature map for updating according to the importance. In this way, SE-Net pays more attention to the features that are useful for the target task while suppressing useless feature information and allocates computational resources rationally to different channels to train the model to achieve better results.The SE-Net attention module is mainly composed of two parts: the squeeze operation and excitation operation. The structure of the SE-Net module is shown in Fig. 5.Figure 5The SE-Net module structure.Full size imageThe squeeze operation uses global average pooling to encode all spatial features on the channel as local features. Second, each feature map is compressed into a real number that has global information on the feature maps. Finally, the squeeze results of each feature map are combined into a vector as the weights of each group of feature maps. It is calculated as shown in Eq. (3):$${text{Z}}_{text{c}}={text{F}}_{text{sq}}left({text{u}}_{text{c}}right)=frac{1}{text{H} times {text{W}}}sum_{text{i=1}}^{text{H}}sum_{text{j=1}}^{text{W}}{{text{u}}}_{text{c}}left(text{i,j}right) , , , $$
    where H is the height of the feature map, W is the feature map width, u is the result after convolution, z is the global attention information of the corresponding feature map, and the subscript c indicates the number of channels.After completing the squeeze operation to obtain the channel information, the feature vector is subjected to the excitation operation. First, it passes through two fully connected layers. Second, it uses the sigmoid function. Finally, the output weights are assigned to the original features. It is calculated as follows:$$text{s} = {text{F}}_{text{ex}}left(text{z,W}right){=sigma}left({text{g}}left(text{z,W}right)right){=sigma}left({text{W}}_{2}{delta}left({text{W}}_{1}{text{z}}right)right)$$
    $$widetilde{{text{x}}_{rm{c}}}={text{F}}_{rm{scale}}left({text{u}}_{rm{c}}, {text{s}}_{rm{c}}right)={text{s}}_{rm{c}}{{text{u}}}_{rm{c}}$$
    where σ is the ReLU activation function, δ represents the sigmoid activation function, and ({text{W}}_{1}) and ({text{W}}_{2}) represent two different fully connected layers. The vector s represents the set of feature mapping weights obtained through the fully connected layer and the activation function. (widetilde{{x}_{c}}) is the feature mapping of the x feature channel, ({text{s}}_{text{c}}) is a weight, and ({text{u}}_{text{c}}) is a two-dimensional matrix.Target detection layerThe garbage in rural areas is a smaller target and has fewer pixel characteristics, such as capsule, button butteries. Therefore, we insert a small target detection layer to improve the head network structure based on the original YOLOv5s network model for detecting objects with small targets to optimize the problem of missed detection in the original network model. The YOLOv5s network structure with the addition of the small target detection layer is shown in Fig. 6 and named YOLOv5s-STD.Figure 6The YOLOv5s-STD network structure.Full size imageIn the seventeenth layer of the neck network, operations such as upsampling are performed on the feature maps so that the feature maps continue to expand. Meanwhile, in the twentieth layer, the feature maps obtained from the neck network are fused with the feature maps extracted from the backbone network. We insert a detection layer capable of predicting small targets in the thirty-first layer. To improve the detection accuracy, we use a total of four detection layers for the output feature maps, which are capable of detecting smaller target objects. In addition to the three initial anchor values based on the original model, an additional set of anchor values is added as a way to detect smaller targets. The anchor values of the improved YOLOv5s network model are set to [5, 6, 8, 14, 15, 11], [10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119] and [116, 90, 156, 198, 373, 326].Bounding box regression loss functionThe loss function is an important indicator of the generalization ability of a model. In 2016, Yu et al.29 proposed a new joint intersection loss function IoU for bounding box prediction. IoU stands for intersection over union, which is a frequently used metric in target detection. It is used not only to determine the positive and negative samples, but also to determine the similarity between the predicted bounding box and the ground truth bounding box. It can be described as shown in the Eq. (6):$$text{IoU} = frac{left|text{A} capleft.{text{B}}right|right.}{left|{text{A}} cupleft.{text{B}}right|right.}$$
    where the value domain of IoU ranges from [0,1]. A and B are the areas of arbitrary regions. Additionally, when IoU is used as a loss function, it has to scale invariance, as shown in Eq. (7):$$text{IoU_Loss} = 1-frac{left|text{A} cap left.{text{B}}right|right.}{left|{text{A}} cup left.{text{B}}right|right.}$$
    However, when the prediction bounding box and the ground truth bounding box do not intersect, namely IoU = 0, the distance between the arbitrary region area of A and B cannot be calculated. The loss function at this point is not derivable and cannot be used to optimize the two disjoint bounding boxes. Alternatively, when there are different intersection positions, where the overlapping parts are the same but in different overlapping directions, the IoU loss function cannot be predicted.To address these issues, the idea of GIoU (Generalized Intersection over Union)30, in which a minimum rectangular Box C of A and B is added, was proposed in 2019 by Rezatofighi et al. Suppose the prediction bounding box is B, the ground truth bounding box is A, the area where A and B intersect is D, and the area containing two bounding boxes is C, as shown in Fig. 7.Figure 7GIoU evaluation chart.Full size imageThen, the GIoU calculation, as shown in Formula (8), is:$$text{GIoU}= text{IoU}-frac{text{|C}-left({text{A}} cup {text{B}}right)text{|}}{text{|C|}}$$
    The GIoU_Loss is calculated as (9):$$text{GIoU_Loss=1}-{text{IoU}}-frac{text{|C}-left({text{A}} cup {text{B}}right)text{|}}{text{|C|}}$$
    The original YOLOv5 algorithm uses GIoU_Loss as the loss function. Comparing Eqs. (6) and (8), it can be seen that GIoU is a new penalty term (frac{text{|C}-left({text{A}} cup {text{B}}right)text{|}}{text{|C|}}) that is added to IoU and is clearly represented by Fig. 7.Although the GIoU loss function solves the problem that the gradient of the IoU loss function cannot be updated in time and the prediction bounding box, the direction of the ground truth bounding box is not consistent when predicting, but there are still disadvantages, as shown in Fig. 8.Figure 8Comparsion of loss values.Full size imageFigure 8 shows three different position relationships formed when the predicted bounding box and the ground truth bounding box overlap exactly. Among them, the ratio of the length to width of the green grounding truth bounding box is 1:2, and the red predicted bounding box has the same aspect ratio as the ground truth bounding box, but the size is only one-half of the green ground truth bounding box. When the prediction bounding box and the ground truth bounding box completely overlap, the GIoU degenerates to the IoU, and the GIoU value and IoU value for the three different position cases are 0.45 at this time. The GIoU loss function does not directly reflect the distance between the prediction bounding box and the ground truth bounding box. Therefore, we introduce the CIoU (Complete Intersection over Union)31 loss function to replace the original GIoU loss function in the YOLOv5 algorithm and continue to optimize the prediction bounding box.Therefore, the CIoU is calculated as (10):$$text{GIoU_Loss}=1-text{IoU}-frac{text{|C}-left({text{A}} cup {text{B}}right)text{|}}{text{|C|}}$$
    where b and ({text{b}}^{text{gt}}) denote the centroids of the prediction bounding box and the ground truth bounding box, respectively, ({rho}) is the Euclidean distance between the two centroids, and c is the diagonal length of the minimum closed area formed by the prediction bounding box and the ground truth bounding box.(alpha) is the parameter used to balance the scale, and v is the scale consistency used to measure the aspect ratio between the prediction bounding box and the ground truth bounding box, as shown in Eqs. (11) and (12).$$alpha =frac{text{v}}{left(1-text{IoU}right)+{text{v}}^{{prime}}}$$
    $$text{v} = frac{4}{{pi}^{2}}{left({text{arctan}}frac{{omega}^{text{gt}}}{{text{h}}^{text{gt}}}- text{arctan}frac{{omega}^{text{p}}}{{text{h}}^{text{p}}}right)}^{2}$$
    Therefore, the expression of CIoU_Loss can be obtained according to Eqs. (10), (11) and (12).$$text{CIoU_Loss} =1-text{CIoU}=1-text{IoU}+frac{{rho}^{2}left(text{b,}{text{b}}^{text{gt}}right)}{{text{c}}^{2}}{+ alpha v }$$
    Optimization algorithmAfter optimizing the loss function of the network model, the next step is to optimize the hyperparameters of the network model. The function of the optimizer is to adjust the hyperparameters to the most appropriate values while making the loss function converge as much as possible32. In the target detection algorithm, the optimizer is mainly used to calculate the gradient of the loss function and to iteratively update the parameters.The optimizer used in YOLOv5 is stochastic gradient descent (SGD). Since a large number of problems in deep learning satisfy the strict saddle function, all the local optimal solutions obtained are almost as ideal. Therefore, SGD algorithm is not trapped in the saddle point and has strong generality. However, the slow convergence speed and the number of iterations of SGD algorithm are still problems that need to be improved. Adam algorithm has both the first-order momentum in the SGD algorithm and combines the second-order momentum in AdaGrad algorithm and AdaDelta algorithm, Adaptive&Momentum. Adam formula can be described as follows:$${m}_{t}={beta }_{1}{m}_{t-1}+left(1-{beta }_{1}right){g}_{t}$$
    $${v}_{t}={beta }_{2}{v}_{t-1}+left(1-{beta }_{2}right){g}_{t}^{2}$$
    $${widehat{m}}_{t}=frac{{m}_{t}}{1-{beta }_{1}^{t}}$$
    $${widehat{v}}_{t}=frac{{v}_{t}}{1-{beta }_{2}^{t}}$$
    where ({beta }_{1}) and ({beta }_{2}) parameters are hyperparameters and g is the current gradient value of the error function, ({m}_{t}) is the gradient of the first-order momentum and ({v}_{t}) is the gradient of the second-order momentum.Adam is an adaptive one-step random objective function optimization algorithm based on a low-order moment. It can replace the traditional first-order optimization algorithm for the stochastic gradient descent process. It is able to update the weights of the neural network adaptively based on the data trained during the iterative process. The Adam optimizer occupies fewer memory resources during the training process and is suitable for solving the problems of sparse gradients and large fluctuations in loss values33. Therefore, we use the Adam optimization algorithm instead of the SGD optimization algorithm to train the network model based on the YOLOv5s network model. The calculation is shown in Table 3.Table 3 Computing method of the Adam optimizer.Full size tablewhere ({alpha}) is a factor controlling the learning rate of the network, ({beta}^{{prime}}) is the exponential decay rate of the first-order moment estimate, ({beta}^{{primeprime}}) is the exponential decay rate of the second-order moment estimate, and ({varepsilon}) is a constant that tends to zero infinitely as the denominator. More

    Nitrogen and carbon stable isotope analysis sheds light on trophic competition between two syntopic land iguana species from Galápagos

    Silvopastoral systems and remnant forests enhance carbon storage in livestock-dominated landscapes in Mexico

    The pulsating soft coral Xenia umbellata shows high resistance to warming when nitrate concentrations are low

    Single-cell view of deep-sea microbial activity and intracommunity heterogeneity

    Coupled abiotic-biotic cycling of nitrous oxide in tropical peatlands

    Genetic and particle modelling approaches to assessing population connectivity in a deep sea lobster

