in

Toward autonomous weed management systems in sugarcane crops and an assessment of technological readiness


Abstract

Weeds compete with crop plants for space, nutrients, sunlight, and soil moisture, reducing crop yields during early weeks after emergence. Controlling weeds in perennial crops like sugarcane is challenging and typically addressed by herbicides and mechanical tillage. This work focuses on weed detection in sugarcane crops. We provide an in-field dataset as a benchmark and evaluate deep learning architectures for object detection, classification, and conduct a bounding-box-guided segmentation study with both qualitative and quantitative evaluation. For detection, a 44.2 AP50 score was achieved combining RTMDeT, an architecture employing large-kernel depth-wise convolution, a loss function incorporating geometric constraints, and feature pyramid networks. For classification, we leveraged Swin Transformer with self-supervised pre-training achieving 99% accuracy. We compared segmentation performance among SAM, ExGR, and S2C approaches both qualitatively and quantitatively, using manually annotated pixel-level ground truth on the test set. Although significant progress was made, precisely detecting weeds in perennial crops remains unsolved under real-world conditions.

Similar content being viewed by others

WeedSwin hierarchical vision transformer with SAM-2 for multi-stage weed detection and classification

Manually annotated and curated Dataset of diverse Weed Species in Maize and Sorghum for Computer Vision

Deep learning classification of INSV-associated weeds in Monterey county using a curated RGB image dataset

Introduction

Sugarcane is the most important row crop grown in Louisiana and is currently produced on over 214,000 ha of land with an economic impact of over 1.5 USD billion to producers and processors1. Sugarcane producers are always searching for ways to increase production efficiency, due to increased input costs. Weed control costs account for a significant portion of a sugarcane grower’s yearly expenses. Furthermore, weed competition can decrease crop production by up to 34%2. Therefore, it is imperative that herbicides or other weed control methods are applied accurately and efficiently, so that these expenses can be minimized without crop loss. Furthermore, due to increased public concerns regarding the use of pesticides and concerns from farmers about herbicide resistance issues, it is important that herbicide application is made judiciously, such that the minimum amounts are applied to achieve the desired level of weed control. If these goals are to be achieved, it is necessary to develop new systems that accurately target weed species that are developing among the crop of interest. The successful development of this system would allow agricultural producers to reduce their use of agrochemicals, in particular herbicides, and would represent an important step towards sustainable agriculture.+

The most prevalent method of weed control in modern agriculture entails uniformly spraying an entire field with herbicides, which can be unnecessarily costly, impact non-target species, and often poses health concerns to local populations. One can also possibly under-spray areas with high weed populations and over-spray regions with low weed populations, which may adversely affect crop yields. Currently, limited commercial options exist to conduct targeted weed spraying in an automated, low-cost manner. Precision agriculture using autonomous robotics allows for many improvements to current agricultural practices, such as increasing crop yields, reducing resource requirements and hazardous chemicals, and lowering human errors.

Experimental commercial agricultural robotic systems, such as Bosch’s BoniRob3, are under development. These systems face several challenges in navigation, speed and efficiency, manipulation of end-effector tools, and robust weed detection. Effective classification and detection of weeds is a fundamental prerequisite, and currently the primary bottleneck, for a functional robotic weeding system. While an autonomous platform requires navigation and actuation, these downstream components rely entirely on the perception module’s accuracy. Therefore, to answer the question posed in our title, “Are we close?” we focus strictly on evaluating the readiness of computer vision architectures for this task.

Numerous vision-based learning systems have been proposed for distinguishing weed from crop. These methods are typically effective for annual crops with high accuracy rates usually above 90% but much less for perennial crops. Luo et al.4 investigated several Convolutional Neural Networks (CNNs) to identify 140 species of weed seeds in a controlled environment, with results surpassing 94% of classification accuracy. Hasan et al.5 proposed a three-step approach for crop and weed classification, in which generative networks first enhance images for further overlapping-patching them. The most informative patches are selected using Laplacian’s variance information and Fast Fourier Transforms’ mean frequency. The proposed pipeline was evaluated in cotton, corn, and tomato, among others, with results nearly to perfection.

Before classification, we must work on weed detection, which is far more challenging. While classification operates on selected regions, detection has the whole area to analyze. Classification assumes that detection has correctly segmented the sample, which may not be untrue and while we present both weed classification and detection experiments, we are more interested in the latter.

Rahman et al.6 evaluated thirteen deep learning architectures for weed detection in cotton crops. RetinaNet7 achieved the highest overall detection accuracy despite its longer inference time with a mean average precision ([email protected] of 79.98%). On the other hand, YOLOv5n8 showed the best trade-off between efficiency and effectiveness with the fastest inference time while achieving comparable detection accuracy ([email protected] of 76.58%). Rai et al.6 reviewed 60 papers on weed detection based on deep learning, with the following findings: (i) transfer learning is regarded as widely used, (ii) there is no model acknowledged to perform better in all datasets, (iii) the literature lacks studies on inference models in the edge, and (iv) YOLO (mostly v3) is broadly used (in our work we show recent neural models that can perform better). Interestingly, none of the prior work considered weed detection in sugarcane fields. We hypothesize that the perennial sugarcane crop is among the most challenging crops for weed detection. As a perennial grass of the family Poaceae, sugarcane’s architecture changes rapidly when weed control methods are typically implemented, making their differentiation quite challenging.

Although several works have been proposed for, there is limited literature on weed detection in perennial crops. Perennial systems are particularly challenging as dense coverage can lead to the so-called “sod-bound” condition. The long lifecycle of such crops facilitates weed proliferation, requiring persistent and often integrated management approaches. This review focuses on contributions from the past four years to capture the paradigm shift from conventional CNNs to Vision Transformers and the progression from green-on-brown to complex green-on-green detection scenarios.

Singh and Singh9 used Unmanned Aerial Vehicles (UAV) imagery and Random Forests to classify patches in sugarcane fields, reporting comparable performance between human and automatic labeling. Singh et al.10 emphasized difficulties in classification due to spectral similarity between weeds and sugarcane, noting camouflage-like behavior and patch-level spread. Although they used deep networks for feature extraction, detection was not addressed, and the limitations of UAVs were acknowledged, including regulatory, technical, and operational bottlenecks.

Sun et al.11 employed a Transformer-based model leveraging low-level semantics for segmentation, achieving 96.97% mean accuracy and 94.13% IoU. Their dataset, however, included predominantly green-on-brown scenarios (e.g., weeds over bare soil), which simplifies the discrimination task.

Romero and Heenkenda12 proposed a multimodal weed mapping method combining UAV multispectral imaging with LiDAR-based height estimation. While reporting 87% accuracy, the method incurs high deployment costs. Zhang et al.13 surveyed robotic approaches for weed detection, noting data annotation as a key bottleneck and identifying self-supervised learning as a promising direction.

Ajayi and Ashi14 analyzed performance variations in a region-based CNN over multi-crop UAV datasets, including sugarcane. Verçosa et al.15 discussed cross-modality generalization, demonstrating that CNN models trained on one acquisition method (e.g., UAV) may not generalize to others (e.g., satellite or ground-based).

Santiago et al.16 used a bag-of-features model with SVMs, achieving 76.1% accuracy-substantially below deep learning benchmarks. Andrade and Ramires17 achieved 97% classification accuracy using Random Forests over color spectral data, leading to a 60.4% reduction in herbicide application area.

Souza et al.18 reached 99.5% accuracy distinguishing weeds and crops using spectral features but did not discuss deployment feasibility in field sensors. Verçosa et al.15 classified crops, weeds, and forest using CNNs with over 98% accuracy, yet conflated classification with detection. Thamoonlest et al.19 applied low-resolution UAV imagery for weed presence estimation to support machinery allocation, reporting 84% accuracy.

Relevant efforts on sugarcane analysis using artificial intelligence mainly focus on disease classification or insect pest damage. Little work addresses weed management on the ground, with intelligent sensors embedded in agricultural machinery. Decisions on whether and when to apply herbicides must occur in real time to avoid wasted resources. Some sensors are more cost-effective than UAVs, which depend on favorable weather and trained personnel. Sensor networks can operate collaboratively even under challenging conditions, such as dust and poor illumination.

Indeed, a careful look at recent studies on weed detection in sugarcane reveals a very limited number of works. For example, Modi et al.20 investigated six deep learning-based methods for weed identification in sugarcane crops, DarkNet53 being the most accurate one (>99%). That said, Modi assumes the images are well-behaved, i.e., they are not mixed with the crop. In the recent work, Sun et al.11 employed Transfomers for weed identification. The dataset does not appear to be challenging, as we can observe in the results (>98%). A visual inspection shows images containing weeds in their well-known green color with brown soil in the background.

In this paper, we shed light on a more realistic field scenario, where weeds may vary in size, color, and degree of overlap with sugarcane crops. We evaluated several recent and state-of-the-art deep learning architectures and detectors for weed identification and showed none are near to delivering reasonable accuracy. Of notice, to benchmark our present (and future) effort, we curated a new and in-the-field dataset, with annotations on image level for classification purposes and bounding boxes for weed detection tasks (https://github.com/MIT-The77Lab/SugarcaneWeedDataset).

Table 1 situates our contribution by comparing existing sugarcane weed datasets. Most prior works rely on UAV-based classification, an inherently simpler task than detection, which presumes prior localization. UAV deployment also involves practical limitations: (i) payload and size, (ii) power constraints, (iii) real-time integration, and (iv) regulatory restrictions. Our dataset captures images from ground-level sensors, simulating integration into field machinery for real-time weed detection and classification.

Table 1 Summary and comparison of datasets used in recent weed detection/classification studies in sugarcane crops
Full size table

We are interested in lightweight weed detectors, which can be embedded in specific devices for rolling out green-on-green spot spraying. This scenario also features dust, mud, rain, and different visibility and light levels, making the challenge harrowing. To our knowledge, none of the proposed approaches so far have been able to identify and control grass weeds among a perennial grass crop like sugarcane, particularly when the shoots are between 15 and 35 in. tall and intertwined with weeds. Most of the literature validates unrealistic scenarios, such as those where weeds are not mixed with the crop, or imaged against bare soil in simpler green-on-brown contexts. Our dataset comprises images with multiple weed specimens and proportions common to production fields. One may find many small weed shoots or widespread areas covered with weed making it difficult to refine the detector.

This paper reports on three primary contributions:

  • A realistic and in-the-field dataset for weed detection and classification with object- and image-level annotations.

  • A methodology for weed detection comprising a neural backbone, a detector head, and a loss function that adapts to different object sizes.

  • A comprehensive experimental analysis including weed detection, classification, alongside qualitative and quantitative assessment of instance segmentation using zero-shot and weakly supervised approaches, supported by manually annotated pixel-level ground truth.

The remainder of the manuscript presents the results, discussion, conclusions, and methodology of this work.

Results

Overview of findings

In this study, we evaluated several tasks related to precision agriculture associated with weed identification among perennial sugarcane crops, a challenging scenario due to the visual similarity between crops and weeds. The experimental results comprise three primary scenarios: (i) detection, (ii) classification, and (iii) segmentation. In this section, we present our findings in each scenario and discuss some fundamental aspects concerning each task revealed by our findings, with methodological details described in the Methods Section.

Detection

We evaluated several state-of-the-art models for weed detection in a sugarcane field. Table 2 presents the outcomes. The task is challenging and requires accurate models to distinguish subtle visual differences between crops and invasive plants, complicated by overlapping foliage, inconsistent scaling, and imperfect annotations. One can observe that RTMDeT-based detectors achieved the highest detection accuracy, especially when considering a ConvNeXt backbone (44.2% on the AP50 metric), outperforming their transformer counterparts by 6.2%.

Table 2 Object detection architectures considered in this paper include backbone types, pretraining strategies, feature fusion methods, and preprocessing techniques
Full size table

Adopting the Complete Intersection over Union (CIoU) loss improved detection AP50 by 2.9% compared to standard IoU, as this approach aims to obtain geometrically plausible predictions even when ground truth boxes are oversized or inconsistently grouped. Meanwhile, the hybrid TransConv-RTMDeT model, which concatenated SwinViT and ConvNeXt features, achieved moderate detection performance (42.1% AP50) but failed to surpass standalone ConvNeXt (44.2%).

To assess the real-time performance of each model, we measure inference time in seconds under standardized hardware conditions. All evaluations were conducted on a system equipped with an Intel(R) Xeon(R) Silver [email protected] GHz processor, 384GB of RAM, and an NVIDIA A40 GPU. As shown in Table 3, inference times varied considerably across architectures. The YOLOv11 family demonstrates exceptional throughput, achieving inference times as low as 0.007s per image and, while this ultra-low latency makes them theoretically suitable for certain scenarios, their efficacy in terms of detection quality is still low.

Table 3 Inference times (seconds per image) for detection models under standardized hardware conditions
Full size table

Table 4 compares the results of our best architecture regarding variations in image input resolutions after training on 120 epochs. Increasing input resolution beyond 640 × 640 paradoxically reduced detection accuracy. At 2560 × 2560, AP50 fell to 13.4%, likely because pre-trained convolutional kernels trained in images of lower resolutions failed to generalize to fine-grained details at extreme scales that require more time to train.

Table 4 Detection performance (AP50) of our best architecture (RTMDeT-ConvNeXt) across different input resolutions, where bold highlights the most accurate scenario
Full size table

Classification

Table 5 presents the results regarding the classification task, in which ViT-B achieved the highest F1-score (98.23%), followed by the ResNet-18 architecture. It is worth noting that SwinViT-B with MAE pretraining achieved the highest accuracy metric (99.05%) alongside an 89.71% F1-score. EfficientNet provided a practical alternative, delivering a 94.06% F1-score with fewer training epochs, a pragmatic choice for imbalanced datasets.

Table 5 Classification outcomes (accuracy, precision, recall, F1-score) for CNN and transformer models on the sugarcane-weed dataset, where bold values indicate the best result for each metric
Full size table

Comparing transformer performance, ViT-B with MAE pretraining reached 99.04% accuracy, nearly matching SwinViT-B with MAE at 99.05%. The MAE approach, specifically UM-MAE, substantially benefited SwinViT-B across all metrics; the F1-score, for example, improved from 87.93% to 89.71%. For ViT-B, however, MAE boosted accuracy but lowered its otherwise leading F1-score, precision, and recall.

To further investigate the error distribution, Fig. 1 presents the confusion matrices for four representative architectures, revealing distinct failure modes across model families. SwinViT-B shows the best efficacy after MAE pretraining, with the most robust decision boundaries and negligible cases of confusion.

Fig. 1: Confusion matrices for four representative classifiers: ResNet-18, SwinViT-B (MAE), ViT-B, and ConvNeXt-v2, covering both convolutional and transformer-based architectures.
Full size image

These results summarize class-specific behavior across the three categories and highlight differences in error distribution among model families.

Segmentation

Concerning the segmentation, our results highlight the complementary nature of the ExGR-based method, SAM, and S2C (Fig. 2). While the ExGR-based approach produced well-defined boundaries for isolated weeds, it struggled in complex scenarios, misclassifying background elements with similar spectral properties. SAM demonstrated superior structural awareness, particularly in cluttered scenes, yet, due to prompting limitations, this led to inconsistencies in delineating individual foliage structures. The S2C method also presented mixed performance, leading to complete and cohesive segmentations for broadleaf weeds but struggling with some dense grassy weeds, and exhibiting a notable tendency to over-segment ground regions.

Fig. 2: Qualitative comparison of weed foliage segmentation approaches: ExGR-based method, SAM, and S2C.

Full size image

Row (iv) shows the “green-on-green” challenge where weeds blend with similarly colored crop foliage. Row (v) demonstrates “foreground visibility” challenges, including detection of thin leaf structures and accurate segmentation of target weeds against complex backgrounds.

Some key challenges faced by our chosen segmentation methods are also illustrated in Fig. 2. Specifically, in the green-on-green scenario (row iv), accurately segmenting weeds that blend in with the background proves particularly challenging. While ExGR tends to produce coarser results and SAM struggles to isolate the weeds, the weakly supervised approach, S2C, achieves clearer segmentations of the target weeds. This suggests that while S2C may be weaker in the optimal scenarios discussed earlier, it has greater potential in certain challenging situations.

A similar challenge arises when foreground visibility is poor, such as when capturing thin leaves or the entire weed (row v). In this case, SAM struggled significantly by segmenting the entire background instead of the foreground. Although ExGR provides a more cohesive solution, it does not adhere to precise markings. On the other hand, S2C offers a middle ground, able to distinguish between foreground and background. However, it tends to segment too much additional information, most likely due to the CAM maps used for training the classification model.

To quantitatively validate our qualitative observations, we manually annotated all test images containing detection bounding boxes with pixel-level ground truth. Table 6 presents the standard segmentation metrics for our evaluated zero-shot and weakly supervised approaches without dedicated fine-tuning. ExGR achieved the most balanced performance, while SAM demonstrated the highest structural accuracy and precision. The weakly supervised S2C model achieved the highest recall but exhibited the lowest precision due to its tendency to over-segment the ground regions.

Table 6 Quantitative evaluation of segmentation strategies. Metrics were calculated on the test set, which contained detection bounding boxes
Full size table

Qualitative analysis of bounding boxes

We removed images containing oversized bounding boxes (boxes spanning more than 80% of the image) from the training set to address annotation inconsistencies. The best-performing model (RTMDeT-ConvNeXt) was retrained on the cleaned dataset defined in the Methods Section. To isolate the effect of data quality, this model was trained from scratch using ImageNet initialization and the same hyperparameter protocol as the main experiments, rather than fine-tuning from the previous noisy weights. The qualitative evaluation revealed improvements in the localization of the excluded images, where the bounding box was poorly annotated, as shown in Fig. 3. The refined model produced smaller, tightly cropped boxes around individual weed leaves, even in scenes where original annotations grouped multiple plants into one box. Suggesting that oversize annotations introduce ambiguity during training, as there is no consistency between weed clusters and single entities, degrading localization granularity.

Fig. 3

Full size image

Qualitative analysis of our model’s robustness to fix the issue with large bounding boxes in the original annotations: the left column shows the model’s prediction while the right column shows the original annotation.

Discussion

Regarding the evaluation of weed detection, the results displayed in Table 2 show a clear discrepancy in effectiveness between convolutional and transformer-based architectures. While ViT-driven approaches traditionally work much better in generic scenarios, the ConvNeXt backbone proved to be a notable exception, as it uses convolutions with increased kernel sizes to mimic some of the improvements in global feature integration present in visual transformers. This approach suggests that global information and more significant reception fields may extract more relevant features for the data we need to detect, making capturing even large clusters of weeds possible.

Regarding the loss function, an initial hypothesis was that the intertwined nature of sugarcane foliage would lead to annotation inconsistencies, specifically where ground truth boxes encompass dense clusters rather than individual plants. The observed improvement with CIoU loss supports this hypothesis and suggests that standard IoU metrics penalize geometrically plausible predictions against ambiguous ground truth, whereas the geometric constraints of CIoU allowed the model to mitigate the noise introduced when annotators were forced to draw coarse boxes around overlapping weeds.

Another noteworthy observation is that the hybrid TransConv-RTMDeT model fails to surpass standalone architectures, suggesting that naively combining transformer and CNN features introduces noise rather than complementary signals. While transformers excel at global context modeling and CNNs at local texture extraction, their feature maps operate at divergent spatial granularities. Attempts to harmonize these via simple concatenation proved insufficient, likely because the detector head could not reconcile patch-based embeddings with channel-wise activations without explicit spatial regularization.

This outcome is also related to an initial hypothesis of this work, which states that self-supervised tasks, such as image reconstruction using Masked Autoencoders (MAE), could improve prediction quality by finding more representative features in the backbone. Instead, the detection model struggled with feature distribution shifts caused by MAE pretraining, as evidenced by a drop in the AP50 metric for the SwinViT-B backbone to 11.93%, compared to 37.96% with standard ImageNet initialization. While classification fine-tuning reduced this gap, it remained less effective than the baseline, suggesting that the masking strategy intrinsic to reconstructive pretraining disrupts the fine-grained spatial relationships critical for crop-weed discrimination.

This poor performance of MAE highlights a broader issue: self-supervised methods optimized for generic image reconstruction can degrade task-specific feature learning. When applied to SwinViT-B, MAE’s masking strategy corrupted the spatial coherence of stem-leaf structures, making detection heads trained on ImageNet features ineffective.

From an operational standpoint, the analysis of model latency, displayed in Table 3 identifies a clear hierarchy for deployment. The ConvNeXt-based RTMDeT achieved the optimal trade-off, registering a processing time of 0.099 s/image while maintaining top-tier AP50 scores. This performance not only rivals the speed of smaller YOLO variants but also outperforms both transformer-based counterparts (SwinViT-RTMDeT: 0.103s) and traditional CNN detectors like Mask R-CNN (ResNet50: 0.154s). Although YOLOv11L (0.147s) lagged slightly behind the ConvNeXt-RTMDeT, it remained 32% faster than similarly sized Mask R-CNN variants, reinforcing the potential of modern CNNs for lightweight deployment.

However, the same could not be said for transformer-based architectures that incurred substantial latency penalties: the SwinViT-L backbone in DiNO21 slowed inference to 0.268s/image, being 2.7× slower than ConvNeXt. Even when paired with a faster detection head, such as RTMDeT, transformer mechanisms are still slower, making it difficult to justify its latency with proportional accuracy gains. This is especially important considering the fact that these architectures will be placed in embedded systems for agricultural needs, so every delay must be justified by the increase in accuracy.

Regarding the effects of image resolution on prediction, as shown in Table 4, the observed degradation implies that resolution gains must be balanced against pre-trained feature compatibility. Since working at a higher resolution scale may increase the latency, the 640 × 640 resolution seems optimal.

We attribute this degradation to the specific environmental conditions in perennial crops. In these scenarios, weeds and crops share nearly identical local textures (e.g., color and leaf venation), making high-frequency details ambiguous. We hypothesize that the primary discriminative signal may rely on tracing the global morphology of leaves back to their stems, rather than local texture. At high resolutions, the network’s effective receptive field shrinks relative to the plant’s size, forcing the model to rely on these ambiguous local textures rather than the structural context needed to resolve the intertwining foliage. Understanding why ConvNeXt succeeds where transformers fall short remains an open question; thus, future work employing attribution visualizations such as Grad-CAM22 could shed light on the specific feature-level mechanisms at play.

The classification outcomes detailed in Table 5 highlight the nuanced impact of reconstruction-based pretraining. The high accuracy observed in MAE-based models suggests that while the reconstruction supervision may be more helpful in the classification task and capable of asserting if there is a weed in the scene, the model may still overfit to dominant sugarcane features, reducing sensitivity to minority weed classes. This distinction suggests that the ViT standard architecture, which is capable of capturing long-range dependencies using global self-attention, may develop more balanced discriminative features for this task than SwinViT’s hierarchical windowing.

The nature of the features learned explains this divergence. The MAE reconstruction learned general scene features useful for accuracy, such as differentiating between ground and foliage, which might have subtly altered the nuanced features that ViT-B originally learns for an optimal precision-recall balance. This suggests that MAE’s utility might be task-dependent, as it improves classification accuracy but disrupts critical spatial relationships in detection tasks, as evidenced by SwinViT-B’s AP50 dropping to 33.31%.

Finally, the error distribution analysis in Fig. 1 reinforces these findings, confirming that reconstruction-based pretraining is good at separating feature representations between weed and sugarcane, even for visually similar classes. In contrast, while ViT-B achieves high accuracy, it exhibits a bias of frequently misclassifying seed instances as a mixture of sugarcane and weed, which suggests that while ViT-B correctly detects vegetative features, it struggles to rule out the presence of sugarcane in dense weed patches. Additionally, ConvNeXt-v2 displays significant inter-class confusion, particularly between Sugarcane’ and Weed’, indicating that its feature extraction failed to overcome the spectral similarity of the green-on-green scenario.

The qualitative assessment of segmentation strategies reveals distinct architectural biases. While the S2C method provided cohesive segmentations for broadleaf weeds, its tendency to over-segment ground regions suggests a limitation in its weak supervision signal. This may be caused by the CAMs obtained during classification training, which may highlight discriminative areas rather than precise object boundaries. Consequently, the model focuses on features that minimize classification loss rather than those that adhere to geometric coherence.

Comparing the approaches visualized in Fig. 2, no single method proved universally effective, with ExGR, SAM, and S2C each exhibiting context-dependent strengths and specific weaknesses, especially without dedicated supervised fine-tuning for pixel-level segmentation. The observed deficiencies, however, can be complementary. For instance, SAM might struggle with small or sparse foliage, whereas ExGR captures more greenness. Meanwhile, both ExGR and S2C (particularly in ground regions, as noted) can be prone to over-segmentation. In contrast, SAM might offer more structurally aware boundaries, though sometimes at the cost of overall completeness, suggesting that a hybrid approach may be ideal.

Finally, regarding the constraints of this evaluation, generating large-scale, pixel-perfect annotations for intertwined foliage presents an extreme labor cost. However, to provide a rigorous baseline, we manually annotated the entire test set that contained detection bounding boxes. As detailed in Table 6, this quantitative evaluation confirms our qualitative hypotheses. The metrics highlight that no single method is universally effective for green-on-green scenarios without supervised fine-tuning. SAM’s higher precision pairs well with S2C’s superior recall, reinforcing our conclusion that a hybrid or ensemble approach is required to guide future zero-shot labeling pipelines.

In terms of real-life applicability, the obtained results indicate that ConvNeXt-based detectors provide a strong solution for sugarcane weed detection, balancing accuracy and latency, which in turn suggests that ConvNeXt can be effectively applied to precision agriculture tasks, such as real-time targeted herbicide deployment. Compared to transformer-based models with higher computational costs and latency, CNN-based architectures like ConvNeXt remain more suitable for this application.

In practice, however, several challenges remain. The dataset contains poorly annotated samples, limiting the model’s ability to generalize effectively. Additionally, the visual similarity between sugarcane and certain weed species requires models to handle intra-class variation and inter-class confusion. Another issue is the lack of annotations and images representing edge cases in this problem, such as variations in climate, illumination, and camera settings that may affect detection.

This analysis enhances a fundamental challenge in real-world deployment of models based on perennial crops, where visual boundaries are often ambiguous even to human experts. As shown in Fig. 3, the initial annotations frequently grouped plants into coarse clusters due to the nature of the field and similarity between different types of foliage. The model trained on the filtered subset, however, successfully predicted tighter, instance-specific bounding boxes on the validation set, effectively correcting the coarse ground truth. In other words, while obtaining perfect field annotations is physically difficult, the model can learn to disentangle vegetative structure when provided with a consistently supervised subset, a critical trait for robust real-world applicability. As such, although we now have a starting point to tackle the problem of weed detection in perennial crops.

We also acknowledge the geographic and geometric constraints of the dataset. The images were acquired from a single site using a fixed viewing angle. While this setup intentionally simulates the rigid sensor mounting of a tractor, where camera height and pitch are mechanically constrained, it leaves the question of cross-site generalization open. Variations in soil color, regional weed species, and sensor mounting positions remain untested variables. Furthermore, this study does not currently account for extreme environmental factors, such as severe weather conditions, dramatic seasonal variations, or shifting lighting dynamics, all of which could significantly impact model robustness in real-world deployments. Consequently, this benchmark validates the feasibility of deep learning for green-on-green discrimination in sugarcane, but future multi-site studies are required to establish robustness.

Fundamentally, this work addressed weed management in perennial crops by providing a new in-the-field dataset that reflects the challenging green-on-green nature of the problem, unlike many existing datasets

Another main contribution of this paper concerns a dataset of in-the-field images, with experiments suggesting there is much to do in weed detection, but classification accuracy is nearly perfect once they are detected. Although segmentation was evaluated without dedicated pixel-level training, the manually annotated test masks enabled quantitative assessment, which showed promising results.

The dataset has limitations, indeed. We need to improve the annotation further, for there are regions with multiple weeds labeled as a single one. We showed how to mitigate the impact of such inconsistencies with a loss function considering geometric distortions. However, even with state-of-the-art architectures, weed detection appears challenging in sugarcane crops.

Here, we highlighted the importance of neural architectures that offer a trade-off between efficiency and effectiveness, for detection and classification modules will likely be deployed in storage- and processing-limited devices, onboard tractors, and chemical sprayers.

Methods

Experimental design overview

Weed detection in sugarcane crops presents significant challenges under actual field conditions, particularly when weeds share visual characteristics with the crop and display variable scaling and overlapping spatial patterns. To investigate this issue, particularly in the case of perennial crops, such as sugarcane, we structured our investigation around three key objectives: (i) weed detection to enable targeted herbicide application, (ii) scene-level classification to identify weed presence, and (iii) bounding-box-guided instance segmentation of weed leaves. To support these objectives, we conceived a purpose-built dataset that captures the inherent complexity of sugarcane fields.

Our evaluation methodology was designed to address the three objectives outlined above. Specifically, for weed detection (i), performance was quantified via AP50 metrics. For scene-level classification (ii), we employed variants of ResNet, ConvNeXt, and Vision Transformer. Finally, the approach to segmentation (iii) involved both qualitative and quantitative analysis, including a novel synthesis of bounding-box-guided instance segmentation comparing unsupervised results using the Segment Anything Model (SAM)23, the Excess Green Ratio (ExGR) vegetation index24 and the SAM to CAMs (S2C) weakly supervised model25, against a manually annotated pixel-level ground truth on the detection test set.

In this section, we detail aspects of dataset acquisition, including field selection criteria, imaging protocols, ground-truth annotation procedure, our experimental design, and some insights regarding the qualitative evaluation. To provide a clear visual summary of our approach, Fig. 4 illustrates our complete experimental workflow, demonstrating the path from image acquisition and preprocessing to the final downstream tasks: classification, detection, and segmentation. The following subsections will provide a detailed breakdown of each stage and the specific models evaluated.

Fig. 4: An overview of the introduced sugarcane-weed dataset and experimental pipeline.

Full size image

The upper half shows examples of images from the new dataset, illustrating the visual similarity and challenging conditions for detection. The dataset includes categories for `Sugarcane’ (left), `Sugarcane and Weeds’ mixed (center), and `Weeds’ only (right). The lower half illustrates the architectures of the well-performing models chosen for each downstream task. After image acquisition and preprocessing, the workflow addresses the three main objectives: (i) classification for scene-level analysis; (ii) detection of weed bounding boxes; and (iii) segmentation, using a bounding-box-guided instance approach.

Dataset

The dataset comprises 2139 high-resolution (4608 × 3072) RGB images captured from sugarcane fields at the USDA/ARS, Sugarcane Research Unit’s Ardoyne Research Farm in Schriever, Louisiana, USA in April 2020. These images were collected across different lighting conditions, soil composition, and plant growth stages to ensure robustness against natural variability. They were manually captured at approximately chest height using a Nikon D3100 camera, framing the scene to exclude operator artifacts (e.g., footwear, tools) and capture only vegetation and soil. The dataset is structured into three categories: (i) sugarcane-only (869 images), (ii) weed-only (855 images), and (iii) mixed (415 images), where both sugarcane and weeds are present within the same scene. Images containing sugarcane and weeds were classified by a sugarcane weed scientist. This categorization supports the development of models capable of distinguishing sugarcane from visually similar weed species.

While the dataset initially possessed image-level labels for classification, it lacked instance-level data. For the detection benchmark, we randomly sampled a representative subset of 285 images from the “mixed” category and had them annotated by agricultural experts, using the VGG Image Annotation tool26, enabling the training and evaluation of supervised object detection models. The dataset prioritizes challenging cases where weeds exhibit high visual similarity to sugarcane in both color and texture.

It is important to note the division of data usage across our experiments. The complete, original dataset of 2139 images was used solely for the scene-level classification task. However, a targeted filtering process was applied for the detection benchmark. Specifically, images containing acquisition artifacts, such as operator footwear, motion blur, or near-duplicate frames, were excluded from the detection subset, as they lacked the corresponding high-quality bounding box annotations required for precise localization. Consequently, only this filtered subset of 285 images was used for training and evaluating the detection models.

To systematically assess model performance, three experimental scenarios are considered:

  • Detection: this task uses the manually annotated subset of bounding boxes to train a detection model capable of localizing individual weed instances within an image by predicting bounding boxes that encapsulate their regions.

  • Classification: the objective is to assign a single label to an image based on its dominant vegetation type, i.e., weed, sugarcane, or a combination of both.

  • Segmentation: this scenario involves both qualitative and quantitative assessment, where weed leaves are segmented from the background using bounding-box-guided instance approaches. Pixel-level ground truth was manually annotated for all test images containing detection bounding boxes, enabling the computation of standard segmentation metrics.

Image preprocessing and data augmentation

To ensure consistency and provide a uniform analysis for all subsequent experiments, the same image preprocessing and data augmentation steps were applied universally across all three tasks. While these tasks operate at different semantic levels, they share the same fundamental photometric challenges, such as variable field illumination and low contrast between crops and weeds, resulting in a beneficial preprocessing regardless of the downstream task.

Thus, our data augmentation strategy addresses field-specific challenges by enhancing image contrast, mitigating illumination variations, and improving the separability between vegetation and background. This strategy incorporates Contrast Limited Adaptive Histogram Equalization (CLAHE) and the ExGR vegetation index after contrast equalization.

CLAHE normalizes contrast variations by adaptively redistributing intensity values within local image regions, preventing noise overamplification while preserving fine-grained details. We applied CLAHE in the Value channel of the HSV color space, ensuring that contrast enhancement performs independently of hue and saturation. This approach preserves the original color distribution while compensating for non-uniform lighting conditions commonly encountered in field imagery.

The ExGR vegetation index is employed after CLAHE to enhance the distinction between plants and non-plant regions, and it is defined as follows:

$$,{rm{ExGR}},=2G-R-B,$$
(1)

where R, G, and B correspond to the red, green, and blue color channels. By amplifying the green component while suppressing red and blue, ExGR highlights vegetation while reducing interference from soil and other non-vegetative elements. Since this transformation results in a grayscale map emphasizing plant structures, we integrate it into the original image using a subtle colormap as a transparency mask with an alpha value of 0.2. This strategy enhances visibility without overwhelming the original color information, facilitating better feature extraction for subsequent detection tasks.

Detection task

We investigate the effectiveness of distinct state-of-the-art detection architectures through a systematic evaluation framework. The architectural assessment spans various detection paradigms, including RetinaNet7, YOLOv11L27, and RTMDeT28 implementations using ConvNeXt29 and SwinViT30 backbones alongside Mask R-CNN31 variants. Note that as our dataset does not contain pixel-wise segmentation masks, we disabled the segmentation head in these architectures. We retained the ‘Mask R-CNN’ designation to indicate the use of RoIAlign, but the models effectively operate in a detection-only mode. To examine the effects of hybrid feature extraction, we implement a custom RTMDeT architecture that combines SwinViT and ConvNeXt as joint backbones, aiming to assess the complementary strengths of convolutional and transformer-based representations in field conditions.

Architectural setup and fine tuning strategies for detection

We initialized all architecture backbones with ImageNet32 pre-trained weights, while the downstream components, including the feature pyramid necks and detection heads, were randomly initialized using Kaiming initialization33, and trained from scratch using AdamW with an initial learning rate of 4e-5 and a cosine learning rate schedule over 120 epochs. We employed a standardized hyperparameter configuration across all architectures to ensure a controlled comparison, isolating the model architecture as the primary variable. The dataset is stratified into 60% training, 20% validation, and 20% testing sets. The primary input resolution is set to 640 × 640 pixels, but an auxiliary analysis evaluates resolution dependencies up to 2560 × 2560 pixels for selected architectures.

While ImageNet weights provide a solid foundation for image pretraining, several advanced techniques have emerged to enhance models’ generalization capabilities further. One such method involves fine-tuning semi-supervised tasks, including classification and Masked Autoencoders (MAE)34, which have shown promising results for improving feature extraction in various domains. MAE, in particular, is a self-supervised learning approach where a portion of the input image is masked, and the model is trained to predict the missing regions. Uniform Masking Masked Autoencoder (UM-MAE)35 is one such technique of masked auto-encoders, proposed with the objective of introducing a more efficient and generalized approach for Pyramid-based Vision Transformers, such as SwinViT, by applying uniform masking strategies that significantly improve pretraining efficiency.

We hypothesize that fine-tuning the backbone of a classification task before detection improves the model’s ability to extract relevant features. To evaluate this, we compared three training strategies for the transformer-based architecture SwinViT-RTMDeT: (i) ImageNet-only pretraining, (ii) ImageNet with UM-MAE pretraining, and (iii) ImageNet with UM-MAE pretraining followed by classification fine-tuning. UM-MAE introduces a structured masking approach to improve feature generalization, particularly in complex agricultural environments.

Given the results of both transformers and convolutional architectures, we ponder the question of the capacity of both models to extract a distinct set of features that enhances overall prediction. We evaluated such a scenario using the hybrid TransConv model, which performs feature fusion on the best detectors of each type of architecture; in this case, the ConvNeXt and the SwinViT, considering various approaches for feature fusion, such as concatenation, addition, dot product, and through an attention layer between both features. This model is evaluated under two conditions: ImageNet-only pretraining and ImageNet with classification finetuning.

We employed Average Precision (AP) as the primary measure for assessing object detection performance. AP integrates the precision-recall curve, which is determined by varying the Intersection over Union (IoU) threshold. IoU measures the overlap between predicted and ground truth bounding boxes. The AP50 metric refers explicitly to the average precision calculated with an IoU threshold of 0.5, meaning a prediction is considered correct if the IoU is at least 50%. AP50 is the perfect choice of quantitative measure in this case, as it strikes a balance between precision and recall, providing a balanced trade-off between false positives and false negatives, making it suitable for evaluating detection quality in our problem.

Bounding box challenges

A key limitation of the dataset is the presence of poorly annotated bounding boxes, particularly those with excessive size variation. To mitigate this, we employed Complete Intersection over Union (CIoU) loss36, which extends IoU by incorporating geometric constraints:

$${{mathcal{L}}}_{{rm{CIoU}}}=1-{rm{IoU}},+frac{{rho }^{2}({bf{b}},{{bf{b}}}^{g})}{{c}^{2}}+alpha v,$$
(2)

where ρ(b, bg) stands for the center distance, c the diagonal of the smallest enclosing box, and v an aspect ratio penalty computed as follows:

$$v=frac{4}{{pi }^{2}}{left(arctan frac{{w}^{g}}{{h}^{g}}-arctan frac{w}{h}right)}^{2},,alpha =frac{v}{(1-,{rm{IoU}})+v}.$$
(3)

To further refine regression, we modify the overall loss by weighting bounding box localization higher than classification. The final loss is defined as follows:

$${mathcal{L}}=4{{mathcal{L}}}_{{rm{CIoU}}}+{{mathcal{L}}}_{{rm{cls}}},$$
(4)

where ({{mathcal{L}}}_{{rm{cls}}}) is the classification loss. This adjustment forces the model to prioritize accurate bounding box regression over classification, and, since the primary source of label noise in our dataset is geometric rather than categorical, we empirically determined that a higher regression weight (λ = 4) was necessary to prevent convergence on trivial, loose localizations, improving localization quality in the presence of annotation inconsistencies.

Figure 5 illustrates such annotation inconsistencies. To further mitigate them, we perform a best-case scenario analysis, filtering the images where the bounding boxes represent the majority of the image, followed by the prediction of the bounding box of the highest-performing model trained on the new clean dataset to assess how the model would classify such cases qualitatively. After the filtering, we obtained 115 training images with 480 training annotations and 264 validation annotations, compared to the original 154 training images with 525 training annotations and 275 validation annotations.

Fig. 5: Some annotation challenges faced in the dataset’s detection task include bounding boxes covering the entire screen instead of isolating individual weed leaves, missing or nested bounding boxes, and inconsistent annotation patterns, where some regions are annotated as a group.

Full size image

In contrast, others are annotated with greater detail.

To quantify this, we analyzed the distribution of bounding box scales before and after our filtering protocol, as shown in Table 7. In this analysis, the “Normalized Area” corresponds to the ratio of the bounding box area to the total image resolution, providing a scale-invariant measure of weed size, which shows that the raw dataset contained significant outliers, with the 90th percentile of normalized box area reaching 54.6%, indicating single annotations encompassing large, undifferentiated clusters.

Table 7 Detailed annotation statistics comparing the Raw and Cleaned datasets across all partitions
Full size table

Following the removal of these coarse samples, the cleaned dataset retained 93% of the original instances (998/1069) while reducing the 90th percentile area to 21.7%. The median normalized area remained consistent (~4.0%), confirming that the filtering process selectively removed ambiguous outliers while preserving the core instance-level data.

Classification task

Regarding the classification task, we evaluated the performance of distinct models across multiple classification metrics: accuracy, precision, recall, and F1-score. Transformer-based architectures form a core component of our evaluation framework. We employed ViT-B37, MobileViT38, DeiT39, and SwinViT-B models, testing the SwinViT-B also with the UM-MAE pretraining. These models can use their self-attention mechanisms to capture long-range dependencies in the input features, offering advantages in recognizing complex patterns across the entire image space. The hierarchical structure of SwinViT-B, in particular, provides multi-scale feature representation that proves beneficial for classification tasks with varying levels of feature granularity.

CNN-based architectures serve as strong comparative baselines. Our evaluation includes established ResNet40 variants (ResNet-18, ResNet-50, ResNet-101, and ResNext-101) and more recent architectures, including ConvNeXt-v241 and EfficientNet42. All models are initialized with ImageNet pre-trained weights to exploit the transfer learning benefits of this task. The training protocol maintains consistency with 10 epochs for most architectures, while the architecturally complex MobileViT, DeiT, and ConvNeXt models require 30 epochs to reach convergence.

Regarding the UM-MAE pretraining process using Swin-ViT and ViT-B as backbone architectures, we employed 100 epochs. An extra 10 epochs were required for fine-tuning. The visual reconstructions produced by the Swin-ViT UM-MAE model reveal interesting properties of the learned representations, as shown in Fig. 6. While these reconstructions do not capture fine-grained details of the images, they demonstrate that the model can preserve the overall color distribution across spatial coordinates. This is particularly evident in the model’s capacity to differentiate between ground and foliage elements in natural scenes, validating our hypothesis that this might help the model in some tasks.

Fig. 6: Reconstruction process of the UM-MAE technique applied to the Swin-ViT architecture.

Full size image

For this case, a reconstruction mask was used to mask 75% of the original image, and the model was trained to reconstruct these gaps. This figure depicts the original, masked, and reconstructed image.

Segmentation task

Segmentation is performed using a bounding-box-guided instance segmentation approach. We qualitatively and quantitatively compare distinct methodologies, i.e., (i) the SAM model, our (ii) EXGR-enhanced preprocessing based on Otsu’s thresholding technique, and (iii) a Weakly Supervised Semantic Segmentation method (S2C), to assess whether bounding-box-based segmentation strategies can effectively capture the structural characteristics of weeds in agricultural imagery. To enable a quantitative evaluation without dedicated segmentation training, we manually annotated all test images containing detection bounding boxes with pixel-level ground truth, examples of which are shown in Fig. 7. This allowed us to extract and report standard segmentation evaluation metrics, specifically Intersection over Union (IoU), the Dice similarity coefficient, Precision, Recall, F1-Score, and pixel-wise Accuracy.

Fig. 7: Examples of manually annotated pixel-level ground truth masks used for quantitative evaluation of segmentation strategies.

Full size image

Each pair shows the original field image with detected bounding boxes (left) alongside its corresponding binary segmentation mask (right). The top row illustrates broadleaf weed instances, while the bottom row shows grass weed instances, mixed in between sugarcane foliage.

For the first setting, we employ SAM, a foundation model for image segmentation. SAM is designed to perform a universal and generic segmentation task in a zero-shot manner, meaning it does not require network refinement for new tasks or datasets. To achieve this, SAM-based networks use mechanisms to combine the latent space representation of the image, obtained through an encoder network, with contextual information from a prompt. This prompt is encoded in the same latent space before the image is decoded into masks. The prompt itself can be a pixel, a region of the image corresponding to the object to be segmented, or even text describing what should be segmented.

In our task, we exploit the model’s prompt-based architecture using the detected bounding boxes from our best model as spatial prompts. The process begins with our object detection network generating bounding box coordinates for each weed instance. These coordinates are then forwarded to SAM as rectangular prompts, which guide the model’s attention to specific image regions, resulting in the desired segmentation mask.

Our second approach uses the ExGR, described in Eq. (1), combined with adaptive thresholding. Similar to the SAM pipeline, this method also begins by identifying weed regions using the bounding boxes predicted by our best detection model. Once a weed-containing region is defined by a bounding box, several steps are performed within that specific cropped area. First, the ExGR index is applied to the original image patch corresponding to the bounding box. This transformation enhances the contrast between green vegetation and non-vegetative background elements like soil. Following the ExGR transformation, which results in a grayscale image that emphasizes vegetation, we employ Otsu’s adaptive thresholding method within each detected bounding box to determine an optimal intensity threshold that separates pixels into foreground (weeds) and background classes. After thresholding, a series of morphological operations, such as erosion followed by dilation, are applied to refine the resulting binary segmentation boundaries and remove noise, resulting in coherent instance masks that preserve the structural characteristics of the target weeds.

Finally, our third approach employs a weakly supervised semantic segmentation method named S2C. S2C enhances Class Activation Map (CAM) quality by transferring knowledge from the SAM model to the classifier during training through two main components: (i) SAM-Segment Contrasting, which uses SAM’s “segment-everything” output to refine classifier features using contrastive learning, and (ii) CAM-based Prompting Module, where CAM-derived peaks are used as point prompts for SAM to generate improved class-specific segmentation masks, which serve as self-supervision to refine the CAMs further.

Data availability

All data needed to evaluate the conclusions in the paper are present in the paper. The datasets are available for download at https://github.com/MIT-The77Lab/SugarcaneWeedDataset and the code is available at https://github.com/MIT-The77Lab/sugarcane-weed-detection.

References

  1. Gravois, K. Sugarcane summary for crop year 2023. Sugarcane Research Annual Progress Report. Baton Rouge: Louisiana State University AgCenter 1–5 (2023).

  2. Gao, J. et al. Fusion of pixel and object-based features for weed mapping using unmanned aerial vehicle imagery. Int. J. Appl. Earth Observ. Geoinf. 67, 43–53 (2018).

    Google Scholar 

  3. Ruckelshausen, A. et al. BoniRob: an autonomous field robot platform for individual plant phenotyping. In van Henten, E., Goense, D. & Lokhorst, C. (eds.) Precision agriculture ’09, 841–847 (Wageningen Academic, 2009).

  4. Luo, T. et al. Classification of weed seeds based on visual images and deep learning. Inf. Process. Agric. 10, 40–51 (2023).

    Google Scholar 

  5. Hasan, A. S. M. M., Diepeveen, D., Laga, H., Jones, M. G. K. & Sohel, F. Image patch-based deep learning approach for crop and weed recognition. Ecol. Inform. 78, 102361 (2023).

    Article 

    Google Scholar 

  6. Rahman, A., Lu, Y. & Wang, H. Performance evaluation of deep learning object detectors for weed detection for cotton. Smart Agric. Technol. 3, 100126 (2023).

    Article 

    Google Scholar 

  7. Lin, T.-Y., Goyal, P., Girshick, R., He, K. & Dollár, P. Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 42, 318–327 (2020).

    Article 

    Google Scholar 

  8. Jocher, G. et al. ultralytics/yolov5: v7.0 – YOLOv5 SOTA Realtime Instance Segmentation https://doi.org/10.5281/zenodo.7347926 (2022).

  9. Singh, V., & Singh, D., Development of an Approach for Early Weed Detection with UAV Imagery, IGARSS 2022 – 2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 4879–4882 https://doi.org/10.1109/IGARSS46834.2022.9883564 (2022).

  10. Singh, V., Singh, D. & Kumar, H. Efficient application of deep neural networks for identifying small and multiple weed patches using drone images. IEEE Access 12, 71982–71996 (2024).

    Article 

    Google Scholar 

  11. Sun, C., Zhang, M., Zhou, M. & Zhou, X. An improved transformer network with multi-scale convolution for weed identification in sugarcane field. IEEE Access 12, 31168–31181 (2024).

    Article 

    Google Scholar 

  12. Romero, K. F. & Heenkenda, M. K. Developing site-specific prescription maps for sugarcane weed control using high-spatial-resolution images and light detection and ranging (lidar). Land 13 (2024).

  13. Zhang, W., Miao, Z., Li, N., He, C. & Sun, T. Review of current robotic approaches for precision weed management. Curr. Robot. Rep. 3, 139–151 (2022).

    Article 

    Google Scholar 

  14. Ajayi, O. G. & Ashi, J. Effect of varying training epochs of a faster region-based convolutional neural network on the accuracy of an automatic weed classification scheme. Smart Agric. Technol. 3, 100128 (2023).

    Article 

    Google Scholar 

  15. Verçosa, J. P. S. et al. Early detection of weed in sugarcane using convolutional neural network. Int. J. Innov. Educ. Res. 10, 210–226 (2022).

    Google Scholar 

  16. Santiago, W. E. et al. Evaluation of bag-of-features (Bof) technique for weed management in sugarcane production. Aust. J. Crop Sci. 13, 1819–1825 (2019).

    Article 

    Google Scholar 

  17. R., A. & Ramires, T. Precision agriculture: Herbicide reduction with AI models (2022).

  18. Souza, M. F., Amaral, L. R., Oliveira, S. R. M., Coutinho, M. C. N. & Ferreira Netto, C. Spectral differentiation of sugarcane from weeds. Biosyst. Eng. 190, 41–46 (2020).

    Article 

    Google Scholar 

  19. Thamoonlest, W. et al. Forecasting gaps in sugarcane fields containing weeds using low-resolution UAV imagery based on a machine-learning approach. Smart Agric. Technol. 10, 100780 (2025).

    Article 

    Google Scholar 

  20. Modi, R. U. et al. An automated weed identification framework for sugarcane crop: a deep learning approach. Crop Prot. 173, 106360 (2023).

    Article 

    Google Scholar 

  21. Caron, M. et al. Emerging properties in self-supervised vision transformers (2021).

  22. Selvaraju, R. R. et al. Grad-cam: visual explanations from deep networks via gradient-based localization (2017).

  23. Kirillov, A. et al. Segment anything (2023).

  24. Meyer, G. E. & Neto, J. C. Verification of color vegetation indices for automated crop imaging applications. Comput. Electron. Agric. 63, 282–293 (2008).

    Article 

    Google Scholar 

  25. Kweon, H. & Yoon, K.-J. From sam to cams: exploring segment anything model for weakly supervised semantic segmentation (2024).

  26. Dutta, A. & Zisserman, A. The via annotation software for images, audio and video (2019).

  27. Khanam, R. & Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv preprint arXiv:2410.17725 (2024).

  28. Lyu, C. et al. Rtmdet: An empirical study of designing real-time object detectors. arXiv preprint arXiv:2212.07784 (2022).

  29. Liu, Z. et al. A convnet for the 2020s (2022).

  30. Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows (2021).

  31. He, K., Gkioxari, G., Dollár, P. & Girshick, R. Mask R-CNN (2017).

  32. Deng, J. et al. Imagenet: A large-scale hierarchical image database (2009).

  33. He, K., Zhang, X., Ren, S. & Sun, J. Delving deep into rectifiers: surpassing human-level performance on imagenet classification (2015).

  34. He, K. et al. Masked autoencoders are scalable vision learners. In Proc. of the IEEE/CVF conference on computer vision and pattern recognition. 16000–16009 (2022).

  35. Li, X., Wang, W., Yang, L. & Yang, J. Uniform masking: enabling mae pre-training for pyramid-based vision transformers with locality. arXiv:2205.10063 (2022).

  36. Zheng, Z. et al. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans. Cybern. 52, 8574–8586 (2021).

    Article 

    Google Scholar 

  37. Dosovitskiy, A. et al. An image is worth 16 × 16 words: transformers for image recognition at scale. Int. Conf. Learn. Represent. (2021).

  38. Mehta, S., & Rastegari, M. Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. Int. Conf. Learn. Represent. (2022).

  39. Touvron, H. et al. Training data-efficient image transformers & distillation through attention (2021).

  40. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition (2016).

  41. Woo, S. et al. Convnext v2: co-designing and scaling convnets with masked autoencoders (2023).

  42. Tan, M. & Le, Q. Efficientnet: rethinking model scaling for convolutional neural networks (2019).

Download references

Acknowledgements

J.P.P. was partially funded by the São Paulo Research Foundation (FAPESP), Brazil, under grants#2013/07375-0, #2023/14427-8, and #2024/00202-7, and also by the Brazilian National Council for Scientific and Technological Development (CNPq) grant #308529/2021-9. J.R.R.M. was partially funded by the São Paulo Research Foundation (FAPESP) grant #2024/00789-8.

Author information

Authors and Affiliations

Authors

Contributions

All authors, including J.P.P., J.R.R.M., M.S., C.J., J.H., D.J.S., R.M.J., and H.I.K., contributed to the experimental design, paper writing, and review. In addition, J.P.P. and J.R.R.M. were in charge of coding. M.S. and C.J. were in charge of annotation. R.M.J. and H.I.K. were in charge of data collection. All authors read and approved the final manuscript.

Corresponding author

Correspondence to
Hermano I. Krebs.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Papa, J.P., Manesco, J.R.R., Schoder, M. et al. Toward autonomous weed management systems in sugarcane crops and an assessment of technological readiness.
npj Artif. Intell. 2, 40 (2026). https://doi.org/10.1038/s44387-026-00096-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • DOI: https://doi.org/10.1038/s44387-026-00096-0


Source: Ecology - nature.com

Demographic causes and social consequences of adult sex ratio variation

Pollination and dispersal networks in the Amazonian tree flora

Back to Top