Advancing agricultural research using machine learning algorithms

Two databases including yield, management, and weather data for maize (n = 17,013) and soybean (n = 24,848) involving US crop performance trials conducted in 28 states between 2016 to 2018 for maize and between 2014 to 2018 for soybean, were developed (Fig. 1). Crop yield and management data were obtained from publicly available variety performance trials which are typically performed yearly in several locations across each state (see methods for more information). Final databases were separated in training (80% of database) and testing (20% of database) datasets using stratified sampling by year, use of irrigation, and soil type. For each crop, an extreme gradient boosting (XGBoost, see methods for more information) algorithm to estimate yield based on soil type and weather conditions (E), seed traits (G) and management practices (M) was developed (see variables listed in Tables S1 and S2 for maize and soybean, respectively, and data science workflow in Fig. S1).

Figure 1

Locations where maize and soybean trials were performed during the examined period. The map was developed in ArcGIS Pro 2.8.0 (https://www.esri.com).

Full size image

The developed algorithms exhibited a high degree of accuracy when estimating yield in independent datasets (test dataset not used for model calibration) (Fig. 2). For maize, the root mean square error (RMSE) and mean absolute error (MAE) was a respective 4.7 and 3.6% of the dataset average yield (13,340 kg/ha). For soybean, the respective RMSE and MAE was 6.4 and 4.9% of the dataset average yield (4153 kg/ha). As is evident in the graphs (Fig. 2), estimated yields exhibited a high degree of correlation with actual yields for both algorithms in the independent datasets. For maize and soybean, 72.3 and 60% of cases in the test dataset deviated less than 5% from actual yields, respectively. Maximum deviation for maize and soybean reached 43 and 70%, respectively. Data points with deviations greater than 15% from actual yield were 1.5% in maize and 3.6% in soybean databases. These results suggest that the developed algorithms can accurately estimate maize and soybean yields utilizing database-generated information involving reported environmental, seed genetic, and crop management variables.

Figure 2

Actual versus algorithm-derived maize (left) and soybean (right) yield in test datasets. Black solid line indicates y = x, red short-dashed lines, black dashed lines, and red long-dashed lines indicate ± 5, 10, and 15% deviation from the y = x line. RMSE, root mean square error; MAE, mean absolute error; r², coefficient of determination; n = number of observations. Each observation corresponds to a yield of an individual cropping system in a specific environment (location-year).

Full size image

In contrast to statistical models, ML algorithms can be complex, and the effect of single independent variables may not obvious. However, accumulated local effects (ALE) plots¹⁴ can aid the understanding and visualization of important and possibly correlated features in ML algorithms. For both crops, indicatively important variables included sowing date, seeding rate, nitrogen fertilizer (for maize), row spacing (for soybean) and June to September cumulative precipitation (Fig. 3). Across the entire region and for both crops, the algorithm-derived trends suggest that above average yields occur in late April to early May sowing dates, but sharply decrease thereafter. Similar responses have been observed in many regional studies across the US for both, maize^15,16,17,18 and soybean¹⁹. Similarly, simulated yield curves due to increasing seeding rate are in close agreement with previous maize^20,21 and soybean²² studies. The maize algorithm has captured the increasing yield due to increasing N fertilizer rate. The soybean algorithm suggests that narrower row spacing resulted in above average yield compared to wider spacing. Such response has been observed in many regions across the US²³. Season cumulative precipitation between 400 and 700 mm resulted in above average yields for both crops.

Figure 3

Accumulated local effect plots for maize sowing date (A), seeding rate (B), Nitrogen fertilizer rate (C), and cumulative precipitation between June and September (mm) (D), and soybean sowing date (E), seeding rate (F), row spacing (G), and cumulative precipitation between June and September (mm) (H).

Full size image

The responses in the ALE plots (Fig. 3) suggest that these algorithms have captured the general expected average responses for important single features. Nevertheless, our databases include hundreds of locations with diverse environments across the US and site-specific crop responses which may vary due to components of the G × E × M interaction. We argue that, instead of examining a single or low-order management interactions, site-specific evaluation of complex high order interactions (a.k.a. cropping systems) can reveal yield differences that current research approaches cannot fully explore and quantify. For example, sowing date exerts a well-known impact on maize and soybean yield. For each crop separately, by creating a hypothetical cropping system (a single combination of all management and traits in Tables S1 and S2) in a randomly chosen field in south central Wisconsin (latitude = 43.34, longitude = -89.38), and by applying the developed algorithms, we can generate estimates of maize and soybean yield. For that specific field and cropping system (out of the vast number of management combinations a farmer can choose from), maize yield with May 1st sowing was 711 kg/ha greater (6% increase) than June sowing (Fig. 4A). By creating scenarios with 256 background cropping system choices (Table S3), the resultant algorithm-derived yield estimate difference for the same sowing date contrast (averaged across varying cropping systems) was smaller but still positive (3% increase), although the range of possible yield differences was wider (Fig. 4B). However, when comparing, instead of averaging, the estimated yield potential among the simulated cropping systems, a 2903 kg/ha yield difference (25% difference) was observed (Fig. 4C). Interestingly, when focusing on the early sown fields that were expected to exhibit the greatest yield, the same yield difference was observed (Fig. 4D). This result shows that sub-optimal background management can mitigate the beneficial effect of early sowing (Table S4).

Figure 4

Maize yield difference (in kg/ha and percentage) due to sowing date (May 1st vs. June 1st) for a single identical background cropping system (A), maize yield difference due to sowing date when averaged across 256 (3 years × 256 cropping systems = 768 year-specific yields) (B), maize yield variability in each of the 256 cropping systems (C), and maize yield variability in each of the 128 cropping systems with early sowing (D). Soybean yield difference due to sowing date (May 1st vs June 1st) for a single identical background cropping system (E), soybean yield difference due to sowing date when averaged across 128 (5 years × 128 cropping systems = 640 year-specific yields) (F), soybean yield in each of the 128 cropping systems (G) and soybean yield variability due in each of the 64 cropping systems with early sowing (H). Within each panel, the horizontal red and grey lines indicate the boxplot with maximum and minimum yield, respectively. In the left four panels, boxes delimit first and third quartiles; solid lines inside boxes indicate median and green triangles indicate means. Upper and lower whiskers extend to maximum and minimum yields. Each maize and soybean cropping system is a respective 8-way and a 7-way interaction of management practices in a randomly chosen field in Wisconsin, USA (Table S3 and S5, respectively).

Full size image

In the case of soybean, a May 1st sowing resulted in greater yield (588 kg/ha; a 14% increase) than a June 1st in the single background cropping system (Fig. 4E). The result was consistent when yield differences due to sowing date were averaged across 128 background cropping system choices (Table S5) (Fig. 4F). Similar to what was observed in maize, among all cropping systems, yield varied by 1704 kg/ha (44% difference) (Fig. 4G). When focusing only on the early sown fields, a 1181 kg/ha yield difference (27% yield increase) was observed (Fig. 4H). In agreement with maize, this result highlights the importance of accounting for sub-optimal background management which can mitigate the beneficial effect of early sowing (Table S6).

We note here the ability of farmers to change management practices can be limited due to an equipment constraint (e.g., change planter unit row width) or simply impossible (e.g., change the previous year’s crop). Thus, recommended management practices that were evaluated in studies that used specific background management may not be applicable in some instances. The benefits of the foregoing approach, which involves extensive up-to-date agronomic datasets and high-level computational programing, can have important and immediate implications in future agricultural trials. Our approach allows for more precise examination of complex management interactions in specific environments (soil type and growing season weather) across the US (region covered in Fig. 1). The ability to extract single management practice information (even across cropping systems) is also possible by utilizing ALE plots, or by calculation of the frequency at which a given level/rate of a management practice appeared among the highest yielding cropping systems (Tables S4 and S6).

Among all available 30-d weather variables, many were strongly correlated in both crop databases (Figs. S2 and S3 for maize and soybean, respectively). Models using all 30-d interval variables with r < 0.7 (Tables S8 and S9) showed minimal to no performance gain compared to the final more parsimonious models that included season-long weather variables (Fig. S4). Thus, we consider the length of periods we chose to represent well the approximate successive 60-d pre-sowing, 120-d in-season, and 60-d post-harvest segments of growing season in the US (Fig. S7). Season-long weather conditions have been used in previous studies^13,24, and it has been shown that choice of growing season does not affect climate-related effects on crop yield^25,26.

As an additional sensitivity analysis, we developed ALE plots for the algorithms using the aforementioned 30-d weather variables (Fig. S8). For major management practices, there were no differences in simulated responses between the algorithms that used multiple 30-d weather variables and the final chosen algorithms that used longer intervals (Fig. 3). Repeating the analysis for the same hypothetical cropping system in the same Wisconsin location using the algorithms developed with the 30-d weather conditions, the observed trends were consistent with the season-long weather algorithms, although the simulated yields were numerical different (Fig. S9). Nevertheless, across all representations of weather conditions (algorithms with 30-d intervals and season-long), the levels/rates of management practices in the 5% highest and lowest yielding maize and 5% highest soybean cropping systems with early sowing date were identical, apart from manure use in maize. Based on these results, we consider the algorithm-derived yield estimates robust to different representations of seasonal weather variability.

It appears that several different cropping systems can result in similar high yield for both crops (Fig. 4C,D,G,H). This is in agreement with other agricultural decision maker tools²⁷. Moreover, it is common for neighboring farms to attain similar crop yield despite the use of a different cropping system, suggesting that a single optimal solution does not necessarily exist and that different combinations of management practices, when they interact with environment, can still result in similar high yields. Since the effect of environment is ever-changing, the high level of complexity of synergies between G × E × M suggests that long-term optimization of single management factor may not be possible²⁸, which further highlights the importance of accounting for the effect of the entire cropping system at the field level.

The approach we present here should not be considered as a crop yield forecasting exercise. There have been several attempts to forecast crop yields using deep neural network methods (e.g.,^29,30). In contrast, the algorithms we present here can generate hypothetical experimental data that can be used to rapidly examine G × E × M interaction for both maize and soybean across the US. Of the millions of possible G × E × M combinations, our ML algorithms can identify hidden complex patterns between G × E × M combinations for yield optimization that may be non-obvious, but once identified, worthy of field test confirmation. Farmers can use the algorithms to gain insights about optimum management interactions in their location-specific environment (known soil type × expected weather conditions), and to identify farm factors that may be too costly to alter without a priori reason (generated by the model) for doing so. Researchers can compare expected yield across thousands of hypothetical cropping systems and use the results as a guide to design more efficient future field-based crop management practice evaluation experiments.

We note that this approach should not be considered as a substitute of replicated trials. To the contrary, replicated field trials performed by Universities are continually needed to serve as an excellent source of high-quality unbiased data which can be used to train even more comprehensive algorithms. The major issue with current performance trial data is that a great amount of management information is not reported. Usually, only information relevant to the examined management factors in each trial are reported, which inevitably results in missing values (Tables S1 and S2), or even in absence of important variables (e.g., number and dates of split fertilizer application). As we have highlighted here, the high order and complex background management interactions should not be considered as irrelevant.

Advancing agricultural research using machine learning algorithms

Global hunter-gatherer population densities constrained by influence of seasonality on diet composition

Landscape genetics and the genetic legacy of Upper Paleolithic and Mesolithic hunter-gatherers in the modern Caucasus

ITALIAN LANGUAGE

ENGLISH LANGUAGE