# Predicting the risk of pipe failure using gradient boosted decision trees and weighted risk analysis

### Receiver operator curve and area under the curve

The receiver operator curve (ROC) is used to visualise how the model performs independently of the decision threshold, providing a useful tool for visualising how well the classifier avoids false classifications32. The ROC plot shows a trade-off between the True Positive Rate (TPR) or sensitivity, the fraction of observations that are correctly classified, calculated in Eq. (1) as

$${rm{TPR}} = frac{{{rm{TP}}}}{{{rm{TP}} + {rm{FN}}}}$$

(1)

where TP is True Positive and FN False Negative, and the False Positive Rates (FPR) or specificity, the fraction of observations that are incorrectly classified, calculated in Eq. (2) as

$${rm{FPR}} = frac{{{rm{FP}}}}{{{rm{FP}} + {rm{TN}}}}$$

(2)

The passing of two lines corresponding to a 100% TPR and a 0% FPR = 1 (TPR versus 1−FPR) is considered a perfect discriminatory ability. This is graphically represented by the ROC curve passing the upper left-hand corner of the plot. The passing of the curve through the diagonal y = x represents a model that is no better than a random guess33. The Area Under the Curve (AUC) is an aggregated measure of performance for all classification thresholds and represents the measure of separability by describing the capability of the predictions in distinguishing between the classes. An AUC measure is returned between zero and one, with zero representing a perfectly inaccurate test and one a perfect test. In general, an AUC of 0.7 to 0.8 is considered acceptable, 0.8 to 0.9 is considered excellent, and >0.9 is outstanding34. Figure 1 shows the ROC curve for the test dataset close to the top left-hand corner and an AUC value of 0.89, suggesting the model has an excellent discriminative ability to distinguish between the classes, and the TPR and FPR appear robust enough to predict failures on the unseen test data.

The calibration curve provides a means of observing how close the predictions are to those observed. Since the outcome in this model is the probability of failure between 0 and 1, it is appropriate to use a binning method. Binning is advantageous since it averages the probability of failure for each bin which provides a useful graphical representation of how well the model is calibrated. The mean probability is then compared to the frequency of observed failures in each bin. In this case, a fixed-width binning approach is used, where the data is partitioned into ten bins known as decile analysis, and approach used in similar studies35. A reliability curve provides a means of visualising this comparison, whereby perfectly calibrated probabilities would lie on a diagonal line through the middle of the plot. The briers score is a useful measure of accuracy for probabilistic predictions and is equivalent to the mean squared error whereby the cost function minimises to zero for a perfect model and maximises to 1 for a model with no accuracy4. The Brier’s Score (BS) is calculated in Eq. (3) as

$${rm{BS}} = frac{1}{N}mathop {sum }limits_{i = 1}^N (P_i – O_i)^2$$

(3)

where N is the total number of observations, Pi is the prediction probability and Oi is equal to the event outcome failure or no failure. Figure 2 shows the calibration plot for the model and suggests the model is well calibrated for the lower and upper deciles since most bins fit the diagonal. The upper middle deciles do not fit the diagonal where the calibration curve is below or above the diagonal, suggesting the predictions have a lower probability than those seen in the data The briers score of 0.007 is low, suggesting accurate predictions overall.

### Confusion matrix and accuracy

The confusion matrix describes the frequency of classification outcomes by explicitly defining the number of True Positives (TP, or Precision), True Negatives (TN), False Positives (FP), and False Negatives (FN). The decision to convert a predicted probability into a class label is determined by an optimal probability threshold such that the value of the response (y_i = left{ {begin{array}{*{20}{c}} {{rm{no}},{rm{failure}},{rm{if}},P_i le {rm{threshold}}} {{rm{failure}},{rm{if}},P_i > {rm{threshold}}} end{array}} right.). The default probability threshold within the model is 0.536. By this definition, there remains a practical need to optimise the probability threshold specifically to the behaviour of pipe failures within the imbalanced test data. An optimal probability threshold typically strikes a balance between sensitivity and specificity. However, there is a trade-off between TPR and FPR when altering the threshold, where increasing or decreasing the TPR typically results in the same for the FPR and vice versa. Probability threshold optimisation is an important step in the decision-making process and is specific to each problem. In the case of pipe replacement, expert judgement should be used by reasoning that water companies would seek to avoid unnecessarily replacing pipes that may have a longevity of several decades more, resulting in wasted maintenance effort and cost. Furthermore, only 0.5–1% of the network is typically replaced each year due to budget constraints37. It is therefore important to only identify pipes with the highest probability of failure. Considering this, the optimal threshold is set to reduce the FNs (i.e., pipes predicted to fail when they have not). This reduces the number of TPs predicted as discussed above but targets those pipes most likely to fail.

A factorial experimental design was used, whereby the threshold was iterated from 0.01 through to 0.99, observing each threshold to reveal the point where the highest accuracy meets the lowest FN value. The Matthews Correlation Coefficient (MCC) was used to measure accuracy and is useful for imbalanced data since it accounts for the difference in class size and only returns a high accuracy score if all four confusion matrix categories are accurately represented. For this reason, Chicco (2017) argues that it is the correct measure for imbalanced data sets. The MCC describes the prediction accuracy as worst value = −1 and best value = +1 and is calculated as shown in Eq. (4) as follows:

$${rm{MCC}} = frac{{{rm{TP times TN – FP times FN}}}}{{sqrt {left( {{rm{TP}} + {rm{FP}}} right)({rm{TP}} + {rm{FN}})({rm{TN}} + {rm{FP}})({rm{TN}} + {rm{FN}})} }}$$

(4)

Table 1 shows a small range of the thresholds for brevity. The optimal threshold in this instance has been identified firstly with the highest MCC accuracy and then the lowest FN. The MCC of 0.27 suggests the model is better than a random fit, but a low MCC value also represents a high percentage of false positives (i.e., values incorrectly identified as non-failure). The balanced accuracy is also a good measure of the accuracy for imbalanced classes, where 1 is high and 0 is low. The balanced accuracy for this model is 0.65. In practical terms, the results are helpful for water companies to target areas for further investigation and potential replacement since they focus on those pipes having the highest probability of failure, yet there are still incorrect predictions that could lead to the potential replacement of pipes unnecessarily. The model predicts 20.20% of all failures occurring in the WDN, found in 7.83% of the WDN pipe network. The results show that approximately 32.80% of the observed pipe failures were correctly predicted as failures, whilst approximately 67.20% of the observed pipe failures were falsely predicted as no failure. If desired, water companies could choose an alternative threshold, one that eliminates FN predictions, however, the number of TP predictions will also reduce.

### Relative variable influence

The relative variable influence shows the empirical improvement (I_t^2) accounted for by variable interval xj, averaged across all boosted trees as presented in Eq. (5) as follows38:

$$hat J_j^2 = mathop {sum }limits_{{rm{splits}},{rm{on}},x_j} I_t^2$$

(5)

The variable influence helps understand which variables contribute more when predicting pipe failures. For GBT models, this is the summation of predictor influence accumulated over all the classifiers. Figure 3 shows the results, suggesting similar findings compared to existing literature. The most important variables are the number of previous failures and pipe length, both a proxy for pipe performance and deterioration. It is worth reiterating that both variables represent the grouped pipe and do not consider individual pipe history. Soil Moisture Deficit (SMD) is the most important weather variable being linked with shrinkage of clay soils and subsequent ground movement in AC pipe failures. Conversely, clay soils and soils shrink–swell potential, both representing ground movement, show lower influence.

Pipe diameter, and material are less important factors in this network than as reported in comparable studies11,20,21,39. The relative variable influence of days air frost and temperature is not as high as expected, given their correlation with high pipe failure frequency in iron pipes and the large percentage of iron pipes in the WDN. It is likely to be a result of over summarising the data to facilitate the annual prediction interval. A shorter prediction interval (week, or month) for networkwide groups of pipes is necessary to capture inter-annual variation accurately, but short prediction intervals in the authors’ experience can result in low predictive accuracy. The overall relative variable influence of soil (shrink well, soil corrosivity, Hydrology of Soil Type) is low. From literature and an engineering perspective, soil corrosion is strongly related to the deterioration of metal pipes and their ability to withstand internal and external forces3. It is possible that many pipes in this network may have been rehabilitated and protected against corrosion; however, this information was unavailable at the time of this study. Water source is the only operational variable and shows low influence compared to many other variables. The most important water source is surface water, resulting in lower temperatures during the winter due to its exposure to weather. This causes higher failure rates in metal pipes, yet compared to other variables, the influence is low. Other variables are imaginable such as installation details like bedding and backfill material, surrounding environments providing evidence on loading such as traffic loading and construction works, operational data such as pipe pressure and transients, water quality and spatial failure characteristics. These are not investigated here but will likely result in performance gains.

### Risk mapping

For the mapping to be effective from an asset management standpoint, the results of the weighted risk analysis should be able to separate out low, medium, and high failures. The number of high failures is expected to be small for two reasons, (1) pipes rarely fail more than once and (2) utilities are only able to allocate investments to those at the greatest risk due to budget limitations and are therefore only interested in the top 1–2% of pipes. The outcome of the weighted risk analysis is presented in Fig. 4, representing a small section of the WDN for clarity. Natural Jenks arranges the risk level into three categories, low [0; ≤0.02], medium [>0.02; ≤0.06] and high [>0.06; ≤0.92]. In this scenario, the length of pipe in the high-risk category is 13.9 km of the 300.7 km or 4.6% of the pipe network present in Fig. 4, a useful percentage of the network to target for management decisions. The choropleth risk map approach is an important means of visualising individual pipes or clusters of pipes with the highest risk in the WDN, evidenced in Fig. 4. Figure 4 also highlights how many pipes in this section of the network have a low risk, which is to be expected since many pipes have a low probability of failure and have small diameters, potentially causing less damage if they fail.

### Practical considerations

Creating groups of pipes was an important step given the low frequency of failures in the UK WDN dataset. Grouping pipes in this way assumes that all pipes in the group share similar failure rates, which is not the case, and thus the approach adopted here presents a suitable solution to this limitation. Grouping pipes on a lower spatial scale can capture localised influences on pipe performance, that can often be obfuscated when generalising over the whole network. However, the approach used may not be as useful for rural areas where fewer pipes are present, where smaller scales may be more appropriate (e.g., 1:100,000 is a smaller scale than 1:100). Further investigation into grouping scales is merited. Optimisation the threshold is challenging and inevitably leads to inappropriately classified failures on either side of the threshold. Optimising is even more difficult with imbalanced data sets since conventional classification methods are built to assume that all classes are equal. An alternative approach was applied in this study, which used MCC accuracy and FN to set a threshold, reducing the potential for wasting budgets replacing pipes that will not fail. In the process, the number of TPs was reduced to 32.80% of the observed pipe failures, whilst the number of FPs was 67.20% of the observed pipe failures, which may not present a good argument to professionals. Despite this, the results can be used directly in strategic planning, which sets long-term key decisions regarding maintenance and potential replacement of pipes. Predicting the probability of failure is an essential response since it enables the identification and prioritisation of risk across the network. This methodology could also be used to provide longer-term predictions to support the development of Asset Management Plan, which cover a five-year period of regulated investment.

Categorising the pipes based on a weighted risk analysis and visually presenting them using Natural Jenks offers a useful method for prioritising pipes based on the consequence of their failure and is an easily assessed cartographic presentation. It extends the probability of failure into a more useful measure of risk, providing more information for decision makers. The use of distance to property in this study is a simple approach to determine flooding. To provide a realistic determination of flooding, an understanding of key geographical features for overland flow routing is required40. The list of consequences was limited in this study and could be extended when such data is available. There are potentially numerous consequences of failure inherent to each network, yet common consequences include loss of water, potential disruption, reduction in water quality, reliability, direct costs (damage to property and infrastructure and pipe repair and replacement) and indirect costs (environmental and social)8. In this study, the risk estimates were concluded by expert knowledge, and any contextual mismatch between weightings could potentially skew the outcomes. Therefore, the weightings should be considered carefully by network professionals. At an engineering level, the risk mapping can be further used to determine areas of the network leading to a high probability of failure, which can be used to take constructive pre-emptive actions towards extending the life of future pipe construction41.

The economic benefits of this model will manifest when performing proactive maintenance, potentially averting associated risks that may arise from damaging properties and infrastructure. It is anticipated that the modelling approach proposed will enhance decision-making at a local level, facilitated through numerical outputs which report on the serviceability of the WDN and help meet regulatory performance targets avoiding heavy fines. Operationally, the approach will help with highlighting short pipe segments for repair and replacement though graphical outputs, these are practical lengths of pipes for operational teams that typically do not replace kms of pipe at any given time42. This approach shows similar performance to comparable GBT studies11,20, but is beneficial since the method provides reliable predictions on a shorter annual time frame. The method here is also computationally easier to develop than other more complex machine learning methods such as neural networks and Bayesian Neural Networks.

The predictions rely on the quality of the data, and several challenges were presented during the cleaning and processing, most notably the location of the pipe failures, many of which were geographically displaced, and some by a considerable distance yet was necessary to retain all the failures in the dataset. These were snapped to the nearest pipe with similar characteristics, yet it is conceivable that some were incorrectly placed despite the protocols established for the snapping process. Further limitations to the study include limited data, where pressure data or other operational data may have proved useful, the advantage of which may consist of increased model accuracy and interpretability. Over-summarised local conditions can also affect the model accuracy, and in this study, the local soil conditions were presented from a soil map at 1:250,000 scale. Likewise, the weather variables were highly summarised to an annual scale from a 40 × 40 km grid source. Inevitably these limitations will affect the model, which can potentially hinder effective decision-making. There are several challenges faced when modelling pipe failures, from uncertainties in data collection and management to specific data processing solutions. There is a need to understand these holistically, and from the view of current practice for a more in-depth perspective of current challenges in practice that may hinder useful data gathering. In addition, future research aimed at understanding how practitioners understand pipe failure models, the limitations, and opportunities is beneficial, since there is often a discord between the capabilities of modelling and user expectations. This further research may help to improve pipe failure models by encouraging enhancements in the pipe failure model process that promotes quality data capture.

### Concluding remarks

This study considered the prediction of pipe failures using a GBT model and establishing the risk based on weighted risk analysis to prioritise pipes for proactive management. A 1 km spatial scale was included in this model when grouping the pipes, which aimed to capture localised conditions and remove the failure rate disparities shared when grouping pipes across a network. This spatial scale, together with a short prediction interval, the absence of some essential variables, and additional inherent problems with pipe failure data sets, has ultimately resulted in acceptable accuracy. However, in practical terms, when used in conjunction with expert knowledge, the results provide a useful approximation of potential failures and a better understanding of the current WDN to help plan rehabilitation and replacement efforts. Improving model accuracy may be achieved by increasing the prediction interval to five-year asset management plan, potentially accumulating more failures per pipe group from which to predict. Yet this may not be as useful to water companies where management decisions are typically annual. Furthermore, understanding the issues faced with data collection and quality from current practice may help to encourage data quantity and quality, and could potentially provide marked improvements in the final predictions.

Further suggested research includes exploring different pipe grouping variations, collecting more data on the consequences of failure to enhance the weighted risk analysis and, expanding on this idea, understanding the data quantity and quality issues from current practice, and exploring feature engineering techniques to derive more valuable data sets that may improve model accuracy.

Source: Resources - nature.com