2.3.2 Evaluation of model predictors
The input variables fed into the XGBoost algorithm are provided in Table A1. The input features encompass 9 meteorological parameters (as simulated by the GEOS-CF model: surface northward and eastward wind components, surface temperature and skin temperature, surface relative humidity, total cloud coverage, total precipitation, surface pressure, and planetary boundary layer height), modeled surface concentrations of 51 chemical species (O3, NOx, carbon monoxide, volatile organic compounds (VOCs), and aerosols), and 21 modeled emissions at the given location. In addition, we provide as input features the hour of the day, day of the week, and month of the year; these allow the machine learning model to identify systematic observation-model mismatches related to the diurnal, weekly, and seasonal cycle of the pollutants. In addition, for sites with observations available for the full two years, we provide the calendar days since 1 January 2018 as an additional input feature to also correct for inter-annual trends in air pollution, e.g., due to a steady decrease in emissions not captured by the model. This follows a similar technique to Ivatt and Evans (2020) and Petetin et al. (2020).
Gradient-boosted tree models consist of a tree-like decision structure, which can be analyzed to understand how the model uses the input features to make a prediction. Particularly useful in this context is the SHapely Additive exPlanations (SHAP) approach, which is based on game-theoretic Shapely values and represents a measure of each feature’s responsibility for a change in the model prediction (Lundberg et al., 2017). SHAP values are computed separately for each individual model prediction, offering detailed insight into the importance of each input feature to this prediction while also considering the role of feature interactions (Lundberg et al., 2020). In addition, combining the local SHAP values offers a representation of the global structure of the machine learning model.
Figure A4 shows the distribution of the SHAP values for all NO2 predictors separated by polluted sites (left panel) and non-polluted sites (right panel), with polluted sites defined as locations with an annual average NO2 concentration of more than 15 ppbv. Generally, the model-predicted (unbiased) NO2 concentration is the most important predictor for the model bias, followed by the hour of the day, the day since 1 January 2018 (“trendday”), and a suite of meteorological variables including wind speed (u10m, v10m), planetary boundary hight (zpbl), and specific humidity (q10m). All of these factors are expected to highly impact NO2 concentrations and it is thus not surprising that the model biases are most sensitive to them. While there is considerable spread in the feature importance across the individual sites, there is little overall difference in the feature ranking between polluted vs. non-polluted sites.
Figure A5 shows the SHAP value distribution for all O3 predictors, again separated into polluted and non-polluted sites (using the same definition as for the NO2 sites). Unlike for NO2, the bias-correction models for polluted sites exhibit different feature sensitivities than the non-polluted sites. At polluted locations, the availability of reactive nitrogen (NO2, NOy, PAN) is the dominant factor for explaining the model O3 bias, reflecting the tight chemical coupling between NOx and O3 (Seinfeld and Pandis, 2016). This is followed by the month of the year, total precipitation (tprec), and O3 concentration, again variables that are expected to be correlated to O3. At non-polluted sites, the uncorrected O3 concentration is on average the most relevant input feature for the bias correctors, followed by the month of the year and the odd oxygen concentration (Ox=NO2+O3). The non-polluted sites are generally more sensitive to wind speed, reflecting the fact that O3 production and loss at these locations is less dominated by local processes compared to the polluted sites.