Building an interpretable SDM from scratch
using Julia 1.9
Timothée Poisot
Université de Montréal
Overview
Build a simple classifier to predict the distribution of a species
No, I will not tell you which species, it’s a large North American mammal
Use this as an opportunity to talk about interpretable ML
Discuss which biases are appropriate in a predictive model
We care a lot about the
process
and only a little about the
product
Why…
- … think of SDMs as a ML problem?
-
Because they are
- … think of explainable ML for SDM?
-
Because model uptake requires transparency
- … not tell us which species this is about?
-
Because this is the point (you’ll see)
Do try this at home!
💻 + 📔 + 🗺️ at https://github.com/tpoisot/InterpretableSDMWithJulia/
include(joinpath("code", "pkg.jl")); # Dependencies
include(joinpath("code", "nbc.jl")); # Naive Bayes Classifier
include(joinpath("code", "bioclim.jl")); # BioClim model
include(joinpath("code", "confusion.jl")); # Confusion matrix utilities
include(joinpath("code", "splitters.jl")); # Cross-validation (part one)
include(joinpath("code", "crossvalidate.jl")); # Cross-validation (part deux)
include(joinpath("code", "variableselection.jl")); # Variable selection
include(joinpath("code", "shapley.jl")); # Shapley values
include(joinpath("code", "palettes.jl")); # Accessible color palettes
Species occurrences
sightings = CSV.File("occurrences.csv")
occ = [
(record.longitude, record.latitude)
for record in sightings
if record.classification == "Class A"
]
filter!(r -> -90 <= r[2] <= 90, occ)
filter!(r -> -180 <= r[1] <= 180, occ)
boundingbox = (
left = minimum(first.(occ)),
right = maximum(first.(occ)),
bottom = minimum(last.(occ)),
top = maximum(last.(occ)),
)
Bioclimatic data
We collect BioClim data from WorldClim v2, using SpeciesDistributionToolkit
provider = RasterData(WorldClim2, BioClim)
opts = (; resolution=2.5)
temperature = SimpleSDMPredictor(provider, layer=1; opts..., boundingbox...)
Bioclimatic data
We set the pixels with only open water to nothing
water =
SimpleSDMPredictor(RasterData(EarthEnv, LandCover), layer=12; boundingbox...)
land = similar(temperature, Bool)
replace!(land, false => true)
Threads.@threads for k in keys(land)
if !isnothing(water[k])
if water[k] == 100
land[k] = false
end
end
end
temperature = mask(land, temperature)
Where are we so far?
Spatial thinning
We limit the occurrences to one per grid cell, assigned to the center of the grid cell
presence_layer = similar(temperature, Bool)
for i in axes(occ, 1)
if ~isnothing(presence_layer[occ[i]...])
presence_layer[occ[i]...] = true
end
end
Background points generation
We generate background points proportionally to the distance away from observations, but penalize the cells that are larger due to projection
possible_background =
pseudoabsencemask(DistanceToEvent, presence_layer) *
cellsize(temperature)
And then we sample three pseudo-absence for each occurrence:
absence_layer = backgroundpoints(
(x -> x^1.01).(possible_background),
3sum(presence_layer);
replace=false
)
Background points cleaning
We can remove all of the information that is neither a presence nor a pseudo-absence
replace!(absence_layer, false => nothing)
replace!(presence_layer, false => nothing)
Data overview
Preparing the responses and variables
Xpresence = hcat([bioclim_var[keys(presence_layer)] for bioclim_var in predictors]...)
ypresence = fill(true, length(presence_layer))
Xabsence = hcat([bioclim_var[keys(absence_layer)] for bioclim_var in predictors]...)
yabsence = fill(false, length(absence_layer))
X = vcat(Xpresence, Xabsence)
y = vcat(ypresence, yabsence)
The model – Naive Bayes Classifier
Prediction:
\[
P(+|x) = \frac{P(+)}{P(x)}P(x|+)
\]
Decision rule:
\[
\hat y = \text{argmax}_j \, P(\mathbf{c}_j)\prod_i P(\mathbf{x}_i|\mathbf{c}_j)
\]
The model – Naive Bayes Classifier
Assumption of Gaussian distributions:
\[
P(x|+) = \text{pdf}(x, \mathcal{N}(\mu_+, \sigma_+))
\]
Cross-validation
We keep an unseen testing set – this will be used at the very end to report expected model performance
idx, tidx = holdout(y, X; permute=true)
For validation, we will run k-folds
ty, tX = y[idx], X[idx,:]
folds = kfold(ty, tX; k=15, permute=true)
k = length(folds)
A note on cross-validation
- All models share the same folds
-
we can compare the validation performance across experiments to select the best model
- Model performance can be compared
-
we average the relevant summary statistics over each validation set
- Testing set is only for future evaluation
-
we can only use it once and report the expected performance of the best model
Measures on the confusion matrix
FPR |
0.3697 |
0.1406 |
FNR |
0.0144 |
0.1734 |
TPR |
0.9856 |
0.8266 |
TNR |
0.6303 |
0.8594 |
TSS |
0.6159 |
0.686 |
MCC |
0.5339 |
0.6408 |
Variable selection
We add variables one at a time, until the Matthew’s Correlation Coefficient stops increasing:
available_variables = forwardselection(ty, tX, folds, naivebayes, mcc)
This method identifies 5 variables, some of which are:
Mean Temp. of Coldest Quarter
Mean Diurnal Range
Annual Precip.
Variable selection?
Constrained variable selection
VIF threshold (over the extent or over document occurrences?)
PCA for dimensionality reduction v. Whitening for colinearity removal
Potential for data leakage: data transformations don’t exist, they are just models we can train
Model with variable selection
N_v1 = crossvalidate(naivebayes, ty, tX[:,available_variables], folds)
B_v1 = crossvalidate(bioclim, ty, tX[:,available_variables], folds, eps())
Measures on the confusion matrix
FPR |
0.3697 |
0.1406 |
0.6343 |
0.0727 |
FNR |
0.0144 |
0.1734 |
0.0056 |
0.1594 |
TPR |
0.9856 |
0.8266 |
0.9944 |
0.8406 |
TNR |
0.6303 |
0.8594 |
0.3657 |
0.9273 |
TSS |
0.6159 |
0.686 |
0.3601 |
0.7679 |
MCC |
0.5339 |
0.6408 |
0.3488 |
0.7534 |
How do we make the model better?
The NBC is a probabilistic classifier returning \(P(+|\mathbf{x})\)
The decision rule is to assign a presence when \(P(\cdot) > 0.5\)
But \(P(\cdot) > \tau\) is a far more general approach, and we can use learning curves to identify \(\tau\)
Thresholding the model
thr = LinRange(0.0, 1.0, 500)
T = hcat([crossvalidate(naivebayes, ty, tX[:,available_variables], folds, t) for t in thr]...)
But how do we pick the threshold?
Tuned model with selected variables
N_v2 = crossvalidate(naivebayes, ty, tX[:,available_variables], folds, thr[m])
Measures on the confusion matrix
FPR |
0.3697 |
0.1406 |
0.6343 |
0.0727 |
0.0737 |
FNR |
0.0144 |
0.1734 |
0.0056 |
0.1594 |
0.1521 |
TPR |
0.9856 |
0.8266 |
0.9944 |
0.8406 |
0.8479 |
TNR |
0.6303 |
0.8594 |
0.3657 |
0.9273 |
0.9263 |
TSS |
0.6159 |
0.686 |
0.3601 |
0.7679 |
0.7742 |
MCC |
0.5339 |
0.6408 |
0.3488 |
0.7534 |
0.7572 |
How do we make the model better?
The NBC is a Bayesian classifier returning \(P(+|\mathbf{x})\)
The actual probability depends on \(P(+)\)
There is no reason not to also tune \(P(+)\) (jointly with other hyper-parameters)!
Joint tuning of hyper-parameters
thr = LinRange(0.0, 1.0, 55)
pplus = LinRange(0.0, 1.0, 45)
T = [crossvalidate(naivebayes, ty, tX[:,available_variables], folds, t; presence=prior) for t in thr, prior in pplus]
best_mcc, params = findmax(map(v -> mean(mcc.(v)), T))
τ = thr[params.I[1]]
ppres = pplus[params.I[2]]
Tuned (again) model with selected variables
N_v3 = crossvalidate(naivebayes, ty, tX[:,available_variables], folds, τ; presence=ppres)
Measures on the confusion matrix
FPR |
0.3697 |
0.1406 |
0.0727 |
0.0737 |
0.0735 |
FNR |
0.0144 |
0.1734 |
0.1594 |
0.1521 |
0.1521 |
TPR |
0.9856 |
0.8266 |
0.8406 |
0.8479 |
0.8479 |
TNR |
0.6303 |
0.8594 |
0.9273 |
0.9263 |
0.9265 |
TSS |
0.6159 |
0.686 |
0.7679 |
0.7742 |
0.7744 |
MCC |
0.5339 |
0.6408 |
0.7534 |
0.7572 |
0.7575 |
Acceptable bias
false positives: we expect that our knowledge of the distribution is incomplete, and this is why we train a model
false negatives: wrong observations (positive in the data) may be identified by the model (negative prediction)
Prediction for each pixel
prediction = similar(temperature, Float64)
variability = similar(temperature, Float64)
uncertainty = similar(temperature, Float64)
Threads.@threads for k in keys(prediction)
pred_k = [p[k] for p in predictors[available_variables]]
bootstraps = [
samplemodel(pred_k)
for samplemodel in samplemodels
]
prediction[k] = finalmodel(pred_k)
variability[k] = iqr(bootstraps)
uncertainty[k] = entropy(prediction[k])
end
Tuned model - prediction
Tuned model - variability in output
Tuned model - entropy in probability
Tuned model - range
Predicting the predictions?
Shapley values (Monte-Carlo approximation): if we mix the variables across two observations, how important is the \(i\)-th variable?
Expresses “importance” as an additive factor on top of the average prediction (here: average prob. of occurrence)
Calculation of the Shapley values
shapval = [similar(first(predictors), Float64) for i in eachindex(available_variables)]
Threads.@threads for k in keys(shapval[1])
x = [p[k] for p in predictors[available_variables]]
for i in axes(shapval, 1)
shapval[i][k] = shapleyvalues(finalmodel, tX[:,available_variables], x, i; M=50)
if isnan(shapval[i][k])
shapval[i][k] = 0.0
end
end
end
Importance of variables
varimp = sum.(map(abs, shapval))
varimp ./= sum(varimp)
shapmax = mosaic(argmax, map(abs, shapval[sortperm(varimp; rev=true)]))
for v in sortperm(varimp, rev=true)
vname = variables[available_variables[v]][2]
vctr = round(Int, varimp[v]*100)
println("$(vname) - $(vctr)%")
end
Mean Temp. of Coldest Quarter - 38%
Annual Precip. - 20%
Precip. of Coldest Quarter - 18%
Mean Diurnal Range - 12%
Precip. Seasonality - 11%
There is a difference between contributing to model performance and contributing to model explainability
Top three variables
Most determinant predictor
Future predictions
relevant variables will remain the same
relevant \(P(+)\) will remain the same
relevant threshold for presences will remain the same
Future climate data (ca. 2070)
future = Projection(SSP370, CanESM5)
future_predictors = [
mask(land,
SimpleSDMPredictor(
provider, future;
layer = l, opts..., boundingbox...,
timespan=Year(2061) => Year(2080))
)
for l in layers(provider)
]
Future climate prediction
future_prediction = similar(temperature, Float64)
Threads.@threads for k in keys(future_prediction)
pred_k = [p[k] for p in future_predictors[available_variables]]
if any(isnothing.(pred_k))
continue
end
future_prediction[k] = finalmodel(pred_k)
end
Tuned model - future prediction
Loss and gain in distribution
Expansion |
1.732 |
No change |
4.659 |
Loss |
0.093 |
Tuned model - future range change
But wait…
What do you think the species was?
Human in the loop v. Algorithm in the loop
Take-home
building a model is incremental
each step adds arbitrary decisions we can control for, justify, or live with
we can provide explanations for every single prediction
free online textbook (in development) at https://tpoisot.github.io/DataSciForBiodivSci/