Modelling COVID-19 infection across England
In this tutorial we’ll cover some work on fitting a purely spatial Bayesian model to predict the COVID-19 infection rate across England.
Study aim and data description
This study describes how to fit a purely spatial Bayesian hierarchical model (BHM) based on Markov Chain Monte Carlo (MCMC) simulation method to estimate the spatial pattern of COVID-19 infection rate in England. The first thing is to load all the packages used in the COVID-19 case study.
Install CARBayes
This tutorial requires the CARBayes package, please install it before continuing through the tutorial.
Retrieving data
The study region is mainland England, which is partitioned into 6789
neighbourhoods at the Middle Layer Super Output Area (MSOA) scale. The
infections data are the total reported number of COVID-19 cases in each
MSOA from Jan 8, 2022 to March 26, 2022. The shapefile of the study
region is a SpatialPolygonsDataFrame
, which is used to map
the data. It stores the location, shape and attributes of geographic
features for the neighbourhoods. We’ll use
retrieve_tutorial_data
to first retrieve the dataset for
this tutorial.
fdmr::retrieve_tutorial_data(dataset = "covid_mcmc")
##
## Tutorial data extracted to /home/runner/fdmr/tutorial_data/covid_mcmc
The COVID-19 data and the related covariate information are included
in our tutorial data package. We’ll load in the data using the
load_tutorial_data
function.
s_covid <- fdmr::load_tutorial_data(dataset = "covid_mcmc", filename = "s_covid.rds")
Next we’ll use the load_tutorial_data
function to load
in the spatial data we want.
sp_data <- fdmr::load_tutorial_data(dataset = "covid_mcmc", filename = "spatial_data.rds")
In this study, we use the areal unit modelling approach to fit the BHM and then make model inference using MCMC method. To do this, we need to construct a non-negative symmetric neighbourhood or adjacency matrix that accounts for the spatio-temporal autocorrelation structure, where is the number of areal units. The neighbourhood matrix specifies the spatial closeness between pairs of areal units. The elements in can be either continuous or binary, and a larger value of represents that MSOAs are spatially closer to each other. Here we use the border sharing specification, so that if MSOAs share a common geographical border, and otherwise.
Model specification
We use a Bayesian hierarchical model to predict the spatial COVID-19 infection rate at the MSOA level in England. Let denote the total number of reported COVID cases for neighbourhood during the study period, and denote the (official) estimated population living in MSOA . is assumed to have a Poisson distribution with parameters (, ), where is the true unobserved COVID-19 infection rate in MSAO . We follow a standard path in modelling with a log link to the Poisson and start with a model where the linear predictor decomposes additively into a set of covariates and a Gaussian Markov Random Field process, which characterises the infection of the disease after the covariate effects have been accounted for. A general Bayesian hierarchical model commonly specified is given by
The spatial random effects are included in the model to account for any residual spatio-temporal autocorrelation after adjusting for covariates . Here we utilise the spatial modelling structure “BYM”, proposed by Besag, York, and Mollié (1991), to model . It is given by
where now consists of two components. is assigned the intrinsic CAR prior (Besag et al., 1991), and is a set of independent and identically normally distributed random effects, with mean zero and common variance .
Define the model formula
In order to fit the spatial model, a model formula needs to be defined, by including the response in the left-hand side and the fixed and random effects in the right-hand side. First, we consider the scenario of including no covariates.
Fit the model
Finally, we fit the spatial model using the function
S.CARbym()
of the package CARBayes
developed
by Lee (2013).
MCMC_model <- CARBayes::S.CARbym(
formula = form,
data = s_covid,
family = "poisson",
W = w,
burnin = 10000,
n.sample = 30000,
thin = 10,
verbose = F
)
## Error in loadNamespace(x): there is no package called 'CARBayes'
For comparison purpose, we fit a separate BHM to the same dataset using the INLA approach. Likewise, it uses the BYM to model the spatial random effects.
s_covid$ID <- seq(1, nrow(s_covid))
formula <- total.cases ~ 1 + f(ID,
model = "bym",
graph = w
)
INLA_model <- INLA::inla(formula,
data = s_covid,
family = "poisson",
E = s_covid$population,
control.compute = list(
dic = TRUE,
waic = TRUE,
cpo = TRUE
),
verbose = FALSE
)
Finally, we fit a BHM to the same dataset using the INLA-SPDE approach.
initial_range <- diff(range(sp_data@data[, "LONG"])) / 5
max_edge <- initial_range / 8
mesh <- fmesher::fm_mesh_2d_inla(
loc = sp_data@data[, c("LONG", "LAT")],
max.edge = c(1, 2) * max_edge,
offset = c(initial_range / 4, initial_range),
cutoff = max_edge / 7
)
prior_range <- initial_range
spde <- inla.spde2.pcmatern(
mesh = mesh,
prior.range = c(prior_range, 0.5),
prior.sigma = c(1, 0.01)
)
s_covid_cp <- s_covid
sp::coordinates(s_covid_cp) <- c("LONG", "LAT")
cmp <- total.cases ~ 0 + Intercept(1) + f(main = coordinates, model = spde)
inlabru_model <- inlabru::bru(cmp,
data = s_covid_cp,
family = "poisson",
E = s_covid_cp$population,
control.family = list(link = "log"),
options = list(
verbose = FALSE
)
)
Model comparison
In terms of model selection criteria, we show the different values for each model based on DIC and WAIC. The model fitted using the INLA-SPDE approach performs better than the other two models in terms of the lowest DIC and WAIC values.
modfit <- data.frame(
DIC = c(MCMC_model$modelfit[1], INLA_model$dic$dic, inlabru_model$dic$dic),
WAIC = c(MCMC_model$modelfit[3], INLA_model$waic$waic, inlabru_model$waic$waic)
)
## Error in eval(expr, envir, enclos): object 'MCMC_model' not found
## Error: object 'modfit' not found
modfit
## Error in eval(expr, envir, enclos): object 'modfit' not found
We compare the posterior COVID-19 infection rate estimates between the models. In general, all models provide similar posterior COVID-19 infection rate estimates.
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'mean': object 'MCMC_model' not found
mcmc_lc <- exp(quantile(MCMC_model$samples$beta, 0.025) +
apply(MCMC_model$samples$psi, 2, quantile, 0.025))
## Error in eval(expr, envir, enclos): object 'MCMC_model' not found
mcmc_uc <- exp(quantile(MCMC_model$samples$beta, 0.975) +
apply(MCMC_model$samples$psi, 2, quantile, 0.975))
## Error in eval(expr, envir, enclos): object 'MCMC_model' not found
comb <- data.frame(
"INLA_BYM" = INLA_model$summary.fitted.values$mean,
"INLA_lc" = INLA_model$summary.fitted.values$`0.025quant`,
"INLA_uc" = INLA_model$summary.fitted.values$`0.975quant`,
"INLA_SPDE" = inlabru_model$summary.fitted.values$mean[1:nrow(s_covid)],
"INLA_SPDE_lc" = inlabru_model$summary.fitted.values$`0.025quant`[1:nrow(s_covid)],
"INLA_SPDE_uc" = inlabru_model$summary.fitted.values$`0.975quant`[1:nrow(s_covid)],
"MCMC" = mcmc_fitted_prev,
"MCMC_lc" = mcmc_lc,
"MCMC_uc" = mcmc_uc
)
## Error in eval(expr, envir, enclos): object 'mcmc_fitted_prev' not found
pairs(comb[, c("MCMC", "INLA_SPDE", "INLA_BYM")],
pch = 19, cex = 0.3, col = "orange", lower.panel = panel.smooth
)
## Error in eval(expr, envir, enclos): object 'comb' not found
## Error in eval(expr, envir, enclos): object 'comb' not found
The spatial patterns of the infection rate estimates for each model are displayed below.
sp_data@data$est.rate.mcmc <- mcmc_fitted_prev
## Error in eval(expr, envir, enclos): object 'mcmc_fitted_prev' not found
domain <- sp_data@data$est.rate.mcmc
legend_values <- sp_data@data$est.rate.mcmc
fdmr::plot_map_leaflet(
polygon_data = sp_data,
domain = domain,
palette = "Reds",
legend_title = "Rate",
add_scale_bar = TRUE,
polygon_fill_opacity = 0.8,
)
## Error: 'plot_map_leaflet' is not an exported object from 'namespace:fdmr'
sp_data@data$est.rate.inlabym <- INLA_model$summary.fitted.values$mean
domain <- sp_data@data$est.rate.inlabym
legend_values <- sp_data@data$est.rate.inlabym
fdmr::plot_map_leaflet(
polygon_data = sp_data,
domain = domain,
palette = "Reds",
legend_title = "Rate",
add_scale_bar = TRUE,
polygon_fill_opacity = 0.8,
)
## Error: 'plot_map_leaflet' is not an exported object from 'namespace:fdmr'
sp_data@data$est.rate.inlaspde <- inlabru_model$summary.fitted.values$mean[1:nrow(s_covid)]
domain <- sp_data@data$est.rate.inlaspde
legend_values <- sp_data@data$est.rate.inlaspde
fdmr::plot_map_leaflet(
polygon_data = sp_data,
domain = domain,
palette = "Reds",
legend_title = "Rate",
add_scale_bar = TRUE,
polygon_fill_opacity = 0.8,
)
## Error: 'plot_map_leaflet' is not an exported object from 'namespace:fdmr'
Now we consider the scenario of fitting the models with covariates of interest. We select several risk factors used in the COVID-19 tutorial. The models described above are fitted and compared again after incorporating covariate information. First we fit the BHM using MCMC method.
form <- total.cases ~ 1 + offset(log(population)) +
IMD + perc.wb + perc.ba + age1 + pm25
MCMC_model2 <- CARBayes::S.CARbym(
formula = form,
data = s_covid,
family = "poisson",
W = w,
burnin = 10000,
n.sample = 30000,
thin = 10,
verbose = F
)
## Error in loadNamespace(x): there is no package called 'CARBayes'
Next we fit the BHM using the INLA-BYM approach.
formula <- total.cases ~ 1 + f(ID,
model = "bym",
graph = w
) + IMD + perc.wb + perc.ba + age1 + pm25
INLA_model2 <- INLA::inla(formula,
data = s_covid,
family = "poisson",
E = s_covid$population,
control.compute = list(
dic = TRUE,
waic = TRUE,
cpo = TRUE
),
verbose = FALSE
)
Next we fit the BHM using the INLA-SPDE approach.
cmp <- total.cases ~ 0 + Intercept(1) + f(main = coordinates, model = spde) +
IMD + perc.wb + perc.ba + age1 + pm25
inlabru_model2 <- inlabru::bru(cmp,
data = s_covid_cp,
family = "poisson",
E = s_covid_cp$population,
control.family = list(link = "log"),
options = list(
verbose = FALSE
)
)
The regression coefficients of the selected covariates for all models are compared. In general, the models have similar regression coefficients estimates.
regr_est <- cbind.data.frame(
"MCMC" = MCMC_model2$summary.results[1:(nrow(MCMC_model2$summary.results) - 2), 1],
"INLA_BYM" = INLA_model2$summary.fixed[, 1],
"INLA_SPDE" = inlabru_model2$summary.fixed[, 1]
)
## Error in eval(expr, envir, enclos): object 'MCMC_model2' not found
regr_est
## Error in eval(expr, envir, enclos): object 'regr_est' not found
The modelling performance and results of the MCMC-BYM, INLA-BYM and INLA-SPDE approaches can be affected by the choice of the spatial model, the inference method and the data characteristics. MCMC-BYM and INLA-BYM use different inference methods, with the latter being faster. In the MCMC-BYM approach, spatial correlation structure is captured by a adjacency matrix. Specifically, if two locations are adjacent in geography, they are spatially correlated, leading to data smoothing between them. However, the INLA-SPDE approach uses the Matern model, where spatial correlation depends on the distance between two locations, rather than their geographic adjacency. Thus, even if two locations are not geographically adjacent, they can still exhibit correlation.