Skip to contents

Modelling COVID-19 infection across England

In this tutorial we’ll cover some work on fitting a purely spatial Bayesian model to predict the COVID-19 infection rate across England.

Study aim and data description

This study describes how to fit a purely spatial Bayesian hierarchical model (BHM) based on Markov Chain Monte Carlo (MCMC) simulation method to estimate the spatial pattern of COVID-19 infection rate in England. The first thing is to load all the packages used in the COVID-19 case study.

Install CARBayes

This tutorial requires the CARBayes package, please install it before continuing through the tutorial.

Retrieving data

The study region is mainland England, which is partitioned into 6789 neighbourhoods at the Middle Layer Super Output Area (MSOA) scale. The infections data are the total reported number of COVID-19 cases in each MSOA from Jan 8, 2022 to March 26, 2022. The shapefile of the study region is a SpatialPolygonsDataFrame, which is used to map the data. It stores the location, shape and attributes of geographic features for the neighbourhoods. We’ll use retrieve_tutorial_data to first retrieve the dataset for this tutorial.

fdmr::retrieve_tutorial_data(dataset = "covid_mcmc")
## 
## Tutorial data extracted to  /home/runner/fdmr/tutorial_data/covid_mcmc

The COVID-19 data and the related covariate information are included in our tutorial data package. We’ll load in the data using the load_tutorial_data function.

s_covid <- fdmr::load_tutorial_data(dataset = "covid_mcmc", filename = "s_covid.rds")

Next we’ll use the load_tutorial_data function to load in the spatial data we want.

sp_data <- fdmr::load_tutorial_data(dataset = "covid_mcmc", filename = "spatial_data.rds")

In this study, we use the areal unit modelling approach to fit the BHM and then make model inference using MCMC method. To do this, we need to construct a non-negative symmetric \(n \times n\) neighbourhood or adjacency matrix \(\boldsymbol{W}\) that accounts for the spatio-temporal autocorrelation structure, where \(n=6789\) is the number of areal units. The neighbourhood matrix specifies the spatial closeness between pairs of areal units. The elements \(\{w_{ij}\}\) in \(\boldsymbol{W}\) can be either continuous or binary, and a larger value of \(w_{ij}\) represents that MSOAs \((i,j)\) are spatially closer to each other. Here we use the border sharing specification, so that \(w_{ij}=1\) if MSOAs \((i,j)\) share a common geographical border, and \(w_{ij}=0\) otherwise.

W_nb <- spdep::poly2nb(sp_data, row.names = rownames(sp_data@data))
w <- spdep::nb2mat(W_nb, style = "B")

Model specification

We use a Bayesian hierarchical model to predict the spatial COVID-19 infection rate at the MSOA level in England. Let \(Y_{i}\) denote the total number of reported COVID cases for neighbourhood \(i=1,\ldots, n(=6789)\) during the study period, and \(N_{i}\) denote the (official) estimated population living in MSOA \(i\). \(Y_{i}\) is assumed to have a Poisson distribution with parameters (\(N_{i}\), \(\theta_{i}\)), where \(\theta_{i}\) is the true unobserved COVID-19 infection rate in MSAO \(i\). We follow a standard path in modelling \(\theta_{i}\) with a log link to the Poisson and start with a model where the linear predictor decomposes additively into a set of covariates and a Gaussian Markov Random Field process, which characterises the infection of the disease after the covariate effects have been accounted for. A general Bayesian hierarchical model commonly specified is given by

\[\begin{align} \nonumber Y_{i}\vert N_{i}, \theta_{i} &\sim \text{Poisson}(N_{i}\theta_{i}),\ \ i=1,\ldots,n,\\ log(\theta_{i} )&=\boldsymbol{x_{i}^{\top}}\boldsymbol{\beta}+\phi_{i}. \end{align}\]

The spatial random effects \(\{\phi_i\}\) are included in the model to account for any residual spatio-temporal autocorrelation after adjusting for covariates \(\boldsymbol{x_{i}}\). Here we utilise the spatial modelling structure “BYM”, proposed by Besag, York, and Mollié (1991), to model \(\{\phi_i\}\). It is given by

\[\begin{align} \nonumber \phi_i &=\phi_i^{(1)}+\phi_i^{(2)}\\ \phi_i^{(1)}\vert\boldsymbol\phi_{-i}^{(1)}&\sim \text{N}\left( \frac{\sum_{j=1}^{n}w_{ij}\phi_j^{(1)}}{\sum_{j=1}^{n}w_{ij}}, \frac{\tau_1^2}{\sum_{j=1}^{n}w_{ij}}\right)\\ \nonumber \phi_i^{(2)}&\sim \text{N}(0, \tau_2^2), \end{align}\]

where \(\phi_i\) now consists of two components. \(\phi_i^{(1)}\) is assigned the intrinsic CAR prior (Besag et al., 1991), and \(\phi_i^{(2)}\) is a set of independent and identically normally distributed random effects, with mean zero and common variance \(\tau_2^2\).

Define the model formula

In order to fit the spatial model, a model formula needs to be defined, by including the response in the left-hand side and the fixed and random effects in the right-hand side. First, we consider the scenario of including no covariates.

form <- total.cases ~ 1 + offset(log(population))

Fit the model

Finally, we fit the spatial model using the function S.CARbym() of the package CARBayes developed by Lee (2013).

MCMC_model <- CARBayes::S.CARbym(
  formula = form,
  data = s_covid,
  family = "poisson",
  W = w,
  burnin = 10000,
  n.sample = 30000,
  thin = 10,
  verbose = F
)
## Error in loadNamespace(x): there is no package called 'CARBayes'

For comparison purpose, we fit a separate BHM to the same dataset using the INLA approach. Likewise, it uses the BYM to model the spatial random effects.

s_covid$ID <- seq(1, nrow(s_covid))
formula <- total.cases ~ 1 + f(ID,
  model = "bym",
  graph = w
)

INLA_model <- INLA::inla(formula,
  data = s_covid,
  family = "poisson",
  E = s_covid$population,
  control.compute = list(
    dic = TRUE,
    waic = TRUE,
    cpo = TRUE
  ),
  verbose = FALSE
)
## Warning in inla.model.properties.generic(inla.trim.family(model), mm[names(mm) == : Model 'scopy' in section 'latent' is marked as 'experimental'; changes may appear at any time.
##   Use this model with extra care!!! Further warnings are disabled.

Finally, we fit a BHM to the same dataset using the INLA-SPDE approach.

initial_range <- diff(range(sp_data@data[, "LONG"])) / 5
max_edge <- initial_range / 8

mesh <- fmesher::fm_mesh_2d_inla(
  loc = sp_data@data[, c("LONG", "LAT")],
  max.edge = c(1, 2) * max_edge,
  offset = c(initial_range / 4, initial_range),
  cutoff = max_edge / 7
)

prior_range <- initial_range
spde <- inla.spde2.pcmatern(
  mesh = mesh,
  prior.range = c(prior_range, 0.5),
  prior.sigma = c(1, 0.01)
)

s_covid_cp <- s_covid
sp::coordinates(s_covid_cp) <- c("LONG", "LAT")
cmp <- total.cases ~ 0 + Intercept(1) + f(main = coordinates, model = spde)

inlabru_model <- inlabru::bru(cmp,
  data = s_covid_cp,
  family = "poisson",
  E = s_covid_cp$population,
  control.family = list(link = "log"),
  options = list(
    verbose = FALSE
  )
)

Model comparison

In terms of model selection criteria, we show the different values for each model based on DIC and WAIC. The model fitted using the INLA-SPDE approach performs better than the other two models in terms of the lowest DIC and WAIC values.

modfit <- data.frame(
  DIC = c(MCMC_model$modelfit[1], INLA_model$dic$dic, inlabru_model$dic$dic),
  WAIC = c(MCMC_model$modelfit[3], INLA_model$waic$waic, inlabru_model$waic$waic)
)
## Error in eval(expr, envir, enclos): object 'MCMC_model' not found
rownames(modfit) <- c("MCMC", "INLA_BYM", "INLA_SPDE")
## Error: object 'modfit' not found
modfit
## Error in eval(expr, envir, enclos): object 'modfit' not found

We compare the posterior COVID-19 infection rate estimates between the models. In general, all models provide similar posterior COVID-19 infection rate estimates.

mcmc_fitted_prev <- exp(mean(MCMC_model$samples$beta) + apply(MCMC_model$samples$psi, 2, mean))
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'mean': object 'MCMC_model' not found
mcmc_lc <- exp(quantile(MCMC_model$samples$beta, 0.025) +
  apply(MCMC_model$samples$psi, 2, quantile, 0.025))
## Error in eval(expr, envir, enclos): object 'MCMC_model' not found
mcmc_uc <- exp(quantile(MCMC_model$samples$beta, 0.975) +
  apply(MCMC_model$samples$psi, 2, quantile, 0.975))
## Error in eval(expr, envir, enclos): object 'MCMC_model' not found
comb <- data.frame(
  "INLA_BYM" = INLA_model$summary.fitted.values$mean,
  "INLA_lc" = INLA_model$summary.fitted.values$`0.025quant`,
  "INLA_uc" = INLA_model$summary.fitted.values$`0.975quant`,
  "INLA_SPDE" = inlabru_model$summary.fitted.values$mean[1:nrow(s_covid)],
  "INLA_SPDE_lc" = inlabru_model$summary.fitted.values$`0.025quant`[1:nrow(s_covid)],
  "INLA_SPDE_uc" = inlabru_model$summary.fitted.values$`0.975quant`[1:nrow(s_covid)],
  "MCMC" = mcmc_fitted_prev,
  "MCMC_lc" = mcmc_lc,
  "MCMC_uc" = mcmc_uc
)
## Error in eval(expr, envir, enclos): object 'mcmc_fitted_prev' not found
pairs(comb[, c("MCMC", "INLA_SPDE", "INLA_BYM")],
  pch = 19, cex = 0.3, col = "orange", lower.panel = panel.smooth
)
## Error in eval(expr, envir, enclos): object 'comb' not found
boxplot(comb[, c("MCMC", "INLA_SPDE", "INLA_BYM")])
## Error in eval(expr, envir, enclos): object 'comb' not found

The spatial patterns of the infection rate estimates for each model are displayed below.

sp_data@data$est.rate.mcmc <- mcmc_fitted_prev
## Error in eval(expr, envir, enclos): object 'mcmc_fitted_prev' not found
domain <- sp_data@data$est.rate.mcmc
legend_values <- sp_data@data$est.rate.mcmc

fdmr::plot_map_leaflet(
  polygon_data = sp_data,
  domain = domain,
  palette = "Reds",
  legend_title = "Rate",
  add_scale_bar = TRUE,
  polygon_fill_opacity = 0.8,
)
## Error: 'plot_map_leaflet' is not an exported object from 'namespace:fdmr'
sp_data@data$est.rate.inlabym <- INLA_model$summary.fitted.values$mean

domain <- sp_data@data$est.rate.inlabym
legend_values <- sp_data@data$est.rate.inlabym

fdmr::plot_map_leaflet(
  polygon_data = sp_data,
  domain = domain,
  palette = "Reds",
  legend_title = "Rate",
  add_scale_bar = TRUE,
  polygon_fill_opacity = 0.8,
)
## Error: 'plot_map_leaflet' is not an exported object from 'namespace:fdmr'
sp_data@data$est.rate.inlaspde <- inlabru_model$summary.fitted.values$mean[1:nrow(s_covid)]

domain <- sp_data@data$est.rate.inlaspde
legend_values <- sp_data@data$est.rate.inlaspde

fdmr::plot_map_leaflet(
  polygon_data = sp_data,
  domain = domain,
  palette = "Reds",
  legend_title = "Rate",
  add_scale_bar = TRUE,
  polygon_fill_opacity = 0.8,
)
## Error: 'plot_map_leaflet' is not an exported object from 'namespace:fdmr'

Now we consider the scenario of fitting the models with covariates of interest. We select several risk factors used in the COVID-19 tutorial. The models described above are fitted and compared again after incorporating covariate information. First we fit the BHM using MCMC method.

form <- total.cases ~ 1 + offset(log(population)) +
  IMD + perc.wb + perc.ba + age1 + pm25

MCMC_model2 <- CARBayes::S.CARbym(
  formula = form,
  data = s_covid,
  family = "poisson",
  W = w,
  burnin = 10000,
  n.sample = 30000,
  thin = 10,
  verbose = F
)
## Error in loadNamespace(x): there is no package called 'CARBayes'

Next we fit the BHM using the INLA-BYM approach.

formula <- total.cases ~ 1 + f(ID,
  model = "bym",
  graph = w
) + IMD + perc.wb + perc.ba + age1 + pm25

INLA_model2 <- INLA::inla(formula,
  data = s_covid,
  family = "poisson",
  E = s_covid$population,
  control.compute = list(
    dic = TRUE,
    waic = TRUE,
    cpo = TRUE
  ),
  verbose = FALSE
)

Next we fit the BHM using the INLA-SPDE approach.

cmp <- total.cases ~ 0 + Intercept(1) + f(main = coordinates, model = spde) +
  IMD + perc.wb + perc.ba + age1 + pm25

inlabru_model2 <- inlabru::bru(cmp,
  data = s_covid_cp,
  family = "poisson",
  E = s_covid_cp$population,
  control.family = list(link = "log"),
  options = list(
    verbose = FALSE
  )
)

The regression coefficients of the selected covariates for all models are compared. In general, the models have similar regression coefficients estimates.

regr_est <- cbind.data.frame(
  "MCMC" = MCMC_model2$summary.results[1:(nrow(MCMC_model2$summary.results) - 2), 1],
  "INLA_BYM" = INLA_model2$summary.fixed[, 1],
  "INLA_SPDE" = inlabru_model2$summary.fixed[, 1]
)
## Error in eval(expr, envir, enclos): object 'MCMC_model2' not found
regr_est
## Error in eval(expr, envir, enclos): object 'regr_est' not found

The modelling performance and results of the MCMC-BYM, INLA-BYM and INLA-SPDE approaches can be affected by the choice of the spatial model, the inference method and the data characteristics. MCMC-BYM and INLA-BYM use different inference methods, with the latter being faster. In the MCMC-BYM approach, spatial correlation structure is captured by a \(n\times n\) adjacency matrix. Specifically, if two locations are adjacent in geography, they are spatially correlated, leading to data smoothing between them. However, the INLA-SPDE approach uses the Matern model, where spatial correlation depends on the distance between two locations, rather than their geographic adjacency. Thus, even if two locations are not geographically adjacent, they can still exhibit correlation.

Besag, Julian, Jeremy York, and Annie Mollié. 1991. “Bayesian Image Restoration, with Two Applications in Spatial Statistics.” Annals of the Institute of Statistical Mathematics 43: 1–20.
Lee, Duncan. 2013. “CARBayes: An r Package for Bayesian Spatial Modeling with Conditional Autoregressive Priors.” Journal of Statistical Software 55 (13): 1–24.