Skip to contents

Modelling COVID-19 infection across England

In this tutorial we’ll cover some work on fitting a spatio-temporal Bayesian hierarchical model (BHM) to predict the COVID-19 infection rate across England.

Study aim and data description

This study describes how to fit a spatio-temporal BHM to areal data using Markov Chain Monte Carlo (MCMC) simulation method. The first thing is to load all the packages used in the COVID-19 case study.

The study region is mainland England, which is partitioned into 6789 neighbourhoods at the Middle Layer Super Output Area (MSOA) scale. The infections data are the total reported number of COVID-19 cases in each MSOA from Jan 8, 2022 to March 26, 2022. The shapefile of the study region is a SpatialPolygonsDataFrame, which is used to map the data. It stores the location, shape and attributes of geographic features for the neighbourhoods.

Installing CARBayesST

This tutorial requires the CARBayesST package which you may need to install before working through this tutorial.

Retrieving and loading data

We first need to retrieve the infections data from the fdmr example data store and unpack it and we’ll use retrieve_tutorial_data to do this.

fdmr::retrieve_tutorial_data(dataset = "covidST_mcmc")
## 
## Tutorial data extracted to  /home/runner/fdmr/tutorial_data/covidst_mcmc

The COVID-19 data and the related covariate information are included in our tutorial data package. We’ll load in the data using the load_tutorial_data function.

st_covid <- fdmr::load_tutorial_data(dataset = "covidST_mcmc", filename = "st_covid.rds")
## Error in get_tutorial_datapath(dataset = dataset, filename = filename): Unable to load data, the folder /home/runner/fdmr/tutorial_data/covidST_mcmc does not exist. Have you run retrieve_tutorial_data?

Next we’ll use the load_tutorial_data function to load in the spatial data we want.

sp_data <- fdmr::load_tutorial_data(dataset = "covidST_mcmc", filename = "spatial_data.rds")
## Error in get_tutorial_datapath(dataset = dataset, filename = filename): Unable to load data, the folder /home/runner/fdmr/tutorial_data/covidST_mcmc does not exist. Have you run retrieve_tutorial_data?

In this study, we use the areal unit modelling approach to fit the BHM and then make model inference using MCMC method. To do this, we need to construct a non-negative symmetric n×nn \times n neighbourhood or adjacency matrix 𝐖\boldsymbol{W} that accounts for the spatio-temporal autocorrelation structure, where n=6789n=6789 is the number of areal units. The neighbourhood matrix specifies the spatial closeness between pairs of areal units. The elements {wij}\{w_{ij}\} in 𝐖\boldsymbol{W} can be either continuous or binary, and a larger value of wijw_{ij} represents that MSOAs (i,j)(i,j) are spatially closer to each other. Here we use the border sharing specification, so that wij=1w_{ij}=1 if MSOAs (i,j)(i,j) share a common geographical border, and wij=0w_{ij}=0 otherwise.

W_nb <- spdep::poly2nb(sp_data, row.names = rownames(sp_data@data))
## Error in eval(expr, envir, enclos): object 'sp_data' not found
w <- spdep::nb2mat(W_nb, style = "B")
## Error in eval(expr, envir, enclos): object 'W_nb' not found

Model specification

We use a Bayesian hierarchical model to predict the spatio-temporal COVID-19 infection rate at the neighbourhood level in England. Let YitY_{it} denotes the weekly number of reported COVID cases for neighbourhood i=1,,n(=6789)i=1,\ldots, n(=6789) and week t=1,,T(=108)t=1,\ldots, T(=108) and NitN_{it} denotes the (official) estimated population living in neighbourhood ii and week tt. YitY_{it} is assumed to have a Poisson distribution with parameters (NitN_{it}, θit\theta_{it}), where θit\theta_{it} is the true unobserved COVID-19 infection rate in MSOA ii and week tt. We follow a standard path in modelling θit\theta_{it} with a log link to the Poisson and start with a model where the linear predictor decomposes additively into a set of covariates and a Gaussian Markov Random Field process, which characterises the infection of the disease after the covariate effects have been accounted for. A general Bayesian hierarchical model commonly specified is given by

Yit|Nit,θitPoisson(Nitθit),i=1,,n,t=1,,T,log(θit)=𝐱𝐢𝐭𝛃+ϕit.\begin{align} \nonumber Y_{it}\vert N_{it}, \theta_{it} &\sim \text{Poisson}(N_{it}\theta_{it}),\ \ i=1,\ldots,n, t=1,\ldots,T,\\ log(\theta_{it} )&=\boldsymbol{x_{it}^{\top}}\boldsymbol{\beta}+\phi_{it}. \end{align}

The spatial random effects {ϕit}\{\phi_{it}\} are included in the model to account for any residual spatio-temporal autocorrelation after adjusting for covariates 𝐱𝐢𝐭\boldsymbol{x_{it}}. Here we utilise the spatio-temporal modelling structure proposed by Rushworth, Lee, and Mitchell (2014) to model {ϕit}\{\phi_{it}\}. It is given by

𝛟𝟏N(𝟎,τ2𝐐(𝐖)1),𝛟𝐭|𝛟𝐭𝟏N(α𝛟𝐭𝟏,τ2𝐐(𝐖,ρ)1),t=2,,T,\begin{align} \nonumber \boldsymbol{\phi_1}&\sim \text{N}\left(\boldsymbol{0}, \tau^2\boldsymbol{ Q}(\boldsymbol{W})^{-1}\right),\\ \boldsymbol{\phi_t}\vert\boldsymbol{\phi_{t-1}}&\sim \text{N}\left(\alpha\boldsymbol{\phi_{t-1}},\tau^2\boldsymbol{Q}(\boldsymbol{W},\rho)^{-1}\right), \ \ t=2,\ldots, T,\\ \end{align} where the precision matrix 𝐐(𝐖,ρ)\boldsymbol{Q}(\boldsymbol{W},\rho) is proposed by Leroux et al.(2000). The algebraic form of this matrix is given by 𝐐(𝐖,ρ)=ρ[diag(𝐖𝟏)𝐖]+(1ρ)𝐈, \boldsymbol{Q}(\boldsymbol{W},\rho)=\rho[diag(\boldsymbol{W1})-\boldsymbol{W}]+(1-\rho)\boldsymbol{I}, where 𝟏\boldsymbol{1} is the n×1n\times 1 vector of ones, and is the n×nn\times n identity matrix. ρ\rho and α\alpha are the spatial and temporal dependence parameters, respectively, while τ2\tau^2 is the variance parameter.

Define the model formula

In order to fit the model, a model formula needs to be defined, by including the response in the left-hand side and the fixed and random effects in the right-hand side. We select a few risk factors used in our COVID-19 tutorial.

form <- cases ~ 1 + offset(log(Population)) + IMD + perc.wb + perc.ba + age1 + pm25

Fit the model

Finally, we fit the spatio-temporal model using the function ST.CARar() of the package CARBayesST developed by Lee (2018) Lee, Rushworth, and Napier (2018). We first need to organize the COVID-19 infection and covariate data into a specific format expected by the ST.CARar() function. More details can be found in the help file of ‘ST.CARar()’. Any MSOAs without reported cases will be stored as missing (NA) values in the data frame.

time.points <- length(unique(st_covid$date))
n <- nrow(sp_data@data)
dat <- data.frame(
  MSOA11CD = rep(sp_data$MSOA11CD, time.points),
  date = rep(sort(unique(st_covid$date)), each = n),
  time = rep(1:time.points, each = n)
)

dat$rowid <- 1:nrow(dat)
out <- merge(dat, st_covid[, c(
  "MSOA11CD",
  "date",
  "MSOA11NM",
  "cases",
  "Population"
)],
all.x = TRUE,
by = c("MSOA11CD", "date")
)

dat <- out[order(out$rowid), ]

covars <- unique(st_covid[, c(
  "MSOA11CD",
  "IMD",
  "age1",
  "perc.chinese",
  "perc.indian",
  "perc.wb",
  "perc.bc",
  "perc.ba",
  "pm25",
  "no2"
)])

out <- merge(dat, covars,
  all.x = TRUE,
  by = c("MSOA11CD")
)
dat <- out[order(out$rowid), ]

dat$pre <- dat$cases / dat$Population
dat$logpre <- log(dat$pre)
nbhoods <- unique(st_covid[, c("MSOA11CD", "Population")])

for (i in 1:nrow(dat)) { #  this will take a few minutes
  if (is.na(dat$Population[i])) {
    dat$Population[i] <- nbhoods[
      match(
        dat[i, "MSOA11CD"],
        nbhoods$MSOA11CD
      ),
      "Population"
    ]
  }
}

st_covid <- dat
rm(dat, nbhoods, out, covars) # remove non-necessary objects

:warning: Memory requirements: Running the model requires a large amount of memory and may fail if run on a normal laptop / desktop.

Now the data frame ‘st_covid’ has the expected format. Then run the model.

MCMC_model <- CARBayesST::ST.CARar(
  formula = form,
  data = st_covid,
  family = "poisson",
  W = w,
  burnin = 10000,
  n.sample = 30000,
  thin = 10, AR = 1
)

Now we summarise the modelling results. “fitted_vals” stores the predicted COVID-19 infection rate at each MSOA and time point. “modfits” stores the DIC and WAIC values, which measure the goodness of model fit. “mod_sum” provides the values for parameters τ2\tau^2, ρ\rho and α\alpha.

fitted_vals <- exp(sum(apply(MCMC_model$samples$beta, 2, mean)) +
  apply(MCMC_model$samples$phi, 2, mean))
fitted_vals <- cbind.data.frame(st_covid[, c("MSOA11CD", "date")], fitted_vals)
modfits <- MCMC_model$modelfit
mod_sum <- MCMC_model$summary.results

The above modelling results are provided in the tutorial data package so we’ll load them now.

NOTE: If you’ve run the full model above you don’t need to load in the files below.

fitted_vals <- fdmr::load_tutorial_data(dataset = "covidST_mcmc", filename = "fitted_vals.rds")
## Error in get_tutorial_datapath(dataset = dataset, filename = filename): Unable to load data, the folder /home/runner/fdmr/tutorial_data/covidST_mcmc does not exist. Have you run retrieve_tutorial_data?
modfits <- fdmr::load_tutorial_data(dataset = "covidST_mcmc", filename = "modfits.rds")
## Error in get_tutorial_datapath(dataset = dataset, filename = filename): Unable to load data, the folder /home/runner/fdmr/tutorial_data/covidST_mcmc does not exist. Have you run retrieve_tutorial_data?
mod_sum <- fdmr::load_tutorial_data(dataset = "covidST_mcmc", filename = "mod_sum.rds")
## Error in get_tutorial_datapath(dataset = dataset, filename = filename): Unable to load data, the folder /home/runner/fdmr/tutorial_data/covidST_mcmc does not exist. Have you run retrieve_tutorial_data?

For comparison purpose, we also fit a separate BHM to the same dataset using the INLA-SPDE approach. The infection rates predicted by the INLA-SPDE approach is saved in the date frame named “inla_preds” and is provided in the tutorial data package. We’ll load that in now.

inla_preds <- fdmr::load_tutorial_data(dataset = "covidST_mcmc", filename = "inla_preds.rds")
## Error in get_tutorial_datapath(dataset = dataset, filename = filename): Unable to load data, the folder /home/runner/fdmr/tutorial_data/covidST_mcmc does not exist. Have you run retrieve_tutorial_data?

Then the predictions from the two models are merged into one data frame named “mergedat”.

mergedat <- merge(inla_preds,
  fitted_vals,
  by = c("MSOA11CD", "date")
)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'merge': object 'inla_preds' not found

Model comparison

We show the DIC and WAIC values for each model. The model using the MCMC approach performs better than the model using the INLA-SPDE approach in terms of the lower DIC and WAIC values.

inla_sum <- fdmr::load_tutorial_data(dataset = "covidST_mcmc", filename = "INLAmodsum.rds")
## Error in get_tutorial_datapath(dataset = dataset, filename = filename): Unable to load data, the folder /home/runner/fdmr/tutorial_data/covidST_mcmc does not exist. Have you run retrieve_tutorial_data?
modfit <- data.frame(
  DIC = c(modfits[1], inla_sum[1]),
  WAIC = c(modfits[2], inla_sum[2])
)
## Error in eval(expr, envir, enclos): object 'modfits' not found
rownames(modfit) <- c("MCMC", "INLA_SPDE")
## Error: object 'modfit' not found
modfit
## Error in eval(expr, envir, enclos): object 'modfit' not found

Now we compare the posterior COVID-19 infection rate estimates between the two models. In general, the two models provide similar posterior COVID-19 infection rate estimates.

plot(mergedat$mcmc.fitted.prev, mergedat$inla.fitted.prev, xlab = "MCMC", ylab = "INLA_SPDE", xlim = c(0, 0.06), ylim = c(0, 0.06), cex = 0.01)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'plot': object 'mergedat' not found
boxplot(mergedat$mcmc.fitted.prev, mergedat$inla.fitted.prev,
  names = c("MCMC", "INLA_SPDE")
)
## Error in eval(expr, envir, enclos): object 'mergedat' not found

The regression coefficients estimates of the selected covariates for both models are compared.

summary_fixed <- fdmr::load_tutorial_data(dataset = "covidST_mcmc", filename = "INLAfixed_sum.rds")
## Error in get_tutorial_datapath(dataset = dataset, filename = filename): Unable to load data, the folder /home/runner/fdmr/tutorial_data/covidST_mcmc does not exist. Have you run retrieve_tutorial_data?
regr_est <- cbind.data.frame(
  "MCMC" = mod_sum[1:6, 1],
  "INLA_SPDE" = summary_fixed$mean
)
## Error in eval(expr, envir, enclos): object 'mod_sum' not found
regr_est
## Error in eval(expr, envir, enclos): object 'regr_est' not found

Finally, the spatial patterns of the average infection rate estimates over time for each model are displayed below.

mcmc_mean_rate <- dplyr::group_by(mergedat, MSOA11CD) %>% dplyr::summarize(
  mean.rate = mean(mcmc.fitted.prev)
)
## Error in eval(expr, envir, enclos): object 'mergedat' not found
sp_data@data$mcmc_mean_rate <- mcmc_mean_rate$mean.rate
## Error in eval(expr, envir, enclos): object 'mcmc_mean_rate' not found
domain <- sp_data@data$mcmc_mean_rate
## Error in eval(expr, envir, enclos): object 'sp_data' not found
fdmr::plot_map(
  polygon_data = sp_data,
  domain = domain,
  palette = "Reds",
  legend_title = "Rate",
  add_scale_bar = TRUE,
  polygon_fill_opacity = 0.8,
  polygon_line_colour = "transparent"
)
## Error in eval(expr, envir, enclos): object 'sp_data' not found
INLA_mean_rate <- dplyr::group_by(mergedat, MSOA11CD) %>% dplyr::summarize(
  mean.rate = mean(inla.fitted.prev)
)
## Error in eval(expr, envir, enclos): object 'mergedat' not found
sp_data@data$INLA_mean_rate <- INLA_mean_rate$mean.rate
## Error in eval(expr, envir, enclos): object 'INLA_mean_rate' not found
domain <- sp_data@data$INLA_mean_rate
## Error in eval(expr, envir, enclos): object 'sp_data' not found
fdmr::plot_map(
  polygon_data = sp_data,
  domain = domain,
  palette = "Reds",
  legend_title = "Rate",
  add_scale_bar = TRUE,
  polygon_fill_opacity = 0.8,
  polygon_line_colour = "transparent"
)
## Error in eval(expr, envir, enclos): object 'sp_data' not found
Lee, Duncan, Alastair Rushworth, and Gary Napier. 2018. “Spatio-Temporal Areal Unit Modeling in r with Conditional Autoregressive Priors Using the CARBayesST Package.” Journal of Statistical Software 84: 1–39.