Modelling COVID-19 infection across England
In this tutorial we’ll cover some work on fitting a spatio-temporal Bayesian hierarchical model (BHM) to predict the COVID-19 infection rate across England.
Study aim and data description
This study describes how to fit a spatio-temporal BHM to areal data using Markov Chain Monte Carlo (MCMC) simulation method. The first thing is to load all the packages used in the COVID-19 case study.
The study region is mainland England, which is partitioned into 6789
neighbourhoods at the Middle Layer Super Output Area (MSOA) scale. The
infections data are the total reported number of COVID-19 cases in each
MSOA from Jan 8, 2022 to March 26, 2022. The shapefile of the study
region is a SpatialPolygonsDataFrame
, which is used to map
the data. It stores the location, shape and attributes of geographic
features for the neighbourhoods.
Installing CARBayesST
This tutorial requires the CARBayesST
package which you may need to install before working through this
tutorial.
Retrieving and loading data
We first need to retrieve the infections data from the fdmr example data store and unpack it and we’ll use retrieve_tutorial_data to do this.
fdmr::retrieve_tutorial_data(dataset = "covidST_mcmc")
##
## Tutorial data extracted to /home/runner/fdmr/tutorial_data/covidst_mcmc
The COVID-19 data and the related covariate information are included
in our tutorial data package. We’ll load in the data using the
load_tutorial_data
function.
st_covid <- fdmr::load_tutorial_data(dataset = "covidST_mcmc", filename = "st_covid.rds")
## Error in get_tutorial_datapath(dataset = dataset, filename = filename): Unable to load data, the folder /home/runner/fdmr/tutorial_data/covidST_mcmc does not exist. Have you run retrieve_tutorial_data?
Next we’ll use the load_tutorial_data
function to load
in the spatial data we want.
sp_data <- fdmr::load_tutorial_data(dataset = "covidST_mcmc", filename = "spatial_data.rds")
## Error in get_tutorial_datapath(dataset = dataset, filename = filename): Unable to load data, the folder /home/runner/fdmr/tutorial_data/covidST_mcmc does not exist. Have you run retrieve_tutorial_data?
In this study, we use the areal unit modelling approach to fit the BHM and then make model inference using MCMC method. To do this, we need to construct a non-negative symmetric neighbourhood or adjacency matrix that accounts for the spatio-temporal autocorrelation structure, where is the number of areal units. The neighbourhood matrix specifies the spatial closeness between pairs of areal units. The elements in can be either continuous or binary, and a larger value of represents that MSOAs are spatially closer to each other. Here we use the border sharing specification, so that if MSOAs share a common geographical border, and otherwise.
## Error in eval(expr, envir, enclos): object 'sp_data' not found
w <- spdep::nb2mat(W_nb, style = "B")
## Error in eval(expr, envir, enclos): object 'W_nb' not found
Model specification
We use a Bayesian hierarchical model to predict the spatio-temporal COVID-19 infection rate at the neighbourhood level in England. Let denotes the weekly number of reported COVID cases for neighbourhood and week and denotes the (official) estimated population living in neighbourhood and week . is assumed to have a Poisson distribution with parameters (, ), where is the true unobserved COVID-19 infection rate in MSOA and week . We follow a standard path in modelling with a log link to the Poisson and start with a model where the linear predictor decomposes additively into a set of covariates and a Gaussian Markov Random Field process, which characterises the infection of the disease after the covariate effects have been accounted for. A general Bayesian hierarchical model commonly specified is given by
The spatial random effects are included in the model to account for any residual spatio-temporal autocorrelation after adjusting for covariates . Here we utilise the spatio-temporal modelling structure proposed by Rushworth, Lee, and Mitchell (2014) to model . It is given by
where the precision matrix is proposed by Leroux et al.(2000). The algebraic form of this matrix is given by where is the vector of ones, and is the identity matrix. and are the spatial and temporal dependence parameters, respectively, while is the variance parameter.
Define the model formula
In order to fit the model, a model formula needs to be defined, by including the response in the left-hand side and the fixed and random effects in the right-hand side. We select a few risk factors used in our COVID-19 tutorial.
Fit the model
Finally, we fit the spatio-temporal model using the function
ST.CARar()
of the package CARBayesST
developed
by Lee (2018) Lee, Rushworth, and Napier (2018). We first need to organize
the COVID-19 infection and covariate data into a specific format
expected by the ST.CARar()
function. More details can be
found in the help file of ‘ST.CARar()’. Any MSOAs without reported cases
will be stored as missing (NA) values in the data frame.
time.points <- length(unique(st_covid$date))
n <- nrow(sp_data@data)
dat <- data.frame(
MSOA11CD = rep(sp_data$MSOA11CD, time.points),
date = rep(sort(unique(st_covid$date)), each = n),
time = rep(1:time.points, each = n)
)
dat$rowid <- 1:nrow(dat)
out <- merge(dat, st_covid[, c(
"MSOA11CD",
"date",
"MSOA11NM",
"cases",
"Population"
)],
all.x = TRUE,
by = c("MSOA11CD", "date")
)
dat <- out[order(out$rowid), ]
covars <- unique(st_covid[, c(
"MSOA11CD",
"IMD",
"age1",
"perc.chinese",
"perc.indian",
"perc.wb",
"perc.bc",
"perc.ba",
"pm25",
"no2"
)])
out <- merge(dat, covars,
all.x = TRUE,
by = c("MSOA11CD")
)
dat <- out[order(out$rowid), ]
dat$pre <- dat$cases / dat$Population
dat$logpre <- log(dat$pre)
nbhoods <- unique(st_covid[, c("MSOA11CD", "Population")])
for (i in 1:nrow(dat)) { # this will take a few minutes
if (is.na(dat$Population[i])) {
dat$Population[i] <- nbhoods[
match(
dat[i, "MSOA11CD"],
nbhoods$MSOA11CD
),
"Population"
]
}
}
st_covid <- dat
rm(dat, nbhoods, out, covars) # remove non-necessary objects
:warning: Memory requirements: Running the model requires a large amount of memory and may fail if run on a normal laptop / desktop.
Now the data frame ‘st_covid’ has the expected format. Then run the model.
MCMC_model <- CARBayesST::ST.CARar(
formula = form,
data = st_covid,
family = "poisson",
W = w,
burnin = 10000,
n.sample = 30000,
thin = 10, AR = 1
)
Now we summarise the modelling results. “fitted_vals” stores the predicted COVID-19 infection rate at each MSOA and time point. “modfits” stores the DIC and WAIC values, which measure the goodness of model fit. “mod_sum” provides the values for parameters , and .
fitted_vals <- exp(sum(apply(MCMC_model$samples$beta, 2, mean)) +
apply(MCMC_model$samples$phi, 2, mean))
fitted_vals <- cbind.data.frame(st_covid[, c("MSOA11CD", "date")], fitted_vals)
modfits <- MCMC_model$modelfit
mod_sum <- MCMC_model$summary.results
The above modelling results are provided in the tutorial data package so we’ll load them now.
NOTE: If you’ve run the full model above you don’t need to load in the files below.
fitted_vals <- fdmr::load_tutorial_data(dataset = "covidST_mcmc", filename = "fitted_vals.rds")
## Error in get_tutorial_datapath(dataset = dataset, filename = filename): Unable to load data, the folder /home/runner/fdmr/tutorial_data/covidST_mcmc does not exist. Have you run retrieve_tutorial_data?
modfits <- fdmr::load_tutorial_data(dataset = "covidST_mcmc", filename = "modfits.rds")
## Error in get_tutorial_datapath(dataset = dataset, filename = filename): Unable to load data, the folder /home/runner/fdmr/tutorial_data/covidST_mcmc does not exist. Have you run retrieve_tutorial_data?
mod_sum <- fdmr::load_tutorial_data(dataset = "covidST_mcmc", filename = "mod_sum.rds")
## Error in get_tutorial_datapath(dataset = dataset, filename = filename): Unable to load data, the folder /home/runner/fdmr/tutorial_data/covidST_mcmc does not exist. Have you run retrieve_tutorial_data?
For comparison purpose, we also fit a separate BHM to the same dataset using the INLA-SPDE approach. The infection rates predicted by the INLA-SPDE approach is saved in the date frame named “inla_preds” and is provided in the tutorial data package. We’ll load that in now.
inla_preds <- fdmr::load_tutorial_data(dataset = "covidST_mcmc", filename = "inla_preds.rds")
## Error in get_tutorial_datapath(dataset = dataset, filename = filename): Unable to load data, the folder /home/runner/fdmr/tutorial_data/covidST_mcmc does not exist. Have you run retrieve_tutorial_data?
Then the predictions from the two models are merged into one data frame named “mergedat”.
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'merge': object 'inla_preds' not found
Model comparison
We show the DIC and WAIC values for each model. The model using the MCMC approach performs better than the model using the INLA-SPDE approach in terms of the lower DIC and WAIC values.
inla_sum <- fdmr::load_tutorial_data(dataset = "covidST_mcmc", filename = "INLAmodsum.rds")
## Error in get_tutorial_datapath(dataset = dataset, filename = filename): Unable to load data, the folder /home/runner/fdmr/tutorial_data/covidST_mcmc does not exist. Have you run retrieve_tutorial_data?
modfit <- data.frame(
DIC = c(modfits[1], inla_sum[1]),
WAIC = c(modfits[2], inla_sum[2])
)
## Error in eval(expr, envir, enclos): object 'modfits' not found
## Error: object 'modfit' not found
modfit
## Error in eval(expr, envir, enclos): object 'modfit' not found
Now we compare the posterior COVID-19 infection rate estimates between the two models. In general, the two models provide similar posterior COVID-19 infection rate estimates.
plot(mergedat$mcmc.fitted.prev, mergedat$inla.fitted.prev, xlab = "MCMC", ylab = "INLA_SPDE", xlim = c(0, 0.06), ylim = c(0, 0.06), cex = 0.01)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'plot': object 'mergedat' not found
## Error in eval(expr, envir, enclos): object 'mergedat' not found
The regression coefficients estimates of the selected covariates for both models are compared.
summary_fixed <- fdmr::load_tutorial_data(dataset = "covidST_mcmc", filename = "INLAfixed_sum.rds")
## Error in get_tutorial_datapath(dataset = dataset, filename = filename): Unable to load data, the folder /home/runner/fdmr/tutorial_data/covidST_mcmc does not exist. Have you run retrieve_tutorial_data?
regr_est <- cbind.data.frame(
"MCMC" = mod_sum[1:6, 1],
"INLA_SPDE" = summary_fixed$mean
)
## Error in eval(expr, envir, enclos): object 'mod_sum' not found
regr_est
## Error in eval(expr, envir, enclos): object 'regr_est' not found
Finally, the spatial patterns of the average infection rate estimates over time for each model are displayed below.
mcmc_mean_rate <- dplyr::group_by(mergedat, MSOA11CD) %>% dplyr::summarize(
mean.rate = mean(mcmc.fitted.prev)
)
## Error in eval(expr, envir, enclos): object 'mergedat' not found
sp_data@data$mcmc_mean_rate <- mcmc_mean_rate$mean.rate
## Error in eval(expr, envir, enclos): object 'mcmc_mean_rate' not found
domain <- sp_data@data$mcmc_mean_rate
## Error in eval(expr, envir, enclos): object 'sp_data' not found
fdmr::plot_map(
polygon_data = sp_data,
domain = domain,
palette = "Reds",
legend_title = "Rate",
add_scale_bar = TRUE,
polygon_fill_opacity = 0.8,
polygon_line_colour = "transparent"
)
## Error in eval(expr, envir, enclos): object 'sp_data' not found
INLA_mean_rate <- dplyr::group_by(mergedat, MSOA11CD) %>% dplyr::summarize(
mean.rate = mean(inla.fitted.prev)
)
## Error in eval(expr, envir, enclos): object 'mergedat' not found
sp_data@data$INLA_mean_rate <- INLA_mean_rate$mean.rate
## Error in eval(expr, envir, enclos): object 'INLA_mean_rate' not found
domain <- sp_data@data$INLA_mean_rate
## Error in eval(expr, envir, enclos): object 'sp_data' not found
fdmr::plot_map(
polygon_data = sp_data,
domain = domain,
palette = "Reds",
legend_title = "Rate",
add_scale_bar = TRUE,
polygon_fill_opacity = 0.8,
polygon_line_colour = "transparent"
)
## Error in eval(expr, envir, enclos): object 'sp_data' not found