## Modelling COVID-19 infection across England

In this tutorial we’ll cover some work on fitting a spatio-temporal Bayesian hierarchical model (BHM) to predict the COVID-19 infection rate across England.

## Study aim and data description

This study describes how to fit a spatio-temporal BHM to areal data using Markov Chain Monte Carlo (MCMC) simulation method. The first thing is to load all the packages used in the COVID-19 case study.

The study region is mainland England, which is partitioned into 6789
neighbourhoods at the Middle Layer Super Output Area (MSOA) scale. The
infections data are the total reported number of COVID-19 cases in each
MSOA from Jan 8, 2022 to March 26, 2022. The shapefile of the study
region is a `SpatialPolygonsDataFrame`

, which is used to map
the data. It stores the location, shape and attributes of geographic
features for the neighbourhoods.

### Installing CARBayesST

This tutorial requires the `CARBayesST`

package which you may need to install before working through this
tutorial.

### Retrieving and loading data

We first need to retrieve the infections data from the fdmr example data store and unpack it and we’ll use retrieve_tutorial_data to do this.

`fdmr::retrieve_tutorial_data(dataset = "covidST_mcmc")`

```
##
## Tutorial data extracted to /home/runner/fdmr/tutorial_data/covidst_mcmc
```

The COVID-19 data and the related covariate information are included
in our tutorial data package. We’ll load in the data using the
`load_tutorial_data`

function.

`st_covid <- fdmr::load_tutorial_data(dataset = "covidST_mcmc", filename = "st_covid.rds")`

`## Error in get_tutorial_datapath(dataset = dataset, filename = filename): Unable to load data, the folder /home/runner/fdmr/tutorial_data/covidST_mcmc does not exist. Have you run retrieve_tutorial_data?`

Next we’ll use the `load_tutorial_data`

function to load
in the spatial data we want.

`sp_data <- fdmr::load_tutorial_data(dataset = "covidST_mcmc", filename = "spatial_data.rds")`

`## Error in get_tutorial_datapath(dataset = dataset, filename = filename): Unable to load data, the folder /home/runner/fdmr/tutorial_data/covidST_mcmc does not exist. Have you run retrieve_tutorial_data?`

In this study, we use the areal unit modelling approach to fit the BHM and then make model inference using MCMC method. To do this, we need to construct a non-negative symmetric $n \times n$ neighbourhood or adjacency matrix $\boldsymbol{W}$ that accounts for the spatio-temporal autocorrelation structure, where $n=6789$ is the number of areal units. The neighbourhood matrix specifies the spatial closeness between pairs of areal units. The elements $\{w_{ij}\}$ in $\boldsymbol{W}$ can be either continuous or binary, and a larger value of $w_{ij}$ represents that MSOAs $(i,j)$ are spatially closer to each other. Here we use the border sharing specification, so that $w_{ij}=1$ if MSOAs $(i,j)$ share a common geographical border, and $w_{ij}=0$ otherwise.

`## Error in eval(expr, envir, enclos): object 'sp_data' not found`

`w <- spdep::nb2mat(W_nb, style = "B")`

`## Error in eval(expr, envir, enclos): object 'W_nb' not found`

## Model specification

We use a Bayesian hierarchical model to predict the spatio-temporal COVID-19 infection rate at the neighbourhood level in England. Let $Y_{it}$ denotes the weekly number of reported COVID cases for neighbourhood $i=1,\ldots, n(=6789)$ and week $t=1,\ldots, T(=108)$ and $N_{it}$ denotes the (official) estimated population living in neighbourhood $i$ and week $t$. $Y_{it}$ is assumed to have a Poisson distribution with parameters ($N_{it}$, $\theta_{it}$), where $\theta_{it}$ is the true unobserved COVID-19 infection rate in MSOA $i$ and week $t$. We follow a standard path in modelling $\theta_{it}$ with a log link to the Poisson and start with a model where the linear predictor decomposes additively into a set of covariates and a Gaussian Markov Random Field process, which characterises the infection of the disease after the covariate effects have been accounted for. A general Bayesian hierarchical model commonly specified is given by

$\begin{align} \nonumber Y_{it}\vert N_{it}, \theta_{it} &\sim \text{Poisson}(N_{it}\theta_{it}),\ \ i=1,\ldots,n, t=1,\ldots,T,\\ log(\theta_{it} )&=\boldsymbol{x_{it}^{\top}}\boldsymbol{\beta}+\phi_{it}. \end{align}$

The spatial random effects $\{\phi_{it}\}$ are included in the model to account for any residual spatio-temporal autocorrelation after adjusting for covariates $\boldsymbol{x_{it}}$. Here we utilise the spatio-temporal modelling structure proposed by Rushworth, Lee, and Mitchell (2014) to model $\{\phi_{it}\}$. It is given by

$\begin{align} \nonumber \boldsymbol{\phi_1}&\sim \text{N}\left(\boldsymbol{0}, \tau^2\boldsymbol{ Q}(\boldsymbol{W})^{-1}\right),\\ \boldsymbol{\phi_t}\vert\boldsymbol{\phi_{t-1}}&\sim \text{N}\left(\alpha\boldsymbol{\phi_{t-1}},\tau^2\boldsymbol{Q}(\boldsymbol{W},\rho)^{-1}\right), \ \ t=2,\ldots, T,\\ \end{align}$ where the precision matrix $\boldsymbol{Q}(\boldsymbol{W},\rho)$ is proposed by Leroux et al.(2000). The algebraic form of this matrix is given by $\boldsymbol{Q}(\boldsymbol{W},\rho)=\rho[diag(\boldsymbol{W1})-\boldsymbol{W}]+(1-\rho)\boldsymbol{I},$ where $\boldsymbol{1}$ is the $n\times 1$ vector of ones, and is the $n\times n$ identity matrix. $\rho$ and $\alpha$ are the spatial and temporal dependence parameters, respectively, while $\tau^2$ is the variance parameter.

## Define the model formula

In order to fit the model, a model formula needs to be defined, by including the response in the left-hand side and the fixed and random effects in the right-hand side. We select a few risk factors used in our COVID-19 tutorial.

## Fit the model

Finally, we fit the spatio-temporal model using the function
`ST.CARar()`

of the package `CARBayesST`

developed
by Lee (2018) Lee, Rushworth, and Napier (2018). We first need to organize
the COVID-19 infection and covariate data into a specific format
expected by the `ST.CARar()`

function. More details can be
found in the help file of ‘ST.CARar()’. Any MSOAs without reported cases
will be stored as missing (NA) values in the data frame.

```
time.points <- length(unique(st_covid$date))
n <- nrow(sp_data@data)
dat <- data.frame(
MSOA11CD = rep(sp_data$MSOA11CD, time.points),
date = rep(sort(unique(st_covid$date)), each = n),
time = rep(1:time.points, each = n)
)
dat$rowid <- 1:nrow(dat)
out <- merge(dat, st_covid[, c(
"MSOA11CD",
"date",
"MSOA11NM",
"cases",
"Population"
)],
all.x = TRUE,
by = c("MSOA11CD", "date")
)
dat <- out[order(out$rowid), ]
covars <- unique(st_covid[, c(
"MSOA11CD",
"IMD",
"age1",
"perc.chinese",
"perc.indian",
"perc.wb",
"perc.bc",
"perc.ba",
"pm25",
"no2"
)])
out <- merge(dat, covars,
all.x = TRUE,
by = c("MSOA11CD")
)
dat <- out[order(out$rowid), ]
dat$pre <- dat$cases / dat$Population
dat$logpre <- log(dat$pre)
nbhoods <- unique(st_covid[, c("MSOA11CD", "Population")])
for (i in 1:nrow(dat)) { # this will take a few minutes
if (is.na(dat$Population[i])) {
dat$Population[i] <- nbhoods[
match(
dat[i, "MSOA11CD"],
nbhoods$MSOA11CD
),
"Population"
]
}
}
st_covid <- dat
rm(dat, nbhoods, out, covars) # remove non-necessary objects
```

:warning:

Memory requirements: Running the model requires a large amount of memory and may fail if run on a normal laptop / desktop.

Now the data frame ‘st_covid’ has the expected format. Then run the model.

```
MCMC_model <- CARBayesST::ST.CARar(
formula = form,
data = st_covid,
family = "poisson",
W = w,
burnin = 10000,
n.sample = 30000,
thin = 10, AR = 1
)
```

Now we summarise the modelling results. “fitted_vals” stores the predicted COVID-19 infection rate at each MSOA and time point. “modfits” stores the DIC and WAIC values, which measure the goodness of model fit. “mod_sum” provides the values for parameters $\tau^2$, $\rho$ and $\alpha$.

```
fitted_vals <- exp(sum(apply(MCMC_model$samples$beta, 2, mean)) +
apply(MCMC_model$samples$phi, 2, mean))
fitted_vals <- cbind.data.frame(st_covid[, c("MSOA11CD", "date")], fitted_vals)
modfits <- MCMC_model$modelfit
mod_sum <- MCMC_model$summary.results
```

The above modelling results are provided in the tutorial data package so we’ll load them now.

If you’ve run the full model above you don’t need to load in the files below.NOTE:

`fitted_vals <- fdmr::load_tutorial_data(dataset = "covidST_mcmc", filename = "fitted_vals.rds")`

`## Error in get_tutorial_datapath(dataset = dataset, filename = filename): Unable to load data, the folder /home/runner/fdmr/tutorial_data/covidST_mcmc does not exist. Have you run retrieve_tutorial_data?`

`modfits <- fdmr::load_tutorial_data(dataset = "covidST_mcmc", filename = "modfits.rds")`

`## Error in get_tutorial_datapath(dataset = dataset, filename = filename): Unable to load data, the folder /home/runner/fdmr/tutorial_data/covidST_mcmc does not exist. Have you run retrieve_tutorial_data?`

`mod_sum <- fdmr::load_tutorial_data(dataset = "covidST_mcmc", filename = "mod_sum.rds")`

`## Error in get_tutorial_datapath(dataset = dataset, filename = filename): Unable to load data, the folder /home/runner/fdmr/tutorial_data/covidST_mcmc does not exist. Have you run retrieve_tutorial_data?`

For comparison purpose, we also fit a separate BHM to the same dataset using the INLA-SPDE approach. The infection rates predicted by the INLA-SPDE approach is saved in the date frame named “inla_preds” and is provided in the tutorial data package. We’ll load that in now.

`inla_preds <- fdmr::load_tutorial_data(dataset = "covidST_mcmc", filename = "inla_preds.rds")`

`## Error in get_tutorial_datapath(dataset = dataset, filename = filename): Unable to load data, the folder /home/runner/fdmr/tutorial_data/covidST_mcmc does not exist. Have you run retrieve_tutorial_data?`

Then the predictions from the two models are merged into one data frame named “mergedat”.

`## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'merge': object 'inla_preds' not found`

## Model comparison

We show the DIC and WAIC values for each model. The model using the MCMC approach performs better than the model using the INLA-SPDE approach in terms of the lower DIC and WAIC values.

`inla_sum <- fdmr::load_tutorial_data(dataset = "covidST_mcmc", filename = "INLAmodsum.rds")`

`## Error in get_tutorial_datapath(dataset = dataset, filename = filename): Unable to load data, the folder /home/runner/fdmr/tutorial_data/covidST_mcmc does not exist. Have you run retrieve_tutorial_data?`

```
modfit <- data.frame(
DIC = c(modfits[1], inla_sum[1]),
WAIC = c(modfits[2], inla_sum[2])
)
```

`## Error in eval(expr, envir, enclos): object 'modfits' not found`

`## Error: object 'modfit' not found`

`modfit`

`## Error in eval(expr, envir, enclos): object 'modfit' not found`

Now we compare the posterior COVID-19 infection rate estimates between the two models. In general, the two models provide similar posterior COVID-19 infection rate estimates.

`plot(mergedat$mcmc.fitted.prev, mergedat$inla.fitted.prev, xlab = "MCMC", ylab = "INLA_SPDE", xlim = c(0, 0.06), ylim = c(0, 0.06), cex = 0.01)`

`## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'plot': object 'mergedat' not found`

`## Error in eval(expr, envir, enclos): object 'mergedat' not found`

The regression coefficients estimates of the selected covariates for both models are compared.

`summary_fixed <- fdmr::load_tutorial_data(dataset = "covidST_mcmc", filename = "INLAfixed_sum.rds")`

`## Error in get_tutorial_datapath(dataset = dataset, filename = filename): Unable to load data, the folder /home/runner/fdmr/tutorial_data/covidST_mcmc does not exist. Have you run retrieve_tutorial_data?`

```
regr_est <- cbind.data.frame(
"MCMC" = mod_sum[1:6, 1],
"INLA_SPDE" = summary_fixed$mean
)
```

`## Error in eval(expr, envir, enclos): object 'mod_sum' not found`

`regr_est`

`## Error in eval(expr, envir, enclos): object 'regr_est' not found`

Finally, the spatial patterns of the average infection rate estimates over time for each model are displayed below.

```
mcmc_mean_rate <- dplyr::group_by(mergedat, MSOA11CD) %>% dplyr::summarize(
mean.rate = mean(mcmc.fitted.prev)
)
```

`## Error in eval(expr, envir, enclos): object 'mergedat' not found`

`sp_data@data$mcmc_mean_rate <- mcmc_mean_rate$mean.rate`

`## Error in eval(expr, envir, enclos): object 'mcmc_mean_rate' not found`

`domain <- sp_data@data$mcmc_mean_rate`

`## Error in eval(expr, envir, enclos): object 'sp_data' not found`

```
fdmr::plot_map(
polygon_data = sp_data,
domain = domain,
palette = "Reds",
legend_title = "Rate",
add_scale_bar = TRUE,
polygon_fill_opacity = 0.8,
polygon_line_colour = "transparent"
)
```

`## Error in eval(expr, envir, enclos): object 'sp_data' not found`

```
INLA_mean_rate <- dplyr::group_by(mergedat, MSOA11CD) %>% dplyr::summarize(
mean.rate = mean(inla.fitted.prev)
)
```

`## Error in eval(expr, envir, enclos): object 'mergedat' not found`

`sp_data@data$INLA_mean_rate <- INLA_mean_rate$mean.rate`

`## Error in eval(expr, envir, enclos): object 'INLA_mean_rate' not found`

`domain <- sp_data@data$INLA_mean_rate`

`## Error in eval(expr, envir, enclos): object 'sp_data' not found`

```
fdmr::plot_map(
polygon_data = sp_data,
domain = domain,
palette = "Reds",
legend_title = "Rate",
add_scale_bar = TRUE,
polygon_fill_opacity = 0.8,
polygon_line_colour = "transparent"
)
```

`## Error in eval(expr, envir, enclos): object 'sp_data' not found`

*Journal of Statistical Software*84: 1–39.