Aim and data description
In this tutorial, we will provide a short tutorial about data pre-processing, which involves transforming raw data into the desired format and object for running the Bayesian Hierarchical Model (BHM) in the fdmr package. To illustrate the process, we will use COVID-19 infection data as a practical example.
In the COVID-19 tutorial, we aim to fit a Bayesian spatio-temporal model to predict the COVID-19 infection rates across mainland England over space and time, and investigate the impacts of socioeconomic, demographic and environmental factors on COVID-19 infection. The study region is mainland England, which is partitioned into 6789 Middle Layer Super Output Areas (MSOAs). The raw shapefile of the study region is obtained from the ONS Open Geography Portal, which stores the location, shape and attributes of geographic features for the MSOAs.
First we’ll retrieve the tutorial dataset
preprocessing
.
fdmr::retrieve_tutorial_data(dataset = "preprocessing")
##
## Tutorial data extracted to /home/runner/fdmr/tutorial_data/preprocessing
Then we load in the shapefile into R using sf::read_sf
,
and store it in an object named sp_data
.
shapefilepath <- fdmr::get_tutorial_datapath(dataset = "preprocessing", filename = "MSOA_(Dec_2011)_Boundaries_Super_Generalised_Clipped_(BSC)_EW_V3.shp")
sp_data <- sf::read_sf(dsn = shapefilepath)
The type of the object sp_data
is sf
which
behind the scenes is a tibble_df
.
class(sp_data)
## [1] "sf" "tbl_df" "tbl" "data.frame"
Then we retrieve the projection attributes of the shapefile, i.e.,
sp_data
, and transform it from its original coordinate
reference system (CRS) to a new CRS, which is the World Geodetic System
1984 (WGS84). Finally, we convert it to a
SpatialPolygonsDataFrame
using sf::as_Spatial
.
sp_data <- sf::st_transform(sp_data, sf::st_crs("+proj=longlat +datum=WGS84"))
In the COVID-19 tutorial, the raw COVID-19 infections data and the
related covariate data were obtained from the official UK Government
COVID-19 dashboard and the Office for National Statistics (ONS). The
data were initially downloaded, organised and saved in a CSV file
format. The CSV file can be imported into R using the
utils::read.csv()
function.
covid19_data_filepath <- fdmr::get_tutorial_datapath(dataset = "preprocessing", filename = "covid19_data.csv")
covid19_data <- utils::read.csv(file = covid19_data_filepath)
The type of the object covid19_data
is a data.frame.
class(covid19_data)
## [1] "data.frame"
The first 6 rows of the data set can be viewed using the following code
utils::head(covid19_data)
## MSOA11CD date week LONG LAT cases Population IMD
## 1 E02002415 1/1/2022 1 -1.54813 53.77558 57 7698 68.39
## 2 E02002391 1/1/2022 1 -1.66579 53.80466 203 7380 20.85
## 3 E02002377 1/1/2022 1 -1.51938 53.81505 78 7955 52.34
## 4 E02002431 1/1/2022 1 -1.58525 53.74190 204 8366 27.40
## 5 E02002998 1/1/2022 1 -2.37911 51.37245 77 6023 12.12
## 6 E02002999 1/1/2022 1 -2.39800 51.37034 83 5597 19.13
## carebeds.ratio AandETRUE perc.chinese perc.indian perc.bangladeshi
## 1 0.000000000 0 0.7112918 1.041534 12.66353360
## 2 0.014004309 0 0.4419192 1.022727 0.01262626
## 3 0.001402525 0 0.6257489 3.608042 20.48994808
## 4 0.003969340 0 1.0094136 2.302370 0.01134173
## 5 0.013117143 0 1.3052209 1.857430 0.31793842
## 6 0.000000000 0 1.1537096 1.100461 0.65672701
## perc.pakistani perc.ba perc.bc perc.wb age1 age2 age3
## 1 22.8883526 18.4554808 1.4098819 20.157500 21.17433 23.13588 19.32970
## 2 0.5808081 1.1237374 0.1262626 91.136364 15.01355 21.31436 26.76152
## 3 27.7592864 9.8655306 4.2604181 9.253095 21.87304 22.95412 16.07794
## 4 0.7258705 1.4630827 0.3515935 85.210389 17.77432 21.92207 24.98207
## 5 0.1171352 0.6860776 0.4685408 80.722892 38.53561 14.36161 18.61199
## 6 0.1597444 0.3727370 0.3727370 84.700035 18.31338 18.59925 23.36966
## age4 pm25 no2
## 1 5.520915 8.909063 18.108010
## 2 16.463415 7.430964 15.525150
## 3 6.712759 7.913701 18.095600
## 4 13.387521 7.540963 16.344490
## 5 13.132990 7.082593 7.628312
## 6 19.403252 6.926548 7.073514
The data frame contains 23 columns. MSOA11CD
represents
the spatial identifier for each data observation. Variable
cases
is the response variable, which is the weekly
reported number of COVID-19 cases in each of the 6789 MSOAs in main
England over the period from 2022-01-01 to 2022-03-26. Variable
date
indicates the start date of each observation week when
the COVID-19 infections data for each MSOA were reported. Variable
week
indicates the week index number that each data
observation was collected from. Columns LONG
and
LAT
indicate the longitude and latitude for each MSOA.
Variable Population
indicates the population size for each
MSOA. The remaining columns store the data for each covariate in each
MSOA and week.
Therefore, the expected observation and measurement data format for a
spatio-temporal Bayesian hierarchical model as in the COVID-19 tutorial
should be a data frame that includes one column for the response
variable (e.g., cases
), two columns for the spatial
location of each observation (e.g., LONG
and
LAT
), and one column containing time point indices
indicating when each observation was collected (e.g., week
= 1, 2, …). If the model incorporates covariates, then the covariate
data should also be included in the same data frame, and each covariate
is stored in one column. Users can use any variable names for the
columns, as long as they ensure consistency with those used when
defining the model formula and fitting the model. The following table
provides a summary of the expected data format for running the BHM in
the fdmr
package:
ID | LONG | LAT | Time | Response Variable | Covariate 1 | Covariate 2 | Covariate… |
---|---|---|---|---|---|---|---|
1 | … | … | … | … | … | … | … |
2 | … | … | … | … | … | … | … |
… | … | … | … | … | … | … | … |
With sp_data
and covid19_data
in the
expected data object and format, we now possess all the essential
information required for the fitting the BHM and visualising the
results. More details regarding the model fitting process can be found
at in the COVID-19
tutorial.