Data pre-processing • fdmr

Aim and data description

In this tutorial, we will provide a short tutorial about data pre-processing, which involves transforming raw data into the desired format and object for running the Bayesian Hierarchical Model (BHM) in the fdmr package. To illustrate the process, we will use COVID-19 infection data as a practical example.

In the COVID-19 tutorial, we aim to fit a Bayesian spatio-temporal model to predict the COVID-19 infection rates across mainland England over space and time, and investigate the impacts of socioeconomic, demographic and environmental factors on COVID-19 infection. The study region is mainland England, which is partitioned into 6789 Middle Layer Super Output Areas (MSOAs). The raw shapefile of the study region is obtained from the ONS Open Geography Portal, which stores the location, shape and attributes of geographic features for the MSOAs.

First we’ll retrieve the tutorial dataset preprocessing.

fdmr::retrieve_tutorial_data(dataset = "preprocessing")

## 
## Tutorial data extracted to  /tmp/RtmpV80V5y/fdmr/tutorial_data/preprocessing

Then we load in the shapefile into R using sf::read_sf, and store it in an object named sp_data.

shapefilepath <- fdmr::get_tutorial_datapath(dataset = "preprocessing", filename = "MSOA_(Dec_2011)_Boundaries_Super_Generalised_Clipped_(BSC)_EW_V3.shp")
sp_data <- sf::read_sf(dsn = shapefilepath)

The type of the object sp_data is sf which behind the scenes is a tibble_df.

class(sp_data)

## [1] "sf"         "tbl_df"     "tbl"        "data.frame"

Then we retrieve the projection attributes of the shapefile, i.e., sp_data, and transform it from its original coordinate reference system (CRS) to a new CRS, which is the World Geodetic System 1984 (WGS84). Finally, we convert it to a SpatialPolygonsDataFrame using sf::as_Spatial.

sp_data <- sf::st_transform(sp_data, sf::st_crs("+proj=longlat +datum=WGS84"))

In the COVID-19 tutorial, the raw COVID-19 infections data and the related covariate data were obtained from the official UK Government COVID-19 dashboard and the Office for National Statistics (ONS). The data were initially downloaded, organised and saved in a CSV file format. The CSV file can be imported into R using the utils::read.csv() function.

covid19_data_filepath <- fdmr::get_tutorial_datapath(dataset = "preprocessing", filename = "covid19_data.csv")
covid19_data <- utils::read.csv(file = covid19_data_filepath)

The type of the object covid19_data is a data.frame.

class(covid19_data)

## [1] "data.frame"

The first 6 rows of the data set can be viewed using the following code

utils::head(covid19_data)

##    MSOA11CD     date week     LONG      LAT cases Population   IMD
## 1 E02002415 1/1/2022    1 -1.54813 53.77558    57       7698 68.39
## 2 E02002391 1/1/2022    1 -1.66579 53.80466   203       7380 20.85
## 3 E02002377 1/1/2022    1 -1.51938 53.81505    78       7955 52.34
## 4 E02002431 1/1/2022    1 -1.58525 53.74190   204       8366 27.40
## 5 E02002998 1/1/2022    1 -2.37911 51.37245    77       6023 12.12
## 6 E02002999 1/1/2022    1 -2.39800 51.37034    83       5597 19.13
##   carebeds.ratio AandETRUE perc.chinese perc.indian perc.bangladeshi
## 1    0.000000000         0    0.7112918    1.041534      12.66353360
## 2    0.014004309         0    0.4419192    1.022727       0.01262626
## 3    0.001402525         0    0.6257489    3.608042      20.48994808
## 4    0.003969340         0    1.0094136    2.302370       0.01134173
## 5    0.013117143         0    1.3052209    1.857430       0.31793842
## 6    0.000000000         0    1.1537096    1.100461       0.65672701
##   perc.pakistani    perc.ba   perc.bc   perc.wb     age1     age2     age3
## 1     22.8883526 18.4554808 1.4098819 20.157500 21.17433 23.13588 19.32970
## 2      0.5808081  1.1237374 0.1262626 91.136364 15.01355 21.31436 26.76152
## 3     27.7592864  9.8655306 4.2604181  9.253095 21.87304 22.95412 16.07794
## 4      0.7258705  1.4630827 0.3515935 85.210389 17.77432 21.92207 24.98207
## 5      0.1171352  0.6860776 0.4685408 80.722892 38.53561 14.36161 18.61199
## 6      0.1597444  0.3727370 0.3727370 84.700035 18.31338 18.59925 23.36966
##        age4     pm25       no2
## 1  5.520915 8.909063 18.108010
## 2 16.463415 7.430964 15.525150
## 3  6.712759 7.913701 18.095600
## 4 13.387521 7.540963 16.344490
## 5 13.132990 7.082593  7.628312
## 6 19.403252 6.926548  7.073514

The data frame contains 23 columns. MSOA11CD represents the spatial identifier for each data observation. Variable cases is the response variable, which is the weekly reported number of COVID-19 cases in each of the 6789 MSOAs in main England over the period from 2022-01-01 to 2022-03-26. Variable date indicates the start date of each observation week when the COVID-19 infections data for each MSOA were reported. Variable week indicates the week index number that each data observation was collected from. Columns LONG and LAT indicate the longitude and latitude for each MSOA. Variable Population indicates the population size for each MSOA. The remaining columns store the data for each covariate in each MSOA and week.

Therefore, the expected observation and measurement data format for a spatio-temporal Bayesian hierarchical model as in the COVID-19 tutorial should be a data frame that includes one column for the response variable (e.g., cases), two columns for the spatial location of each observation (e.g., LONG and LAT), and one column containing time point indices indicating when each observation was collected (e.g., week = 1, 2, …). If the model incorporates covariates, then the covariate data should also be included in the same data frame, and each covariate is stored in one column. Users can use any variable names for the columns, as long as they ensure consistency with those used when defining the model formula and fitting the model. The following table provides a summary of the expected data format for running the BHM in the fdmr package:

ID	LONG	LAT	Time	Response Variable	Covariate 1	Covariate 2	Covariate…
1	…	…	…	…	…	…	…
2	…	…	…	…	…	…	…
…	…	…	…	…	…	…	…

With sp_data and covid19_data in the expected data object and format, we now possess all the essential information required for the fitting the BHM and visualising the results. More details regarding the model fitting process can be found at in the COVID-19 tutorial.

ID	LONG	LAT	Time	Response Variable	Covariate 1	Covariate 2	Covariate…
1	…	…	…	…	…	…	…
2	…	…	…	…	…	…	…
…	…	…	…	…	…	…	…

ID	LONG	LAT	Time	Response Variable	Covariate 1	Covariate 2	Covariate…
1	…	…	…	…	…	…	…
2	…	…	…	…	…	…	…
…	…	…	…	…	…	…	…

ID	LONG	LAT	Time	Response Variable	Covariate 1	Covariate 2	Covariate…
1	…	…	…	…	…	…	…
2	…	…	…	…	…	…	…
…	…	…	…	…	…	…	…