In this tutorial, we will show the expected data structure for
running the Bayesian Hierarchical Model (BHM) in the fdmr
package. First, let’s have a look at the data structures used in our COVID-19
tutorial, Hydrology
tutorial and Priors
exploration tutorial.
In the COVID-19
tutorial, we aim to fit a Bayesian spatio-temporal model to predict
the COVID-19 infection rates across mainland England over space and
time, and investigate the impacts of socioeconomic, demographic and
environmental factors on COVID-19 infection. We load the dataset
covid19_data
and the type of this object is a
data.frame
.
covid19_data <- fdmr::load_tutorial_data(dataset = "covid", filename = "covid19_data.rds")
class(covid19_data)
## [1] "data.frame"
The first 6 rows of the data set can be viewed using the
utils::head()
function.
utils::head(covid19_data)
## MSOA11CD date week LONG LAT cases Population IMD
## 6521 E02006663 2020-03-07 1 -1.77072 51.17063 3 12605 12.94
## 2101 E02002160 2020-03-14 2 -2.07682 52.59541 4 7481 24.79
## 3321 E02003411 2020-03-14 2 -0.57737 51.52327 8 11443 25.13
## 4008 E02004108 2020-03-14 2 -1.43761 53.29906 4 7514 17.33
## 3334 E02003424 2020-03-14 2 -0.71324 51.51317 4 8947 4.58
## 2094 E02002153 2020-03-14 2 -2.05805 52.61565 3 7217 31.88
## carebeds.ratio AandETRUE perc.chinese perc.indian perc.bangladeshi
## 6521 0.01630534 0 0.2153846 0.5384615 0.02307692
## 2101 0.01440074 0 0.4223307 18.2922001 0.11878052
## 3321 0.01265956 1 0.2399232 16.6826615 0.67178503
## 4008 0.02191863 0 0.1455411 0.2910823 0.10584811
## 3334 0.02194874 0 1.3427866 10.4952197 0.21484585
## 2094 0.00000000 0 0.2005884 2.6343942 0.05349024
## perc.pakistani perc.ba perc.bc perc.wb age1 age2 age3
## 6521 0.0000000 1.2307692 0.35384615 89.54615 11.685839 21.01547 27.10829
## 2101 0.5807048 1.7949056 1.82130131 63.15164 13.848416 19.51611 25.77196
## 3321 30.8941139 4.2226488 1.44753679 17.99424 14.812549 23.80495 21.34056
## 4008 0.2778513 0.1984652 0.07938608 94.98545 11.738089 16.34283 28.81288
## 3334 4.0068751 1.0849715 0.40820711 61.38146 9.701576 20.73321 27.51760
## 2094 0.1337256 1.2035304 1.39074619 87.79085 13.759180 16.95996 27.24124
## age4 pm25 no2
## 6521 16.92979 6.608217 5.957610
## 2101 17.99225 8.066084 14.479405
## 3321 11.13344 8.308508 14.164620
## 4008 24.12829 6.813326 8.356319
## 3334 22.68917 7.822982 11.810846
## 2094 21.53249 7.715645 12.522200
You can find that the dataset contains 23 columns. Column
cases
is our outcome variable, which is the weekly reported
number of COVID-19 cases in each neighbourhood. Column date
indicates the start date of each observation week when the COVID-19
infections data for each neighbourhood were reported, while column
week
indicates the week number that the data were collected
from. Column MSOA11CD
represents the neighbourhood code.
Columns LONG
and LAT
indicate the longitude
and latitude for each neighbourhood. Column Population
indicates the population size. The remaining columns store the data for
our covariates, which are related to the socioeconomic, demographic and
environmental factors. In summary, in this dataset, we have a column for
the outcome variable (cases
), a column for the time index
(week
, an integer starting at 1), two columns for the
geographical coordinates (LONG
, LAT
), and a
few columns for the covariates of interest.
Now let’s look at the data structure in the Hydrology
tutorial. The dataset used here is called inla_data
,
which is obtained by using the following codes.
library(magrittr)
fdmr::retrieve_tutorial_data(dataset = "hydro")
##
## Tutorial data extracted to /home/runner/fdmr/tutorial_data/hydro
norway_polygon_path <- fdmr::get_tutorial_datapath(dataset = "hydro", filename = "Kvilldal_Catch_Boundary.geojson")
norway_polygon <- sf::read_sf(norway_polygon_path) %>% sf::st_zm()
sfc <- sf::st_transform(norway_polygon, crs = "+proj=longlat +datum=WGS84")
suldalsvatnet_dam <- list(longitude = 6.517174, latitude = 59.490720, label = "Suldalsvatnet Dam")
stream_gauge_13 <- list(longitude = 6.5395789, latitude = 59.5815849, label = "Stream Gauge 13")
stream_gauge_14 <- list(longitude = 6.7897968, latitude = 59.7531662, label = "Stream Gauge 14")
streamdata_13 <- fdmr::get_tutorial_datapath(dataset = "hydro", filename = "NVEobservations_s36_13.csv")
streamdata_14 <- fdmr::get_tutorial_datapath(dataset = "hydro", filename = "NVEobservations_s36_14.csv")
data_13 <- read.csv(streamdata_13)
data_13$date <- as.Date(data_13$time)
data_13 <- subset(data_13, date >= "2021-10-01" & date <= "2021-10-31")
row.names(data_13) <- NULL
data_13$time_index <- seq(1, 31, 1)
data_14 <- read.csv(streamdata_14)
data_14$date <- as.Date(data_14$time)
data_14 <- subset(data_14, date >= "2021-10-01" & date <= "2021-10-31")
row.names(data_14) <- NULL
data_14$time_index <- seq(1, 31, 1)
data_13$value <- scale(data_13$value)
data_14$value <- scale(data_14$value)
era5_location <- fdmr::get_tutorial_datapath(dataset = "hydro", filename = "era5_land_daily.nc")
era5_precip <- raster::stack(era5_location)
## Loading required namespace: ncdf4
sr <- "+proj=utm +zone=32"
projected_raster <- raster::projectRaster(era5_precip, crs = sr)
era5_precip_cropped <- raster::mask(projected_raster, norway_polygon)
era5_precip_cropped <- terra::crop(era5_precip_cropped, raster::extent(norway_polygon), snap = "near")
raster_df <- data.frame(raster::rasterToPoints(era5_precip_cropped))
raster_df <- data.table::melt(data.table::setDT(raster_df), id.vars = c("x", "y"), variable.name = "time")
raster_df$time <- lubridate::day(fdmr::to_dates(fdmr::numbers_only(raster_df$time)))
raster_df <- setNames(raster_df, c("x", "y", "time", "precip"))
pixel_coords <- unique(sp::coordinates(era5_precip_cropped))
s13_utm <- fdmr::latlong_to_utm(
lat = stream_gauge_13$latitude,
lon = stream_gauge_13$longitude
)
s14_utm <- fdmr::latlong_to_utm(
lat = stream_gauge_14$latitude,
lon = stream_gauge_14$longitude
)
s13 <- data.frame(x = s13_utm[[1]], y = s13_utm[[2]])
s14 <- data.frame(x = s14_utm[[1]], y = s14_utm[[2]])
pixel_dist_gauge <- data.frame(
x = pixel_coords[, 1],
y = pixel_coords[, 2],
dist_s1 = rep(NA, nrow(pixel_coords)),
dist_s2 = rep(NA, nrow(pixel_coords)),
min_s = rep(NA, nrow(pixel_coords)),
streamflow = rep(NA, nrow(pixel_coords))
)
for (i in seq_len(nrow(pixel_coords))) {
pixel_dist_gauge$dist_s1[i] <- sqrt((pixel_dist_gauge$x[i] - s13$x)^2 + (pixel_dist_gauge$y[i] - s13$y)^2)
pixel_dist_gauge$dist_s2[i] <- sqrt((pixel_dist_gauge$x[i] - s14$x)^2 + (pixel_dist_gauge$y[i] - s14$y)^2)
pixel_dist_gauge$min_s[i] <- which.min(c(pixel_dist_gauge$dist_s1[i], pixel_dist_gauge$dist_s2[i]))
}
n_time <- raster::nlayers(era5_precip)
pixel_dist_gauge <- do.call(rbind, replicate(n_time, pixel_dist_gauge, simplify = FALSE))
pixel_dist_gauge$time <- rep(1:n_time, each = nrow(pixel_coords))
streamdata <- list(data_13, data_14)
get_stream_data <- function(row) {
which_stream_data <- streamdata[[row["min_s"]]]
data_at_row_time_index <- which_stream_data[which_stream_data[, c("time_index")] == row["time"], ]
return(data_at_row_time_index$value)
}
pixel_dist_gauge$streamflow <- apply(pixel_dist_gauge, 1, get_stream_data)
inla_data <- merge(pixel_dist_gauge, raster_df, by = c("time", "x", "y"))
inla_data <- subset(inla_data, select = c("x", "y", "streamflow", "precip", "time"))
The type of the object inla_data
is a
data.frame
.
class(inla_data)
## [1] "data.frame"
The first 6 rows of the data set are
utils::head(inla_data)
## x y streamflow precip time
## 1 331417.8 6583370 -0.7379358 0.007305657 1
## 2 342677.8 6594470 -0.7379358 0.006965242 1
## 3 348307.8 6594470 -0.7379358 0.006590370 1
## 4 353937.8 6594470 -0.7379358 0.006557765 1
## 5 353937.8 6605570 -0.7379358 0.007222239 1
## 6 359567.8 6594470 -0.7379358 0.006858168 1
The dataset inla_data
contains 5 columns. Column
streamflow
is the outcome variable, column
time
is the time point index for each data observation.
Columns x
and y
represent the geographical
coordinates for each observation. Column precip
is the
covariate of interest, which is the precipitation data at each location
and time point. Therefore, similar to the covid19_data
dataset, inla_data
has a column for the outcome variable
(streamflow
), a column for the time index
(time
, an integer starting at 1), two columns for the
geographical coordinates (x
, y
), and a column
for the covariate (precip
).
Finally, let’s look at the structure of the dataset used in the Priors
exploration tutorial. In this tutorial, we use the COVID-19 data in
Bristol, UK. We load the dataset covid19_data_bristol
and
the object type of it is a data.frame
.
fdmr::retrieve_tutorial_data(dataset = "priors")
##
## Tutorial data extracted to /home/runner/fdmr/tutorial_data/priors
covid19_data_bristol <- fdmr::load_tutorial_data(dataset = "priors", filename = "covid19_dataBris.rds")
class(covid19_data_bristol)
## [1] "data.frame"
The first 6 rows of the data set are
utils::head(covid19_data_bristol)
## UtlaName date week MSOA11CD cases Population prevalence
## 3 Bristol, City of 2021-12-25 1 E02003014 89 8292 0.010733237
## 4 Bristol, City of 2021-12-25 1 E02003015 99 6692 0.014793784
## 5 Bristol, City of 2021-12-25 1 E02003016 99 8169 0.012118986
## 6 Bristol, City of 2021-12-25 1 E02003017 74 6118 0.012095456
## 7 Bristol, City of 2021-12-25 1 E02003018 69 5870 0.011754685
## 9 Bristol, City of 2021-12-25 1 E02003020 55 6416 0.008572319
## LONG LAT
## 3 -2.66729 51.51507
## 4 -2.59093 51.49491
## 5 -2.57357 51.49471
## 6 -2.61209 51.48767
## 7 -2.65070 51.48954
## 9 -2.62725 51.49219
The dataset covid19_data_bristol
contains 9 columns.
Column cases
is the outcome variable, which is the weekly
reported number of COVID-19 cases in each neighbourhood in Bristol.
Column date
indicates the start date of each observation
week, while column week
indicates the week number that the
data were collected from. Column MSOA11CD
represents the
neighbourhood code. Columns LONG
and LAT
indicate the longitude and latitude for each neighbourhood. Column
Population
indicates the population size. Column
prevalence
indicates the infection rate in each
neighbourhood, computed as the ratio of cases
to
Population
. Likewise, covid19_data_bristol
has
a column for the outcome variable (cases
), a column for the
time index (week
, an integer starting at 1), two columns
for the geographical coordinates (LONG
, LAT
).
Note that there are no covariate columns because we don’t use any
covariates in this tutorial.
Now we can identify the similarities across the structure of the
datasets covid19_data
, inla_data
and
covid19_data_bristol
. Their object type is a
data.frame
, with each row representing a data observation.
The data frame should include a column for the outcome variable, a
column for the time point index, indicating when each observation is
collected, and two columns for the geographical coordinates of each
observation. If the model incorporates covariates, then the covariate
data should also be included in the same data frame, and each covariate
is stored in one column. Users can use any variable names for the
columns, as long as they ensure consistency with those used when
defining the model formula and fitting the BHM. In addition, depending
on the type of outcome data, e.g., Gaussian or Poisson, other variables
may be included in the data frame. For example, in the COVID-19
tutorials where the outcome data follow a Poisson distribution, a
separate column named Population
is added to account for
the exposure parameter in the Poisson model. However, in the Hydrology
tutorial where the outcome data are assumed to follow a Gaussian
distribution, there is no need for a column representing the exposure
parameter in the dataset. The following table provides a summary of the
expected data structure for running the BHM in the fdmr
package:
ID | LONG | LAT | Time | Response Variable | Covariate 1 | Covariate 2 | Covariate… |
---|---|---|---|---|---|---|---|
1 | … | … | … | … | … | … | … |
2 | … | … | … | … | … | … | … |
… | … | … | … | … | … | … | … |
With the expected data object and format, we now possess all the essential information required for the fitting the BHM. More details regarding the model fitting process can be found at our tutorials.