Expected data structure • fdmr

In this tutorial, we will show the expected data structure for running the Bayesian Hierarchical Model (BHM) in the fdmr package. First, let’s have a look at the data structures used in our COVID-19 tutorial, Hydrology tutorial and Priors exploration tutorial.

In the COVID-19 tutorial, we aim to fit a Bayesian spatio-temporal model to predict the COVID-19 infection rates across mainland England over space and time, and investigate the impacts of socioeconomic, demographic and environmental factors on COVID-19 infection. We load the dataset covid19_data and the type of this object is a data.frame.

fdmr::retrieve_tutorial_data(dataset = "covid")

## 
## Tutorial data extracted to  /tmp/RtmpRTvfDc/fdmr/tutorial_data/covid

covid19_data <- fdmr::load_tutorial_data(dataset = "covid", filename = "covid19_data.rds")
class(covid19_data)

## [1] "data.frame"

The first 6 rows of the data set can be viewed using the utils::head() function.

utils::head(covid19_data)

##       MSOA11CD       date week     LONG      LAT cases Population   IMD
## 6521 E02006663 2020-03-07    1 -1.77072 51.17063     3      12605 12.94
## 2101 E02002160 2020-03-14    2 -2.07682 52.59541     4       7481 24.79
## 3321 E02003411 2020-03-14    2 -0.57737 51.52327     8      11443 25.13
## 4008 E02004108 2020-03-14    2 -1.43761 53.29906     4       7514 17.33
## 3334 E02003424 2020-03-14    2 -0.71324 51.51317     4       8947  4.58
## 2094 E02002153 2020-03-14    2 -2.05805 52.61565     3       7217 31.88
##      carebeds.ratio AandETRUE perc.chinese perc.indian perc.bangladeshi
## 6521     0.01630534         0    0.2153846   0.5384615       0.02307692
## 2101     0.01440074         0    0.4223307  18.2922001       0.11878052
## 3321     0.01265956         1    0.2399232  16.6826615       0.67178503
## 4008     0.02191863         0    0.1455411   0.2910823       0.10584811
## 3334     0.02194874         0    1.3427866  10.4952197       0.21484585
## 2094     0.00000000         0    0.2005884   2.6343942       0.05349024
##      perc.pakistani   perc.ba    perc.bc  perc.wb      age1     age2     age3
## 6521      0.0000000 1.2307692 0.35384615 89.54615 11.685839 21.01547 27.10829
## 2101      0.5807048 1.7949056 1.82130131 63.15164 13.848416 19.51611 25.77196
## 3321     30.8941139 4.2226488 1.44753679 17.99424 14.812549 23.80495 21.34056
## 4008      0.2778513 0.1984652 0.07938608 94.98545 11.738089 16.34283 28.81288
## 3334      4.0068751 1.0849715 0.40820711 61.38146  9.701576 20.73321 27.51760
## 2094      0.1337256 1.2035304 1.39074619 87.79085 13.759180 16.95996 27.24124
##          age4     pm25       no2
## 6521 16.92979 6.608217  5.957610
## 2101 17.99225 8.066084 14.479405
## 3321 11.13344 8.308508 14.164620
## 4008 24.12829 6.813326  8.356319
## 3334 22.68917 7.822982 11.810846
## 2094 21.53249 7.715645 12.522200

You can find that the dataset contains 23 columns. Column cases is our outcome variable, which is the weekly reported number of COVID-19 cases in each neighbourhood. Column date indicates the start date of each observation week when the COVID-19 infections data for each neighbourhood were reported, while column week indicates the week number that the data were collected from. Column MSOA11CD represents the neighbourhood code. Columns LONG and LAT indicate the longitude and latitude for each neighbourhood. Column Population indicates the population size. The remaining columns store the data for our covariates, which are related to the socioeconomic, demographic and environmental factors. In summary, in this dataset, we have a column for the outcome variable (cases), a column for the time index (week, an integer starting at 1), two columns for the geographical coordinates (LONG, LAT), and a few columns for the covariates of interest.

Now let’s look at the data structure in the Hydrology tutorial. The dataset used here is called inla_data, which is obtained by using the following codes.

library(magrittr)

fdmr::retrieve_tutorial_data(dataset = "hydro")

## 
## Tutorial data extracted to  /tmp/RtmpRTvfDc/fdmr/tutorial_data/hydro

norway_polygon_path <- fdmr::get_tutorial_datapath(dataset = "hydro", filename = "Kvilldal_Catch_Boundary.geojson")
norway_polygon <- sf::read_sf(norway_polygon_path) %>% sf::st_zm()
sfc <- sf::st_transform(norway_polygon, crs = "+proj=longlat +datum=WGS84")

suldalsvatnet_dam <- list(longitude = 6.517174, latitude = 59.490720, label = "Suldalsvatnet Dam")
stream_gauge_13 <- list(longitude = 6.5395789, latitude = 59.5815849, label = "Stream Gauge 13")
stream_gauge_14 <- list(longitude = 6.7897968, latitude = 59.7531662, label = "Stream Gauge 14")

streamdata_13 <- fdmr::get_tutorial_datapath(dataset = "hydro", filename = "NVEobservations_s36_13.csv")
streamdata_14 <- fdmr::get_tutorial_datapath(dataset = "hydro", filename = "NVEobservations_s36_14.csv")

data_13 <- read.csv(streamdata_13)
data_13$date <- as.Date(data_13$time)
data_13 <- subset(data_13, date >= "2021-10-01" & date <= "2021-10-31")

row.names(data_13) <- NULL
data_13$time_index <- seq(1, 31, 1)

data_14 <- read.csv(streamdata_14)
data_14$date <- as.Date(data_14$time)
data_14 <- subset(data_14, date >= "2021-10-01" & date <= "2021-10-31")
row.names(data_14) <- NULL
data_14$time_index <- seq(1, 31, 1)

data_13$value <- scale(data_13$value)
data_14$value <- scale(data_14$value)

era5_location <- fdmr::get_tutorial_datapath(dataset = "hydro", filename = "era5_land_daily.nc")
era5_precip <- raster::stack(era5_location)

## Loading required namespace: ncdf4

sr <- "+proj=utm +zone=32"
projected_raster <- raster::projectRaster(era5_precip, crs = sr)

era5_precip_cropped <- raster::mask(projected_raster, norway_polygon)
era5_precip_cropped <- terra::crop(era5_precip_cropped, raster::extent(norway_polygon), snap = "near")

raster_df <- data.frame(raster::rasterToPoints(era5_precip_cropped))

raster_df <- data.table::melt(data.table::setDT(raster_df), id.vars = c("x", "y"), variable.name = "time")

raster_df$time <- lubridate::day(fdmr::to_dates(fdmr::numbers_only(raster_df$time)))

raster_df <- setNames(raster_df, c("x", "y", "time", "precip"))

pixel_coords <- unique(sp::coordinates(era5_precip_cropped))
s13_utm <- fdmr::latlong_to_utm(
  lat = stream_gauge_13$latitude,
  lon = stream_gauge_13$longitude
)
s14_utm <- fdmr::latlong_to_utm(
  lat = stream_gauge_14$latitude,
  lon = stream_gauge_14$longitude
)

s13 <- data.frame(x = s13_utm[[1]], y = s13_utm[[2]])
s14 <- data.frame(x = s14_utm[[1]], y = s14_utm[[2]])

pixel_dist_gauge <- data.frame(
  x = pixel_coords[, 1],
  y = pixel_coords[, 2],
  dist_s1 = rep(NA, nrow(pixel_coords)),
  dist_s2 = rep(NA, nrow(pixel_coords)),
  min_s = rep(NA, nrow(pixel_coords)),
  streamflow = rep(NA, nrow(pixel_coords))
)


for (i in seq_len(nrow(pixel_coords))) {
  pixel_dist_gauge$dist_s1[i] <- sqrt((pixel_dist_gauge$x[i] - s13$x)^2 + (pixel_dist_gauge$y[i] - s13$y)^2)

  pixel_dist_gauge$dist_s2[i] <- sqrt((pixel_dist_gauge$x[i] - s14$x)^2 + (pixel_dist_gauge$y[i] - s14$y)^2)

  pixel_dist_gauge$min_s[i] <- which.min(c(pixel_dist_gauge$dist_s1[i], pixel_dist_gauge$dist_s2[i]))
}

n_time <- raster::nlayers(era5_precip)

pixel_dist_gauge <- do.call(rbind, replicate(n_time, pixel_dist_gauge, simplify = FALSE))
pixel_dist_gauge$time <- rep(1:n_time, each = nrow(pixel_coords))

streamdata <- list(data_13, data_14)

get_stream_data <- function(row) {
  which_stream_data <- streamdata[[row["min_s"]]]
  data_at_row_time_index <- which_stream_data[which_stream_data[, c("time_index")] == row["time"], ]
  return(data_at_row_time_index$value)
}

pixel_dist_gauge$streamflow <- apply(pixel_dist_gauge, 1, get_stream_data)

inla_data <- merge(pixel_dist_gauge, raster_df, by = c("time", "x", "y"))
inla_data <- subset(inla_data, select = c("x", "y", "streamflow", "precip", "time"))

The type of the object inla_data is a data.frame.

class(inla_data)

## [1] "data.frame"

The first 6 rows of the data set are

utils::head(inla_data)

##          x       y streamflow      precip time
## 1 331417.8 6583370 -0.7379358 0.007305657    1
## 2 342677.8 6594470 -0.7379358 0.006965242    1
## 3 348307.8 6594470 -0.7379358 0.006590370    1
## 4 353937.8 6594470 -0.7379358 0.006557765    1
## 5 353937.8 6605570 -0.7379358 0.007222239    1
## 6 359567.8 6594470 -0.7379358 0.006858168    1

The dataset inla_data contains 5 columns. Column streamflow is the outcome variable, column time is the time point index for each data observation. Columns x and y represent the geographical coordinates for each observation. Column precip is the covariate of interest, which is the precipitation data at each location and time point. Therefore, similar to the covid19_data dataset, inla_data has a column for the outcome variable (streamflow), a column for the time index (time, an integer starting at 1), two columns for the geographical coordinates (x, y), and a column for the covariate (precip).

Finally, let’s look at the structure of the dataset used in the Priors exploration tutorial. In this tutorial, we use the COVID-19 data in Bristol, UK. We load the dataset covid19_data_bristol and the object type of it is a data.frame.

fdmr::retrieve_tutorial_data(dataset = "priors")

## 
## Tutorial data extracted to  /tmp/RtmpRTvfDc/fdmr/tutorial_data/priors

covid19_data_bristol <- fdmr::load_tutorial_data(dataset = "priors", filename = "covid19_dataBris.rds")
class(covid19_data_bristol)

## [1] "data.frame"

The first 6 rows of the data set are

utils::head(covid19_data_bristol)

##           UtlaName       date week  MSOA11CD cases Population  prevalence
## 3 Bristol, City of 2021-12-25    1 E02003014    89       8292 0.010733237
## 4 Bristol, City of 2021-12-25    1 E02003015    99       6692 0.014793784
## 5 Bristol, City of 2021-12-25    1 E02003016    99       8169 0.012118986
## 6 Bristol, City of 2021-12-25    1 E02003017    74       6118 0.012095456
## 7 Bristol, City of 2021-12-25    1 E02003018    69       5870 0.011754685
## 9 Bristol, City of 2021-12-25    1 E02003020    55       6416 0.008572319
##       LONG      LAT
## 3 -2.66729 51.51507
## 4 -2.59093 51.49491
## 5 -2.57357 51.49471
## 6 -2.61209 51.48767
## 7 -2.65070 51.48954
## 9 -2.62725 51.49219

The dataset covid19_data_bristol contains 9 columns. Column cases is the outcome variable, which is the weekly reported number of COVID-19 cases in each neighbourhood in Bristol. Column date indicates the start date of each observation week, while column week indicates the week number that the data were collected from. Column MSOA11CD represents the neighbourhood code. Columns LONG and LAT indicate the longitude and latitude for each neighbourhood. Column Population indicates the population size. Column prevalence indicates the infection rate in each neighbourhood, computed as the ratio of cases to Population. Likewise, covid19_data_bristol has a column for the outcome variable (cases), a column for the time index (week, an integer starting at 1), two columns for the geographical coordinates (LONG, LAT). Note that there are no covariate columns because we don’t use any covariates in this tutorial.

Now we can identify the similarities across the structure of the datasets covid19_data, inla_data and covid19_data_bristol. Their object type is a data.frame, with each row representing a data observation. The data frame should include a column for the outcome variable, a column for the time point index, indicating when each observation is collected, and two columns for the geographical coordinates of each observation. If the model incorporates covariates, then the covariate data should also be included in the same data frame, and each covariate is stored in one column. Users can use any variable names for the columns, as long as they ensure consistency with those used when defining the model formula and fitting the BHM. In addition, depending on the type of outcome data, e.g., Gaussian or Poisson, other variables may be included in the data frame. For example, in the COVID-19 tutorials where the outcome data follow a Poisson distribution, a separate column named Population is added to account for the exposure parameter in the Poisson model. However, in the Hydrology tutorial where the outcome data are assumed to follow a Gaussian distribution, there is no need for a column representing the exposure parameter in the dataset. The following table provides a summary of the expected data structure for running the BHM in the fdmr package:

ID	LONG	LAT	Time	Response Variable	Covariate 1	Covariate 2	Covariate…
1	…	…	…	…	…	…	…
2	…	…	…	…	…	…	…
…	…	…	…	…	…	…	…

With the expected data object and format, we now possess all the essential information required for the fitting the BHM. More details regarding the model fitting process can be found at our tutorials.

ID	LONG	LAT	Time	Response Variable	Covariate 1	Covariate 2	Covariate…
1	…	…	…	…	…	…	…
2	…	…	…	…	…	…	…
…	…	…	…	…	…	…	…

ID	LONG	LAT	Time	Response Variable	Covariate 1	Covariate 2	Covariate…
1	…	…	…	…	…	…	…
2	…	…	…	…	…	…	…
…	…	…	…	…	…	…	…

ID	LONG	LAT	Time	Response Variable	Covariate 1	Covariate 2	Covariate…
1	…	…	…	…	…	…	…
2	…	…	…	…	…	…	…
…	…	…	…	…	…	…	…