---
title: "Introduction to predictsr"
output: html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to predictsr}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
knitr:
  opts_chunk:
    collapse: true
    comment: '#>'
cache: true
---

The [`predictsr`](https://github.com/Biodiversity-Futures-Lab/predictsr) package accesses the PREDICTS database (Hudson et al, 2013) from within R, conveniently as a dataframe. It uses the [Natural History Museum Data Portal](https://data.nhm.ac.uk/) to download the latest versions of the PREDICTS database and some related metadata. 

The [PREDICTS](https://www.nhm.ac.uk/our-science/research/projects/predicts/science.html) database comprises over 4 million measurements of species at sites across the world. All major terrestrial plant, animal, and fungal groups are represented within the data. The most well-represented group in the data are arthropods (mainly insects), covering about 2% of the species known by science

There have been 2 public releases of PREDICTS data. The first was in [2016](https://data.nhm.ac.uk/dataset/the-2016-release-of-the-predicts-database-v1-1) and the second was in [2022](https://data.nhm.ac.uk/dataset/release-of-data-added-to-the-predicts-database-november-2022); each consisted of about 3 million and 1 million records, respectively. We include both, as a single dataset, in this package.

*However*, the PREDICTS database is under constant development and is continuously growing, with later releases planned.

## Loading the data

To get started, let's load in the database into R and poke around. To do so you will use the `LoadPredictsData` function, which pulls in the data from the data portal and saves it locally for you. It reads in both releases into a single dataframe (by default), or you can specify which years to save:

```{r setup}
file_predicts <- file.path(tempdir(), "predicts.rds")
predicts <- predictsr::LoadPredictsData(file_predicts, extract = c(2016, 2022))
str(predicts)
```

Be patient as this may take a few seconds to run, as it downloads the data from the NHM Data Portal.

The first time you run this, it will save the data into `file_predicts`, and store some metadata in a `*.aux.json` file. Subsequent calls will then check the metadata (via a SHA-based invalidation), and if this looks OK, just load the data in `file_predicts`. This saves you having to repeatedly download the data.

We can see that there are over 4 million records in the combined PREDICTS extracts. Let's look at a set of summary statistics for the database:

```{r}
if (nrow(predicts) > 0) {
  taxa <- predicts[
    !duplicated(predicts[, c("Source_ID", "Study_name", "Taxon_name_entered")]),
  ]
  species_counts <- length(
    unique(taxa[taxa$Rank %in% c("Species", "Infraspecies"), "Taxon"])
  ) +  nrow(taxa[!taxa$Rank %in% c("Species", "Infraspecies"), ])
  
  print(glue::glue(
    "This database has {length(unique(predicts$SS))} studies across ",
    "{length(unique(predicts$SSBS))} sites, in ",
    "{length(unique(predicts$Country))} countries, and with ",
    "{species_counts} species."
  ))
} else {
  print("No data available - check the download!")
}
```

There are over 30,000 sites, 101 countries, and 54,863 species in this dataframe, with a couple of important columns that should be noted:

- The `SS` column is what we use to identify studies when dealing with PREDICTS data. This is the concantenation of the `Source_ID`, and the `Study_number` columns.
- The `SBBS` column is the concatenation of `Source_ID`, `Study_number`, `Block` and `Site_number`, which is what we use to identify single sites in the database.

Let's also check the ranges of sample collection in the database:

```{r}
if (nrow(predicts) > 0) {
  print(glue::glue(
    "Earliest sample collection (midpoint): {min(predicts$Sample_midpoint)}, ",
    "latest sample collection (midpoint): {max(predicts$Sample_midpoint)}"
  ))
} else {
  print("No data available in the PREDICTS database - check the download!")
}
```

## Loading the data without caching

If you don't want to cache the data, you can also just use `GetPredictsData`:

```{r}
predicts <- predictsr::GetPredictsData(extract = c(2016, 2022))
```

This will pull in the exact same data as `LoadPredictsData`, but won't save it to disk for you. If you are already using some sort of data pipeline tool (e.g. [`targets`](https://cran.r-project.org/package=targets)) then this may be your preferred option.

```{r, echo=FALSE}
# Delete the database locally
unlink(c(file_predicts, paste0(file_predicts, ".aux.json")))
```

## Accessing site-level summaries

We also include access to the site-level summaries from the full release; to get *these data* you will need to use the `GetSitelevelSummaries` function. The function call is very similar to pull in the summaries for the same data as above:

```{r}
summaries <- predictsr::GetSitelevelSummaries(extract = c(2016, 2022))
str(summaries)
```

Investigating the summary data closer we see that there are a number of missing columns between the two dataframes:

```{r}
if (nrow(predicts) > 0 && nrow(summaries) > 0) {
  print(names(predicts)[!(names(predicts) %in% names(summaries))])
}
```

This is because all of the measurement-level data has been dropped from the dataframe. Now indeed we could try and replicate the creation of the `summaries` dataframe through some `dplyr` operations. These would be the following (roughly):

**Note:** there seems to be some differences as there are a couple sites that don't match up.

```{r}
if (nrow(summaries) > 0) {
  summaries_rep <- predicts |>
    dplyr::mutate(
      Higher_taxa = paste(sort(unique(Higher_taxon)), collapse = ","),
      N_samples = length(Measurement),
      Rescaled_sampling_effort = mean(Rescaled_sampling_effort),
      .by = SSBS
    ) |>
    dplyr::select(
      dplyr::all_of(names(summaries))
    ) |>
    dplyr::distinct() |>
    dplyr::arrange(SSBS)
  
  summaries_copy <- summaries |>
    subset(SSBS %in% summaries_rep$SSBS) |>
    dplyr::arrange(SSBS)
  
  all.equal(summaries_copy, summaries_rep)
}
```

## Accessing descriptions of the columns in PREDICTS

As we've seen already, there are 67 (!) columns in the PREDICTS database, and within the NHM Data portal releases, we have included a description of the data that is used in each of these columns. You can access this via the `GetColumnDescriptions` function:

```{r}
descriptions <- predictsr::GetColumnDescriptions()
str(descriptions)
```

So this includes the `Column` name, the resolution of the `Column` (`Applies_to`), whether it is in the PREDICTS extract (`Diversity_extract`), whether it is in the site-level summaries (`Site_extract`), the datatype of the `Column`*, whether it is guaranteed to be nonempty, any additional `Notes`, and any information on the range of values that the `Column` may be expected to take (`Validation`).

## Glossary

To clarify some of the PREDICTS jargon, we include the following table of definitions, from the dataframes we have worked with thus far:

```{r, echo = FALSE, results = 'asis'}
if (nrow(descriptions) > 0) {
  descriptions_sub <-  subset(
    descriptions,
    Column %in% c(
      "Source_ID", "SS", "Block", "SSBS", "Diversity_metric_type", "Measurement"
    )
  )
  
  for (i in seq_along(descriptions_sub$Column)) {
    cat(
      paste0(
        "* `", 
        descriptions_sub$Column[i], 
        "` (`", 
        descriptions_sub$Type[i], 
        "`): "
      )
    )
    notes <- descriptions_sub$Notes[i] |>
      (\(s) gsub("\\*", "  -", s))() |>
      (\(s) gsub("\n\n", "  \n", s))() |>
      (\(s) gsub("\n", "  \n", s))()
  
    if (notes == " ") {
      notes <- "As title."
    }
  
    cat(notes)
    cat("  \n")
  }
}
```

**Notes:** When we refer to a "study" we typically refer to it from the `SS` that identifies it. Referring to an "extract" simply refers to a PREDICTS database release, either from 2016 or 2022 (or both). For further complete documentation see the SI in Hudson et. al. (2017).

## Notes

The [NHM Data Portal API](https://data.nhm.ac.uk/about/download) has no rate limits so be considerate with your requests. Make sure you save the data somewhere, or use a tool like [`targets`](https://docs.ropensci.org/targets/index.html) to save you from re-running workflows.

## References

Hudson, Lawrence N., et al. "The PREDICTS database: a global database of how local terrestrial biodiversity responds to human impacts." Ecology and evolution 4.24 (2014): 4701-4735. <doi.org/10.1002/ece3.1303>.

Hudson, Lawrence N., et al. "The database of the PREDICTS (projecting responses of ecological diversity in changing terrestrial systems) project." Ecology and evolution 7.1 (2017): 145-188 <doi.org/10.1002/ece3.2579>