---
title: "Getting started"
vignette: >
  %\VignetteIndexEntry{Getting started}
  %\VignetteEngine{quarto::html}
  %\VignetteEncoding{UTF-8}
---

This package serves two overarching purposes:

1.  To provide an open-source, code-based algorithm to classify type 1
    and type 2 diabetes using Danish registers as data sources.
2.  To inspire discussions within the Danish register-based research
    space on the openness and ease of use on the existing tooling and
    registers, and on the need for an official process for updating or
    contributing to existing data sources.

To read up on the overall design of this package as well as on the
algorithm, check out the `vignette("design")`. For more explanation on
the motivations, rationale, and needs for this algorithm and package,
check out the `vignette("rationale")`. To see the specific data needed
for this package and algorithm, see `vignette("data-sources")`.

## Usage

First, let's load the package, as well as
[duckplyr](https://duckplyr.tidyverse.org/index.html) since we require
the data to be in the [DuckDB](https://duckdb.org/) format. See the
`vignette("design")` for some reasons why.

```{r setup}
#| message: false
library(osdc)
```

The core of this package depends on the list of variables within
different registers that are needed in order to classify the diabetes
status of an individual. This can be found in the list:

```{r}
# Only showing first 2
registers() |> 
  head(2)
```

We can see the list of registers we need with:

```{r}
registers() |> 
  names()
```

Let's create a fake dataset to show how to use the classification. We
have a helper function `simulate_registers()` that takes a vector of
register names and outputs a list of registers with simulated data.
Because of the way that DuckDB connections work, we have to either load
the data directly from a file as a DuckDB table, or convert a tibble
into a DuckDB table. So we'll do that right after simulating the data.

```{r}
register_data <- registers() |> 
  names() |> 
  simulate_registers() |> 
  purrr::map(duckplyr::as_duckdb_tibble) |> 
  # Convert to a DuckDB connection, as duckplyr is still
  # in early development, while the DBI-DuckDB connection
  # is more stable.
  purrr::map(duckplyr::as_tbl) 

# Show only the first two items.
register_data |> 
  head(2)
```

Now we can run the `classify_diabetes()` on the simulated data. Because
we use DuckDB, in order to "materialize" the data into R, you need to
use `dplyr::collect()`.

```{r}
classified_diabetes <- classify_diabetes(
  kontakter = register_data$kontakter,
  diagnoser = register_data$diagnoser,
  lpr_diag = register_data$lpr_diag,
  lpr_adm = register_data$lpr_adm,
  sysi = register_data$sysi,
  sssy = register_data$sssy,
  lab_forsker = register_data$lab_forsker,
  bef = register_data$bef,
  lmdb = register_data$lmdb
) |> 
  dplyr::collect()

classified_diabetes
```

Just by pure chance, there are `r nrow(classified_diabetes)` simulated
individuals that get classified into diabetes status. This is mainly
because we've created the simulated data to over-represent the values in
the variables included in the algorithm that will lead to classifying
into diabetes status.

In a real scenario, the register data is probably too big to read into
memory before being converted into a `duckdb_tibble`. Therefore, we
recommend that users first convert the individual register files into
`.parquet` format on disk, with each register source contained in
separate folders (e.g. all files from `kontakter` in one folder,
`diagnoser` in another, `lpr_diag` in a third folder etc.). With the
`arrow` package, each register data source can then be read in as a
single `duckdb_tibble` by pointing the following code snippet to each of
the Parquet folders. E.g. to load in `diagnoser`:

``` r
diagnoser <- diagnoser_parquet_folder |>
  arrow::open_dataset(unify_schemas = TRUE) |>
  arrow::to_duckdb()
```

And that's all there is to this package! You can now save this dataset
as a Parquet file for you or your collaborators on your DST project to
use these classifications.

``` r
classified_diabetes |>
  duckplyr::as_duckdb_tibble() |>
  duckplyr::compute_parquet(
    "classified_diabetes.parquet"
  )
```
