---
title: "nncc: nearest-neighbors matching for case-control data"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{nncc: nearest-neighbors matching for case-control data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width=6, 
  fig.height=4
)
```

```{r, warning=FALSE, message=FALSE}
library(nncc)
library(survival)
library(dplyr)
library(ggplot2)
data(anifood)
```

## 1. Prepare the data

Case-control studies often collect many exposures but have relatively small 
sample sizes in comparison to the number of exposures. Conventional analysis
techniques like multivariate logistic regression can only accommodate a limited 
number of exposures. The `nncc` package was designed to help address this 
limitation. `nncc` matches a case to its nearest available neighbors based 
on a user-defined measure of distance (by default Gower distance) calculated 
using all collected confounders for a given exposure. In the end, `nncc` creates 
a matched data set for each exposure of interest. 

Ideally, one prefers to follow the analysis plan pre-defined at study design.
Decisions to change your analysis plan to this approach should be considered
carefully in light of how they may affect bias and uncertainty in model estimates. 
In addition, constructing variables are a routine and sometimes necessary part of 
multifactor studies. However, constructing new outcome variables from primary 
variables will increase dependencies and the difficulty of interpretation, 
especially where there is missingness of data. 

For illustration, we created a toy example case-control data set called `anifood` which only contains 
`r table(anifood$case)["1"]` cases, `r table(anifood$case)["0"]` controls, and `r NCOL(anifood)` variables including the 
one indicating case status, `case`, which equals 0 for controls, 1 for cases.

```{r}
dim(anifood)
```

In addition to the data set, `nncc` needs to know the exposures of interest, 
the variables used for matching, and a data set that contains 
variables to be excluded from matching for each exposures (e.g., possible 
intermediates in the pathway between an exposure and case status). 

```{r}
# exposures of interest. In a real study, this list can be much longer
exp_interest <- c("exp01","exp09", "exp27")

# exposures to be controlled for any exposure of interest
exp_match <- setdiff(names(anifood), "case")

# variables to be excluded from matching for each exposure
# both exp_var and rm_vars are character variables
excl_vars %>% head
```

## 2. Establish a threshold for distance

The first step is to establish a proper distance threshold for matching. 
If too loose, there will be poor confounding control; if too tight, it will
be difficult to find enough controls for the cases due to overmatching.

The `get_threshold` function uses the maximum bipartite graph algorithm to find
each case's closest control while ensuring each control is used no more than 
once. Then, for each case, a control is selected randomly. 

A logistic regression on whether the control is the closest or the randomly
selected one based on the distance is created. By default, the threshold used
is the distance at which the probability is at least 50% that the control is
the closest vs. a randomly selected one (see the plot below generated by the 
function `threshold_model_plot`). You can choose a different probability with 
the `p_threshold` argument. 

```{r}
threshold_results <- get_threshold(anifood, exp_match, p_threshold = 0.50)
```

The `distance_density_plot` function plots the density distribution of Gower distances 
for the maximum bipartite graph algorithm matches (solid line), the density
distribution of distances for the random matches (dashed line), and the threshold (red dashed line).
As expected, a maximum bipartite graph algorithm matching results in closer 
matching than a random matching. 

```{r, warning=FALSE}
distance_density_plot(threshold_results) + ggtitle("Example of distance_density_plot")
```

The function `threshold_model_plot` plots the probability of being the maximum 
bipartite graph algorithm match vs. a random one by distance. It also plots the 
threshold for distance corresponding to a probability specified through 
the `p_threshold` argument.

```{r}
threshold_model_plot(threshold_results, p_threshold = 0.50) + ggtitle("Example of threshold_model_plot")
```

This package can also be used to analyze data from a case-control study that
was originally matched. The `original_compare_plot` function makes a density plot of 
the distance for originally matched pairs. It also gives the proportion 
of originally matched pairs that have a distance greater than the threshold.

```{r}
# create a variable (i.e., pair) indicating originally matched pairs
anifood_matched <- anifood %>% group_by(case) %>% mutate(pair = seq_along(case)) %>% ungroup
```

In this example, `r p <- original_compare_plot(anifood_matched, case, pair, threshold_results); sprintf("%.0f%%", p$prop_distance_gt_threshold[2] * 100)` of original matched case-control pairs have a distance 
greater than the threshold.

```{r}
p <- original_compare_plot(anifood_matched, case, pair, threshold_results)


# the density plot of distance between originally matched cases and controls
p$plot_density + ggtitle("Example of original_compare_plot")

# proportion of originally matched cases and controls with a distance greater than the threshold
p$prop_distance_gt_threshold
```

## 3. Create strata: candidate matches for final analysis

The `make_knn_strata` function calculates Gower distances, by default, for pairs 
formed by every case and every control for a given exposure and creates strata
by selecting the closest 250 controls for each case (by default, can be modified 
through the `ncntls` argument). Thus, for that exposure, the number of rows in 
data frame equals the number of cases * 251. The data frames for all exposures 
are stored in a list with a length equal to `length(exp_interest)`. 

This step can be computationally intensive depending on the amount of data. 
`furrr::future_map()` can be used so that the computation can be easily adjusted
to use multiple cores on your local computer or a high-performance computing 
(HPC). More information is provided [below][9. Run the analysis on your local computer or HPC].

```{r, warning=FALSE, message=FALSE}
library(furrr)
strata1 <- future_map(exp_interest, make_knn_strata,  rmvars = excl_vars, matchvars = exp_match, df = anifood) %>%
  setNames(exp_interest)
```

```{r, warning=FALSE}
length(strata1) == length(exp_interest)

# rows in a matched data set
all.equal(anifood %>% filter(case == 1) %>% NROW %>% `*`(250 + 1), NROW(strata1[[1]]))
```

In the `make_knn_strata` function, a data frame should be supplied to the `rmvars` 
argument (see `excl_vars` above). That data frame should include two variables:
*exp_var: a character variable (only a single exposure in a row), and
*rm_vars: a character variable that contains the names of the variables that are 
not adjusted for. Names are separated by a space.


## 4. Prepare final matched sets for analysis

For a given exposure, the `make_analysis_set` function selects the closest 
20 controls for each case; subset these 20 to those that fall within the threshold; 
collapse strata that share the same controls; and remove strata without any control.

```{r, warning = FALSE}
strata2 <- future_map(exp_interest, make_analysis_set, stratified_data = strata1, data = anifood, maxdist = threshold_results$threshold) %>% setNames(exp_interest)
```

The `finalize_data` function ensures that a control retained in a data frame is 
used once; removes strata without any case or any control. In this process, 
priority is given to the smallest strata then smallest distance if a control 
is matched to multiple cases (i.e., that control exists in multiple strata). 
```{r, message=FALSE}
strata3 <- finalize_data(strata2)
```

The last step is to exclude exposures, if any, with odds ratios that are not 
estimable (e.g., none of cases and controls was exposed). 
For example, `exp09` in this example.

```{r}
# exposures to which neither cases nor controls were exposed
strata3 %>%
  lapply(function(dfm) dfm %>%
                            mutate(case = as.character(case), exp = as.character(exp)) %>%
                            filter(case != "" & exp != "") %>% 
                            with(table(case, exp))) %>%
  lapply(function(x) x==0) %>%
  lapply(function(x) {sum = sum(x); length = length(x); cbind(sum, length)}) %>%
  do.call(rbind,.) %>%
  as.data.frame %>%
  mutate(var = names(strata3)) %>%
  filter(sum >= 2 | length < 4) %>% select(var) %>%
  unclass %>%
  unlist -> expvars_invalid

expvars_invalid

# none exposed and odds ratio cannot be estimated
strata3[["exp09"]] %>% with(table(case, exp))

data_final <- strata3[setdiff(names(strata3), expvars_invalid)]
```


## 5. Calculate odds ratios

### 5.1 Mantel–Haenszel test and Fisher's exact test
The `test_mh` function performs the Mantel–Haenszel test for data frame with 
more than one stratum and the Fisher's exact test for a data frame with only 
one stratum.

```{r}
or_mh <- future_map(data_final, function(dfm) with(dfm, test_mh(case = case, exp = exp, strata = strata)))

or_mh[["exp01"]]
```

However, when the number of exposed is small, this method may yield an
infinite result.

```{r}
data_final[["exp27"]] %>% select(case, exp) %>% table

or_mh[["exp27"]]
```

### 5.2 Conditional logistic regression 
Conditional logistic regression gives results similar to those from 
Mantel–Haenszel test, but may also give less than ideal
results when few participants are exposed.
```{r, warning=FALSE}
results_clogit <- future_map(data_final, function(dfm){clogit(case ~ exp + strata(strata) , data = dfm)}) 

results_clogit[["exp27"]] %>% summary() %>% `$`(conf.int)
```

When combined with multiple imputation, we recommend a Bayesian approach.

```{r, warning=FALSE, eval=FALSE}
library(rstanarm)

results_stan_clogit <- future_map(data_final, function(dfm) {
  dfm$strata <- factor(dfm$strata)
  stan_clogit(case ~ exp, 
              data = dfm, 
              strata = strata,
              # rstanarm suggests us specifying priors even if they are the 
              # defaults because they might change in the future
              prior = normal(0, 4),
              #default: prior = normal(autoscale = TRUE),
              prior_covariance = decov(),
              iter = 100,
              chains = 2,
              prior_PD = FALSE)}) # for speed only
```

### 5.3 Penalized logistic regression
When the number of participants being exposed is small, [Firth's bias-Reduced penalized-likelihood logistic regression](https://CRAN.R-project.org/package=logistf/index.html) can also be used (1). 

```{r, warning=FALSE}
library(logistf)
# regression coefficients
coef_logistf <- future_map(data_final, function(dfm){

  if(length(unique(dfm[["strata"]])) == 1) {
        x <- model.matrix(case ~ exp, data = dfm)
        plconf <- grep("exp", colnames(x))
        o <- logistf(case ~ exp, data = dfm, plconf = plconf) %>%
                 `[`(c("terms", "coefficients", "ci.lower", "ci.upper", "prob", "call", "loglik", "model"))
    } else {
        x <- model.matrix(case ~ exp + strata, data = dfm)
        plconf <- grep("exp", colnames(x))
        o <- logistf(case ~ exp + strata, data = dfm, plconf = plconf) %>%
                 `[`(c("terms", "coefficients", "ci.lower", "ci.upper", "prob", "call", "loglik", "model"))
    }
},
.progress = TRUE)


# odds ratios
or_logistf <- lapply(names(coef_logistf), function(var_name){
  coef_logistf[[var_name]] %>%
    `[`(c("terms", "coefficients", "ci.lower", "ci.upper", "prob")) %>%
    bind_cols %>%
    filter(grepl("exp", terms)) %>%
    mutate(variable = var_name, or = exp(coefficients), ci.lower = exp(ci.lower), ci.upper = exp(ci.upper)) %>%
  select(variable, terms, or, ci.lower, ci.upper, prob)}) %>%
  setNames(names(coef_logistf))

# odds ratio for exp27
or_logistf[["exp27"]] 
```


## 6. Calculate population attributable fraction (PAF)
PAF can be estimated using the method described by Bruzzi and colleagues (3). 
We implemented this method in `get_paf`.
```{r}

# prepare a data frame for calculating PAF
df_or <- bind_rows(or_logistf)

df_or
```


```{r, message=FALSE}
# point estimate of PAF
paf <- get_paf(df_or = df_or, which_or = or, exp_var = variable, exp_level = terms, df_matched = data_final)

paf

# lower confidence limit of PAF
paf_ci.lower <- get_paf(df_or = df_or, which_or = ci.lower, exp_var = variable, exp_level = terms, df_matched = data_final)
```


## 7. Multiple imputation of missingness
When the case-control data contains a large proportion of missingness, multiple 
imputation may be useful. 

* For regression coefficients generated by Mantel–Haenszel test or conditional 
logistic regression, Rubin's rule may be proper for  combining the results (4).

* Originally in our paper (2), we used Firth's bias-Reduced penalized-likelihood logistic regression, but 
the regression coefficients are not normally distributed. The results from 
imputed data sets should therefore be combined using penalized likelihood profiles (5). 
Penalized likelihood profiles are implemented in `logistf::CLIP.confint()`, but
the function was developed to combine results from imputed data sets that 
have the same structure. However, if we impute the original case-control data, 
and then apply nearest-neighbors matching to the imputed data, the final 
analytic data sets for a given exposure could have inconsistent structures 
because they could have different numbers of strata. 

Thus, we modified the `logistf::CLIP.confint()` to accommodate our method. 
The modified function is included in version 1.0.1 of the  `nncc` package along with a compatible, bundled version of the 
`logistf` package version 1.24, and the function is called 
`CLIP.confint.difflevel()`. Please cite the original paper (4) and/or the [logistf package](https://CRAN.R-project.org/package=logistf/index.html) if you use 
this modified function for publication.

* Now we recommend a Bayesian approach, based on drawing samples from the stacked posteriors of the model run on each imputed data set.


## 8. Helper function `cacheit()`
When the number of exposures of interest or the sample size of the original 
case-control study is large, it may be time-consuming for some analyses 
(e.g., strata creation). You may not want to have the same code evaluated again 
when you run your analysis later. The helper function `cacheit()` caches the results for you. Specifically, before your run `cacheit()`, create a folder named "cache" in the working directory of your project. When you run `cacheit("abc", code)` for the first time, it saves the results of your code as abc.rds in the `cache` folder; next time, 
when you run the code, it directly read the saved results in without 
evaluating your code.

```{r, eval = FALSE}
strata1 <- cacheit("abc",
                   future_map(exp_interest, make_knn_strata,  rmvars = excl_vars, matchvars = exp_match, df = anifood) %>%
  setNames(exp_interest),
  clearcache = FALSE)
```


## 9. Run the analysis on your local computer or HPC

Considering the intensive computation for this approach, 
especially when multiple imputation is conducted, a [future](https://CRAN.R-project.org/package=future/vignettes/future-1-overview.html) 
can be used to accelerate the analysis. Below we briefly introduce the use of futures on your local multi-core computer and HPC.

### 9.1 local computer
If you have a multi-core computer, the following code can assign a specified number of cores for the analysis.

```{r, eval = FALSE}
library(nncc)
library(dplyr)
library(furrr)
library(future.batchtools)
# the workers argument is used to define the number of cores for the analysis. By default, all cores will be used.
plan(multisession, workers = 3)

strata1 <- future_map(exp_interest, make_knn_strata,  rmvars = excl_vars, matchvars = exp_match, df = anifood) %>%
  setNames(exp_interest)

```


### 9.2 HPC

To run futures on HPC, you may need a template file, a job script, and a R script. The template file and 
the job script are created using a Linux application such as nano and saved as .sh files.

There are some exemplary template files [here](https://github.com/mllg/batchtools/tree/master/inst/templates). 
The template file should be saved as `batchtools.sge.tmpl` in your current 
working directory or as `.batchtools.sge.tmpl` in your home directory on the HPC. 
Below is an example of a template file for Sun Grid Engine.

```
#!/bin/bash

## name of the job
#$ -N <%= job.name %>

## tell the queue system to use the current directory as the working directory
## Or else the script may fail as it will execute in your top level home directory /home/username
#$ -cwd

## Use environment variables
#$ -V

## specify a queue
#$ -q <%= resources$queue %>

## free to add other modules if necessary
module load R/3.6.2

Rscript -e 'batchtools::doJobCollection("<%= uri %>")'
exit 0
```

Below is an example of a job script to be submitted to the scheduler through a shell:
```{r, eval=FALSE, comment = ""}
#!/bin/bash -l

# name of the job is helloR
#$ -N helloR

#$ -cwd

#$ -V

#$ -pe smp 2-12

#$ -q all.q

module load R/3.6.2

# to execute fugure-map.R
Rscript future-map.R

exit 0
```

The R code looks like:

```{r, eval=FALSE}
library(furrr)
library(future.batchtools)

plan(batchtools_sge)

future_map(<a_list_or_vector>, function(x){...}, .progress = TRUE)
```


### References:
1. Firth D. Bias reduction of maximum likelihood estimates. Biometrika 1993;80:27-38.
2. Bruzzi P, Green SB, Byar DP, et al. Estimating the population attributable risk for multiple risk factors using case-control data. Am J Epidemiol 1985;122(5):904-14.
3. Cui Z, Marder EP, Click ES, Hoekstra RM, Bruce BB. Nearest-neighbors matching for case-control study analyses: better risk factor identification from a study of sporadic campylobacteriosis in the United States. Epidemiology 2022;33(5):633-41.  
4. Rubin DB. Multiple imputation for nonresponse in surveys. New York: John Wiley and Sons; 1987.
5. Heinze G, Ploner M, Beyea J. Confidence intervals after multiple imputation: combining profile likelihood information from logistic regressions. Stat Med 2013;32(29):5062-76.