---
title: "Generating missing values"
output: rmarkdown::html_vignette
bibliography: bibliography.bibtex
vignette: >
  %\VignetteIndexEntry{Generating-missing-values}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7, 
  fig.height = 4, 
  fig.align = "center"
)
```

To generate missing values in a dataset with missMethods you can use one of the `delete_` functions. The names of these functions always starts with `delete_` and the next part of the name shows the used missing data mechanism. There are three basic types of missing data mechanisms: missing completely at random (`MCAR`), missing at random (`MAR`) and missing not at random (`MNAR`). A list of all available functions for the different mechanisms is given below:

**MCAR**

* `delete_MCAR()`

**MAR**

* `delete_MAR_1_to_x()`
* `delete_MAR_censoring()`
* `delete_MAR_one_group()`
* `delete_MAR_rank()`

**MNAR**

* `delete_MNAR_1_to_x()`
* `delete_MNAR_censoring()`
* `delete_MNAR_one_group()`
* `delete_MNAR_rank()`

All these functions share a common interface. The first argument `ds` takes the dataset in which missing values should be generated. The next argument `p` specifies the proportion of missing values to include in every column with missing value. These columns are specified with the third argument `cols_mis`. The further arguments depend on the chosen function and are documented for every function separately. In most cases, reasonable defaults are set for these further arguments. Only the `MAR` functions need one additional argument with no default: `cols_ctrl`. The argument `cols_ctrl` specifies the columns that control the generation of missing data in a MAR settings. 

One further remark: All `MAR` functions have a `MNAR` twin. These twins behave exactly the same way. The only difference is the columns that controls the generation of missing values. In the `MAR` functions separate  `cols_ctrl` columns controls the generation of missing values in the `cols_mis` columns. In contrast, in the `MNAR` functions the generation of missing values in the `cols_mis` columns is controlled by the `cols_mis` columns themselves.

In the following, different examples for the generation of missing values will be presented. Furthermore, connections between the functions from missMethods and the paper by @Santos.2019 will be shown.

## Examples for the generation of missing values

The examples below show the use of some `delete_` functions in a 2-dimensional dataset. Missing values are always generated in the variable "X" and 30 % of the values are deleted. At first, a basic set-up:

```{r setup}
library(missMethods)
library(ggplot2)

set.seed(123)

make_simple_MDplot <- function(ds_comp, ds_mis) {
  ds_comp$missX <- is.na(ds_mis$X)
  ggplot(ds_comp, aes(x = X, y = Y, col = missX)) +
    geom_point()
}

# generate complete data frame
ds_comp <- data.frame(X = rnorm(100), Y = rnorm(100))
```

### MCAR

Generate MCAR values:

```{r MCAR}
ds_mcar <- delete_MCAR(ds_comp, 0.3, "X")
make_simple_MDplot(ds_comp, ds_mcar)
```


### MAR

Generate MAR values using a censoring mechanism. This leads to a missing value in "X", if the y-value is below the 30 % quantile of "Y":

```{r MAR censoring}
ds_mar <- delete_MAR_censoring(ds_comp, 0.3, "X", cols_ctrl = "Y")
make_simple_MDplot(ds_comp, ds_mar)
```

The censoring mechanism is a rather strong form of MAR. A function that allows to control the strength of the MAR mechanism is `delete_MAR_1_to_x`. The strength is controlled through the argument `x`: the bigger `x`, the stronger the simulated `MAR` mechanism:

```{r MAR_1_to_2}
# x = 2
ds_mar <- delete_MAR_1_to_x(ds_comp, 0.3, "X", cols_ctrl = "Y", x = 2)
make_simple_MDplot(ds_comp, ds_mar)
```

```{r MAR_1_to_10}
# x = 10
ds_mar <- delete_MAR_1_to_x(ds_comp, 0.3, "X", cols_ctrl = "Y", x = 10)
make_simple_MDplot(ds_comp, ds_mar)
```

### MNAR

Generate MAR values using a censoring mechanism. This leads to a missing value in "X", if the x-value is below the 30 % quantile of "X":

```{r MNAR censoring}
ds_mnar <- delete_MNAR_censoring(ds_comp, 0.3, "X")
make_simple_MDplot(ds_comp, ds_mnar)
```

## Connections between missMethods and @Santos.2019

The following table shows the connections between the algorithm names of the missing data creation methods in @Santos.2019 and the functions of missMethods:

@Santos.2019   |  Function               |  Arguments
---------------|-------------------------|----------------------
MCAR1univa     |`delete_MCAR`            | n_mis_stochastic = FALSE
MCAR2univa     |`delete_MCAR`            | all default
MCAR3univa     |`delete_MCAR`            | all default
MAR1univa      |`delete_MAR_censoring`   | sorting = FALSE
MAR2univa      |`delete_MAR_rank`        | all default
MAR3univa      |`delete_MAR_1_to_x`      | x = 1/9
MAR4univa      |`delete_MAR_censoring`   | where = "upper"
MAR5univa      |`delete_MAR_censoring`   | where = "both"
MNAR1univa     |`delete_MNAR_censoring`  | sorting = FALSE
MNAR2univa     |`delete_MNAR_censoring`  | where = "upper"
MCAR1unifo     |`delete_MCAR`            | n_mis_stochastic = FALSE
MCAR2unifo     |`delete_MCAR`            | p_overall = TRUE 
MAR1unifo      |`delete_MAR_censoring`   | all default
MAR2unifo      |`delete_MAR_censoring`   | sorting = FALSE
MAR3unifo      |`delete_MAR_one_group`   | all default
MAR4unifo      |`delete_MAR_one_group`   | all default
MNAR1unifo     |`delete_MNAR_censoring`  | all default
MNAR2unifo     |`delete_MNAR_censoring`  | sorting = FALSE
MNAR3unifo     |`delete_MAR_one_group`   | all default
MNAR4unifo     |`delete_MAR_one_group`   | all default

Only the argument(s), which default values must be altered, are shown in the table. Notice that most functions in missMethods are more general than the described algorithms in @Santos.2019. Therefore, some functions of missMethods are able to replace different algorithms from @Santos.2019. In contrast to @Santos.2019, the user must always specify the missing column(s). However, this may change in the future. 

## References
