---
date: "`r Sys.Date()`"
title: "Exploring Missingness with mde"
output: html_document
vignette: >
  %\VignetteIndexEntry{mde-missingness}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
resource_files:
  - man/figures/mde_icon_2.png
 
---


```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```


The goal of `mde` is to ease exploration of missingness.  


**Loading the package**

```{r}

library(mde)

```


To get a simple missingness report, use `na_summary`:


```{r}

na_summary(airquality)

```


To sort this summary by a given column :

```{r}

na_summary(airquality,sort_by = "percent_complete")

```


If one would like to reset (drop) row names, then one can set `row_names` to 
`TRUE` This may especially be useful in cases where `rownames` are simply 
numeric and do not have much additional use.


```{r reset_rownames}

na_summary(airquality,sort_by = "percent_complete", reset_rownames = TRUE)

```


To sort by `percent_missing` instead:

```{r}
na_summary(airquality, sort_by = "percent_missing")

```

To sort the above in descending order:

```{r}
na_summary(airquality, sort_by="percent_missing", descending = TRUE)

```
To exclude certain columns from the analysis:

```{r}

na_summary(airquality, exclude_cols = c("Day", "Wind"))

```

To include or exclude via regex match:

```{r}
na_summary(airquality, regex_kind = "inclusion",pattern_type = "starts_with", pattern = "O|S")
```

```{r}
na_summary(airquality, regex_kind = "exclusion",pattern_type = "regex", pattern = "^[O|S]")
```
To get this summary by group:

```{r}

test2 <- data.frame(ID= c("A","A","B","A","B"), Vals = c(rep(NA,4),"No"),ID2 = c("E","E","D","E","D"))

na_summary(test2,grouping_cols = c("ID","ID2"))

```

```{r}

na_summary(test2, grouping_cols="ID")


```

* `get_na_counts`

This provides a convenient way to show the number of missing values column-wise. It is relatively fast(tests done on about 400,000 rows, took a few microseconds.)

To get the number of missing values in each column of `airquality`, we can use the function as follows:

```{r}

get_na_counts(airquality)


```

The above might be less useful if one would like to get the results by group. In that case, one can provide a grouping vector of names in `grouping_cols`. 


```{r}

test <- structure(list(Subject = structure(c(1L, 1L, 2L, 2L), .Label = c("A", 
"B"), class = "factor"), res = c(NA, 1, 2, 3), ID = structure(c(1L, 
1L, 2L, 2L), .Label = c("1", "2"), class = "factor")), class = "data.frame", row.names = c(NA, 
-4L))

get_na_counts(test, grouping_cols = "ID")


```

* `percent_missing`

This is a very simple to use but quick way to take a look at the percentage of data that is missing column-wise.


```{r}


percent_missing(airquality)


```

We can get the results by group by providing an optional `grouping_cols` character vector. 

```{r}

percent_missing(test, grouping_cols = "Subject")


```


To exclude some columns from the above exploration, one can provide an optional character vector in `exclude_cols`.


```{r}

percent_missing(airquality,exclude_cols = c("Day","Temp"))


```

* `sort_by_missingness`

This provides a very simple but relatively fast way to sort variables by missingness. Unless otherwise stated, this does not currently support arranging grouped percents.

Usage:

```{r}


sort_by_missingness(airquality, sort_by = "counts")


```

To sort in descending order:

```{r}

sort_by_missingness(airquality, sort_by = "counts", descend = TRUE)

```

To use percentages instead:

```{r}

sort_by_missingness(airquality, sort_by = "percents")

```


Please note that the `mde` project is released with a
[Contributor Code of Conduct](https://github.com/Nelson-Gon/mde/blob/master/.github/CODE_OF_CONDUCT.md).
By contributing to this project, you agree to abide by its terms.


For further exploration, please `browseVignettes("mde")`. 


To raise an issue, please do so [here](https://github.com/Nelson-Gon/mde/issues)

Thank you, feedback is always welcome :)