---
title: "Simple data exploration"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Simple data exploration}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Introduction

The `summarize` function in `dplyr`, especially when combined with `group_by` and `across`, provides powerful tools for exploring data using summary statistics.
The `psyntur` package provides some wrappers to these tools to allow data exploration, albeit of a limited kind, to be done quickly and easily.
We explore some of these functions in this vignette.

Load the `psyntur` functions and data sets with the usual `library` command.
```{r setup}
library(psyntur)
```

# Summary statistics with `describe`

We can use the `describe` function in `psyntur`.
The first argument to `describe` should be the data frame.
Subsequent arguments should be named arguments of summary statistics functions, like `mean`, `median`, etc., applied to any variables in the data frame.
For example, using the `faithfulfaces` data frame, we can obtain the arithmetic mean and standard deviation of the `faithful` variable as follows.
```{r}
describe(data = faithfulfaces, avg = mean(faithful), stdev = sd(faithful))
```

We can apply the same or different functions to the same or different variables.
```{r}
describe(data = faithfulfaces,
         avg_faith = mean(faithful), 
         avg_trust = mean(trustworthy),
         sd_trust = sd(trustworthy))
```

We can obtain the summary statistics for the chosen variables for each group of a third variable using a `by` variable.
```{r}
describe(data = faithfulfaces, by = face_sex, 
         avg = mean(faithful), stdev = sd(faithful))
```

The `by` argument may be a vector of variables.
In this case, the chosen variables are grouped by the combination of the `by` variables.
For example, in the following we group the `time` variable in `vizverb` by both `task` and `response`.
```{r}
describe(vizverb, by = c(task, response),
         avg = mean(time),
         median = median(time),
         iqr = IQR(time),
         stdev = sd(time)
)
```


# Multiple summary functions to multiple variables

It would be tedious and repetitive to use `describe` as above if wanted to apply the same set of summary statistic functions to a set of variables.
Instead, we can use `describe_across`.
For example, to calculate the mean, median, standard deviation to two variables, `trustworthy` and `faithful`, in the `faithfulfaces` data set, we can do the following.
```{r}
describe_across(faithfulfaces,
                variables = c(trustworthy, faithful),
                functions = list(avg = mean, median = median, stdev = sd)
)
```

Note that the data frame that is returned is in a wide format.
We can pivot this to a longer format by saying `pivot = TRUE`.
```{r}
describe_across(faithfulfaces,
                variables = c(trustworthy, faithful),
                functions = list(avg = mean, median = median, stdev = sd),
                pivot = TRUE
)
```

We can use the `by` variable to calculate the summary statistics for each subgroup corresponding to each value of the `by` variable, as in the following example.
```{r}
describe_across(faithfulfaces,
                variables = c(trustworthy, faithful),
                functions = list(avg = mean, median = median, stdev = sd),
                by = face_sex,
                pivot = TRUE
)
```

As in the case of `describe`, the `by` argument can be a vector of variables.

# Dealing with missing values with `_xna`

When variable have `NA` values, most summary statistics function will, by default, return `NA`.
To illustrate this, we can modify `faithfulfaces` to contain `NA`'s for the `faithful` variable.
```{r}
faithfulfaces_na <- faithfulfaces %>%
  dplyr::mutate(faithful = ifelse(faithful > 6, NA, faithful))
```

Now, if we try one of the above `describe` or `describe_aross` functions with the `faithful` variable, we will obtain corresponding `NA` values.
```{r}
describe_across(faithfulfaces_na,
                variables = c(trustworthy, faithful),
                functions = list(avg = mean, median = median, stdev = sd),
                by = face_sex,
                pivot = TRUE
)
```

Of course, if we set `na.rm = TRUE` in any or all of the summary functions, we will remove the `NA` values before the statistics are calculated.
This is relatively easy to do with `describe`, as in the following example.
```{r}
describe(data = faithfulfaces, by = face_sex, 
         avg = mean(faithful, na.rm = T), stdev = sd(faithful, na.rm = T))
```
However, for `describe` across, we pass in a list of functions, and so to set `na.rm = T`, we can to create `purrr` style anonymous functions calling the summary statistic function with `na.rm = T`, as in the following example.
```{r}
library(purrr)
describe_across(faithfulfaces_na,
                variables = c(trustworthy, faithful),
                functions = list(avg = ~mean(., na.rm = T), 
                                 median = ~median(., na.rm = T), 
                                 stdev = ~sd(., na.rm = T)),
                by = face_sex,
                pivot = TRUE
)
```
Anonymous function like this are not very transparent for those new to R, and the resulting function looks quite complex.

In order to avoid using code like `~mean(., na.rm = T)`, for a number of commonly used summary statistic functions (`sum`, `mean`, `median`, `var`, `sd`, `IQR`), we have made counterparts where `na.rm` is set to `TRUE` by default.
These functions have the same name as the original with the suffix `_xna` (but `IQR` is `iqr_xna`, not `IQR_xna`).
As such, we can do the following.
```{r}
describe_across(faithfulfaces_na,
                variables = c(trustworthy, faithful),
                functions = list(avg = mean_xna, median = median_xna, stdev = sd_xna),
                by = face_sex,
                pivot = TRUE
)
```


