---
title: "randomForestVIP Vignette"
author: "Kelvyn Bladen"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{randomForestVIP Vignette}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Overview

The goal of `randomForestVIP` is to tune and select a Random Forest model with high
accuracy and interpretability. This is done by tuning the Random Forest 
based on the accuracy and variable importance metrics associated with each 
model. To accomplish this, functions are available to tabulate and plot 
results designed to help the user select an optimal model. 

The function `mtry_compare` may be used to tune the hyper-parameter mtry
based on model performance and variable importance metrics. This grid-search 
technique provides tables and plots showing the effect of mtry on each of the 
assessment metrics. It also returns each of the evaluated models to the user.

This package also contains functions for assessing relationships among the 
predictor variables and between the predictors and response. These 
are relevant for any predictive model, not just Random Forests. Metrics such 
as partial correlations and variance inflation factors are available for a variety of modeling techniques (not just linear regressions). These are tabulated and 
plotted for the analyst using the functions `partial_cor` and `robust_vifs`. 

The package also provides superior `ggplot2` variable importance plots for 
individual models using the function `ggvip`. This function is a highly 
aesthetic and editable improvement upon the function `randomForest::varImpPlot`
and other basic importance graphics.

All of the plots generated by these functions are developed with `ggplot2`
techniques so that the user has the ability to edit and improve further upon 
the plots.

For methodology see "Contributions to Random Forest Variable Importance with 
Applications in R" <https://digitalcommons.usu.edu/etd/8587/>.

## Example: Boston

```{r, warning=FALSE, message=FALSE}
library(randomForestVIP)
library(MASS)
library(EZtune)
```

To introduce the functionality of `randomForestVIP`, we look at modeling the Boston 
housing data (found in the MASS package). We want to 
build a Random Forest model with a view towards both accuracy and 
interpretability. We begin by running some preliminary diagnostics on our data.

```{r, warning=FALSE, message=FALSE, fig.width=4, fig.height=4, fig.align='center'}
set.seed(1234)

pcs <- partial_cor(medv ~ ., data = Boston, model = lm)
pcs$plot_y_part_cors

rv <- robust_vifs(medv ~ ., data = Boston, model = lm)
rv$plot_lin_vifs
```

These functions assess concerns with collinearity. Notice that the VIFs from 
`robust_vifs` are all less than 10. The partial correlations with the response 
from `partial_cor` are a type of pseudo-importance assessing the importance 
each variable does not share with the others. Now we tune our model by 
assessing four different mtry values in the `mtry_compare` function.

```{r, warning=FALSE, message=FALSE, fig.width=4, fig.height=4, fig.align='center'}
set.seed(1)
m <- mtry_compare(medv ~ .,
  data = Boston, sqrt = TRUE,
  mvec = c(1, 4, 9, 13), num_var = 7
)
m$gg_model_errors
m$model_errors
```

According to the accuracy plot and table above, our best choice is when mtry is 
4. However, the accuracy for the best model is notably only slightly better than 
the models with mtry set to 9 and 13. We now look at the variable importance 
metrics across the different models.

```{r, warning=FALSE, message=FALSE, fig.width=6, fig.height=5, fig.align='center'}
m$gg_var_imp_error
```

The top two variables are consistently identified as more important than the 
other variables and their order remains unchanged across mtry. However, the 
variables 'nox' and 'dis' switch order as mtry increases. Pollution (nox) has a 
strong negative correlation with distance to employment centers (dis). This 
makes sense if the employment centers are responsible for much of the pollution.
If many home buyers consider distance to 
work more important than pollution when selecting a house, 'dis' is more likely
to be a causal driver of price than 'nox'. By this reasoning, the model where 
mtry is 9 appears to be superior to the model where mtry is 4, despite mtry of 4 
yielding slightly more accurate results.

We now take our selected model and build individual importance plots for it 
using `ggvip`.

```{r, warning=FALSE, message=FALSE, fig.width=4, fig.height=4, fig.align='center'}
g <- ggvip(m$rf9)$both_vips
```

The plot above resembles a standard variable importance plot, but possesses 
superior tick labels and editing capabilities for the analyst. 

We have used the `randomForestVIP` package to tune a strong model for prediction and with 
reasonably useful importance values. This was accomplished by assessing 
variable importance and accuracy metrics across the hyper-parameter mtry.

## Example: Lichens in Pacific Northwest

```{r, warning=FALSE, message=FALSE}
library(randomForestVIP)
```

To further demonstrate the functionality of `randomForestVIP`, we provide another 
example. This time using classification data. We look at modeling the Lichen 
data (found in the EZtune package) with a view towards both accuracy and 
interpretability. The response is presence or absence (coded 0 or 1) of a 
lichen species, Lobaria oregana. We begin by running preliminary diagnostics on 
our data using `partial_cor` and `robust_vifs`.

```{r, warning=FALSE, message=FALSE, fig.width=4, fig.height=4, fig.align='center'}
set.seed(1234)

lichen <- EZtune::lichen[, -c(1, 3:8)]

pairs(lichen[, c(16, 20, 26)])
cor(lichen[, c(16, 20, 26)])

pcs <- partial_cor(factor(LobaOreg) ~ .,
  data = lichen, model = lm,
  num_var = 15
)
pcs$plot_y_part_cors

rv <- robust_vifs(factor(LobaOreg) ~ .,
  data = lichen, model = lm,
  num_var = 15
)
rv$plot_nonlin_vifs
```

These variables exhibit high collinearity. To illustrate this observation, consider the pairs plots above for 'MinTempAve', 'Elevation', and 
'AmbVapPressAve'. Most of the VIFs from `robust_vifs` exceed the standard 
threshold. The partial correlations with the response from `partial_cor` are a 
type of pseudo-importance assessing the importance each variable does not share 
with the others. Now we tune our Random Forest model across four mtry values.

```{r, warning=FALSE, message=FALSE, fig.width=4, fig.height=4, fig.align='center'}
set.seed(100)
m <- mtry_compare(factor(LobaOreg) ~ .,
  data = lichen, sqrt = TRUE,
  mvec = c(1, 5, 19, 33), num_var = 7
)
m$gg_model_errors
m$model_errors
```

According to the accuracy plot and table above, our best choice is when mtry is 
19. However, the accuracy for the best model is only slightly better than 
the models with mtry set to 5 and 33. We now look at the variable importance 
metrics across the different models.

```{r, warning=FALSE, message=FALSE, fig.width=6, fig.height=5, fig.align='center'}
m$gg_var_imp_error
```

There are 3 variables to focus on. 'MinTempAve', 'Elevation', and 
'AmbVapPressAve' were all shown to be highly correlated above. These variables 
appear to be the most importance variable when mtry is small. However, as mtry 
increases, the importance of 'Elevation' drops off a bit, and the importance of 
'AmbVapPressAve' drops even more. After seeing these changes, a researcher 
might consider how these variables actually affect lichen presence. They would find that 
'MinTempAve' informs freezing which directly contributes to lichen presence. 
They would also realize that 'Elevation' indirectly causes lichen presence 
since 'Elevation' drives 'MinTempAve'. 'AmbVapPressAve' can be assumed to be a 
byproduct of 'Elevation' and is not a feature that should have much of a causal 
impact on lichen presences. While it is highly predictive, it is not something 
a scientist would prescribe for inducing the response. In this example, as mtry 
increases, casual variables rise while collinear byproducts fall.

No solution is perfect, but mtry of 33 yields results that match the intuition 
about the effect our predictors have on the response.

We now take our selected model and build individual importance plots for it 
using `ggvip`.

```{r, warning=FALSE, message=FALSE, fig.width=6, fig.height=5, fig.align='center'}
g <- ggvip(m$rf33, num_var = 12)$both_vips
```

We have used the `randomForestVIP` package to tune a model for prediction and with 
superior importance values. This was accomplished by assessing variable 
importance and accuracy metrics across the hyper-parameter mtry.