---
title: "A quick tour of mclustAddons"
author: "Luca Scrucca"
date: "`r format(Sys.time(), '%d %b %Y')`"
output: 
  rmarkdown::html_vignette:
    toc: true
    number_sections: false
    css: "vignette.css"
vignette: >
  %\VignetteIndexEntry{A quick tour of mclustAddons}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
library(knitr)
opts_chunk$set(fig.align = "center", 
               out.width = "80%",
               fig.width = 7, 
               fig.height = 6,
               dev.args = list(pointsize=12),
               par = TRUE, # needed for setting hook 
               collapse = TRUE, # collapse input & output code in chunks
               warning = FALSE)

knit_hooks$set(par = function(before, options, envir)
  { if(before && options$fig.show != "none") 
       par(family = "sans", mar=c(4.1,4.1,1.1,1.1), mgp=c(3,1,0), tcl=-0.5)
})
set.seed(1) # for exact reproducibility
```
       
# Introduction

**mclustAddons** is a contributed R package that extends the functionality available in the **mclust** package (Scrucca et al. 2016, Scrucca et al. 2023).  
In particular, the following methods are included:

- density estimation for data with bounded support (Scrucca, 2019);

- modal clustering using modal EM algorithm for Gaussian mixtures (Scrucca, 2021);

- entropy estimation via Gaussian mixture modeling (Robin & Scrucca, 2023).

This document gives a quick tour of **mclustAddons** (version `r packageVersion("mclustAddons")`). It was written in [R Markdown](https://rmarkdown.rstudio.com), using the [knitr](https://cran.r-project.org/package=knitr) package for production. 

References on the methodologies implemented are provided at the end of this document.

```{r}
library(mclustAddons)
```

# Density estimation for data with bounded support

## Univariate case with lower bound

```{r}
x <- rchisq(200, 3)
xgrid <- seq(-2, max(x), length=1000)
f <- dchisq(xgrid, 3)  # true density
dens <- densityMclustBounded(x, lbound = 0)
summary(dens, parameters = TRUE)
plot(dens, what = "density")
lines(xgrid, f, lty = 2)
plot(dens, what = "density", data = x, breaks = 15)
```

## Univariate case with lower & upper bounds

```{r}
x <- rbeta(200, 5, 1.5)
xgrid <- seq(-0.1, 1.1, length=1000)
f <- dbeta(xgrid, 5, 1.5)  # true density
dens <- densityMclustBounded(x, lbound = 0, ubound = 1)
summary(dens, parameters = TRUE)
plot(dens, what = "density")
plot(dens, what = "density", data = x, breaks = 11)
```

## Bivariate case with lower bounds

```{r}
x1 <- rchisq(200, 3)
x2 <- 0.5*x1 + sqrt(1-0.5^2)*rchisq(200, 5)
x <- cbind(x1, x2)
dens <- densityMclustBounded(x, lbound = c(0,0))
summary(dens, parameters = TRUE)
plot(dens, what = "BIC")
plot(dens, what = "density")
points(x, cex = 0.3)
abline(h = 0, v = 0, lty = 3)
plot(dens, what = "density", type = "hdr")
abline(h = 0, v = 0, lty = 3)
plot(dens, what = "density", type = "persp")
```

## Suicide data

The data consist in the lengths of 86 spells of psychiatric treatment undergone by control patients in a suicide study (Silverman, 1986).

```{r}
data("suicide")
dens <- densityMclustBounded(suicide, lbound = 0)
summary(dens, parameters = TRUE)
plot(dens, what = "density", 
     lwd = 2, col = "dodgerblue2",
     data = suicide, breaks = 15, 
     xlab = "Length of psychiatric treatment")
rug(suicide)
```

## Racial data

This dataset provides the proportion of white student enrollment in 56 school districts in Nassau County (Long Island, New York), for the 1992-1993 school year (Simonoff 1996, Sec. 3.2).

```{r}
data("racial")
x <- racial$PropWhite
dens <- densityMclustBounded(x, lbound = 0, ubound = 1)
summary(dens, parameters = TRUE)
plot(dens, what = "density", 
     lwd = 2, col = "dodgerblue2",
     data = x, breaks = 15, 
     xlab = "Proportion of white student enrolled in schools")
rug(x)
```

<br><br><br>

# Modal clustering using MEM algorithm for Gaussian mixtures

## Simulated datasets

```{r}
data(Baudry_etal_2010_JCGS_examples, package = "mclust")
GMM <- Mclust(ex4.1)
plot(GMM, what = "classification")
MEM <- MclustMEM(GMM)
summary(MEM)
plot(MEM)
plot(MEM, addPoints = FALSE)
```

```{r}
GMM <- Mclust(ex4.4.2)
plot(GMM, what = "classification")
MEM <- MclustMEM(GMM)
summary(MEM)
plot(MEM)
plot(MEM, addDensity = FALSE)
```

<br><br><br>

# Entropy estimation

## Simulated data

<br>
**Univariate Gaussian**
```{r, out.width="50%"}
EntropyGauss(1)       # population entropy
x = rnorm(1000)       # generate sample
EntropyGauss(var(x))  # sample entropy assuming Gaussian distribution
mod = densityMclust(x, plot = FALSE)
EntropyGMM(mod)       # GMM-based entropy estimate
plot(mod, what = "density", data = x, breaks = 31); rug(x)
```

<br>

**Univariate Mixed-Gaussian**\
Consider the mixed-Gaussian distribution $f(x) = 0.5 \times N(-2,1) + 0.5 \times N(2,1)$, whose entropy is 2.051939 in the population.
```{r, out.width="50%"}
cl = rbinom(1000, size = 1, prob = 0.5)
x = ifelse(cl == 1, rnorm(1000, 2, 1), rnorm(1000, -2, 1))   # generate sample
mod = densityMclust(x, plot = FALSE)
EntropyGMM(mod)       # GMM-based entropy estimate
plot(mod, what = "density", data = x, breaks = 31); rug(x)
```

<br>

**Multivariate Chi-squared**\
Consider a 10-dimensional independent $\chi^2$ distribution, whose entropy is 24.23095 in the population.
```{r}
x = matrix(rchisq(1000*10, df = 5), nrow = 1000, ncol = 10)
mod1 = densityMclust(x, plot = FALSE)
EntropyGMM(mod1)      # GMM-based entropy estimate, not too bad but...
mod2 = densityMclustBounded(x, lbound = rep(0,10))
EntropyGMM(mod2)      # much more accurate
```

## Faithful data

```{r}
data(faithful)
mod = densityMclust(faithful, plot = FALSE)
EntropyGMM(mod)       # GMM-based entropy estimate
# or provide the data and fit GMM implicitly
EntropyGMM(faithful)
```

## Iris data

```{r}
data(iris)
mod = densityMclust(iris[,1:4], plot = FALSE)
EntropyGMM(mod)       # GMM-based entropy estimate
```

# Volatility analysis of financial log-returns

## Gold price 2023 data

```{r}
data(gold)
head(gold)

# GMM modeling 
mod = GMMlogreturn(gold$log.returns)
summary(mod)
plot(mod, what = "BIC")
plot(mod, what = "density", data = gold$log.returns)
plot(mod, what = "diagnostic")

# compare to single Gaussian model
mod1 = GMMlogreturn(gold$log.returns, G = 1)
y0 = extendrange(mod$data, f = 0.1)
y0 = seq(min(y0), max(y0), length = 1000)
plot(mod, what = "density", data = gold$log.returns, col = "steelblue",
     xlab = "Gold price log-returns", ylab = "Density")
lines(y0, predict(mod1, what = "dens", newdata = y0), col = "red3")
legend("topright", legend = c("Gaussian", "GMM"), lty = c(1,1),
       col = c("red3", "steelblue"), inset = 0.02)
```

<br>

# References

Scrucca L., Fraley C., Murphy T. B. and Raftery A. E. (2023) *Model-Based Clustering, Classification, and Density Estimation Using mclust in R*. Chapman & Hall/CRC, ISBN: 978-1032234953, https://mclust-org.github.io/book/

Scrucca L., Fop M., Murphy T. B. and Raftery A. E. (2016) mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. *The R Journal* 8/1, pp. 289-317.
https://doi.org/10.32614/RJ-2016-021
  
Scrucca L. (2019) A transformation-based approach to Gaussian mixture density estimation for bounded data, *Biometrical Journal*, 61:4, 873–888. https://doi.org/10.1002/bimj.201800174

Scrucca L. (2021) A fast and efficient Modal EM algorithm for Gaussian mixtures. *Statistical Analysis and Data Mining*, 14:4, 305–314. https://doi.org/10.1002/sam.11527

Robin S. and Scrucca L. (2023) Mixture-based estimation of entropy. *Computational Statistics & Data Analysis*, 177, 107582. https://doi.org/10.1016/j.csda.2022.107582

----

```{r}
sessionInfo()
```
