---
title: "Choosing link, variance and covariance functions"
author: "Prof. Wagner Hugo Bonat"
date: "`r paste('mcglm', packageVersion('mcglm'), Sys.Date())`"
vignette: >
  %\VignetteIndexEntry{Choosing link, variance and covariance functions}        
  %\VignetteEngine{knitr::rmarkdown} 
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
##----------------------------------------------------------------------

library(knitr)

```

****

To install the stable version of [`mcglm`][], use
`devtools::install_git()`. For more information, visit [mcglm/README].

```{r, eval=FALSE}
library(devtools)
install_git("bonatwagner/mcglm")
```

```{r, eval=FALSE, error=FALSE, message=FALSE, warning=FALSE}
library(mcglm)
packageVersion("mcglm")
```

##### Abstract
The `mcglm` package implements the multivariate covariance generalized
linear models (McGLMs) proposed by Bonat and J$\o$rgensen (2016).
The core fit function `mcglm` is employed for fitting a set of models. 
In this introductory vignette we restrict ourselves to model 
independent data, although a simple model for longitudinal data analysis
in the Gaussian case is also presented. We present models to deal with
continuous, binomial/bounded and count univariate response variables.
We explore the specification of different link, variance and covariance
functions.

****
## Regression models for continuous data

Consider a simple regression model, for univariate and independent 
Gaussian data:
$$Y \sim N(X \beta, \tau_0 Z_0).$$

```{r, warning = FALSE, message = FALSE}
# Loading extra packages
require(mcglm)
require(Matrix)
require(mvtnorm)
require(tweedie)

# Setting the seed
set.seed(2503)

# Fixed component
x1 <- seq(-1,1, l = 100)
X <- model.matrix(~ x1)
mu1 <- mcglm::mc_link_function(beta = c(1,0.8), X = X, offset = NULL, 
                        link = "identity")
# Random component
y1 <- rnorm(100, mu1$mu, sd = 0.5)

# Data structure
data <- data.frame("y1" = y1, "x1" = x1)

# Matrix linear predictor
Z0 <- mc_id(data)

# Fit
fit1.id <- mcglm(linear_pred = c(y1 ~ x1), 
                 matrix_pred = list(Z0),
                 data = data)

```

The `mcglm` package offers the following set of `S3-methods` for
model summarize.

```{r}
print(methods(class = "mcglm"))
```

The traditional `summary` for a fitted model can be obtained by

```{r}
summary(fit1.id)
```

The function `summary.mcglm` was designed to be similar to the native
`summary` functions for the classes `lm` and `glm`. Extra features were
included to describe the dispersion structure. The mean formula call,
along with the the `link`, `variance` and `covariance` functions are
presented. The parameter estimates are presented in two blocks,
the first presents the regression estimates while the second presents
the dispersion estimates. In both cases the parameter estimates are
summarized by point estimates, standard errors, Z-values and p-values associated
with the Wald test whose null hypothesis is defined as $\beta = 0$ and
$\tau = 0$ for the mean and dispersion structures, respectively.
Finally, the selected algorithm, if the correction term is employed or 
not and the number of iterations is printed.

The same linear regresion model can be fitted by using a 
different covariance link function, for example the inverse covariance 
link function.

```{r}
# Fit using inverse covariance link function 
fit1.inv <- mcglm(linear_pred = c(y1 ~ x1), 
                  matrix_pred = list(Z0),
                  covariance = "inverse", data = data)

```
Furthermore, we can use a less conventional covariance link function,
as the exponential-matrix. 

```{r, message=FALSE, warning=FALSE}
# Fit using expm covariance link function
fit1.expm <- mcglm(linear_pred = c(y1 ~ x1), 
                   matrix_pred = list(Z0),
                   covariance = "expm", data = data)
```

The function `mcglm` returns an object of mcglm class, for which we can
use the method `coef` to extract the parameters estimates.

```{r}
# Comparing estimates using different covariance link functions
cbind(coef(fit1.id)$Estimates,
      coef(fit1.inv)$Estimates,
      coef(fit1.expm)$Estimates)

# Applying the inverse transformation
c(coef(fit1.id)$Estimates[3],
  1/coef(fit1.inv)$Estimates[3],
  exp(coef(fit1.expm)$Estimates[3]))
```

Consider an extension of the linear regression models to deal with 
heteroscedasticity:

$$ Y \sim N(X \beta, \tau_0 Z_0 + \tau_1 Z_1),$$
where $Z_0$ is a identity matrix and $Z_1$ is a diagonal matrix whose
elements are given by the values of a known covariate. Such a model,
can be fitted easily using the `mcglm` package.

```{r}
# Mean model
set.seed(1811)
x1 <- seq(-1,1, l = 100)
X <- model.matrix(~ x1)
mu1 <- mcglm::mc_link_function(beta = c(1,0.8), X = X, offset = NULL, 
                        link = "identity")
z1 <- rnorm(100, mean = 0, sd = 0.25)
data <- data.frame("id" = 1, "x1" = x1, "z1" = z1)

# Matrix linear predictor
Z <- mc_dglm(~ z1, id = 'id', data = data)

# Covariance model
Sigma <- mcglm::mc_matrix_linear_predictor(tau = c(0.2, 0.15), Z = Z)

# Simulating the response variable
y1 <- rnorm(100, mu1$mu, sd = sqrt(diag(Sigma)))
data$y <- y1

# Fitting
fit2.id <- mcglm(linear_pred = c(y1 ~ x1), matrix_pred = list(Z), data = data)
```

We can also extend the linear regression model to deal with longitudinal
data analysis. The code below presents an example of such a model.

```{r}
# Mean model
x1 <- seq(-1,1, l = 100)
X <- model.matrix(~ x1)
mu1 <- mcglm::mc_link_function(beta = c(1,0.8), X = X, offset = NULL, 
                        link = "identity")

# Data structure
data <- data.frame("id" = as.factor(rep(1:10, each = 10)), "x1" = x1)

# Covariance model
Z0 <- mc_id(data)
Z1 <- mc_mixed(~ 0 + id, data = data)
Sigma <- mcglm::mc_matrix_linear_predictor(tau = c(0.2, 0.15), Z = c(Z0,Z1))

# Simulating the Response variable
y1 <- as.numeric(rmvnorm(1, mean = mu1$mu, sigma = as.matrix(Sigma)))
data <- data.frame("y1" = y1, "x1" = x1)

# Fit
fit3.id <- mcglm(linear_pred = c(y1 ~ x1), 
                 matrix_pred = list("resp1" = c(Z0,Z1)), data = data)
```

The model summary

```{r}
summary(fit3.id)
```

Note that, the dispersion structure now has two parameters. 
In that case, the parameter $\tau_1$ represents the longitudinal 
structure for which we are assuming a compound symmetry model. This
model is an equivalent to a random intercept model in the context of
Linear Mixed Models (LMMs).

## Regression models for binomial and bounded data

The `mcglm` package offers a rich set of models to deal with binomial
and bounded response variables. The `logit`, `probit`, `cauchit`,
`cloglog`, and `loglog` link functions along with the extended binomial
variance function combined with the linear covariance structure, 
provide a flexible class of models for handling binomial and 
bounded response variables. The extended binomial variance function is
given by $\mu^p (1- \mu)^q$ where the two extra power parameters 
offer more flexibility to model the relationship between mean and 
variance. Consider the following simulated dataset.

```{r}
# Mean model
x1 <- seq(-1,1, l = 500)
X <- model.matrix(~ x1)
mu1 <- mcglm::mc_link_function(beta = c(1,0.8), X = X, offset = NULL, link = "logit")

# Data structure
data <- data.frame("x1" = x1)

# Covariance model
Z0 <- mc_id(data)

# Simulating the response variable
set.seed(123)
data$y <- rbinom(500, prob = mu1$mu, size = 10)/10
```

The most traditional regression model to deal with binomial data is the
logistic regression model that can be fitted using the `mcglm` package
using the following code:

```{r}
# Fit
fit4.logit <- mcglm(linear_pred = c(y ~ x1), 
                    matrix_pred = list(Z0),
                    link = "logit", variance = "binomialP",
                    power_fixed = TRUE,
                    Ntrial = list(rep(10,500)), data = data)
```

It is important to highlight that the response variable collumn should
be between $0$ and $1$ and in the case of more than one trial the 
argument `Ntrial` should be used for fitting the model. 
The argument `link` specifies the link function whereas the argument 
`variance` specifies the variance function in that case `binomialP`. 
The variance function `binomialP` represents a simplification of the 
extended binomial variance function given by $\mu^p (1- \mu)^p$. 
Note that, in this example the argument 'power_fixed = TRUE' specifies
that the power parameter $p$ will not be estimated, but fixed at the 
initial value $p = 1$ corresponding to the orthodox binomial variance
function. 
We can easily fit the model using a different link function, for example
the `cauchit`.

```{r}
fit4.cauchit <- mcglm(linear_pred = c(y ~ x1), 
                      matrix_pred = list(Z0),
                      link = "cauchit", variance = "binomialP", 
                      Ntrial = list(rep(10,250)), data = data)
```

We can also estimate the extra power parameter $p$.

```{r}
fit4.logitP <- mcglm(linear_pred = c(y ~ x1), 
                      matrix_pred = list(Z0),
                      link = "logit", variance = "binomialP",
                      power_fixed = FALSE,
                      Ntrial = list(rep(10,500)), data = data)
```

Furthermore, we can estimate the two extra power parameters involved in
the extended binomial variance function.

```{r}
fit4.logitPQ <- mcglm(linear_pred = c(y ~ x1), 
                      matrix_pred = list(Z0),
                      link = "logit", variance = "binomialPQ",
                      power_fixed = FALSE,
                      Ntrial = list(rep(10,500)), 
                      control_algorithm = list(tuning = 0.5, max_iter = 100),
                      data = data)
```

The estimation of the extra power parameters involved in the
extended binomial variance function is challenging mainly for small data 
sets. Note that, in this simulated example, we have to control the 
step-length of the `chaser` algorithm to avoid unrealistic values for
the parameters involved in the dispersion structure. To do that, we used
the extra argument `control_algorithm` that should be a named list. 
For a detailed description of the arguments that can be passed to the
`control_algorithm` function see `?fit_mcglm`. 

## Regression models for count data

The analysis of count data in the `mcglm` package relies on the 
structure of the Poisson-Tweedie distribution. Such a distribution is
characterized by the following dispersion function:

$$ \nu(\mu, p) = \mu + \tau_0 \mu^p. $$
The power parameter is an index that identify different distributions,
examples include the Hermite ($p = 0$), Neyman-Type A ($p = 1$) and the
negative binomial ($p = 2$). 

The orthodox Poisson model can be fitted using the `mcglm` package using
the `Tweedie` variance function $\nu(\mu, p ) = \mu^p$ where the power
parameter $p$ is fixed at $1$. For example,

```{r}
# Mean model

x1 <- seq(-2,2, l = 200)
X <- model.matrix(~ x1)
mu <- mcglm::mc_link_function(beta = c(1,0.8), X = X, offset = NULL, link = "log")

# Data structure
data <- data.frame("x1" = x1)

# Covariance model
Z0 <- mc_id(data)

# Data structure
data$y <- rpois(200, lambda = mu$mu)

# Fit
fit.poisson <- mcglm(linear_pred = c(y ~ x1), 
                    matrix_pred = list(Z0),
                    link = "log", variance = "tweedie",
                    power_fixed = TRUE, data = data)
```

Another very useful model for count data is the negative binomial. 

```{r}
# Simulating negative binomial models
set.seed(1811)
x <- rtweedie(200, mu = mu$mu, power = 2, phi = 0.5)
y <- rpois(200, lambda = x)
data <- data.frame("y1" = y, "x1" = x1)

fit.pt <- mcglm(linear_pred = c(y ~ x1), matrix_pred = list(Z0), 
                link = "log", variance = "poisson_tweedie", 
                power_fixed = FALSE, data = data)
summary(fit.pt)
```

Note that, we estimate the power parameter rather than fix it at $p = 2$.


<!---------------------------------------------------------------------- -->

[`mcglm`]: https://github.com/bonatwagner/mcglm
[mcglm/README]: https://github.com/bonatwagner/mcglm/blob/main/README.md