---
title: "Introduction to mashr"
author: "Matthew Stephens"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{mashr intro}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,comment = "#",fig.width = 5,
                      fig.height = 4,fig.align = "center")
```

# Goal

This vignette introduces the `mashr` software for someone who is
familiar with the basic idea in [Urbut et al.][urbut-biorxiv], which
we strongly recommend reading before proceeding.

In this first vignette we
use only the "canonical" covariance matries to simplify presentation.
However, an important aspect of `mashr` is the use of data-driven
covariance matrices that are learned from the data, so when you are
done here you should look at [the data-driven
vignette](intro_mash_dd.html) and the [non-canonical matrix
vignette](simulate_noncanon.html). If you think you may have correlations
in your data measurements then also look at how to incorporate [correlations](intro_correlations.html).

# Outline

This package implements the "multivariate adaptive shrinkage" (mash) 
method from [Urbut et al.][urbut-biorxiv].

Mash is a method for dealing
with large numbers of tests in many (e.g. dozens) different conditions.
Here "conditions" can mean many different things: it could be different tissues
(as in Urbut et al), or different time points,
or different treatments, or even different phenotypes (think of testing multiple phenotypes at many genetic variants in a GWAS).

There are essentially four steps to a mash analysis

- Read in the data
- Set up the covariance matrices to be used
- Fit the model
- Extract posterior summaries and other quantities of interest

Here we go through an example that illustrates each of these steps in turn. 
In this example we use the same (simulated) data at each step. In complex practical applications -- especially very large applications -- one might want to use different subsets of the data at different steps. For example, one might only want to extract posterior summaries (step 4) for a subset of the tests.
We will deal with such complications in later vignettes. For now,
we note the one crucial rule: 

*The Crucial Rule:*  Step 3 (fitting the `mash` model) must
be performed with either *all* the tests you performed, or -- if that
causes computational challenges -- a *large random subset* of tests. 


In particular, you must not use only a set of selected "significant" or "strong"
tests in step 3. This is because `mash` uses step 3 to learn
about the null signal in the data, as well as the non-null signal. In particular, in settings where most tests are null or nearly-null,
this is the point where `mash` learns this, and consequently
"shrinks" ("corrects") the posterior estimates towards 0. (Effectively this
step is analogous to a "multiple testing correction" step.)


# Simulate some data

First we simulate
some data for illustration.

```{r}
library(ashr)
library(mashr)
set.seed(1)
simdata = simple_sims(500,5,1)
```

This simulation routine creates a dataset with 5 conditions, and four
different types of effect: null, independent among conditions,
condition-specific in condition 1, and shared (equal effects in all
conditions). It creates 500 effects of each type for a total of 2000
effects.

# Step 1: Read in the data

To run `mash` you need data consisting of a matrix of effects (`Bhat`)
and a matrix of standard errors (`Shat`), for $J$ effects (rows) in
$R$ conditions (columns).

[If you have only access to $Z$ scores, you can set `Bhat` to the Z
scores, and set `Shat` to be the matrix with all 1s].

The simulation above created both these matrices for us (in
`simdata$Bhat` and `simdata$Shat`). To get these ready for applying
`mash` you must first use `mash_set_data` to create a data object
with those two pieces of information:

```{r}
data = mash_set_data(simdata$Bhat, simdata$Shat)
```

# Step 2: Set up the covariance matrices 

There are two types of covariance matrix you can use in `mash`:
"canonical" and "data-driven". The canonical ones are very easy to set
up and so we use those here for illustration. However, in applications
you will likely also want to use data-driven matrices, and this is an
important feature of `mash`. See [the data-driven
vignette](intro_mash_dd.html) for more details on how to do this.

The function to set up canonical covariance matries is
`cov_canonical`. The following sets up canonical covariances in `U.c`
(we used `.c` to indicate canonical), which is a named list of
matrices.

```{r}
U.c = cov_canonical(data)  
print(names(U.c))
```

# Step 3: fit the model

Having set up the data and covariance matrices you are ready to fit
the model using the `mash` function:

```{r}
m.c = mash(data, U.c)
```

This can take a little time. What this does is to fit a mixture model
to the data, estimating the mixture proportions. Specifically the
model is that the true effects follow a mixture of multivariate normal
distributions: $B \sim \sum_k \sum_l \pi_{kl} N(0, \omega_l U_k)$
where the $\omega_l$ are scaling factors set by the "grid" parameter
in `mash` and the $U_k$ are the covariance matrices (here specified by
`U.c`).

Remember the Crucial Rule! This step must be peformed
using all the tests (or a large random subset), because this is where
`mash` learns that many tests are null and corrects for it.

# Step 4: Extract Posterior Summaries

You can extract estimates (posterior means and posterior standard
deviations) and measures of significance (local false sign rates)
using functions like `get_pm` (posterior mean), `get_psd` (posteriore
standard deviation) and `get_lfsr` (local false sign rate):

```{r}
head(get_lfsr(m.c))
head(get_pm(m.c))
head(get_psd(m.c))
```

Each of these are $J \times R$ matrices.

Use `get_significant_results` to find the indices of effects that are
"significant", which here means they have lfsr less than t in at least
one condition, where t is a threshold you specify (default 0.05). The
output is ordered from most significant to least significant.

```{r}
head(get_significant_results(m.c))
print(length(get_significant_results(m.c)))
```

You can also get the significant results in just a subset of conditions.
For example
```{r}
print(head(get_significant_results(m.c, conditions=1)))
```

## Sharing

Use `get_pairwise_sharing` to assess sharing of significant signals
among each pair of conditions. Here the default definition of shared
is "the same sign and within a factor 0.5 of each other".

```{r}
print(get_pairwise_sharing(m.c)) 
```

You can change the factor if you like. For example, here by setting
the factor to be 0 you assess only if they are the same sign:

```{r}
print(get_pairwise_sharing(m.c, factor=0))
```

## Measure of fit (log-likelihood)

Use `get_loglik` to find the log-likelihood of the fit (this will only
be useful when you have other fits to compare it with!)

```{r}
print(get_loglik(m.c))
```

## Estimated mixture proportions

Use `get_estimated_pi` to extract the estimates of the mixture
proportions for different types of covariance matrix:

```{r}
print(get_estimated_pi(m.c))
barplot(get_estimated_pi(m.c),las = 2)
```

Here we can see most of the mass is on the null, identity,
`singletons_1` (which corresponds to effects that are specific to
condition 1) and `equal_effects`. This reassuringly matches the way
that these data were generated.

## Metaplot

The following produces a meta-plot based on the posterior means and
posterior variances of an effect. Here we look at the most significant
result.

```{r}
mash_plot_meta(m.c,get_significant_results(m.c)[1])
```

# Session information.

```{r info}
print(sessionInfo())
```

[urbut-biorxiv]: https://www.biorxiv.org/content/10.1101/096552v4
