---
title: "Haplotype Discrete Survival Models"
author: Klaus Holst & Thomas Scheike
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    fig_caption: yes
    fig_width: 7.15
    fig_height: 5.5        
vignette: >
  %\VignetteIndexEntry{Haplotype Discrete Survival Models}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(mets)
```


Haplotype Analysis for discrete TTP
===================================

Cycle-specific logistic regression of haplotype effects with
known haplotype probabilities. Given observed genotype $G$ and
unobserved haplotypes $H$, we marginalise over the possible
haplotypes using $P(H|G)$ supplied as input.

\begin{align*}
   S(t|x,G) & = E( S(t|x,H) | G)  = \sum_{h \in G} P(h|G) S(t|z,h)
\end{align*}
so survival can be computed by marginalising over possible $h$ given $g$.

Survival is based on logistic regression for the discrete hazard
function of the form
\begin{align*}
      \mbox{logit}(P(T=t \mid T \geq t, x, h)) & = \alpha_t + x(h)^T \beta
\end{align*}
where $x(h)$ is a regression design of $x$ and haplotypes $h=(h_1,h_2)$.

Simple binomial data can also be fitted using this function.

Standard errors are computed assuming that the haplotype probabilities are known.

	
We are particularly interested in the types haplotypes: 
```{r}
types <- c("DCGCGCTCACG","DTCCGCTGACG","ITCAGTTGACG","ITCCGCTGAGG")

## some haplotypes frequencies for simulations 
data(haplo)
hapfreqs <- haplo$hapfreqs 
print(hapfreqs)
```

Among the haplotypes of interest we look up the frequencies and choose a baseline:

```{r}
www <-which(hapfreqs$haplotype %in% types)
hapfreqs$freq[www]

baseline=hapfreqs$haplotype[9]
baseline
```


We have cycle-specific data with $id$ and outcome $y$:

```{r}
haploX  <- haplo$haploX
dlist(haploX,.~id|id %in% c(1,4,7))
```
and a list of possible haplotypes for each id and their probabilities $p$
(the probabilities within each id sum to 1):

```{r}
ghaplos <- haplo$ghaplos
head(ghaplos)
```

The first id=1 has the haplotype fully observed, but id=2 has two possible haplotypes consistent with the
observed genotype for this id, the probabilities are 7\% and 93\%, respectively. 

With the baseline given above we can specify a regression design that gives an effect if a haplotype of the specified type
is present (`sm=0`), or an additive effect of haplotypes (`sm=1`):

```{r}
designftypes <- function(x,sm=0) {
hap1=x[1]
hap2=x[2]
if (sm==0) y <- 1*( (hap1==types) | (hap2==types))
if (sm==1) y <- 1*(hap1==types) + 1*(hap2==types)
return(y)
}

```

To fit the model we first construct a time design matrix (named `X`)
and supply the haplotype distributions for each id:

```{r}
haploX$time <- haploX$times
Xdes <- model.matrix(~factor(time),haploX)
colnames(Xdes) <- paste("X",1:ncol(Xdes),sep="")
X <- dkeep(haploX,~id+y+time)
X <- cbind(X,Xdes)
Haplos <- dkeep(ghaplos,~id+"haplo*"+p)
desnames=paste("X",1:6,sep="")   # six X's related to 6 cycles 
head(X)

```

We can now fit the model with the design given by the design function:
```{r}
out <- haplo_surv_discrete(X=X,y="y",time.name="time",
      Haplos=Haplos,desnames=desnames,designfunc=designftypes) 
names(out$coef) <- c(desnames,types)
out$coef
summary(out)
```

Haplotypes `"DCGCGCTCACG"` and `"DTCCGCTGACG"` give an increased hazard of pregnancy.

The data were generated with these true coefficients:

```{r}
tcoef=c(-1.93110204,-0.47531630,-0.04118204,-1.57872602,-0.22176426,-0.13836416,
0.88830288,0.60756224,0.39802821,0.32706859)

cbind(out$coef,tcoef)
```

The fitted design matrix can be found in the output:
```{r}
head(out$X,10)
```


SessionInfo
============

```{r}
sessionInfo()
```

