---
title: "Peptide Annotation and Protein Inference"
author: "Witold Wolski"
date: "March 16, 2017"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Peptide Annotation and Protein Inference}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE)
```

# Introduction

This Vignette describes how the `prozor::greedy` function can be used to infer proteins form peptide identifications using the Occam's razor principle. The method tries to find a minimal set of porteins which can explain all the peptides identified.

```{r loadData}
library(prozor)

rm(list = ls())

file = system.file("extdata/IDResults.txt.gz" , package = "prozor")
specMeta <- readr::read_tsv(file)
head(specMeta)

```


Annotate peptide sequences with protein sequences from two fasta files on with reviewed entries only (sp) and the other with reviewed and Trembl entries (sp/tr).

```{r fig.cap = "Number of proteins in the All and Canonical database."}
length(unique(specMeta$peptideSeq))
upeptide <- unique(specMeta$peptideSeq)

resAll <-
    prozor::readPeptideFasta(
        system.file("p1000_db1_example/Annotation_allSeq.fasta.gz" , package = "prozor"))

resCan <-
    prozor::readPeptideFasta(
        system.file("p1000_db1_example/Annotation_canSeq.fasta.gz" , package = "prozor"))

annotAll <- prozor::annotatePeptides(upeptide, resAll)
head(subset(annotAll,select =  -proteinSequence))
annotCan <- prozor::annotatePeptides(upeptide, resCan)
barplot(c(All = length(resAll),Canonical = length(resCan)))
```



```{r fig.cap="Number of unique peptide protein pairs  for the All and Canonical database."}
barplot(c(All = nrow(annotAll), Canonical = nrow(annotCan)), ylab = "# peptide protein matches.")
```

We can see that using the larger fasta database reduces the proportion of proteotypic peptides.

```{r}
PCProteotypic_all <-
    sum(table(annotAll$peptideSeq) == 1) / length(table(annotAll$peptideSeq)) * 100
PCProteotypic_canonical <-
    sum(table(annotCan$peptideSeq) == 1) / length(table(annotCan$peptideSeq)) * 100

barplot(
    c(All = PCProteotypic_all, Canonical =  PCProteotypic_canonical),
    las = 2,
    ylab = "% proteotypic"
)

```


# Protein Inference 


We can now identify a minmal set of proteins explaining all the peptides observed for both databases

```{r}
library(Matrix)
precursors <-
    unique(subset(specMeta, select = c(
        peptideModSeq, precursorCharge, peptideSeq
    )))

```

## Protein Inference for database with Trembl identifiers

```{r greedyTrembl, fig.cap="Peptide protein machtes for the All database. Rows - peptides, Columns - proteins, black - peptide protein match."}
library(Matrix)
annotatedPrecursors <- merge(precursors ,
                             subset(annotAll, select = c(peptideSeq, proteinID)),
                             by.x = "peptideSeq",
                             by.y = "peptideSeq")


xx <-
    prepareMatrix(annotatedPrecursors,
                  proteinID = "proteinID",
                  peptideID = "peptideSeq")

image(xx)
xxAll <- greedy(xx)

```

## Protein Inference for Reviewed/Canonical database

```{r greedyCanonical, fig.cap="Peptide protein machtes for the Canonical database. Rows - peptides, Columns - proteins, black - peptide protein match."}
annotatedPrecursors <-
    merge(precursors ,
          subset(annotCan, select = c(peptideSeq, proteinID)),
          by.x = "peptideSeq",
          by.y = "peptideSeq")


xx <-
    prepareMatrix(annotatedPrecursors ,
                  proteinID = "proteinID",
                  peptideID = "peptideSeq")
image(xx)
xxCAN <- greedy(xx)

```

# Conclusion

We see that the number of proteins needed to explain all the peptides is practically identical for both databases. Also in practice using a database with more entries does not lead to more identified proteins. On the contrary, it might even reduce the number of porteins identified.

```{r fig.cap="Number of proteins before and after protein inference."}
barplot(c(All_before = length(unique(annotAll$proteinID)), All_after = length(unique(unlist(
    xxAll
))) , Canonical_before =  length(unique(annotCan$proteinID)), Canonical_after = length(unique(unlist(
    xxCAN
)))))

```


# TODO

[Protein Grouping and Clustering in Scaffold](https://www.dropbox.com/s/a4kqyc4gxln8sfj/scaffold_protein_grouping_clustering.pdf?dl=1)


# Session Info

```{r}
sessionInfo()
```