--- title: "Creating Target Decoy Databases" author: "Witek Wolski" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Creating Target Decoy Databases} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- Using the package prozor for creating standardized fasta. The package can be used to: * merge several fasta files into a single fasta file * add reverse sequences to the fasta file. * add contaminants to the fasta file # Contaminants This package provides two sets of typical contaminant proteins and peptides, one with and one without contaminants of human origin, which can be accessed by the functions `loadContaminantsFasta` and `loadContaminantsNoHumanFasta`. At the [FGCZ](https://fgcz.ch/) we always are adding one of those two contaminant files to the database. To databases already containing human proteins, we will add the `ContaminantsNoHumanFasta`. The contaminants are easy to distinguish from other entries thanks to the `zz|FGCZCont` prefix. ```{r} library(prozor) head(names(loadContaminantsFasta2021())) length(loadContaminantsFasta2021()) length(loadContaminantsFasta2021(noHuman = TRUE)) ``` # Creating a fasta protein amino acid sequence database for searching. To merge several _fasta_ databases into a single file place them into a single folder and give the folder the name of the database. At the [FGCZ](https://fgcz.ch/) the database name starts with the project number e.g. `p1000` a consecutive number e.g. `db1` and descriptive name `example,i.e. `p1000_db1_example`. Add to the folder also an annotation.txt file. The annotation file should contain a single line formatted like a _fasta_ entry header with the following conent: `aa|p_| `. Example : `AA|p1000_db1_example|20180119_Example https://github.com/protViz/prozor` The package provides an example of such a folder with the fasta files. Based on this folder a database can be created. ```{r} databasedirectory = system.file("p1000_db1_example",package = "prozor") #databasedirectory <- file.path(find.package("prozor"), "p1000_db1_example") dbname <- basename(databasedirectory) fasta <- grep("fasta", dir(databasedirectory),value = TRUE) files1 <- file.path(databasedirectory,fasta) annot <- grep("annotation",dir(databasedirectory), value = TRUE) annotation <- readLines(file.path(databasedirectory,annot)) annotation ``` ## Create non decoy database ```{r} resDB <- createDecoyDB(files1, useContaminants = loadContaminantsFasta2021(), annot = annotation, revLab = NULL) length(resDB) ``` Based on the directory name we build the name of the fasta file adding the current date. ```{r} dirname(databasedirectory) xx <- file.path(dirname(databasedirectory), paste(dbname,"_",format(Sys.time(), "%Y%m%d"),".fasta" ,sep = "")) print(xx) ``` ```{r eval=FALSE} writeFasta(resDB, file=xx) ``` ## Create decoy database To add a decoy database, using reverse sequences specify the `revLab` parameter in the `createDecoyDB` function. The resulting database will be twice as long as the non-decoy database. ```{r} resDBDecoy <- createDecoyDB(files1, useContaminants = loadContaminantsFasta2021(), annot = annotation, revLab = "REV_") resDBDecoy[[length(resDBDecoy) - 1]] length(resDBDecoy) sum(duplicated(names(resDBDecoy))) sum(duplicated(resDBDecoy)) dbname_decoy <- unlist(strsplit(dbname,"_")) dbname_decoy <- paste(c(dbname_decoy[1],"d",dbname_decoy[2:length(dbname_decoy)]),collapse = "_") dbname_decoy xx <- file.path(dirname(databasedirectory), paste(dbname_decoy,"_",format(Sys.time(), "%Y%m%d"),".fasta" ,sep = "")) print(xx) ``` ```{r eval=FALSE} writeFasta(resDBDecoy, file = xx) ``` # Session Info ```{r} sessionInfo() ```