\documentclass[a4paper]{article}
%\VignetteIndexEntry{Reading Files}
%\VignettePackage{pegas}
\usepackage{ape}

\newcommand{\loci}{\code{"loci"}}
\newcommand{\genind}{\code{"genind"}}
\newcommand{\NA}{\code{NA}}
\newcommand{\pegas}{\pkg{pegas}}
\newcommand{\adegenet}{\pkg{adegenet}}
\newcommand{\gdata}{\pkg{gdata}}
\newcommand{\poppr}{\pkg{poppr}}

\newcommand{\genetix}{\textsc{Genetix}}
\newcommand{\fstat}{\textsc{Fstat}}
\newcommand{\genepop}{\textsc{Genepop}}
\newcommand{\struc}{\textsc{Structure}}

\author{Emmanuel Paradis}
\title{Reading Genetic Data Files Into R with \adegenet\ and \pegas}

\begin{document}
\DefineVerbatimEnvironment{Sinput}{Verbatim}{formatcom=\color{darkblue}}
\DefineVerbatimEnvironment{Soutput}{Verbatim}{formatcom=\color{black}\vspace{-1.5em}}

\maketitle\vspace*{1pc}\hrule

\vspace{1cm}

<<echo=false,quiet=true>>=
options(width = 80, prompt = "> ")
@

\vspace{1cm}

\adegenet\ \cite{Jombart2008} and \pegas\ \cite{Paradis2010} are two
packages that complement each other for population genetic analyses in
R. Since various computer programs have been used by population
geneticists, most of them with their own data file formats, it is
crucial that R can read them to ease users to switch to R. The present
document explains how to read several file formats commonly used in
population genetics. The following formats are considered:

\begin{itemize}
\item Text (ASCII) in tabular form,
\item VCF files,
\item \fstat\ data format,
\item \genetix\ data format,
\item \genepop\ data format,
\item \struc\ data format,
\item Excel.
\end{itemize}

Except the last one, these files are stored in text (usually standard
ASCII) format the differences being in the layout of the data.

\section{Data Structures in \adegenet\ and \pegas}

First, let's have a brief look on the way allelic data are stored by
our two packages. \adegenet\ has the class \code{"genind"} where
individuals are rows and alleles are columns in a matrix named
\code{tab}. This is an S4 class, so the elements are accessed with the
\code{@} operator (e.g., \code{x@tab}). Additional information are
stored in other slots (\code{@ind.names}, \code{@pop}, \dots) The
details can be found with \code{class?genind}.

\pegas\ has the class \code{"loci"} which is a simple data frame with
a mandatory attribute named \code{"locicol"} which identifies the
columns that are loci; the other columns are additional (individual)
variables that may be of any kind. The loci are coded with factors
where the different levels are the observed genotypes with the alleles
separated with a forward slash, for instance, `A/A' for a classical
genotype, or `132/148' for a microsatellite locus. This is an S3
class.\footnote{For details:
  emmanuelparadis.github.io/pegas/DefinitionDataClassesPegas.pdf.}
Some examples are given in the next subsection.

With version 0.6, \pegas\ supports phased and unphased genotypes (A|A
and A/A, respectively). In unphased genotypes, the alleles are sorted
with uppercase first, and then in alphabetical order, so that a/A is
changed into A/a (even if the latter was not observed in the original
data file).

There is no need to choose between these two data structures: they
are used by each package and they can be converted between each other
with the functions \code{genind2loci} (or the generic form
\code{as.loci}) and \code{loci2genind}. Therefore, it is
straightforward to run analyses with both packages on the same data.

\section{Reading Genetic Data Files}

\subsection{Reading Text Tabular Files}\label{sec:tab}

It is intuitive to organise allelic data in a tabular form where each
row is an individual and the columns are loci and possibly other
variables such as population, spatial locations, and so on. A simple
example of such a file would be (file `toto'):\footnote{File contents
  are printed in blue to distinguish them from R input/output.}

\color{blue}
\begin{verbatim}
a/a
\end{verbatim}
\color{black}
This file can be read with the \pegas\ function \code{read.loci}:

<<>>=
library(pegas)
x <- read.loci("toto", header = FALSE)
x
print(x, details = TRUE)
class(x)
@
Since the file has no label for the column, we used \code{header =
  FALSE}. Note that printing a \code{"loci"} object displays a very
brief account of the data; the option \code{details = TRUE} allows to
print them like a standard data frame.\footnote{The function
  \code{View} can also be used: this will use the same spreadsheet
  interface than \code{fix} but the data cannot be edited (see
  below).} If the same data were formatted with a different allele
separator (file `titi'):

\color{blue}
\begin{verbatim}
a-a
\end{verbatim}
\color{black}
Then this file would be read with:

<<>>=
y <- read.loci("titi", header = FALSE, allele.sep = "-")
print(y, details = TRUE)
identical(x, y)
@

Let us have a look on the different options of \code{read.loci}:

<<>>=
args(read.loci)
@
We already know \code{file}, \code{header}, and \code{allele.sep}.
Note the default value for this last option which specifies that the
allele separator can be either a slash or a vertical bar.
\code{loci.sep} is the separator of the columns
(not only the loci) which is one or several spaces by default (use
\code{sep = "\textbackslash t"} if a tabulation). \code{col.pop} must
be an
integer giving the index of the column that will be labelled
``population'' in the returned data; by default there is
none. \code{col.loci} does the same for the loci; by default all
columns are treated as loci except if \code{col.pop} is used. Finally
`\code{\dots}' may be any further (named) arguments passed to
\code{read.table} (e.g., \code{skip} in case there are comments at the
top of the file).

Any level of ploidy is accepted and \pegas\ checks the order of the
alleles (see above). For instance the file `tutu' is:

\color{blue}
\begin{verbatim}
a/a/A
A/a/a
\end{verbatim}
\color{black}

<<>>=
print(read.loci("tutu", FALSE), TRUE)
@

Phased and unphased genotypes can be mixed in a file. For instance the
file `tyty' is:

\color{blue}
\begin{verbatim}
Loc1
A/a
a/A
A|a
a|A
\end{verbatim}
\color{black}

<<>>=
X <- read.loci("tyty")
print(X, TRUE)
summary(X)
@

A more realistic example with four columns---an allozyme locus, a
microsat locus, a population assignment, and a phenotypic
variable---might be (file `tata'):

\color{blue}
\begin{verbatim}
        Adh2    SSR1    pop     size
IndA1   A/A     100/200 A       2.3
IndA2   A/a     100/120 A       2.5
IndB1   A/A     100/100 B       2.1
IndB2   a/a     120/120 B       2.8
\end{verbatim}
\color{black}
which will be read with:

<<>>=
z <- read.loci("tata", loci.sep = "\t", col.loci = 2:3, col.pop = 4, row.names = 1)
z
print(z, details = TRUE)
@
Note \code{row.names} which is passed with the `\code{\dots}' argument.
To make sure that only the first and the second columns are treated as
loci, let us extract the alleles from this data set:

<<>>=
getAlleles(z)
@
We may check that the attribute \code{"locicol"} has been set
correctly, but usually the user does not need:

<<>>=
attr(z, "locicol")
@
Finally we display the internal structure of the data to see that the
additional variables are treated as they should be:

<<>>=
str(z)
@

\subsection{Reading VCF Files}

Starting with version 0.6, \pegas\ can read VCF files with the
function \code{read.vcf}. Version 0.8 has a completely rewritten code:

<<>>=
args(read.vcf)
@
By default, the first 10,000 loci are read. The option
\code{which.loci} is an alternative way to specify which loci to read
in the file. For instance, the following is the same than the default
(the arguments \code{from} and \code{to} are ignored here):

\begin{verbatim}
read.vcf(file, which.loci = 1:1e4)
\end{verbatim}
In practice, the numbers passed to this option will be obtained from
additional functions which query information from VCF files (see
\code{?VCFloci} for more information). \code{read.vcf} returns an
object of class \code{"loci"}.

\subsection{Importing \fstat, \genetix, \genepop, and \struc\ Data Files}

These four programs have their own data format. Roughly, these formats
have the same idea: they store the genotypes of individuals from
different populations. So, they store genotypes at several loci and an
individual categorical variable. Additionally, the \genetix\ and
\struc\ formats allow for individual labels.

\adegenet\ includes four data files in each of these formats of the
same microsatellite data set. These files can be displayed in the R
console with:

\begin{Schunk}
\begin{Sinput}
> file.show(system.file("files/nancycats.dat", package="adegenet"))
> file.show(system.file("files/nancycats.gtx", package="adegenet"))
> file.show(system.file("files/nancycats.gen", package="adegenet"))
> file.show(system.file("files/nancycats.str", package="adegenet"))
\end{Sinput}
\end{Schunk}
If you want to copy these files into the working directory to further
display or edit them with your favourite editor, use these commands:

\begin{Schunk}
\begin{Sinput}
> file.copy(system.file("files/nancycats.dat", package = "adegenet"), getwd())
\end{Sinput}
\begin{Soutput}
[1] TRUE
\end{Soutput}
\begin{Sinput}
> file.copy(system.file("files/nancycats.gtx", package = "adegenet"), getwd())
\end{Sinput}
\begin{Soutput}
[1] TRUE
\end{Soutput}
\begin{Sinput}
> file.copy(system.file("files/nancycats.gen", package = "adegenet"), getwd())
\end{Sinput}
\begin{Soutput}
[1] TRUE
\end{Soutput}
\begin{Sinput}
> file.copy(system.file("files/nancycats.str", package = "adegenet"), getwd())
\end{Sinput}
\begin{Soutput}
[1] TRUE
\end{Soutput}
\end{Schunk}

\adegenet\ provides four functions to read these formats. Reading the
first three formats is straightforward:

\begin{Schunk}
\begin{Sinput}
> A <- read.fstat("nancycats.dat", quiet = TRUE)
> B <- read.genetix("nancycats.gtx", quiet = TRUE)
> C <- read.genepop("nancycats.gen", quiet = TRUE)
\end{Sinput}
\end{Schunk}

Reading a \struc\ file is slightly more complicated because the
function needs some exta information. This can be done interactively
(the default), or by specifying the appropriate options in which case
we will use \code{ask = FALSE}:

\begin{Schunk}
\begin{Sinput}
> D <- read.structure("nancycats.str", onerowperind=FALSE, n.ind=237, n.loc=9, col.lab=1, col.pop=2, ask=FALSE, quiet=TRUE)
\end{Sinput}
\end{Schunk}

All four data sets are identical (we only compare the \code{tab} slots):

\begin{Schunk}
\begin{Sinput}
> identical(A@tab, C@tab)
\end{Sinput}
\begin{Soutput}
[1] TRUE
\end{Soutput}
\begin{Sinput}
> identical(B@tab, D@tab)
\end{Sinput}
\begin{Soutput}
[1] TRUE
\end{Soutput}
\end{Schunk}

Once the data have been read into R, they can be analysed with
\adegenet\ or with \pegas\ after eventually converting them with
\code{as.loci}. We now delete the data files:

\begin{Schunk}
\begin{Sinput}
> unlink(c("nancycats.dat", "nancycats.gtx", "nancycats.gen", "nancycats.str"))
\end{Sinput}
\end{Schunk}

Finally, \pegas\ has the function \code{read.gtx} to read a \genetix\
data file and return an object of class \loci. This function has no option.

\subsection{Importing Excel Files}

Excel is widely used for trivial data management, but clearly these
data must be
exported to other programs for most analyses. This also applies to the
free spreadsheet
editors such as OpenOffice's Calc or Gnumeric. Several solutions to
get such data into R are given below. I assume that the allelic data
in the spreadsheet are in a tabular form similar to what we have seen
in Section \ref{sec:tab}, so the objective is to have them in R as a
\code{"loci"} object.

\begin{enumerate}
\item The simplest solution is to save the spreadsheet as a text file
  using either the tab-delimited or comma-separated-variable (csv)
  format. This can be done with any spreadsheet editor since Calc or
  Gnumeric can import Excel files. Once the text file is created,
  \code{read.loci} can be used with the option \code{loci.sep =
    "\textbackslash t"} or \code{loci.sep = ","}, as well as any other
  that may be needed.
\item If the ``Save as\dots'' solution does not work, it is possible
  to save a sheet, or part of it, in a text file by following these steps:
  \begin{enumerate}
  \item Open the file, again this may be done with any program.
  \item Select the cells you want to export; this can be done by
    clicking once on the top-left cell, and then clicking a second
    time on the bottom-right cell after pressing the Shift key (this
    could avoid you a tunnel syndrome and is much easier if many cells
    must be selected).
  \item Copy the selected cells in the clipboard (usually Ctrl-C).
  \item Open a text editor (do not use a word processor), paste the
    content of the clipboard (usually Ctrl-V), and
    save the file.
  \end{enumerate}
  The text file can now be read with \code{read.loci(..., loci.sep =
    "\textbackslash t")}.
\item If Perl is installed on your computer (this is true for almost all
  Linux distributions), you can use the function \code{read.xls} from the package
  \pkg{gdata} (available on CRAN) to read directly an Excel file into R
  (the Perl program actually does the same job than the user does
  manually in the ``Save as\dots'' solution above). By default the
  first sheet is used, but this can be changed with the \code{sheet}
  option. The returned object is a data frame and can be converted as
  a \code{"loci"} object with \code{as.loci}. In that case, the same
  options that in \code{read.loci} can be used (see \code{?as.loci}.).

  In my experience, \code{read.xls} works well with small to moderate
  size Excel files but can be very slow with bigger files ($>$ 10 MB).
\end{enumerate}

We also note the function \code{read.genealex} in the package \poppr\
\cite{Kamvar2014} which reads a Genalex file that has been exported
into csv format and returns an object of class \code{"genind"}.

\section{An Example From Dryad}

This section shows how the \code{jaguar} data set was prepared. This
data set, deliver with \pegas, was published by Haag et al.\
\cite{Haag2010}, and was deposited on
Dryad.\footnote{http://datadryad.org/resource/doi:10.5061/dryad.1884}
The main data file is in Excel format and can be accessed directly at
the locations indicated below so that it can be read remotely with
\gdata:

\begin{Schunk}
\begin{Sinput}
> f <- "http://datadryad.org/bitstream/handle/10255/dryad.1885/\
MicrosatelliteData.xls?sequence=1"
> library(gdata)
> x <- read.xls(f, row.names = 1)
\end{Sinput}
\begin{Soutput}
essai de l'URL 'http://datadryad.org/bitstream/handle/10255/\
dryad.1885/MicrosatelliteData.xls?sequence=1'
Content type 'application/vnd.ms-excel;charset=ISO-8859-1'\
length 29184 bytes (28 KB)
==================================================
downloaded 28 KB
\end{Soutput}
\end{Schunk}

The object \code{x} is a data frame with the row names set correctly
with the line identifiers of the original file since we have used the
option \code{row.names = 1}. In this data frame each column is an
allele so that two columns are used to represent the genotype at a
given locus. This is clearly not the format used by the class
\code{"loci"}. Furthermore, some rows indicating the populations have
been inserted with missing values (\NA) for all columns. Fortunately,
the rows with genotypes have no \NA, so it is easy to find the rows
with the population names, and drop them before transforming the data
frame with the function \code{alleles2loci}. This function has been
specially designed to transform such data sets. The commands are
relatively straightforward:

\begin{Schunk}
\begin{Sinput}
> s <- apply(x, 1, anyNA)
> y <- alleles2loci(x[!s, ])
\end{Sinput}
\end{Schunk}

We can now extract the population names and assign them to each
observation; this is slightly more complicated, but the logical is
based on the fact that the rows below a population name should be
assigned to it:\footnote{In practice, a \code{for} loop can be used:
  it would be less efficient but more intuitive and easier to read.}

\begin{Schunk}
\begin{Sinput}
> w <- which(s)
> n <- diff(c(w, nrow(x) + 1)) - 1
> pop <- factor(rep(1:4, n), labels = names(w))
> y$population <- pop
\end{Sinput}
\end{Schunk}

The data are now ready to be analysed in R. We can check that the row-
and colnames are correctly set with the labels from original file:

\begin{Schunk}
\begin{Sinput}
> y
\end{Sinput}
\begin{Soutput}
Allelic data frame: 59 individuals
                    13 loci
                    1 additional variable
\end{Soutput}
\begin{Sinput}
> dimnames(y)
\end{Sinput}
\begin{Soutput}
[[1]]
 [1] "bPon01"  "bPon02"  "bPon133" "bPon134" "bPon135"
 [6] "bPon140" "bPon137" "bPon139" "bPon138" "bPon136"
[11] "bPon141" "bPon143" "bPon142" "bPon124" "bPon366"
[16] "bPon12"  "bPon91"  "bPon04"  "bPon25"  "bPon48"
[21] "bPon49"  "bPon50"  "bPon51"  "bPon52"  "bPon53"
[26] "bPon54"  "bPon35"  "bPon46"  "bPon40"  "bPon41"
[31] "bPon47"  "bPon78"  "bPon36"  "bPon359" "bPon44"
[36] "bPon80"  "bPon03"  "bPon11"  "bPon15"  "bPon16"
[41] "bPon17"  "bPon18"  "bPon19"  "bPon20"  "bPon21"
[46] "bPon22"  "bPon23"  "bPon27"  "bPon29"  "bPon30"
[51] "bPon31"  "bPon32"  "bPon38"  "bPon45"  "bPon130"
[56] "bPon131" "bPon132" "bPon58"  "bPon24"

[[2]]
 [1] "FCA742"     "FCA723"     "FCA740"     "FCA441"
 [5] "FCA391"     "F98"        "F53"        "F124"
 [9] "F146"       "F85"        "F42"        "FCA453"
[13] "FCA741"     "population"
\end{Soutput}
\end{Schunk}

\section{Editing and Writing Genetic Data Files}

After the data have been read into R, they can be manipulated in the
standard way. This is straightforward for the class \code{"loci"}
since it is a direct extension of data frames. \pegas\ has a few
method functions adapted to \loci: \code{rbind}, \code{cbind}, and
the indexing operator \code{[}. Some other functions, such as
\code{subset}, \code{split}, \code{rownames}, \code{colnames}, or the
\code{\$} operator, can be used without problem since they respect
additional attributes. Others, such as \code{transform}, drop the
attributes and so will return a simple data frame.

\adegenet\ allows to edit an object of class \genind, but since
this is an
S4 class, the elements are accessed with the \code{@} operator. The
help page \code{?genind} describes them in
details. A few functions are provided to ease the manipulation of
\genind\ objects because setting all elements by hand may be
tedious: \code{seploc} splits the data with respect to each locus,
\code{seppop} does the same with respect to each population, and
\code{repool} allows to do the opposite operation.

It is also possible to select a part of a \genind\ object with the
usual indexing operator \code{[}, the indices will apply to the rows
and/or columns of the \code{@tab} slot, and the other slots will be
modified accordingly. Some care should be paid when using numerical
indexing on columns because something like \code{x[, 1]} will only select
one allele column, eventually excluding other alleles of the same
locus. It is safer to use the option \code{nloc} which specifies the
loci to be selected, and so will select the appropriate allele columns.
Further details are available in \code{?pop} as well as information on
some other functions.

In addition to standard data editing, \pegas\ allows to edit a \loci\
object with \code{edit} or \code{fix}. The command \code{edit(x)} opens the data
in R's spreadsheet-like data editor. Here are a few points about this procedure:

\begin{itemize}
\item It is possible to change the row and column labels
  (\code{rownames} and \code{colnames}).
\item It is possible to add new rows (individuals): if some columns
  are not filled they will be given \code{NA}.
\item You can add new genotypes and/or alleles to a locus column: the
  levels of the corresponding factor will be adjusted and a warning
  message will inform you of that.
\item New columns may be added, but they can only be numerical or
  character vectors.
\item Like most R functions, \code{edit} returns its results in the
  console if no assignment has been done, so you may prefer to call
  the editor with \code{x <- edit(x)} or \code{fix(x)}.

  If you forgot the assignment and don't want to lose all the
  changes you did, after you have closed the editor you can save the
  modified data with (only if you don't do any other operation after
  \code{edit}):
  \begin{verbatim}
  x <- .Last.value
  \end{verbatim}
\end{itemize}

\bibliographystyle{plain}
\bibliography{pegas}

\end{document}
