---
title: Pedigree_constructor_details
author: Terry Therneau, Elizabeth Atkinson
date: '`r format(Sys.time(),"%d %B, %Y")`'
output:
  rmarkdown::html_vignette:
    toc: yes
    toc_depth: 2
header-includes: \usepackage{tabularx}
vignette: |
  %\VignetteIndexEntry{Pedigree_constructor_details}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
---

Introduction
===============

The pedigree routines came out of a simple need -- to quickly draw a
pedigree structure on the screen, within R, that was ``good enough'' to
help with debugging the actual routines of interest, which were those for
fitting mixed effecs Cox models to large family data.  As such the routine
had compactness and automation as primary goals; complete annotation
(monozygous twins, multiple types of affected status) and most certainly
elegance were not on the list.  Other software could do that much
better.
        
It therefore came as a major surprise when these routines proved useful
to others.  Through their constant feedback, application to more
complex pedigrees, and ongoing requests for one more feature, the routine has 
become what it is today.  This routine is still not 
suitable for really large pedigrees, nor for heavily inbred ones such as in
animal studies, and will likely not evolve in that way.  The authors fondest
hope is that others will pick up the project.
        
Pedigree Constructor
========================
The pedigree function is the first step, creating an object of class pedigree.  
It accepts the following input
\begin{description}
\item[id] A numeric or character vector of subject identifiers.
\item[dadid] The identifier of the father.
\item[momid] The identifier of the mother.
\item[sex] The gender of the individual.  This can be a numeric variable
with codes of 1=male, 2=female, 3=unknown, 4=terminated, or NA=unknown.
A character or factor variable can also be supplied containing
the above; the string may be truncated and of arbitrary case.  A sex
value of 0=male 1=female is also accepted.
\item[status] Optional, a numeric variable with 0 = censored and 1 = dead.
\item[relationship] Optional, a matrix or data frame with three columns.
The first two contain the identifier values of the subject pairs, and
the third the code for their relationship: 1 = Monozygotic twin, 2=Dizygotic twin,
3= Twin of unknown zygosity, 4 = Spouse.  
\item[famid] Optional, a numeric or character vector of family identifiers.
\end{description}
        
The [[famid]] variable is placed last as it was a later addition to the
code; thus prior invocations of the function that use positional 
arguments will not be affected.                                      
If present, this allows a set of pedigrees to be generated, one per
family.  The resultant structure will be an object of class
[[pedigreeList]].
        
Note that a factor variable is not listed as one of the choices for the
subject identifier. This is on purpose.  Factors
were designed to accomodate character strings whose values came from a limited
class -- things like race or gender, and are not appropriate for a subject
identifier.  All of their special properties as compared to a character
variable turn out to be backwards for this case, in particular a memory
of the original level set when subscripting is done.

However, due to the awful decision early on in S to automatically turn every
character into a factor --- unless you stood at the door with a club to
head the package off --- most users have become ingrained to the idea of
using them for every character variable. 
(I encourage you to set the global option stringsAsFactors=FALSE to turn
            off autoconversion -- it will measurably improve your R experience).
Therefore, to avoid unnecessary hassle for our users 
the code will accept a factor as input for the id variables, but
the final structure does not retain it.  
Gender and relation do become factors.  Status follows the pattern of the 
survival routines and remains an integer.

We will describe the code in a set of blocks.

## Data Checks and Errors

### Errors1
The code starts out with some checks on the input data.  
Is it all the same length, are the codes legal, etc. Checks for ids being non-missing,
and for sex to be as expected of the codes 1-4 for female/male/unknown/terminated.
          
### Errors2
Create the variables descibing a missing father and/or mother,
which is what we expect both for people at the top of the
pedigree and for marry-ins, \emph{before} adding in the family
id information.  It is easier to do it first.
If there are multiple families in the pedigree, make a working set of
identifiers that are of the form `family/subject'.
Family identifiers can be factor, character, or numeric.

### Errors3-Parents

Next check that any mother or father identifiers are found in the identifier
list, and are of the right sex.
Subjects who don't have a mother or father are founders.  For those people 
both of the parents should be missing.

## Creation of Pedigrees
Now, paste the parts together into a basic pedigree.
The fields for father and mother are not the identifiers of
the parents, but their row number in the structure.

## Finish Object                  

The final structure will be in the order of the original data, and all the
components except [[relation]] will have the same number of rows as the original data.
                
                
Subscipting
==================
                
Subscripting of a pedigree list extracts one or more families from the
list.  We treat character subscripts in the same way that dimnames on
a matrix are used.  Factors are a problem though: assume that we
have a vector x with names ``joe'', ``charlie'', ``fred'', then
[[x['joe']]] is the first element of the vector, but
[[temp <- factor('joe', 'charlie', 'fred'); z <- temp[1]; x[z] ]] will
be the second element! 
R is implicitly using as.numeric on factors when they are a subscript;
this caught an early version of the code when an element of a data
frame was used to index the pedigree: characters are turned into factors
when bundled into a data frame.
                
              Note:
                  \begin{enumerate}
                \item What should we do if the family id is a numeric: when the user
                says [4] do they mean the fourth family in the list or family '4'?
                  The user is responsible to say ['4'] in this case.
                \item  In a normal vector invalid subscripts give an NA, e.g. (1:3)[6], but
                since there is no such object as an ``NA pedigree'', we emit an error
                for this.
                \item The [[drop]] argument has no meaning for pedigrees, but must to be
                a defined argument of any subscript method; we simply ignore it.
                \item Updating the father/mother is a minor nuisance;
                since they must are integer indices to rows they must be
                recreated after selection.  Ditto for the relationship matrix.  
                \end{enumerate}
                  For a pedigree, the subscript operator extracts a subset of individuals.
                We disallow selections that retain only 1 of a subject's parents, since    %'
                they cause plotting trouble later.
                Relations are worth keeping only if both parties in the relation were
                selected.

As Data.Frame for Pedigree
=============================
                
                Convert the pedigree to a data.frame so it is easy to view when removing or
                trimming individuals with their various indicators.  
                The relation and hints elements of the pedigree object are not easy to
                put in a data.frame with one entry per subject. These items are one entry 
                per subject, so are put in the returned data.frame:  id, findex, mindex, 
                sex, affected, status.  The findex and mindex are converted to the actual id
                of the parents rather than the index.
                
                Can be used with as.data.frame(ped) or data.frame(ped). Specify in Namespace
                file that it is an S3 method.
                
                
                This function is useful for checking the pedigree object with the
                $findex$ and $mindex$ vector instead of them replaced with the ids of 
                the parents.  This is not currently included in the package.

Printing Pedigree
===================
                It usually doesn't make sense to print a pedigree, since the id is just   %'
                a repeat of the input data and the family connections are pointers.
                Thus we create a simple summary.