---
title: "Translate variables"
date: "`r Sys.Date()`"
author: "Adrian Maldet"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Translate variables}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  echo = TRUE,
  message = FALSE,
  warning = FALSE,
  collapse = TRUE,
  comment = "#>"
)
```

In the following, we will explain how to use a lama-dictionary
(See [Creating lama-dictionaries]) in order to translate data frame variables
or atomic vectors (or factor objects).
The main functions are:
* `lama_translate()` and `lama_translate_()`: Assign new labels to variable
  values and turn them into ordered factors (if `to_factor = TRUE`).
* `lama_translate_all()`: Apply `lama_translate()` on all possible columns
  of a data frame, if there are corresponding translations.
* `lama_to_factor()` and `lama_to_factor_()`: Similar to
  `lama_translate()` and `lama_translate_()`, but the variables already
  have the right values (character or factor), but should be turned into
  factor variables with the factor levels given in the corresponding
  translations.
* `lama_to_factor_all()`: Apply `lama_to_factor()` on all possible columns
  of a data frame, if there are corresponding translations.

## The example data frame and lama-dictionary

Let `df` be a data frame with the following structure:
```{r}
df <- data.frame(
  pupil_id = rep(1:4, each = 3),
  subject = rep(c("eng", "mat", "gym"), 4),
  level = factor(
    c("a", "a", "a", "b", "b", "b", "b", "b", "b", "a", "a", "a"),
    levels = c("a", "b")
  ),
  result = c(1, 2, 2, NA, 2, NA, 1, 0, 1, 2, 3, NA),
  stringsAsFactors = FALSE
)
df
```

The column `subject` (**character**) contains the subject codes and the column
`level` (**factor**) holds the level of the courses (`basic` and `advanced`)
pupils were tested in. The column `result` (**integer**) contains the
test results (`1` and `2` are positive, `3` and `4` are negative, `NA` means
that the pupil missed the test and `0` means that something else went wrong). 

We want to use the following lama-dictionary in order to translate the data
frame variables:
```{r}
library(labelmachine)
dict <- new_lama_dictionary(
  sub = c(eng = "English", mat = "Mathematics", gym = "Gymnastics"),
  lev = c(b = "Basic", a = "Advanced"),
  result = c(
    "1" = "Good",
    "2" = "Passed",
    "3" = "Not passed",
    "4" = "Not passed",
    NA_ = "Missed",
    "0" = NA
  )
)
dict
```

## Translate using non-standard evaluation

The function `lama_translate()` uses non-standard evaluation, which means that
we pass in expressions, which will be parsed and we can spare the quotes surrounding 
column and translation names:

```{r}
df_new <- lama_translate(
  .data = df,
  dictionary = dict,
  subject_new = sub(subject),
  level = lev(level),
  result = result(result),
  keep_order = c(FALSE, TRUE, FALSE),
  to_factor = c(TRUE, TRUE, FALSE)
)
str(df_new)
```

The arguments `.data` and `dictionary` define which data frame should be
translated and which lama-dictionary should be used for the translation.
The argument `keep_order` defines for each given translation if the
original ordering of the variable should be kept (ordering of the variable
in the data frame `df`) or if the ordering given in the translation should be
used. The argument `to_factor` defines for each translation, if the resulting
labeled variable should be a factor variable (`to_factor = TRUE`) or a
plain character variable (`to_factor = FALSE`).
Besides the arguments `.data`, `dictionary` and `keep_order` all other
arguments are label assignments. The names of the arguments
(left hand side of the equations) define the column names under which the
labeled variable should be stored. The right hand side
of the assignments define the column which should be labeled
(parameter name in the brackets) and which 
translation should be used (function name the left of the brackets).
Hence, the statement above does the following things:

* `subject_new = sub(subject)`: The column `subject` in the data frame `df` is
  translated using the translation `sub` and the resulting factor is stored
  under the column name `subject_new`. Since the first entry in `keep_order` is
  `FALSE`, the ordering given in the translation `sub` is used for the labels.
  Since the first entry in
  `to_factor` is `TRUE` the resulting variable is a factor variable.
* `level = lev(level)`: The column `level` in the data frame `df` is translated
  using the translation `lev` and then overwritten by the resulting factor.
  Since the second entry in `keep_order` is `TRUE`, the labeled variable
  has the same ordering as the original column. Since the second entry in
  `to_factor` is `TRUE` the resulting variable is a factor variable.
* `result = result(result)`: The column `result` in the data frame `df` is
  translated using the translation `result` and then overwritten by the
  resulting factor. Since the third entry in `keep_order` is
  `FALSE`, the ordering given in the translation is used for the labels.
  Since the third entry in
  `to_factor` is `FALSE` the resulting variable is a plain character variable.

There are several abbreviations, in order to spare some writing:

* If the translation has the same name as the original column name, then
  it is sufficient to just write the translation name on the right hand side.
  E.g: `result_new = result` is the same as `result_new = result(result)`.
* If the column name under which the labeled variable should be stored is the
  same as the original column name, then
  the left hand side of the assignment can be omitted.
  E.g: `lev(level)` is the same as `level = lev(level)`.
* If the names of the translation, of the original column and the new column 
  are equal then only the name of the translation is needed.
  E.g: `result` is the same as `result = result(result)`.

## Translate using standard evaluation

The function `lama_translate_()` is the standard evaluation variant of 
`lama_translate()`, which means that instead of expressions, we pass in 
character strings holding the names of the translations and columns we want
to use:

```{r}
df_new <- lama_translate_(
  .data = df,
  dictionary = dict,
  translation = c("sub", "lev", "result"),
  col = c("subject", "level", "result"),
  col_new = c("subject_new", "level", "result"),
  keep_order = c(FALSE, TRUE, FALSE),
  to_factor = c(TRUE, TRUE, FALSE)
)
str(df_new)
```

The arguments `.data` and `dictionary` define which data frame should be
translated and which lama-dictionary should be used for the translation.
The argument `keep_order` defines for each given translation if the
original ordering of the variable should be kept (ordering of the variable
in the data frame `df`) or if the ordering given in the translation should be
used. The result is the same as before, when we used `lama_translate()`.

## Translate all possible variables

The function `lama_translate_all()` is an extension of `lama_translate()`,
which tries to automatically translate as many columns in the 
data frame `.data` as possible.
Therefore, the names of the columns which should be translated must match
the names of the translations which should be used:

```{r}
df_new <- lama_translate_all(
  .data = df,
  dictionary = dict,
  prefix = "new_",
  fn_colname = toupper,
  suffix = "_labeled",
  keep_order = TRUE
)
str(df_new)
```

In the above example, only the column name `result` matches the translation
name and is therefore translated and stored under the column name
`new_RESULT_labeled`. The name of the new columns is a transformation of
the old column name (e.g. `result`), appending the strings given in the
arguments `prefix` and `suffix` at the beginning and at the end of the
column name. Before this string concatenation, the name of the original
column can be transformed into a other string by using the string
transformation function `fn_colname`. In our case `fn_colname` is given
the function `toupper` which transforms all letters of the column name `result`
to upper case `RESULT`.
Contrary to `lama_translate()`, the argument `keep_order` is just a single
boolean flag. It defines whether the original order of all columns should
be kept (`keep_order = TRUE`)
or if the order in the translation vector should be used.
Like in the case of `lama_translate()`, it is possible to pass an
argument `to_factor = FALSE`
`lama_translate_all` in order to define that all resulting labeled variables
shall be stored as plain character vectors.

## Translate vectors

So far, we only translated variables in data frames, but it is also possible
to use `lama_translate()` and `lama_translate_()` in 
order to translate atomic vectors (character, logical, numeric) and factors.

Using `lama_translate()`:
```{r}
vec <- c("eng", "eng", "gym", "mat")
vec_labeled <- lama_translate(vec, dict, sub)
```

Using `lama_translate_()`:
```{r}
vec_labeled <- lama_translate_(vec, dict, "sub")
```

## Turn labeled variables into factors

Sometimes, you already have labeled variables (character or factor variables,
maybe produced by `lama_translate()` with argument `to_factor = FALSE`)
and you want to turn them into factor variables with a desired ordering.
In this case the functions `lama_to_factor()`, `lama_to_factor_()`
`lama_to_factor_all()` are right choices.

Let `df_non_factor` a data frame holding the right labels, but no factor
variables (created with `lama_translate_all()` using `to_factor = FALSE`):
```{r}
dict_new <- lama_rename(dict, subject = sub, level = lev)
df_non_factor <- lama_translate_all(df, dict_new, to_factor = FALSE)
str(df_non_factor)
```

Turning variables into factors with `lama_to_factor()`:
```{r}
df_factor <- lama_to_factor(
  .data = df_non_factor,
  dictionary = dict,
  subject_new = sub(subject),
  level = lev(level),
  result = result(result)
)
str(df_factor)
```
The function `lama_to_factor()` allows the same abbreviations as `lama_translate()`.
It can also be used on factor variables and there is also a `keep_order`
argument like in the case of `lama_translate()`. Furthermore,
the functions `lama_to_factor()` and `lama_to_factor_()` can both be applied
to atomic vectors or plain factors like in the case of `lama_translate()`.

Turning variables in a data frame into factors with `lama_to_factor_()`:
```{r}
df_factor <- lama_to_factor_(
  .data = df_non_factor,
  dictionary = dict,
  translation = c("sub", "lev", "result"),
  col = c("subject", "level", "result")
)
str(df_factor)
```
Since the argument `col_new` was omitted, the variable names (`subject`,
`level` and `result`) were overwritten.

Turning all possible variables in a data frame into factors
with `lama_to_factor_all()`:
```{r}
df_factor <- lama_to_factor_all(
  .data = df_non_factor,
  dictionary = dict
)
str(df_factor)
```
Since the arguments `prefix`, `suffix` and `fn_colname` were omitted, the
variable names (`subject`, `level` and `result`) were overwritten.

## Further reading

* [Creating lama-dictionaries]
* [Altering lama-dictionaries]
* [Get started]

[Get started]: https://a-maldet.github.io/labelmachine/articles/labelmachine.html
[Creating lama-dictionaries]: https://a-maldet.github.io/labelmachine/articles/create_dictionaries.html
[Altering lama-dictionaries]: https://a-maldet.github.io/labelmachine/articles/alter_dictionaries.html
[Translating variables]: https://a-maldet.github.io/labelmachine/articles/translate.html

