---
title: "Creating lama-dictionaries"
date: "`r Sys.Date()`"
author: "Adrian Maldet"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Creating lama-dictionaries}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  echo = TRUE,
  message = FALSE,
  warning = FALSE,
  collapse = TRUE,
  comment = "#>"
)
knitr::opts_hooks$set(out.maxwidth = function(options) {
  if (!knitr:::is_html_output())
    return(options)
  options$out.extra <- sprintf(
    'style="max-width: %s; width: %s;"',
    options$out.maxwidth,
    options$out.width
  )
  options
})
```

As already described in [Get started] we need **translations**, which tell
us how the labels should be assigned to the different values of a variable.
These translations are collected in a **lama-dictionary** object.
With the `labelmachine` package, translations can be written as
named character vectors:
```{r}
subject_translation <- c(eng = "English", mat = "Mathematics", gym = "Gymnastics")
```
The names of the character vector items correspond to the original values
(e.g.: `"eng"`, `"mat"` and `"gym"`) and the item values correspond to the
new labels which should be used (e.g.: `"English"`, `"Mathematics"` and `"Gymnastics"`).

Lama-Dictionaries always assume that the original variables are of type
**character**. Even if they are not, it does not make a difference, since
when calling `lama_translate()` the original variables of different types
(**logical**, **numeric** or **factor**) will automatically be transformed
into **character** variables. Hence, `labelmachine` can handle any type of
variables.

## From scratch

With the command `new_lama_dictionary()` a lama-dictionary can be created
on the fly:

```{r}
library(labelmachine)
dict <- new_lama_dictionary(
  sub = subject_translation,
  res = c(
    "1" = "Good",
    "2" = "Passed",
    "3" = "Not passed",
    "4" = "Not passed",
    NA_ = "Missed",
    "0" = NA
  )
)
dict
```

Each named argument in `new_lama_dictionary()` defines a translation. The
passed in argument names are used as translation names
(e.g. `"sub"` and `"res"`).
The first translation `sub` was given by a named character vector, holding 
the full length labels for the subjects of a data set holding encoded subject
values (e.g. `"eng"` for `"English"`, `"mat"` for `"Mathematics"` and `"gym"`
for `"gymnastics"`).
The second translation `res` holds the full length labels for the encoded test results.
In translation `res` multiple values were mapped onto a single label string 
(e.g. `3` and `4` were mapped onto `"Not passed"`).
The expression `NA_` in translation `res` is used to escape the
missing value symbol `NA` (only necessary in the name fields of the translation
vectors).
Hence, the last assignment `NA_ = "Missed"` defines that missing values
should be labeled with the string `"Missed"`.
The last expression `"0" = NA` given in translation `res` tells that
all zero values `0` should be mapped onto the missing value symbol `NA`.

```{r, out.maxwidth = "450px", out.width = "100%"}
knitr::include_graphics("createdictionary.png")
```


## From a list object holding translations

If we have a named list object holding translations (named character vectors),
then we can turn it into a lama-dictionary object with the command `new_lama_dictionary()`:

```{r}
obj <- list(
  sub = c(eng = "English", mat = "Mathematics", gym = "Gymnastics"),
  res = c(
    "1" = "Good",
    "2" = "Passed",
    "3" = "Not passed",
    "4" = "Not passed",
    NA_ = "Missed",
    "0" = NA
  )
)
dict <- new_lama_dictionary(obj)
dict
```


## From an existing **yaml** file

As described in [Get started] lama-dictionaries can be saved to **yaml** files.
These files are plain text files with a specific structure. 
For example the file `dictionary.yaml` may looks as follows
```{yaml, code = readLines(system.file("extdata", "dictionary_exams.yaml", package = "labelmachine"))}
```
On the top level (no indentation) there are the translation names followed by a
colon
(e.g. `sub:`, `res:` and `lev:`)
the following lines describe the label assignment rules of the translation
(indentation of two whitespaces `  `).
Each line contains a single label assignment, where the original value comes
first (use quotes in case of a numeric or logical, e.g. `'1': "Very good"`), 
followed by a colon and then comes the label, which should be assigned to the 
value.
The file shown above contains the following translations:

* translation `sub`, which assigns
  * the label `"English"` to the value `eng`
  * the label `"Mathematics"` to the value `mat`
  * the label `"Gymnastics"` to the value `gym`
* translation `res`, which assigns
  * the label `"Good"` to the value `1`
  * the label `"Passed"` to the value `2`
  * the label `"Not passed"` to the values `3` and `4`
  * the label `"Missed"` to the missing value `NA`
  * the missing value `NA` to the value `0`
* translation `lev`, which assigns
  * the label `"Basic"` to the value `"b"`
  * the label `"Advanced"` to the value `"a"`

With the command `lama_read()` a lama-dictionary can be loaded:
```{r, eval = FALSE}
path_to_file <- system.file("extdata", "dictionary_exams.yaml", package = "labelmachine")
dict <- lama_read(path_to_file)
```

With the command `lama_write()` the lama-dictionary `dict` can be saved as a
**yaml** file:
```{r, eval = FALSE}
path_to_file <- file.path(tempdir(), "my_dictionary.yaml")
lama_write(dict, path_to_file)
```

## From already existing label assignment rules

Sometimes, there are already **csv** or **excel** files or **data frames**
holding label assignment rules. Normally this assignment rules are stored
as column pairs, where one column always represents the original values and
the other column holds the corresponding label strings.
In case of the assignments are stored in a **csv** or an **excel** file, they
can be loaded into a data frame with the commands
`read.csv()` or `read.table()`. Hence, it is sufficient to cover the case
of label assignment rules given as data frames.

Let `df_trans` be a data frame holding the label assignment rules:
```{r, include = FALSE}
df_trans <- data.frame(
  sub_old = c("eng", "mat", "gym", NA, NA, NA),
  sub_new = c("English", "Mathematics", "Gymnastics", NA, NA, NA),
  res_old = c(0, 1, 2, 3, 4, NA),
  res_new = c(NA, "Good", "Passed", "Not passed", "Not passed", "Missed")
)
```
```{r}
df_trans
```

The command `as.lama_dictionary` reads in the label assignment rules given in
`df_trans` and returns a lama-dictionary object holding the given translations:
```{r}
dict <- as.lama_dictionary(
  .data = df_trans,
  translation = c("sub", "res"),
  col_old = c("sub_old", "res_old"),
  col_new = c("sub_new", "res_new")
)
dict
```
The resulting lama-dictionary `dict` contains two translations `sub` and `res`.

The command `as.lama_dictionary()` also allows an argument `ordering`, which
specifies how the ordering of the translations should
be determined. For further details, see `as.lama_dictionary()`.

## Further reading

* [Translating variables]
* [Altering lama-dictionaries]
* [Get started]

[Get started]: https://a-maldet.github.io/labelmachine/articles/labelmachine.html
[Creating lama-dictionaries]: https://a-maldet.github.io/labelmachine/articles/create_dictionaries.html
[Altering lama-dictionaries]: https://a-maldet.github.io/labelmachine/articles/alter_dictionaries.html
[Translating variables]: https://a-maldet.github.io/labelmachine/articles/translate.html

