---
title: "Use case: Iterative Policy Evaluation (Reinforcement Learning)"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{matrices_and_vectors}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

In this vignette, we'll present a real-life use case, which shows how the `matricks` package makes the work with matrices easier. 

Let's try to implement an algorithm from the field of **Reinforcement Learning** called *iterative policy evaluation*. The environment we will work on is a simple **Grid World** game. This vignette was inspired by a Reinforcement Learning course by [Lazy Programmer](https://github.com/lazyprogrammer), originally written in Python.

```{r setup}
library(matricks)
```

## Initializing environment

We initialize environment, using following functions from matrix package:
 * `m()`
 * `matrix_idx()`
 * `plot.matrix()`
 * `seq_matrix()`
 * `sv()`
 * `with_same_dims()`

We construct matrices, describing game environment:
* **actions** - all the possible moves
* **rewards**

```{r init.gid.world}
# Defining possible actions
# FALSE = unavailable field
actions <- m(T, T, T, F, T|
             T, T, F, T, T|
             F, T, T, T, T|
             T, T, T, T, T|
             F, F, T, T, T|
             F, T, T, T, F)
plot(actions)
# Defining rewards matrix with two terminal states
rewards <- with_same_dims(actions, 0)
rewards <- sv(rewards,
              c(4, 4) ~ 1., # Win
              c(5, 3) ~ -1) # Lose
# We add small penalties, which encourages models
# to reduce number of steps it passes 
rewards[rewards == 0] <- -0.1
plot(rewards)
# List of possible state indices
states <- matrix_idx(actions, actions)
```
## Setting fixed policy

**Policy** defines a set of moves we should perform, in order to achieve the best result in the game.
We start with a fixed policy values.

```{r fixed.policy}
# Symbols for moves
U <- "U" # Up
D <- "D" # Down
L <- "L" # Left
R <- "R" # Right

fixed.policy <- m(R , D , L , NA, D |
                  R , D , NA, D , L |
                  NA, R , D , D , D |
                  R , R , D , NA, D |
                  NA, NA, NA, U , L |
                  NA, R , U , L , NA)
```
It was easier to define policy using character symbols (as above). 
However, we have to transform into form that is more convenient for computations. 
```{r as.idx}
as_idx <- function(x){
  n.row <- nrow(x)
  n.col <- ncol(x)

  result <- matrix(list(), nrow = n.row, ncol = n.col)

  for (i in 1:n.row){
    for (j in 1:n.col){
      coords <- c(i, j)
      
      if (is.na(at(x, coords)))
        next
      
      x.val <- x[i, j]

      switch (x.val,
        U = c(i - 1, j),
        D = c(i + 1, j),
        L = c(i, j - 1),
        R = c(i, j + 1),
      ) -> move.idx

      if (!is_idx_possible(x, move.idx))
        next

      result[i, j] <- list(move.idx)
    }
  }
  result
}

fixed.policy.idx <- as_idx(fixed.policy)
fixed.policy.idx
```

## Policy evaluation function
```{r policy.evaluation}
evaluate_policy <- function(actions, rewards, policy, 
                            epsilon = 1e-3, gamma = 0.9){
  
  V <- with_same_dims(policy, 0)
  policy.idx <- as_idx(policy)

  while (TRUE) {
    biggest.change <- 0
  
    for (move in seq_matrix(policy.idx)) {
      s <- move[[1]] # Action, value at index s
      a <- move[[2]]
   
      old.v <- at(V, s)

      if(!at(actions, s))
        next
      
      if(is.null(a))
        next
        
      r <- at(rewards, a)
      at(V, s) <- r + gamma * at(V, a)
      biggest.change <- max(biggest.change, abs(old.v - at(V, s)))
    }
    
    print(biggest.change)
    
    if (biggest.change < epsilon){
      break
    }
  }
  return(V)
}  
```

## Experiment
```{r experiment}
V1 <- evaluate_policy(actions = actions, 
                      rewards = rewards,
                      policy  = fixed.policy)
V1
plot(V1)
```

  
