---
title: "Session reconstruction in R"
author: "Oliver Keyes"
date: "`r Sys.Date()`"
vignette: >
  %\VignetteIndexEntry{Session reconstruction in R}
  %\VignetteEngine{knitr::rmarkdown}
  %\usepackage[utf8]{inputenc}
---

## Session reconstruction in R

With trace data - from web logs, behavioural logs, really anything to do with user actions - reconstructing sessions (or 'sessionising') is essential. It lets an analyst divide up user actions into actual periods of sustained interaction, and from there compute a whole host of useful metrics,
from session length to bounce rate.

`reconstructr` is a library for just that - sessionisation and metric computation - in a way that keeps all the metadata about the events you're sessionising.

### Dividing events into sessions

The nature of a "session" has provided fodder for researchers for years. Most people take an approach based on inactivity thresholds; if someone does not take an action in N seconds, their session has ended and a new one begins on their next logged action.

Using reconstructr, we can conveniently divide events into sessions with the `sessionise` function. This takes a data.frame of events (along with specifiers of which column contains the user ID, and which column contains the timestamp) and a threshold value (in seconds). When the time between events by a user crosses that threshold, the session ends and a new one begins.
We can demonstrate this using reconstructr's inbuilt session dataset:

```{r, eval=FALSE}
library(reconstructr)
data("session_dataset")

str(session_dataset)
'data.frame':	63524 obs. of  3 variables:
 $ uuid     : chr  "47dc43895814861e21a2edf93348c826" "a736822df1890011694e7049cb3abef3" "674d2d00e096a3319874a4347caa1f4a" "f62d315398e6d04a3f2fa02e8ae42d49" ...
 $ timestamp: POSIXlt, format: "2014-01-07 00:00:15" "2014-01-07 00:01:11" "2014-01-07 00:01:54" ...
 $ url      : chr  "https://www.nasa.gov/history/mercury/mercury.html" "https://www.nasa.gov/images/ksclogosmall.gif" "https://www.nasa.gov/elv/hot.gif" "https://www.nasa.gov/facts/faq04.html" ...

sessionised_data <- sessionise(x = session_dataset, timestamp = timestamp, user_id = uuid,
                               threshold = 1800)

str(sessionised_data)
'data.frame':	63524 obs. of  5 variables:
 $ uuid      : chr  "0005839b3e8483d50870f61f50307fa7" "000b047bad36484451f12c114ab5eb28" "000b047bad36484451f12c114ab5eb28" "000b047bad36484451f12c114ab5eb28" ...
 $ timestamp : POSIXlt, format: "2014-01-14 12:47:59" "2014-01-07 14:25:11" "2014-01-09 12:47:17" ...
 $ url       : chr  "https://www.nasa.gov/history/apollo/images/footprint-logo.gif" "https://www.nasa.gov/ksc.html" "https://www.nasa.gov/biomed/threat/gif/beachmousefinsmall.gif" "https://www.nasa.gov/shuttle/resources/orbiters/atlantis.html" ...
 $ session_id: chr  "09cd65049020ed55472a2d8b1f47787e" "9dcb2f610297b3fe2c810907fa90fb8e" "70bcde51eff332d4ac820a90930f0f6e" "70bcde51eff332d4ac820a90930f0f6e" ...
 $ time_delta: int  NA NA NA 45 4 75 274 47 NA 28 ...
```

Sessionisation adds two new columns; 'session\_id', a unique per-session ID, and 'time\_delta' - the time between an event and the previous event in the session. If the event was the first (or only) one in a session, that value will be `NA`.

Crucially, existing metadata (like URLs, or activity type) is carried along with the session information and not dropped.

From the sessionised data we can compute a whole host of useful metrics, many of which have convenience functions in this package.

### Session metrics
An important metric in session data is the bounce rate: the proportion of sessions that included only a single event. This represents (absent data quality issues) the number of sessions where a user took only one action and then simply left.

It can be computed with `bounce_rate`, which takes a sessionised dataset and produces the percentage of sessions resulting in bounces. Optionally (if you provide an argument for the `user_id` parameter) it produces the bounce rate on a per-user basis, rather than for the dataset overall:

```{r, eval=FALSE}

str(bounce_rate(sessionised_data))

num 20.7
 
str(bounce_rate(sessionised_data, user_id = uuid))

'data.frame':	10000 obs. of  2 variables:
 $ user_id    : chr  "0005839b3e8483d50870f61f50307fa7" "000b047bad36484451f12c114ab5eb28" "000b2bc1a5438d8d54d4fbec139a2fd5" "001b6e80a14ba8d809c4ff18cdbade40" ...
 $ bounce_rate: num  100 14.3 0 100 100 ...
```

`time_on_page` is very similar, calculating either the mean (or median) time between events - either for the dataset as a whole or, if the `by_session` parameter is TRUE, for each session:

```{r, eval=FALSE}

str(time_on_page(sessionised_data))

num 146

str(time_on_page(sessionised_data, by_session = TRUE))

'data.frame':	22226 obs. of  2 variables:
 $ session_id  : chr  "00011b1e098848edee7e50a2174fe6ef" "0001f6457a4d09a8c2092278fec89a89" "000451f0869b7eab3582c093ace0253d" "0004c56ace95f92ee12bf9552401f923" ...
 $ time_on_page: num  NaN NaN NaN NaN NaN ...
```

(It's not broken, it just so happened the first few sessions contained no non-NA time deltas).

Finally, `session_count` and `session_length` provide easy ways of identifying how many sessions are in a sessionised dataset (overall, or on a per-user basis) and how long those sessions are, respectively.



### Other session functionality
If you have ideas for other functionality that would make processing sessionseasier, the best approach
is to either [request it](https://github.com/Ironholds/reconstructr/issues) or [add it](https://github.com/Ironholds/reconstructr/pulls)!
