--- title: "Process a Relational Event History" author: "" package: remify date: "" output: rmarkdown::html_document: theme: spacelab highlight: pygments code_folding: show css: "remify-theme.css" vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Process a Relational Event History} %\VignetteEncoding{UTF-8} --- This vignette explains the aim, input, output, and methods of the function `remify::remify()`. --- # {.tabset .tabset-fade .tabset-pills} ## Aim The objective of remify::remify() is to process raw relational event sequences supplied by the user along with other inputs that characterize the data (actors' names, event types, starting time point of the event sequence, manual risk set specification, etc.). The internal routines process the structure of the input event sequence into a standardized format, providing objects that are used by the other packages in 'remverse'. As example, we will use the data `randomREH` (documentation available via `?randomREH`). ```{r} library(remify) # loading library data(randomREH) # loading data names(randomREH) # objects inside the list 'randomREH' ``` --- ## Input Input arguments that can be supplied to `remify()` are: `edgelist`, `directed`, `ordinal`, `model`, `thin`, `actors`, `riskset`, `manual.riskset`, `event_type`, `origin`, `time.units`, `attach_riskset`, `riskset_decode`, `riskset_max_decode`, `event_covariates`, and `ncores`. ------ ### edgelist The edgelist must be a `data.frame` with three mandatory columns: the time of the interaction in the first column, and the two actors forming the dyad in the second and third column. The naming of the first three columns is not required but the order must be `[time, actor1, actor2]`. For directed networks, the second column is the sender and the third is the receiver. For undirected networks, the order of the second and third columns is ignored (dyads are sorted alphanumerically). Optional columns are `weight` (event weights affecting endogenous statistics) and event type columns (see `event_type` below). ```{r} head(randomREH$edgelist) ``` ------ ### directed A logical `TRUE`/`FALSE` value indicating whether events are directed (`TRUE`) or undirected (`FALSE`). If `FALSE`, dyads are sorted according to their alphanumeric order (e.g. `[actor1, actor2] = ["Colton", "Alexander"]` becomes `["Alexander", "Colton"]`). Note that undirected networks are only supported for tie-oriented modeling. ------ ### ordinal A logical `TRUE`/`FALSE` value indicating whether only the order of events matters in the model (`TRUE`) or whether waiting times must also be considered (`FALSE`). Based on this argument, the processing of the time variable is carried out differently and `remstimate` will use either the _ordinal_ (if `ordinal = TRUE`) or the _interval_ (if `ordinal = FALSE`) time likelihood. When `ordinal = TRUE`, the `intereventTime` field of the returned object is `NULL`. ------ ### model Either `"tie"` (default) or `"actor"`. For `"tie"`, the risk set is at the dyad level. For `"actor"`, the model has two sub-processes: a sender rate model and a receiver choice model. Actor-oriented modeling requires `directed = TRUE`. When `model = "actor"`, the returned object additionally contains `sender_riskset`, `receiver_riskset`, and `activeN`. ------ ### thin An integer >= 1 controlling event-time thinning based on unique time points. Keeps every `thin`-th unique event time and maps each event time to the next kept time point. This reduces the number of unique time points and thus memory and computation in downstream steps. The default is `1` (no thinning). ------ ### actors An optional character vector of actor names. If `NULL` (default), actor names are taken from the input `edgelist`. Specifying `actors` explicitly is useful when actors that could interact during the study did not appear in any observed event and should nonetheless be included in the risk set. ```{r} randomREH$actors ``` ------ ### riskset The `riskset` argument specifies the type of risk set. Four options are available: - `"full"` (default): all possible dyadic events given the number of actors (and event types) are at risk throughout the entire event history. - `"active"`: only the dyadic events observed at least once in the event history are at risk. This can substantially reduce computation time for sparse networks. - `"active_saturated"`: extends the active risk set by adding the reverse direction for each observed dyad (if A→B is observed, B→A is also at risk) and includes all event types for each observed actor pair. This reflects the assumption that observing any interaction between two actors implies both directions and all types are possible. - `"manual"`: a user-defined static risk set specified via `manual.riskset`. Observed dyads absent from `manual.riskset` are automatically added. More details about risk set definitions are provided in `vignette(topic = "riskset", package = "remify")`. ------ ### manual.riskset Required when `riskset = "manual"`. A `data.frame` with columns `actor1` and `actor2` (and optionally `type`) specifying the complete set of dyads that are at risk throughout the event history. This defines a static risk set: unlike the deprecated `omit_dyad` argument, the risk set does not vary over time. Any observed dyads from the edgelist that are missing from `manual.riskset` are added automatically. ```{r, eval = FALSE} # Example: restrict the risk set to a specific set of dyads my_riskset <- data.frame( actor1 = c("Alexander", "Colton", "Lexy"), actor2 = c("Kayla", "Lexy", "Alexander") ) reh_manual <- remify( edgelist = randomREH$edgelist, directed = TRUE, model = "tie", riskset = "manual", manual.riskset = my_riskset ) ``` ------ ### event_type An optional character string giving the name of the column in `edgelist` that contains event types (marks). If `NULL` (default), `remify()` uses `edgelist$type` if it exists; otherwise events are treated as untyped. If a column name is supplied, that column is used as the event-type mark. When event types are present, the dyadic risk set is extended over types: each dyad is duplicated for each event type. ------ ### origin The initial time ($t_0$) of the observation period. If known, it can be specified here; it must have the same class as the `time` column in the input `edgelist`. If left unspecified (`NULL`), it is set by default to one average waiting time before the first observed event. In the `randomREH` data a $t_0$ is provided: ```{r} randomREH$origin ``` ------ ### time.units A character string specifying the time unit for converting time values when `edgelist$time` is of class `Date` or `POSIXct`. Ignored for numeric or integer time. Default is `"auto"`, which selects seconds. ------ ### attach_riskset Logical (default `TRUE`). When `TRUE`, a list `riskset_info` is attached to the returned `remify` object. This list contains the effective risk set representation (e.g., `riskset_idx`, decoded dyad tables, and basic risk set metadata) and is intended to make the object self-describing and easier to inspect. ------ ### riskset_decode Controls how the included risk set dyads are decoded and attached in `riskset_info$included`: - `"labels"` (default): attach a decoded dyad table with actor (and type) name labels. - `"ids"`: attach a decoded dyad table with integer IDs only. - `"none"`: do not attach a decoded dyad table. ------ ### riskset_max_decode Integer (default `200000L`). Maximum number of included dyads for which `riskset_decode = "labels"` is performed. If the risk set exceeds this threshold, decoding falls back to `"ids"` with a warning, to avoid excessive memory usage. ------ ### event_covariates An optional character vector of column names in `edgelist` to retain as additional event-level variables in the returned `remify` object. These are stored in `reh$event_covariates` together with `time`, `actor1`, `actor2`, and an internal `.event_id`. This is useful when downstream functions need access to event-level marks that are not part of the core `reh$edgelist`. ------ ### ncores An optional integer specifying the number of cores used in the parallelization of internal processing functions (default is `1`). ------ ### Running the example ```{r} edgelist_reh <- remify( edgelist = randomREH$edgelist, directed = TRUE, # events are directed ordinal = FALSE, # model with waiting times model = "tie", # tie-oriented modeling actors = randomREH$actors, riskset = "full", origin = randomREH$origin ) ``` ## Output The output of `remify()` is an S3 object of class `remify`. Its top-level elements are: ```{r} names(edgelist_reh) ``` ------ ### M `M` is the number of observed time points. If two or more events occur at the same time point, `M` counts unique time points and the total number of events is returned by `E` (see below). If all events occur at different time points, `M` equals the number of events. ```{r} edgelist_reh$M ``` ------ ### E `E` is the total number of observed events, returned only when simultaneous events exist (i.e., when `M < E`). ------ ### N `N` is the total number of actors that could interact in the network. ```{r} edgelist_reh$N ``` ------ ### C `C` is the number of event types. If no event types are present, `C` is `1`. ```{r} edgelist_reh$C ``` ------ ### D and activeD `D` is the number of dyads in the full risk set, i.e., the largest possible risk set size: - directed: $D = N \times (N-1) \times C$ - undirected: $D = \frac{N \times (N-1)}{2} \times C$ When `riskset` is `"active"` or `"manual"`, `activeD` gives the number of dyads in the reduced risk set. ```{r} edgelist_reh$D ``` ------ ### intereventTime A numeric vector of waiting times between subsequent events: $$\begin{bmatrix} t_1 - t_0 \\ t_2 - t_1 \\ \cdots \\ t_M - t_{M-1} \end{bmatrix}$$ ```{r} head(edgelist_reh$intereventTime) ``` `intereventTime` is `NULL` when `ordinal = TRUE`. ------ ### edgelist The processed input edgelist as a `data.frame` with columns `[time, actor1, actor2]` (plus `type` and/or `weight` if supplied). Events are re-ordered by time if necessary. ```{r} head(edgelist_reh$edgelist) ``` ------ ### edgelist_id A per-event integer ID summary (internal use by downstream packages). ------ ### meta A list of metadata describing the processed event history. This replaces the old `attr()`-based interface. Key fields: ```{r} names(edgelist_reh$meta) ``` - `meta$model`: tie-oriented or actor-oriented modeling - `meta$directed`: whether events are directed - `meta$ordinal`: whether ordinal likelihood is used - `meta$riskset`: the type of risk set - `meta$dictionary`: a list of two `data.frame`s — `actors` (columns `actorName`, `actorID`) and `types` (columns `typeName`, `typeID`) — sorted alphanumerically - `meta$origin`: the starting time $t_0$ - `meta$ncores`: number of cores used ```{r} edgelist_reh$meta$directed edgelist_reh$meta$model edgelist_reh$meta$dictionary ``` ------ ### ids A list of per-event integer IDs for `actor1`, `actor2`, `dyad`, and `type`. These replace the old `attr(reh, "actor1ID")` etc. interface. ```{r} # dyad ID of the first event edgelist_reh$ids$dyad[1] # sender ID of the first event edgelist_reh$ids$actor1[1] ``` ------ ### index A list of decoded risk set tables. For tie-oriented modeling, contains `dyad_map` (full risk set) or `dyad_map_active` (active/manual risk set). For actor-oriented modeling, contains `sender_map`. ------ ### omit_dyad For tie-oriented modeling, a dynamic risk set modification list (empty list when `riskset` is not `"manual"`). For actor-oriented modeling, always an empty list. ------ ### riskset_info When `attach_riskset = TRUE` (the default), this list contains the effective risk set representation used for estimation. The field `riskset_info$included` contains the decoded risk set dyad table (format controlled by `riskset_decode`): ```{r} head(edgelist_reh$riskset_info$included) ``` ------ ### Actor-oriented model output When `model = "actor"`, the following additional elements are returned: - `sender_riskset`: integer vector of actor IDs allowed to send. - `receiver_riskset`: named list of integer vectors of allowed receiver IDs per sender. - `activeN`: number of active senders. - `index$sender_map`: a `data.frame` with columns `senderID` and `actorName` for active senders. ```{r} reh_actor <- remify( edgelist = randomREH$edgelist, directed = TRUE, ordinal = FALSE, model = "actor", actors = randomREH$actors, riskset = "full", origin = randomREH$origin ) reh_actor$activeN head(reh_actor$index$sender_map) ``` --- ## Methods The available methods for a `remify` object are: `print`, `summary`, `dim`, and `plot`. ------ ### print() and summary() Both `print()` and `summary()` print a brief summary of the relational network data. ```{r} summary(edgelist_reh) ``` ------ ### dim() Returns key dimensions of the network: number of events, number of actors, number of event types (if more than one), number of possible dyads ($D$), and number of active dyads (`activeD`, shown only if `riskset = "active"` or `"manual"`). ```{r} dim(edgelist_reh) ``` ------ ### plot() `plot()` returns a set of descriptive plots (selected via the `which` argument): 1. Histogram of the inter-event times. 2. Activity tile plot: dyad frequency heatmap with in-degree and out-degree (or total-degree for undirected networks) bar plots on the margins. 3. Normalized in-/out-/total-degree of actors over `n_intervals` evenly-spaced time intervals (values in $[0,1]$; opacity and size of points are proportional to the normalized measure). 4. Per time interval: number of events, proportion of observed dyads, proportion of active senders and active receivers (directed networks only). 5. Network visualization via `igraph`: undirected and directed edge graphs (edges' opacity is proportional to event counts; vertices' opacity is proportional to degree). ```{r, out.width="50%", fig.align = "center", dev=c("jpeg"), fig.alt = "summary plots", dev.args = list(bg = "white")} op <- par(no.readonly = TRUE) par(mai=rep(0.8,4), cex.main=0.9, cex.axis=0.75) plot(x = edgelist_reh, which = 1, n_intervals = 13) plot(x = edgelist_reh, which = 2, n_intervals = 13) plot(x = edgelist_reh, which = 3, n_intervals = 13) plot(x = edgelist_reh, which = 4, n_intervals = 13) plot(x = edgelist_reh, which = 5, n_intervals = 13, igraph.edge.color = "#cfcece", igraph.vertex.color = "#7bbfef") par(op) ``` The plots vary for undirected networks: ```{r} edgelist_reh_undir <- remify( edgelist = randomREH$edgelist, directed = FALSE, model = "tie" ) ```