---
title: "Package Statistics"
author: 
  - "Mark Padgham"
date: "`r Sys.Date()`"
vignette: >
  %\VignetteIndexEntry{Package Statistics}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set (
    collapse = TRUE,
    warning = TRUE,
    message = TRUE,
    width = 120,
    comment = "#>",
    fig.retina = 2,
    fig.path = "README-"
)
options (repos = c (
    ropenscireviewtools = "https://ropensci-review-tools.r-universe.dev",
    CRAN = "https://cloud.r-project.org"
))
library (pkgstats)
```

This vignette describes the statistics collated by `pkgstats`. The first
section provides full descriptions of all data returned by [the main
`pkgstats()`
function](https://docs.ropensci.org/pkgstats/reference/pkgstats.html), and the
second section describes the output of [the `pkgstats_summary()`
function](https://docs.ropensci.org/pkgstats/reference/pkgstats_summary.html)
which converts statistics to a single-row summary. Single-row summaries from
multiple packages can be combined to represent the statistical properties of
multiple packages in a single `data.frame` object. [The `pkgstats()`
function](https://docs.ropensci.org/pkgstats/reference/pkgstats.html) is
applied to all CRAN packages on a regular basis, with results accessible with
[the `dl_pkgstats_data()`
function](https://docs.ropensci.org/pkgstats/reference/dl_pkgstats_data.html).

## Overview of Package Statistics

The [main `pkgstats()`
function](https://docs.ropensci.org/pkgstats/reference/pkgstats.html) returns a
`list` of eight main components:

1. `"loc"` summarising "Lines of Code" in package sub-directories and languages;
2. `"vignettes"` containing counts of numbers of vignettes and demos;
3. `"data_stats"` summarising data files;
4. `"desc"` summarising the contents of the package "DESCRIPTION" file;
5. `"translations"` summarising translations into other (human) languages;
6. `"objects"`: a table of all "objects" in all languages;
7. `"network"`: a table of relationships between objects, such as function calls; and
8. `"external_calls"`: a detailed table of all calls made to all R functions.

The following sub-sections provide further detail on these components (except
the simpler components of `"vignettes"`, `"data_stats"`, and `"translations"`).
The results use the output of applying the function to the source code of this package:

```{r main-pkgstats-call, eval = FALSE}
s <- pkgstats () # run in root directory of `pkgstats` source
```
```{r pkgstats-fakey, echo = FALSE}
# These data all have to be faked because they can't be generated on CRAN
# windows machines.
loc <- tibble::tibble (
    language = c ("C++", "R", "R", "Rmd"),
    dir = c ("src", "R", "tests", "vignettes"),
    nfiles = c (3L, 24L, 7L, 2L),
    nlines = c (364L, 4727L, 300L, 347L),
    ncode = c (276L, 345L, 234L, 278L),
    nempty = c (67L, 682L, 61L, 61L),
    nspaces = c (932L, 333L, 511L, 1483L),
    nchars = c (6983L, 114334L, 5543L, 11290L),
    nexpr = c (1L, 1L, 1L, 1L),
    ntabs = c (0L, 0L, 0L, 0L),
    indentation = c (4L, 4L, 4L, 4L)
)
vignettes <- c (vignettes = 2L, demos = 0L)
data_stats <- c (n = 0L, total_size = 0L, median_size = 0L)
desc <- data.frame (
    package = "pkgstats",
    verion = "0.1.1",
    date = date (),
    license = "GPL-3",
    urls = paste0 (c (
        "https://docs.ropensci.org/pkgstats/",
        "nhttps://github.com/ropensci-review-tools/pkgstats"
    ), collapse = ",\n"),
    bugs = "https://github.com/ropensci-review-tools/pkgstats/issues",
    aut = 1L,
    ctb = 0L,
    fnd = 0L,
    rev = 0L,
    ths = 0L,
    trl = 0L,
    depends = NA,
    imports = paste0 (c (
        "brio",
        "checkmate",
        "dplyr",
        "fs",
        "igraph",
        "methods",
        "readr",
        "sys",
        "withr"
    ), collapse = ", "),
    suggests = paste0 (c (
        "curl",
        "hms",
        "jsonlite",
        "knitr",
        "parallel",
        "pkgbuild",
        "Rcpp",
        "rmarkdown",
        "roxygen2",
        "testthat",
        "visNetwork"
    ), collapse = ", "),
    enchances = NA_character_,
    linking_to = "cpp11"
)
translations <- NA_character_

s <- list (
    loc = loc,
    vignettes = vignettes,
    data_stats = data_stats,
    desc = desc,
    translations = translations
)
```


The result is a list of various data extracted from the code. All except for
`objects` and `network` represent summary data:

```{r}
s [!names (s) %in% c ("objects", "network", "external_calls")]
```

These results demonstrate that many fields use `NA` to denote values of zero.
The following sub-sections explore these various components generated by the
`pkgstats()` function in more detail.

## Lines of Code

The first item in the above list is "loc" for Lines-of-Code, which are  counted
using an internal routine specifically developed for R packages, and which
provides more accurate and R-specific information than most open source code
counting libraries. For example, the counts in `pkgstats` are able to
distinguish and separately count code chunks and text lines in `.Rmd` files. 

```{r loc}
s$loc
```

That output includes the following components, grouped by both computer
language and package directory:

1. `nfiles` = Numbers of files in each directory and language.
2. `nlines` = Total numbers of lines in all files.
3. `ncode` = Total numbers of lines of code.
4. `ndoc` = Total numbers of documentation or comment lines.
5. `nempty` = Total numbers of empty of blank lines.
6. `nspaces` = Total numbers of white spaces in all code lines, excluding
   leading indentation spaces.
7. `nchars` = Total numbers of non-white-space characters in all code lines.
8. `nexpr` = Median numbers of nested expressions in all lines which have any
   expressions (see below).
9. `ntabs` = Number of lines of code with initial tab indentation.
10. `indentation` = Number of spaces by which code is indented (with `-1`
   denoting tab-indentation).

Numbers of nested expressions are counted as numbers of brackets or braces of
any type nested on a single line. The following line has one nested bracket:

```{r nested1, eval = FALSE}
x <- myfn ()
```

while the following has four:

```{r nested4, eval = FALSE}
x <- function () { return (myfn ()) }
```

Code with fewer nested expressions per line is generally easier to read, and
this metric is provided as one indication of the general readability of
code. A second relative indication may be extracted by converting numbers of
spaces and characters to a measure of relative numbers of white spaces, noting
that the `nchars` value quantifies total characters including white spaces.

```{r rel-space}
index <- which (s$loc$dir %in% c ("R", "src")) # consider source code only
sum (s$loc$nspaces [index]) / sum (s$loc$nchars [index])
```

Finally, the `ntabs` statistic can be used to identify whether code uses tab
characters as indentation, otherwise the `indentation` statistics indicate
median numbers of white spaces by which code is indented. The `objects`,
`network`, and `external_calls` items returned by the [`pkgstats()`
function](https://docs.ropensci.org/pkgstats/reference/pkgstats.html) are
described further below.

## `"desc"`: The package "DESCRIPTION" file

The `desc` item looks like this:

```{r desc}
s$desc
```

This item includes the following components:

- Package name, version, date, and license
- Package URL(s) (`urls`)
- URL for BugReports (`bugs`)
- Number of contributors with role of *author* (`desc_n_aut`), *contributor*
  (`desc_n_ctb`), *funder* (`desc_n_fnd`), *reviewer* (`desc_n_rev`), *thesis
  advisor* (`ths`), and *translator* (`trl`, relating to translation between
  computer and not spoken languages).
- Comma-separated character entries for all `depends`, `imports`, `suggests`,
  and `linking_to` packages.

The "Date" field is taken from the "Date/Publication" field automatically
inserted by CRAN on package publication, or for non-CRAN packages to the
"mtime" value (modification time) value of the DESCRIPTION file. Note that
"date" values extracted by `pkgstats` do not use "Date" values from DESCRIPTION
files (as these are manually-entered, and potentially unreliable).

### 1.3 `"objects"`: Objects in all languages

The `objects` item contains all code objects identified by
the code-tagging library [`ctags`](https://ctags.io). For R, those are
primarily functions, but for other languages may be a variety of entities such
as class or structure definitions, or sub-members thereof. Object tables look
like this:

```{r objects-fakey, echo = FALSE}
s$objects <- data.frame (
    file_name = c (
        rep ("R/archive-trawl.R", 4L),
        "R/cpp11.R",
        "R/ctags-install.R"
    ),
    fn_name = c (
        "pkgstats_from_archive",
        "list_archive_files",
        "rm_prev_files",
        "pkgstats_fns_from_archive",
        "cpp_loc",
        "clone_ctag"
    ),
    kind = rep ("function", 6L),
    language = rep ("R", 6L),
    loc = c (89, 17, 24, 82, 3, 17),
    npars = c (9, 2, 2, 7, 4, 1),
    has_dots = rep (FALSE, 6L),
    exported = rep (c (TRUE, FALSE, FALSE), 2L),
    param_nchards_md = c (
        133,
        rep (NA_integer_, 2L),
        163,
        rep (NA_integer_, 2L)
    ),
    param_nchards_mn = c (
        159.7778,
        rep (NA_integer_, 2L),
        174.5714,
        rep (NA_integer_, 2L)
    ),
    num_doclines = c (
        77,
        rep (NA_integer_, 2L),
        50,
        rep (NA_integer_, 2L)
    )
)
```


```{r objects}
head (s$objects)
```

Objects are primarily sorted by language, with R-language objects given first.
These are mostly functions, and include statistics on:

- lines of code used to define each function (`loc`);
- numbers of parameters (`npars`);
- whether or not the function includes a "three dots" parameter (that is,
  `...`; identified by `has_dots`);
- whether or not a function is exported (`exported`);
- Mean and median numbers of character used to document each parameter
  (`param_nchards_mn` and `param_nchards_md`, respectively); and
- Total number of lines of documentation for that object / function.


## `"network"`: Relationships between objects

The `network` item details all relationships between objects, which generally
reflects one object calling or otherwise depending on another object. Each row
thus represents one edge of a "function call" network, with each entry in the
`from` and `to` columns representing the network vertices or nodes.

```{r network-fakey, echo = FALSE}
network <- data.frame (
    file = c (
        rep ("R/external_calls.R", 4L),
        rep ("R/pkgstats-summary.R", 2L)
    ),
    line1 = c (11L, 26L, 38L, 326L, 39L, 50L),
    from = c (
        rep ("external_call_network", 3L),
        "add_other_pkgs_to_calls",
        rep ("pkgstats_summary", 2L)
    ),
    to = c (
        "extract_call_content",
        "add_base_recommended_pkgs",
        "add_other_pkgs_to_calls",
        "control_parse",
        "null_stats",
        "loc_summary"
    ),
    language = rep ("R", 6L),
    cluster_dir = rep (1L, 6L),
    centrality_dir = c (9L, 9L, 9L, 1L, 11L, 11L),
    cluster_undir = rep (1L, 6L),
    centrality_undir = c (rep (230.8333, 3L), 6, rep (874, 2L))
)

nrows <- 142 # full number in result
# expand to nrows:
n <- ceiling (nrows / nrow (network))
network <- with (network, data.frame (
    file = rep (file, n),
    line1 = rep (line1, n),
    from = rep (from, n),
    to = rep (to, n),
    language = rep (language, n),
    cluster_dir = rep (cluster_dir, n),
    centrality_dir = rep (centrality_dir, n),
    cluster_undir = rep (cluster_undir, n),
    centrality_undir = rep (centrality_undir, n)
))
s$network <- network [seq (nrows), ]
```


```{r}
head (s$network)
nrow (s$network)
```

The network table includes additional statistics on the centrality of each
edge, measured as betweenness centrality assuming edges to be both directed
(`centrality_dir`) and undirected (`centrality_undir`). More central edges
reflect connections between objects that are more central to package
functionality, and vice versa. The distinct components of the network are also
represented by discrete cluster numbers, calculated both for directed and
undirected versions of the network. Each distinct cluster number represents
a distinct group of objects, internally related to other members of the same
cluster, yet independent of all objects with different cluster numbers.

The network can be viewed as an interactive [`vis.js`](https://visjs.org/)
network through passing the result of `pkgstats` -- the variable `p` in the
code above -- to the [`plot_network()`
function](https://docs.ropensci.org/pkgstats/reference/plot_network.html).

## `"external_calls"`: All calls made to all R functions

The `external_calls` item is structured similar to the `network` object, but
identifies all calls to functions from external packages. However, unlike the
`network` and `object` data, which provide information on objects and
relationships in all computer languages used within a package, the
`external_calls` object maps calls within R code only, in order to provide
insight into the use within a package of of functions from other packages,
including R's base and recommended packages. The object looks like this:

```{r external_calls-fakey, echo = FALSE}
s$external_calls <- data.frame (
    tags_line = 1:6,
    call = c (
        "c",
        rep ("character", 2L),
        "logical",
        "integer",
        "left_join"
    ),
    tag = c (
        "GTAGSLABEL",
        "file_name",
        "fn_name",
        "has_dots",
        "loc",
        "name"
    ),
    file = c (
        "R/ctags-test.R",
        rep ("R/pkgstats.R", 4L),
        "R/plot.R"
    ),
    kind = rep ("nameattr", 6L),
    start = c (109L, 185L, 186L, 189L, 187L, 89L),
    end = c (109L, 185L, 186L, 189L, 187L, 89L),
    package = c (rep ("base", 5L), "dplyr")
)
```


```{r ext-call-head}
head (s$external_calls)
```

These data are converted to a summary form by the [`pkgstats_summary()`
function](https://docs.ropensci.org/pkgstats/reference/pkgstats_summary.html),
which tabulates numbers of external calls and unique functions from each
package. These data are presented as a single character string which looks like
this:

```{r ext-call-summary, eval = FALSE}
s_summ <- pkgstats_summary (s)
print (s_summ$external_calls)
```
```{r ext-call-sumamry-fakey, echo = FALSE}
s_summ <- list (external_calls = paste0 (c (
    "base:581:84",
    "brio:11:2",
    "curl:4:3",
    "dplyr:7:4",
    "fs:4:2",
    "graphics:10:2",
    "hms:2:1",
    "igraph:3:3",
    "parallel:2:1",
    "pkgstats:126:73",
    "readr:8:5",
    "stats:19:3",
    "sys:14:1",
    "tools:3:2",
    "utils:22:7",
    "visNetwork:3:2",
    "withr:6:2"
), collapse = ","))
```

These data can be easily converted to the corresponding numeric values using
code like the following:

```{r ext-call-details, eval = TRUE}
x <- strsplit (s_summ$external_calls, ",") [[1]]
x <- do.call (rbind, strsplit (x, ":"))
x <- data.frame (
    pkg = x [, 1],
    n_total = as.integer (x [, 2]),
    n_unique = as.integer (x [, 3])
)
x$n_total_rel <- round (x$n_total / sum (x$n_total), 3)
x$n_unique_rel <- round (x$n_unique / sum (x$n_unique), 3)
print (x)
```

Those data reveal, for example, that this package makes 
`r x$n_total [x$pkg == "base"]` individual calls to
`r x$n_unique [x$pkg == "base"]` unique functions from the "base" package.
