---
title: "mixgb: Multiple Imputation Through XGBoost"
author: "Yongshi Deng"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteEncoding{UTF-8}
  %\VignetteIndexEntry{mixgb: Multiple Imputation Through XGBoost}
  %\VignetteEngine{knitr::rmarkdown}
---
## Introduction

The **mixgb** package provides a scalable approach to imputation for large data using XGBoost, subsampling, and predictive mean matching. It leverages XGBoost—an efficient implementation of gradient-boosted trees—to automatically capture complex interactions and non-linear relationships. Subsampling and predictive mean matching are incorporated to reduce bias and to preserve realistic imputation variability. The package accommodates a wide range of variable types and offers flexible control over subsampling and predictive matching settings.

We also recommend our package **vismi** ([Visualisation Tools for Multiple Imputation][vismi-url]), which offers a comprehensive set of diagnostics for assessing the quality of multiply imputed data.

[vismi-url]: https://agnesdeng.github.io/vismi/

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Impute missing values with `mixgb`

We first load the `mixgb` package and the `newborn` dataset, which contains 16 variables of various types
(integer/numeric/factor/ordinal factor). There are 9 variables with missing values.

```{r}
library(mixgb)
str(newborn)
colSums(is.na(newborn))
```

To impute this dataset, we use the default settings. By default, the number of imputed datasets is set to `m = 5`. The data do not need to be converted to a `dgCMatrix` or one-hot encoded format, as these transformations are handled automatically by the package. Supported variable types include numeric, integer, factor, and ordinal factor.

```{r, eval = FALSE}
# use mixgb with default settings
imp_list <- mixgb(data = newborn, m = 5)
```

### Customise imputation settings

We can also customise imputation settings:

-   The number of imputed datasets
    `m`
    
-   The number of imputation iterations
    `maxit`
    
-   XGBoost hyperparameters and verbose settings.
    `xgb.params`, `nrounds`, `early_stopping_rounds`, `print_every_n` and `verbose`.

-   Subsampling ratio. By default, `subsample = 0.7`. Users can change this value under the `xgb.params` argument.

-   Predictive mean matching settings
    `pmm.type`, `pmm.k` and `pmm.link`.
   
-   Whether ordinal factors should be converted to integer (imputation process may be faster)
    `ordinalAsInteger`

-   Initial imputation methods for different types of variables
    `initial.num`, `initial.int` and `initial.fac`.

-   Whether to save models for imputing newdata
    `save.models` and `save.vars`.
    

```{r, eval = FALSE}
set.seed(2026)
# Use mixgb with chosen settings
params <- list(
  max_depth = 5,
  subsample = 0.9,
  nthread = 2,
  tree_method = "hist"
)

imp_list <- mixgb(
  data = newborn, m = 10, maxit = 2,
  ordinalAsInteger = FALSE,
  pmm.type = "auto", pmm.k = 5, pmm.link = "prob",
  initial.num = "normal", initial.int = "mode", initial.fac = "mode",
  save.models = FALSE, save.vars = NULL,
  xgb.params = params, nrounds = 200, early_stopping_rounds = 10, print_every_n = 10L, verbose = 0
)
```

### Tune hyperparameters

Imputation performance can be influenced by the choice of hyperparameters. While tuning a large number of hyperparameters may seem daunting, the search space can often be substantially reduced because many of them are correlated. In mixgb, the function `mixgb_cv()` is provided to tune the number of boosting rounds (`nrounds`). As XGBoost does not define a default value for `nrounds`, users must specify this parameter explicitly. The default setting in mixgb() is `nrounds = 100`; however, we recommend using `mixgb_cv()` to get an appropriate value first.


```{r}
params <- list(max_depth = 3, subsample = 0.7, nthread = 2)
cv.results <- mixgb_cv(data = newborn, nrounds = 100, xgb.params = params, verbose = FALSE)
cv.results$evaluation.log
cv.results$response
cv.results$best.nrounds
```


By default, `mixgb_cv()` randomly selects an incomplete variable as the response and fits an XGBoost model using the remaining variables as predictors, based on the complete cases of the dataset. As a result, repeated runs of `mixgb_cv()` may yield different results. Users may instead explicitly specify the response variable and the set of covariates via the `response` and `select_features` arguments, respectively.


```{r}
cv.results <- mixgb_cv(
  data = newborn, nfold = 10, nrounds = 100, early_stopping_rounds = 1,
  response = "head_circumference_cm", select_features = c("age_months", "sex", "race_ethinicity", "recumbent_length_cm", "first_subscapular_skinfold_mm", "second_subscapular_skinfold_mm", "first_triceps_skinfold_mm", "second_triceps_skinfold_mm", "weight_kg"), xgb.params = params, verbose = FALSE
)

cv.results$best.nrounds
```

We can then set `nrounds = cv.results$best.nrounds` in `mixgb()` to generate five imputed datasets.

```{r, eval = FALSE}
imp_list <- mixgb(data = newborn, m = 5, nrounds = cv.results$best.nrounds)
```


## Inspect multiply imputed values

Older version of **mixgb** package included a few visual diagnostic functions. These have now been removed from **mixgb**. 

We recommend our standalone package **vismi** ([Visualisation Tools for Multiple Imputation][vismi-url]), which provides a comprehensive set of visual diagnostics for evaluating multiply imputed data.

For more details, please visit:

[https://agnesdeng.github.io/vismi/](https://agnesdeng.github.io/vismi/) 

[https://github.com/agnesdeng/vismi](https://github.com/agnesdeng/vismi).
