\documentclass[nojss]{jss}

%% -- LaTeX packages and custom commands ---------------------------------------

%% recommended packages
\usepackage{orcidlink,thumbpdf,lmodern}

%% additional packages
%\usepackage[english]{babel}
\usepackage[utf8]{inputenc}
\usepackage{amsmath, amsfonts, amssymb, bbm, bm, mathabx}
\usepackage{booktabs} 	
\usepackage{longtable} 
\usepackage{tabularx}
\usepackage{xltabular}
\usepackage{lscape}
\usepackage{fontawesome5}
\usepackage{tikz}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{makecell}
\renewcommand{\cellalign}{tl}
\renewcommand{\theadalign}{tl}

%% new custom commands
\newcommand{\class}[1]{`\code{#1}'}
\newcommand{\fct}[1]{\code{#1()}}

\newcommand{\dif}{\mathop{}\!\mathrm{d}}

%\VignetteIndexEntry{Getting Started with DataSimilarity}
%\VignetteDepends{rpart.plot, nnet, knitr, synthpop, cluster}

%% For Sweave-based articles about R packages:
%% need no \usepackage{Sweave}
\SweaveOpts{engine=R, eps=FALSE, keep.source = TRUE}
<<preliminaries, echo=FALSE, results=hide>>=
op <- par(no.readonly = TRUE)
old <- options()
options(prompt = "R> ", continue = "+  ", width = 70, useFancyQuotes = FALSE)
library("knitr")
@


%% -- Article metainformation (author, title, ...) -----------------------------
\author{Marieke Stolte~\orcidlink{0009-0002-0711-6789}\\TU Dortmund University and LMU Munich
\And Luca Sauer\\TU Dortmund University \AND J\"org Rahnenf\"uhrer~\orcidlink{0000-0002-8947-440X} \\ TU Dortmund University \And Andrea Bommert~\orcidlink{0000-0002-1005-9351} \\ TU Dortmund University}
\Plainauthor{Marieke Stolte, Luca Sauer, J\"org Rahnenf\"uhrer, Andrea Bommert}


\title{Getting Started With \pkg{DataSimilarity}: Quantifying Similarity of Datasets and Multivariate Two- and $k$-Sample Testing}
\Plaintitle{Getting Started With DataSimilarity: Quantifying Similarity of Datasets and Multivariate Two- and k-Sample Testing}
\Shorttitle{Getting Started}

%% - \Abstract{} almost as usual
\Abstract{
Quantifying the similarity of two or more datasets is a common task in various applications of statistics and machine learning, including two- or $k$-sample testing and meta- or transfer learning.
The \pkg{DataSimilarity} package contains a variety of methods for quantifying the similarity of datasets.
The package includes 40 methods of which 17 are implemented for the first time in \proglang{R}. 
The remaining are wrapper functions for methods with already existing implementations that unify and simplify the various input and output formats of different methods and bundle the methods of many existing \proglang{R} packages in a single package.
In this vignette, we show the basic workflow for using the package.  
}

\Keywords{dataset similarity, two-sample testing, multi-sample testing}
\Plainkeywords{dataset similarity, two-sample testing, multi-sample testing}

\Address{
Marieke Stolte, Luca Sauer, J\"org Rahnenf\"uhrer, Andrea Bommert\\
Department of Statistics\\
TU Dortmund University\\
Vogelpothsweg 87\\
44227 Dortmund, Germany\\
\emph{and}\\
Marieke Stolte\\
Institute for Medical Information Processing, Biometry, and Epidemiology (IBE)\\
LMU Medizin\\
Ludwig-Maximilians-Universit\"at M\"unchen\\
Marchioninistrasse 15\\
81377 Munich, Germany\\
E-mail: \email{marieke.stolte@ibe.med.uni-muenchen.de}\\
}

\begin{document}
\SweaveOpts{concordance=FALSE}

\section{Introduction} \label{sec:intro}
The challenge of quantifying how similar two or more datasets are arises in various contexts where two or more datasets should be compared. 
This could be in the context of transferring results of a prediction model from one dataset to another, as well as for assessing how close simulated data is to a real-world dataset. 
The most common usage is for two- or $k$-sample testing. 
Formally, the two-sample problem is defined as the testing problem  
\begin{equation}
H_0: F_1 = F_2 \text{ vs. } H_1: F_1\ne F_2.\label{two.sample.problem}
\end{equation}
A two-sample test, therefore, can be used to check whether the underlying distributions of two datasets coincide. 
Analogously, the $k$-sample problem is defined as
\[
H_0: F_1 = F_2 = \dots = F_k \text{ vs. } H_1: \exists i\ne j\in\{1,\dots,k\}: F_i\ne F_j,
\]
for $k$ distributions $F_1,\dots, F_k$. 

Many different methods are proposed in the literature for quantifying the similarity of two or more datasets, and most of these define a two- or $k$-sample test. 
In this package, a subset of these methods are implemented, which were selected as relevant from a literature review \citep{stolte_methods_2024}. 
For more details on the methods and their selection, see the `Details' vignette. 
In the following, the basic steps for using the \pkg{DataSimilarity} package are explained using real-world example datasets with different characteristics with regard to the scale level, number of datasets, and presence of a target variable in each dataset. 
Moreover, it is demonstrated how a method from the package can be chosen. 
For that, both theoretical properties of the method \citep{stolte_methods_2024} and their empirical performance in simulation studies \citep{stolte_empirical_cat_2026, stolte_empirical_num_2026} are considered.


\section{Workflow} \label{sec:illustration}
In the following, the typical workflow for working with the package is explained. 


There are two different use cases with different workflows. 
\begin{itemize}
\item[a)] We already know which method to apply to our dataset comparison at hand.
\item[b)] We have two (or more) datasets that we want to compare, but we do not have a specific method in mind. 
\end{itemize}

In case a), the workflow for using the package would be to find the corresponding function for the method and apply it to the data. 
The full list of methods can also be found in the `Details' vignette as well as in the \code{method.table} dataset.
Some examples for working with the package are given in Section~\ref{sec:examples}.
There, we differentiate six cases with regard to the applicability of the selected methods. 
These are summarized in Table~\ref{tab:cases}.
In Section~\ref{sec:examples}, one data example per case is shown and one corresponding applicable method is explained and demonstrated.

\begin{table}[!htb]
\centering
\begin{tabular}{llll}
\toprule
Scenario no. & No.datasets & Scale level & Target variable \\ 
\midrule
1 & $k = 2$ & Numeric & No \\  
2 & $k \ge 2$ & Numeric & No \\ 
3 & $k = 2$ & Numeric & Yes \\ 
\midrule
4 & $k = 2$ & Categorical & No \\  
5 & $k \ge 2$ & Categorical & No \\ 
6 & $k = 2$ & Categorical & Yes \\ 
\bottomrule
\end{tabular}
\caption{Overview of considered cases for applicability of the dataset similarity methods. If present, the target variable included in each dataset has to be a categorical variable.}
\label{tab:cases}
\end{table}

In case b), the package can also be used as a tool for finding an appropriate method. 
This depends on the dataset characteristics. 
For a quick overview of which methods are applicable in which cases, see also the help page \code{?DataSimilarity}.
Here, we distinguish between numeric and categorical data, the number of datasets (two or more than two), and whether or not the datasets include a target variable. 
We demonstrate how to find and apply a method for different types of datasets in the following. 
The general workflow for case b) is summarized in Table~\ref{tab:steps.meth.choice}. 
The proposed COMPARE procedure (\textbf{C}haracteristics of datasets, \textbf{O}bjective of comparison, \textbf{M}ethod requirements, \textbf{P}roduction of method list, \textbf{A}ssessment of computational complexity, \textbf{R}anking of methods, \textbf{E}stimation and testing) for finding the best-suited method is described in more detail in the following.
It can be seen as a step-by-step guide for first specifying all requirements that can then be passed to \fct{findSimilarityMethod}. 
This returns a list of eligible methods. 
The remaining steps then consist of finding the best method from this list and applying it to the datasets.
In Section~\ref{sec:demonstration_compare}, each step is described and the full procedure is demonstrated for one application example of comparing strategies for creating synthetic datasets.

\begin{table}[!tb]
\centering
\begin{tabular}{lp{5cm}p{9cm}}
			\toprule
			& Step & Description \\
			\midrule
			1. & Characteristics of datasets & Evaluate number of datasets, scale level of variables, presence of target variable, dimensions of datasets\\
			2. & Objective of comparison & Is the focus mainly on a test decision, or is the (dis)similarity value itself of interest and should be interpretable?\\
			3. & Method requirements & Are additional properties of the method regarding invariances, metric properties, or consistency of a test of relevance?\\
			4. & Production of method list & Call \fct{findSimilarityMethod} with the criteria identified in steps 1.--3.\\
			5. & Assessment of computational complexity & Are the identified methods feasible with regard to their runtime and memory consumption?\\
			
			6. & Ranking of methods & Compare the remaining methods regarding their performance in simulation studies and select the best\\
			7. & Estimation and testing & Apply the chosen method(s)\\
			\bottomrule
		\end{tabular}
		\caption{Step-by-step procedure for choosing the best method for a given application.}\label{tab:steps.meth.choice}
\end{table}


\section{Examples for case a): We already know which method to apply}\label{sec:examples}
In the following, we present examples how to apply specific methods for six special data situations with different characteristics.
For method descriptions, see the Details vignette.
The methods for the examples here are chosen for demonstration mainly based on their interpretability and simplicity. 
For guidance on choosing methods for a given application, refer to the following section.
For using any of the methods, we first need to attach the package.
%
<<loadpackage, echo = TRUE>>=
library("DataSimilarity")
@
 
\subsection{Cross-Match test for exactly two numeric datasets without target variables}
The dataset \code{dhfr} \citep{sutherland_three-dimensional_2004} from the \pkg{caret} package \citep{caret} is a binary classification dataset (regarding Dihydrofolate Reductase inhibition) consisting of 325 compounds of which 203 are labeled as `active' and 122 as `inactive'. 
The variables are 228 molecular descriptors.
As the active and inactive compounds should differ in their descriptors, we divide the dataset according to the first variable that indicates the activity status. 
<<preparedhfr, echo = TRUE, cache = TRUE>>=
data(dhfr, package = "caret")
act <- dhfr[dhfr$Y == "active", -1]
inact <- dhfr[dhfr$Y == "inactive", -1]
@
For finding an appropriate method, we can use the function \fct{findSimilarityMethod}.
We specify that we have two numeric datasets.
As two datasets is already the default, we only need to specify \code{Numeric = TRUE}: 
<<findMethod2Num, echo = TRUE, cache = TRUE>>=
findSimilarityMethod(Numeric = TRUE)
@
We can also get more information if we set \code{only.names = FALSE}.
We could use this additional information for choosing a method but here we assume that we already have a method in mind. 
We apply the Rosenbaum cross-match test here to check whether the active and inactive compounds differ.
One example method for this case is the \citet{rosenbaum_exact_2005} cross-match test. 
It is a graph-based method. 
Most graph-based methods work by constructing a similarity graph on the pooled sample and counting the edges that connect points from different samples. 
Here, the optimal non-bipartite matching is used, i.e., a graph where pairs of two observations in the pooled sample are connected such that the sum over the edge lengths (= Euclidean distances of connected observations) is minimized. 
The optimal non-bipartite matching for two example data situations is shown in Figure~\ref{fig:ex.CM}.
In case of an odd number of observations, a ghost observation is introduced that has the highest distance to all other observations. 
The observation that is matched with that ghost observation is discarded from further analysis. 

The test statistic of the cross-match test is given by the standardized cross-match count
\[
\frac{\text{CMC} - \E_{H_0}(\text{CMC})}{\sqrt{\VAR_{H_0}(\text{CMC})}},
\]
where CMC denotes the cross-match count and $\E_{H_0}$ and $\VAR_{H_0}$ its expectation and variance, respectively, under $H_0: F_1 = F_2$. 
The cross-match count is the number of edges connecting points stemming from different datasets.
The exact distribution of the test statistic under $H_0$ is known. For small samples, it can be used for computing an exact $p$~value. 
For large samples, the asymptotic standard normal distribution of the test statistic can be used. 
The idea of the test is that for similar datasets, the number of edges connecting points from different samples is expected to be higher than in datasets that differ. 
This is illustrated in Figure~\ref{fig:CM.sim} compared to Figure~\ref{fig:CM.dis}. 
In case of data drawn from different datasets, fewer edges connect points from different datasets, indicated by the lower number of red edges in Figure~\ref{fig:CM.dis}. 

\begin{figure}[t!]
\centering
\begin{subfigure}[b]{0.49\textwidth}
\centering
<<visualization_CM, echo=FALSE, fig=TRUE, height=4, width=7>>=
library(nbpMatching)
set.seed(5202)
data.graph.sim <- data.frame(x = runif(16), y = runif(16), 
                             dataset = rep(c(1, 2), each = 8))
set.seed(5202)
data.graph.dis <- data.frame(x = c(rbeta(8, 2, 5), runif(8)), 
                             y = c(rbeta(8, 2, 5), runif(8)), 
                             dataset = rep(c(1, 2), each = 8))

w.sim <- dist(data.graph.sim[, 1:2], upper = TRUE)
w.dis <- dist(data.graph.dis[, 1:2], upper = TRUE)

nbp.sim <- nonbimatch(distancematrix(as.matrix(w.sim)))
nbp.sim <- cbind(nbp.sim$halves[, 2], nbp.sim$halves[, 4])
nbp.dis <- nonbimatch(distancematrix(as.matrix(w.dis)))
nbp.dis <- cbind(nbp.dis$halves[, 2], nbp.dis$halves[, 4])

plotGraph <- function(data, E, directed = FALSE, col = hcl.colors(2, "Spectral")[1]) {
  x <- data$x
  y <- data$y
  ind <- data$dataset
  plot(x, y, pch = c(21, 19)[ind], cex = 2, xlim = 0:1, ylim = 0:1, xlab = "", 
       ylab = "", las = 1, cex.axis = 2)
  if(directed) {
    for(i in 1:nrow(E)) {
      e <- E[i, ]
      d <- dist(matrix(c(x[e[1]], y[e[1]], x[e[2]], y[e[2]]), byrow = TRUE, ncol = 2))
      f <- 0.014 / d
      arrows(x0 = x[e[1]], 
             y0 = y[e[1]], 
             x1 = x[e[1]] + (1 - f) * (x[e[2]] - x[e[1]]), 
             y1 = y[e[1]] + (1 - f) * (y[e[2]] - y[e[1]]), 
             length = 0.07, lwd = 2, 
             col = c("darkgrey", col)[(ind[e[1]] != ind[e[2]]) + 1], 
             lty = c(1, 2)[(ind[E[, 1]] != ind[E[, 2]]) + 1])
    }
  } else {
    segments(x0 = x[E[, 1]], y0 = y[E[, 1]], 
             x1 = x[E[, 2]], y1 = y[E[, 2]], 
             lwd = 3, 
             col = c("darkgrey", col)[(ind[E[, 1]] != ind[E[, 2]]) + 1], 
             lty = c(1, 2)[(ind[E[, 1]] != ind[E[, 2]]) + 1])
  }
  points(x, y, pch = 19, col = c("white", "black")[ind], cex = 2)
  points(x, y, pch = c(21, 19)[ind], cex = 2)
}
par(mar = c(2.1, 3.1, 1.1, 2.1))
plotGraph(data.graph.sim, nbp.sim, FALSE)
par(op)
@
\caption{Datasets drawn from the same distribution.}\label{fig:CM.sim}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.49\textwidth}
\centering
<<visualization_CM_dis, echo=FALSE, fig=TRUE, height=4, width=7>>=
par(mar = c(2.1, 3.1, 1.1, 2.1))
plotGraph(data.graph.dis, nbp.dis, FALSE)
par(op)
@
\caption{Datasets drawn from different distributions.}\label{fig:CM.dis}
\end{subfigure}
\caption{Optimal non-bipartite matching for example datasets. Dataset~1 is indicated by white points and Dataset~2 by black points. Edges connecting points from different datasets are indicated by red and dashed lines. Edges connecting points from the same sample are indicated by grey and solid lines.}
\label{fig:ex.CM}
\end{figure}

Since the combined sample size here is smaller than 340, we can apply the exact test. 
We can either use the \fct{DataSimilarity} function and specify the \code{method} argument accordingly:
%
<<exRosenbaumDS, echo = TRUE, cache = TRUE, tidy = TRUE>>=
DataSimilarity(act, inact, method = "Rosenbaum", exact = TRUE)
@
Alternatively, we can use the \fct{Rosenbaum} function directly:
%
<<exRosenbaum, echo = FALSE, cache = TRUE, tidy = TRUE>>=
res.Rosenbaum <- Rosenbaum(act, inact, exact = TRUE)
@
<<exRosenbaum1, echo = TRUE, eval = FALSE, cache = TRUE, tidy = TRUE>>=
Rosenbaum(act, inact, exact = TRUE)
@
<<exRosenbaum2, echo = FALSE, cache = TRUE, tidy = TRUE>>=
print(res.Rosenbaum)
@
%
The output of the Rosenbaum test is an object of class \class{htest}. 
The output of the other methods is also in this format. 
The statistic value can be accessed by saving the result and accessing the \code{statistic} element of the saved result: 
<<exRosenbaum3, echo = TRUE, cache = TRUE, tidy = TRUE>>=
res.Rosenbaum <- Rosenbaum(act, inact, exact = TRUE)
res.Rosenbaum$statistic
@
The $p$~value can be accessed analogously as follows: 
<<exRosenbaum4, echo = TRUE, cache = TRUE, tidy = TRUE>>=
res.Rosenbaum$p.value
@
This holds for almost all other functions in this package. 
Additionally, the output might include more information specific to the method, which is then described on the respective help page. 
For the Rosenbaum test, for example, the unstandardized cross-match count is also returned and can be accessed via 
<<exRosenbaum5, echo = TRUE, cache = TRUE, tidy = TRUE>>=
res.Rosenbaum$estimate
@
The cross-match count is equal to \Sexpr{res.Rosenbaum$estimate}. 
At most, there could be $122$ cross-matches if each observation from the `inactive' dataset was connected to an observation in the `active' dataset. 
Therefore, the cross-match count of $20$ can be considered a rather small value. 
This is also reflected by the $z$ score of \Sexpr{round(res.Rosenbaum$statistic, 2)}. %\Sexpr{round(res.Rosenbaum$statistic, 2)}
Consequently, we see that the hypothesis of equal distributions can be rejected with a $p$~value smaller than $2.2\cdot 10^{-16}$. 

We obtain a warning that informs us that a ghost value was introduced when calculating the optimal non-bipartite matching, due to the odd pooled sample size.
This means that an artificial point was added to the sample that has the highest distance to all other points in the sample, such that the optimal non-bipartite matching, which needs an even sample size, could be calculated. 
The ghost value and the point with which it was matched are then discarded from the subsequent calculations.

\subsection{MMCM for more than two numeric datasets without target variables}
The well-known \code{iris} dataset \citep{fisher_use_1936} included in the \pkg{datasets} package that comes with base \proglang{R} \citep{R_4_1_2} includes measurements of sepal and petals of 50 flowers each of three iris species.
We compare the datasets for the three species Iris setosa, versicolor, and virginica, which are known to differ in their sepal and petal measurements.
%
<<prepareiris, echo = TRUE, cache = TRUE>>=
data("iris")
setosa <- iris[iris$Species == "setosa", -5]
versicolor <- iris[iris$Species == "versicolor", -5]
virginica <- iris[iris$Species == "virginica", -5]
@
%
For finding an appropriate method, we can use the function \fct{findSimilarityMethod} again and specify that we have more than two numeric datasets using the \code{Numeric} and in addition the \code{Multiple.samples} options: 
<<findMethod3Num, echo = TRUE, cache = TRUE>>=
findSimilarityMethod(Numeric = TRUE, Multiple.Samples = TRUE)
@
For comparing the three datasets, we could, for example, use the \citet{mukherjee_distribution-free_2022} Mahalanobis multisample cross-match (MMCM) test, which is a generalization of the cross-match test for multiple samples.
The method of \citet{mukherjee_distribution-free_2022} is an extension of the \citet{rosenbaum_exact_2005} cross-match test for multiple samples. 
The cross-match counts $A = (a_{12}, a_{13},\dots,a_{ik}, a_{23},\dots,a_{2k},\dots,a_{k-1,k})^\top$ for all pairs of datasets are calculated using the optimal non-bipartite matching on the pooled sample. 
The test statistic then is the Mahalanobis distance of the observed cross-counts under the null hypothesis $H_0: F_1 = F_2 = \dots = F_k$
\[
\text{MMCM} = (A - \E_{H_0}(A))^\top \COV_{H_0}^{-1}(A) (A - \E_{H_0}(A)).
\]
The expectation and covariance matrix of the cross-count vector $A$ under $H_0$ can be calculated analytically and depend only on the sample sizes $n_i, i = 1,\dots,k$. 
Small values of the multi-sample Mahalanobis cross-match (MMCM) statistic indicate similarity. 
However, as there is no known upperbound, it is hard to interpret the MMCM value. 
The MMCM statistic follows a $\chi^2_{\binom{k}{2}}$ distribution asymptotically under the null, which can be used for testing. 

For applying the test, we can again either use the \fct{DataSimilarity} function or the \fct{MMCM} function directly
%
<<exMMCM, echo = TRUE, cache = TRUE>>=
DataSimilarity(setosa, versicolor, virginica, method = "MMCM")
MMCM(setosa, versicolor, virginica)
@
%
The MMCM statistic value on its own is hard to interpret. 
However, the test rejects the null hypothesis of equal distributions with $p < 2.2\cdot 10^{-16}$.
Therefore, we can conclude that the observed MMCM value presents an extreme value when assuming the null. 
Thus, the datasets are dissimilar. 

\subsection{NKT for exactly two numeric datasets with target variables}
The \code{segmentationData} dataset \citep{hill_impact_2007} in the \pkg{caret} package \citep{caret} includes cell body segmentation data. 
The dataset contains 119 imaging measurements of 2019 cells to predict the segmentation that is divided into the two classes \code{PS} for `poorly segmented' and \code{WS} for `well segmented'. 
Moreover, there is a division into 1009 observations used for training and 1010 observations used as a test set. 
We compare this training and test set. 
Ideally, the distributions of the training and test set should be equal in this predictive modelling setting. 
%
<<preparesegmentationData, echo = TRUE, cache = TRUE, tidy = TRUE>>=
data(segmentationData, package = "caret")
test <- segmentationData[segmentationData$Case == "Test", -(1:2)]
train <- segmentationData[segmentationData$Case == "Train", -(1:2)]
@
%
The following methods would be appropriate to use: 
%
<<findMethodNumY, echo = TRUE, cache = TRUE>>=
findSimilarityMethod(Numeric = TRUE, Target.Inclusion = TRUE)
@
% 
Setting \code{Target.Inclusion = TRUE} selects only the methods that can handle datasets that include a target variable.
For demonstration, we choose the method of \citet{ntoutsi_general_2008} and use all three proposed similarity measures NTO1, NTO2, and NTO3.
\citet{ntoutsi_general_2008} propose measuring dataset similarity based on probability density estimates derived from decision trees.
For this, it is assumed that in addition to both covariate datasets $X^{(1)}$ and $X^{(2)}$, categorical target variables $Y^{(1)}$ and $Y^{(2)}$ are given.
On each dataset $X^{(j)}$, a classification tree is constructed with $Y^{(j)}$ as the target variable, $j = 1, 2$. 
The splits defined by the decision trees induce a partition of the feature space $\mathcal{X}$ such that each leaf node corresponds to one segment in the partition.
Figure~\ref{fig:ex.NKT} demonstrates the procedure for two example datasets. 
First, trees are fit to each dataset (Figure~\ref{fig:tree1} and \ref{fig:tree2}). 
Then, the sample space is divided into segments based on the splits performed in each tree (Figure~\ref{fig:parti1} and \ref{fig:parti2}). 
These partitions are intersected (Figure~\ref{fig:sec.parti}) and based on the joint partition, the probability densities $P_D(\mathcal{X})$ and $P_D(Y^{(j)},\mathcal{X})$ are estimated for $D \in \{X^{(1)}, X^{(2)}, Z\}$.

Let $n_r$ denote the number of segments in the joint partition and $n_c$ the number of classes in $X^{(1)}$ and $X^{(2)}$.
$\hat{P}_D(\mathcal{X}) \in \mathbb{R}^{n_r}$ uses the proportion of observations in $D$ that fall into each segment of the joint partition. 
This means that for each of the $n_r$ segments of the partition, the number of observations from dataset $D$ that fall into that segment is counted and divided by the total number of observations in $D$.
For the estimation of the joint density $P_D(Y,\mathcal{X})$, the proportion of observations that fall into each segment of the joint partition and belong to each class is determined, $\hat{P}_D(Y, \mathcal{X}) \in \mathbb{R}^{n_r \times n_c}$.
Here, for each of the $n_r$ segments of the partition and each of the $n_c$ classes, the number of observations in $D$ where the corresponding target variable has the respective class value and that fall into the respective segment is counted and divided by the total number of observations in $D$.
The conditional density $P_D(Y|\mathcal{X})$ is estimated by calculating the proportion of observations belonging to each class separately for each segment, 
$\hat{P}_D(Y|\mathcal{X}) \in \mathbb{R}^{n_r \times n_c}$.
Here, for each of the $n_r$ segments of the partition and each of the $n_c$ classes, the number of observations in $D$ where the corresponding target variable has the respective class value and that fall into the respective segment is counted and divided by the total number of observations in $D$ that fall into the respective segment.

\begin{figure}[t!]
\centering
\begin{subfigure}[b]{0.49\textwidth}
\centering
<<visualization_tree1, echo=FALSE, fig=TRUE, height=5, width=7>>=
library(DataSimilarity)
set.seed(5202)
X1 <- data.frame(X1 = runif(100), X2 = runif(100))
X1$y <- as.factor(ifelse(X1$X1 < 0.5 & X1$X2 > 0.3, 0, 1))

X2 <- data.frame(X1 = runif(100), X2 = runif(100))
X2$y <- as.factor(ifelse((X2$X1 < 0.5 & X2$X2 > 0.3) | (X2$X2 < 0.3 & X2$X1 > 0.2 ), 0, 1))

library(rpart)
library(rpart.plot)
tree1 <- rpart(y ~ ., data = X1)
tree2 <- rpart(y ~ ., data = X2)
parti1 <- DataSimilarity:::findPartition(tree1, X1, X2)
parti2 <- DataSimilarity:::findPartition(tree2, X1, X2)
intersec.parti <- DataSimilarity:::intersectPartitions(parti1, parti2)
par(xpd = TRUE)
prp(tree1, digits = 2, type = 5, tweak = 1.5)
par(op)
@
\caption{Fitted Tree for Dataset~1.}
\label{fig:tree1}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.49\textwidth}
\centering
<<visualization_tree2, echo=FALSE, fig=TRUE, height=5, width=7>>=
par(xpd = TRUE)
prp(tree2, digits = 2, type = 5, tweak = 1.5)
par(op)
@
\caption{Fitted Tree for Dataset~2.}
\label{fig:tree2}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.49\textwidth}
\centering
<<visualization_parti1, echo=FALSE, fig=TRUE, height=5, width=7>>=
plotParti <- function(parti) {
  plot(NA, xlim = 0:1, ylim = 0:1, xlab = "X1", ylab = "X2", main = "", las = 1, 
       cex.axis = 1.5, cex.lab = 1.5)
  for(i in seq_along(parti)){
    segments(x0 = round(parti[[i]][1, 2], 2), x1 = round(parti[[i]][1, 3], 2), 
             y0 = round(parti[[i]][2, 2], 2), y1 = round(parti[[i]][2, 2], 2))
    segments(x0 = round(parti[[i]][1, 2], 2), x1 = round(parti[[i]][1, 3], 2), 
             y0 = round(parti[[i]][2, 3], 2), y1 = round(parti[[i]][2, 3], 2))
    segments(x0 = round(parti[[i]][1, 3], 2), x1 = round(parti[[i]][1, 3], 2), 
             y0 = round(parti[[i]][2, 2], 2), y1 = round(parti[[i]][2, 3], 2))
    segments(x0 = round(parti[[i]][1, 2], 2), x1 = round(parti[[i]][1, 2], 2), 
             y0 = round(parti[[i]][2, 2], 2), y1 = round(parti[[i]][2, 3], 2))
  }  
}

par(mar = c(4.1, 4.1, 1.1, 2.1))
plotParti(parti1)
par(op)
@
\caption{Partition of sample space derived from fitted tree for Dataset~1.}
\label{fig:parti1}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.49\textwidth}
\centering
<<visualization_NKT_parti2, echo=FALSE, fig=TRUE, height=5, width=7>>=
par(mar = c(4.1, 4.1, 1.1, 2.1))
plotParti(parti2)
par(op)
@
\caption{Partition of sample space derived from fitted tree for Dataset~2.}
\label{fig:parti2}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.49\textwidth}
\centering
<<visualization_intersect, echo=FALSE, fig=TRUE, height=5, width=7>>=
par(mar = c(4.1, 4.1, 1.1, 2.1))
plotParti(intersec.parti$parti)
par(op)
@
\caption{Intersected partition (greatest common refinement, GCR) from fitted trees for Datasets~1 and~2. Each dataset includes two covariates and a binary target variable.}
\label{fig:sec.parti}
\end{subfigure}
\caption{Partitioning of sample space by fitting trees to two example datasets.}
\label{fig:ex.NKT}
\end{figure}

Then, \citet{ntoutsi_general_2008} consider the similarity index 
\[s(p, q) = \sum_{i} \sqrt{p_i \cdot q_i}\]
for vectors $p$ and $q$, where $(n_r \times n_c)$-matrices are interpreted as $(n_r \cdot n_c)$-dimensional vectors.
For the conditional distribution, the similarity vector $S(Y|\mathcal{X}) \in \mathbb{R}^{n_r}$ is computed with $S(Y|\mathcal{X})_i = s(\hat{P}_{X^{(1)}}(Y|\mathcal{X})_{i \bullet}, \hat{P}_{X^{(2)}}(Y|\mathcal{X})_{i \bullet})$ and index $i \bullet$ denoting the $i$-th row.
Based on this, three similarity measures for datasets are proposed:

\begin{enumerate}
\item NTO1 = $s(\hat{P}_{X^{(1)}}(\mathcal{X}), \hat{P}_{X^{(2)}}(\mathcal{X}))$
\item NTO2 = $s(\hat{P}_{X^{(1)}}(Y, \mathcal{X}), \hat{P}_{X^{(2)}}(Y, \mathcal{X}))$
\item NTO3 = $S(Y|\mathcal{X})^{\top} \hat{P}_{Z}(\mathcal{X})$.
\end{enumerate}

All three measures have values in the interval $[0, 1]$, where high values correspond to high similarity.

For applying the method to our data, the \code{target1} and \code{target2} arguments have to be specified as the column names of the target variable in the first and second supplied datasets, respectively.
Here, the target variable is named \code{"Class"} in both cases.
Again, we can use either the \fct{DataSimilarity} function or \fct{NKT}.
%
<<exNKT, echo = TRUE, cache = TRUE>>=
DataSimilarity(train, test, method = "NKT", target1 = "Class", 
               target2 = "Class", tune = FALSE)
NKT(train, test, target1 = "Class", target2 = "Class", tune = FALSE)
DataSimilarity(train, test, method = "NKT", target1 = "Class",
               target2 = "Class", tune = FALSE, version = 2)
NKT(train, test, target1 = "Class", target2 = "Class", tune = FALSE, 
    version = 2)
DataSimilarity(train, test, method = "NKT", target1 = "Class", 
               target2 = "Class", tune = FALSE, version = 3)
NKT(train, test, target1 = "Class", target2 = "Class", tune = FALSE,
    version = 3)
@
%
We observe high similarity between the training and test datasets with all three methods, reflected by the similarity values \code{s} that are all close to the maximal value $1$. 
For the method of \citet{ntoutsi_general_2008}, no test is proposed and therefore, no $p$~value is calculated. 

\subsection{HMN for exactly two categorical datasets without target variables} 
The \code{banque} dataset from the \pkg{ade4} package \citep{ade4} consists of bank survey data of 810 customers.
All variables are categorical and contain socio-economic information of the customers. 
We divide the data into bank card owners and non-bank card owners and compare these two groups. 
In total, 243 out of the 810 customers own a bank card.
Bank card owners and non-bank card owners might differ in their socio-economic characteristics.
%
<<preparebanque1, echo = TRUE, cache = TRUE>>=
data(banque , package = "ade4")
card <- banque[banque$cableue == "oui", -7]
no.card <- banque[banque$cableue == "non", -7]
@
%
We again apply the \fct{findSimilarityMethod} function to find appropriate methods for comparing two categorical datasets.
Again, two samples are the default. 
Therefore, we only have to specify \code{Categorical = TRUE}.
<<findMethod2Cat, echo = TRUE, cache = TRUE>>=
findSimilarityMethod(Categorical = TRUE)
@
For demonstration, we use the random forest test of \citet{hediger_use_2021} to compare these two groups. 
\citet{hediger_use_2021} provide a two-sample test based on random forests that is applicable for both numeric and categorical data. 
For this test, a pooled dataset is created where each observation is labeled according to its original dataset membership, and a random forest is trained to distinguish between the dataset labels.
The idea is that if the datasets are generated from the same distribution, the classification error of the random forest should be close to the chance level, otherwise, the classifier should be able to distinguish between the two distributions and hence the classification error should be lower than the chance level. 
One advantage of using random forests as the classifier is that it requires almost no tuning.
An asymptotic test is proposed.
For this, the pooled dataset has to be split into a training set on which the random forest is trained and a test set on which its classification error is evaluated.
In the implementation, both datasets are split in half to create a training and a test dataset. 
Alternatively, an out-of-bag (OOB) based permutation test can be performed that does not require data splitting. 
OOB statistics can be used to increase the sample efficiency compared to the test based on a holdout sample. 
Both the OOB-based test and the asymptotic version of the test are implemented.
The test statistic is either the mean of the per-class OOB or test classification errors or the overall OOB or test classification error, respectively.
In the asymptotic case, a binomial test is performed in case of the overall classification error, or a $Z$ test is performed in case of the mean per-class classification error. 
Otherwise, a permutation test is performed.
The variable importance measures of the random forest can provide additional insights into sources of distributional differences. 

For easier interpretation, we here look at the overall out-of-bag (OOB) prediction error instead of the per-class OOB prediction error and perform a permutation test with 1000 permutations.
For reproducibility, we set a seed before applying the method. 
Alternatively, we could supply the seed via the \code{seed} argument for setting the seed within the function.
%
<<exHMN, echo = FALSE, cache = TRUE>>=
# HMN.res <- HMN(card, no.card, n.perm = 1000, statistic = "OverallOOB") 
# save(HMN.res, file = "tmpResHMNVignette.RData")
load("tmpResHMNVignette.RData")
@
<<exHMNDS, echo = TRUE, eval = FALSE, cache = TRUE>>=
set.seed(1234)
DataSimilarity(card, no.card, method = "HMN", n.perm = 1000,
               statistic = "OverallOOB") 
@
<<exHMNDS2, echo = FALSE, cache = TRUE>>=
print(HMN.res)
@
<<exHMN1, echo = TRUE, eval = FALSE, cache = TRUE>>=
set.seed(1234)
HMN(card, no.card, n.perm = 1000, statistic = "OverallOOB") 
@
<<exHMN2, echo = FALSE, cache = TRUE>>=
print(HMN.res)
@
%
The overall OOB prediction error is \Sexpr{round(HMN.res$statistic, 3)}, which is considerably smaller than the naive prediction error of $243/810 = 0.3$. 
Therefore, the random forest can distinguish between the datasets, so we can conclude that the datasets differ.
This is also reflected by the $p$~value of \Sexpr{sprintf("%.3e", HMN.res$p.value)}. 

\subsection{C2ST for more than two categorical datasets without target variables} 
We consider the \code{banque} dataset from the \pkg{ade4} package \citep{ade4} again. 
This time, we split it by the nine socio-professional categories given by `csp', which are again expected to differ with regard to the other socio-economic characteristics.
%
<<preparebanque2, echo = TRUE, cache = TRUE>>=
data(banque, package = "ade4")
agric <- banque[banque$csp == "agric", -1]
artis <- banque[banque$csp == "artis", -1]
cadsu <- banque[banque$csp == "cadsu", -1]
inter <- banque[banque$csp == "inter", -1]
emplo <- banque[banque$csp == "emplo", -1]
ouvri <- banque[banque$csp == "ouvri", -1]
retra <- banque[banque$csp == "retra", -1]
inact <- banque[banque$csp == "inact", -1]
etudi <- banque[banque$csp == "etudi", -1]
@
%
To compare these datasets, we now need a method that can handle multiple datasets at once: 
<<findMethod7Cat, echo = TRUE, cache = TRUE>>=
findSimilarityMethod(Categorical = TRUE, Multiple.Samples = TRUE)
@
We apply the classifier two-sample test (C2ST). 
The general idea of this method by \citet{lopez-paz_revisiting_2017} is to use a classifier to determine which of two or more datasets a sample belongs to. 
The \textit{classifier two-sample test (C2ST)} uses the classification accuracy of this classifier as its test statistic. \\
The C2ST consists of five steps: 
\begin{enumerate}
\item Construct the dataset consisting of the samples from all datasets, labeled with their membership to each of the datasets.
\item Assign the observations of the dataset constructed in 1.\ randomly to a training and test set.
\item Train a classifier that predicts for an observation to which dataset $X^{(j)}, j = 1,\dots,k$ it belongs.
\item Calculate the C2ST statistic, which is the accuracy on the test set. The accuracy should be close to the chance level for $F_1 = \dots = F_k$, and it should be greater than the chance level for $\exists i\ne j\in\{1,\dots,k\}: F_i \ne F_j$ since in the latter case the classifier should identify distributional differences between the samples.
\item Calculate a $p$~value using a binomial test for comparing the accuracy to the chance level.
\end{enumerate}
Maximizing the power of a C2ST is a trade-off between using a large training set to optimize the classifier and a large test set to better evaluate the performance of the classifier. \\
The test statistic is interpretable as the percentage of samples that are correctly classified on the unseen test data. 
The above-mentioned test of \citet{hediger_use_2021} can be seen as a special case of the general framework proposed by \citet{lopez-paz_revisiting_2017}. 
One difference in the implementation of the tests is that for the C2ST, categorical data is dummy-encoded, while for the test of \citet{hediger_use_2021} the categorical variables are passed to \fct{ranger::ranger} directly. 
Moreover, the use of OOB predictions and feature importance is specific to the random forest-based test and cannot be used for all of the available classifiers for the C2ST.
Further, the C2ST uses the accuracy as its test statistic while the test of \citet{hediger_use_2021} uses the classification error, i.e., $1 - \text{accuracy}$. 

First, we use the default $K$-NN classifier. 
Categorical variables are dummy-coded. 
Again, we can use either \fct{DataSimilarity} or \fct{C2ST}:
%
<<exC2STKNN, echo = FALSE, cache = TRUE>>=
C2ST.res <- C2ST(agric, artis, cadsu, inter, emplo, ouvri, retra, inact, etudi)
@
<<exC2STKNNDS1, echo = TRUE, eval = FALSE, cache = TRUE>>=
DataSimilarity(agric, artis, cadsu, inter, emplo, ouvri, retra, inact,
               etudi, method = "C2ST")
@
<<exC2STKNNDS2, echo = FALSE, cache = TRUE>>=
print(C2ST.res)
@
<<exC2STKNN1, echo = TRUE, eval = FALSE, cache = TRUE>>=
C2ST(agric, artis, cadsu, inter, emplo, ouvri, retra, inact, etudi)
@
<<exC2STKNN2, echo = FALSE, cache = TRUE>>=
print(C2ST.res)
@
%
The accuracy of the $K$-NN classifier is \Sexpr{round(C2ST.res$statistic, 3)}. 
It is larger than the naive accuracy for always predicting the largest class, which is given by \code{prob = \Sexpr{round(C2ST.res$parameter[2], 3)}} in the output. 
The classifier seems to be able to distinguish between the datasets, and we can therefore regard them as dissimilar. 
Moreover, the null hypothesis of equal distributions can be rejected with a $p$~value of \Sexpr{sprintf("%.3e", C2ST.res$p.value)}. 

For demonstration, we additionally perform the C2ST with a neural net classifier. 
%
<<exC2STMLP, echo = TRUE, cache = TRUE>>=
DataSimilarity(agric, artis, cadsu, inter, emplo, ouvri, retra, inact,
               etudi, method = "C2ST", classifier = "nnet",
               train.args = list(trace = FALSE))
C2ST(agric, artis, cadsu, inter, emplo, ouvri, retra, inact, etudi, 
     classifier = "nnet", train.args = list(trace = FALSE))
@
%
The results are very similar to using $K$-NN. 

\subsection{OTDD for exactly two categorical datasets with target variables} % OTDD
We consider the \code{banque} dataset from the \pkg{ade4} package \citep{ade4} again. 
In this case, we interpret the savings bank amount (\code{eparliv}) variable as the target variable, which is again supplied via the \code{target1} and \code{target2} arguments. 
It is divided into the three categories `$> 20000$', `$> 0 $ and $ <20000$', and `nulle'.
We divide the data into the socio-professional categories as before, and now need a method for two categorical datasets that include a target variable.
<<findMethodCatY, echo = TRUE, cache = TRUE>>=
findSimilarityMethod(Categorical = TRUE, Target.Inclusion = TRUE)
@
We use the optimal transport dataset distance (OTDD) to compare the resulting datasets for craftsmen, shopkeepers, company directors (`artis'), to that of higher intellectual professions (`cadsu'), and to that of manual workers (`ouvri'). 
\citet{alvarez-melis_geometric_2020} define this distance based on optimal transport between datasets that include a target (class) variable $Y$. 
The \emph{optimal transport dataset distance} (OTDD) is defined as
\[
d_{\text{OT}}(X^{(1)}, X^{(2)}) = \min_{\pi\in\Pi(F_1, F_2)} \int_{\mathcal{Z}\times\mathcal{Z}} d_{\mathcal{Z}}(z, z^\prime)^q \dif \pi(z, z^\prime)
\]
where $X^{(1)}, X^{(2)}$ denote the two datasets, \[
\Pi(F_1, F_2) := \{\pi_{1,2}\in\mathcal{P}(\mathcal{Z}\times\mathcal{Z})\arrowvert \pi_1 = F_2, \pi_2 = F_2\}
\]
is the set of joint distributions over the product space $\mathcal{Z}\times\mathcal{Z}$ over the sample space of the pooled sample with marginal distributions $F_2$ and $F_2$, and
\[
d_{\mathcal{Z}}(z, z^\prime) := (d_{\mathcal{X}}(x, x^{\prime})^q + W_{q^\prime}^{q^\prime}(\alpha_y, \alpha_{y^{\prime}}))^{1/q}.
\]
defines a distance of two points $z^{\top} = (x^{\top}, y)$, and ${z^\prime}^{\top} = ({x^\prime}^{\top}, y^\prime)$ in the pooled sample.
$d_{\mathcal{X}}$ defines a distance on the covariate space, e.g., the Euclidean distance, and  $W_{q^\prime}(\alpha_y, \alpha_{y^{\prime}})$ is the $q^\prime$-Wasserstein distance of the distribution of the subset of covariate data with corresponding response value $y$ and the distribution of the subset of covariate data with corresponding response value $y^\prime$. 
The powers $q$ and $q^\prime$ have to be chosen in advance to calculate the OTDD.
The optimal transport problem can intuitively be motivated by imagining each probability density as a pile of dirt. 
Then, the cost function corresponds to the cost for transporting the dirt from one point to another, which is proportional to the distance between the two points. 
The optimal transport then corresponds to the lowest cost required for moving one pile of dirt fully to the shape and location of the other. 
Therefore, distributions can be regarded as more similar if the optimal transport between them is lower. 
For an intuitive explanation and visualization of the OTDD, also refer to 
\citet{hagen_measuring_2020}.
As all variables are categorical, we use the Hamming distance instead of the default Euclidean distance.
We can either use \fct{DataSimilarity} or \fct{OTDD}.
%
<<exOTDD, echo = FALSE, cache = TRUE>>=
# res.OTDD1 <- OTDD(artis, cadsu, target1 = "eparliv", target2 = "eparliv", 
#                   feature.cost = hammingDist) 
# res.OTDD2 <- OTDD(artis, ouvri, target1 = "eparliv", target2 = "eparliv", 
#                   feature.cost = hammingDist) 
# save(res.OTDD1, res.OTDD2, file = "tmpResOTDDVignette.RData")
load("tmpResOTDDVignette.RData")
@
<<exOTDDDS1, echo = TRUE, eval = FALSE, cache = TRUE>>=
DataSimilarity(artis, cadsu, method = "OTDD", target1 = "eparliv", 
               target2 = "eparliv", feature.cost = hammingDist) 
@
<<exOTDDDS2, echo = FALSE, cache = TRUE>>=
print(res.OTDD1)
@
<<exOTDD1, echo = TRUE, eval = FALSE, cache = TRUE>>=
OTDD(artis, cadsu, target1 = "eparliv", target2 = "eparliv", 
     feature.cost = hammingDist) 
@
<<exOTDD2, echo = FALSE, cache = TRUE>>=
print(res.OTDD1)
@
%
We obtain a dataset distance of \Sexpr{round(res.OTDD1$statistic, 3)} between craftsmen/shopkeepers/company directors and executives/higher intellectual professions.
For the OTDD, low values correspond to high similarity, and the minimum value is 0. 
The observed value is clearly larger than zero, so the datasets are not exactly similar. 
How dissimilar they are is however hard to interpret from the observed OTDD value on its own.
For the OTDD, no test is proposed and therefore, no $p$~value is calculated. 
<<exOTDDDS3, echo = TRUE, eval = FALSE, cache = TRUE>>=
DataSimilarity(artis, ouvri, method = "OTDD", target1 = "eparliv", 
               target2 = "eparliv", feature.cost = hammingDist) 
@
<<exOTDDDS4, echo = FALSE, cache = TRUE>>=
print(res.OTDD2)
@

<<exOTDD3, echo = TRUE, eval = FALSE, cache = TRUE>>=
OTDD(artis, ouvri, target1 = "eparliv", target2 = "eparliv", 
     feature.cost = hammingDist) 
@
<<exOTDD4, echo = FALSE, cache = TRUE>>=
print(res.OTDD2)
@
We obtain a dataset distance of \Sexpr{round(res.OTDD2$statistic, 3)} between craftsmen/shopkeepers/company directors and manual workers.
Again, this value on its own is hard to interpret.
However, we can compare the values and conclude that the data of craftsmen/shopkeepers/company directors is more similar to that of executives/higher intellectual professions than to that of manual workers. 

\section{Examples for case b): We are looking for a method to apply}\label{sec:demonstration_compare}
In the following, the individual steps of the COMPARE procedure for finding a suitable dataset similarity method are first described and than applied for the application of comparing two strategies of creating a synthetic dataset. 

\subsection{COMPARE steps}
\paragraph{Characteristics of datasets}
The following characteristics of the datasets for which similarity is assessed should be considered:
\begin{itemize}
  \item How many datasets are there to compare simultaneously?
  \item What are the scale levels of the variables in the datasets?
  \item Do the datasets include target variables that we wish to treat differently from the remaining covariates?
  \item Do the sample sizes of the datasets differ? 
  \item Does any of the datasets include fewer observations than variables?
\end{itemize}
If we have answers to all of these questions, we can proceed with the next step. 
The help pages of the methods also give a first overview for which datasets each method is applicable.

\paragraph{Objective of comparison}
We should define the objective of comparing the datasets. 
The main decision here is whether we want to perform a test or are only interested in measuring (dis)similarity. 
In the case where only or mainly the (dis)similarity is of interest, the interpretability of the statistic values might be desirable, especially boundedness.

\paragraph{Method requirements}
Next, we should should consider whether any theoretical properties (invariances to shift/scale/rotation, metric properties, consistency of test) are relevant in the given application. 
For example, invariances might be desirable if data will be transformed, metric properties are useful in applications relying on distances between datasets, like clustering the datasets, and consistency of the test might be desirable if the test result is of interest.

\paragraph{Production of method list}
After assessing the requirements, we can call the function \fct{findSimilarityMethod}. 
The arguments of that function correspond to the theoretical criteria of \citet{stolte_methods_2024}. 
For a full list, refer to the help pages in this package or the cited article. 
The most relevant criteria for the method choice are already covered by the first three steps in this procedure, so we can now call \fct{findSimilarityMethod} using the answers from above, which returns a list of methods that fulfill our requirements.

\paragraph{Assessment of computational complexity}
Another aspect to consider is the computational complexity of the methods. 
Different methods require different computational resources, both with regard to runtime and to memory consumption. 
These requirements also depend on the dataset characteristics, especially on the dimensions and number of datasets, but also, for example, in the case of categorical data, for some methods, on the number of categories.
Limits for the computational resources, especially for runtime, will also depend on the application. 
In applications where two or more concrete datasets are compared once, higher runtimes will likely be considered more tolerable than for running a simulation study where each method is applied at least hundreds of times.
Theoretical complexity is given in the corresponding column of \code{method.table} and will also be returned by \fct{findSimilarityMethod}, if \code{only.names} is set to \code{FALSE}.
In \citet{stolte_empirical_cat_2026} and \citet{stolte_empirical_num_2026}, the methods were additionally benchmarked regarding empirical runtime and memory requirements. 
The results of those benchmarks can be summarized as follows. 
For categorical data, the CM distance shows the lowest resource consumption for low numbers of variables and classes, but can no longer be computed if the number of classes or variables becomes too high, since an enumeration of the whole sample space is required for calculation. 
The edge count test (FR, CF, CCS, ZC) and the other graph-based methods, MMCM and Petrie, show the next lowest resource consumption. 
The classifier-based methods C2ST and HMN need considerably more runtime and memory. 
The two methods, GGRL and OTDD, that can also handle a target variable appropriately, show by far the highest resource consumption. 
For numeric data, the methods with the lowest resource consumption are the Engineer metric, Wasserstein distance, interpoint distance-based methods like Energy, DISCO, Bahr, and BF, as well as graph-based (SC, KMD) and kernel-based methods (MMD, GPK) consume the fewest resources. 
The highest resource consumption is observed for BMG, which requires computing the shortest Hamilton path. Again, the methods that can handle a target variable (NKT, GGRL, OTDD) or use a classifier (C2ST, YMRZL), but also Jeffreys divergence, DS, SH, and BG2.
Note that BG becomes unfeasible for high numbers of variables.
For more details and the complete ranking, see Sections 3.2.1 and 3.3.1 of \citet{stolte_empirical_num_2026}. 
It should be noted that the benchmark was performed with zero permutations for all methods. 
The runtimes of the permutation and Bootstrap tests will increase considerably with increasing the number of permutations. 
If the sample size is large and the runtime is limited, methods that offer asymptotic tests might be preferred. 
The information on which tests are available can be found in the method descriptions in the Details vignette as well as on the corresponding help pages. 
Runtime and memory requirements can, of course, additionally be evaluated in practice by simply applying the methods to the given data or by performing a small benchmark tailored to the data characteristics at hand. 

\paragraph{Ranking of methods}
Now that we reduced the set of methods to those that satisfy all practical requirements, we can choose from the remaining methods by considering their performance in simulation studies like those provided in \citet{stolte_empirical_cat_2026} and \citet{stolte_empirical_num_2026}. 
There, overall method rankings are provided, but also more detailed results and even decision rules taking into account specific dataset characteristics.
The overall results for categorical datasets can be summarized as follows. 
The edge count tests (FR, CCS, CF, ZC) often showed high performance for all considered alternatives, especially the FR using the 5-MST as the similarity graph and the union of graphs. 
The CM distance showed even better performance in some scenarios, but was infeasible for higher numbers of categories or variables.
The HMN performed okay for equal sample sizes but worse for unequal sample sizes. 
The C2ST performance depended on the scenario. 
Out of the methods that treat the target variable separately, the OTDD outperformed the GGRL, but both performed poorly compared to all other methods.

For two numeric datasets, the BF method performed best overall. 
The combination of BF (using FracB as the kernel function), ZC (using the 5-MST as the similarity graph and $\kappa=1$), DS, and GPK was able to cover 90\% of the considered scenarios with good performance in the two-sample case.
For the multi-sample case, the DISCO $F$-test (with $\alpha = 0.5$) performed best overall. 
Combined with SC (using the 5-MST and the statistic $S$), it was again able to cover 90\% of the considered scenarios with good performance for the multi-sample test. 

For more details, we refer to the original studies in \citet{stolte_empirical_cat_2026} and \citet{stolte_empirical_num_2026}, as a comprehensive discussion of the results for all methods is out of scope for this vignette.

\paragraph{Estimation and testing}
Lastly, we can apply the chosen method(s) to our datasets to estimate their (dis)similarity and potentially test whether the underlying distributions of the datasets coincide using the \fct{DataSimilarity} function and specifying the chosen method via the \code{method} argument. 
The interpretation of the results depend on the method. 
More details on this can be found on the corresponding help pages and in the literature that is referenced there.

\subsection{Illustration}
In the following, we illustrate the steps for choosing and applying a method as described in the previous subsection to a real-world example dataset in the setting of creating a realistic synthetic dataset. 
First, we start by attaching the \pkg{DataSimilarity} package. 
%
<<loadpackage, echo = TRUE>>=
library("DataSimilarity")
@

The \code{banque} dataset from the \pkg{ade4} package \citep{ade4} consists of bank survey data of 810 customers.
All variables are categorical and contain socio-economic information of the customers. 
Details on the variables can be found on the help page of the dataset \code{?banque}.
%
<<preparebanque1, echo = TRUE, cache = TRUE>>=
data("banque", package = "ade4")
head(banque)
summary(banque)
@
%
To ensure that the ordinal variables are handled correctly later on, we first transform them to \class{ordered} variables.
The nominal variables are kept as \class{factor}s.
%
<<preparebanque2, echo = TRUE, cache = TRUE>>=
ord.vars <- c("duree", "age", "soldevu", "eparlog", "eparliv", 
              "credcon", "retresp", "remiche", "preltre", "prelfin", 
              "viredeb", "virecre", "porttit")
banque[, ord.vars] <- lapply(banque[, ord.vars], ordered)
str(banque)
@
%
Assume that we want to create a synthetic dataset that is similar to this real dataset. 
We might consider different strategies and evaluate these with regard to the similarity of the synthetic dataset to the original one. 
First, we create the synthetic datasets. 
For this, we use two different approaches. 
The first one is to naively sample data for each variable using the observed frequencies as the class distributions. 
%
<<synth1, echo = TRUE, cache = TRUE>>=
naiveSynth <- function(v) {
  tab <- prop.table(table(v))
  n <- length(v)
  new.v <- sample(x = names(tab), size = n, prob = tab, replace = TRUE)
  if(is.ordered(v)) {
    new.v <- ordered(new.v, levels = levels(v))
  } else {
    new.v <- factor(new.v, levels = levels(v))
  }
  return(new.v)
}

set.seed(2026)
synth.n <- as.data.frame(lapply(banque, naiveSynth))
head(synth.n)
@
%
As a second approach, we use the \pkg{synthpop} package \citep{synthpop} to synthesize our data. 
This uses more sophisticated methods for synthesizing the data that for example also consider correlations between the variables. 
%
<<synth2, echo = TRUE, cache = TRUE>>=
library("synthpop")
synth.p <- syn(banque, seed = 2026, print.flag = FALSE)
head(synth.p$syn)
@
%
Now, we have two synthetic datasets and want to evaluate how closely they capture the real data. 
For this, we would like to use a dataset similarity method. 
To find a method for our application, we follow the steps described in the previous section. 

\paragraph{Characteristics of datasets}
For the pairwise comparison of our two synthetic datasets to the original one, our two datasets each consist of 21 categorical variables. 
The datasets do not contain any target variables.
The sample sizes are equal as we always created our synthetic data to match the original sample size of 810.
The sample size is also larger than the of variables. 

\paragraph{Objective of comparison}
We are interested in comparing the similarity of the synthetic datasets with the original dataset. 
Interpretability of the resulting similarity values would therefore be very helpful in this application.
Upper or lower bounds of the values would provide additional interpretability of the values themselves but are not essential for comparing the synthesizing strategies. 

\paragraph{Method requirements}
The invariances to transformations do not apply to categorical data and are therefore not relevant in this context.
The metric properties or consistency of corresponding tests are also not essential in this case. 

\paragraph{Production of method list}

Now that we know all of our requirements, we can call the \fct{findSimilarityMethod} function to find appropriate methods.
We specify all the criteria that we identified above, which are the applicability to categorical variables and interpretability of the resulting values.
<<findMethod2Cat, echo = TRUE, cache = TRUE>>=
findSimilarityMethod(Categorical = TRUE, Interpretable.units = TRUE)
@
This gives us a list of five methods that would fulfill our requirements.

\paragraph{Assessment of computational complexity} 
If we only want to make the two comparisons of the generated datasets from above, runtime and memory are no limiting factors. 
However, if we wanted to compare the two strategies for data synthesis more generally, we would want to generate more datasets with each strategy and compare each against the original \code{banque} dataset. 
Then, runtime might become a relevant factor. 
According to the benchmarks performed in \citet{stolte_empirical_cat_2026}, the graph-based methods \fct{FR\_cat} or \fct{Petrie} might be preferable to the other classifier-based methods \fct{C2ST}, \fct{HMN}, and \fct{YMRZL} in that case due to lower runtimes. 
Depending on the runtime limits and the desired number of dataset comparisons, these might even become the only feasible options. 
For this illustration we will, however, not exclude any of the methods due to their runtime. 

\paragraph{Ranking of methods} 
In the performance comparisons of the simulation study in \citet{stolte_empirical_cat_2026}, \fct{FR\_cat} outperformed the remaining methods in many of the considered scenarios.
Therefore, we choose to use it here. 

The Friedman-Rafsky (FR) test \citep{friedman_multivariate_1979} (also called original edge-count test) is a graph-based test. 
A similarity graph (in the original proposed form a minimum spanning tree, MST) is calculated on the pooled sample created by pooling the two datasets using an appropriate distance measure. 
Then, the edges connecting points from different datasets are counted. 
This between-sample edge-count $R_{12}$ can also be standardized
\[
T_{\text{FR}} = \frac{R_{12} - \E_{H_0}(R_{12})}{\sqrt{\VAR_{H_0}(R_{12})}}
\]
for testing. 
The expectation and variance under the null can be calculated analytically.
For small samples, a permutation test is proposed.
For large samples, the asymptotic standard normal distribution of the test statistic can be used. 
The idea of the test is that for similar datasets, the number of edges connecting points from different samples is expected to be higher than in datasets that differ. 
In categorical data, ties in the distance matrix are expected. 
In that case, the MST is no longer uniquely defined. 
This can be solved by calculating all optimal MST solutions and either calculating the test statistic as defined above on the union of these graphs, or on each graph individually and averaging the test statistic values over all optimal MST solutions \citep{zhang_graph-based_2019}. 

\paragraph{Estimation and testing}
We use the FR test with the recommended parameter settings from the simulations, which is using the union and using the 5-MST instead of the regular MST. 
The $K$-MST is defined as follows. 
First, the regular MST is calculated. 
Then, the edges used by that MST are removed from the edge set of the complete graph and a new MST is calculated using only the remaining edges.
For the third MST, the edges used by the first and second MST are removed and it is calculated again only using the remaining edges.
This is repeated until the $K$th MST was calculated.
Then, the $K$-MST is the union of all the resulting MSTs. 
This results in denser graphs which have shown better performance in practice \citep{zhu_limiting_2024}.

As a distance function, we use the Gower distance \fct{daisy} from the \pkg{cluster} package to combine appropriate distances for the nominal and ordinal variables present in the dataset.
Like all other methods, the test can be applied using the unified interface of the \fct{DataSimilarity} function.
The chosen graph and aggregation type are already the default settings in the implementations. 
The aggregation type could be changed using the \code{agg.type} argument and the graph type can be specified using the \code{graph.type} and \code{K} arguments.
We only have to specify the distance function using the \code{dist.fun} argument.
We first compare the original dataset and the naively generated one.
%
<<exFR.naive, echo = TRUE>>=
DataSimilarity(banque, synth.n, method = "FR_cat",
               dist.fun = function(x) cluster::daisy(x, metric = "gower"))
@  
%
The standardized test statistic value is very small considering the asymptotic normal distribution of the test statistic. 
This means that the observed edge count is considerably lower than what we would expect for data from the same distribution, which indicates notable differences between the datasets.
Consequently, the test rejects the null hypothesis that the datasets follow the same distribution. 
Therefore, we can conclude that our naive approach generated a dataset that is dissimilar from the original one. 
Next, we compare the original dataset and the one generated using \pkg{synthpop} to see whether this more sophisticated approach lead to a synthetic dataset closer to the original one. 
%
<<exFR.synthpop, echo = TRUE>>=
DataSimilarity(banque, synth.p$syn, method = "FR_cat",
               dist.fun = function(x) cluster::daisy(x, metric = "gower"))
@  
%
The standardized test statistic value is positive and close to zero. 
Therefore, the edge count is close to what would be expected when assuming that the datasets are generated from the same distribution. 
This is also reflected by the test not rejecting the null of equal distributions. 
So, as expected, the synthetic dataset generated by \pkg{synthpop} is clearly more similar to the original dataset than the one generated with the naive approach. 
In a real application, we would ideally synthesize the dataset repeatedly with both approaches and compare the resulting similarity values of all repetitions but this would go beyond the scope of this illustration.
Moreover, we would inspect the generated datasets in more detail, checking for example their marginal distributions, or correlation structures, looking for implausible observations, or depending on the application also taking into account other aspects like data privacy concerns.

\section*{Acknowledgments}
This work has been supported (in part) by the Research Training Group ``Biostatistical Methods for High-Dimensional Data in Toxicology'' (RTG 2624, Project P1) funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation - Project Number 427806116).\\
We would like to thank Nabarun Deb and Bodhisattva Sen for allowing us to use their \proglang{R} implementation of their test for the package. 
Moreover, we would like to thank David Alvarez-Melis, whose \proglang{Python} implementation of the OTDD was the basis for our \proglang{R} implementation.

<<reset, echo=FALSE, results=hide>>=
options(old)
par(op)
@

\bibliography{refs}

\end{document}
