\documentclass[nojss]{jss}

%% -- LaTeX packages and custom commands ---------------------------------------

%% recommended packages
\usepackage{orcidlink,thumbpdf,lmodern}

%% additional packages
%\usepackage[english]{babel}
\usepackage[utf8]{inputenc}
\usepackage{amsmath, amsfonts, amssymb, bbm, bm, mathabx}
\usepackage{booktabs} 	
\usepackage{longtable} 
\usepackage{tabularx}
\usepackage{xltabular}
\usepackage{lscape}
\usepackage{fontawesome5}
\usepackage{tikz}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{makecell}
\renewcommand{\cellalign}{tl}
\renewcommand{\theadalign}{tl}

%% new custom commands
\newcommand{\class}[1]{`\code{#1}'}
\newcommand{\fct}[1]{\code{#1()}}

\newcommand{\dif}{\mathop{}\!\mathrm{d}}

%\VignetteIndexEntry{Details on methods and implementations}
%\VignetteDepends{rpart.plot, nnet, knitr}

%% For Sweave-based articles about R packages:
%% need no \usepackage{Sweave}
\SweaveOpts{engine=R, eps=FALSE, keep.source = TRUE}
<<preliminaries, echo=FALSE, results=hide>>=
op <- par(no.readonly = TRUE)
old <- options()
options(prompt = "R> ", continue = "+  ", width = 70, useFancyQuotes = FALSE)
library("knitr")
@


%% -- Article metainformation (author, title, ...) -----------------------------
\author{Marieke Stolte~\orcidlink{0009-0002-0711-6789}\\TU Dortmund University and LMU Munich
\And Luca Sauer\\TU Dortmund University \AND J\"org Rahnenf\"uhrer~\orcidlink{0000-0002-8947-440X} \\ TU Dortmund University \And Andrea Bommert~\orcidlink{0000-0002-1005-9351} \\ TU Dortmund University}
\Plainauthor{Marieke Stolte, Luca Sauer, J\"org Rahnenf\"uhrer, Andrea Bommert}


\title{\pkg{DataSimilarity}: Quantifying Similarity of Datasets and Multivariate Two- and $k$-Sample Testing}
\Plaintitle{DataSimilarity: Quantifying Similarity of Datasets and Multivariate Two- and k-Sample Testing}
\Shorttitle{\pkg{DataSimilarity} Package}

%% - \Abstract{} almost as usual
\Abstract{
Quantifying the similarity of two or more datasets is a common task in many applications in statistics and machine learning, such as two- or $k$-sample testing and meta- or transfer learning.
The \pkg{DataSimilarity} package contains a variety of methods for quantifying the similarity of datasets.
The package includes 40 methods of which 17 are implemented for the first time in \proglang{R}. 
The remaining are wrapper functions for methods with already existing implementations that unify and simplify the various input and output formats of different methods and bundle the methods of many existing \proglang{R} packages in a single package.
Here, we give details on the methods and implementations and show some application examples.
}

\Keywords{dataset similarity, two-sample testing, multi-sample testing}
\Plainkeywords{dataset similarity, two-sample testing, multi-sample testing}

\Address{
Marieke Stolte, Luca Sauer, J\"org Rahnenf\"uhrer, Andrea Bommert\\
Department of Statistics\\
TU Dortmund University\\
Vogelpothsweg 87\\
44227 Dortmund, Germany\\
\emph{and}\\
Marieke Stolte\\
Institute for Medical Information Processing, Biometry, and Epidemiology (IBE)\\
LMU Medizin\\
Ludwig-Maximilians-Universit\"at M\"unchen\\
Marchioninistrasse 15\\
81377 Munich, Germany\\
E-mail: \email{marieke.stolte@ibe.med.uni-muenchen.de}\\
}

\begin{document}
\SweaveOpts{concordance=FALSE}


\section{Methods}\label{sec:meth}

In the following, we describe the general setup in the two- or $k$-sample problem that most of the implemented methods have in common. Moreover, we discuss the selection of the implemented methods and present one example method for each application domain in more detail. 

\subsection[The two- and k-sample problem]{The two- and $k$-sample problem}
Most methods for quantifying the similarity of datasets are proposed in the literature as test statistics for two- or $k$-sample testing. 
For this, a dataset is seen as a sample from a set of random variables that follow some true underlying distribution. 
Often, the similarity or distance of these underlying distributions is estimated. 

In the following, we assume that at least two different datasets $X^{(1)}$ and $X^{(2)}$ are given consisting of $n_1$ and $n_2$ samples $X_{1}^{(1)},\dots, X_{n_1}^{(1)} \sim F_1$ and $X^{(2)}_1,\dots, X^{(2)}_{n_2} \sim F_2$, respectively.
We assume $X^{(1)}_i, X^{(2)}_j: \mathcal{X}\to \mathbb{R}^p \,\forall i\in\{1,\dots,n_1\}, j\in \{1,\dots,n_2\}$ and call the $p$ components of each sample features or variables.
The two-sample problem is defined as the testing problem  
\begin{equation}
H_0: F_1 = F_2 \text{ vs. } H_1: F_1\ne F_2.\label{two.sample.problem}
\end{equation}
This testing problem is sometimes also called testing for homogeneity of the two distributions.

In some cases, one of the variables in each dataset has the role of a target variable, which we will then denote as $Y$. 
While data in many domains (e.g., clinical applications) typically has a clearly defined endpoint, the literature on quantifying dataset similarity has mostly neglected the existence of target variables. 
Most dataset similarity methods (at least implicitly) assume that all variables can be treated the same way. 
However, there are some methods that can handle the target variable differently and take the relationship of the remaining variables with the target variable into account. 
Therefore, whenever necessary in the following, we distinguish between methods that can handle a target variable appropriately and methods that treat all variables in the same way.

Analogously to the two-sample problem, the $k$-sample or multi-sample problem is defined for $k\ge 2, k \in\mathbb{N}$, datasets $X^{(1)},\dots,X^{(k)}$ with sample sizes $n_i, i= 1, \dots, k$, as
\[
H_0: F_1 = F_2 = \dots = F_k \text{ vs. } H_1: \exists i\ne j\in\{1,\dots,k\}: F_i\ne F_j,
\]
where $F_i$ denotes the distribution from which each observation in the $i$th dataset is drawn. 

Each of the considered methods can be seen as a measure of similarity or distance between the $F_i$, $i = 1,\dots,k$.
Not all of these methods include a hypothesis test. 

We use the hat symbol to denote estimators.
We denote the pooled sample as $\{Z_1,\dots, Z_N\} = \{X_1^{(1)},\dots, X_{n_1}^{(1)}, \dots, X^{(k)}_1,\dots, X^{(k)}_{n_k}\}$, where $N = \sum_{i = 1}^k n_i$ is the total sample size.
Additionally, we assume that all $Z_i$ are distributed independently. 


\subsection{Selection of methods}
Previously, in a comprehensive literature review \citep{stolte_methods_2024}, 118 methods were described and divided into the ten classes:
\begin{enumerate}
\item Comparison of cumulative distribution functions, density functions, or characteristic functions,
\item Methods based on multivariate ranks,
\item Discrepancy measures for distributions,
\item Graph-based methods,
\item Methods based on inter-point distances,
\item Kernel-based methods,
\item Methods based on binary classification,
\item Distance and similarity measures for datasets,
\item Comparison based on summary statistics, and
\item Testing approaches.
\end{enumerate}
Moreover, the methods were compared with respect to 22 criteria judging their applicability, interpretability, and theoretical properties. 
The criteria are listed in Table~\ref{tab:theo.crit}.
Here, methods are selected from the literature review using the theoretical criteria to perform a pre-selection of promising methods.
The \pkg{DataSimilarity} package comprises 36 methods that fulfill at least one of the following properties:
\begin{enumerate}
\item The method is implemented in \proglang{R}.
\item The method is one of the top methods ordered by the highest number of fulfilled criteria and fulfills at least 11 criteria of the theoretical criteria, excluding the two consistency criteria.
\item The method is the best in its subclass in the theoretical comparison, and no other method from this subclass was chosen based on the first two criteria.
\end{enumerate}
To avoid preferring methods that define a test over methods that do not, and therefore can by definition not fulfill the consistency criteria, consistency is not counted for determining the top methods. 
We chose 11 as the cutoff for the number of fulfilled criteria, as this is the median number of fulfilled criteria (excluding consistency) for the already implemented methods, and it ensures that at least more than half of the criteria are fulfilled. 
All methods from the theoretical comparison fulfilling the above-mentioned properties are included except for the method of \citet{weiss_two-sample_1960}, for which no concrete test is defined but only the general idea. 
Moreover, that test lacks symmetry, i.e., the test result depends on the order in which the datasets are supplied, which is highly undesirable in practice.

\begin{table}[!tb]
  \centering
  \begin{tabular}{rp{16cm}}
  \toprule
  & Criterion\\
  \midrule
  \multicolumn{2}{c}{Applicability}\\
  \midrule
  1. & Is the method able to handle a target variable present in the datasets appropriately?\\
  2. & Is the method applicable to numeric datasets? \\
  3. & Is the method applicable to categorical datasets? \\
  4. & Is the method applicable to datasets with unequal sample sizes?\\
  5. & Is the method applicable to datasets with more variables than observations?\\
  6. & Is the method applicable to $k>2$ datasets at the same time?\\ 
  7. & Can the method be applied without the need of a training dataset?\\
  8. & Is the method defined without making further assumptions about the dataset characteristics or the underlying distributions?\\
  9. & Can the method be applied without choosing or tuning hyperparameters?\\
  10. & Is the method implemented?\\
  11. & What is the computational complexity of the method?\\
  \midrule
  \multicolumn{2}{c}{Interpretability}\\
  \midrule
  12. & Can a one unit increase in the statistic values that the method returns be interpreted?\\
  13. & Is there a known lower bound of the statistic values?\\
  14. & Is there a known upper bound of the statistic values?\\
  \midrule
  \multicolumn{2}{c}{Theoretical Properties}\\
  \midrule
  15. & Is the method invariant to rotating all datasets in the same way?\\
  16. & Is the method invariant to location change applied to all datasets in the same way?\\
  17. & Is the method invariant to change of scale applied to all datasets in the same way?\\
  18. & Is the result of the method $\ge0$ and zero iff the distributions of the datasets coincide (positive definite)?\\
  19. & Is the method symmetric (regarding the order of two datasets)?\\
  20. & Does distance given by the method fulfill the triangle inequality?\\
  21. & Is the corresponding two- or $k$-sample test consistent (for $N\to\infty$)?\\
  22. & Is the corresponding two- or $k$-sample test HDLSS consistent (for $p\to\infty$,)?\\
  \bottomrule
  \end{tabular}
  \caption{Criteria for theoretical method comparison \citep{stolte_methods_2024}.}\label{tab:theo.crit}
\end{table}

There are additional methods (DMMD, DFDA by \cite{kirchler_two-sample_2020}, and ME, SCF by \cite{chwialkowski_fast_2015, jitkrittum_interpretable_2016}) implemented in \proglang{Python}, but these are not compatible with current versions of \proglang{Python} and the used packages and can therefore no longer be run directly. 
As these methods also performed poorly in the theoretical comparison and other methods based on similar ideas (MMD) are implemented in this package, we did not take the effort to re-implement the methods in \proglang{R} from scratch.
The same holds for the block MMD \citep{zaremba_b_2022} for which a \proglang{MATLAB} implementation exists. 
Since it is just a block-wise estimation of the already implemented MMD, it is not included here.

Note that only nonparametric methods that are applicable to multivariate data, can be used as a measure of dataset (dis)similarity, and do not focus on comparing specific characteristics (like location or scale) were included here. 
There exist many additional methods for the univariate case, for specific distribution families or more specific test problems. 
A very general testing framework that includes many of these as special cases is, e.g., provided by the \pkg{coin} package \citep{coin}. 
Here, we kept to these restrictions since univariate data is in many applications an exception, and distributional assumptions are often hard to check in practice.


\clearpage
\section{Implementation overview}\label{sec:implementation}
In the following, we give an overview of the implemented methods and the general structure of the implementations.
Moreover, we discuss potential challenges in the application of the methods in practice.

\subsection{Implemented methods}
For the \pkg{DataSimilarity} package, existing implementations are used where possible.
If methods already have a name in the article where they were proposed or in the secondary literature, the corresponding functions are named after that, e.g., \fct{Wasserstein} for the Wasserstein distance \citep{vaserstein_markov_1969}, \fct{MMD} for the maximum mean discrepancy (MMD) \citep{gretton_kernel_2006}, or \fct{CMDistance} for the constrained minimum (CM) distance \citep{tatti_distances_2007}. 
Otherwise, the function names are composed of the first letters of the surnames of all authors of the article where the respective method was originally proposed, e.g., \fct{FR} for the Friedman-Rafsky test proposed by \citet{friedman_multivariate_1979}, or the full surname in case of a single author, e.g., \fct{Bahr} for the test proposed by \citet{bahr_ein_1996}. 
The input and output of the methods from different existing packages and of the newly implemented methods are unified.  
To achieve this, for some existing methods, it was sufficient to implement a wrapper calling the original function. 

In other cases, we re-implemented the method from scratch if the \proglang{R} package was archived and additional issues with the original implementation occurred. 
This was the case for the Di\-Pro\-Perm test \citep{wei_direction-projection-permutation_2016} for which the original implementation in the \pkg{diproperm} package \citep{diproperm} yields non-reproducible results.
Moreover, the implementations of the multi-sample cross-match test of \citet{petrie_graph-theoretic_2016} and the multi-sample Mahalanobis cross-match test (MMCM) of \citet{mukherjee_distribution-free_2022} in the \pkg{multicross} package \citep{multicross} could not be used due to the output format that made it impossible to access the test statistic and $p$~value.
More details on the new implementations compared to the aforementioned versions can be found in Section~\ref{app:technical}.

Table~\ref{tab:wrapper} gives an overview of all wrapper functions included in the package. 
For each method, the original implementation, the new function name, and the applicability to data with a target variable, numerical data, categorical data, and multiple samples are given. 
Note that the applicability statements refer to the specific implementation of the method.
Some of the methods are, in theory, applicable to a broader range of data types than those implemented.
More details on the methods and their implementation can be found in Section~\ref{app:technical}.

Table \ref{tab:new.funs} gives an overview of the newly implemented methods and their applicability. 
A few of these methods were already implemented in another programming language, as described in the implementation details in Sectioin~\ref{app:technical}. 

%\todo[inline]{table format}
%\begin{landscape}
%\clearpage
\begin{footnotesize}
\LTcapwidth=\textwidth
\begin{longtable}{p{3.4cm}p{4.2cm}p{2.4cm}p{0.4cm}p{0.4cm}p{0.6cm}p{0.6cm}p{0.6cm}}
% \toprule
% Method & Original function & New function & $y$ & Num & Cat & $k$>2\\
% \midrule
% \endhead
\toprule
Method & Original function & New function & $y$ & Num & Cat & $k$>2 & Sec.\\
\midrule
\endhead
\bottomrule
\endfoot
\bottomrule\\
\caption{Implemented wrapper functions. $y$: Can the method deal with a target variable in the dataset? Num: Is the method as implemented applicable to numeric data? Cat: Is the method as implemented applicable to categorical data? $k > 2$: Is the method as implemented applicable to more than two datasets at a time? Sec.: Reference to the corresponding section. \faTimes$^*$: Method is, in theory, applicable, but implementation does not work in this case. \faCheck$^*$: Implementation can be used, although this case is not described in the literature.}\label{tab:wrapper}\\
\endlastfoot
\citet{bahr_ein_1996} & \fct{cramer::cramer.test} \citep{cramer} & \fct{Bahr} & \faTimes & \faCheck & \faTimes & \faTimes & \ref{app:technical:energy} \\
Ball divergence \citep{pan_ball_2018} & \fct{Ball::bd.test} \citep{Ball} & \fct{BallDivergence} & \faTimes & \faCheck & \faTimes & \faCheck & \ref{app:technical:balldivergence}\\
\citet{baringhaus_rigid_2010} & \fct{cramer::cramer.test} \citep{cramer} & \fct{BF} & \faTimes & \faCheck & \faTimes & \faTimes & \ref{app:technical:energy}\\
Classifier Two-Sample Test \citep{lopez-paz_revisiting_2017} & \fct{Ecume::classifier\_test} \citep{Ecume} & \fct{C2ST} & \faTimes & \faCheck & \faCheck & \faCheck & \ref{app:technical:c2st}\\
\citet{chen_weighted_2018} & \makecell{\fct{gTests::g.tests}\\ \citep{gTests}} & \fct{CCS} & \faTimes & \faCheck & \faTimes & \faTimes & \ref{app:technical:edgecount}\\
\citet{chen_new_2017} & \makecell{\fct{gTests::g.tests}\\ \citep{gTests}} & \fct{CF} & \faTimes & \faCheck & \faTimes & \faTimes & \ref{app:technical:edgecount}\\
Cramér test \citep{baringhaus_new_2004} & \fct{cramer::cramer.test} \citep{cramer} & \fct{Cramer} & \faTimes & \faCheck & \faTimes & \faTimes & \ref{app:technical:energy}\\
DISCO \citep{rizzo_disco_2010} & \fct{energy::eqdist.test} \citep{energy} & \fct{DISCOB}, \fct{DISCOF} & \faTimes & \faCheck & \faTimes & \faCheck & \ref{app:technical:disco}\\
\makecell{Energy statistic\\ \citep{szekely_energy_2017}} & \fct{energy::eqdist.test} \citep{energy} & \fct{Energy} & \faTimes & \faCheck & \faTimes & \faCheck & \ref{app:technical:energy}\\
\citet{friedman_multivariate_1979} & \makecell{\fct{gTests::g.tests}\\ \citep{gTests}} & \fct{FR} & \faTimes & \faCheck & \faTimes & \faTimes & \ref{app:technical:edgecount}\\
\citet{chen_ensemble_2013} & \makecell{\fct{gTests::g.tests\_cat}\\ \citep{gTests}} & \fct{FR\_cat}, \fct{CF\_cat}, \fct{CCS\_cat}, \fct{ZC\_cat} & \faTimes & \faTimes & \faCheck & \faTimes & \ref{app:technical:edgecount_cat}\\
FS test \citep{paul_clustering-based_2022} & \makecell{\fct{HDLSSkST::FStest}\\ \citep{HDLSSkST}} & \fct{FStest} & \faTimes & \faCheck & \faTimes & \faCheck & \ref{app:technical:paul}\\
\citet{song_generalized_2021} & \makecell{\fct{kerTests::kertests}\\ \citep{kerTests}} & \fct{GPK} & \faTimes & \faCheck & \faTimes$^*$ & \faTimes & \ref{app:technical:gpk}\\
\citet{hediger_use_2021} & \fct{hypoRF::hypoRF} \citep{hypoRF} & \fct{HMN} & \faTimes & \faCheck & \faCheck & \faTimes & \ref{app:technical:hedinger} \\
\makecell{KMD \\ \citep{huang_kernel_2022}} & \makecell{\fct{KMD::KMD},\\ \fct{KMD::KMD\_test}\\ \citep{KMD}} & \fct{KMD} & \faTimes & \faCheck & \faTimes$^*$ & \faCheck & \ref{app:technical:kmd}\\
Maximum Mean Discrepancy (MMD) \citep{gretton_kernel_2006} & \fct{kernlab::kmmd} \citep{kernlab} & \fct{MMD} & \faTimes & \faCheck & \faTimes$^*$ & \faTimes & \ref{app:technical:mmd}\\
\citet{mukhopadhyay_nonparametric_2020} & \fct{LPKsample::GLP} \citep{LPKsample} & \fct{MW} & \faTimes & \faCheck & \faTimes$^*$ & \faCheck & \ref{app:technical:lp} \\
\makecell{RISE\\ \citep{zhou_new_2023}} & \fct{GraphRankTest::RISE} \citep{GraphRankTest} & \fct{RISE} & \faTimes & \faCheck & \faTimes & \faTimes & \ref{app:technical:rise}\\
RI test \citep{paul_clustering-based_2022} & \makecell{\fct{HDLSSkST::RItest}\\ \citep{HDLSSkST}} & \fct{RItest} & \faTimes & \faCheck & \faTimes & \faCheck & \ref{app:technical:paul}\\
\makecell{Cross-match test\\ \citep{rosenbaum_exact_2005}} & \fct{crossmatch::crossmatch} \citep{crossmatch} & \fct{Rosenbaum} & \faTimes & \faCheck & \faTimes & \faTimes & \ref{app:technical:rosenbaum}\\
\citet{song_new_2022} & \fct{gTestsMulti::gtestsmulti} \citep{gTestsMulti} & \fct{SC} & \faTimes & \faCheck & \faTimes & \faCheck & \ref{app:technical:SC}\\
Wasserstein distance & \fct{Ecume::wasserstein\_permut} \citep{Ecume} & \fct{Wasserstein} & \faTimes & \faCheck & \faTimes & \faTimes & \ref{app:technical:wasserstein}\\
\citet{zhang_graph-based_2019} & \makecell{\fct{gTests::g.tests}\\ \citep{gTests}} & \fct{ZC} & \faTimes & \faCheck & \faTimes & \faTimes & \ref{app:technical:edgecount}\\
\end{longtable}
%\end{landscape}
\end{footnotesize}

%\begin{landscape}
\begin{table}[!h]
\centering
\begin{tabular}{p{6cm}p{3.2cm}p{0.4cm}p{0.8cm}p{0.7cm}p{1cm}p{1cm}} 
\toprule
Method &  New function & $y$ & Num & Cat & $k$>2 & Sec.\\
\midrule
\citet{biau_asymptotic_2005} & \fct{BG} & \faTimes & \faCheck & \faTimes & \faCheck & \ref{app:technical:biau}\\
\citet{biswas_nonparametric_2014} & \fct{BG2} & \faTimes & \faCheck & \faTimes & \faTimes & \ref{app:technical:bg} \\
\citet{biswas_distribution-free_2014} & \fct{BMG} & \faTimes & \faCheck & \faTimes & \faCheck & \ref{app:technical:bmg}\\
\citet{barakat_multivariate_1996} & \fct{BQS} & \faTimes & \faCheck & \faTimes & \faTimes & \ref{app:technical:barakat} \\
Constrained Minimum Distance \citep{tatti_distances_2007} & \fct{CMDistance} & \faTimes & \faTimes & \faCheck & \faTimes & \ref{app:technical:cmd} \\
DiProPerm test \citep{wei_direction-projection-permutation_2016} & \fct{DiProPerm} & \faTimes & \faCheck & \faTimes & \faTimes & \ref{app:technical:diproperm}\\
\citet{deb_multivariate_2021} & \fct{DS} & \faTimes & \faCheck & \faTimes & \faTimes & \ref{app:technical:rank_energy} \\
Engineer metric & \fct{engineerMetric}& \faTimes & \faCheck & \faTimes & \faTimes & \ref{app:technical:engineer}\\
\citet{ganti_framework_1999} & \fct{GGRL} & \faCheck & \faCheck & \faTimes$^*$ & \faTimes & \ref{app:technical:decision_tree}\\
Jeffreys divergence & \fct{Jeffreys} & \faTimes & \faCheck & \faTimes & \faTimes & \ref{app:technical:jeffreys}\\
\citet{li_measuring_2022} & \fct{LHZ} & \faTimes & \faCheck & \faTimes & \faTimes & \ref{app:technical:lhz}\\
\citet{mukherjee_distribution-free_2022} & \fct{MMCM} & \faTimes & \faCheck & \faCheck$^*$ & \faCheck & \ref{app:technical:crossmatch} \\
\citet{ntoutsi_general_2008} & \fct{NKT} & \faCheck & \faCheck & \faTimes & \faTimes & \ref{app:technical:decision_tree}\\
\citet{alvarez-melis_geometric_2020} & \fct{OTDD} & \faCheck & \faCheck & \faCheck & \faTimes & \ref{app:technical:otdd}\\
\citet{petrie_graph-theoretic_2016} & \fct{Petrie} & \faTimes & \faCheck & \faCheck$^*$ & \faCheck & \ref{app:technical:crossmatch}\\
\citet{schilling_multivariate_1986} and \citet{henze_multivariate_1988} & \fct{SH} & \faTimes & \faCheck & \faTimes & \faTimes & \ref{app:technical:sh}\\
\citet{yu_two-sample_2007} & \fct{YMRZL} & \faTimes & \faCheck & \faCheck & \faTimes & \ref{app:technical:yu} \\
%Block MMD \citep{zaremba_b-test_2013} & \fct{blockMMD} & \faTimes & \faCheck & \faTimes$^*$ & \faTimes \\
\bottomrule
\end{tabular}
\caption{Newly implemented functions. $y$: Can the method deal with a target variable in the dataset? Num: Is the method as implemented applicable to numeric data? Cat: Is the method as implemented applicable to categorical data? $k > 2$: Is the method as implemented applicable to more than two datasets at a time? Sec.: Reference to the corresponding section. \faTimes$^*$:  Method is, in theory, applicable, but implementation does not work in this case. \faCheck$^*$: Implementation can be used, although this case is not described in the literature.}
\label{tab:new.funs}
\end{table}



\subsection{Specific data types} %todo: find better name
Note that most implementations listed in Table~\ref{tab:wrapper} and Table~\ref{tab:new.funs} are only applicable to either numerical or categorical data. 
Exceptions to this are the classifier-based methods \fct{HMN} and \fct{C2ST}, which can handle both data types simultaneously as long as the selected classifier can do so. 
The \fct{MMD} implementation can also handle both data types, but a matching kernel function has to be implemented.
For the graph-based tests (FR, CF, CCF, ZC, MMCM, Petrie) as well as for OTDD, an appropriate distance function has to be supplied.
For FR, CF, CCF, and ZC, depending on whether there are ties or not in the distances, either the implementation for numeric data (in case of no ties) or for categorical data (in case of ties) should be selected.
Some of the other methods might also be applicable if no ties are present in the distance matrix of the datasets. 
A detailed list of which methods are applicable in which cases is also available on the help page \code{?DataSimilarity}.

The restriction to either numeric or categorical data can be a limitation in real-world applications, where data is often mixed-scaled.
It is, however, inherited from the underlying literature, where the mixed-scale case is rarely discussed, and data is typically (at least implicitly) assumed to be either numeric or categorical. 
In some cases, the choice of an appropriate distance or kernel function is left up to the user, which allows the use of one that is appropriate for mixed-scale datasets.
However, to our knowledge, there are no reliable empirical or theoretical results on the performance of the methods for that case yet.

Similarly, the literature mostly only distinguishes between numeric (continuous) and categorical data without a further distinction between nominal and ordinal data in the categorical case. 
A distinction between ordinal and nominal data might, however, also be critical for the choice of appropriate distance or kernel functions or other parameters of the method, as for example demonstrated by a simulation study using categorical data \citep{stolte_empirical_cat_2026}. 
There, for one scenario, data was generated with five classes and the original implementations for \fct{MMCM} and \fct{Petrie} were used, which only allowed the Euclidean distance as the distance function. 
It could be shown that with this choice, the methods were considerably more sensitive in detecting alternatives, where the probabilities increased with increasing class labels, than for increasing the probability for one class and decreasing that of the neighboring class accordingly. 
This might be desirable if data is assumed to be ordinal, as then a change that only affects neighboring classes can be seen as minor compared to the overall change.
But if we assume the classes to be unordered, we should select a distance function that treats changes in the probability of the classes independently of the assigned class labels. 

Another issue that is common in practice but not well discussed in the related literature is missing values. 
To our knowledge, there are no recommendations specific to any of the implemented methods on how to handle missing values.
The implementations of the \pkg{DataSimilarity} package omit missing values but warn the user in that case.
Depending on the application, users might consider alternative strategies for handling missing values like supplying adapted distance or graph functions, or using imputation before comparing the datasets.

\subsection{In- and output formats}
In the implementation within the \pkg{DataSimilarity} package, each method gets two (or more) datasets as its first input parameters. 
After that, arguments specific to the method follow. 
For example, many methods perform a permutation test for which the number of permutations (\code{n.perm}) has to be specified.
The defaults for those parameters of the methods are chosen based on the simulation results of \citet{stolte_empirical_cat_2026} and \citet{stolte_empirical_num_2026}. 
Some of the existing implementations already include setting a random seed, and some do not.
Therefore, for uniformity, the new methods all include a random seed argument. 
The default for this seed is always set to \code{NULL}, which corresponds to not setting a new seed. 
If instead a numeric value is supplied, the random seed is set to the supplied value for reproducibility. 

The output of each function is of class \class{htest} and includes 
\begin{itemize}
\item \code{statistic}: The test statistic
\item \code{parameter} (optional): A parameter specifying the null distribution (e.g., degrees of freedom for a $\chi^2$ distribution).
\item \code{p.value}: The $p$~value (if an asymptotic or permutation / bootstrap test is performed).
\item \code{estimate}: The sample estimate(s) (if available, e.g., the unstandardized edge count for edge-count tests, \code{NULL} for many methods).
\item \code{alternative}: The alternative hypothesis. For two datasets, this is $F_1 \ne F_2$, for $k$ datasets it is $\exists i\ne j\in\{1,\dots,k\}: F_i\ne F_j$.
\item \code{data.name}: Names of the supplied datasets.
\item Further elements specific to the method (optional), e.g., the variable importances for the test of \citet{hediger_use_2021}.
\end{itemize}
We use the \class{htest} class as it is widely adopted for storing results of hypothesis tests in \proglang{R}, and most of the implemented methods are two- or $k$-sample tests. 
Objects of class \class{htest} will be automatically printed in an appealing format using the \fct{print.htest} function from the \pkg{stats} package.
For methods for which no test is performed, the \code{p.value} is set to \code{NULL}. 
This allows pretty printing of the results and a unified output format for the corresponding functions.
For many of the newly implemented permutation tests, we use the \fct{boot} function from the \pkg{boot} package that is included in \proglang{R} for implementing the permutation.

There is one exception from the unified output format for the \fct{DISCOF} function.
In that case, we decided to keep the original output format of class \class{disco} that is structured similarly to the remaining \class{htest} objects. 
The reason for this was that a specific printing method exists for \class{disco} objects that displays the results in an ANOVA table style, which we considered more appealing than the normal \class{htest} formatting in this specific case. 
The test statistic and $p$~value can be accessed as for the \class{htest} objects, such that we do not see any disadvantages for the uniform usability.
Moreover, the three convenience functions \fct{gTests}, \fct{gTestsMulti}, \fct{kerTests}, and the multiscale versions of \fct{FStest} and \fct{RItest} return multiple test statistics and therefore the output is a list of the same structure but not of class \class{htest}.

In typical applications, users should choose a test a priori and not based on test results. 
Therefore, the new functions perform exactly one test and return only the results corresponding to that single test. 
Some of the existing implementations used to perform multiple tests based on the same metrics or always returned the asymptotic $p$~value in addition to a permutation $p$~value.
This could lead to unscientific practices like choosing the test based on the desired result. 
As an exception, for implementations that output multiple related tests, we offer wrapper functions that also perform these multiple tests (see the above-listed convenience functions). 
Often, conducting them at once is computationally faster than performing each test individually when large parts of the calculation are the same. 
This option might be useful in certain situations where multiple tests need to be applied to the same data, e.g., when performing method comparison studies.
The wrapper functions return lists structured in the same way as the \class{htest} objects that, instead of the single test statistic and $p$~value include vectors of test statistics and $p$~values.
We advise against applying multiple related tests for the same hypothesis and deciding on one test afterwards when conducting inference for a specific real-life application.
The next sections will provide some helpful considerations for the method choice in practice.

\subsection[The DataSimilarity() and findSimilarityMethod() functions]{The \fct{DataSimilarity} and \fct{findSimilarityMethod} functions}
To facilitate the choice and usage of an appropriate method, two convenience functions, \fct{findSimilarityMethod} and \fct{DataSimilarity}, are supplied. 
\fct{findSimilarityMethod} can be used to find the set of methods that satisfy all given requirements on the applicability of the methods. 
The \fct{DataSimilarity} function can be used to call each method by supplying the method name via the \code{method} argument. 
Like the functions for the individual methods, it takes two or more datasets as its first input. 
Moreover, the \code{method} argument has to be supplied to specify the method, and arguments specific to the method can be passed. 

The arguments of \fct{findSimilarityMethod} specify which of the 22 theoretical criteria (see Section~\ref{sec:meth}, \cite{stolte_methods_2024}) must be fulfilled.
These can be passed to the function by setting the corresponding arguments to \code{TRUE}. 
\fct{findSimilarityMethod} then returns by default the function names for all implemented and suitable methods. 
By setting \code{only.names = FALSE}, the full information on which criteria the method fulfills can be retrieved. 

 
These convenience functions are useful in situations where users have datasets at hand that they want to compare, but do not know which method to use, or if users want to conveniently compare multiple methods. 
In the case where users want to apply a specific method, they can also use the corresponding function directly and find additional information on that method in the last section of this vignette.
The typical workflow for using the \fct{findSimilarityMethod} and \fct{DataSimilarity} functions is demonstrated in the next section and also described by a `Getting Started' vignette within the package.


\section{Method choice} \label{sec:meth.choice}
There are different aspects to take into account when choosing a suitable method. 
In the following, we give a roadmap for the method choice using the \pkg{DataSimilarity} package, and specifically the \fct{findSimilarityMethod} and \fct{DataSimilarity} functions.
The description here is kept general and illustrated in the following section for a concrete example.
Table~\ref{tab:steps.meth.choice} gives a compact step-by-step overview of the proposed COMPARE procedure (\textbf{C}haracteristics of datasets, \textbf{O}bjective of comparison, \textbf{M}ethod requirements, \textbf{P}roduction of method list, \textbf{A}ssessment of computational complexity, \textbf{R}anking of methods, \textbf{E}stimation and testing) for finding the best-suited method.
Each step is described in more detail in the following.
The first three steps are guiding through all considerations necessary for specifying the arguments of the \fct{findSimilarityMethod} function which is applied in step 5., and step 6.\ and 7.\ specify how to get from the output of \fct{findSimilarityMethod} to the final result.
An illustration and further details on the individual steps can be found in the Getting Started vignette.

\begin{table}[!tb]
\centering
\begin{tabular}{lp{5cm}p{9cm}}
			\toprule
			& Step & Description \\
			\midrule
			1. & Characteristics of datasets & Evaluate number of datasets, scale level of variables, presence of target variable, dimensions of datasets\\
			2. & Objective of comparison & Is the focus mainly on a test decision, or is the (dis)similarity value itself of interest and should be interpretable?\\
			3. & Method requirements & Are additional properties of the method regarding invariances, metric properties, or consistency of a test of relevance?\\
			4. & Production of method list & Call \fct{findSimilarityMethod} with the criteria identified in steps 1.--3.\\
			5. & Assessment of computational complexity & Are the identified methods feasible with regard to their runtime and memory consumption?\\
			6. & Ranking of methods & Compare the remaining methods regarding their performance in simulation studies and select the best\\
			7. & Estimation and testing & Apply the chosen method(s)\\
			\bottomrule
		\end{tabular}
		\caption{Step-by-step procedure for choosing the best method for a given application.}\label{tab:steps.meth.choice}
\end{table}


\clearpage
\section{Implementation details} \label{app:technical}
\subsection[Bahr(), BF(), Energy()]{\fct{Bahr}, \fct{BF}, \fct{Energy}} \label{app:technical:energy}
The energy statistic is a popular two- and $k$-sample statistic based on interpoint distances. 
The $k$-sample statistic is defined as 
\begin{align*}
T_\text{Energy} = &\sum_{1\le i<j\le k} \frac{n_i n_j}{n_i + n_j} \left(\frac{2}{n_i n_j}\sum_{u = 1}^{n_i}\sum_{v = 1}^{n_j} \|X^{(i)}_u - X^{(j)}_v\|_2 \right.\\
& \left.- \frac{1}{n_i^2} \sum_{u = 1}^{n_i}\sum_{v = 1}^{n_i} \|X^{(i)}_u - X^{(i)}_v\|_2 - \frac{1}{n_j^2} \sum_{u = 1}^{n_j}\sum_{v = 1}^{n_j} \|X^{(j)}_u - X^{(j)}_v\|_2\right).
\end{align*}
For a comprehensive review of the literature on the energy statistic and its applications, please refer to \citet{szekely_energy_2017}.
A permutation test can be performed based on the energy statistic. 
In the two-sample case, the energy statistic is equal to two times the Cramér test statistic of \citet{baringhaus_new_2004}, and therefore, the tests are equivalent.
However, a bootstrap instead of a permutation test is proposed for the Cramér test.
\citet{baringhaus_rigid_2010} propose a test statistic that generalizes the Cramér test statistic by using a continuous function $\phi$ such that $\phi(\|x-y\|^2)$ is a negative definite kernel instead of the Euclidean distances.
Different examples for $\phi$ are given, including as special cases the Cramér test, the test by \citet{bahr_ein_1996}, and the test by \citet{szabo_variable_2002}. 
Overall, $\phi(z) = \log(1 + z)$ is recommended for general alternatives based on a simulation study, and the Cramér test is recommended for location alternatives.
The tests of \citet{baringhaus_rigid_2010} are implemented in the \pkg{cramer} package \citep{cramer}. 
The new implementation is a simple wrapper to unify input and output naming and types. 
The energy statistic is implemented in the \pkg{energy} package \citep{energy}. 
For the corresponding wrapper, the input type was changed to a greater extent since the original implementation had the pooled sample and the sample sizes as the input. 
The \pkg{energy} implementation outsourced the calculation of the energy statistic to \proglang{C}, which gives it a notable advantage with regard to computing time over the \pkg{cramer} implementation.

\subsection[BallDivergence()]{\fct{BallDivergence}} \label{app:technical:balldivergence}
The Ball divergence \citep{pan_ball_2018} measures the difference between two probability measures.
It is defined as the square of the measure difference over a given closed ball collection. It can be estimated as
\[
\widehat{\text{BD}} = A + C,
\]
where
\begin{align*}
A &= \frac{1}{n_1^2} \sum_{i,j = 1}^{n_1} \left(A_{ij}^{(1)} - A_{ij}^{(2)}\right)^2, \\
C &= \frac{1}{n_2^2} \sum_{l,m = 1}^{n_2} \left(C_{lm}^{(1)} - C_{lm}^{(2)}\right)^2,
\end{align*}
and
\begin{align*}
A_{ij}^{(1)} &= \frac{1}{n_1}\sum_{u= 1}^{n_1} \mathbbm{1}(X_u^{(1)} \in \bar{B}(X_i^{(1)}, d(X_i^{(1)}, X_j^{(1)}))), \\
A_{ij}^{(2)} &= \frac{1}{n_2}\sum_{v= 1}^{n_2} \mathbbm{1}(X_v^{(2)} \in \bar{B}(X_i^{(1)}, d(X_i^{(1)}, X_j^{(1)}))),\\
C_{lm}^{(1)} &= \frac{1}{n_1}\sum_{u= 1}^{n_1} \mathbbm{1}(X_u^{(1)} \in \bar{B}(X_l^{(2)}, d(X_l^{(2)}, X_m^{(2)}))), \\
C_{lm}^{(2)} &= \frac{1}{n_2}\sum_{v= 1}^{n_2} \mathbbm{1}(X_v^{(2)} \in \bar{B}(X_l^{(2)}, d(X_l^{(2)}, X_m^{(2)}))),\\
\end{align*}
with $\bar{B}(X_i^{(l)}, d(X_i^{(l)}, X_j^{(l)}))$ denoting the closed Ball around $X_i^{(l)}$ with radius equal to the distance $d$ of the points $X_i^{(l)}$ and $X_j^{(l)}, l\in\{1, 2\}$.
Therefore, the first part of the Ball divergence, $A$, consists of squared distances of proportions of data points from the first sample lying within closed balls around data points from the first sample and of data points from the second sample lying within closed balls around data points from the first sample.
The second part, $C$, consists of squared distances of proportions of data points from the first sample lying within closed balls around data points from the second sample and of data points from the second sample lying within closed balls around data points from the second sample.
For both parts, the mean over all such Balls with radii equal to the distances of the center point of the ball to all other points from the same sample is taken. 
For multiple samples, the pairwise test statistics can be summarized by summing up the pairwise divergences, or by taking the maximum of sums of the Ball divergences from each sample to all other samples, or by summing up the largest $k-1$ pairwise Ball divergences.\\
The implementation here is a wrapper around the \fct{bd.test} function from the \pkg{Ball} package \citep{Ball}.
In contrast to the original implementation, the new wrapper returns an object of class \class{htest} in the multi-sample case, although, in that case, no test is conducted.
Moreover, only the summarized statistic according to the specified \code{kbd.type}, which determines how the pairwise Ball divergences are summarized, is returned.

\subsection[C2ST()]{\fct{C2ST}} \label{app:technical:c2st}
The general idea of \citet{lopez-paz_revisiting_2017} is to use a classifier to determine which of two or more datasets an observation belongs to. 
The \textit{classifier two-sample test (C2ST)} uses the classification accuracy of this classifier as its test statistic. \\
The C2ST consists of five steps: 
\begin{enumerate}
\item Construct the dataset consisting of the samples from all datasets labeled with their membership to each dataset.
\item Assign the observations of the dataset constructed in 1.\ randomly to a training and test set.
\item Train a classifier that predicts to which of the datasets $X^{(j)}, j = 1,\dots,k$, an observation belongs.
\item Calculate the C2ST statistic, which is the accuracy on the test set. The accuracy should be close to the chance level for $F_1 = \dots = F_k$, and it should be greater than the chance level if $\exists i\ne j\in\{1,\dots,k\}: F_i \ne F_j$ since in the latter case the classifier should identify distributional differences between the samples.
\item Calculate a $p$~value using a binomial test for comparing the accuracy to the chance level.
\end{enumerate}
Maximizing the power of a C2ST is a trade-off between using a large training set to optimize the classifier and a large test set to better evaluate the performance of the classifier. \\
The test statistic is interpretable as the percentage of samples that are correctly classified on the unseen test data. 
The above-mentioned test of \citet{hediger_use_2021} can be seen as a special case of the general framework proposed by \citet{lopez-paz_revisiting_2017}. 
One difference in the implementation of the tests is that for the C2ST, categorical data is dummy coded, while for the test of \citet{hediger_use_2021} the categorical variables are passed to \fct{ranger::ranger} directly. 
Moreover, the use of OOB predictions and feature importance is specific to the random forest-based test and cannot be used for all of the available classifiers for the C2ST.
Further, the C2ST uses the accuracy as its test statistic, while the test of \citet{hediger_use_2021} uses the classification error, i.e., $1 - \text{accuracy}$. 

For the C2ST, the classifier can be specified by the user and defaults to $K$-nearest neighbors.
Possible options are all models accepted by \fct{caret::train}.
For a list of these classification models, call e.g. %\\
%   \texttt{names(getModelInfo())[sapply(getModelInfo(),\\ function(x) "Classification" \%in\% x\$type)]}.
<<methods, echo=TRUE, results=hide>>=
names(caret::getModelInfo())[sapply(caret::getModelInfo(), function(x) {
  "Classification" %in% x$type
})]
@

\subsection[CCS(), CF(), FR(), ZC()]{\fct{CCS}, \fct{CF}, \fct{FR}, \fct{ZC}} \label{app:technical:edgecount}
The tests by \citet{friedman_multivariate_1979}, \citet{chen_new_2017}, \citet{chen_weighted_2018}, and \citet{zhang_graph-based_2019} are graph-based two-sample tests that use the edge counts in a similarity graph like the ($K$-)MST on the pooled sample. 
They make use of the number of edges that connect points within the first sample, $R_1$, the number of edges that connect points within the second sample, $R_2$, and the number of edges that connect points from different samples $R_{12}$.
The original edge-count test by \citet{friedman_multivariate_1979} takes the standardized between-sample edge-count
\[
T_{\text{FR}} = \frac{R_{12} - \E_{H_0}(R_{12})}{\sqrt{\VAR_{H_0}(R_{12})}}
\]
as its test statistic. The expectation and variance under the null can be calculated analytically. 
\citet{chen_new_2017} noted that this has low power against scale alternatives and proposed the \emph{generalized edge-count test} using
\[
T_{\text{CF}} = \left(R_1 - \E_{H_0}(R_1), R_2 - \E_{H_0}(R_2)\right)\COV^{-1}_{H_0}\left(\begin{pmatrix} R_1\\ R_2\end{pmatrix}\right) \begin{pmatrix} R_1 - \E_{H_0}(R_1)\\ R_2 - \E_{H_0}(R_2)\end{pmatrix}.
\]
\citet{chen_weighted_2018} found some problems with the original edge-count test for unequal sample sizes of the two datasets, based on which they proposed the \emph{weighted edge-count test} using the weighted edge-counts
\[
R_w = \frac{n_1}{N} R_1 + \frac{n_2}{N} R_2,
\]
where $n_1$ denotes the sample size of the first dataset, $n_2$ the sample size of the second dataset, and $N = n_1 + n_2$ the total sample size in the pooled sample. 
The \emph{weighted edge-count test} statistic is then defined as the standardized weighted edge count
\[
T_{\text{CCS}} = \frac{R_{w} - \E_{H_0}(R_{w})}{\sqrt{\VAR_{H_0}(R_{w})}}.
\]
Lastly, the \emph{max-type edge count} \citep{zhang_graph-based_2019} test (based on the method of \citet{chu_asymptotic_2019}) additionally uses the difference of the edge counts in the samples, i.e.,
\[
R_d = R_1 - R_2.
\]
Its test statistic is defined as
\[
T_{\text{ZC}} = \max\left(\kappa \frac{R_{w} - \E_{H_0}(R_{w})}{\sqrt{\VAR_{H_0}(R_{w})}}, \left|\frac{R_{d} - \E_{H_0}(R_{d})}{\sqrt{\VAR_{H_0}(R_{d})}}\right|\right),
\]
where $\kappa$ is a constant that has to be chosen before performing the test.
$\kappa\in\{1, 1.14, 1.31\}$ is recommended based on a small power simulation for normal data with shift or scale alternatives.\\
Wrapper functions around \fct{g.tests} from the \pkg{gTests} package \citep{gTests} are implemented.
These do not need a pre-calculated graph as input but allow specifying a distance function (\code{dist.fun}) and a function for calculating a similarity graph (\code{graph.fun}) and then calculating the similarity graph internally.
The new input also includes both datasets.
We find this more intuitive and less error-prone than supplying an edge matrix and two vectors of indices specifying the dataset membership as for the original \fct{g.tests} function.
The new implementation forces the user to choose one of the tests first and then perform it, instead of performing all tests at once.
Moreover, the users have to decide whether they want to perform the permutation test or the approximate test.

For the Friedman-Rafsky test, there is an additional implementation in the \pkg{GSAR} package \citep{GSAR}, but there, the test statistic is standardized by the empirical mean and standard deviation rather than the theoretical mean and standard deviation of the test statistic under the null hypothesis as proposed in the original article.
Therefore, we use the \pkg{gTests} implementation here. 

\subsection[CCScat(), CFcat(), FRcat(), ZCcat]{\fct{CCS\_cat}, \fct{CF\_cat}, \fct{FR\_cat}, \fct{ZC\_cat}} \label{app:technical:edgecount_cat}
These methods are adaptations of the previously mentioned edge-count tests for categorical data. 
With categorical data, the problem of ties in the distance matrix arises. 
Ties lead to non-unique solutions for the similarity graph construction and, therefore, also to non-unique values of the proposed test statistics. 
This can be solved by either taking the union of all optimal graphs and calculating the respective statistic on this union graph, or by averaging the test statistic values over all optimal graphs.
The new implementation of the categorical graph-based tests is again a wrapper function that includes the calculation of the edge matrix.
For this, the function \fct{getGraph} from the \pkg{gTests} package is used.
Therefore, the choice of the similarity graph is restricted to the $K$-nearest neighbors and the $K$-MST.
Still, a distance function can be supplied.
By default, this is the sum of unequal classes.
The calculation of the frequency table of all observations and the similarity graph on this are performed internally; thus, again, only the datasets have to be supplied by the user.
Moreover, the method for aggregating the graphs has to be supplied.
Possible options are averaging (\code{``a''}) and union (\code{``u''}) over graphs.

\subsection[DISCOB(), DISCOF()]{\fct{DISCOB}, \fct{DISCOF}} \label{app:technical:disco}
\citet{rizzo_disco_2010} show that the energy test can be seen as the treatment sum of squares in an ANOVA interpretation of the $k$-sample problem. 
As the measure of dispersion for univariate or multivariate responses based on all pairwise distances between sample elements for ANOVA
\[
d_{\alpha}(X^{(i)}, X^{(j)}) = \frac{n_i n_j}{n_i + n_j} [2g_{\alpha}(X^{(i)}, X^{(j)}) - g_{\alpha}(X^{(i)}, X^{(i)}) - g_{\alpha}(X^{(j)}, X^{(j)})]
\] 
is proposed with 
\[
g_{\alpha}(X^{(i)}, X^{(j)}) = \frac{1}{n_i n_j}\sum_{u = 1}^{n_i}\sum_{v=1}^{n_j} \|X^{(i)}_u - X^{(j)}_v\|_2^{\alpha}, i,j\in\{1, \dots, k\}.
\]
With this, \citet{rizzo_disco_2010} derive their so-called \textit{distance components (DISCO) decomposition} for $\alpha\in (0,2]$.
It partitions the total dispersion in the samples 
\[
T_{\alpha} = \frac{N}{2} \sum_{i,j = 1}^{N} g_{\alpha}(X^{(i)}, X^{(j)}), 
\]
into components
\[
T_{\alpha} = S_{\alpha} + W_{\alpha}
\]
analogous to the variance components in ANOVA.
%Here, $Z$ denotes the pooled sample. 
The between-sample measure of dispersion $S_{\alpha}$ and the within-sample measure of dispersion $W_{\alpha}$, respectively, are defined as
\begin{align*}
S_{\alpha} &= \sum_{1\le i < j \le k} \frac{n_i + n_j}{2N} d_{\alpha}(X^{(i)}, X^{(j)}), \\
W_{\alpha} &= \sum_{i = 1}^k \frac{n_i}{2} g_{\alpha}(X^{(i)}, X^{(i)}).
\end{align*}
The between-sample measure of dispersion $S_{\alpha}$ can be used directly in a $k$-sample permutation test (\fct{DISCOB}). 
Alternatively, the statistic
\[
F_{\alpha} = \frac{S_{\alpha}/(k-1)}{W_{\alpha}/(N-k)}
\]
can be used in a $k$-sample permutation test (\fct{DISCOF}).
For each index $\alpha\in(0,2)$, this determines a nonparametric test for the multi-sample problem that is statistically consistent against general alternatives.
For $\alpha = 2$, it equals the usual ANOVA $F$-test. 
The choice of the index $\alpha$ is difficult. 
In general, the computational costs for calculating Gini means $g_{\alpha}$, in terms of which the test statistic can be formulated, are $\mathcal{O}(N^2)$.
For $\alpha= 1$, it can be linearized, and computation time reduces to $\mathcal{O}(N\log N)$. 
The simplest and most natural choice for $\alpha$ is one.
For heavy-tailed distributions, a small $\alpha$ is recommended.\\
The test is implemented by permutation bootstrap in the \proglang{R} package \texttt{energy} \citep{energy}.
The new implementations of the between-sample and of the DISCO $F$-test are wrappers, which mainly unify the inputs and outputs that differed between the two tests in the original implementation. 
Moreover, the input format is again changed from the pooled sample and the dataset labels to the individual datasets. 

\subsection[FStest(), RItest()]{\fct{FStest}, \fct{RItest}} \label{app:technical:paul}
\citet{paul_clustering-based_2022} propose distribution-free $k$-sample tests intended for the high dimension low sample size (HDLSS) setting.
The tests are based on clustering the pooled sample and comparing the resulting clustering to the true dataset membership via a contingency table.
If the datasets come from the same distribution, the cluster and dataset membership are independent, while if the datasets come from different distributions, the clustering depends on the true dataset membership. 
As a clustering algorithm, \citet{paul_clustering-based_2022} suggest using $K$-means based on the generalized version of the \textit{mean absolute difference of distances (MADD)}
\[
\rho_{h,\varphi}(z_i, z_j) = \frac{1}{N-2} \sum_{m\in \{1,\dots, N\}\setminus\{i,j\}} \left| \varphi_{h,\psi}(z_i, z_m) - \varphi_{h,\psi}(z_j, z_m)\right|,
\]
as proposed by \citet{sarkar_perfect_2020} for the HDLSS setting. 
Here, $z_i, i = 1,\dots,N$, denote realizations from the pooled sample and \[\varphi_{h,\psi}(z_i, z_j) = h\left(\frac{1}{p}\sum_{l=1}^p\psi|z_{il} - z_{jl}|\right),\] where $h:\mathbb{R}^{+} \to\mathbb{R}^{+}$ and $\psi:\mathbb{R}^{+} \to\mathbb{R}^{+}$ are continuous and strictly increasing functions. 
$\psi(t) = t^2$, $\psi(t) = 1 - \exp(-t)$, $\psi(t) = 1 - \exp(-t^2)$, $\psi(t) = \log(1 + t)$, and $\psi(t) = t$ are considered in combination with $h(t) = \sqrt{t}$ and $h(t) = t$. 
The number of clusters has to be chosen in advance. 
A natural choice is to set the number of clusters to the number of datasets $k$.
For the RI test, the Rand index of the clustering is used as a test statistic. 
It is zero when the clustering is perfect, i.e., when the cluster membership is a permutation of the true dataset membership. 
The test rejects for low values since the Rand index should take higher values when all clusters have similar distributions of class labels.
The critical value can be calculated using a generalized hypergeometric distribution.
Due to the discreteness of the Rand index, \citet{paul_clustering-based_2022} propose to use a randomized test. 
For the FS test, the generalized Fisher's test statistic for testing for independence in a $k\times\ell$ contingency table is used. 
Again, a randomized test using the generalized hypergeometric distribution to find the critical values is proposed.\\
\citet{paul_clustering-based_2022} additionally propose modified versions of the tests (MRI, MFS test).
For these, the number of clusters is estimated from the data using the Dunn index, since setting the number of clusters to $k$ might fail in case of multimodal distributions where a larger number of clusters might be required, where then multiple clusters can correspond to one dataset. \\
Moreover, multiscale versions of the tests are presented (MSRI, MSFS test) for the case where the number of clusters is unclear.
The respective tests are then performed for different numbers of clusters, and the results are aggregated using a Bonferroni adjustment for the individual tests. 
Still, an upper limit for the number of clusters to be considered must be chosen. 
The implementation also includes aggregated tests (AFS / ARI test) that perform all pairwise FS / MFS or RI / MRI tests, respectively, on the samples and aggregate the results by taking the minimum test statistic value and applying a multiple testing procedure. \\
The tests are implemented in the \proglang{R} package \pkg{HDLSSkST} \citep{HDLSSkST}.
The main difference between the new wrapper functions and the original implementation is that the modified and multiscale versions of the RI and FS tests can be performed with the same function as the original tests. 
The test can be chosen via the newly introduced \code{version} argument of the \fct{FStest} and \fct{RItest} functions. 
One advantage of this is that the input and output formats are unified between the versions of the test. 
In the original implementation of the test, the elements of the output list differ both content-wise and also in their names between the tests. 
Moreover, the input of the tests differs slightly between the original functions for the different tests. 
The input is also unified to match the input of the other functions in the \pkg{DataSimilarity} package and, therefore, consists simply of the datasets instead of a pooled data matrix, a vector with the dataset affiliation of each observation, and a vector of the sample sizes. 
We think this is easier to understand and less error-prone from a user perspective. 

\subsection[GPK()]{\fct{GPK}} \label{app:technical:gpk}
\citet{song_generalized_2021} propose another kernel-based test for which they decompose the squared MMD estimator as
\[
\widehat{\text{MMD}}^2 = \alpha + \beta - 2\gamma,
\]
where
\begin{align*}
\alpha &= \frac{1}{n_1(n_1 - 1)}\sum_{i=1}^{n_1}\sum_{\substack{j = 1\\ j\ne i}}^{n_1} K(X_i^{(1)}, X_j^{(1)}),\\
\beta &= \frac{1}{n_2(n_2 - 1)}\sum_{i=1}^{n_2}\sum_{\substack{j = 1\\ j\ne i}}^{n_2} K(X_i^{(2)}, X_j^{(2)}),\\
\gamma &= \frac{1}{n_1n_2}\sum_{i=1}^{n_1}\sum_{j = 1}^{n_2} K(X_i^{(1)}, X_j^{(2)}).
\end{align*}
As a new statistic, they propose to use
\[
\text{GPK} = (\alpha - \E_{H_0}(\alpha), \beta - \E_{H_0}(\beta)) \COV^{-1}_{H_0}\left(\begin{pmatrix} \alpha\\ \beta \end{pmatrix}\right) \begin{pmatrix} \alpha - \E_{H_0}(\alpha)\\ \beta - \E_{H_0}(\beta)\end{pmatrix}.
\]
The GPK can be decomposed into $\text{GPK} = Z_W^2 + Z_D^2$, where $Z_W$ and $Z_D$ are the standardized versions (with expectation and variance under $H_0$) of
\begin{align*}
W &= \frac{n_1}{N}\alpha + \frac{n_2}{N}\beta,\\
D &= n_1 (n_1 -1) \alpha - n_2(n_2 - 1)\beta.
\end{align*}
Based on this observation, they further generalize $W$ to
\[
W_r = r\frac{n_1}{N}\alpha + \frac{n_2}{N}\beta
\]
and $Z_W$ to $Z_{W,r}$. 
Fast tests based on $Z_{W,r}$ are proposed as the asymptotic distribution of $Z_W = Z_{W, 1}$ is complicated, but that of $Z_{W,r}, r\ne 1$, is a standard normal under mild assumptions.
One fast test fGPK uses the Bonferroni adjusted test result of the tests based on $Z_D$, $Z_{W,1.2} =: ZW_1$ and $Z_{W,0.8} =: ZW_2$, the other fast test fGPK$_M$ uses the Bonferroni adjusted test result of the tests based on $Z_{W,1.2}$ and $Z_{W,0.8}$. For GPK (as well as for fGPK and fGPK$_M$), a permutation test can be performed.\\
The new implementation \fct{GPK} based on the \fct{kertests} function from the \pkg{kerTests} package \citep{kerTests} performs by default the fast test version instead of a permutation test, and the bandwidth parameter $\sigma$ of the RBF kernel that is used as the kernel $K$ is chosen via the median heuristic using the function \fct{med\_sigma} of the \pkg{kerTests} package.
The median heuristic sets the bandwidth of the kernel to the median value of all pairwise distances in the pooled sample \citep{sriperumbudur_kernel_2009}. 
When the fast test is performed, all three test statistics, $ZW_1$, $ZW_2$, and $Z_D$, are returned together with the asymptotic $p$~value if \code{n.perm = 0} or the permutation $p$~value if \code{n.perm > 0}, respectively.
For the GPK statistic, only the permutation test is available as its null distribution cannot be accessed.
Therefore, if the number of permutations is set to zero, the fast test is always performed.
This holds even if \code{fast} is set to \code{FALSE} (with a warning).

\subsection[HMN()]{\fct{HMN}} \label{app:technical:hedinger}
\citet{hediger_use_2021} provide a two-sample test based on random forests.
It is applicable for categorical data and can also be used with numeric variables or a mix of categorical and numeric variables. 
For this, a pooled dataset is created where each observation is labeled according to its original dataset membership, and a random forest is trained to distinguish between the dataset labels.
The idea is that if the datasets are generated from the same distribution, the classification error of the random forest should be close to the chance level; otherwise, the classifier should be able to distinguish between the two distributions, and hence the classification error should be lower than the chance level. 
One advantage of using random forests as the classifier is that it requires almost no tuning.
An asymptotic test is proposed.
For this, the pooled dataset has to be split into a training set on which the random forest is trained and a test set on which its classification error is evaluated to ensure that the test is performed on data that is independent of the data on which the classifier was trained.
In the implementation, both datasets are split in half to create a training and a test dataset. 
Alternatively, an out-of-bag (OOB) based permutation test can be performed that does not require data splitting. 
OOB statistics can be used to increase the sample efficiency compared to the test based on a holdout sample. 
Both the OOB-based permutation test and the asymptotic version of the test using data splitting are implemented.
The test statistic is either the mean of the per-class OOB or test classification errors, or the overall OOB or test classification error over both classes, respectively.
In the asymptotic case, a binomial test is performed in case of the overall classification error, or a $Z$ test is performed in case of the mean per-class classification error. 
Otherwise, a permutation test is performed.
The variable importance measures of the random forest can provide additional insights into sources of distributional differences. 

The function here is a wrapper around the \fct{hypoRF} function from the \pkg{hypoRF} package \citep{hypoRF} that only renames arguments for consistency with the other methods.
Note that the implemented per-class OOB statistics differ for the permutation test and the approximate test: for the permutation test, the sum of the per-class OOB errors is returned; for the asymptotic version, the standardized sum is returned.

\subsection[KMD()]{\fct{KMD}} \label{app:technical:kmd}
The \emph{kernel measure of multi-sample dissimilarity} (KMD) introduced by \citet{huang_kernel_2022} is a kernel-based test using the association between the variables and the sample membership to quantify the dissimilarity of multiple samples. 
Denote the dataset membership of each point in the pooled sample $\{Z_1,\dots, Z_N\}$ by $\{\Delta_1,\dots, \Delta_N\}$. $\{(\Delta_i, Z_i)\}_{i=1}^N$ can be seen as an i.i.d.\ sample from $(\tilde{\Delta}, \tilde{Z})$ with distribution $\mu$ defined by $\text{P}(\tilde{\Delta} = i) = \pi_i$ and $\tilde{Z}|\tilde{\Delta} = i \sim F_i$, $i = 1,\dots, k$. 
Let $(\tilde{Z}_1, \tilde{\Delta}_1), (\tilde{Z}_2, \tilde{\Delta}_2)$ be i.i.d.\ samples from $\mu$ and $(\tilde{Z}, \tilde{\Delta}), (\tilde{Z}, \tilde{\Delta}^{\prime}) \sim \mu$ with $\tilde{\Delta}, \tilde{\Delta}^{\prime}$ conditionally independent given $\tilde{Z}$.
Denote by $K$ a kernel function over $\{1,\dots,k\}$, e.g., the discrete kernel $K(x, y) := \mathbbm{1}(x = y)$. 
Then, the KMD is defined as
\[
\eta(P_1,\dots,P_k) := \frac{\E\left[K(\tilde{\Delta}, \tilde{\Delta}^{\prime})\right] - \E\left[K(\tilde{\Delta}_1, \tilde{\Delta}_2)\right]}{\E\left[K(\tilde{\Delta}, \tilde{\Delta})\right] - \E\left[K(\tilde{\Delta}_1, \tilde{\Delta}_2)\right]}.
\]
It can be estimated using a similarity graph $\mathcal{G}$, e.g., the $K$-nearest neighbor (NN) graph or the minimum spanning tree (MST) on the pooled sample.
Denote by $(Z_i,Z_j)\in\mathcal{E}(\mathcal{G})$ that there is an edge in $\mathcal{G}$ connecting $Z_i$ and $Z_j$.
Moreover, let $o_i$ be the out-degree of $Z_i$ in $\mathcal{G}$. Then, an estimator for $\eta$ is defined as
\[
\hat{\eta} := \frac{\frac{1}{N} \sum_{i=1}^N \frac{1}{o_i} \sum_{j:(Z_i,Z_j)\in\mathcal{E}(\mathcal{G})} K(\Delta_i, \Delta_j) - \frac{1}{N(N-1)} \sum_{i\ne j} K(\Delta_i, \Delta_j)}{\frac{1}{N}\sum_{i=1}^N K(\Delta_i, \Delta_i) - \frac{1}{N(N-1)} \sum_{i\ne j} K(\Delta_i, \Delta_j)}.
\]
An asymptotic and a permutation $k$-sample test are proposed based on the KMD.\\
The implementation of the new function \fct{KMD} combines the calculation of KMD and the corresponding $p$~value using the functions \fct{KMD} and \fct{KMD\_test}, respectively, from the \pkg{KMD} package \citep{KMD}. 
Moreover, the inputs of the new function are simply the individual datasets instead of the pooled data matrix and sample IDs. 
By default, the asymptotic test is performed (\code{n.perm = 0}) using a discrete kernel and the $K$-NN graph with $K = \lceil N / 10 \rceil$, where $N$ denotes the total sample size of the pooled sample.
The options for the graph are restricted to \code{``knn''} and \code{``mst''} by the implementations from the \pkg{KMD} package. 
A user-specified kernel can be used only when a kernel matrix is supplied instead of the keyword \code{``discrete''} for the \code{kernel} argument of the new function.

\subsection[MMD()]{\fct{MMD}} \label{app:technical:mmd}
The \emph{maximum mean discrepancy} (MMD) uses a kernel mean embedding to define a metric for probability distributions.
Kernel mean embeddings extend feature maps $\phi$ to the space of probability distributions by representing each distribution $F$ as a mean function
\begin{equation*}
\phi(F)(\cdot) = \mu_{F}(\cdot) := \int_\mathcal{X} K(x, \cdot) \dif F(x) = \E_{F}(K(X, \cdot)), \label{def.emb}
\end{equation*}
where $K:\mathcal{X}\times\mathcal{X}\to\mathbb{R}$ is a symmetric and positive definite kernel function.
A reproducing kernel Hilbert space (RKHS) $\mathcal{H}$ of functions on the domain $\mathcal{X}$ with kernel~$K$ is a Hilbert space of functions $f: \mathcal{X}\to\mathbb{R}$ with dot product $\langle\cdot,\cdot\rangle$ that satisfies the reproducing property
\begin{align*}
\langle f(\cdot), K(x,\cdot)\rangle &= f(x)
\;\Rightarrow\; \langle K(x,\cdot), K(x^{\prime},\cdot)\rangle = K(x,x^{\prime}),
\end{align*}
such that the linear map from a function to its value at $x$ can be seen as an inner product. 
Then the kernel mean embedding as given above is a transformation of the distribution $F$ to an element in the reproducing kernel Hilbert space (RKHS) $\mathcal{H}$ corresponding to the kernel~$K$ \citep{muandet_kernel_2017}.
For characteristic kernels, the kernel mean representation captures all information about the distribution $F$, which implies $\|\mu_{F_1} - \mu_{F_2}\|_{\mathcal{H}} = 0 \Leftrightarrow F_1 = F_2$ \citep{fukumizu_dimensionality_2004, sriperumbudur_injective_2008, sriperumbudur_hilbert_2010}. 
Therefore, the MMD measures the difference between two distributions as
\[
\text{MMD}(\mathcal{H}, F_1, F_2) = \|\mu_{F_1} - \mu_{F_2}\|_{\mathcal{H}}.
\]
Here, the implementation \fct{kmmd} from the \pkg{kernlab} package \citep{kernlab} is used.
The alternative implementation from the \pkg{Ecume} package \citep{Ecume} does not include an automatic choice of the kernel parameter.
The new implementation adds a permutation test to the \pkg{kernlab} implementation.

\subsection[MW()]{\fct{MW}} \label{app:technical:lp}
For the test of \citet{mukhopadhyay_nonparametric_2020}, a nonparametrically designed set of orthogonal functions (LP polynomials) is obtained by orthonormalizing a set of functions constructed as orthonormal polynomials of mid-distribution transforms. 
These are used for the construction of a polynomial kernel of degree 2 that encodes the similarity between two data points in the LP-transformed domain.
The values of the kernel Gram matrix are then used as weights on a graph with the pooled sample as vertices. 
The idea is to cluster points for the graph into $k$ groups that have higher connectivity and compare how closely this clustering is related to the true memberships of the $k$ distributions. 
Then, the problem reduces to testing independence, which can be accomplished by determining whether all of the LP comeans are zero.\\
The test is implemented in the \pkg{LPKsample} package \citep{LPKsample}.
The new implementation offers the additional option to sum over all components instead of summing over the significant components only. 
This might be of interest when using the statistic as a data similarity measure without testing.
By default, this is disabled (\code{sum.all = FALSE}).
When only summing over the significant components, the returned test statistic is always equal to zero when no component is significant.

\subsection[RISE()]{\fct{RISE}} \label{app:technical:rise}
\citet{zhou_new_2023} define a rank-based two-sample test based on similarity graphs, which can be applied to high-dimensional and non-Euclidean data. 
They define a sequence of simple similarity graphs $\{G_l\}_{l=0}^K$ on the pooled sample via 
\[
G_{l+1} = G_l \cup G_{l+1}^\ast
\]
with
\[
G_{l+1}^\ast = \arg\max_{G^\prime\in\mathcal{G}_{l+1}}\sum_{(i,j)\in G^\prime} S(Z_i, Z_j),
\]
where $G_0$ has no edges, $\mathcal{G}_{l+1} = \{G^\prime\in\mathcal{G}:G^\prime\cap G_l = \emptyset \}$ with $\mathcal{G}$ the set of graphs that fulfill specific user-defined constraints, and $S(\cdot,\cdot)$ a similarity measure, e.g.\ the negative Euclidean distance for Euclidean data, and $Z_1,\dots,Z_N$ denoting the pooled sample. 
This includes as special cases the $K$-nearest neighbor graph, the $K$-minimum spanning tree, the $K$-minimum distance non-bipartite pairing, and the $K$-shortest Hamiltonian path. 
Then, \citet{zhou_new_2023} define the following two graph-based rank matrices $R = (R_{ij})_{i,j=1}^N$ using this sequence of graphs. The \emph{graph-induced ranks} are defined as 
\[
R_{ij} = \sum_{l=1}^K \mathbbm{1}\left(\left(i,j\right)\in G_l\right).
\]
They can be interpreted as the number of graphs that contain the edge $(i,j)$ in the sequence of graphs. 
The \emph{overall ranks} are defined as
\[
R_{ij} = \text{rank}\left(S\left(Z_i, Z_j\right), G_K\right),
\]
where $\text{rank}\left(S\left(Z_i, Z_j\right), G_K\right)$ denotes the rank of $S\left(Z_i, Z_j\right)$ among the values $\{S\left(Z_u, Z_v\right)\}_{(u,v)\in G_K}$ if $(i,j)\in G_k$ and zero otherwise. 
The overall rank can be interpreted as the rank of the similarity of edges in the graph $G_K$. 
Both rank definitions depend on the choice of $K$. 
For the test, the symmetrized rank matrix $1/2(R+R^T)$ is used, which is also denoted by $R$ for convenience.

For the test statistic, the within-sample rank sums of the first and second sample are defined as 
\[
U_x = \sum_{i,j=1}^{n_1} R_{ij}, U_y = \sum_{i,j=n_1 + 1}^{N} R_{ij}.
\]
Using these, the \emph{rank in similarity graph edge-count two-sample test (RISE)} statistic is defined as 
\[
T_R = (U_x - \mu_x, U_y - \mu_y)\Sigma^{-1}(U_x - \mu_x, U_y - \mu_y)^T,
\]
where $\mu_x = \E(U_x)$, $\mu_y = \E(U_y)$, and $\Sigma = \mathbb{C}\text{ov}((U_x, U_y)^T)$. 
The quantities $\mu_x$, $\mu_y$, and $\Sigma$ can be calculated explicitly under the permutation null hypothesis. 
\citet{zhou_new_2023} give sufficient conditions under which $\Sigma$ is invertible and therefore $T_R$ is well-defined. 
For small samples, the exact permutation null distribution can be used for testing. 
For large samples and under several assumptions on the similarity graphs, the asymptotic $\chi^2_2$-distribution of $T_R$ can be used for testing. 

The implementation here is a wrapper around the \fct{RISE} function from the \pkg{GraphRankTest} package \citep{GraphRankTest}.

\subsection[Rosenbaum()]{\fct{Rosenbaum}} \label{app:technical:rosenbaum}
The \citet{rosenbaum_exact_2005} cross-match test uses a similar approach as the Friedman-Rafsky test, but is based on the optimal non-bipartite matching instead of the MST as a similarity graph, i.e., a graph where pairs of two observations in the pooled sample are connected such that the sum over the edge lengths (= Euclidean distances of connected observations) is minimized. 
For a visualization, see the ``Getting Started'' vignette.
In the case of an odd pooled sample size, a ghost observation is added that has the maximal distance to all real observations. 
This ghost observation and the observation that was matched with it are then discarded in the calculation of the test statistic. 

The test statistic of the cross-match test is given by the standardized cross-match count
\[
\frac{\text{CMC} - \E_{H_0}(\text{CMC})}{\sqrt{\VAR_{H_0}(\text{CMC})}},
\]
where CMC denotes the cross-match count and $\E_{H_0}$ and $\VAR_{H_0}$ its expectation and variance, respectively, under $H_0: F_1 = F_2$. 
The cross-match count is the number of edges connecting points from different datasets.
The exact distribution of the test statistic under $H_0$ is known. For small samples, it can be used for computing an exact $p$~value. 

The new function \fct{Rosenbaum} is a wrapper around the \fct{crossmatchtest} function from the \pkg{crossmatch} package \citep{crossmatch}.
Again, a distance function can be supplied.
By default, this is \fct{stats::dist}, i.e., the Euclidean distance.
The new function then calculates the distance matrix internally.
Again, we find this more straightforward from a user perspective than supplying a distance matrix on the pooled sample and a vector specifying the dataset membership of each observation.
The output of the function includes the raw edge count, its standard error, and expectation under the null, like for the \pkg{crossmatch} implementation.
In contrast, only either the exact or the approximate $p$~value is returned.
By default (\code{exact = TRUE}), the exact $p$~value is returned.
This is appropriate for samples that are not too large.
Note that with a pooled sample size of $340$ or more, it is numerically impossible to derive the exact distribution due to the factorials involved in the calculation, and \fct{crossmatchtest} will return a missing value for the exact $p$~value.

\subsection[SC()]{\fct{SC}} \label{app:technical:SC}
\citet{song_new_2022} propose three new tests for the $k$-sample problem that use the between-sample edges and the within-sample edges of a similarity graph on the pooled sample. 
Let $R^W$ denote the vector containing the numbers of within-sample edges for each of the $k$ samples and $R^B$ denote the vector containing the numbers of between-sample edges for all $k(k-1)$ pairs of different samples. 
Then, the first test statistic is given by
\begin{align*}
S &= S^W + S^B,\text{ where}\\
S^W &= \left(R^W - \E_{H_0}(R^W)\right)^{\top} \COV_{H_0}^{-1}\left(R^W\right) \left(R^W - \E_{H_0}(R^W)\right),\\
S^B &= \left(R^B - \E_{H_0}(R^B)\right)^{\top} \COV_{H_0}^{-1}\left(R^B\right) \left(R^B - \E_{H_0}(R^B)\right).
\end{align*}
The second test statistic is based on the vector $R^A$ of all linearly independent numbers of edges between and within samples, i.e., all numbers of edges between all pairs of samples, including the pairs of a sample with itself, except for the pair of sample~$(k-1)$ and sample~$k$. 
The test statistic is then defined as
\[
S^A = \left(R^A - \E_{H_0}(R^A)\right)^{\top} \COV_{H_0}^{-1}\left(R^A\right) \left(R^A - \E_{H_0}(R^A)\right).
\]
All expectations and covariances under the null can be calculated analytically again.
While $\COV_{H_0}\left(R^W\right)$ is shown to be always invertible, no such proof exists for $\COV_{H_0}\left(R^B\right)$ and $\COV_{H_0}\left(R^A\right)$. 
Therefore, \citet{song_new_2022} suggest checking the invertibility numerically before applying the test and using a generalized inverse if necessary. This is already done within their implementation. 
Based on $S^A$, an asymptotic test can easily be performed. 
The asymptotic distribution of $S$ is more complicated and hard to compute in practice; therefore, a fast test is suggested instead.
It combines the tests using $S^W$ and $S^B$ and takes the Bonferroni-adjusted $p$~value of both these tests. 
Alternatively, a permutation test can be performed for either $S^A$ or $S$.\\
The implementation here for the test of \citet{song_new_2022} is a wrapper around the \fct{gtestsmulti} function from \pkg{gTestsMulti} \citep{gTestsMulti}.
The input is simplified as for the wrapper around \fct{g.tests}.
The user has to choose whether the original ($S$) or the fast ($S^A$) version of the test should be performed.
If the number of permutations for the permutation test (\code{n.perm}) is set to 0, the approximate test is performed; otherwise, the permutation $p$~value is reported

\subsection[Wasserstein()]{\fct{Wasserstein}} \label{app:technical:wasserstein}
The $q$-Wasserstein distance \citep{vaserstein_markov_1969} of two distributions $F_1$ and $F_2$ on $\mathcal{X}$ is defined as 
\[
\text{W}(F_1, F_2) := \left(\min_{\pi\in\Pi(F_1, F_2)} \int_{\mathcal{X}\times\mathcal{X}} d_{\mathcal{X}}(x, y)^q \dif \pi(x, y)\right)^{1/q}, 
\]
where $d_{\mathcal{X}}$ is the metric that $\mathcal{X}$ is provided with, and 
\[
\Pi(F_1, F_2) := \{\pi_{1,2}\in\mathcal{P}(\mathcal{X}\times\mathcal{X})\arrowvert \pi_1 = F_1, \pi_2 = F_2\}
\]
is the set of joint distributions over the product space $\mathcal{X}\times\mathcal{X}$ with marginal distributions $F_1$ and $F_2$.\\
In the \pkg{Ecume} package \citep{Ecume}, a permutation test based on the Wasserstein distance is implemented.

\subsection[BG()]{\fct{BG}} \label{app:technical:biau}
\citet{biau_asymptotic_2005} test for homogeneity of two (multivariate) datasets by calculating the $L_1$-distance between the two empirical distributions restricted to a finite partition.
For this, a finite partition of the subspace spanned by the two datasets has to be defined.
By default, we define a rectangular partition under the assumption of approximately equal cell probabilities.
The number of elements of the partition $m_n$ is chosen according to the convergence criteria in \citet{biau_asymptotic_2005} as $n^{0.8}$, where the exponent can be varied as an argument (\code{exponent}).
For each dimension, $m_n^{1/p} + 1$ equidistant cut-points are created along the range of both datasets to define the partition.
It must be ensured that there are at least three cut-points per dimension (min, max, and one point splitting the data into two bins).
The argument \code{eps} ensures that the partition covers all data points by adding some small value to the data range.
Alternative partition functions can be provided via the \code{partition} argument.
After calculating the partition, all data points are assigned to an element of the partition along the defined cut-points.
Lastly, the $L_1$ distance between the empirical distribution functions restricted to the elements of the partition is calculated.

\subsection[BG2()]{\fct{BG2}} \label{app:technical:bg}
The statistic of \citet{biswas_nonparametric_2014} uses inter-point distances and is defined as
\begin{align*}
T &= \|\hat{\mu}_{D_{F_1}} - \hat{\mu}_{D_{F_2}}\|^2_2, \text{ where}\\
\hat{\mu}_{D_{F_1}} &= \left[\hat{\mu}_{{F_1}{F_1}} = \frac{2}{n_1(n_1 - 1)}\sum_{i=1}^{n_1}\sum_{j=i+1}^{n_1}\|X_{i}^{(1)} - X_{j}^{(1)}\|, \hat{\mu}_{{F_1}{F_2}} = \frac{1}{n_1 n_2}\sum_{i=1}^{n_1}\sum_{j=1}^{n_2}\|X_{i}^{(1)} - X_{j}^{(2)}\|\right]^\top,\\
\hat{\mu}_{D_{F_2}} &= \left[\hat{\mu}_{{F_1}{F_2}} = \frac{1}{n_1 n_2}\sum_{i=1}^{n_1}\sum_{j=1}^{n_2}\|X_{i}^{(1)} - X_{j}^{(2)}\|, \hat{\mu}_{{F_2}{F_2}} = \frac{2}{n_2(n_2 - 1)}\sum_{i=1}^{n_2}\sum_{j=i+1}^{n_2}\|X_{i}^{(2)} - X_{j}^{(2)}\|\right]^\top.
\end{align*}
For testing, the scaled statistic
\begin{align*}
T^* &= \frac{N\hat{\lambda}(1 - \hat{\lambda})}{2\hat{\sigma}_0^2} T \text{ with}\\
\hat{\lambda} &= \frac{n_1}{N},\\
\hat{\sigma}_0^2 &= \frac{n_1S_1 + n_2S_2}{N}, \text{ where}\\
S_1 &= \frac{1}{\binom{n_1}{3}} \sum_{1\le i < j < l \le n_1} \|X_{i}^{(1)} - X_{j}^{(1)}\|\cdot \|X_{i}^{(1)} - X_{l}^{(1)}\| - \hat{\mu}_{{F_1}{F_1}}^2 \text{ and}\\
S_2 &= \frac{1}{\binom{n_2}{3}} \sum_{1\le i < j < l \le n_2} \|X_{i}^{(2)} - X_{j}^{(2)}\|\cdot \|X_{i}^{(2)} - X_{l}^{(2)}\| - \hat{\mu}_{{F_2}{F_2}}^2
\end{align*}
is used as it is asymptotically $\chi^2_1$-distributed.
The new function \fct{BG2} implements the \citet{biswas_nonparametric_2014} test from scratch.
\fct{stats::dist} is used to calculate the Euclidean distance matrix on the pooled sample.
The statistic $T$ and the scaled test statistic $T^*$ are implemented according to the formulas above.
A permutation test is implemented by permuting the distance matrix, recalculating the test statistic $T$ for the permuted distances, and calculating the $p$~value as the proportion of permuted test statistics larger than the observed test statistic.
An asymptotic test is implemented using the asymptotic result from Theorem 4.1 of \citet{biswas_nonparametric_2014}, i.e., calculating the $p$~value as \code{stats::pchisq(}$T^*$\code{, lower.tail = FALSE)}.

\subsection[BMG()]{\fct{BMG}} \label{app:technical:bmg}
\citet{biswas_distribution-free_2014} suggest a graph-based test similar to those of \citet{friedman_multivariate_1979} and \citet{rosenbaum_exact_2005}, but using the shortest Hamiltonian path as the similarity graph.
Since calculating the Hamiltonian path is an NP hard problem, the implementation of \fct{BMG} is based on Kruskal's algorithm, which is a heuristic approach to find the shortest Hamilton Path within the pooled dataset as suggested in \citet{biswas_distribution-free_2014}.
Here, it is implemented as follows:
\begin{enumerate}
\item Create an edge list of the fully connected graph on the pooled sample, sorted by increasing Euclidean distance of the corresponding vertices.
\item For each edge, check if (i) an addition of this edge leads to a cyclic graph (using \fct{IsAcyclic} from the \pkg{rlemon} package \citep{rlemon}) and (ii) an addition of this edge leads to a degree larger than two in any (used) vertex. If both criteria are not met, keep the corresponding edge.
\item Return the reduced edge list, containing only edges needed to construct the Hamilton path.
\end{enumerate}
For pooled sample sizes $N<1030$, an exact test can be performed.
For $N \ge 1030$, calculation of the exact runs statistic cannot be performed due to terms involved in the calculation becoming too large for representing them as floating point numbers in \proglang{R}.
In the exact case, the $p$~values using the null distribution of the univariate runs statistic \citep{biswas_distribution-free_2014} are calculated.
If an asymptotic test is performed, the asymptotic null distribution is used instead.

\subsection[BQS()]{\fct{BQS}} \label{app:technical:barakat}
\citet{barakat_multivariate_1996} generalize the Schilling-Henze nearest neighbor test (see~\ref{app:technical:sh}) to circumvent choosing the number of nearest neighbors. 
Their test statistic is the sum of edge counts for all values of $K$ for the $K$-nearest neighbor graph. 
The resulting test is equivalent to a sum of Wilcoxon rank sums.
It requires samples in the Euclidean space $\mathbb{R}^p$, and it is assumed that there are no ties in ranking with respect to nearness.\\
Within our implementation, we do not explicitly calculate the $K$-nearest neighbor graph for all possible values of $K$ as this would be highly inefficient.
Instead, the distance matrix on the pooled sample is calculated with a user-specified distance function (Euclidean distance calculated via \fct{stats::dist} by default), and the column-wise orderings of the distances, excluding the diagonal elements, are calculated. 
Then, the cumulative numbers of the elements smaller than $n_1$ are calculated for the first $n_1$ columns of the orderings, corresponding to the numbers of within-sample edges in the first sample in the $K$-nearest neighbor graph for $K = 1,\dots,N-1$. 
Analogously, the cumulative numbers of the elements greater than $n_1$ are calculated for the remaining $n_2$ columns of the orderings, corresponding to the numbers of within-sample edges in the second sample in the $K$-nearest neighbor graph for $K = 1,\dots,N-1$.
Lastly, all these cumulative numbers are summed up, which corresponds to the \citet{barakat_multivariate_1996} test statistic. 
A permutation test is implemented using the \fct{boot::boot} function. 
For that, the distances are permuted directly, and the calculation is repeated for the permuted distance matrix, which circumvents the costly recalculation of the distances for each permutation. 

\subsection[CMDidstance()]{\fct{CMDistance}} \label{app:technical:cmd}
The \emph{constrained minimum (CM) distance} \citep{tatti_distances_2007} uses a \emph{feature function} $S:\mathcal{X}\to\mathbb{R}^m$ that maps points from the sample space $\mathcal{X}$ to a real vector. The \emph{frequency} $\theta\in\mathbb{R}^m$ of $S$ with respect to dataset $X^{(j)}$ is the average of the values of $S$
\[
\theta_j= \frac{1}{N}\sum_{i = 1}^{n_j} S(X_{i}^{(j)}), j = 1, 2.
\]
The CM distance is then defined as
\[
D_{\text{CM}}(X^{(1)}, X^{(2)}|S)^2 = (\theta_1 - \theta_2)^{\top}\COV^{-1}(S)(\theta_1 - \theta_2)
\]
with
\[
\COV(S) = \frac{1}{|\mathcal{X}|}\sum_{\omega\in\mathcal{X}} S(\omega)S(\omega)^{\top} - \left(\frac{1}{|\mathcal{X}|}\sum_{\omega\in\mathcal{X}}S(\omega)\right)\left(\frac{1}{|\mathcal{X}|}\sum_{\omega\in\mathcal{X}}S(\omega)\right)^{\top}.
\]
It has to be assumed that the feature space $\mathcal{X}$ is finite and can be enumerated.
For binary data and $S$ chosen as the conjunction function, i.e., $S$ is one if all components of an observation are one, and zero otherwise, or as the parity function, i.e., $S$ is one if an odd number of components of an observation are one, and zero otherwise, the CM distance reduces to
\[
D_{\text{CM}}(X^{(1)}, X^{(2)}|S) = 2\|\theta_1 - \theta_2\|_2.
\]
This special case for binary data is implemented first.
It includes the option to use either the means as features (example 3 in \citet{tatti_distances_2007}) or the means and covariances (example 4 in \citet{tatti_distances_2007}).
Note that there is an error in the calculation of the covariance matrix in A.4 Proof of Lemma 8 in \citet{tatti_distances_2007}.
The correct covariance matrix has the form $\COV[T_{\mathcal{F}}] = 0.25I$ since $\VAR[T_A] = \E[T_A^2] - \E[T_A]^2 = 0.5 - 0.5^2 = 0.25$ following from the correct statement that $\E[T_A^2] = \E[T_A] = 0.5$.
Therefore, formula (4) changes to $d_{CM}(D_1, D_2 | S_{\mathcal{F}}) = 2 \|\theta_1 - \theta_2\|_2$ and the formula in example 3 changes to $d_{CM}(D_1, D_2 | S_{1}) = 2 \|\theta_1 - \theta_2\|_2$.
Our implementation is based on these corrected formulas.
If the original formula was used, the results on the same data calculated with the formula for the binary special case and the results calculated with the general formula differ by a factor of $\sqrt{2}$.
For the general case for categorical data, the user has to specify a feature function $S$ mapping a point in the sample space to a real vector.
Additionally, either the covariance matrix $\COV[S]$, if known, or the sample space has to be given.
If both are given, the supplied covariance matrix is used and not recalculated.
The constrained minimum distance is calculated using Theorem 1 in \citet{tatti_distances_2007}, i.e., the formulas given above.
Therefore, the supplied or calculated $\COV[S]$, respectively, has to be invertible.

\subsection[DiProPerm()]{\fct{DiProPerm}} \label{app:technical:diproperm}
\citet{wei_direction-projection-permutation_2016} propose their \emph{direction-projection-permutation (DiProPerm)} test for which a univariate two-sample statistic is applied to the projection of the datasets onto the normal vector of a separating hyperplane.
For this, a linear classification method like a support vector machine (SVM) or distance weighted discrimination (DWD) is used to calculate such a separating hyperplane.
A permutation test is then performed for the univariate statistic applied to the projection onto the normal vector.
Possible options for the univariate statistic would be the mean difference, the two-sample $t$-statistic, or the area under the curve (AUC). 
There is an implementation in the \pkg{diproperm} package \citep{diproperm}, which is currently archived. 
Our implementation is independent of that implementation. 
It has the following advantages.
\begin{itemize}
\item All suggested univariate two-sample statistics from the paper, i.e., mean difference, $t$~test statistic, and AUC, are implemented. An additional two-sample statistic can be used if a suitable function is supplied via the \code{stat.fun} argument.
\item Additional binary linear classifiers other than the DWD and SVM suggested in the original paper can easily be used by supplying a suitable function via the \code{dipro.fun} argument.
\item The results of the new function are reproducible by setting a random seed.
\item The new implementation does not rely on global variables.
\item The $p$~value is returned as numeric instead of character.
\item The output is an object of class \class{htest} for pretty displaying of the results.
\end{itemize}
One restriction of the new function is that it no longer supports balanced permutation. 
That was necessary to ensure the reproducibility, which we consider to be a trade-off worth making since the use of balanced permutation is controversial anyway, see \citet{southworth_properties_2009}, and reproducibility is essential for permutation tests.

\subsection[DS()]{\fct{DS}} \label{app:technical:rank_energy}
The test of \citet{deb_multivariate_2021} is a rank version of the Energy statistic. 
The multivariate ranks are assigned using optimal transport.
The implementation is based on \proglang{R} code accompanying the original article (\url{https://github.com/NabarunD/MultiDistFree}).
It wraps up tidied-up versions of the \fct{computestatistic} and \fct{gensamdist} given there.
The implementation uses the \pkg{randtoolbox} package \citep{randtoolbox} for random number generation, the \pkg{clue} package \citep{cluePaper, clue} to solve the assignment problem for ranking, and the \pkg{energy} package \citep{energy} for implementation of the Energy statistic.

\subsection[engineerMetric()]{\fct{engineerMetric}} \label{app:technical:engineer}
The $L_q$-\textit{engineer metric} is defined as
\[
\text{EN}(X, Y; q) = \left[ \sum_{i = 1}^{p} \left\arrowvert \E\left(X_i\right) - \E\left(Y_i\right)\right\arrowvert^q\right]^{\min(q, 1/q)} \text{ with } q> 0,
\]
where $X_i, Y_i$ denote the $i$th component of the $p$-dimensional random vectors $X\sim F_1$ and $Y\sim F_2$.
A new function \fct{engineerMetric} is implemented.
Since the Engineer metric is simply the $L_q$-distance of the expectations of two random vectors, it is estimated as the $L_q$-distance of the column means of the datasets.
For the distance calculation, the \pkg{base} function \fct{norm} is used, and different options for the $L_q$ norm are available via the \code{type} argument.

\subsection[GGRL(), GGRLcat(), NKT()]{\fct{GGRL}, \fct{GGRL\_cat}, \fct{NKT}} \label{app:technical:decision_tree}
The methods of \citet{ganti_framework_1999} and \citet{ntoutsi_general_2008} work by determining the partition induced by a decision tree fit to each dataset and then intersecting these partitions and calculating certain probability estimates on the resulting intersection. 
\subsubsection{3. Methods applicable to exactly two numeric datasets with target variables}
\citet{ntoutsi_general_2008} propose measuring dataset similarity based on probability density estimates derived from decision trees.
For this, it is assumed that in addition to both covariate datasets $X^{(1)}$ and $X^{(2)}$, categorical target variables $Y^{(1)}$ and $Y^{(2)}$ with the same levels are given.
On each dataset $X^{(j)}$, a classification tree is constructed with $Y^{(j)}$ as the target variable, $j = 1, 2$. 
The splits defined by the decision trees induce a partition of the feature space $\mathcal{X}$ such that each leaf node corresponds to one segment in the partition.
Figure~\ref{fig:ex.NKT} demonstrates the procedure for two example datasets. 
First, trees are fit to each dataset (Figure~\ref{fig:tree1} and \ref{fig:tree2}). 
Then, the sample space is divided into segments based on the splits performed in each tree (Figure~\ref{fig:parti1} and \ref{fig:parti2}). 
These partitions are intersected (Figure~\ref{fig:sec.parti}) and based on the joint partition, the probability densities $P_D(\mathcal{X})$ and $P_D(Y^{(j)},\mathcal{X})$ are estimated for $D \in \{X^{(1)}, X^{(2)}, Z\}$.

\begin{figure}[!t]
\centering
\begin{subfigure}[b]{0.49\textwidth}
\centering
<<visualization-tree1, echo=FALSE, fig=TRUE, height=5, width=7>>=
library("DataSimilarity")
set.seed(5202)
X1 <- data.frame(X1 = runif(100), X2 = runif(100))
X1$y <- as.factor(ifelse(X1$X1 < 0.5 & X1$X2 > 0.3, 0, 1))

X2 <- data.frame(X1 = runif(100), X2 = runif(100))
X2$y <- as.factor(ifelse((X2$X1 < 0.5 & X2$X2 > 0.3) | (X2$X2 < 0.3 & X2$X1 > 0.2 ), 0, 1))

library("rpart")
library("rpart.plot")
tree1 <- rpart(y ~ ., data = X1)
tree2 <- rpart(y ~ ., data = X2)
parti1 <- DataSimilarity:::findPartition(tree1, X1, X2)
parti2 <- DataSimilarity:::findPartition(tree2, X1, X2)
intersec.parti <- DataSimilarity:::intersectPartitions(parti1, parti2)
par(xpd = TRUE)
prp(tree1, digits = 2, type = 5, tweak = 1.5)
par(op)
@
\caption{Fitted Tree for Dataset~1.}
\label{fig:tree1}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.49\textwidth}
\centering
<<visualization-tree2, echo=FALSE, fig=TRUE, height=5, width=7>>=
par(xpd = TRUE)
prp(tree2, digits = 2, type = 5, tweak = 1.5)
par(op)
@
\caption{Fitted Tree for Dataset~2.}
\label{fig:tree2}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.49\textwidth}
\centering
<<visualization-parti1, echo=FALSE, fig=TRUE, height=5, width=7>>=
plotParti <- function(parti) {
  plot(NA, xlim = 0:1, ylim = 0:1, xlab = "X1", ylab = "X2", main = "", las = 1, 
       cex.axis = 1.5, cex.lab = 1.5)
  for(i in seq_along(parti)){
    segments(x0 = round(parti[[i]][1, 2], 2), x1 = round(parti[[i]][1, 3], 2), 
             y0 = round(parti[[i]][2, 2], 2), y1 = round(parti[[i]][2, 2], 2))
    segments(x0 = round(parti[[i]][1, 2], 2), x1 = round(parti[[i]][1, 3], 2), 
             y0 = round(parti[[i]][2, 3], 2), y1 = round(parti[[i]][2, 3], 2))
    segments(x0 = round(parti[[i]][1, 3], 2), x1 = round(parti[[i]][1, 3], 2), 
             y0 = round(parti[[i]][2, 2], 2), y1 = round(parti[[i]][2, 3], 2))
    segments(x0 = round(parti[[i]][1, 2], 2), x1 = round(parti[[i]][1, 2], 2), 
             y0 = round(parti[[i]][2, 2], 2), y1 = round(parti[[i]][2, 3], 2))
  }  
}

par(mar = c(4.1, 4.1, 1.1, 2.1))
plotParti(parti1)
par(op)
@
\caption{Partition of sample space derived from fitted tree for Dataset~1.}
\label{fig:parti1}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.49\textwidth}
\centering
<<visualization-NKT-parti2, echo=FALSE, fig=TRUE, height=5, width=7>>=
par(mar = c(4.1, 4.1, 1.1, 2.1))
plotParti(parti2)
par(op)
@
\caption{Partition of sample space derived from fitted tree for Dataset~2.}
\label{fig:parti2}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.49\textwidth}
\centering
<<visualization-intersect, echo=FALSE, fig=TRUE, height=5, width=7>>=
par(mar = c(4.1, 4.1, 1.1, 2.1))
plotParti(intersec.parti$parti)
par(op)
@
\caption{Intersected partition (greatest common refinement, GCR) from fitted trees for Datasets~1 and~2. Each dataset includes two covariates and a binary target variable.}
\label{fig:sec.parti}
\end{subfigure}
\caption{Partitioning of sample space by fitting trees to two example datasets. Names in the tree nodes refer to variables in the respective datasets. $0$ and $1$ in the tree leaves refer to the levels of the target variables in the respective datasets.}
\label{fig:ex.NKT}
\end{figure}

Let $n_r$ denote the number of segments in the joint partition and $n_c$ the number of classes of $Y^{(j)}, j = 1, 2$.
$\hat{P}_D(\mathcal{X}) \in \mathbb{R}^{n_r}$ uses the proportion of observations in $D$ that fall into each segment of the joint partition. 
This means that for each of the $n_r$ segments of the partition, the number of observations from dataset $D$ that fall into that segment is counted and divided by the total number of observations in $D$.
For the estimation of the joint density $P_D(Y,\mathcal{X})$, the proportion of observations that fall into each segment of the joint partition and belong to each class is determined, $\hat{P}_D(Y, \mathcal{X}) \in \mathbb{R}^{n_r \times n_c}$.
Here, for each of the $n_r$ segments of the partition and each of the $n_c$ classes, the number of observations in $D$ where the corresponding target variable has the respective class value and that fall into the respective segment is counted and divided by the total number of observations in $D$.
The conditional density $P_D(Y|\mathcal{X})$ is estimated by calculating the proportion of observations belonging to each class separately for each segment, 
$\hat{P}_D(Y|\mathcal{X}) \in \mathbb{R}^{n_r \times n_c}$.
Here, for each of the $n_r$ segments of the partition and each of the $n_c$ classes, the number of observations in $D$ where the corresponding target variable has the respective class value and that fall into the respective segment is counted and divided by the total number of observations in $D$ that fall into the respective segment.

Then, \citet{ntoutsi_general_2008} consider the similarity index 
\[s(p, q) = \sum_{i} \sqrt{p_i \cdot q_i}\]
for vectors $p$ and $q$, where $(n_r \times n_c)$-matrices are interpreted as $(n_r \cdot n_c)$-dimensional vectors.
For the conditional distribution, the similarity vector $S(Y|\mathcal{X}) \in \mathbb{R}^{n_r}$ is computed with $S(Y|\mathcal{X})_i = s(\hat{P}_{X^{(1)}}(Y|\mathcal{X})_{i \bullet}, \hat{P}_{X^{(2)}}(Y|\mathcal{X})_{i \bullet})$ and index $i \bullet$ denoting the $i$-th row.
Based on this, three similarity measures for datasets are proposed:

\begin{enumerate}
\item NTO1 = $s(\hat{P}_{X^{(1)}}(\mathcal{X}), \hat{P}_{X^{(2)}}(\mathcal{X}))$
\item NTO2 = $s(\hat{P}_{X^{(1)}}(Y, \mathcal{X}), \hat{P}_{X^{(2)}}(Y, \mathcal{X}))$
\item NTO3 = $S(Y|\mathcal{X})^{\top} \hat{P}_{Z}(\mathcal{X})$.
\end{enumerate}

All three measures have values in the interval $[0, 1]$, where high values correspond to high similarity.

\citet{ganti_framework_1999} calculate a decision tree model for each of the two datasets and calculate the greatest common refinement (GCR) induced by these trees. 
That is the intersection of the partitions of the sample space induced by each tree. 
A visualization of the computation of the GCR is given in Figure \ref{fig:ex.NKT}.
\citet{ganti_framework_1999} then compare the distribution of both datasets over this GCR. 
Let $n_r$ denote the number of segments of the GCR, $p_i$ the proportion of observations of $X^{(1)}$ that map to the $i$-th segment, and $q_i$ the respective proportion of observations of $X^{(2)}$ mapping to the $i$-th segment. 
Then \citet{ganti_framework_1999} compare the vectors $p$ and $q$ by a difference function $f: \mathbb{R}^{2n_r} \to \mathbb{R}^{n_r}$ and aggregate the results from that by an aggregate function $g: \mathbb{R}^{n_r} \to \mathbb{R}$ to obtain a measure of distance between the two datasets
\[
\text{GAN} = g(f(p,q)).
\]
Large values then indicate differences between the datasets. 
They propose the absolute difference function
\begin{align*}
f_a(p, q)_i = |p_i - q_i|,
\end{align*}
and the scaled difference function 
\begin{align*}
f_s(p, q)_i = \begin{cases}
\frac{|p_i - q_i|}{(p_i + q_i)/2}, & \text{if } (p_i + q_i) > 0, i = 1, \dots, n_r\\
0,& \text{otherwise}
\end{cases}.
\end{align*}
For the aggregate function, they propose the sum or maximum of the values from the difference function. 
For using the sum as the aggregate function together with either $f_a$ or $f_s$, it can be shown that the GCR is optimal in the sense that it gives the lowest value over all common refinements. 
For using the maximum, this property is not fulfilled. 
%As in general, for different combinations of difference and aggregate functions, there might not be an upper bound for the difference given by the proposed measure GAN1, 
\citet{ganti_framework_1999} propose using a bootstrap test procedure for assessing whether or not the two datasets are generated by the same data-generating process. \\
%The lower bound for GAN for all proposed difference and aggregate functions is 0. 
We use \pkg{rpart} package \citep{rpart} for tree estimation.
In the frame of a tree object fit with \fct{rpart}, the nodes are numbered starting with 1 at the root, following the rule that the left child node gets the ID of the parent times 2, and the right child node gets the ID of the parent times 2 plus 1.
This allows us to easily trace back the decision rules from a leaf node to the root using integer division by 2.
Moreover, the split rules can be easily accessed using the \fct{labels} function on the tree object.
We iterate over leaves and collect all split rules on each path from the leaf to the root.
Suppose no upper or lower limit is specified by any split rule for a certain variable in this way.
In that case, we set this limit to the minimum or maximum, respectively, of this variable over both datasets.
This ensures that each observation in either of the two datasets falls into some part of the intersected partition later on.
The resulting set of ranges for all variables for each leaf node gives us the partition induced by the tree.
The resulting partitions are intersected as described in \citet{ganti_framework_1999} and \citet{ntoutsi_general_2008}.
For \citet{ntoutsi_general_2008}, all three methods presented in the original article (see above) are implemented.
No test is performed.
For \citet{ganti_framework_1999}, the difference and aggregation functions can be supplied by the users.
The suggested choices $f_a$ and $f_s$, i.e., taking the absolute differences between the joint probabilities calculated on the GCR or normalizing this difference with the sum of both probabilities, are readily implemented.
The default difference function is set to $f_a$, and the default aggregation function is set to the sum.
A permutation test can be performed.\\
Neither \citet{ntoutsi_general_2008} nor \citet{ganti_framework_1999} discuss the hyperparameter choice for the decision trees. 
Here, we offer the options to use the default parameter settings of \fct{rpart} or to tune the hyperparameters.
For tuning the hyperparameters, we use the \fct{best.rpart} function of the \pkg{e1071} package \citep{e1071}.
The parameters \code{minsplit}, \code{minbucket}, and \code{cp} of the tree can be tuned.
The ranges that are used here for tuning are chosen based on the recommendations by \citet{bischl_hyperparameter_2021}.
Tuning is enabled by default but can be disabled by setting \code{tune} to \code{FALSE}.
Cross-validation is used for tuning.
The number of evaluations (\code{n.eval}) is set to 100 as a default, and the number of folds (\code{k}) is set to 5.
Both values can be customized by the user.
The remaining calculation works the same for a tuned or untuned tree model.
By default, the number of permutations is set to 0, corresponding to not performing any test.

An implementation for categorical data for the method of \citet{ganti_framework_1999} is also supplied.
This comes with the following difficulties.
If a category is only observed in one dataset and not in the other, or even if just not all combinations of categories are observed, it might happen that at a certain split, not all levels of the respective variable are observed in the remaining data at that split.
Then, it is unclear to which child node the missing level is assigned.
In the \fct{rpart::rpart} implementation that we use here, the label is not assigned at all.
If now in the other dataset, the combination with this label is present, the respective data points do not fit anywhere in the intersected partition.
Therefore, the calculated probabilities in the joint distribution do not sum to one anymore.
In these cases, a warning is printed.
The function might still return a useful measure of dataset distance, but the interpretation and theoretical results might not hold anymore.
Also note that for deep trees, the intersection in practice often reduces to all combinations of categories of the variables.
Therefore, the measure reduces to the differences in frequency of all category combinations in these cases, but is far more complicated and time-consuming to calculate.

\subsection[Jeffreys()]{\fct{Jeffreys}} \label{app:technical:jeffreys}
\emph{Jeffreys divergence} \citep{jeffreys_invariant_1997} is the symmetrized version
\[
J(F_1, F_2) = \text{KL}(F_1, F_2) + \text{KL}(F_2, F_1)
\]
of the Kullback Leibler (KL) divergence \citep{kullback_information_1951}
\[
\text{KL}(F_1, F_2) := \int \log\left(\frac{f_1(x)}{f_2(x)}\right) f_1(x) \dif x.
\]
Within the \fct{Jeffreys} function, Jeffreys divergence is calculated as the sum of the two KL-divergences \citep{kullback_information_1951} where each dataset is used as the first once.
The KL-divergences are calculated using density ratio estimation as recommended in \citet{sugiyama_direct_2013}.
For this, the \fct{densratio} function from the \pkg{densratio} package \citep{densratio} is used.
By default, the method \code{KLIEP} is chosen as suggested by \citet{sugiyama_direct_2013}.
At the time of implementation, the \pkg{densityratio} package was not yet available on CRAN, and the resulting KL divergences for some simple examples of data sampled from two univariate normal distributions were worse compared to the true values, which are known for that simple case, than the resulting KL divergences using the alternative package.
By now, the two packages perform similarly but the \pkg{densityratio} does not show advantages that would make a change of the implementation necessary.

\subsection[LHZ()]{\fct{LHZ}} \label{app:technical:lhz}
The \emph{characteristic distance} \citep{li_measuring_2022} is defined as
\begin{align*}
\text{CD} (X, Y) &= \E\left[\| \E\left(\exp\left(i\langle X^{\prime\prime}, X - X^{\prime}\rangle\right)\arrowvert X - X^{\prime}\right)\right.\\
&\qquad ~~- \left.\E\left(\exp\left(i\langle Y, X - X^{\prime}\rangle\right)\arrowvert X - X^{\prime}\right)\|^2\right]\\
&\quad + \E\left[\| \E\left(\exp\left(i\langle X, Y - Y^{\prime}\rangle\right)\arrowvert Y-Y^{\prime}\right)\right. \\
&\qquad \quad ~-\left. \E\left(\exp\left(i\langle Y^{\prime\prime}, Y - Y^{\prime}\rangle\right)\arrowvert Y - Y^{\prime}\right)\|^2\right],
\end{align*}
where $X^{\prime}, X^{\prime\prime}$ and $Y^{\prime}, Y^{\prime\prime}$ denote independent copies of $X\sim F_1$ and $Y\sim F_2$, respectively.
An empirical version is obtained by replacing the conditional expectations with empirical means.
The implementation calculates the empirical characteristic distance between two datasets.
For both summands, Euler's formula is used for every entry of the inner product defined in \citet{li_measuring_2022}. 
Both mean values are calculated, and the squared complex modulus of the difference between the two means is calculated.
Since the inner product leads to a symmetric matrix, only an upper triangular matrix is calculated, and the final sum is multiplied by two.
For reproducibility, a permutation test with \code{n.perm} permutations and random seed \code{seed} is performed.

\subsection[MMCM(), Petrie()]{\fct{MMCM}, \fct{Petrie}} \label{app:technical:crossmatch}
The tests of \citet{mukherjee_distribution-free_2022} and \citet{petrie_graph-theoretic_2016} generalize the Rosenbaum cross-match test to multiple samples by calculating the cross-counts for all pairs of samples based on the optimal non-bipartite matching on the pooled sample and taking the Mahalanobis distance or simply the sum of the cross-counts, respectively, as the test statistics.
The cross-match counts $A = (a_{12}, a_{13},\dots,a_{1k}, a_{23},\dots,a_{2k},\dots,a_{k-1,k})^\top$ for all pairs of datasets are calculated using the optimal non-bipartite matching on the pooled sample. 
The test statistic for the MMCM test then is the Mahalanobis distance of the observed cross-counts under the null hypothesis $H_0: F_1 = F_2 = \dots = F_k$
\[
\text{MMCM} = (A - \E_{H_0}(A))^\top \COV_{H_0}^{-1}(A) (A - \E_{H_0}(A)).
\]
The expectation and covariance matrix of the cross-count vector $A$ under $H_0$ can be calculated analytically and depend only on the sample sizes $n_i, i = 1,\dots,k$. 
Small values of the multi-sample Mahalanobis cross-match (MMCM) statistic indicate similarity. 
However, as there is no known upper bound, it is hard to interpret the MMCM value. 
The MMCM statistic follows a $\chi^2_{\binom{k}{2}}$ distribution asymptotically under the null, which can be used for testing. 
The statistic for the test or Petrie is simply the sum of the cross-matches.
When standardized by its expectation and variance under the null, the test statistic is asymptotically normally distributed. 
Small values of the test statistic correspond to dissimilarity.
New functions \fct{MMCM} and \fct{Petrie} were implemented. 
There exist implementations of these methods in the \proglang{R} package \pkg{multicross} \citep{multicross}, but the package is archived on CRAN, and the implementation makes it impossible to access the test statistic and $p$~value as numeric values. 
Therefore, here, the functions were re-implemented from scratch. 
To ensure that the new functions are not derivations of the \pkg{multicross} versions, they were implemented by an author who had not looked at the \pkg{multicross} implementations before. 
The functions implement the formulas from Section~2 of \citet{mukherjee_distribution-free_2022}.  
The new output is again of class \class{htest} and contains the test statistic value and the $p$~value as a numeric value.
The \pkg{nbpMatching} package \citep{nbpMatching} is used for calculating the optimal non-bipartite matching. 
Note that in case of ties in the distance matrix, the optimal non-bipartite matching might not be defined uniquely. 
In the current implementation, the observations in the pooled sample are ordered by default as supplied by the user. 
When searching for a match, the \pkg{nbpMatching} implementation of the optimal non-bipartite matching algorithm starts at the end of the pooled sample. 
Therefore, with many ties (e.g., for categorical data), observations from the first dataset are often matched with ones from the last dataset, and so on.
This might affect the validity of the test negatively since, even under the null, more cross counts than expected are observed.
As a workaround, the current implementation allows for a random ordering of the pooled sample in case of ties.
A small simulation study was conducted to examine the impact of the random ordering for the \fct{Petrie} function. 
In this study, the number of cross-counts under the null hypothesis was computed both with and without random ordering of the pooled sample across different settings of sample size, data dimension and the data-generating distribution. 
For low-dimensional data (i.e. $p=2$) the random ordering appears to resolve this problem for binary and categorical data. 
Figures \ref{fig:petrie:bin} and \ref{fig:petrie:mult} show the simulation results for low-dimensional data generated from binomial and discrete uniform distributions, respectively.
Simultaneously, for high-dimensional data (i.e. $p=10$) without ties or with only very few ties, both versions of the implementation yield consistent results. 
Figures \ref{fig:petrie:bin10} and \ref{fig:petrie:mult10} show the results for high-dimensional data generated from binomial and discrete uniform distributions, respectively.

\begin{figure}[ht]
\centering
\includegraphics[width=0.9\textwidth]{res.Petrie.bin.pdf}
\caption{Simulation results for bivariate data generated by a binomial distribution to examine the effect of random ordering within the \fct{Petrie} function in case of ties.}
\label{fig:petrie:bin}
\end{figure}

\begin{figure}[ht]
\centering
\includegraphics[width=0.9\textwidth]{res.Petrie.mult.pdf}
\caption{Simulation results for bivariate data generated by a discrete uniform distribution to examine the effect of random ordering within the \fct{Petrie} function in case of ties.}
\label{fig:petrie:mult}
\end{figure}

\begin{figure}[ht]
\centering
\includegraphics[width=0.9\textwidth]{res.Petrie.bin.10.pdf}
\caption{Simulation results for multivariate data ($p=10$) generated by a binomial distribution to examine the effect of random ordering within the \fct{Petrie} function in case of ties.}
\label{fig:petrie:bin10}
\end{figure}

\begin{figure}[ht]
\centering
\includegraphics[width=0.9\textwidth]{res.Petrie.mult.10.pdf}
\caption{Simulation results for multivariate data ($p=10$) generated by a discrete uniform distribution to examine the effect of random ordering within the \fct{Petrie} function in case of ties.}
\label{fig:petrie:mult10}
\end{figure}




\subsection[OTDD()]{\fct{OTDD}} \label{app:technical:otdd}
\citet{alvarez-melis_geometric_2020} define a distance based on optimal transport between datasets that include a target (class) variable $Y$. 
The \emph{optimal transport dataset distance} (OTDD) is defined as
\[
d_{\text{OT}}(X^{(1)}, X^{(2)}) = \min_{\pi\in\Pi(F_1, F_2)} \int_{\mathcal{Z}\times\mathcal{Z}} d_{\mathcal{Z}}(z, z^\prime)^q \dif \pi(z, z^\prime)
\]
where $X^{(1)}, X^{(2)}$ denote the two datasets, \[
\Pi(F_1, F_2) := \{\pi_{1,2}\in\mathcal{P}(\mathcal{Z}\times\mathcal{Z})\arrowvert \pi_1 = F_1, \pi_2 = F_2\}
\]
is the set of joint distributions over the product space $\mathcal{Z}\times\mathcal{Z}$ over the sample space of the pooled sample with marginal distributions $F_1$ and $F_2$, and
\[
d_{\mathcal{Z}}(z, z^\prime) := (d_{\mathcal{X}}(x, x^{\prime})^{q^\prime} + W_{q^\prime}(\alpha_y, \alpha_{y^{\prime}})^{q^\prime})^{1/{q^\prime}}.
\]
defines a distance of two points $z^{\top} = (x^{\top}, y)$, and ${z^\prime}^{\top} = ({x^\prime}^{\top}, y^\prime)$ in the pooled sample.
$d_{\mathcal{X}}$ defines a distance on the covariate space, e.g., the Euclidean distance, and  $W_{q^\prime}(\alpha_y, \alpha_{y^{\prime}})$ is the $q^\prime$-Wasserstein distance of the distribution of the subset of covariate data with corresponding response value $y$ and the distribution of the subset of covariate data with corresponding response value $y^\prime$. 
The powers $q$ and $q^\prime$ have to be chosen in advance to calculate the OTDD.
The optimal transport problem can intuitively be motivated by imagining each probability density as a pile of dirt. 
Then, the cost function corresponds to the cost of transporting the dirt from one point to another, which is proportional to the distance between the two points. 
The optimal transport then corresponds to the lowest cost required for moving one pile of dirt fully to the shape and location of the other. 
Therefore, distributions can be regarded as more similar if the optimal transport between them is lower. 
For an intuitive explanation and visualization of the OTDD, also refer to 
\citet{hagen_measuring_2020}.

There is a \proglang{Python} implementation of the method (\url{https://github.com/microsoft/otdd}) that was used as a rough orientation here.
Compared to that, the JDOT option is deprecated.
The new implementation uses the Wasserstein distance implementation from the \pkg{approxOT} package \citep{approxOT} and the matrix square root from the \pkg{expm} package \citep{expm}.
Note that the solution of the optimal transport between two distributions is given by their $q$-Wasserstein distance to the power of $q$.
There are different options for the method to calculate the optimal transport based on the dataset distance.
First case: chosen method is \code{"augmentation"}.
In this case, the variable means and the covariance matrix of each dataset, reduced to each target observation value in that dataset, are calculated.
The mean vector and the vectorized covariance matrix (column-wise) corresponding to the target value are appended to each observation in each dataset.
Then, the $q$-Wasserstein distance to the power of $q$ of these augmented datasets is calculated.
Note that this calculation assumes commuting covariance matrices of all label distributions (rarely fulfilled in practice) and that the feature space metric coincides with the ground cost of the optimal transport problem on the labels \citep{alvarez-melis_geometric_2020}.
Second case: chosen method is \code{"precomputed.labeldist"}.
In this case, both the distance matrix for the label distributions and the distance matrix for the features are calculated, and the corresponding distances are added with weights \code{lambda.x} and \code{lambda.y}, respectively, to calculate a cost matrix of all observations.
In case of \code{sinkhorn = FALSE}, i.e., for the exact calculation, only the costs from each observation from the first dataset to each observation from the second dataset are needed.
When using the debiased Sinkhorn approximation, additionally, the costs within each dataset are needed.
For calculating the distance matrices of the label distributions, there are again different options:
\begin{enumerate}
\item \code{inner.ot.method = "exact"}. The Wasserstein distance for each label pair is calculated between the datasets reduced to the observations where the target value equals the corresponding label.
There are options for using the (debiased) Sinkhorn approximation and changing the parameters of the Wasserstein distance and the ground cost metric.
\item \code{inner.ot.method = "gaussian.approx"}. The label distributions are approximated by Gaussians, which leads to a simple closed-form solution of the optimal transport problem that uses only the means and covariances.
The calculation includes calculating multiple matrix square roots of covariance matrices, which might get costly if the number of variables is high.
Moreover, this calculation fails if the estimated covariance matrix is not numerically positive semi-definite.
This might happen especially for $N < p$ settings.
\item \code{inner.ot.method = "only.means"}. The former is further simplified by using only the means (i.e., assuming equal covariance matrices in all label distributions).
\item \code{inner.ot.method = "naive.upperbound"}. A distribution-agnostic upper bound for the optimal transport between the label distributions is calculated that again relies only on the means and covariance matrices of these distributions.
\end{enumerate}

\subsection[SH()]{\fct{SH}} \label{app:technical:sh}
The Schilling-Henze \citep{schilling_multivariate_1986, henze_multivariate_1988} test uses the mean within-sample edge-count, i.e.,
\[
\text{SH} := L := \frac{1}{KN} (R_1 + R_2)
\]
in a $K$-nearest neighbor graph as the test statistic.
It is implemented from scratch as follows.
\begin{enumerate}
\item Calculate $K$-nearest neighbor (NN) edge matrix on the pooled sample (distance function returning a distance matrix and $K$ are inputs of the function), i.e., create a matrix where the first column is each observation number repeated $K$ times, and the second column is the corresponding $K$ nearest neighbors of that observation. For the calculation of the $K$-NN graph, a function can be supplied by the user. Pre-implemented options include a brute-force search, a wrapper for the \fct{kNN} function from the \pkg{dbscan} package \citep{dbscan}, and the fast (approximative) $K$-NN algorithm implemented in the \fct{get.knn} function from the \pkg{FNN} package \citep{FNN}.
\item Count the number $L$ of rows where both observations come from the same sample (i.e., either both have observation number $\le n_1$ or both have observation number $> n_1$).
\item Calculate the quantities $\E_{H_0}(L)$ and $\VAR_{H_0}(L)$ from proposition 2.1 in \citet{henze_multivariate_1988}.
\item Calculate the standardized test statistic $L^* = (L - \E_{H_0}(L)) / \sqrt{\VAR_{H_0}(L)}$.
\item When performing a permutation test, permute the distance matrix on the pooled sample, recalculate $L$, and calculate the proportion of permuted test statistics that are larger than the observed value of $L$.
\item When performing an asymptotic test, use the asymptotic normal distribution of $Z$ as proposed in Remark 5.1 of \citet{henze_multivariate_1988}.
\item The observed value of $L^*$ is returned in the result as the \code{statistic}, the observed $L$ is returned as the \code{estimate}.
\end{enumerate}
The default for $K$ is set to one.
This is rather arbitrary based on computational speed as there is no good rule for choosing $K$ so far proposed in the literature \citep{aslan_new_2005}.

\subsection[YMRZL()]{\fct{YMRZL}} \label{app:technical:yu}
\citet{yu_two-sample_2007} propose a permutation test that uses the classification error of a classification tree that distinguishes between the two datasets.
The implementation of the test is based on the \fct{C2ST} function, as the methods work very similarly.
Here, we set the classifier to \code{"rpart"}, i.e., a CART.
Instead of the classification accuracy, as for the C2ST, the classification error, i.e., $1 - $ accuracy, is returned.
A permutation test is implemented using the \fct{boot::boot} framework, and the permutation $p$~value is calculated as the proportion of the number + 1 of permuted test statistics smaller than or equal to the observed value divided by the number of permutations.
\citet{yu_two-sample_2007} do not propose any asymptotic test, but since their test fits into the framework of \citet{lopez-paz_revisiting_2017}, the binomial test proposed there and implemented in the \fct{Ecume::classifier\_test} function utilized by \fct{C2ST} is still valid and therefore kept in the implementation.

\section*{Acknowledgments}
This work has been supported (in part) by the Research Training Group ``Biostatistical Methods for High-Dimensional Data in Toxicology'' (RTG 2624, Project P1) funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation - Project Number 427806116).\\
We would like to thank Nabarun Deb and Bodhisattva Sen for allowing us to use their \proglang{R} implementation of their test for the package. 
Moreover, we would like to thank David Alvarez-Melis, whose \proglang{Python} implementation of the OTDD was the basis for our \proglang{R} implementation.

<<reset, echo=FALSE, results=hide>>=
options(old)
par(op)
@

\bibliography{refs}

\end{document}
