---
title: "Single Cell RNA-Seq Example"
output:
  html_document:
    toc: true
    toc_depth: 2
    number_sections: true
    theme: flatly
params:
  de_results: NULL
  enrichment_results: NULL
  grn_object: NULL
  report_date: !r Sys.Date()
vignette: >
  %\VignetteIndexEntry{Single Cell Example}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


## Single-Cell RNA-Seq Analysis with XYomics
Authors: Sophie Le Bars, Mohamed Soudy, and Enrico Glaab

# Introduction

This vignette provides a guide for using the **XYomics** package to analyze sex-related patterns in single-cell RNA-sequencing (scRNA-seq) data. We will walk through a complete workflow using simulated data demonstrating how to identify and interpret sex-specific gene expression changes at the cell-type level. To ensure fast execution, all heavy computations are **precomputed** and loaded from the package.

The tutorial covers:
1.  **load a simulated scRNA-seq dataset**.
2.  **Standard preprocessing** using the Seurat package.
3.  **Performing differential expression analysis** using different strategies: sex-stratified analysis and interaction analysis.
4.  **Detecting and categorizing sex-specific DEGs** for each cell type.
5.  **Visualizing gene expression** using dot plots.
6.  **Performing pathway enrichment analysis** to uncover biological functions.
7.  **Constructing and visualizing protein-protein networks**.

# Data Simulation and Preprocessing

## Simulating Single-Cell Data
We start by load a simulated scRNA-seq dataset containing 10 simulated samples. The code for generating this simulated dataset can be found in the Gitlab in the **data_vignette** folder.

```{r simulate-data, message=FALSE, warning=FALSE}
library(Seurat)
library(XYomics)
library(dplyr)

sim.seurat <- readRDS(
  system.file("extdata", "sc_toy_seurat.rds", package = "XYomics")
)
```

## Preprocessing and Clustering
We performed standard scRNA-seq preprocessing, including normalization, feature selection, and scaling. We then apply dimensionality reduction (PCA and UMAP) and clustering to identify cell populations.

```{r preprocessing, message=FALSE, warning=FALSE, results='hide'}
# Precomputed during data generation

# sim.seurat <- NormalizeData(sim.seurat)
# sim.seurat <- FindVariableFeatures(sim.seurat)
# sim.seurat <- ScaleData(sim.seurat)
# sim.seurat <- RunPCA(sim.seurat)
# sim.seurat <- FindNeighbors(sim.seurat)
# sim.seurat <- FindClusters(sim.seurat)
# sim.seurat <- RunUMAP(sim.seurat)
DimPlot(sim.seurat)
```

## Annotating Cell Types and Conditions
For the analysis, the Seurat object must contain metadata columns for cell type (`cell_type`), sex (`sex`), and condition (`status`). Here, we added mock annotations to our simulated data.

```{r annotate-data, warning=FALSE}
table(sim.seurat$cell_type)
table(sim.seurat$sex)
table(sim.seurat$status)

```

# Differential Expression Analysis Strategies

When analyzing sex differences, researchers must be aware of the pitfalls of associated statistical analyses, including the limitations of sex-stratified analyses and the challenges of analyzing interactions between sex and disease state.

**Sex-stratified analyses** use standard statistical tests for differential molecular abundance analysis to test for disease-associated changes in each sex separately. However, a pure sex-stratified analysis may misclassify a change as sex-specific if it uses a standard significance threshold to assess both the presence and absence of an effect. Stochastic variation in significance scores around a chosen threshold may lead to the erroneous detection of significance specific to only one sex, especially if the p-value in the other sex marginally exceeds the chosen threshold. In addition, such an analysis may miss sex-modulated changes, where significant changes in both sexes share the same direction but differ significantly in magnitude; these changes require cross-sex comparisons for accurate detection.

**Interaction analysis** formally tests whether the relationship between disease and molecular changes differs significantly between males and females. Not only can such interaction terms reveal complexities in disease mechanisms that might otherwise be obscured in analyses that do not consider SABV, but they also have the potential to detect changes that are limited to the magnitude of an effect, an aspect that sex-stratified analyses do not capture. Nevertheless, robust estimation of interaction effects requires large sample sizes, which are often not available due to the costs associated with advanced molecular profiling techniques such as single-cell RNA sequencing.

The **XYomics** package provides functions for both types of analyses.

## Method 1: Sex-Stratified Analysis

We use `sex_stratified_analysis_sc()` to identify DEGs between conditions separately for males and females within each cell type. We use `sex_stratified_pipeline_sc()` to obtain the categorized DEGs for each cell type in our seurat object.

```{r sex-stratified-analysis, eval=FALSE}
# Run for all cell types
# if you want to have the female and male degs 

results_degs <- sex_stratified_analysis_sc(sim.seurat) 
# The differential expression method can be specified using the `test.use` parameter.
# Available options include "MAST", "bimod", "poisson", "LR", and any other
# method supported by Seurat::FindMarkers(). 
#
# The `min_samples` parameter defines the minimum number of cells required
# to perform the analysis. The default value is 3.

#
# Internal QC results generated by XYomics can be accessed via:
# results_degs$validation
#
# This QC output includes assessment of sex and phenotype balance, per-group cell
# counts, detection of under-represented groups below the minimum cell threshold,
# and overall design imbalance warnings, as well as dataset sparsity metrics.

# if you want the categorized DEGs by cell type more lenient than the default thesholds as it is simulated data 

results <- sex_stratified_pipeline_sc(sim.seurat, min_abs_logfc = 0.1,  alpha = 0.1) 
# By default, `target_cell_type_flag = "all"`, meaning that all cell types are plotted.
# You can also run on one or multiple cell types, for example:
# target_cell_type_flag = "cell type 1", "cell type 2"

```




```{r load-stratified-results}
results_degs <- readRDS(
  system.file("extdata", "sc_results_degs_small.rds", package = "XYomics")
)

results <- readRDS(
  system.file("extdata", "sc_results_small.rds", package = "XYomics")
)

head(results$`cell type 1`)
```

## Method 2: Sex-Phenotype Interaction Analysis

We use `sex_interaction_pipeline_sc()` to perform a formal interaction analysis, directly comparing the condition effect between sexes for all cell types.

```{r sex-interaction-analysis, eval=FALSE}
# Example for one cell type (e.g., "cell type 1")

interaction_results <- sex_interaction_pipeline_sc(sim.seurat)
# By default, `target_cell_type_flag = "all"`, meaning that all cell types are plotted.
# You can also specify one or multiple cell types, for example:
# target_cell_type_flag = c("cell type 1", "cell type 2")
#
# The `min_samples` parameter defines the minimum number of cells required
# to perform the interaction analysis. The default value is 20.
```

```{r load-interaction-results}
interaction_results <- readRDS(
  system.file("extdata", "sc_interaction_results_small.rds", package = "XYomics")
)

head(interaction_results$`cell type 1`$summary_stats)# signficant sex-modulated DEGs: interaction_results$`cell type 1`$sig_results 
```


# Visualization of Gene Expression
Dot plots are an effective way to visualize gene expression in scRNA-seq data. Here, we show the expression of the top male-specific, female-specific, sex-neutral and sex-dimorphic genes across all cell types, split by sex and condition.

```{r dot-plots, message=FALSE, warning=FALSE, fig.width=10, fig.height=8}
# Get top gene from each category for 'cell type 1'
top_male_gene <- results$`cell type 1` %>% filter(DEG_Type == "male-specific") %>% top_n(1, -Male_FDR) %>% pull(Gene_Symbols)
top_female_gene <- results$`cell type 1`  %>% filter(DEG_Type == "female-specific") %>% top_n(1, -Female_FDR) %>% pull(Gene_Symbols)
top_dimorphic_gene <- results$`cell type 1`  %>% filter(DEG_Type == "sex-dimorphic") %>% top_n(1, -Male_FDR) %>% pull(Gene_Symbols)
top_neutral_gene <- results$`cell type 1`  %>% filter(DEG_Type == "sex-neutral") %>% top_n(1, -Male_FDR) %>% pull(Gene_Symbols)

top_genes <- c(top_male_gene, top_female_gene, top_dimorphic_gene, top_neutral_gene)

generate_dotplot_sc(sim.seurat, top_genes, target_cell_type_flag = "cell type 1")
# By default, `target_cell_type_flag = "all"`, meaning that all cell types are plotted.
# You can also specify one or multiple cell types, for example:
# target_cell_type_flag = c("cell type 1", "cell type 2")
```

# Pathway Enrichment Analysis
We perform pathway enrichment analysis to identify biological pathways over-represented in our DEG categories. The `categorized_enrich()` function automates this for all categories.

```{r pathway-analysis, eval=FALSE}
# Run for all cell types
pathway_category <- lapply(
  results,
  function(cell) categorized_enrich(cell, enrichment_db = "GO")
)
# The `enrichment_db` parameter can be set to "GO", "KEGG", "REACTOME".
# If the user wants to use a custom database, a TERM2GENE data frame must be
# provided via the `custom_db` parameter.
#
# Gene identifiers can be either any Gene identifier type supported by OrgDb
#' (e.g., "SYMBOL", "ENTREZID", "ENSEMBL", "UNIPROT")
# using the `gene_type` parameter.
#
# The `return_df` parameter controls the output format:
# if set to TRUE, results are returned as data frames;
# by default (FALSE), enrichResult objects are returned.
# Run for one specific cell type
# pathway_category_one <- categorized_enrich(results$`cell type 1`, return_df =T, enrichment_db = "GO")
```


```{r load-pathway-results}
pathway_category <- readRDS(
  system.file("extdata", "sc_pathway_single_small.rds", package = "XYomics")
)

pathway_category_one <- readRDS(
  system.file("extdata", "sc_pathway_single_small_df.rds", package = "XYomics")
)

head(pathway_category_one$`female-specific`)
```


You can then visualize them using `plot_enrichment_dotplots()` function. 

```{r celltype1_plot ,  results='asis', fig.keep='all', fig.width=15, fig.height=7, message=FALSE}
# visualize via dotplot
plot <- plot_enrichment_dotplots(pathway_category)
print(plot)
```

# Protein-protein interaction Network Analysis
Finally, we construct a protein-protein interaction network to explore the interactions between our DEGs.

## Building the Network
We first fetch a protein-protein interaction network from the STRING database or directly without the package (hormonal_network_metacore.rds). Then, we use the PCSF algorithm via `ppi_pipeline()` to extract a relevant subnetwork based on "prizes" derived from our DEG p-values for all cell types.


```{r build-network, message=FALSE, warning=FALSE}
# Fetch STRING network (can be replaced with a custom network)
# g <- get_string_network(organism = "9606", score_threshold = 900)

# Load a pre-existing network from a file
g <- readRDS(system.file("extdata", "string_example_network.rds", package = "XYomics")) # You can also download the hormonal network "hormonal_network_metacore.rds" instead of "string_example_network.rds".

```

```{r network-analysis, eval=FALSE}

### Run for all cell type
network_results <- ppi_pipeline(results, g)

# By default, `target_cell_type_flag = "all"`, meaning that all cell types are plotted.
# You can also specify one or multiple cell types, for example:
# target_cell_type_flag = c("cell type 1", "cell type 2")

# Example for one specific cell type


network_results_one <- ppi_pipeline(results,g , target_cell_type_flag = "cell type 2")
```

```{r load-network-results}
network_results <- readRDS(
  system.file("extdata", "sc_network_results_small.rds", package = "XYomics")
)
```




## Visualizing the Network
We use plot_network() to visualize the network, highlighting nodes based on their degree (i.e., significance). Each node also includes a barplot showing logFC values, blue for males and pink for females.

```{r visualize-network, warning=FALSE, message=FALSE}
female_network <- network_results$`cell type 2`$female_network

#Generate network visualization
plot_network(female_network, "female-specific", results$`cell type 2`, "Cell type 2", show_barplot = T) 

#Generate all_plots for all categories and all cell types

all_plots <- plot_network_pipeline(network_results, results)

```

# Generating a Report
The `generate_cat_report` function can be used to compile all results into a single HTML report.

```{r report, eval=FALSE}
# This command generates a comprehensive HTML report for single-cell dataset
generate_cat_report(results, pathway_category, network_results)


```


# Notes

- All heavy computations are precomputed to ensure fast vignette build time.
- Users can reproduce the full workflow using the functions provided in **XYomics**.
