---
title: "Getting Started with mLLMCelltype"
author: "Chen Yang"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting Started with mLLMCelltype}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  echo = TRUE,
  message = FALSE,
  warning = FALSE,
  eval = FALSE
)
```

# Getting Started with mLLMCelltype

This guide provides a quick introduction to using mLLMCelltype for cell type annotation in single-cell RNA sequencing data. We'll cover the basic workflow, input data requirements, and a simple example to get you started.

## Basic Workflow

The mLLMCelltype workflow consists of these main steps:

1. **Prepare marker gene data** for each cluster
2. **Run annotation** using one or multiple LLMs
3. **Create consensus** from multiple model predictions (optional)
4. **Integrate results** with your Seurat or Scanpy object
5. **Visualize results** with uncertainty metrics

## Loading the Package and Setting Up API Keys

First, load the mLLMCelltype package:

```{r}
library(mLLMCelltype)
```

### Setting Up API Keys

Before using mLLMCelltype, you need to set up API keys for the LLM providers you plan to use:

```{r}
# Set API keys as environment variables
Sys.setenv(ANTHROPIC_API_KEY = "your-anthropic-api-key")  # For Claude models
Sys.setenv(OPENAI_API_KEY = "your-openai-api-key")        # For GPT models
Sys.setenv(GEMINI_API_KEY = "your-gemini-api-key")        # For Gemini models
Sys.setenv(OPENROUTER_API_KEY = "your-openrouter-api-key") # For OpenRouter models
```

You can obtain API keys from:
- Anthropic: https://console.anthropic.com/
- OpenAI: https://platform.openai.com/
- Google (Gemini): https://ai.google.dev/
- OpenRouter: https://openrouter.ai/keys

Alternatively, you can provide API keys directly in function calls:

```{r}
results <- annotate_cell_types(
  input = markers,
  tissue_name = "human PBMC",
  model = "claude-sonnet-4-6",
  api_key = "your-anthropic-api-key",  # Direct API key
  top_gene_count = 10
)
```

## Input Data Requirements

mLLMCelltype accepts marker gene data in several formats:

### 1. Data Frame Format

A data frame with the following columns:
- `cluster`: Cluster ID (preserved as-is from your data)
- `gene`: Gene name/symbol
- `avg_log2FC` or similar metric: Log fold change
- `p_val_adj` or similar metric: Adjusted p-value

Example:

```{r}
# Example marker data frame
markers_df <- data.frame(
  cluster = c(0, 0, 0, 1, 1, 1),
  gene = c("CD3D", "CD3E", "CD2", "CD14", "LYZ", "CST3"),
  avg_log2FC = c(2.5, 2.3, 2.1, 3.1, 2.8, 2.5),
  p_val_adj = c(0.001, 0.001, 0.002, 0.0001, 0.0002, 0.0005)
)
```

### 2. Seurat FindMarkers Output

You can directly use the output from Seurat's `FindAllMarkers()` function:

```{r}
# Assuming you have a Seurat object named 'seurat_obj'
library(Seurat)
all_markers <- FindAllMarkers(seurat_obj, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
```

### 3. CSV File Path

A path to a CSV file containing marker gene data:

```{r}
# Path to your CSV file
markers_file <- "path/to/markers.csv"
```

### 4. List Format

A list where each element contains marker genes for a cluster:

```{r}
# Example marker list
markers_list <- list(
  "0" = c("CD3D", "CD3E", "CD2", "IL7R", "LTB"),
  "1" = c("CD14", "LYZ", "CST3", "MS4A7", "FCGR3A")
)
```

## Function Parameters

The `annotate_cell_types` function has the following parameters:

| Parameter | Description | Default Value |
|-----------|-------------|---------------|
| `input` | Marker gene data (data frame, list, or file path) | (required) |
| `tissue_name` | Tissue name (e.g., "human PBMC", "mouse brain") | `NULL` |
| `model` | LLM model to use | `"gpt-5.5"` |
| `api_key` | API key (if not set in environment) | `NA` |
| `top_gene_count` | Number of top genes per cluster to use | `10` |
| `debug` | Whether to print debugging information | `FALSE` |

Note: If `api_key` is set to `NA`, the function will return the generated prompt without making an API call, which is useful for reviewing the prompt before sending it to the API.

## Basic Usage Example

Here's a simple example using a single LLM model for annotation:

```{r}
# Example marker data
markers <- data.frame(
  cluster = c(0, 0, 0, 0, 0, 1, 1, 1, 1, 1),
  gene = c("CD3D", "CD3E", "CD2", "IL7R", "LTB", "CD14", "LYZ", "CST3", "MS4A7", "FCGR3A"),
  avg_log2FC = c(2.5, 2.3, 2.1, 1.8, 1.7, 3.1, 2.8, 2.5, 2.2, 2.0),
  p_val_adj = c(0.001, 0.001, 0.002, 0.003, 0.005, 0.0001, 0.0002, 0.0005, 0.001, 0.002)
)

# Run annotation with a single model
results <- annotate_cell_types(
  input = markers,
  tissue_name = "human PBMC",
  model = "claude-sonnet-4-6",
  api_key = Sys.getenv("ANTHROPIC_API_KEY"),
  top_gene_count = 10,
  debug = FALSE  # Set to TRUE for more detailed output
)

# Print results
print(results)
```

### Example Output

When using a single model like Claude, the output will be a character vector with one annotation per cluster:

```r
> print(results)
[1] "0: T cells"   "1: Monocytes"
```

## Multi-Model Consensus Example

For more reliable annotations, you can use multiple models and create a consensus:

```{r}
# Define models to use
models <- c(
  "claude-sonnet-4-6",  # Anthropic
  "gpt-5.5",                      # OpenAI
  "gemini-3.1-pro-preview"               # Google
)

# API keys for different providers
api_keys <- list(
  anthropic = Sys.getenv("ANTHROPIC_API_KEY"),
  openai = Sys.getenv("OPENAI_API_KEY"),
  gemini = Sys.getenv("GEMINI_API_KEY")
)

# Run annotation with multiple models
results <- list()
for (model in models) {
  provider <- get_provider(model)
  api_key <- api_keys[[provider]]

  results[[model]] <- annotate_cell_types(
    input = markers,
    tissue_name = "human PBMC",
    model = model,
    api_key = api_key,
    top_gene_count = 10
  )
}

# Create consensus
consensus_results <- interactive_consensus_annotation(
  input = markers,
  tissue_name = "human PBMC",
  models = models,  # Use all the models defined above
  api_keys = api_keys,
  controversy_threshold = 0.7,
  entropy_threshold = 1.0,
  consensus_check_model = "claude-sonnet-4-6"
)
```

### Consensus Output Example

The function automatically prints a summary upon completion:

```r
>
Consensus Summary:
-----------------
Total clusters: 2
Controversial clusters: 0
Consensus achieved for all clusters

Cluster 0:
  Final annotation: T cells
  Consensus proportion: 1.0
  Entropy: 0.0
  Model predictions:
    - claude-sonnet-4-6: T cells
    - gpt-5.5: T cells
    - gemini-3.1-pro-preview: T cells

Cluster 1:
  Final annotation: Monocytes
  Consensus proportion: 1.0
  Entropy: 0.0
  Model predictions:
    - claude-sonnet-4-6: Monocytes
    - gpt-5.5: Monocytes
    - gemini-3.1-pro-preview: Monocytes
```

## Integrating with Seurat

To add the annotations to your Seurat object:

```{r}
# Assuming you have a Seurat object named 'seurat_obj' and consensus results
library(Seurat)

# Add consensus annotations to Seurat object
seurat_obj$cell_type_consensus <- plyr::mapvalues(
  x = as.character(Idents(seurat_obj)),
  from = names(consensus_results$final_annotations),
  to = consensus_results$final_annotations
)

# Extract consensus metrics from the consensus results
# Note: These metrics are available in the consensus_results$initial_results$consensus_results
consensus_metrics <- lapply(names(consensus_results$initial_results$consensus_results), function(cluster_id) {
  metrics <- consensus_results$initial_results$consensus_results[[cluster_id]]
  return(list(
    cluster = cluster_id,
    consensus_proportion = metrics$consensus_proportion,
    entropy = metrics$entropy
  ))
})

# Convert to data frame for easier handling
metrics_df <- do.call(rbind, lapply(consensus_metrics, data.frame))

# Add consensus proportion to Seurat object
seurat_obj$consensus_proportion <- plyr::mapvalues(
  x = as.character(Idents(seurat_obj)),
  from = metrics_df$cluster,
  to = metrics_df$consensus_proportion
)

# Add entropy to Seurat object
seurat_obj$entropy <- plyr::mapvalues(
  x = as.character(Idents(seurat_obj)),
  from = metrics_df$cluster,
  to = metrics_df$entropy
)
```

## Basic Visualization

Here's a simple visualization of the results using Seurat:

```{r}
# Plot UMAP with cell type annotations
DimPlot(seurat_obj, group.by = "cell_type_consensus", label = TRUE, repel = TRUE) +
  ggtitle("Cell Type Annotations") +
  theme(plot.title = element_text(hjust = 0.5))
```

## Understanding the Output

The output of `annotate_cell_types()` is a vector of cell type annotations, where each element corresponds to a cluster.

The output of `interactive_consensus_annotation()` is a list containing:

- `final_annotations`: Final consensus cell type annotations
- `initial_results`: Initial predictions from each model
- `controversial_clusters`: List of clusters that required discussion
- `discussion_logs`: Detailed logs of the discussion process
- `session_id`: Unique identifier for the annotation session

### Understanding Uncertainty Metrics

When using consensus annotation, two key metrics help evaluate the reliability of annotations:

- **Consensus Proportion**: Ranges from 0 to 1, indicating the proportion of models that agree on the final annotation. Higher values indicate stronger agreement.
- **Entropy**: Measures the uncertainty in model predictions. Lower values indicate more certainty. An entropy of 0 means all models agree perfectly.

Clusters with low consensus proportion or high entropy may require manual review.

## Using OpenRouter Free Models

If you don't have access to paid API keys, you can use OpenRouter's free models:

```{r}
# Set OpenRouter API key
Sys.setenv(OPENROUTER_API_KEY = "your-openrouter-api-key")

# Use a free model
free_results <- annotate_cell_types(
  input = markers,
  tissue_name = "human PBMC",
  model = "meta-llama/llama-4-maverick:free",  # Note the :free suffix
  api_key = Sys.getenv("OPENROUTER_API_KEY"),
  top_gene_count = 10
)

# Print results
print(free_results)
```

Available free models (Updated Oct 2025):

- `meta-llama/llama-4-maverick:free` - Meta Llama 4 Maverick (256K context, best performance)
- `deepseek/deepseek-v4-pro:free` - DeepSeek V4 Pro
- `meta-llama/llama-3.3-70b-instruct:free` - Meta Llama 3.3 70B (reliable)
- `venice/uncensored:free` - Venice Uncensored (new model)
- `z-ai/glm-4.5-air:free` - GLM 4.5 Air (lightweight)

**Important**: OpenRouter reduced free tier limits in 2025:
- **Free accounts**: 50 requests/day (down from 200), 20 requests/minute
- **Accounts with $10+ credits**: 1000 requests/day
- **Some models removed**: NVIDIA Nemotron and others have exited the free tier
- **For production use**: Consider using paid models for better reliability

## Troubleshooting

### Common Issues

1. **API Key Not Found**:

   ```r
   Error: No auth credentials found
   ```

   **Solution**: Ensure you've set the correct API key environment variable or provided it directly in the function call.

2. **Rate Limiting**:

   ```r
   Error: Rate limit exceeded
   ```

   **Solution**: Wait a few minutes before trying again, or reduce the number of API calls by processing fewer clusters at once.

3. **Invalid Model Name**:

   ```r
   Error: Unsupported model: [model_name]
   ```

   **Solution**: Check that you're using a supported model name and that it's spelled correctly.

4. **Network Issues**:

   ```r
   Error: Could not connect to API
   ```

   **Solution**: Check your internet connection and try again. If the problem persists, the API service might be down.

## Next Steps

Now that you understand the basics of mLLMCelltype, you can explore:

- [Usage Tutorial](https://cafferyang.com/mLLMCelltype/articles/usage-tutorial.html): More detailed usage examples
- [Consensus Annotation Principles](https://cafferyang.com/mLLMCelltype/articles/consensus-principles.html): Learn about the consensus mechanism
- [Visualization Guide](https://cafferyang.com/mLLMCelltype/articles/visualization-guide.html): Create publication-ready visualizations
- [Advanced Features](https://cafferyang.com/mLLMCelltype/articles/advanced-features.html): Explore hierarchical annotation and other advanced features
- [FAQ](https://cafferyang.com/mLLMCelltype/articles/faq.html): Answers to common questions

If you encounter any issues, check the [FAQ](https://cafferyang.com/mLLMCelltype/articles/faq.html) or [open an issue](https://github.com/cafferychen777/mLLMCelltype/issues) on our GitHub repository.