spqrp

License: GPL-3

Sample Provenance Quality Resolver in Proteomics — native R port of the Python spqrp package.

Recent advancements in MS technology and lab methods opened the door for large-scale proteomics but also led to a growing concern regarding sample mix-ups. spqrp helps you evaluate whether sample data is safe for further analysis by clustering samples and flagging probable mix-ups, uncertain assignments, and outliers.

Install

# install.packages("remotes")
remotes::install_github("fhradilak/spqrp_r")

No Python install needed — this is a native R port.

Input data format

A long-format data frame with these columns:

Column Description
Sample_ID Unique sample identifier
Patient_ID Patient identifier
Protein Protein name/identifier
Intensity Numeric intensity value

Optionally a protein ranking with Protein and Importance columns. If you don’t supply one, the package uses a precomputed ranking from a plasma cohort (spqrp_example_data("ranking_cohort_a")).

Quick start

library(spqrp)

df      <- spqrp_example_data("input_cohort_df")
ranking <- spqrp_example_data("protein_ranking")

# Clustering: build kNN graph, split big components, visualise
res <- run_clustering(
  df = df, ranking = ranking,
  n_neighbors = 1L,
  max_component_size = 2L,
  metric = "manhattan",
  method = "UMAP"   # or "PCA" / "MDS"
)

res$cluster_assignments       # sample -> cluster ID
res$uncertain_samples         # likely missing connections
res$error_candidate_samples   # likely sample mix-ups
res$plot                      # ggplot object

Verbose output

All spqrp functions are silent by default — no progress messages, no per-call summaries. If you want progress and diagnostic prints (which sample IDs got flagged, what cutoff was picked, how many proteins overlapped between ranking and data, etc.) pass quiet = FALSE to any function that emits status output:

remove_outlier_samples(df, quiet = FALSE)         # prints flagged Sample_IDs
run_clustering(df, ranking, n_neighbors = 1,
                max_component_size = 3, quiet = FALSE)  # prints save-path hint,
                                                          # cluster listing,
                                                          # transitive metrics
perform_distance_evaluation_on_ranked_proteins(
  df, top_importance_df = ranking, quiet = FALSE
)                                                 # prints real-protein count

Warnings about genuine data issues — e.g. samples dropped because they lack measurements for any of the top-ranked proteins — fire regardless of quiet, because they signal a real problem you need to see.

Three random-forest backends for protein ranking

If you don’t have a precomputed ranking, train one. The Python package uses imblearn.BalancedRandomForestClassifier; this R port exposes three substitute backends so you can pick the tradeoff that fits:

results <- train_with_normalise(
  df,
  classifier_backend = "randomForest"  # default — closest to imblearn's BalancedRF
  # classifier_backend = "ranger"        # faster, class.weights on impurity
  # classifier_backend = "themis_smote"  # SMOTE rebalance + ranger
)

new_ranking <- retrieve_ranking(results)

See articles/numerical-divergence.md for when to pick each.

Threshold-based evaluation

result <- perform_distance_evaluation_on_ranked_proteins(
  df = df,
  top_importance_df = ranking,
  metric = "manhattan",
  p = 0.989,
  n = 20L
)
result$cutoff
result$eval_metrics[c("TP", "FP", "FN", "TN", "Precision", "Sensitivity", "F1")]

optimize_parameters() sweeps n and the percentile cutoff to find optimal values for your dataset.

Preprocessing

Optional helpers that mirror the Python pipeline:

df_pp <- df |>
  log_transform() |>
  filter_by_occurrence(cutoff = 0.7)

norm <- normalize_medianintensity(df_pp, plot = FALSE)
df_pp <- norm$data

# If your data has a `plate` column:
df_pp <- plate_correct_residuals_by_protein(df_pp)

Function reference

Function Purpose
run_clustering() End-to-end clustering pipeline
cluster_samples_iteratively() Build kNN graph + 2D embedding
plot_distances_neighbours_with_coloring_hue() Heavy clustering visualization
perform_distance_evaluation_on_ranked_proteins() Threshold-based pairwise classification
optimize_parameters() Grid-search optimal n and percentile
calculate_pairwise_distances() Distance matrix on top-n proteins
train_with_normalise() Full ranking pipeline (filter → normalize → RF)
retrieve_ranking() Extract ranked proteins from a trained model
train_pairwise_balanced_rand_forest() Pairwise RF (3 backends)
get_threshold() ROC/F1/Youden/MinFP threshold selection
get_distances(), get_nearest_neighbours() Distance + kNN helpers
get_sample_relations_by_cutoff(), get_evaluation_metrics() Cutoff → metrics
percentile_cutoff() numpy.percentile-equivalent
filter_by_occurrence(), log_transform(), revert_log_transform(), normalize_medianintensity(), plate_correct_residuals_by_protein() Preprocessing
by_isolation_forest(), by_isolation_forest_plot(), remove_outlier_samples() Outlier detection (Isolation Forest). contamination = 0.1 for sklearn-like behaviour.
spqrp_example_data() Access bundled example CSVs
check_input_data_format() Validate required columns

Migrating from the Python version

The R API mirrors the Python one — function names are identical snake_case. Outputs are R named lists (which work just like Python dicts: res$cluster_assignments).

Because the underlying numerical libraries differ (uwot vs umap-learn, ranger vs imblearn, solitude (wrapping ranger) vs sklearn’s IsolationForest), exact numbers can drift across runs even with matched seeds. See articles/numerical-divergence.md for which outputs are bit-exact, which match up to rotation/reflection, and which are only equivalent in expectation.

License

GPL-3