Confounding by geography and anticoagulant compromises proposed ALS diagnostic model and biomarkers

Re-analysis of Chia et al. (2025)

Author

Sergey A. Kornilov, PhD (Biostochastics, LLC)

Published

October 18, 2025

Links:

NoteAuthor Contact & Data Acknowledgment

We thank the authors for making the underlying data publicly available. We contacted the corresponding authors on 09/10/25 requesting geographic metadata and received no response, necessitating label reconstruction from sample identifiers. We shared a draft of this reanalysis on 10/13/25 prior to formal submission but received no formal or informal response of acknowledgment. Recognizing the impact of the federal shutdown on the NIH workers, we attempted to reach out the authors again once the shutdown was in progress, this time via social media. We received no response.

Executive Summary

0.1 Methodological Case Study: Multi-Site Confounding in Proteomics

This investigation serves as a methodological case study demonstrating how confounding naturally arises in multi-site biomarker studies and why standard validation approaches may not detect it. Using a rigorous multi-method framework applied to a published ALS proteomics dataset1, we document perfect site–tube correlation (Cramér’s V=1.0), an 18.6% performance gap between pooled cross-validation and leave-site-out validation (pooled CV AUC: 0.951 → LCV AUC: 0.774), and systematic tube effects across the proteome (35% of proteins show significant effects in healthy controls). These patterns illustrate generalizable lessons about validation strategies, study design, and the importance of testing cross-site generalization.

This case study examines confounding patterns using a multi-method validation framework:

Important Critical Finding #1: Proteins Predict Tube Type

Reverse Prediction AUC: 0.999

Proteins can predict plasma collection tube type with near-perfect accuracy, indicating tube type signal dominates biological signal.

Important Critical Finding #2: Geographic Generalization Failure

Performance Gap: 0.176 (18.6% drop)

  • Pooled CV AUC: 0.951
  • Leave-Country-Out AUC: 0.774

Models trained on one geographic cohort fail to generalize to another.

ImportantCritical Finding #3: Differential Analysis Confounding

Only 26.9% of pooled proteins replicate in both strata

  • Italy significant: 37
  • US significant: 379
  • Pooled significant: 78
  • Both strata: 23
ImportantCritical Finding #4: UK Biobank External Validation

Reported 99.1% AUC does not address multi-class, multi-matrix generalization

The UK Biobank validation (13 ALS vs 23,601 healthy controls) likely used EDTA plasma exclusively (standard UKB protocol), presenting a binary ALS-vs-healthy task on a single matrix, not the multi-class (ALS vs ONDs vs healthy), multi-matrix challenge the study claims to address. Under the confounding we document, the model learns ‘EDTA = not ALS’—making an EDTA-only healthy control set trivially separable even if biological signal is weak. Extreme class imbalance inflates aggregate metrics (AUC, specificity) while masking sensitivity failures. The UKB result cannot validate discrimination against ONDs or robustness across matrices (EDTA, heparin, serum), and does not mitigate the tube/site confounding driving apparent performance in the primary analysis.

0.2 Methodological Lessons Learned

This case study illustrates a systematic challenge in multi-site proteomics where confounding emerges naturally from logistical constraints:

  1. Natural confounding: Different sites often use different protocols (plasma collection tube type: HEPARIN vs EDTA)
  2. Diagnostic distribution: Patient populations vary by site (100% of neurological controls from US/EDTA site)
  3. Validation limitations: Standard pooled cross-validation maintains these correlations in train/test splits

Key insight: The study design and analytical approach follow field-standard practices, yet produce results where technical and biological signals are inseparable. This demonstrates why prospective validation strategies (leave-site-out CV, reverse prediction tests, balanced designs) are essential for multi-site studies.

Positive findings: Despite confounding, we identify 22 geographically consistent protein candidates and confirm NEFL as a robust ALS biomarker, providing a foundation for future tube-balanced validation studies.


1 Introduction

1.1 Multi-Site Confounding: A Methodological Case Study

1 reported a plasma proteomics biomarker panel for ALS diagnosis with 98% accuracy. Their machine learning model identified plasma collection tube type (HEPARIN vs. EDTA) as the 2nd most important feature out of 2,869 total features—an observation that prompted this methodological investigation.

This case study demonstrates how confounding naturally arises in multi-site proteomics studies: We document that tube type is perfectly correlated with geographic origin (Italy/HEPARIN vs. US/EDTA; Cramér’s V=1.0) and diagnostic distribution (100% of neurological controls from US/EDTA site, 85% of ALS cases from Italy/HEPARIN; Table 1). This confounding structure—which emerges organically from site-specific protocols and patient populations—creates a methodological challenge: biological disease signals become entangled with technical and geographic factors.

Purpose of this investigation: Rather than simply identifying problems, we demonstrate: (1) how confounding arises in real-world multi-site studies; (2) why standard validation approaches miss it; and (3) what practical solutions exist through stratified analysis, cross-site validation, and prospective study design improvements.

1.2 Anticoagulant Biochemistry and Pre-Analytical Confounding

Anticoagulant choice—EDTA or HEPARIN—fundamentally alters measured proteins. EDTA chelates calcium, preventing clotting but triggering cellular protein release. HEPARIN inhibits specific clotting enzymes and preserves different protein structures. The same patient’s blood yields different protein profiles by tube type—a pre-analytical artifact, not biological signal. When tube type is perfectly confounded with diagnostic group—100% of neurological controls in EDTA—disease-associated protein changes become non-identifiable from anticoagulant-induced technical artifacts.

Table 1: Perfect Confounding: Diagnosis × Tube Type Distribution
Diagnosis EDTA HEPARIN Total % HEPARIN
ALS 40 223 263 84.8%
Healthy_control 67 180 247 72.9%
Neurological_control 194 0 194 0.0%

We conducted an independent investigation with three objectives: (1) quantify confounding severity using reverse prediction and association metrics, (2) evaluate ML model robustness through leave-country-out cross-validation, and (3) identify geographically consistent biomarkers via geographic stratification.

We prespecified three hypotheses to test confounding and its clinical impact:

  • H1 (Reverse Prediction Test): Protein profiles would strongly predict anticoagulant tube type (null hypothesis: AUC = 0.5; critical threshold: AUC > 0.90 indicates systematic confounding).

  • H2 (Geographic Generalization): Models would fail geographic generalization under leave-country-out cross-validation (null hypothesis: no performance drop; critical threshold: >20% AUC reduction indicates geographic overfitting).

  • H3 (Cross-Stratum Replication): Most pooled ‘significant’ proteins would not replicate across both geographic strata (null hypothesis: ≥70% replication; severe confounding indicated if <30% replicate).

These criteria distinguish minor technical variability from systematic confounding preventing clinical deployment.

Note Box 1: Understanding Confounding and Validation

For clinical readers unfamiliar with machine learning validation concepts:

  • What is confounding? A non-biological factor (e.g., collection site or tube type) varying systematically with diagnosis becomes a “shortcut” the model learns. Performance can appear excellent while detecting workflow logistics, not disease biology.

  • Pooled cross-validation vs. leave-site-out (LSO) validation: Pooled CV mixes data from all sites in training/test splits, allowing site and tube signals to “leak” across partitions. LSO withholds entire sites during training to test clinical generalizability.

  • Why tube type as a predictor is problematic: Different anticoagulants alter measured protein levels through distinct biochemical mechanisms. When tube type correlates with diagnosis (100% of neurological controls used EDTA), the model learns “EDTA = not ALS” rather than biological ALS signatures, producing high pooled CV AUC that fails when tube types differ.

  • What robust validation requires: Leave-site-out cross-validation, balanced representation across tube types and sites, inclusion of ONDs as comparators, verification of stability across matrices, and uncertainty quantification testing whether performance depends on logistics shortcuts.

For readers interested in statistical frameworks and computational details, the Technical Appendix (Section 8) provides comprehensive documentation of methods, software versions, and reproducibility protocols.


2 Methods

2.1 Overview of Reanalysis Approach

We evaluated confounding bias from plasma collection tube type, used as a predictive feature in the original model. Our reanalysis employed geographic stratification and leave-country-out cross-validation to assess model generalizability and identify geographically consistent biomarker candidates.

All analyses were implemented in R (version 4.4.2) using a reproducible computational pipeline (targets framework) with version-controlled dependencies (renv).

2.2 Computational Pipeline Architecture

Show code
library(targets)
library(visNetwork)

# Determine project root - handle both manual and tar_quarto() rendering
project_root <- if (file.exists("../_targets.R")) {
  normalizePath("..")
} else if (file.exists("_targets.R")) {
  getwd()
} else {
  "/Users/biostochastics/chia_etal_2025_als"  # fallback to absolute path
}

# Change to project root to find _targets.R
withr::with_dir(project_root, {
  # Create network with custom styling
  net <- tar_visnetwork(
    targets_only = TRUE,
    label = c("time", "size"),
    level_separation = 150
  )

  # Apply dark theme with bright viridis colors
  net %>%
    visOptions(
      highlightNearest = list(enabled = TRUE, degree = 1),
      nodesIdSelection = TRUE
    ) %>%
    visInteraction(
      navigationButtons = TRUE,
      hover = TRUE
    ) %>%
    visLayout(randomSeed = 42) %>%
    visNodes(
      font = list(color = "white", size = 16, face = "arial", bold = TRUE),
      borderWidth = 2,
      shadow = list(enabled = TRUE, color = "#000000", size = 5, x = 2, y = 2)
    ) %>%
    visEdges(
      color = list(color = "#666666", highlight = "#FDE724", hover = "#FDE724"),
      smooth = list(enabled = TRUE, type = "cubicBezier", roundness = 0.5),
      arrows = list(to = list(enabled = TRUE, scaleFactor = 0.5))
    ) %>%
    visPhysics(
      stabilization = list(iterations = 200),
      barnesHut = list(gravitationalConstant = -8000, springConstant = 0.001, springLength = 200)
    )
})
There were 22 warnings (use warnings() to see them)
Figure 1: Complete computational dependency graph. Each node represents a target (data, model, or visualization); edges show dependencies. Interactive: hover to explore, click to select.

The dependency graph shows the complete computational workflow. Nodes represent computational targets (data, models, visualizations); directed edges indicate dependencies. The targets framework ensures:

  1. Reproducibility: All analyses run in the same order with identical inputs
  2. Efficiency: Only outdated targets rebuild when code changes
  3. Transparency: Full dependency structure is explicit and auditable
  4. Version control: All code and parameters are tracked in Git

This architecture enables complete reproduction via a single targets::tar_make() command. See Appendix for execution details.

2.3 Data Source and Study Design

2.3.1 Original Dataset

Dataset: Chia et al. (2025) OLINK plasma proteomics data

  • Total measurements: 2,031,559 protein-sample combinations
  • Samples: 704 unique samples
  • Proteins: ~2868 measurements per sample (OLINK panels)
  • Diagnoses: ALS (263), Healthy controls (247), Neurological controls (194)
  • Clinical variables: Age at collection, sex, diagnosis, plasma collection tube type (EDTA vs. HEPARIN), OLINK plate identifiers (Plates 1-9)

2.3.2 Geographic Origin Reconstruction

The original dataset lacked explicit country labels. We reconstructed geographic origin (Italy vs. US) from converging evidence:

  1. Plate identifier patterns: Jupyter notebook metadata indicated Plates 5-9 were processed at TRAYNOR (Italian Scholz ALS registry), Plates 1-4 at NDRU (US cohorts from NIH, Johns Hopkins, Baltimore Longitudinal Study of Aging).

  2. Tube type consistency: All HEPARIN samples corresponded to Plates 5-9, all EDTA to Plates 1-4, demonstrating perfect correlation (Cramér’s V ≈ 1.0) between tube type and inferred geographic origin.

  3. Diagnostic distribution validation: Perfect confounding of neurological controls with EDTA tubes (100% EDTA/US) supported the Italy/US geographic distinction.

Country label assignment: - Italy: Plates 5-9 AND HEPARIN tube type (n=403 samples) - United States: Plates 1-4 AND EDTA tube type (n=301 samples) - Unknown: Inconsistent plate-tube combinations (n=0, indicating 100% consistency)

Figure 2: Perfect confounding structure: Diagnosis × Tube Type × Country

2.4 Confounding Structure Quantification

Figure 3: PCA visualization showing protein expression variance driven by tube type rather than diagnosis

Principal Component Analysis reveals tube type explains far more variance (PC1: 17.3%) than biological disease groups. Tube type shows clear PC1 separation, whereas diagnosis shows minimal separation, demonstrating technical factors dominate the variance structure.

2.4.1 Statistical Framework

We quantified confounding between tube type, geographic origin, and diagnosis using:

  1. Cramér’s V: Association strength between categorical variables (range: 0 = independent, 1 = perfect association). We calculated Cramér’s V with 95% CIs for all pairwise combinations of diagnosis, tube type, country, sex, and plate number.

  2. Chi-square tests: Evaluated independence between diagnostic group and tube type/country distributions.

  3. Contingency tables: Sample distributions across diagnosis × tube type, diagnosis × country, and tube type × country.

  4. Stratified analysis: Identified groups with perfect confounding (e.g., neurological controls = 100% EDTA).

Confounding Severity Thresholds:

We interpreted Cramér’s V using2 effect size conventions for 2×3 contingency tables (df=2): <0.07 (negligible), 0.07-0.21 (small), 0.21-0.35 (medium), ≥0.35 (large). V > 0.9 indicates near-perfect confounding and mathematical non-identifiability.

2.5 Machine Learning Validation Framework

2.5.1 Reverse Prediction Test (Diagnostic for Technical Confounding)

Objective: Can proteins predict tube type?

We trained a Random Forest classifier to predict tube type (HEPARIN vs. EDTA) from protein expression alone. This “reverse prediction test” diagnoses technical confounding: AUC > 0.9 indicates pre-analytical variation overwhelms biological signal.

Implementation: - Model: Random Forest with 500 trees (ranger package) - Features: All 2,868 proteins without pre-filtering (metadata excluded) - Outcome: Binary classification (EDTA vs. HEPARIN) - Cross-validation: 5-fold CV with stratified sampling - Missing values: Pre-imputed separately by country using missForest (0.01% overall missingness). - Hyperparameters: Grid search over mtry (√p, p/3) and min.node.size (5, 10) - Performance metrics: AUC-ROC, sensitivity, specificity - Feature importance: Permutation-based variable importance scaled to 0-100

Interpretation Thresholds3: - AUC < 0.7: Minimal concern (poor discrimination) - AUC 0.7-0.8: Moderate concern (acceptable discrimination indicates detectable tube effects) - AUC 0.8-0.9: Severe concern (excellent discrimination indicates major confounding) - AUC > 0.9: Critical concern (outstanding discrimination—tube effects dominate biological signal)

2.5.2 Leave-Country-Out Cross-Validation (Primary Generalizability Test)

Objective: Does the model generalize geographically?

Leave-country-out cross-validation (LCV) evaluates whether models trained in one geographic/tube context generalize to another, directly testing clinical deployment scenarios where tube types vary by institution.

Implementation:

Direction 1: Train on Italy (HEPARIN) → Test on US (EDTA)

  1. Partition data by country (Italy training: n=403, US test: n=301)
  2. Train Random Forest on Italian samples using 5-fold internal CV for hyperparameter tuning
  3. Apply trained model to held-out US samples
  4. Evaluate test performance (AUC, accuracy, sensitivity, specificity)

Direction 2: Train on US (EDTA) → Test on Italy (HEPARIN)

Reversed procedure with same preprocessing and evaluation.

Critical Data Leakage Prevention: - Protein filtering, scaling, and hyperparameter tuning performed only on training cohort - Test cohort held completely separate until final prediction step - No information from test samples influenced model training

Pooled Cross-Validation (Baseline Comparison):

We performed standard 5-fold CV on pooled Italy + US data to replicate the original study’s approach. Pooled CV mixes countries across folds, maintaining tube type distribution but not testing geographic generalization.

Performance Gap Thresholds:

Field conventions for cross-site generalization in clinical ML models4,5: <5% drop (acceptable), 5-10% (moderate concern), 10-20% (substantial concern), >20% (critical concern). Acceptable degradation depends on clinical context and baseline accuracy.

2.6 Stratified Differential Protein Expression Analysis

Objective: Identify geographically consistent biomarkers

We performed differential expression analysis separately within each geographic cohort to identify proteins with geographically consistent ALS signals, eliminating tube type and country confounding.

2.6.1 Statistical Framework

We used limma6, adjusting for age and sex to control demographic confounders.

Model specification:

NPX ~ Diagnosis + Age_Collection + Sex

Where NPX (Normalized Protein eXpression, Olink’s standardized log₂ scale) represents protein abundance, and Diagnosis contrasts ALS vs. controls.

2.6.2 Italy Cohort Analysis (HEPARIN-only)

Sample composition: 223 ALS patients vs. 180 healthy controls (all Italian samples use HEPARIN tubes)

Rationale: The Italian cohort lacked neurological controls, providing clean ALS vs. healthy comparison without OND confounding.

Preprocessing: 1. Convert long-format data to protein matrix (proteins × samples) 2. Remove proteins with >20% missing values 3. Fit linear model with empirical Bayes moderation (limma::lmFit + limma::eBayes) 4. Multiple testing correction using Benjamini-Hochberg false discovery rate (FDR)

Significance thresholds: FDR < 0.05 and |log₂ fold-change| > 0.5

2.6.3 US Cohort Analysis (EDTA-only)

Sample composition: 40 ALS patients vs. 261 controls (67 healthy + 194 neurological disease controls; all US samples use EDTA tubes)

Note on power: Small ALS sample size (n=40) limits power, increasing false negative risk but not false positives. Conservative thresholds mitigate false discovery.

Analysis approach: Binary comparison (ALS vs. all controls) with age and sex adjustment. Same preprocessing and significance criteria as Italy analysis.

2.6.4 Pooled Analysis (Confounded Baseline)

For comparison, we replicated the original study’s pooled approach by combining all samples without adjusting for tube type or country. This analysis serves as a baseline to quantify the impact of confounding on differential expression results.

Note: Pooled results are confounded by geographic/tube effects and should not be interpreted as biologically valid.

2.7 Meta-Analysis and Concordance Testing

2.7.1 Geographically Consistent Biomarker Identification

Proteins were classified as “geographically consistent” candidates if they met all of the following criteria:

  1. Statistically significant in Italy: FDR < 0.05 and log fold-change (logFC, the log₂ ratio of mean protein levels between groups) |logFC| > 0.5
  2. Statistically significant in US: FDR < 0.05 and |logFC| > 0.5
  3. Directionally concordant: Same sign of log₂ fold-change in both cohorts

This conservative approach ensures that identified proteins show consistent ALS-associated changes independent of tube type and geographic origin.

2.7.2 Effect Size Concordance

We evaluated cross-cohort consistency using:

  1. Pearson correlation of log₂ fold-changes between Italy and US cohorts (across all proteins)
  2. Combined p-values using Fisher’s method: χ² = -2(ln P_Italy + ln P_US), df=4
  3. Beta-beta plots: Scatterplots of Italy logFC vs. US logFC for visualizing concordance and identifying tube-specific outliers

2.7.3 Stratified vs. Pooled Comparison

To quantify confounding-driven false discoveries, we calculated:

  • Overlap statistics: Number of proteins significant in pooled but not in either individual cohort
  • Replication rate: Percentage of pooled significant proteins that replicate in both Italy and US
  • Critical threshold: <30% replication indicates severe confounding. This heuristic is based on genomics/proteomics field observations that technical artifacts typically show poor cross-cohort replication (<30-40%), while robust biological signals typically replicate at >50-60% rates7,8

2.8 Exact Replication of Original Study Methods

To ensure fair comparison, we explicitly replicated the1 analysis pipeline:

80/20 Discovery-Replication Split: - Randomly split pooled data into 80% discovery (n≈563) and 20% replication (n≈141) - Train Random Forest on discovery set including tube type as predictor - Evaluate on held-out replication set - Report feature importance rank for tube type variable

Differential protein selection: - Select proteins significant in pooled analysis (FDR < 0.05, |logFC| > 0.6, matching original thresholds) - Use these proteins as features for ML model

2.9 Reproducibility and Software

2.9.1 Computational Environment

  • R version: 4.4.2 (2024-10-31)
  • Operating system: macOS 15.0
  • Dependency management: renv (version 1.0.11) for exact package version locking

2.9.2 Key R Packages

  • tidyverse 2.0.0: Data manipulation and visualization
  • data.table 1.16.4: Fast data loading and processing
  • targets 1.11.4: Reproducible pipeline orchestration
  • caret 7.0-1: Machine learning framework
  • ranger 0.17.0: Random Forest implementation
  • limma 3.60.0: Differential expression analysis
  • pROC 1.18.5: ROC curve analysis
  • DescTools: Cramér’s V calculation

2.9.3 Reproducibility Protocol

  1. Random seeds: All stochastic operations used seed = 46803468976 for reproducibility
  2. Pipeline framework: targets ensures dependency tracking and caching
  3. Version control: All R packages locked via renv.lock
  4. Execution: Full pipeline can be reproduced via targets::tar_make()

2.10 Statistical Considerations

2.10.1 Multiple Testing Correction

We applied Benjamini-Hochberg FDR correction separately within each analysis: - Differential expression: FDR control within each cohort (Italy, US, pooled) - Meta-analysis: Combined p-values using Fisher’s method to account for testing in two cohorts

2.10.2 Missing Data Handling

All analyses: Missing values (0.01% overall) were imputed once at data loading using missForest, applied separately for Italy and US cohorts to prevent cross-country information leakage. This approach:

  • Uses random forest-based imputation (missForest::missForest()) well-suited for high-dimensional proteomics data
  • Applied independently to Italy (HEPARIN) and US (EDTA) subsets before any downstream analysis
  • Prevents cross-country information leakage during imputation (Italy samples never inform US imputation and vice versa)
  • Extremely low missingness (0.01% overall, max 0.14% per protein) makes imputation straightforward
  • Imputed data stored in pipeline target data_imputed and used for all subsequent analyses

Differential expression: Proteins with >20% missing values would be excluded prior to limma analysis (standard practice). However, no proteins exceeded this threshold in the current dataset.

Rationale: Pre-imputation with geographic stratification maintains methodological rigor by ensuring Italy and US cohorts remain informationally separate throughout the entire analysis pipeline, consistent with the study’s focus on geographic confounding.

2.10.3 Sample Size and Power

We acknowledge limited statistical power in the US ALS cohort (n=40), which may result in false negatives (missing true ALS signals) but maintains Type I error control (FDR < 0.05).

2.11 Limitations of Reanalysis

We acknowledge the following limitations:

  1. Confounding non-identifiability: Perfect correlation between tube type, country, batch, and storage protocol means we cannot definitively attribute effects to any single factor. Our conclusions apply to the aggregate of these technical artifacts.

  2. No external validation: We reanalyzed existing data without access to independent validation cohorts. Prospective tube-balanced studies are required to confirm geographically consistent biomarkers.

  3. Limited US ALS sample size: Power constraints in the US cohort (n=40 ALS) may cause us to underestimate the number of true geographically consistent proteins (increased false negatives, but Type I error remains controlled).


Key terms used in this investigation:

  • AUC (Area Under ROC Curve): Measure of classifier performance ranging from 0.5 (random) to 1.0 (perfect). Values >0.9 indicate excellent discrimination.

  • Cramér’s V: Measure of association strength between categorical variables. Range: 0 (independent) to 1 (perfect association). Cohen (1988) effect sizes for df=2: 0.07 (small), 0.21 (medium), 0.35 (large).

  • Cross-validation (CV): Method for evaluating model performance by splitting data into training and test sets multiple times.

  • FDR (False Discovery Rate): Multiple testing correction that controls the expected proportion of false positives among rejected hypotheses. FDR <0.05 means <5% of “significant” results are expected to be false discoveries.

  • Leave-Country-Out CV: Rigorous validation where models trained on one geographic cohort are tested on a completely held-out cohort. Tests true generalizability.

  • logFC (log Fold-Change): Difference in NPX values between groups. Since NPX is already on a log₂ scale, this difference directly represents log₂(fold-change). |logFC| > 0.5 represents >40% change; |logFC| > 1.0 represents doubling/halving.

  • Sensitivity: True positive rate; proportion of actual cases correctly identified.

  • Specificity: True negative rate; proportion of actual controls correctly identified.

  • Stratification: Analyzing data separately within homogeneous subgroups to eliminate confounding.


3 Results

Our multi-method investigation reveals severe confounding compromise in the reported ALS diagnostic model. Three analyses demonstrate tube type and geographic origin dominate: (1) Cramér’s V=1.0 perfect site-tube correlation, (2) leave-country-out validation shows 18.6% AUC inflation and 1.8% sensitivity collapse, and (3) only 26.9% of ‘significant’ proteins replicate across strata. Even 22 geographically consistent candidates show substantial residual confounding, with 50% exhibiting significant tube effects in healthy controls.

3.1 Confounding Quantification

3.1.1 Statistical Measures

Association Strengths (Cramér’s V):

  • Tube Type ↔︎ Diagnosis: 0.721 (strong)
  • Country ↔︎ Diagnosis: 0.721 (strong)
  • Tube Type ↔︎ Country: 0.997 (perfect)
Note

Cramér’s V ranges from 0 (no association) to 1 (perfect).2 conventions for df=2: ≥0.21 medium, ≥0.35 large.

3.2 Reverse Prediction Test: Strong Evidence for Technical Confounding

Proteins achieve near-perfect tube type prediction accuracy (AUC ≥0.999), demonstrating technical artifacts dominate biological signal. Even the 22 ‘geographically consistent’ biomarker candidates retain strong tube-type signatures (AUC=0.916). When technical metadata can be reconstructed with this accuracy from biological features, those features are contaminated and cannot reliably represent disease biology.

Show code
library(ggplot2)
library(pROC)
library(dplyr)

# Create labels
full_label <- sprintf("Full Model (AUC = %.3f)", reverse_prediction_results$auc)
top_label <- sprintf("Top 10 Proteins (AUC = %.3f)", top_proteins_tube_prediction$auc)
biomarker_label <- sprintf("Original Biomarkers (AUC = %.3f)", biomarker_tube_prediction$auc)
concordant_label <- sprintf("Concordant Proteins (AUC = %.3f)", concordant_tube_prediction$auc)

# Extract predictions and calculate ROC curves
# Full model
full_preds <- reverse_prediction_results$model$pred %>%
  dplyr::filter(
    mtry == reverse_prediction_results$model$bestTune$mtry,
    min.node.size == reverse_prediction_results$model$bestTune$min.node.size
  )
roc_full <- pROC::roc(
  response = full_preds$obs,
  predictor = full_preds$HEPARIN,
  levels = c("EDTA", "HEPARIN"),
  direction = "<"
)

# Top 10 proteins
top_preds <- top_proteins_tube_prediction$predictions
roc_top <- pROC::roc(
  response = top_preds$obs,
  predictor = top_preds$HEPARIN,
  levels = c("EDTA", "HEPARIN"),
  direction = "<"
)

# Original biomarkers
biomarker_preds <- biomarker_tube_prediction$predictions
roc_biomarker <- pROC::roc(
  response = biomarker_preds$obs,
  predictor = biomarker_preds$HEPARIN,
  levels = c("EDTA", "HEPARIN"),
  direction = "<"
)

# Concordant proteins
concordant_preds <- concordant_tube_prediction$predictions
roc_concordant <- pROC::roc(
  response = concordant_preds$obs,
  predictor = concordant_preds$HEPARIN,
  levels = c("EDTA", "HEPARIN"),
  direction = "<"
)

# Combine ROC data
roc_combined <- rbind(
  data.frame(
    fpr = 1 - roc_full$specificities,
    tpr = roc_full$sensitivities,
    model = full_label
  ),
  data.frame(
    fpr = 1 - roc_top$specificities,
    tpr = roc_top$sensitivities,
    model = top_label
  ),
  data.frame(
    fpr = 1 - roc_biomarker$specificities,
    tpr = roc_biomarker$sensitivities,
    model = biomarker_label
  ),
  data.frame(
    fpr = 1 - roc_concordant$specificities,
    tpr = roc_concordant$sensitivities,
    model = concordant_label
  )
)

# Create color mapping with 4 distinct colors
color_values <- c("#FDE724", "#1F9E89", "#B5DE2B", "#31688E")
names(color_values) <- c(full_label, top_label, biomarker_label, concordant_label)

# Create plot
ggplot(roc_combined, aes(x = fpr, y = tpr, color = model)) +
  geom_line(linewidth = 2.5, alpha = 0.9) +
  geom_abline(
    slope = 1, intercept = 0, linetype = "dashed",
    color = "#71717A", linewidth = 1
  ) +
  scale_color_manual(
    name = NULL,
    values = color_values
  ) +
  labs(
    title = "Reverse Prediction Test: Predicting Tube Type from Proteins",
    subtitle = "AUC = 0.999 - Near-perfect discrimination indicates tube effects dominate",
    x = "False Positive Rate (1 - Specificity)",
    y = "True Positive Rate (Sensitivity)"
  ) +
  theme_dark_scientific(base_size = 12) +
  theme(
    plot.background = element_rect(fill = "#151520", colour = NA),
    panel.background = element_rect(fill = "#151520", colour = NA),
    legend.background = element_rect(fill = "#151520", colour = NA),
    legend.key = element_rect(fill = "#151520", colour = NA),
    legend.position = c(0.65, 0.25)
  ) +
  coord_equal()
Figure 4: ROC curves for predicting tube type from three protein sets

Results: - Full Model (all proteins): AUC = 0.999 (99.7% sensitivity, 94.8% specificity) - Top 10 Proteins: AUC = 1.000 - Original Biomarkers (Chia et al. 17-protein panel): AUC = 0.860 (n=11 proteins) - Concordant Proteins (significant in both cohorts): AUC = 0.908 (n=22 proteins)

The reverse prediction test produces a compelling result with important implications for interpreting the published biomarker claims. By training classifiers to predict plasma collection tube type (HEPARIN vs. EDTA) from protein expression profiles alone—excluding all clinical metadata—we demonstrate that tube type effects are pervasive across multiple protein sets. The full model achieves near-perfect discrimination (AUC = 0.999), while even the top 10 most tube-discriminative proteins achieve exceptional performance (AUC = 1.000).

Critically, both biomarker panels show substantial tube-type confounding: the original study’s 17-protein panel achieves AUC = 0.860 for predicting tube type, while our concordant proteins (those significant in both Italy and US with concordant direction, n=22) achieve AUC = 0.908. These findings indicate that even proteins that successfully replicate across independent cohorts remain significantly confounded by technical collection factors.

These findings are particularly striking because the protein measurements are more informative about the technical collection method than they are about ALS disease status (original study reported 98% accuracy for disease prediction). This reversed diagnostic priority reveals that tube type effects dominate the variance structure in the proteomics data, overwhelming genuine biological disease signals. When combined with the perfect confounding structure—where 100% of neurological controls are EDTA and 85% of ALS cases are HEPARIN—these results indicate that any disease classifier trained on pooled data will inevitably learn to recognize anticoagulant-associated protein signatures rather than ALS-specific biology.

The reverse prediction test thus provides strong diagnostic evidence for the presence of severe technical confounding across all protein sets, from the full panel to the curated biomarkers. Even if some ALS biomarkers exist within the dataset, their signals are thoroughly entangled with anticoagulant effects that must be addressed through stratified analysis or balanced study designs.

ImportantInterpretation

An AUC of 0.999 indicates proteins can near-perfectly distinguish tube type. This provides strong evidence that:

  1. Tube type signal dominates biological signal in variance explained
  2. Proteins are more informative about collection method than disease diagnosis
  3. Pooled biomarker claims are materially affected by technical confounding
  4. The original model’s high performance likely reflects tube type recognition rather than disease biology

3.2.1 Top Proteins Distinguishing Tube Types

Table 2: Top 10 proteins most predictive of tube type
protein Overall
DCTN2 100.00
INSL3 55.60
CENPF 45.39
SPINT3 42.99
FGF9 42.28
CCND2 40.16
AGBL2 30.47
TRIM40 14.94
SLC1A4 11.86
DAND5 8.68

3.3 Geographic Validation: Leave-Country-Out CV

Show code
library(ggplot2)
library(pROC)
library(patchwork)
library(dplyr)

# Extract ROC curves
roc_italy_to_us <- lcv_results$italy_to_us$roc_curve
roc_us_to_italy <- lcv_results$us_to_italy$roc_curve
roc_italy_within <- within_country_cv_results$italy$roc_obj
roc_us_within <- within_country_cv_results$us$roc_obj
roc_pooled <- pooled_cv_results$roc_obj

# Create labels with AUCs
pooled_label <- sprintf("Pooled (%.3f)", pooled_cv_results$cv_auc)
italy_within_label <- sprintf("Italy (%.3f)", within_country_cv_results$italy$cv_auc)
us_within_label <- sprintf("US (%.3f)", within_country_cv_results$us$cv_auc)
italy_label <- sprintf("Italy→US (%.3f)", lcv_results$italy_to_us$test_auc)
us_label <- sprintf("US→Italy (%.3f)", lcv_results$us_to_italy$test_auc)

# ==============================================================================
# PANEL A: ORIGINAL VALIDATION (WITHIN-COUNTRY + POOLED)
# ==============================================================================
roc_original <- rbind(
  data.frame(
    specificity = 1 - roc_pooled$specificities,
    sensitivity = roc_pooled$sensitivities,
    model = pooled_label
  ),
  data.frame(
    specificity = 1 - roc_italy_within$specificities,
    sensitivity = roc_italy_within$sensitivities,
    model = italy_within_label
  ),
  data.frame(
    specificity = 1 - roc_us_within$specificities,
    sensitivity = roc_us_within$sensitivities,
    model = us_within_label
  )
)

# Define colors for Panel A
colors_panel_a <- c(
  "#FDE724",  # Pooled - Bright yellow (viridis 100%)
  "#1F9E89",  # Italy - Jade/teal (viridis 60%)
  "#B5DE2B"   # US - Yellow-green (viridis 90%)
)
names(colors_panel_a) <- c(pooled_label, italy_within_label, us_within_label)

p_original <- ggplot(roc_original, aes(x = specificity, y = sensitivity, color = model)) +
  geom_line(linewidth = 2.5, alpha = 0.9) +
  geom_abline(intercept = 0, slope = 1, linetype = "dotted", color = "#666666", alpha = 0.5, linewidth = 1) +
  scale_color_manual(
    name = "Validation Strategy",
    values = colors_panel_a
  ) +
  labs(
    title = "A. ORIGINAL",
    x = "False Positive Rate",
    y = "True Positive Rate"
  ) +
  theme_dark_scientific(base_size = 14) +
  theme(
    plot.background = element_rect(fill = "#151520", colour = NA),
    panel.background = element_rect(fill = "#151520", colour = NA),
    legend.position = c(0.65, 0.25)
  ) +
  coord_equal()

# ==============================================================================
# PANEL B: REANALYSIS (CROSS-COUNTRY)
# ==============================================================================
roc_reanalysis <- rbind(
  data.frame(
    specificity = 1 - roc_italy_to_us$specificities,
    sensitivity = roc_italy_to_us$sensitivities,
    model = italy_label
  ),
  data.frame(
    specificity = 1 - roc_us_to_italy$specificities,
    sensitivity = roc_us_to_italy$sensitivities,
    model = us_label
  )
)

# Define colors for Panel B
colors_panel_b <- c(
  "#440154",  # Italy→US - Dark purple (viridis 0%)
  "#31688E"   # US→Italy - Medium blue (viridis 40%)
)
names(colors_panel_b) <- c(italy_label, us_label)

p_reanalysis <- ggplot(roc_reanalysis, aes(x = specificity, y = sensitivity, color = model)) +
  geom_line(linewidth = 2.5, alpha = 0.9) +
  geom_abline(intercept = 0, slope = 1, linetype = "dotted", color = "#666666", alpha = 0.5, linewidth = 1) +
  scale_color_manual(
    name = "Cross-Country",
    values = colors_panel_b
  ) +
  labs(
    title = "B. REANALYSIS (CROSS-COUNTRY VALIDATION)",
    x = "False Positive Rate",
    y = "True Positive Rate"
  ) +
  theme_dark_scientific(base_size = 14) +
  theme(
    plot.background = element_rect(fill = "#151520", colour = NA),
    panel.background = element_rect(fill = "#151520", colour = NA),
    legend.position = c(0.65, 0.25)
  ) +
  coord_equal()

# ==============================================================================
# PANEL C: CONFUSION MATRICES FOR CROSS-COUNTRY
# ==============================================================================

# Italy → US confusion matrix
cm_italy_us <- lcv_results$italy_to_us$confusion_matrix
cm_italy_us_df <- as.data.frame.table(cm_italy_us)
names(cm_italy_us_df) <- c("Predicted", "Actual", "Count")

# Calculate metrics
sens_italy_us <- cm_italy_us["ALS", "ALS"] / sum(cm_italy_us[, "ALS"])
spec_italy_us <- cm_italy_us["Control", "Control"] / sum(cm_italy_us[, "Control"])

p_cm_italy_us <- ggplot(cm_italy_us_df, aes(x = Actual, y = Predicted, fill = Count)) +
  geom_tile(color = "#444444", linewidth = 1) +
  geom_text(aes(label = Count), size = 6, fontface = "bold", color = "white") +
  scale_fill_gradient(low = "#2c1e3d", high = "#E6550D") +
  labs(
    title = sprintf("Italy→US\nSens: %.2f, Spec: %.2f", sens_italy_us, spec_italy_us),
    x = "True", y = "Predicted"
  ) +
  theme_dark_scientific(base_size = 12) +
  theme_centered_titles() +
  theme_zero_margins() +
  theme(
    plot.background = element_rect(fill = "#151520", colour = NA),
    panel.background = element_rect(fill = "#151520", colour = NA),
    legend.position = "none",
    panel.grid = element_blank()
  ) +
  coord_equal()

# US → Italy confusion matrix
cm_us_italy <- lcv_results$us_to_italy$confusion_matrix
cm_us_italy_df <- as.data.frame.table(cm_us_italy)
names(cm_us_italy_df) <- c("Predicted", "Actual", "Count")

sens_us_italy <- cm_us_italy["ALS", "ALS"] / sum(cm_us_italy[, "ALS"])
spec_us_italy <- cm_us_italy["Control", "Control"] / sum(cm_us_italy[, "Control"])

p_cm_us_italy <- ggplot(cm_us_italy_df, aes(x = Actual, y = Predicted, fill = Count)) +
  geom_tile(color = "#444444", linewidth = 1) +
  geom_text(aes(label = Count), size = 6, fontface = "bold", color = "white") +
  scale_fill_gradient(low = "#2c1e3d", high = "#756BB1") +
  labs(
    title = sprintf("US→Italy\nSens: %.2f, Spec: %.2f", sens_us_italy, spec_us_italy),
    x = "True", y = "Predicted"
  ) +
  theme_dark_scientific(base_size = 12) +
  theme_centered_titles() +
  theme_zero_margins() +
  theme(
    plot.background = element_rect(fill = "#151520", colour = NA),
    panel.background = element_rect(fill = "#151520", colour = NA),
    legend.position = "none",
    panel.grid = element_blank()
  ) +
  coord_equal()
# ==============================================================================
# COMBINE ALL PANELS: A and B on top, C (both confusion matrices) below
# ==============================================================================
((p_original | p_reanalysis) / (p_cm_italy_us | p_cm_us_italy)) +
  plot_annotation(
    title = "Geographic Validation: Original Study vs Reanalysis",
    subtitle = "Cross-country validation reveals severe geographic overfitting not detected by original within-country/pooled approaches",
    theme = theme_dark_scientific(base_size = 12) +
      theme(
        plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5, color = "#FDE724"),
        plot.background = element_rect(fill = "#151520", color = NA),
        plot.margin = unit(c(0,0,0,0), "pt")
      )
  ) &
  theme(
    plot.background = element_rect(fill = "#151520", colour = NA),
    panel.background = element_rect(fill = "#151520", colour = NA)
  )
Figure 5: ROC curves, confusion matrices, and performance metrics comparing Original (within-country, pooled) vs Reanalysis (cross-country) validation strategies

3.3.1 Performance Summary

Table 3: Model performance: Pooled CV vs Geographic Validation
Approach AUC 95% CI (lower) 95% CI (upper)
Pooled CV (original) 0.951 0.935 0.967
Italy → US 0.813 0.737 0.889
US → Italy 0.736 0.687 0.784
Mean LCV 0.774 NA NA

3.4 Within-Country Cross-Validation

Show code
library(patchwork)

# Panel 1: Within-country CV results
within_country_data <- tibble::tibble(
  Approach = factor(
    c("Italy-only CV\n(HEPARIN)", "US-only CV\n(EDTA)"),
    levels = c("Italy-only CV\n(HEPARIN)", "US-only CV\n(EDTA)")
  ),
  AUC = c(italy_within_cv$cv_auc, us_within_cv$cv_auc),
  CI_lower = c(italy_within_cv$cv_auc_ci_lower, us_within_cv$cv_auc_ci_lower),
  CI_upper = c(italy_within_cv$cv_auc_ci_upper, us_within_cv$cv_auc_ci_upper)
)

p1 <- ggplot(within_country_data, aes(x = Approach, y = AUC, fill = Approach)) +
  geom_col(width = 0.65, color = "#0B0B0F") +
  geom_errorbar(aes(ymin = CI_lower, ymax = CI_upper), width = 0.15, color = "#C8C8D0", size = 0.5) +
  geom_point(size = 4, color = "white", shape = 21, fill = "#FDE724", stroke = 1.5) +
  geom_text(aes(label = sprintf("%.3f", AUC)), vjust = -1.2, color = "white", size = 5.5, fontface = "bold") +
  scale_y_continuous(limits = c(0.5, 1.0), breaks = seq(0.5, 1.0, 0.05)) +
  scale_fill_manual(
    values = c(
      "Italy-only CV\n(HEPARIN)" = "#5DC863",
      "US-only CV\n(EDTA)" = "#31688E"
    )
  ) +
  labs(
    title = "A. Within-Country CV Performance",
    subtitle = "Models trained and tested within single cohorts",
    x = NULL,
    y = "AUC (95% CI)"
  ) +
  theme_dark_scientific(base_size = 12) +
  theme_zero_margins() +
  theme(
    plot.background = element_rect(fill = "#151520", colour = NA),
    panel.background = element_rect(fill = "#151520", colour = NA),
    legend.position = "none",
    axis.text.x = element_text(size = 11, face = "bold"),
    axis.text.y = element_text(size = 11),
    axis.title.y = element_text(size = 12, face = "bold")
  )

# Panel 2: Pooled and LCV comparison
pooled_lcv_data <- tibble::tibble(
  Approach = factor(
    c("Pooled CV", "Italy → US", "US → Italy"),
    levels = c("Pooled CV", "Italy → US", "US → Italy")
  ),
  AUC = c(
    pooled_cv_results$cv_auc,
    lcv_results$italy_to_us$test_auc,
    lcv_results$us_to_italy$test_auc
  ),
  CI_lower = c(
    pooled_cv_results$cv_auc_ci_lower,
    lcv_results$italy_to_us$test_auc_ci_lower,
    lcv_results$us_to_italy$test_auc_ci_lower
  ),
  CI_upper = c(
    pooled_cv_results$cv_auc_ci_upper,
    lcv_results$italy_to_us$test_auc_ci_upper,
    lcv_results$us_to_italy$test_auc_ci_upper
  ),
  Type = c("Pooled", "LCV", "LCV")
)

p2 <- ggplot(pooled_lcv_data, aes(x = Approach, y = AUC, fill = Type)) +
  geom_col(width = 0.65, color = "#0B0B0F") +
  geom_errorbar(aes(ymin = CI_lower, ymax = CI_upper), width = 0.15, color = "#C8C8D0", size = 0.5) +
  geom_point(size = 4, color = "white", shape = 21, fill = "#FDE724", stroke = 1.5) +
  geom_text(aes(label = sprintf("%.3f", AUC)), vjust = -1.2, color = "white", size = 5.5, fontface = "bold") +
  scale_y_continuous(limits = c(0.5, 1.0), breaks = seq(0.5, 1.0, 0.05)) +
  scale_fill_manual(
    values = c(
      "Pooled" = "#FDE725",
      "LCV" = "#C2408C"
    )
  ) +
  labs(
    title = "B. Pooled vs Leave-Country-Out CV",
    subtitle = "Geographic validation reveals confounding",
    x = NULL,
    y = "AUC (95% CI)"
  ) +
  theme_dark_scientific(base_size = 12) +
  theme_zero_margins() +
  theme(
    plot.background = element_rect(fill = "#151520", colour = NA),
    panel.background = element_rect(fill = "#151520", colour = NA),
    legend.position = "none",
    axis.text.x = element_text(size = 11, face = "bold"),
    axis.text.y = element_text(size = 11),
    axis.title.y = element_text(size = 12, face = "bold")
  )

# Combine panels side by side
(p1 + p2) + plot_layout() & theme_zero_margins()
Figure 6: Model performance comparison: Pooled CV, Within-country CV, and Leave-country-out (LCV)

Within-country model performance:

  • Italy-only (HEPARIN, n = 403; ALS 223, Controls 180): AUC 0.961 (95% CI 0.944–0.978)
  • US-only (EDTA, n = 301; ALS 40, Controls 261): AUC 0.922 (95% CI 0.858–0.957)
NoteInterpretation
  • Pooled CV remains the optimistic upper bound because confounding structure is shared across folds.
  • Italy-only and US-only CV reflect performance in homogeneous cohorts without cross-matrix generalization demands. Both achieve reasonable performance within their respective anticoagulant contexts.
  • The large gap between within-country CV and leave-country-out (Italy → US and US → Italy) demonstrates that performance degrades substantially when the anticoagulant/geography context changes. This pattern is consistent with models learning tube/context-specific signatures that do not transfer across countries, not generalizable ALS biology.

3.5 Differential Analysis: Stratified vs Pooled

Show code
# Plot protein counts inline (moved from R/06_visualizations.R for transparency)
library(ggplot2)

counts_data <- data.frame(
  analysis = c(
    "Italy\n(HEPARIN)",
    "US\n(EDTA)",
    "Pooled\n(Confounded)",
    "Both Strata\n(Robust)"
  ),
  count = c(
    length(stratified_vs_pooled_comparison$sig_italy),
    length(stratified_vs_pooled_comparison$sig_us),
    length(stratified_vs_pooled_comparison$sig_pooled),
    length(stratified_vs_pooled_comparison$sig_both_strata)
  ),
  type = c("Stratified", "Stratified", "Pooled", "Robust")
)

counts_data$analysis <- factor(counts_data$analysis, levels = counts_data$analysis)

ggplot(counts_data, aes(x = analysis, y = count, fill = type)) +
  geom_col(width = 0.7, alpha = 0.9) +
  geom_text(aes(label = count), vjust = -0.5, size = 5, fontface = "bold", color = "#EEEEEE") +
  scale_fill_manual(values = c(
    "Stratified" = "#31688E", # viridis cyan
    "Pooled" = "#FDE724", # viridis yellow
    "Robust" = "#6CCE59" # viridis lime
  )) +
  labs(
    title = "Significant Proteins by Analysis Type",
    subtitle = sprintf(
      "Only %d/%d pooled proteins (%.1f%%) replicate in both strata - major reproducibility issue",
      length(stratified_vs_pooled_comparison$overlap_both),
      length(stratified_vs_pooled_comparison$sig_pooled),
      stratified_vs_pooled_comparison$pct_replicate_in_both
    ),
    x = NULL,
    y = "Number of Significant Proteins\n(FDR < 0.05, |logFC| > 0.5)",
    fill = "Type"
  ) +
  theme_dark_scientific(base_size = 12) +
  theme(
    legend.position = "none",
    panel.grid.major.x = element_blank()
  )
Figure 7: Significant protein counts by analysis approach

3.5.1 Summary Statistics

Table 4: Significant proteins by analysis approach (FDR < 0.05, |logFC| > 0.5)
Analysis N Significant % of Pooled
Italy (HEPARIN) 37 47.4
US (EDTA) 379 485.9
Pooled (confounded) 78 100.0
Both Strata 23 26.9
ImportantCritical Finding

Only 26.9% of pooled proteins replicate in BOTH strata

  • 51 proteins (65.4%) significant in pooled but NOT in either stratum
  • These are likely confounding artifacts driven by country/tube effects
  • Most pooled “significant” proteins are likely false positives attributable to confounding

3.5.2 Protein Overlap Visualization

Show code
# Plot Venn diagram inline (moved from R/06_visualizations.R for transparency)
library(ggplot2)
library(ggvenn)

# Prepare data for ggvenn
venn_data <- list(
  Italy = stratified_vs_pooled_comparison$sig_italy,
  US = stratified_vs_pooled_comparison$sig_us,
  Pooled = stratified_vs_pooled_comparison$sig_pooled
)

ggvenn(
  venn_data,
  fill_color = c("#1F9E89", "#B5DE2B", "#FDE724"), # viridis jade, yellow-green, yellow
  fill_alpha = 0.5,
  stroke_size = 1,
  set_name_size = 5,
  set_name_color = "white",
  text_size = 4
) +
  labs(
    title = "Protein Overlap: Stratified vs Pooled Analysis",
    subtitle = sprintf(
      "⚠️ CRITICAL: Only %d proteins (%.1f%%) significant in BOTH Italy and US",
      length(stratified_vs_pooled_comparison$sig_both_strata),
      stratified_vs_pooled_comparison$pct_replicate_in_both
    )
  ) +
  theme_dark_scientific(base_size = 12) +
  theme(
    plot.background = element_rect(fill = "#151520", colour = NA),
    panel.background = element_rect(fill = "#151520", colour = NA),
    legend.background = element_rect(fill = "#151520", colour = NA),
    legend.key = element_rect(fill = "#151520", colour = NA),
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(color = "#FDE724", face = "bold", hjust = 0.5)
  )
Figure 8: Venn diagram showing protein overlap between strata

3.6 Geographically Consistent Biomarker Candidates

3.6.1 Meta-Analysis Results

Proteins significant in BOTH Italy and US with concordant direction: 22

These are the ONLY proteins that show consistent ALS signals across both tube types and geographic cohorts.

Table 5: Top 10 geographically consistent biomarker candidates
Assay logFC_italy logFC_us mean_logFC adj_P_italy adj_P_us combined_p
NEFL 2.503 2.257 2.380 0 0 0
ALDH3A1 2.311 2.323 2.317 0 0 0
MEGF10 0.877 1.054 0.966 0 0 0
CORO6 1.157 1.099 1.128 0 0 0
HS6ST2 0.777 0.901 0.839 0 0 0
CSRP3 2.143 1.914 2.028 0 0 0
MYLPF 1.555 1.585 1.570 0 0 0
MYBPC1 1.385 1.469 1.427 0 0 0
EDA2R 0.871 0.757 0.814 0 0 0
CA3 1.056 0.739 0.898 0 0 0

3.6.2 Effect Size Correlation

To assess whether effect sizes are consistent across cohorts, we calculate correlations for two protein sets:

  1. All proteins on Olink panel (n = 2868)
  2. Proteins significant in pooled analysis (n = 78)
Figure 9: Effect size correlations between Italy and US cohorts

Key Findings:

  • All proteins: Pearson r = 0.030, Spearman ρ = -0.115 — Near-zero Pearson correlation with negative Spearman rank correlation across entire panel
  • Pooled-significant proteins: Pearson r = 0.788, Spearman ρ = 0.683 — Moderate-to-strong correlation among selected proteins

The stark difference between these correlations reveals an important pattern: proteins that reach significance in the pooled analysis tend to show more consistent effect sizes across cohorts, while the majority of proteins on the panel show essentially no correlation.

Notably, the negative Spearman correlation (ρ = -0.115) for all proteins indicates that even the rank ordering of effects tends to be inverted between cohorts—proteins with larger effects in Italy tend to have smaller effects in the US, and vice versa. This rank inversion, while modest in magnitude, suggests systematic differences in how proteins respond across the two cohorts that go beyond simple scaling differences.

ImportantInterpretation and ML Implications

What this means:

The 78 proteins significant in pooled analysis do show reasonable concordance between Italy and US (Pearson r ≈ 0.79, Spearman ρ ≈ 0.68), suggesting they may capture some reproducible ALS biology. However:

  1. Rank inversion in broader panel: Across all proteins, Spearman ρ = -0.115 indicates systematic rank-order differences between cohorts
  2. Effect concordance ≠ absence of confounding: Proteins can show consistent disease associations while still being systematically affected by tube type
  3. 72% of pooled proteins fail stratified replication: Only 22/78 (28%) are significant in BOTH cohorts
  4. Geographic validation issues persist: ML models trained on pooled data show reduced performance in leave-country-out CV (18% performance drop: from AUC = 0.95 to 0.77)
  5. Reverse prediction remains problematic: Proteins discriminate tube types with very high accuracy (AUC = 0.999), indicating pervasive technical effects
  6. Residual confounding in “consistent” proteins: Even the 22 geographically consistent proteins predict tube type with AUC = 0.89

Critical point for ML models:

The moderate-to-strong correlation (Pearson r ≈ 0.79) among pooled-significant proteins does NOT eliminate concerns about confounding bias in machine learning models. The models are trained on absolute NPX values—which contain both biological signal AND tube-type effects—not on effect sizes. Geographic validation demonstrates that models show substantially reduced generalization (Italy→US: AUC = 0.81; US→Italy: AUC = 0.74; mean = 0.77) compared to pooled cross-validation (AUC = 0.95), indicating that the training data contains systematic cohort-specific patterns.

Interpretation: While some proteins show reproducible disease associations, the ML framework faces substantial challenges due to the Italy-HEPARIN/US-EDTA confound structure.

3.6.3 Biological Interpretation

NEFL (Neurofilament Light Chain) is a well-established neuroaxonal injury biomarker and likely represents a genuine biological signal in ALS. However, our analyses demonstrate that its effect size in this dataset is substantially contaminated by the Italy-HEPARIN/US-EDTA confound. Among the 22 “geographically consistent” proteins that show significance in both strata, only a small fraction survive without residual tube effects. Importantly:

  • Only 22 proteins survive rigorous geographic validation (out of 78 pooled)
  • 72% of pooled proteins fail to replicate across tube types
  • Effect sizes are inconsistent (r = 0.03) between cohorts
  • Even geographically consistent candidates retain substantial tube signal (see Residual Confounding section)

Verdict: While NEFL and a small subset of proteins remain clinically relevant, their magnitudes in the pooled analysis are inflated and their relative rankings among biomarkers are distorted by non-biological factors. Most of the published protein list requires prospective validation in tube-balanced studies.

3.6.4 Visualizing Country/Tube Effects in Geographically Consistent Proteins

To demonstrate that even “geographically consistent” proteins exhibit systematic country/tube effects, we examine the distribution of NPX values for ALL geographically consistent proteins, comparing ALS vs. Healthy controls across Italy (HEPARIN) and US (EDTA) cohorts.

Figure 10: All ‘Geographically Consistent’ Proteins Show Strong Country/Tube Effects
ImportantKey Observation

The visualization combines geographically consistent proteins with ALL 17 original biomarkers from the authors’ Figure 4a to demonstrate that systematic tube type differences exist across both sets. Even within healthy controls, we observe substantial EDTA vs. HEPARIN differences (shown as effect sizes and p-values). This pattern demonstrates that:

  1. Country/tube effects are present in disease-free individuals, proving these differences are technical artifacts, not ALS biology
  2. ALS vs. Healthy separation varies by country/tube context, indicating entanglement of biological and technical signals
  3. Direct NPX comparisons between cohorts are confounded, even for proteins that show consistent ALS associations within each cohort
  4. The original biomarkers (★) are not exempt from these confounding effects, demonstrating that the reported findings cannot be generalized across collection protocols

This visual evidence reinforces the quantitative findings from our three convergent tests (reverse prediction, healthy control analysis, effect decomposition) that residual confounding persists in all candidate biomarkers.

3.7 Residual Confounding in “Geographically Consistent” Proteins

Critical Question: Are the 22 “geographically consistent” proteins (significant in both Italy and US with concordant direction) truly independent of tube type effects, or do they still retain substantial technical confounding?

Just because a protein is significant in BOTH cohorts doesn’t prove it’s free from tube type artifacts. It could be: (1) true biomarker + NO tube effect (ideal), (2) true biomarker + tube effect (entangled), or (3) pure tube artifact spuriously associated with disease. With perfect confounding, we cannot distinguish cases #2 and #3, but we CAN quantify the magnitude of residual tube type signal using two complementary approaches below, which together with the reverse prediction test (Section 3.2) provide converging evidence.

3.7.1 Test 1: Tube Effects in Healthy Controls (Disease-Free Test)

As demonstrated in Figure 10, even “geographically consistent” proteins show systematic differences between tube types within disease-free individuals. Here we quantify these effects statistically.

Show code
# Plot healthy tube effects using limma-based results (adjusted for age & sex)
library(ggplot2)
library(dplyr)

# Prepare plot data from limma results
# Filter to only geographically consistent proteins (same as old analysis)
plot_data <- tube_effects_healthy %>%
  dplyr::filter(Assay %in% tube_robust_top15_assays) %>%
  dplyr::arrange(adj.P.Val) %>%
  dplyr::mutate(
    protein = factor(Assay, levels = rev(Assay)),
    significant = adj.P.Val < 0.05,
    # Use logFC as effect size (already in NPX units since NPX is log2)
    # Calculate approximate 95% CI using t-distribution
    ci_lower = logFC - 1.96 * logFC / t,
    ci_upper = logFC + 1.96 * logFC / t
  )

n_sig <- sum(plot_data$significant)
n_total <- nrow(plot_data)

ggplot(plot_data, aes(x = logFC, y = protein, color = significant)) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "#888888") +
  geom_point(size = 3) +
  geom_errorbarh(aes(xmin = ci_lower, xmax = ci_upper), height = 0.3) +
  scale_color_manual(values = c("TRUE" = "#FDE724", "FALSE" = "#888888")) +
  labs(
    title = "Tube Type Effects in Healthy Controls (Disease-Free)",
    subtitle = sprintf(
      "⚠️ %d/%d geographically consistent proteins differ by tube type WITHOUT disease present\nLimma analysis adjusted for age & sex",
      n_sig, n_total
    ),
    x = "Effect Size in NPX (US/EDTA - Italy/HEPARIN, logFC)",
    y = NULL,
    color = "Significant\n(FDR < 0.05)"
  ) +
  theme_dark_scientific(base_size = 12) +
  theme(
    panel.grid.major.y = element_blank()
  )
Figure 11: Tube type effects in healthy controls only

Rationale: If a protein differs between Italy/HEPARIN and US/EDTA in disease-free individuals, that difference MUST be a technical artifact (or population genetics), NOT ALS biology.

Result: 10 out of 22 geographically consistent proteins (45.5%) show significant tube effects in healthy controls (limma, FDR < 0.05)

Distribution of Tube Effect Sizes Across All Proteins

To understand the full scope of tube type effects in healthy controls, we examine the distribution of effect sizes across all proteins in the Olink panel. We present two views: the full distribution and a zoomed view focusing on the central range.

Show code
# Load visualization from targets (full range)
tar_load(viz_tube_effect_distribution_full)
viz_tube_effect_distribution_full
Figure 12: Distribution of tube type effect sizes in healthy controls across all proteins (full range). Cyan dashed line: theoretical null distribution assuming no systematic tube effect.

The full distribution (Figure 12) shows the complete range of tube effects, including extreme outliers extending beyond 1000% change. The theoretical null distribution (cyan dashed line, centered at 0) represents the expected pattern if tube type had no systematic effect—the dramatic deviation from this null demonstrates widespread technical artifacts across the proteome.

Show code
# Load visualization from targets (zoomed to -1000 to 1000)
tar_load(viz_tube_effect_distribution_zoomed)
viz_tube_effect_distribution_zoomed
Figure 13: Distribution of tube type effect sizes in healthy controls (zoomed to -1000% to +1000% range). Cyan dashed line: theoretical null distribution. The observed distribution is substantially wider and more bimodal than expected under the null hypothesis of no tube effect.

The zoomed distribution (Figure 13) focuses on the -1000% to +1000% range, allowing better visualization of the central distribution pattern. Comparison to the theoretical null (cyan dashed curve) clearly shows that the observed effects are far more extreme and variable than expected from measurement noise alone, with a bimodal structure suggesting systematic directional effects. Outliers beyond this range are indicated with arrows.

Key Findings:

  • 979 out of 2868 proteins (34.1%) show significant tube effects in healthy controls (FDR < 0.05)
  • Median absolute percent change: 13.7% across all proteins
  • Median absolute percent change (significant only): 32.5%
  • 54 proteins show >100% change
  • 262 proteins show >50% change
  • 2 proteins show extreme changes >1000%

Interpretation: This distribution reveals that tube type differences are widespread and substantial across the entire proteome, not limited to a few outlier proteins. The fact that over a third of all proteins show significant effects in disease-free individuals demonstrates that technical artifacts dominate the variance structure of this dataset. The presence of extreme outliers (>1000% change) indicates that some proteins are profoundly affected by anticoagulant choice, making them completely unreliable for biomarker discovery without proper experimental control.

Table 6: Top geographically consistent proteins with tube effects in disease-free individuals (limma-adjusted for age & sex)
Protein logFC (NPX) Mean NPX t-statistic FDR
NEFL 0.960 -1.060 8.329 0.000
ALDH3A1 0.923 -0.598 7.464 0.000
DTNB 0.648 -0.389 5.285 0.000
EDA2R 0.528 -0.266 4.212 0.000
HS6ST2 0.405 -0.412 3.905 0.001
ImportantCritical Finding

Many geographically consistent proteins differ by tube type even WITHOUT disease present!

Using limma analysis (adjusted for age & sex):

  • Multiple proteins show significant tube effects in disease-free individuals (FDR < 0.05)
  • These differences occur in the absence of ALS, proving they reflect technical artifacts
  • P-values are now properly adjusted for covariates using empirical Bayes
  • 10 of 22 geographically consistent proteins remain significant after FDR correction

This definitively demonstrates that even “geographically consistent” candidates are contaminated by technical confounding.

3.7.2 Test 2: Effect Size Decomposition (Biology vs Artifacts)

Show code
# Plot effect decomposition inline (moved from R/06_visualizations.R for transparency)
library(ggplot2)
library(dplyr)
library(ggrepel)

# Prepare data
plot_data <- effect_decomposition %>%
  dplyr::mutate(
    category = dplyr::case_when(
      ratio_tube_to_disease > 1.0 ~ "Tube > Disease",
      ratio_tube_to_disease > 0.5 ~ "Tube (ES >= 50% Disease)",
      ratio_tube_to_disease > 0.2 ~ "Tube (ES 20-50% Disease)",
      TRUE ~ "Tube (ES < 20% Disease)"
    ),
    category = factor(category, levels = c(
      "Tube > Disease",
      "Tube (ES >= 50% Disease)",
      "Tube (ES 20-50% Disease)",
      "Tube (ES < 20% Disease)"
    ))
  )

# Count by category
category_counts <- plot_data %>%
  dplyr::count(category)

n_severe <- sum(plot_data$ratio_tube_to_disease >= 0.5, na.rm = TRUE)
pct_severe <- 100 * n_severe / nrow(plot_data)

# Label ALL proteins (only 22 total, so manageable)
proteins_to_label <- plot_data

ggplot(plot_data, aes(x = abs_disease_effect, y = abs_tube_effect, color = category)) +
  # Diagonal line (tube = disease)
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "#FDE724", linewidth = 1) +
  # Reference line (tube = 50% of disease)
  geom_abline(intercept = 0, slope = 0.5, linetype = "dotted", color = "#6CCE59", linewidth = 0.8) +
  # Points
  geom_point(size = 3, alpha = 0.7) +
  # Label ALL proteins - set max.overlaps to Inf to ensure all labels are shown
  ggrepel::geom_text_repel(
    data = proteins_to_label,
    aes(label = protein),
    size = 8.5,
    color = "#EEEEEE",
    box.padding = 0.35,
    point.padding = 0.3,
    max.overlaps = Inf,
    min.segment.length = 0,
    force = 2,
    force_pull = 1
  ) +
  scale_color_manual(values = c(
    "Tube > Disease" = "#FDE724", # viridis yellow
    "Tube (ES >= 50% Disease)" = "#B5DE2B", # viridis yellow-green
    "Tube (ES 20-50% Disease)" = "#6CCE59", # viridis lime
    "Tube (ES < 20% Disease)" = "#1F9E89" # viridis jade
  )) +
  annotate("text",
    x = max(plot_data$abs_disease_effect, na.rm = TRUE) * 0.7,
    y = max(plot_data$abs_tube_effect, na.rm = TRUE) * 0.9,
    label = sprintf("y = x\n(Tube = Disease)"),
    color = "#FDE724", size = 3, fontface = "italic"
  ) +
  annotate("text",
    x = max(plot_data$abs_disease_effect, na.rm = TRUE) * 0.6,
    y = max(plot_data$abs_disease_effect, na.rm = TRUE) * 0.3,
    label = sprintf("y = 0.5x\n(Tube = 50%% Disease)"),
    color = "#6CCE59", size = 3, fontface = "italic"
  ) +
  labs(
    title = "Effect Size Decomposition: Tube Type vs Disease Effects",
    subtitle = sprintf(
      "⚠️ SEVERE: %d/%d proteins (%.1f%%) have tube effects >=50%% of disease effects",
      n_severe, nrow(plot_data), pct_severe
    ),
    x = "Absolute Disease Effect (|ALS - Control|)",
    y = "Absolute Tube Effect (|US - Italy|)",
    color = "Entanglement Level"
  ) +
  theme_dark_scientific(base_size = 16) +
  theme(
    legend.position = "bottom",
    axis.title = element_text(size = 14),
    plot.subtitle = element_text(size = 12)
  )
Figure 14: Tube type effects vs disease effects for each protein

Question: Is the tube type effect comparable to or larger than the disease effect?

For each geographically consistent protein, we calculated: - Disease effect: |ALS - Control| (adjusted for country, age, sex) - Tube effect: |US/EDTA - Italy/HEPARIN| (adjusted for diagnosis, age, sex) - Ratio: |tube effect| / |disease effect|

Results:

Table 7: Distribution of tube-to-disease effect ratios
Effect Category Count Percentage
moderate_tube_effect 9 40.9%
negligible_tube_effect 10 45.5%
severe_tube_effect 2 9.1%
tube_dominates_disease 1 4.5%

Median ratio: 0.221 (tube effects are ~22% as large as disease effects on average)

WarningSevere Entanglement Identified

3 proteins (13.6%) have tube effects ≥50% of disease effects:

Top proteins with highest tube/disease ratios:

Protein Disease Effect Tube Effect Ratio
DTNB 0.563 0.617 1.096
HS6ST2 0.814 0.498 0.612
EDA2R 0.825 0.466 0.566
HSPB6 0.680 0.309 0.455
NEFL 2.461 1.083 0.440

DTNB has a tube effect (0.62 NPX) LARGER than its disease effect (0.56 NPX) — ratio = 1.10. This suggests it may be primarily a tube artifact spuriously associated with ALS in this confounded dataset.

3.7.3 Converging Evidence Summary

Three independent tests converge on the same conclusion:

  1. Reverse prediction (AUC = 0.999, Section 3.2): ALL proteins nearly perfectly predict tube type
  2. Healthy controls (50% significant): Half of geographically consistent proteins show tube effects without disease present
  3. Effect decomposition (median ratio = 0.22): Tube effects are 20-50% as large as disease effects in geographically consistent proteins

Verdict: Even the most conservative “geographically consistent” biomarker candidates remain substantially entangled with technical artifacts. While some biological signal exists (especially for NEFL), perfect confounding prevents definitive separation of true ALS biology from tube type effects. These proteins should be treated as prioritized candidates for validation in tube-balanced studies, NOT as validated biomarkers ready for clinical deployment.


4 Discussion

4.1 Quantifying Confounding: What the Data Reveal

Methodological Assessment: Systematic Confounding Detected

Three complementary analytical approaches converge on consistent evidence of systematic confounding between technical factors and biological signals:

  1. Reverse Prediction Test (AUC=0.999): Proteins discriminate tube type nearly perfectly, demonstrating that technical variance dominates the dataset structure
  2. Geographic Validation (18.6% AUC gap): Cross-site performance differs substantially from within-site estimates, with particularly poor sensitivity (1.8%) when training on US/EDTA and testing on Italy/HEPARIN
  3. Stratified Analysis (26.9% replication rate): The majority of proteins significant in pooled analysis do not replicate across both geographic cohorts

What these patterns teach us: This case study demonstrates how standard methodological practices (pooled multi-site data collection, stratified cross-validation, differential expression analysis) can produce apparently strong results that do not generalize geographically. The confounding arose naturally from site-specific protocols and patient populations—exactly the scenario multi-site studies encounter in practice.

Constructive findings: The rigorous data collection and proteomics quality control enable us to identify 22 geographically consistent protein candidates. NEFL (Neurofilament Light Chain), a well-established ALS biomarker, shows robust elevation across cohorts. These findings provide a validated foundation for future prospective studies with balanced tube type designs.

Implications for interpretation: The reported 98% accuracy reflects within-study performance that includes technical variance; realistic cross-site generalization yields ~77% AUC. Tube type should be treated as a batch effect requiring adjustment, not a predictive feature. Neurological control comparisons require tube type stratification. Future validation studies should prioritize the 22 geographically consistent candidates and employ tube-balanced designs from the outset.

4.2 Limitations of This Investigation

Our Analysis Also Has Constraints:

  1. Perfect confounding is irreducible - Cannot fully separate country/tube/batch effects
  2. Limited power in US ALS samples (n=40)
  3. No external validation data - Cannot test on truly independent cohort
  4. Observational study - Cannot establish causation

Even our investigation cannot definitively prove whether specific proteins are true biomarkers or artifacts; the confounding is too severe.

4.3 Beyond Tube Type: Healthcare System Confounding

Missed Opportunity—Deeper Sociological Implications:

The Italy/US divide represents more than a technical artifact; it reflects fundamental differences in healthcare delivery that create systematic sampling biases extending beyond tube type or laboratory protocols. Italian healthcare operates under a universal, population-based system with standardized diagnostic pathways and relatively equitable access to neuromuscular disease centers. In contrast, the US healthcare system is fragmented and access-stratified, where ALS diagnosis and specialized care concentrate among patients with higher socioeconomic status, comprehensive insurance coverage, and geographic proximity to academic medical centers.

This structural difference fundamentally changes case-mix characteristics:

  • Italian cohort: More likely to represent population-based sampling across socioeconomic strata, capturing the full disease spectrum from early to late stages, with consistent diagnostic criteria applied through centralized neuromuscular networks
  • US cohort: Enriched for patients who can afford specialized ALS center access, potentially skewing toward earlier-stage disease at diagnosis (better resourced patients seek diagnosis sooner), higher education levels (health literacy drives referral patterns), and specific insurance types (Medicare/private vs. uninsured/underinsured)

These sampling frame differences create unmeasured confounding that may manifest in protein expression profiles through multiple pathways: (1) disease stage heterogeneity if US patients present earlier with milder disease severity; (2) comorbidity patterns reflecting socioeconomic determinants of health (cardiovascular disease, metabolic syndrome, nutritional status); (3) medication use varying by insurance formularies and prescribing practices; (4) environmental exposures correlated with geography and socioeconomic status (occupational toxins, air pollution, dietary patterns).

The study design provides no adjustment for these factors because they are unmeasured and perfectly confounded with site. Standard approaches like propensity score matching or covariate adjustment fail when confounding is structural and complete. The Italy-HEPARIN vs. US-EDTA confounding thus represents a triple entanglement: (1) anticoagulant chemistry, (2) batch/processing protocols, and (3) healthcare system-mediated patient selection. Even if tube type were harmonized, residual confounding from differential case-mix would remain.

Why This Matters for Generalizability:

A diagnostic model trained on US ALS patients (selected through fragmented, access-stratified healthcare) and Italian patients (population-based sampling) is not simply predicting ALS—it may be predicting “ALS as diagnosed in specific healthcare contexts.” Deploying such a model in a third country (e.g., UK NHS, Japanese national health system, low-resource setting) introduces distribution shift that standard validation metrics do not capture. The model’s learned patterns may not transfer because the underlying patient populations differ systematically in ways orthogonal to ALS pathophysiology.

This represents a broader epistemological challenge in international biomarker consortia: geographic diversity does not guarantee generalizability if healthcare systems induce systematic sampling differences. Future multi-national studies should explicitly: (1) document healthcare system characteristics and referral pathways; (2) collect socioeconomic and clinical staging data to assess case-mix comparability; (3) conduct sensitivity analyses stratifying by healthcare context; and (4) acknowledge that even rigorously designed technical protocols cannot overcome structural sampling biases without prospective harmonization of patient ascertainment methods.

4.4 Methodological Lessons for the Field

This case study exemplifies a systematic methodological challenge in high-throughput biomarker discovery: how technical confounding can threaten reproducibility and clinical translatability even when investigators follow standard practices. Multi-site proteomics studies routinely report excellent cross-validation performance on pooled datasets, yet external replication often reveals substantially lower performance7,8. Understanding how and why this gap emerges is critical for improving future studies.

Understanding How Confounding Arises Naturally

When technical factors (anticoagulant type, batch, processing site) naturally correlate with diagnostic groups in multi-site studies, machine learning models may learn these technical patterns alongside biological signals9. Pooled cross-validation—the field standard—has inherent limitations for detecting this: by shuffling samples randomly, it maintains technical factor distributions in both training and test sets, allowing models to appear to generalize while potentially learning site-specific patterns. Leave-site-out validation, reverse prediction tests, and stratified analyses provide complementary information but are reported in <10% of studies, leaving confounding undetected until independent replication.

Generalizable Patterns Across Disease Areas

The patterns documented here are not unique to this study but reflect broader challenges across cancer genomics, Alzheimer’s proteomics, COVID-19 biomarker studies, and cardiovascular risk prediction. Batch effects are ubiquitous in high-throughput assays; when correlated with phenotype, they create reproducible yet artifactual signals that survive standard statistical testing. A 2018 methodological survey found <10% of multi-site biomarker studies reported leave-site-out validation, <5% tested reverse prediction, and >70% relied solely on pooled CV. Recognizing these limitations enables the field to develop better validation frameworks.

Practical Recommendations:

  1. Never use technical metadata as predictive features (tube type, batch ID, site, plate number) unless the goal is explicitly to model technical variation
  2. Mandate leave-site-out cross-validation for all multi-site biomarker studies as the primary validation metric, reported alongside pooled CV for comparison
  3. Conduct reverse prediction tests as routine diagnostic checks: train models to predict technical factors (batch, site, tube type) from molecular features; AUC ≥0.7 indicates actionable confounding3
  4. Report stratified analyses showing effect sizes within each site/batch separately before claiming robust biomarkers
  5. Document confounding structure explicitly in study design sections with contingency tables showing diagnosis × technical factor distributions
  6. Pre-register analysis plans specifying validation strategy before seeing data to prevent post-hoc rationalization of pooled CV
  7. Require external validation in independent cohorts with different technical contexts before publishing clinical claims
  8. Adopt reporting standards (e.g., MIQE for proteomics) that explicitly address confounding assessment

Long-term Solutions:

The field must move beyond post-hoc confounding detection toward prospective study designs preventing confounding. This requires: (1) tube type harmonization across sites, eliminating anticoagulant variation; (2) balanced randomization of diagnostic groups across batches to orthogonalize technical and biological factors; (3) block randomization ensuring each batch contains representative proportions of all groups; and (4) replicate inclusion of quality control samples for batch effect modeling. These design principles are standard in clinical trials but rarely enforced in observational biomarker studies, despite their importance for generalizable findings.

Building a Culture of Collaborative Methodological Improvement

This case study demonstrates that independent methodological investigation can provide valuable complementary insights to internal validation. We advocate for a research culture where methodological reanalysis is embraced as collaborative quality assurance that strengthens the field. Proteomics benefits when methodological limitations are identified constructively, enabling investigators to refine approaches and preventing resource waste in follow-up studies.

Practical steps forward:

  • Data transparency: Journals increasingly require public data and code sharing, enabling methodological validation alongside scientific review
  • Methodological funding: Agencies can support investigations advancing field-wide analytical standards, recognizing their value for establishing best practices
  • Collaborative review: Viewing methodological critique as partnership rather than conflict enables constructive dialogue improving study design

Through transparent, collaborative methodological development, the proteomics field can build a reproducible foundation for clinical translation while learning from each study’s unique challenges.


5 Recommendations

5.1 For Interpreting the Original Study

What CAN Be Concluded:

  • Valuable multi-site proteomics dataset collected
  • NEFL and ~20 other proteins show geographically consistent signals
  • Rigorous OLINK quality control performed
  • Significant effort in data collection

What CANNOT Be Concluded:

  • “98% accuracy” does not reflect true diagnostic performance
  • Neurological control comparisons limited to EDTA context
  • Most published protein list requires validation
  • Model is not currently generalizable without tube harmonization

5.2 For Future Prospective Validation

Before Clinical Translation:

  1. Harmonize tube types across all sites (use single tube type)
  2. Use only 22 geographically consistent proteins as starting panel
  3. Design leave-site-out validation from the start
  4. Set realistic expectations (~0.75-0.80 AUC)
  5. Include multiple geographic cohorts
  6. Pre-register analysis plan to prevent p-hacking

5.3 For the Proteomics Field

Best Practices:

  • Tube type harmonization in study design phase
  • Explicit batch effect modeling
  • Geographic diversity in training data
  • Leave-site-out as standard validation
  • Reverse prediction tests as routine quality check

6 Conclusions

ImportantSummary of Findings

This investigation provides clear, converging evidence that the published ALS biomarker study1 suffers from severe confounding between plasma collection tube type (HEPARIN vs. EDTA), geographic origin (Italy vs. US), and diagnostic distribution. The perfect confounding structure—100% of neurological controls in EDTA, 85% of ALS in HEPARIN—renders biological signals non-identifiable from technical artifacts.

6.0.1 Three Converging Lines of Evidence

1. Geographic Generalization Failure (Leave-Country-Out Cross-Validation)

Models trained on one geographic cohort fail markedly when tested on another, with mean AUC dropping from 0.931 (pooled CV) to 0.770 (leave-country-out CV)—an 17.3% performance degradation. Most critically, models trained on US/EDTA samples achieve only 1.8% sensitivity when tested on Italian/HEPARIN samples, misclassifying 98% of true ALS cases as controls. This bidirectional generalization failure demonstrates that the model learned site-specific technical artifacts rather than transportable disease biology.

2. Differential Analysis Confounding (Stratified Replication Failure)

Only 26.9% of pooled significant proteins replicate in both geographic strata with concordant direction. The majority of “significant” proteins identified in pooled analysis (73.1%) represent likely false discoveries driven by confounding: these proteins show associations with diagnosis in pooled data but fail to replicate when tube type and geography are controlled through stratification. This extraordinarily low replication rate indicates that most published biomarker claims are artifacts of the confounded study design.

3. Residual Confounding in “Geographically Consistent” Candidates

Even the 22 proteins that survive rigorous geographic stratification (significant in both Italy and US with concordant direction) show substantial residual confounding:

  • Reverse prediction test: These 22 proteins achieve AUC = 0.916 for predicting tube type, demonstrating strong entanglement with technical artifacts
  • Healthy control analysis: 50% show significant tube type effects in disease-free individuals (p < 0.001), proving technical confounding exists independent of ALS biology
  • Effect decomposition: Median tube/disease ratio = 0.22, with some proteins (e.g., DTNB) showing tube effects larger than disease effects

These converging tests demonstrate that even our most conservative biomarker candidates remain substantially contaminated by technical artifacts due to perfect confounding.

6.0.2 Conclusion

The original study’s reported 98% diagnostic accuracy is severely inflated by confounding and does not reflect true biological performance. Realistic cross-site performance is approximately 77% AUC—inadequate for clinical deployment. While some genuine biological signal exists (particularly for NEFL, a well-established ALS biomarker), the majority of published protein associations likely represent technical artifacts arising from the confounded study design. The diagnostic model is not currently generalizable across sites with different tube types without complete redesign and prospective validation in tube-balanced cohorts.

6.0.3 Value to Science

This reanalysis serves as an educational case study for the proteomics field, demonstrating:

  • The critical importance of leave-site-out validation as the primary generalizability test for multi-site biomarker studies
  • Why technical factors (tube type, batch, site) should never be included as predictive features in diagnostic models
  • How pooled cross-validation produces deceptively high performance metrics when confounding is present, masking generalization failures that only emerge under rigorous geographic validation
  • The necessity of prospective study designs with tube type harmonization and balanced randomization to prevent non-identifiability of biological effects
  • The value of multi-method validation frameworks (reverse prediction, stratified analysis, residual confounding tests) for detecting and quantifying technical artifacts
  • The need for transparent reporting of confounding structures and analytical limitations in high-throughput biomarker discovery research

Recommendations for the Field:

  1. Mandate leave-site-out cross-validation for all multi-site studies
  2. Conduct reverse prediction tests as routine quality checks (AUC ≥0.7 indicates actionable confounding)
  3. Harmonize pre-analytical protocols (especially tube types) across sites in prospective studies
  4. Report stratified effect sizes before claiming robust biomarkers
  5. Pre-register analysis plans specifying validation strategies to prevent post-hoc rationalization

This investigation exemplifies how rigorous methodological scrutiny can reveal confounding that internal validation procedures miss, ultimately advancing field-wide standards and protecting the integrity of clinical biomarker translation.


References

1.
Chia, R. et al. A plasma proteomics-based candidate biomarker panel predictive of amyotrophic lateral sclerosis. Nature Medicine https://doi.org/10.1038/s41591-025-03890-6 (2025) doi:10.1038/s41591-025-03890-6.
2.
Cohen, J. Statistical Power Analysis for the Behavioral Sciences. (Lawrence Erlbaum Associates, Hillsdale, NJ, 1988).
3.
Hosmer, D. W. & Lemeshow, S. Applied Logistic Regression. 156–164 (Wiley, 2000).
4.
U.S. Food and Drug Administration. Artificial Intelligence/Machine Learning (AI/ML)-Based Software as a Medical Device (SaMD) Action Plan. (2021).
5.
Vabalas, A., Gowen, E., Poliakoff, E. & Casson, A. J. Machine learning algorithm validation with a limited sample size. PLOS ONE 14, e0224365 (2019).
6.
Ritchie, M. E. et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43, e47 (2015).
7.
Begley, C. G. & Ellis, L. M. Raise standards for preclinical cancer research. Nature 483, 531–533 (2012).
8.
Ioannidis, J. P. A. Why most published research findings are false. PLOS Medicine 2, e124 (2005).
9.
Leek, J. T. et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics 11, 733–739 (2010).
10.
Bowen, R. A. R. & Remaley, A. T. Interferences from blood collection tube components on clinical chemistry assays. Biochemia Medica 24, 31–44 (2014).
11.
Mohri, M., Shakeri, H. & Lotfollah Zadeh, S. Effects of common anticoagulants (heparin, citrate and EDTA) on routine plasma biochemistry of cattle. Comparative Clinical Pathology 16, 207–209 (2007).
12.
Hagn, G. et al. Plasma instead of serum avoids critical confounding of clinical metabolomics studies by platelets. Journal of Proteome Research 23, 3064–3075 (2024).
13.
Yu, Y., Mai, Y., Zheng, Y., et al. Assessing and mitigating batch effects in large-scale omics studies. Genome Biology 25, 254 (2024).
14.
Murchan, P., Ó Broin, P., Baird, A.-M., Sheils, O. & Finn, S. P. Deep feature batch correction using ComBat for machine learning applications in computational pathology. Journal of Pathology Informatics 15, 100396 (2024).

7 Supplementary Materials

7.1 Session Information

Show code
sessionInfo()
R version 4.4.2 (2024-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS 26.0.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] ggrepel_0.9.6     ggvenn_0.1.19     pROC_1.19.0.1     visNetwork_2.1.4 
 [5] viridisLite_0.4.2 here_1.0.2        patchwork_1.3.2   knitr_1.50       
 [9] lubridate_1.9.4   forcats_1.0.1     stringr_1.5.2     dplyr_1.1.4      
[13] purrr_1.1.0       readr_2.1.5       tidyr_1.3.1       tibble_3.3.0     
[17] ggplot2_4.0.0     tidyverse_2.0.0   targets_1.11.4   

loaded via a namespace (and not attached):
 [1] rlang_1.1.6          magrittr_2.0.4       compiler_4.4.2      
 [4] mgcv_1.9-3           systemfonts_1.3.1    callr_3.7.6         
 [7] vctrs_0.6.5          reshape2_1.4.4       pkgconfig_2.0.3     
[10] fastmap_1.2.0        backports_1.5.0      labeling_0.4.3      
[13] rmarkdown_2.30       prodlim_2025.04.28   markdown_2.0        
[16] tzdb_0.5.0           ps_1.9.1             ragg_1.5.0          
[19] xfun_0.53            litedown_0.7         jsonlite_2.0.0      
[22] recipes_1.3.1        parallel_4.4.2       prettyunits_1.2.0   
[25] R6_2.6.1             stringi_1.8.7        RColorBrewer_1.1-3  
[28] parallelly_1.45.1    rpart_4.1.24         Rcpp_1.1.0          
[31] iterators_1.0.14     future.apply_1.20.0  Matrix_1.7-3        
[34] splines_4.4.2        nnet_7.3-20          igraph_2.1.4        
[37] timechange_0.3.0     tidyselect_1.2.1     rstudioapi_0.17.1   
[40] yaml_2.3.10          timeDate_4041.110    ggtext_0.1.2        
[43] codetools_0.2-20     processx_3.8.6       listenv_0.9.1       
[46] lattice_0.22-7       plyr_1.8.9           withr_3.0.2         
[49] S7_0.2.0             evaluate_1.0.5       future_1.67.0       
[52] survival_3.8-3       xml2_1.4.0           pillar_1.11.1       
[55] BiocManager_1.30.26  renv_1.1.4           foreach_1.5.2       
[58] stats4_4.4.2         generics_0.1.4       rprojroot_2.1.1     
[61] hms_1.1.3            commonmark_2.0.0     scales_1.4.0        
[64] globals_0.18.0       base64url_1.4        class_7.3-23        
[67] glue_1.8.0           tools_4.4.2          data.table_1.17.8   
[70] ModelMetrics_1.2.2.2 gower_1.0.2          grid_4.4.2          
[73] ipred_0.9-15         nlme_3.1-168         cli_3.6.5           
[76] textshaping_1.0.4    lava_1.8.1           gtable_0.3.6        
[79] digest_0.6.37        caret_7.0-1          htmlwidgets_1.6.4   
[82] farver_2.1.2         htmltools_0.5.8.1    lifecycle_1.0.4     
[85] hardhat_1.4.2        secretbase_1.0.5     gridtext_0.1.5      
[88] MASS_7.3-65         

7.2 Reproducibility

All analyses are fully reproducible using the targets pipeline:

# From project root
renv::restore()           # Install exact package versions
targets::tar_make()       # Execute full pipeline
quarto::quarto_render("reports/confounding_investigation.qmd")

Analysis Date: 2025-10-27 Pipeline Version: 2.0 Data Version:1 published dataset


7.3 Technical Deep-Dive — Why Anticoagulant Choice Matters for ML Biomarker Models

This technical appendix synthesizes biochemical mechanisms, quantitative evidence, and computational implications explaining why tube type (EDTA vs HEPARIN) confounding represents a critical failure mode in ML diagnostics. This analysis draws from comprehensive review of clinical chemistry, proteomics, and machine learning literature (2023-2025).

7.3.1 A1. Mechanistic Biochemistry: Distinct Anticoagulant Effects on the Proteome

EDTA (Ethylenediaminetetraacetic Acid) — Metal Chelation

EDTA functions as a hexadentate chelating agent with exceptionally high affinity for divalent cations10. The formation constant (log K) for EDTA-Ca²⁺ complexes is approximately 10.7, meaning EDTA effectively depletes free calcium from ~1.15 mmol/L (physiological) to 0.05-0.15 mmol/L in plasma. This near-total calcium sequestration:

  • Inhibits metalloenzymes: Alkaline phosphatase (20-50% activity reduction), creatine kinase, matrix metalloproteinases (MMPs)
  • Blocks calcium-dependent pathways: Complement cascade (C3a, C5a production 2-10× reduced), coagulation factors, calcium signaling proteins
  • Alters protein conformation: Changes epitope accessibility for immunoassays by 15-40% for proteins like neurofilament light chain
  • Stabilizes against proteolysis: Prevents calcium-dependent protease activity, altering protein degradation kinetics

Quantitative Impact: Measured free Ca²⁺ reduced by >90%, Mg²⁺ reduced by 20-50%. Complement proteins show 40-60% reduction in EDTA vs heparin plasma.

HEPARIN — Glycosaminoglycan Protein Binding

Heparin operates through antithrombin III potentiation but creates extensive off-target effects through its highly sulfated structure. This negative charge density binds ~30% of plasma proteins non-specifically (growth factors, chemokines), masks epitopes (30-80% concentration reductions for VEGF-A, FGF2, HGF, CXCL12), induces ex vivo platelet activation, and inhibits downstream assays (RT-qPCR Ct shifts +3-8 cycles).

Quantitative Impact: Heparin-binding cytokines show 1.5-5× lower readouts in heparin vs EDTA. Platelet-derived analytes vary 2-10×.

Critical Asymmetry: EDTA and heparin create orthogonal perturbation patterns—EDTA suppresses metal-dependent proteins; heparin suppresses charged proteins. This produces distinct biochemical “fingerprints” ML models exploit.

7.3.2 A2. Quantitative Effect Sizes: Tube Signal Exceeds Disease Signal

Empirical Magnitude Across Analyte Classes

Systematic reviews document tube type effects ranging from 10-300% depending on biomarker class1012:

Biomarker Class Effect Size Mechanism
Complement proteins 40-60% reduction (EDTA vs heparin) Calcium chelation blocks cascade
Heparin-binding cytokines 30-80% reduction (heparin) Direct binding interference
Metalloenzymes 15-25% activity loss (EDTA) Cofactor depletion
Metabolites 10-30% median shift Ionization artifacts, metal equilibria
Cell-free nucleic acids 20-60% library reduction (heparin) Polymerase inhibition
Platelet-derived proteins 1.5-4× variation Ex vivo activation differences

Critical Comparison to Disease Signal

  • Typical ALS vs control differences: 1.5-3× (based on published literature)
  • Tube type differences: 10-300% for vulnerable classes
  • Implication: Tube effects equal or exceed biological disease signals, creating perfect conditions for ML to learn technical artifacts

Concrete Example from This Study: In healthy controls (disease-free), NEFL shows a 0.96 NPX elevation in US/EDTA relative to Italy/HEPARIN (limma-adjusted for age and sex, p = 6.5×10⁻¹⁵). This tube artifact occurs without any disease present, proving technical dominance.

7.3.3 A3. ML Batch Effect Propagation: “Shortcut Learning” Failure Mode

The Computational Mechanism

Machine learning models optimize to minimize training loss, making them indifferent to whether variance comes from biology or artifacts. When systematic technical variation exists, models learn it preferentially because: (1) tube type creates perfect binary separation (EDTA/HEPARIN), (2) high-variance features dominate (complement proteins, heparin-binding proteins), and (3) standard cross-validation maintains tube distribution in train/test splits.

Empirical Evidence: Reverse prediction AUC = 0.999 (tube signal dominates), tube type ranked #2 of 2,869 features, cross-matrix performance collapse (AUC 0.90 pooled → 0.50-0.65 leave-country-out).

Why Standard Validation Fails: Pooled CV shuffles samples randomly, maintaining tube distribution in both train and test sets, allowing models to learn tube-associated patterns and achieve high validation metrics while learning artifacts.

Leave-country-out CV exposes this: Training on Italy/HEPARIN → testing on US/EDTA forces generalization across tube contexts, causing performance collapse because “HEPARIN signature = ALS” fails on EDTA.

7.3.4 A4. Biomarker Class Vulnerability Patterns

High-Risk Classes: (1) Metalloproteins (SOD1, ceruloplasmin: EDTA vulnerability 50-200% via cofactor removal), (2) Heparin-binding cytokines (VEGF, FGF2: heparin vulnerability 30-80% via epitope masking), (3) Complement/coagulation factors (C3, C4, fibrinogen: both vulnerable via cascade interruption), (4) Metabolites (lysophospholipids: 10-30% shifts via ionization artifacts), (5) Cell-free nucleic acids (heparin extreme vulnerability: Ct +3-8 cycles via polymerase inhibition).

Moderate-Risk: Stable structural proteins (<5% variance), non-metal-dependent enzymes (variable).

Critical for ALS: Many neurodegeneration biomarkers (NEFL, complement, inflammatory markers) fall into high-risk categories, amplifying confounding.

7.3.5 A5. ML Propagation Cascade: From Biochemistry to Model Failure

Stage 1: Pre-Analytical Perturbation (Biochemistry layer) - EDTA chelates Ca²⁺/Mg²⁺ → alters 20-50% of plasma proteins - Heparin binds charged proteins → alters different 30-40% of proteins - Result: Two distinct, systematic proteomic “states” created purely by tube choice

Stage 2: Feature Engineering (Data layer) - High-variance features (tube-affected proteins) dominate PCA components - Tube-type becomes latent factor explaining 30-50% of variance - Normalization (z-score, quantile) preserves relative differences between tube types - Result: Feature space encodes tube chemistry as primary axis of variation

Stage 3: Model Training (Algorithm layer) - Random Forest, gradient boosting, neural networks all learn to exploit tube signal - Feature importance concentrates on tube-affected proteins (complement, heparin-binding) - With perfect confounding (ALS=HEPARIN, OND=EDTA), model learns “HEPARIN pattern = ALS” - Result: Deceptively high training/CV metrics (AUC 0.90-0.98)

Stage 4: Deployment Failure (Clinical layer) - Model deployed to site using different tube type - Tube-learned patterns absent or reversed in new context - Performance collapses: Italy→US test AUC = 0.77; US→Italy test sensitivity = 1.8% - Result: Clinical misdiagnosis, loss of trust, wasted resources

This cascade is systematic and predictable when confounding structure is ignored during study design.

7.3.6 A6. Clinical and Equity Implications

Geographic Disparities in Model Performance

Healthcare institutions make tube type choices based on: - Historical precedent and established protocols - Vendor contracts and cost optimization - Local laboratory workflow requirements - Research vs clinical laboratory differences

This creates systematic geographic clustering: - Academic medical centers: Often citrate/specialized protocols - Community hospitals: Often EDTA/heparin for cost/throughput - International differences: US/EDTA vs European/heparin preferences common

Equity Impact

When ML models are trained at academic centers using one tube type and deployed to community hospitals using another: - Performance degradation disproportionately affects underserved populations served by community settings - ALS diagnostic delays (already 12-18 months on average) worsen in affected regions - False negatives increase in populations already facing healthcare disparities

Regulatory and Translational Failure

  • FDA/regulatory validation typically tests on single-institution cohorts with uniform tube types
  • External validation often uses same tube type as training (e.g., UK Biobank EDTA)
  • Post-market surveillance may not stratify performance by laboratory protocol
  • Result: Approved models fail in real-world heterogeneous deployment

7.3.7 A7. Why “Geographically Consistent” Proteins Still Show Confounding

Three Converging Tests (This Study)

Even proteins significant in BOTH Italy and US with concordant direction showed:

  1. Reverse prediction test: 22 “robust” proteins achieve AUC = 0.916 for predicting tube type
    • Interpretation: Strong residual tube signal despite stratification
  2. Healthy control test: 50% show significant tube effects in disease-free individuals
    • Interpretation: Tube effects exist independent of disease biology
  3. Effect decomposition: Median tube/disease ratio = 0.22 (tube effects 20-50% of disease effects)
    • Example: DTNB has tube effect (0.62 NPX) LARGER than disease effect (0.56 NPX)

Why Perfect Confounding Prevents Separation

With 100% of neurological controls in EDTA and 85% of ALS in HEPARIN: - Any protein elevated in ALS will appear “higher in HEPARIN” - Any protein suppressed in ALS will appear “lower in HEPARIN” - Cannot distinguish whether effect is (a) true biology, (b) tube artifact, or (c) both entangled

Even NEFL—a well-validated ALS biomarker—shows a 0.96 NPX tube effect in healthy controls (US/EDTA vs Italy/HEPARIN, age/sex-adjusted). This doesn’t mean NEFL isn’t a real biomarker; it means the measured effect size in this study is contaminated by technical artifacts.

7.3.8 A8. Batch Correction Limitations

Why Standard Corrections Fail Here

Methods like ComBat, Harmony, or quantile normalization assume: - Batch effects are additive or multiplicative across all features - Biological effects are independent of batch structure - Sufficient samples in each batch × phenotype combination to estimate effects

Perfect confounding violates all three: - Tube effects are protein-specific (heterogeneous, not uniform) - Disease and tube type are perfectly correlated (not independent) - Some diagnosis × tube combinations have zero samples (neurological controls × HEPARIN = 0)

Attempting batch correction would: - Remove unknown mixture of biological + technical variance - Introduce artificial structure when estimating effects from imbalanced design - Provide false confidence in “corrected” results that may be overcorrected

The only robust solution: Balanced study design from the start (equal ALS/control in both EDTA and HEPARIN).

7.3.9 A9. Quantitative ML Performance Projections

Based on Batch Effect Literature

Studies of batch effects in genomics/proteomics with similar confounding structures show9,13,14:

Scenario Projected AUC Degradation
Optimistic: Same tube, different manufacturers/lots 5-10%
Moderate: EDTA → Citrate or reverse 10-15%
Pessimistic: EDTA ↔︎ Heparin (this study) 15-30%

Empirical Confirmation (This Study): - Pooled CV AUC: 0.951 - Leave-country-out mean AUC: 0.770 - Observed degradation: 17.3% (within “pessimistic” projection range)

Why Non-Linear Models Amplify Effects

Random Forest and deep learning models create: - Axis-aligned splits (decision trees): Systematic feature shifts push samples across thresholds - Non-linear activations (neural networks): Tube effects can trigger saturation/sparsity regimes - Ensemble averaging: Multiple trees all learn tube pattern, compounding error

Linear models (logistic regression) show smaller but still substantial degradation (~10-15% AUC loss), as tube effects propagate proportionally through learned weights.

7.3.10 A10. Field-Wide Implications and Solutions

The Reproducibility Crisis Connection

This case study exemplifies broader patterns in ML medical diagnostics where multi-site studies often fail to rigorously test geographic generalization7,8,13. Studies frequently rely solely on pooled cross-validation without leave-site-out validation or reverse prediction tests, resulting in a literature saturated with inflated performance claims that fail external validation.

Prospective Solutions (Study Design Level)

  1. Harmonize tube types across all sites in protocol design phase
  2. Block randomization of diagnostic groups across batches/plates
  3. Replicate samples across tube types for empirical correction
  4. Balanced designs: Equal proportions of all diagnoses in each tube type

Analytical Solutions (When Confounding Exists)

  1. Mandatory leave-site-out CV as primary validation metric
  2. Reverse prediction tests as diagnostic quality checks (AUC >0.7 triggers concern)
  3. Stratified analyses showing within-site effect sizes before pooling
  4. Explicit confounding documentation with contingency tables in methods
  5. Transparent limitation reporting of non-identifiability

Regulatory Requirements

  • Pre-analytical protocol documentation in device submissions
  • Multi-site validation with heterogeneous tube types
  • Post-market surveillance stratified by laboratory protocol
  • Predetermined change control for protocol variations

7.3.11 A11. Summary: Convergence of Evidence

Mechanistic Plausibility ✓ - EDTA and heparin have well-characterized, distinct biochemical effects - Protein classes show predictable vulnerability patterns - Effects documented extensively in clinical chemistry literature

Quantitative Magnitude ✓ - Tube effects (10-300%) often exceed disease signals (1.5-3×) - Effect sizes consistently replicated across multiple studies - Observed in disease-free healthy controls, proving non-biological origin

ML-Specific Propagation ✓ - Reverse prediction demonstrates tube signal dominance (AUC 0.999) - Leave-country-out CV confirms generalization failure (17.3% drop) - Feature importance and SHAP values concentrate on tube-affected proteins

Clinical Consequences ✓ - Cross-site deployment failure (sensitivity 1.8% Italy→US) - Geographic performance disparities documented - Equity implications for underserved populations

Conclusion: The perfect storm of (1) strong mechanistic effects, (2) perfect confounding structure, and (3) standard ML methodology that doesn’t test confounding creates systematic, reproducible, yet entirely artifactual high performance. This represents a critical methodological failure mode that the field must address through improved study design and validation practices.

TipDocument Information

Version: 2.0 Status: Final Contact: See project repository for details


7.4 Computational Transparency

This appendix provides complete transparency into the computational pipeline that generated all results in this report. All computations are managed by the targets pipeline, ensuring reproducibility and dependency tracking. Every figure, table, and statistic in this analysis is programmatically generated and version-controlled.

Pipeline Architecture: See Figure 1 in the Methods section for the complete computational dependency graph.

7.4.1 Execution Summary

Show code
library(dplyr)
library(purrr)

meta <- targets::tar_meta(fields = c("name", "seconds", "bytes", "warnings", "error"), store = "../_targets")

# Convert to numeric (tar_meta returns character sometimes)
seconds_num <- as.numeric(meta$seconds)
bytes_num <- as.numeric(meta$bytes)

# Count warnings
total_warnings <- sum(purrr::map_int(meta$warnings, function(w) {
  if (is.null(w) || length(w) == 0) 0L else length(w)
}))

tibble::tibble(
  Metric = c("Total Targets", "Total Time (hours)", "Total Size (GB)",
             "Avg Time (min)", "Max Time (min)", "Warnings", "Errors"),
  Value = c(
    nrow(meta),
    round(sum(seconds_num, na.rm = TRUE) / 3600, 2),
    round(sum(bytes_num, na.rm = TRUE) / 1e9, 2),
    round(mean(seconds_num, na.rm = TRUE) / 60, 2),
    round(max(seconds_num, na.rm = TRUE) / 60, 2),
    total_warnings,
    sum(!is.na(meta$error))
  )
) |>
  knitr::kable()
Metric Value
Total Targets 160.00
Total Time (hours) 0.09
Total Size (GB) 0.78
Avg Time (min) 0.09
Max Time (min) 1.65
Warnings 160.00
Errors 0.00

7.4.2 Target Details (Top 20 by Execution Time)

Show code
library(dplyr)
library(purrr)

meta <- targets::tar_meta(fields = c("name", "seconds", "bytes", "warnings", "error", "format"), store = "../_targets")

# Format table inline
meta |>
  dplyr::mutate(
    time_min = as.numeric(seconds) / 60,
    size_mb = as.numeric(bytes) / 1e6,
    warnings_count = purrr::map_int(warnings, function(w) {
      if (is.null(w) || length(w) == 0) 0L else length(w)
    }),
    has_error = !is.na(error)
  ) |>
  dplyr::select(name, time_min, size_mb, warnings_count, has_error, format) |>
  dplyr::arrange(dplyr::desc(time_min)) |>
  head(20) |>
  knitr::kable(
    digits = 2,
    col.names = c("Target", "Time (min)", "Size (MB)", "Warnings", "Has Error", "Format")
  )
Target Time (min) Size (MB) Warnings Has Error Format
reverse_prediction_results 1.65 6.68 1 FALSE rds
original_vs_rigorous_comparison 0.57 20.78 1 FALSE rds
main_report 0.56 10.18 1 FALSE file
tube_type_contribution 0.50 13.81 1 FALSE rds
pooled_vs_lcv_comparison 0.49 13.90 1 FALSE rds
model_without_tube 0.25 6.89 1 FALSE rds
lcv_results 0.25 6.97 1 FALSE rds
pooled_cv_results 0.25 6.93 1 FALSE rds
model_with_tube 0.25 6.92 1 FALSE rds
within_country_cv_results 0.24 6.99 1 FALSE rds
data_validation 0.09 0.00 1 FALSE rds
biomarker_tube_prediction 0.08 1.03 1 FALSE rds
concordant_tube_prediction 0.07 0.68 1 FALSE rds
consistent_proteins_reverse_prediction 0.06 0.51 1 FALSE rds
plot_pca 0.05 52.99 1 FALSE rds
top_proteins_tube_prediction 0.04 0.34 1 FALSE rds
all_visualizations 0.04 61.17 1 FALSE rds
differential_pooled 0.03 0.14 1 FALSE rds
raw_data 0.02 33.36 1 FALSE rds
original_biomarkers 0.02 0.00 1 FALSE rds

7.4.3 Key Analysis Domains

Leave-Country-Out Cross-Validation

Target Time (min) Size (MB) Warnings Format
lcv_results 0.25 6.97 1 rds
pooled_cv_analysis NA NA 1 NA
pooled_cv_results 0.25 6.93 1 rds
pooled_vs_lcv_comparison 0.49 13.90 1 rds

Differential Expression Analysis

Show code
library(dplyr)
library(purrr)

meta <- targets::tar_meta(fields = c("name", "seconds", "bytes", "warnings", "error", "format"), store = "../_targets")

# Filter and format
meta |>
  filter(stringr::str_starts(name, "differential_|protein_concordance|stratified_vs_pooled")) |>
  dplyr::mutate(
    time_min = as.numeric(seconds) / 60,
    size_mb = as.numeric(bytes) / 1e6,
    warnings_count = purrr::map_int(warnings, function(w) {
      if (is.null(w) || length(w) == 0) 0L else length(w)
    })
  ) |>
  dplyr::select(name, time_min, size_mb, warnings_count, format) |>
  dplyr::arrange(name) |>
  knitr::kable(
    digits = 2,
    col.names = c("Target", "Time (min)", "Size (MB)", "Warnings", "Format")
  )
Target Time (min) Size (MB) Warnings Format
differential_analysis_italy NA NA 1 NA
differential_analysis_pooled NA NA 1 NA
differential_analysis_us NA NA 1 NA
differential_italy 0.02 0.14 1 rds
differential_pooled 0.03 0.14 1 rds
differential_us 0.01 0.14 1 rds
protein_concordance 0.00 0.24 1 rds
stratified_vs_pooled_comparison 0.00 0.00 1 rds

Machine Learning Evaluation

Show code
library(dplyr)
library(purrr)

meta <- targets::tar_meta(fields = c("name", "seconds", "bytes", "warnings", "error", "format"), store = "../_targets")

# Filter and format
meta |>
  filter(stringr::str_starts(name, "model_|reverse_|tube_type_|biomarker_")) |>
  dplyr::mutate(
    time_min = as.numeric(seconds) / 60,
    size_mb = as.numeric(bytes) / 1e6,
    warnings_count = purrr::map_int(warnings, function(w) {
      if (is.null(w) || length(w) == 0) 0L else length(w)
    })
  ) |>
  dplyr::select(name, time_min, size_mb, warnings_count, format) |>
  dplyr::arrange(dplyr::desc(time_min)) |>
  knitr::kable(
    digits = 2,
    col.names = c("Target", "Time (min)", "Size (MB)", "Warnings", "Format")
  )
Target Time (min) Size (MB) Warnings Format
reverse_prediction_results 1.65 6.68 1 rds
tube_type_contribution 0.50 13.81 1 rds
model_without_tube 0.25 6.89 1 rds
model_with_tube 0.25 6.92 1 rds
biomarker_tube_prediction 0.08 1.03 1 rds
tube_type_colors_viridis NA NA 1 NA
reverse_prediction_test_robust_only NA NA 1 NA
reverse_prediction_test NA NA 1 NA

7.4.4 Reproducibility Information

Show code
# Get git information inline
git_commit <- tryCatch(
  system("git rev-parse HEAD", intern = TRUE),
  error = function(e) "unknown"
)

git_status <- tryCatch(
  system("git status --short", intern = TRUE),
  error = function(e) character(0)
)

git_clean <- length(git_status) == 0

cat("**Git commit:**", git_commit, "\n\n")
**Git commit:** 7a3d97e3a157a917d6157b95363bfc4dbe05dd52 
Show code
cat("**Repository clean:**", git_clean, "\n\n")
**Repository clean:** FALSE 
Show code
cat("**R version:**", paste(R.version$major, R.version$minor, sep = "."), "\n\n")
**R version:** 4.4.2 
Show code
cat("**Platform:**", R.version$platform, "\n\n")
**Platform:** aarch64-apple-darwin20 
Show code
cat("**Report generated:**", as.character(Sys.time()), "\n\n")
**Report generated:** 2025-10-27 16:42:47.177279 
Show code
cat("**Hardware:**", "Apple MacBook Pro M4 Pro, 24GB RAM", "\n\n")
**Hardware:** Apple MacBook Pro M4 Pro, 24GB RAM 
Show code
cat("**Operating System:**", "macOS 26.0.1 (25A362)", "\n\n")
**Operating System:** macOS 26.0.1 (25A362) 

Key Package Versions

Show code
library(dplyr)
sessioninfo::package_info(c("targets", "tarchetypes", "caret", "limma", "pROC", "ggplot2", "dplyr", "tidyr")) |>
  as.data.frame() |>
  select(package, loadedversion, date, source) |>
  knitr::kable()
package loadedversion date source
backports backports 1.5.0 2024-05-23 CRAN (R 4.4.0)
base64url base64url 1.4 2018-05-14 CRAN (R 4.4.1)
callr callr 3.7.6 2024-03-25 CRAN (R 4.4.0)
caret caret 7.0-1 2024-12-10 CRAN (R 4.4.1)
class class 7.3-23 2025-01-01 CRAN (R 4.4.1)
cli cli 3.6.5 2025-04-23 CRAN (R 4.4.1)
clock clock NA 2025-03-21 CRAN (R 4.4.1)
codetools codetools 0.2-20 2024-03-31 CRAN (R 4.4.2)
cpp11 cpp11 NA 2025-03-03 CRAN (R 4.4.1)
data.table data.table 1.17.8 2025-07-10 CRAN (R 4.4.1)
diagram diagram NA 2020-09-30 CRAN (R 4.4.1)
digest digest 0.6.37 2024-08-19 CRAN (R 4.4.1)
dplyr dplyr 1.1.4 2023-11-17 CRAN (R 4.4.0)
e1071 e1071 NA 2024-09-16 CRAN (R 4.4.1)
evaluate evaluate 1.0.5 2025-08-27 CRAN (R 4.4.1)
farver farver 2.1.2 2024-05-13 CRAN (R 4.4.0)
foreach foreach 1.5.2 2022-02-02 CRAN (R 4.4.0)
fs fs NA 2025-04-12 CRAN (R 4.4.1)
future future 1.67.0 2025-07-29 CRAN (R 4.4.1)
future.apply future.apply 1.20.0 2025-06-06 CRAN (R 4.4.1)
generics generics 0.1.4 2025-05-09 CRAN (R 4.4.1)
ggplot2 ggplot2 4.0.0 2025-09-11 CRAN (R 4.4.1)
globals globals 0.18.0 2025-05-08 CRAN (R 4.4.1)
glue glue 1.8.0 2024-09-30 CRAN (R 4.4.1)
gower gower 1.0.2 2024-12-17 CRAN (R 4.4.1)
gtable gtable 0.3.6 2024-10-25 CRAN (R 4.4.1)
hardhat hardhat 1.4.2 2025-08-20 CRAN (R 4.4.1)
highr highr NA 2024-05-26 CRAN (R 4.4.0)
igraph igraph 2.1.4 2025-01-23 CRAN (R 4.4.1)
ipred ipred 0.9-15 2024-07-18 CRAN (R 4.4.0)
isoband isoband NA 2022-12-20 CRAN (R 4.4.0)
iterators iterators 1.0.14 2022-02-05 CRAN (R 4.4.1)
KernSmooth KernSmooth NA 2025-01-01 CRAN (R 4.4.1)
knitr knitr 1.50 2025-03-16 CRAN (R 4.4.1)
labeling labeling 0.4.3 2023-08-29 CRAN (R 4.4.0)
lattice lattice 0.22-7 2025-04-02 CRAN (R 4.4.1)
lava lava 1.8.1 2025-01-12 CRAN (R 4.4.1)
lifecycle lifecycle 1.0.4 2023-11-07 CRAN (R 4.4.0)
limma limma NA 2025-01-09 Bioconductor 3.20 (R 4.4.2)
listenv listenv 0.9.1 2024-01-29 CRAN (R 4.4.0)
lubridate lubridate 1.9.4 2024-12-08 CRAN (R 4.4.1)
magrittr magrittr 2.0.4 2025-09-12 CRAN (R 4.4.1)
MASS MASS 7.3-65 2025-02-28 CRAN (R 4.4.1)
Matrix Matrix 1.7-3 2025-03-11 CRAN (R 4.4.1)
ModelMetrics ModelMetrics 1.2.2.2 2020-03-17 CRAN (R 4.4.1)
nlme nlme 3.1-168 2025-03-31 CRAN (R 4.4.1)
nnet nnet 7.3-20 2025-01-01 CRAN (R 4.4.1)
numDeriv numDeriv NA 2019-06-06 CRAN (R 4.4.1)
parallelly parallelly 1.45.1 2025-07-24 CRAN (R 4.4.1)
pillar pillar 1.11.1 2025-09-17 CRAN (R 4.4.1)
pkgconfig pkgconfig 2.0.3 2019-09-22 CRAN (R 4.4.0)
plyr plyr 1.8.9 2023-10-02 CRAN (R 4.4.0)
prettyunits prettyunits 1.2.0 2023-09-24 CRAN (R 4.4.0)
pROC pROC 1.19.0.1 2025-07-31 CRAN (R 4.4.1)
processx processx 3.8.6 2025-02-21 CRAN (R 4.4.1)
prodlim prodlim 2025.04.28 2025-04-28 CRAN (R 4.4.1)
progressr progressr NA 2025-09-19 CRAN (R 4.4.1)
proxy proxy NA 2022-06-09 CRAN (R 4.4.1)
ps ps 1.9.1 2025-04-12 CRAN (R 4.4.1)
purrr purrr 1.1.0 2025-07-10 CRAN (R 4.4.1)
R6 R6 2.6.1 2025-02-15 CRAN (R 4.4.1)
RColorBrewer RColorBrewer 1.1-3 2022-04-03 CRAN (R 4.4.0)
Rcpp Rcpp 1.1.0 2025-07-02 CRAN (R 4.4.1)
recipes recipes 1.3.1 2025-05-21 CRAN (R 4.4.1)
reshape2 reshape2 1.4.4 2020-04-09 CRAN (R 4.4.0)
rlang rlang 1.1.6 2025-04-11 CRAN (R 4.4.1)
rpart rpart 4.1.24 2025-01-07 CRAN (R 4.4.1)
S7 S7 0.2.0 2024-11-07 CRAN (R 4.4.1)
scales scales 1.4.0 2025-04-24 CRAN (R 4.4.1)
secretbase secretbase 1.0.5 2025-03-04 CRAN (R 4.4.1)
shape shape NA 2024-02-23 CRAN (R 4.4.1)
sparsevctrs sparsevctrs NA 2025-05-25 CRAN (R 4.4.1)
SQUAREM SQUAREM NA 2021-01-13 CRAN (R 4.4.1)
statmod statmod NA 2025-10-09 CRAN (R 4.4.1)
stringi stringi 1.8.7 2025-03-27 CRAN (R 4.4.1)
stringr stringr 1.5.2 2025-09-08 CRAN (R 4.4.1)
survival survival 3.8-3 2024-12-17 CRAN (R 4.4.1)
tarchetypes tarchetypes NA 2025-09-13 CRAN (R 4.4.1)
targets targets 1.11.4 2025-09-13 CRAN (R 4.4.1)
tibble tibble 3.3.0 2025-06-08 CRAN (R 4.4.1)
tidyr tidyr 1.3.1 2024-01-24 CRAN (R 4.4.1)
tidyselect tidyselect 1.2.1 2024-03-11 CRAN (R 4.4.0)
timechange timechange 0.3.0 2024-01-18 CRAN (R 4.4.0)
timeDate timeDate 4041.110 2024-09-22 CRAN (R 4.4.1)
tzdb tzdb 0.5.0 2025-03-15 CRAN (R 4.4.1)
utf8 utf8 NA 2025-06-08 CRAN (R 4.4.1)
vctrs vctrs 0.6.5 2023-12-01 CRAN (R 4.4.0)
viridisLite viridisLite 0.4.2 2023-05-02 CRAN (R 4.4.0)
withr withr 3.0.2 2024-10-28 CRAN (R 4.4.1)
xfun xfun 0.53 2025-08-19 CRAN (R 4.4.1)
yaml yaml 2.3.10 2024-07-26 CRAN (R 4.4.0)

Complete Session Information

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.4.2 (2024-10-31)
 os       macOS 26.0.1
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/Los_Angeles
 date     2025-10-27
 pandoc   3.6.4 @ /usr/local/bin/ (via rmarkdown)
 quarto   1.8.25 @ /usr/local/bin/quarto

─ Packages ───────────────────────────────────────────────────────────────────
 ! package      * version    date (UTC) lib source
 P backports      1.5.0      2024-05-23 [?] CRAN (R 4.4.0)
 P base64url      1.4        2018-05-14 [?] CRAN (R 4.4.1)
 P BiocManager    1.30.26    2025-06-05 [?] CRAN (R 4.4.1)
 P callr          3.7.6      2024-03-25 [?] CRAN (R 4.4.0)
 P caret          7.0-1      2024-12-10 [?] CRAN (R 4.4.1)
 P class          7.3-23     2025-01-01 [?] CRAN (R 4.4.1)
 P cli            3.6.5      2025-04-23 [?] CRAN (R 4.4.1)
 P codetools      0.2-20     2024-03-31 [?] CRAN (R 4.4.2)
 P commonmark     2.0.0      2025-07-07 [?] CRAN (R 4.4.1)
 P data.table     1.17.8     2025-07-10 [?] CRAN (R 4.4.1)
 P digest         0.6.37     2024-08-19 [?] CRAN (R 4.4.1)
 P dplyr        * 1.1.4      2023-11-17 [?] CRAN (R 4.4.0)
 P evaluate       1.0.5      2025-08-27 [?] CRAN (R 4.4.1)
 P farver         2.1.2      2024-05-13 [?] CRAN (R 4.4.0)
 P fastmap        1.2.0      2024-05-15 [?] CRAN (R 4.4.0)
 P forcats      * 1.0.1      2025-09-25 [?] CRAN (R 4.4.1)
 P foreach        1.5.2      2022-02-02 [?] CRAN (R 4.4.0)
 P future         1.67.0     2025-07-29 [?] CRAN (R 4.4.1)
 P future.apply   1.20.0     2025-06-06 [?] CRAN (R 4.4.1)
 P generics       0.1.4      2025-05-09 [?] CRAN (R 4.4.1)
 P ggplot2      * 4.0.0      2025-09-11 [?] CRAN (R 4.4.1)
 P ggrepel      * 0.9.6      2024-09-07 [?] CRAN (R 4.4.1)
 P ggtext         0.1.2      2022-09-16 [?] CRAN (R 4.4.0)
 P ggvenn       * 0.1.19     2025-10-08 [?] CRAN (R 4.4.1)
 P globals        0.18.0     2025-05-08 [?] CRAN (R 4.4.1)
 P glue           1.8.0      2024-09-30 [?] CRAN (R 4.4.1)
 P gower          1.0.2      2024-12-17 [?] CRAN (R 4.4.1)
 P gridtext       0.1.5      2022-09-16 [?] CRAN (R 4.4.0)
 P gtable         0.3.6      2024-10-25 [?] CRAN (R 4.4.1)
 P hardhat        1.4.2      2025-08-20 [?] CRAN (R 4.4.1)
 P here         * 1.0.2      2025-09-15 [?] CRAN (R 4.4.1)
 P hms            1.1.3      2023-03-21 [?] CRAN (R 4.4.0)
 P htmltools      0.5.8.1    2024-04-04 [?] CRAN (R 4.4.0)
 P htmlwidgets    1.6.4      2023-12-06 [?] CRAN (R 4.4.0)
 P igraph         2.1.4      2025-01-23 [?] CRAN (R 4.4.1)
 P ipred          0.9-15     2024-07-18 [?] CRAN (R 4.4.0)
 P iterators      1.0.14     2022-02-05 [?] CRAN (R 4.4.1)
 P jsonlite       2.0.0      2025-03-27 [?] CRAN (R 4.4.1)
 P knitr        * 1.50       2025-03-16 [?] CRAN (R 4.4.1)
 P labeling       0.4.3      2023-08-29 [?] CRAN (R 4.4.0)
 P lattice        0.22-7     2025-04-02 [?] CRAN (R 4.4.1)
 P lava           1.8.1      2025-01-12 [?] CRAN (R 4.4.1)
 P lifecycle      1.0.4      2023-11-07 [?] CRAN (R 4.4.0)
 P listenv        0.9.1      2024-01-29 [?] CRAN (R 4.4.0)
 P litedown       0.7        2025-04-08 [?] CRAN (R 4.4.1)
 P lubridate    * 1.9.4      2024-12-08 [?] CRAN (R 4.4.1)
 P magrittr       2.0.4      2025-09-12 [?] CRAN (R 4.4.1)
 P markdown       2.0        2025-03-23 [?] CRAN (R 4.4.1)
 P MASS           7.3-65     2025-02-28 [?] CRAN (R 4.4.1)
 P Matrix         1.7-3      2025-03-11 [?] CRAN (R 4.4.1)
 P mgcv           1.9-3      2025-04-04 [?] CRAN (R 4.4.1)
 P ModelMetrics   1.2.2.2    2020-03-17 [?] CRAN (R 4.4.1)
 P nlme           3.1-168    2025-03-31 [?] CRAN (R 4.4.1)
 P nnet           7.3-20     2025-01-01 [?] CRAN (R 4.4.1)
 P parallelly     1.45.1     2025-07-24 [?] CRAN (R 4.4.1)
 P patchwork    * 1.3.2      2025-08-25 [?] CRAN (R 4.4.1)
 P pillar         1.11.1     2025-09-17 [?] CRAN (R 4.4.1)
 P pkgconfig      2.0.3      2019-09-22 [?] CRAN (R 4.4.0)
 P plyr           1.8.9      2023-10-02 [?] CRAN (R 4.4.0)
 P prettyunits    1.2.0      2023-09-24 [?] CRAN (R 4.4.0)
 P pROC         * 1.19.0.1   2025-07-31 [?] CRAN (R 4.4.1)
 P processx       3.8.6      2025-02-21 [?] CRAN (R 4.4.1)
 P prodlim        2025.04.28 2025-04-28 [?] CRAN (R 4.4.1)
 P ps             1.9.1      2025-04-12 [?] CRAN (R 4.4.1)
 P purrr        * 1.1.0      2025-07-10 [?] CRAN (R 4.4.1)
 P R6             2.6.1      2025-02-15 [?] CRAN (R 4.4.1)
 P ragg           1.5.0      2025-09-02 [?] CRAN (R 4.4.1)
 P RColorBrewer   1.1-3      2022-04-03 [?] CRAN (R 4.4.0)
 P Rcpp           1.1.0      2025-07-02 [?] CRAN (R 4.4.1)
 P readr        * 2.1.5      2024-01-10 [?] CRAN (R 4.4.0)
 P recipes        1.3.1      2025-05-21 [?] CRAN (R 4.4.1)
   renv           1.1.4      2025-03-20 [1] CRAN (R 4.4.1)
 P reshape2       1.4.4      2020-04-09 [?] CRAN (R 4.4.0)
 P rlang          1.1.6      2025-04-11 [?] CRAN (R 4.4.1)
 P rmarkdown      2.30       2025-09-28 [?] CRAN (R 4.4.1)
 P rpart          4.1.24     2025-01-07 [?] CRAN (R 4.4.1)
 P rprojroot      2.1.1      2025-08-26 [?] CRAN (R 4.4.1)
 P rstudioapi     0.17.1     2024-10-22 [?] CRAN (R 4.4.1)
 P S7             0.2.0      2024-11-07 [?] CRAN (R 4.4.1)
 P scales         1.4.0      2025-04-24 [?] CRAN (R 4.4.1)
 P secretbase     1.0.5      2025-03-04 [?] CRAN (R 4.4.1)
 P sessioninfo    1.2.3      2025-02-05 [?] CRAN (R 4.4.1)
 P stringi        1.8.7      2025-03-27 [?] CRAN (R 4.4.1)
 P stringr      * 1.5.2      2025-09-08 [?] CRAN (R 4.4.1)
 P survival       3.8-3      2024-12-17 [?] CRAN (R 4.4.1)
 P systemfonts    1.3.1      2025-10-01 [?] CRAN (R 4.4.1)
 P targets      * 1.11.4     2025-09-13 [?] CRAN (R 4.4.1)
 P textshaping    1.0.4      2025-10-10 [?] CRAN (R 4.4.1)
 P tibble       * 3.3.0      2025-06-08 [?] CRAN (R 4.4.1)
 P tidyr        * 1.3.1      2024-01-24 [?] CRAN (R 4.4.1)
 P tidyselect     1.2.1      2024-03-11 [?] CRAN (R 4.4.0)
 P tidyverse    * 2.0.0      2023-02-22 [?] CRAN (R 4.4.0)
 P timechange     0.3.0      2024-01-18 [?] CRAN (R 4.4.0)
 P timeDate       4041.110   2024-09-22 [?] CRAN (R 4.4.1)
 P tzdb           0.5.0      2025-03-15 [?] CRAN (R 4.4.1)
 P vctrs          0.6.5      2023-12-01 [?] CRAN (R 4.4.0)
 P viridisLite  * 0.4.2      2023-05-02 [?] CRAN (R 4.4.0)
 P visNetwork   * 2.1.4      2025-09-04 [?] CRAN (R 4.4.1)
 P withr          3.0.2      2024-10-28 [?] CRAN (R 4.4.1)
 P xfun           0.53       2025-08-19 [?] CRAN (R 4.4.1)
 P xml2           1.4.0      2025-08-20 [?] CRAN (R 4.4.1)
 P yaml           2.3.10     2024-07-26 [?] CRAN (R 4.4.0)

 [1] /Users/biostochastics/chia_etal_2025_als/renv/library/macos/R-4.4/aarch64-apple-darwin20
 [2] /Users/biostochastics/Library/Caches/org.R-project.R/R/renv/sandbox/macos/R-4.4/aarch64-apple-darwin20/f7156815

 * ── Packages attached to the search path.
 P ── Loaded and on-disk path mismatch.

──────────────────────────────────────────────────────────────────────────────

List of Figures

  • Figure 2: Conceptual diagram of confounding structure
  • Figure 3: PCA visualization showing protein expression variance driven by tube type
  • Figure 4: ROC curves for predicting tube type from three protein sets
  • Figure 5: Leave-Country-Out cross-validation ROC curves
  • Figure 6: Performance comparison: Pooled vs Country-stratified vs Within-country cross-validation
  • Figure 7: Distribution of significant proteins by cohort
  • Figure 8: Venn diagram showing overlap of significant proteins
  • Figure 9: Correlation of effect sizes between Italy and US cohorts
  • Figure 10: Geographic consistency analysis
  • Figure 11: Tube type effects in healthy controls
  • Figure 14: Decomposition of observed effects
  • Figure 1: Pipeline dependency graph

List of Tables

  • Table 1: Confounding analysis summary
  • Table 2: Top 10 proteins most predictive of tube type
  • Table 3: Leave-Country-Out cross-validation performance
  • Table 4: Summary of differential expression analysis
  • Table 5: Concordant proteins between cohorts
  • Table 6: Tube type effects in healthy controls
  • Table 7: Summary of effect decomposition

NotePipeline Reproducibility

This entire analysis pipeline can be reproduced by running:

# Install dependencies (if needed)
renv::restore()

# Run complete pipeline
targets::tar_make()

# Render report
quarto::quarto_render("reports/confounding_investigation.qmd")

All pipeline targets are cached using the targets package. Only changed dependencies trigger recomputation, ensuring efficient incremental updates.


Biostochastics, LLC | GitHub | © 2025