Confounding by geography and anticoagulant compromises proposed ALS diagnostic model and biomarkers

Re-analysis of Chia et al. (2025)

Author

Sergey A. Kornilov, PhD (Biostochastics, LLC)

Published

October 18, 2025

Links:

Author Contact & Data Acknowledgment

We thank the authors for making the underlying data publicly available. We contacted the corresponding authors on 09/10/25 requesting geographic metadata and received no response, necessitating label reconstruction from sample identifiers. We shared a draft of this reanalysis on 10/13/25 prior to formal submission but received no formal or informal response of acknowledgment. Recognizing the impact of the federal shutdown on the NIH workers, we attempted to reach out the authors again once the shutdown was in progress, this time via social media. We received no response.

Executive Summary

0.1 Methodological Case Study: Multi-Site Confounding in Proteomics

This investigation serves as a methodological case study demonstrating how confounding naturally arises in multi-site biomarker studies and why standard validation approaches may not detect it. Using a rigorous multi-method framework applied to a published ALS proteomics dataset¹, we document perfect site–tube correlation (Cramér’s V=1.0), an 18.6% performance gap between pooled cross-validation and leave-site-out validation (pooled CV AUC: 0.951 → LCV AUC: 0.774), and systematic tube effects across the proteome (35% of proteins show significant effects in healthy controls). These patterns illustrate generalizable lessons about validation strategies, study design, and the importance of testing cross-site generalization.

This case study examines confounding patterns using a multi-method validation framework:

Critical Finding #1: Proteins Predict Tube Type

Reverse Prediction AUC: 0.999

Proteins can predict plasma collection tube type with near-perfect accuracy, indicating tube type signal dominates biological signal.

Critical Finding #2: Geographic Generalization Failure

Performance Gap: 0.176 (18.6% drop)

Pooled CV AUC: 0.951
Leave-Country-Out AUC: 0.774

Models trained on one geographic cohort fail to generalize to another.

Critical Finding #3: Differential Analysis Confounding

Only 26.9% of pooled proteins replicate in both strata

Italy significant: 37
US significant: 379
Pooled significant: 78
Both strata: 23

Critical Finding #4: UK Biobank External Validation

Reported 99.1% AUC does not address multi-class, multi-matrix generalization

The UK Biobank validation (13 ALS vs 23,601 healthy controls) likely used EDTA plasma exclusively (standard UKB protocol), presenting a binary ALS-vs-healthy task on a single matrix, not the multi-class (ALS vs ONDs vs healthy), multi-matrix challenge the study claims to address. Under the confounding we document, the model learns ‘EDTA = not ALS’—making an EDTA-only healthy control set trivially separable even if biological signal is weak. Extreme class imbalance inflates aggregate metrics (AUC, specificity) while masking sensitivity failures. The UKB result cannot validate discrimination against ONDs or robustness across matrices (EDTA, heparin, serum), and does not mitigate the tube/site confounding driving apparent performance in the primary analysis.

0.2 Methodological Lessons Learned

This case study illustrates a systematic challenge in multi-site proteomics where confounding emerges naturally from logistical constraints:

Natural confounding: Different sites often use different protocols (plasma collection tube type: HEPARIN vs EDTA)
Diagnostic distribution: Patient populations vary by site (100% of neurological controls from US/EDTA site)
Validation limitations: Standard pooled cross-validation maintains these correlations in train/test splits

Key insight: The study design and analytical approach follow field-standard practices, yet produce results where technical and biological signals are inseparable. This demonstrates why prospective validation strategies (leave-site-out CV, reverse prediction tests, balanced designs) are essential for multi-site studies.

Positive findings: Despite confounding, we identify 22 geographically consistent protein candidates and confirm NEFL as a robust ALS biomarker, providing a foundation for future tube-balanced validation studies.

1 Introduction

1.1 Multi-Site Confounding: A Methodological Case Study

¹ reported a plasma proteomics biomarker panel for ALS diagnosis with 98% accuracy. Their machine learning model identified plasma collection tube type (HEPARIN vs. EDTA) as the 2nd most important feature out of 2,869 total features—an observation that prompted this methodological investigation.

This case study demonstrates how confounding naturally arises in multi-site proteomics studies: We document that tube type is perfectly correlated with geographic origin (Italy/HEPARIN vs. US/EDTA; Cramér’s V=1.0) and diagnostic distribution (100% of neurological controls from US/EDTA site, 85% of ALS cases from Italy/HEPARIN; Table 1). This confounding structure—which emerges organically from site-specific protocols and patient populations—creates a methodological challenge: biological disease signals become entangled with technical and geographic factors.

Purpose of this investigation: Rather than simply identifying problems, we demonstrate: (1) how confounding arises in real-world multi-site studies; (2) why standard validation approaches miss it; and (3) what practical solutions exist through stratified analysis, cross-site validation, and prospective study design improvements.

1.2 Anticoagulant Biochemistry and Pre-Analytical Confounding

Anticoagulant choice—EDTA or HEPARIN—fundamentally alters measured proteins. EDTA chelates calcium, preventing clotting but triggering cellular protein release. HEPARIN inhibits specific clotting enzymes and preserves different protein structures. The same patient’s blood yields different protein profiles by tube type—a pre-analytical artifact, not biological signal. When tube type is perfectly confounded with diagnostic group—100% of neurological controls in EDTA—disease-associated protein changes become non-identifiable from anticoagulant-induced technical artifacts.

Table 1: Perfect Confounding: Diagnosis × Tube Type Distribution

Diagnosis	EDTA	HEPARIN	Total	% HEPARIN
ALS	40	223	263	84.8%
Healthy_control	67	180	247	72.9%
Neurological_control	194	0	194	0.0%

We conducted an independent investigation with three objectives: (1) quantify confounding severity using reverse prediction and association metrics, (2) evaluate ML model robustness through leave-country-out cross-validation, and (3) identify geographically consistent biomarkers via geographic stratification.

We prespecified three hypotheses to test confounding and its clinical impact:

H1 (Reverse Prediction Test): Protein profiles would strongly predict anticoagulant tube type (null hypothesis: AUC = 0.5; critical threshold: AUC > 0.90 indicates systematic confounding).
H2 (Geographic Generalization): Models would fail geographic generalization under leave-country-out cross-validation (null hypothesis: no performance drop; critical threshold: >20% AUC reduction indicates geographic overfitting).
H3 (Cross-Stratum Replication): Most pooled ‘significant’ proteins would not replicate across both geographic strata (null hypothesis: ≥70% replication; severe confounding indicated if <30% replicate).

These criteria distinguish minor technical variability from systematic confounding preventing clinical deployment.

Box 1: Understanding Confounding and Validation

For clinical readers unfamiliar with machine learning validation concepts:

What is confounding? A non-biological factor (e.g., collection site or tube type) varying systematically with diagnosis becomes a “shortcut” the model learns. Performance can appear excellent while detecting workflow logistics, not disease biology.
Pooled cross-validation vs. leave-site-out (LSO) validation: Pooled CV mixes data from all sites in training/test splits, allowing site and tube signals to “leak” across partitions. LSO withholds entire sites during training to test clinical generalizability.
Why tube type as a predictor is problematic: Different anticoagulants alter measured protein levels through distinct biochemical mechanisms. When tube type correlates with diagnosis (100% of neurological controls used EDTA), the model learns “EDTA = not ALS” rather than biological ALS signatures, producing high pooled CV AUC that fails when tube types differ.
What robust validation requires: Leave-site-out cross-validation, balanced representation across tube types and sites, inclusion of ONDs as comparators, verification of stability across matrices, and uncertainty quantification testing whether performance depends on logistics shortcuts.

For readers interested in statistical frameworks and computational details, the Technical Appendix (Section 8) provides comprehensive documentation of methods, software versions, and reproducibility protocols.

2 Methods

2.1 Overview of Reanalysis Approach

We evaluated confounding bias from plasma collection tube type, used as a predictive feature in the original model. Our reanalysis employed geographic stratification and leave-country-out cross-validation to assess model generalizability and identify geographically consistent biomarker candidates.

All analyses were implemented in R (version 4.4.2) using a reproducible computational pipeline (targets framework) with version-controlled dependencies (renv).

2.2 Computational Pipeline Architecture

Show code

library(targets)
library(visNetwork)

# Determine project root - handle both manual and tar_quarto() rendering
project_root <- if (file.exists("../_targets.R")) {
  normalizePath("..")
} else if (file.exists("_targets.R")) {
  getwd()
} else {
  "/Users/biostochastics/chia_etal_2025_als"  # fallback to absolute path
}

# Change to project root to find _targets.R
withr::with_dir(project_root, {
  # Create network with custom styling
  net <- tar_visnetwork(
    targets_only = TRUE,
    label = c("time", "size"),
    level_separation = 150
  )

  # Apply dark theme with bright viridis colors
  net %>%
    visOptions(
      highlightNearest = list(enabled = TRUE, degree = 1),
      nodesIdSelection = TRUE
    ) %>%
    visInteraction(
      navigationButtons = TRUE,
      hover = TRUE
    ) %>%
    visLayout(randomSeed = 42) %>%
    visNodes(
      font = list(color = "white", size = 16, face = "arial", bold = TRUE),
      borderWidth = 2,
      shadow = list(enabled = TRUE, color = "#000000", size = 5, x = 2, y = 2)
    ) %>%
    visEdges(
      color = list(color = "#666666", highlight = "#FDE724", hover = "#FDE724"),
      smooth = list(enabled = TRUE, type = "cubicBezier", roundness = 0.5),
      arrows = list(to = list(enabled = TRUE, scaleFactor = 0.5))
    ) %>%
    visPhysics(
      stabilization = list(iterations = 200),
      barnesHut = list(gravitationalConstant = -8000, springConstant = 0.001, springLength = 200)
    )
})

There were 22 warnings (use warnings() to see them)

Figure 1: Complete computational dependency graph. Each node represents a target (data, model, or visualization); edges show dependencies. Interactive: hover to explore, click to select.

The dependency graph shows the complete computational workflow. Nodes represent computational targets (data, models, visualizations); directed edges indicate dependencies. The targets framework ensures:

Reproducibility: All analyses run in the same order with identical inputs
Efficiency: Only outdated targets rebuild when code changes
Transparency: Full dependency structure is explicit and auditable
Version control: All code and parameters are tracked in Git

This architecture enables complete reproduction via a single targets::tar_make() command. See Appendix for execution details.

2.3 Data Source and Study Design

2.3.1 Original Dataset

Dataset: Chia et al. (2025) OLINK plasma proteomics data

Total measurements: 2,031,559 protein-sample combinations
Samples: 704 unique samples
Proteins: ~2868 measurements per sample (OLINK panels)
Diagnoses: ALS (263), Healthy controls (247), Neurological controls (194)
Clinical variables: Age at collection, sex, diagnosis, plasma collection tube type (EDTA vs. HEPARIN), OLINK plate identifiers (Plates 1-9)

2.3.2 Geographic Origin Reconstruction

The original dataset lacked explicit country labels. We reconstructed geographic origin (Italy vs. US) from converging evidence:

Plate identifier patterns: Jupyter notebook metadata indicated Plates 5-9 were processed at TRAYNOR (Italian Scholz ALS registry), Plates 1-4 at NDRU (US cohorts from NIH, Johns Hopkins, Baltimore Longitudinal Study of Aging).
Tube type consistency: All HEPARIN samples corresponded to Plates 5-9, all EDTA to Plates 1-4, demonstrating perfect correlation (Cramér’s V ≈ 1.0) between tube type and inferred geographic origin.
Diagnostic distribution validation: Perfect confounding of neurological controls with EDTA tubes (100% EDTA/US) supported the Italy/US geographic distinction.

Country label assignment: - Italy: Plates 5-9 AND HEPARIN tube type (n=403 samples) - United States: Plates 1-4 AND EDTA tube type (n=301 samples) - Unknown: Inconsistent plate-tube combinations (n=0, indicating 100% consistency)

Figure 2: Perfect confounding structure: Diagnosis × Tube Type × Country

2.4 Confounding Structure Quantification

Figure 3: PCA visualization showing protein expression variance driven by tube type rather than diagnosis

Principal Component Analysis reveals tube type explains far more variance (PC1: 17.3%) than biological disease groups. Tube type shows clear PC1 separation, whereas diagnosis shows minimal separation, demonstrating technical factors dominate the variance structure.

2.4.1 Statistical Framework

We quantified confounding between tube type, geographic origin, and diagnosis using:

Cramér’s V: Association strength between categorical variables (range: 0 = independent, 1 = perfect association). We calculated Cramér’s V with 95% CIs for all pairwise combinations of diagnosis, tube type, country, sex, and plate number.
Chi-square tests: Evaluated independence between diagnostic group and tube type/country distributions.
Contingency tables: Sample distributions across diagnosis × tube type, diagnosis × country, and tube type × country.
Stratified analysis: Identified groups with perfect confounding (e.g., neurological controls = 100% EDTA).

Confounding Severity Thresholds:

We interpreted Cramér’s V using² effect size conventions for 2×3 contingency tables (df=2): <0.07 (negligible), 0.07-0.21 (small), 0.21-0.35 (medium), ≥0.35 (large). V > 0.9 indicates near-perfect confounding and mathematical non-identifiability.

2.5 Machine Learning Validation Framework

2.5.1 Reverse Prediction Test (Diagnostic for Technical Confounding)

Objective: Can proteins predict tube type?

We trained a Random Forest classifier to predict tube type (HEPARIN vs. EDTA) from protein expression alone. This “reverse prediction test” diagnoses technical confounding: AUC > 0.9 indicates pre-analytical variation overwhelms biological signal.

Implementation: - Model: Random Forest with 500 trees (ranger package) - Features: All 2,868 proteins without pre-filtering (metadata excluded) - Outcome: Binary classification (EDTA vs. HEPARIN) - Cross-validation: 5-fold CV with stratified sampling - Missing values: Pre-imputed separately by country using missForest (0.01% overall missingness). - Hyperparameters: Grid search over mtry (√p, p/3) and min.node.size (5, 10) - Performance metrics: AUC-ROC, sensitivity, specificity - Feature importance: Permutation-based variable importance scaled to 0-100

Interpretation Thresholds³: - AUC < 0.7: Minimal concern (poor discrimination) - AUC 0.7-0.8: Moderate concern (acceptable discrimination indicates detectable tube effects) - AUC 0.8-0.9: Severe concern (excellent discrimination indicates major confounding) - AUC > 0.9: Critical concern (outstanding discrimination—tube effects dominate biological signal)

2.5.2 Leave-Country-Out Cross-Validation (Primary Generalizability Test)

Objective: Does the model generalize geographically?

Leave-country-out cross-validation (LCV) evaluates whether models trained in one geographic/tube context generalize to another, directly testing clinical deployment scenarios where tube types vary by institution.

Implementation:

Direction 1: Train on Italy (HEPARIN) → Test on US (EDTA)

Partition data by country (Italy training: n=403, US test: n=301)
Train Random Forest on Italian samples using 5-fold internal CV for hyperparameter tuning
Apply trained model to held-out US samples
Evaluate test performance (AUC, accuracy, sensitivity, specificity)

Direction 2: Train on US (EDTA) → Test on Italy (HEPARIN)

Reversed procedure with same preprocessing and evaluation.

Critical Data Leakage Prevention: - Protein filtering, scaling, and hyperparameter tuning performed only on training cohort - Test cohort held completely separate until final prediction step - No information from test samples influenced model training

Pooled Cross-Validation (Baseline Comparison):

We performed standard 5-fold CV on pooled Italy + US data to replicate the original study’s approach. Pooled CV mixes countries across folds, maintaining tube type distribution but not testing geographic generalization.

Performance Gap Thresholds:

Field conventions for cross-site generalization in clinical ML models^4,5: <5% drop (acceptable), 5-10% (moderate concern), 10-20% (substantial concern), >20% (critical concern). Acceptable degradation depends on clinical context and baseline accuracy.

2.6 Stratified Differential Protein Expression Analysis

Objective: Identify geographically consistent biomarkers

We performed differential expression analysis separately within each geographic cohort to identify proteins with geographically consistent ALS signals, eliminating tube type and country confounding.

2.6.1 Statistical Framework

We used limma⁶, adjusting for age and sex to control demographic confounders.

Model specification:

NPX ~ Diagnosis + Age_Collection + Sex

Where NPX (Normalized Protein eXpression, Olink’s standardized log₂ scale) represents protein abundance, and Diagnosis contrasts ALS vs. controls.

2.6.2 Italy Cohort Analysis (HEPARIN-only)

Sample composition: 223 ALS patients vs. 180 healthy controls (all Italian samples use HEPARIN tubes)

Rationale: The Italian cohort lacked neurological controls, providing clean ALS vs. healthy comparison without OND confounding.

Preprocessing: 1. Convert long-format data to protein matrix (proteins × samples) 2. Remove proteins with >20% missing values 3. Fit linear model with empirical Bayes moderation (limma::lmFit + limma::eBayes) 4. Multiple testing correction using Benjamini-Hochberg false discovery rate (FDR)

Significance thresholds: FDR < 0.05 and |log₂ fold-change| > 0.5

2.6.3 US Cohort Analysis (EDTA-only)

Sample composition: 40 ALS patients vs. 261 controls (67 healthy + 194 neurological disease controls; all US samples use EDTA tubes)

Note on power: Small ALS sample size (n=40) limits power, increasing false negative risk but not false positives. Conservative thresholds mitigate false discovery.

Analysis approach: Binary comparison (ALS vs. all controls) with age and sex adjustment. Same preprocessing and significance criteria as Italy analysis.

2.6.4 Pooled Analysis (Confounded Baseline)

For comparison, we replicated the original study’s pooled approach by combining all samples without adjusting for tube type or country. This analysis serves as a baseline to quantify the impact of confounding on differential expression results.

Note: Pooled results are confounded by geographic/tube effects and should not be interpreted as biologically valid.

2.7 Meta-Analysis and Concordance Testing

2.7.1 Geographically Consistent Biomarker Identification

Proteins were classified as “geographically consistent” candidates if they met all of the following criteria:

Statistically significant in Italy: FDR < 0.05 and log fold-change (logFC, the log₂ ratio of mean protein levels between groups) |logFC| > 0.5
Statistically significant in US: FDR < 0.05 and |logFC| > 0.5
Directionally concordant: Same sign of log₂ fold-change in both cohorts

This conservative approach ensures that identified proteins show consistent ALS-associated changes independent of tube type and geographic origin.

2.7.2 Effect Size Concordance

We evaluated cross-cohort consistency using:

Pearson correlation of log₂ fold-changes between Italy and US cohorts (across all proteins)
Combined p-values using Fisher’s method: χ² = -2(ln P_Italy + ln P_US), df=4
Beta-beta plots: Scatterplots of Italy logFC vs. US logFC for visualizing concordance and identifying tube-specific outliers

2.7.3 Stratified vs. Pooled Comparison

To quantify confounding-driven false discoveries, we calculated:

Overlap statistics: Number of proteins significant in pooled but not in either individual cohort
Replication rate: Percentage of pooled significant proteins that replicate in both Italy and US
Critical threshold: <30% replication indicates severe confounding. This heuristic is based on genomics/proteomics field observations that technical artifacts typically show poor cross-cohort replication (<30-40%), while robust biological signals typically replicate at >50-60% rates^7,8

2.8 Exact Replication of Original Study Methods

To ensure fair comparison, we explicitly replicated the¹ analysis pipeline:

80/20 Discovery-Replication Split: - Randomly split pooled data into 80% discovery (n≈563) and 20% replication (n≈141) - Train Random Forest on discovery set including tube type as predictor - Evaluate on held-out replication set - Report feature importance rank for tube type variable

Differential protein selection: - Select proteins significant in pooled analysis (FDR < 0.05, |logFC| > 0.6, matching original thresholds) - Use these proteins as features for ML model

2.9 Reproducibility and Software

2.9.1 Computational Environment

R version: 4.4.2 (2024-10-31)
Operating system: macOS 15.0
Dependency management: renv (version 1.0.11) for exact package version locking

2.9.2 Key R Packages

tidyverse 2.0.0: Data manipulation and visualization
data.table 1.16.4: Fast data loading and processing
targets 1.11.4: Reproducible pipeline orchestration
caret 7.0-1: Machine learning framework
ranger 0.17.0: Random Forest implementation
limma 3.60.0: Differential expression analysis
pROC 1.18.5: ROC curve analysis
DescTools: Cramér’s V calculation

2.9.3 Reproducibility Protocol

Random seeds: All stochastic operations used seed = 46803468976 for reproducibility
Pipeline framework: targets ensures dependency tracking and caching
Version control: All R packages locked via renv.lock
Execution: Full pipeline can be reproduced via targets::tar_make()

2.10 Statistical Considerations

2.10.1 Multiple Testing Correction

We applied Benjamini-Hochberg FDR correction separately within each analysis: - Differential expression: FDR control within each cohort (Italy, US, pooled) - Meta-analysis: Combined p-values using Fisher’s method to account for testing in two cohorts

2.10.2 Missing Data Handling

All analyses: Missing values (0.01% overall) were imputed once at data loading using missForest, applied separately for Italy and US cohorts to prevent cross-country information leakage. This approach:

Uses random forest-based imputation (missForest::missForest()) well-suited for high-dimensional proteomics data
Applied independently to Italy (HEPARIN) and US (EDTA) subsets before any downstream analysis
Prevents cross-country information leakage during imputation (Italy samples never inform US imputation and vice versa)
Extremely low missingness (0.01% overall, max 0.14% per protein) makes imputation straightforward
Imputed data stored in pipeline target data_imputed and used for all subsequent analyses

Differential expression: Proteins with >20% missing values would be excluded prior to limma analysis (standard practice). However, no proteins exceeded this threshold in the current dataset.

Rationale: Pre-imputation with geographic stratification maintains methodological rigor by ensuring Italy and US cohorts remain informationally separate throughout the entire analysis pipeline, consistent with the study’s focus on geographic confounding.

2.10.3 Sample Size and Power

We acknowledge limited statistical power in the US ALS cohort (n=40), which may result in false negatives (missing true ALS signals) but maintains Type I error control (FDR < 0.05).

2.11 Limitations of Reanalysis

We acknowledge the following limitations:

Confounding non-identifiability: Perfect correlation between tube type, country, batch, and storage protocol means we cannot definitively attribute effects to any single factor. Our conclusions apply to the aggregate of these technical artifacts.
No external validation: We reanalyzed existing data without access to independent validation cohorts. Prospective tube-balanced studies are required to confirm geographically consistent biomarkers.
Limited US ALS sample size: Power constraints in the US cohort (n=40 ALS) may cause us to underestimate the number of true geographically consistent proteins (increased false negatives, but Type I error remains controlled).

Statistical Glossary

Key terms used in this investigation:

AUC (Area Under ROC Curve): Measure of classifier performance ranging from 0.5 (random) to 1.0 (perfect). Values >0.9 indicate excellent discrimination.
Cramér’s V: Measure of association strength between categorical variables. Range: 0 (independent) to 1 (perfect association). Cohen (1988) effect sizes for df=2: 0.07 (small), 0.21 (medium), 0.35 (large).
Cross-validation (CV): Method for evaluating model performance by splitting data into training and test sets multiple times.
FDR (False Discovery Rate): Multiple testing correction that controls the expected proportion of false positives among rejected hypotheses. FDR <0.05 means <5% of “significant” results are expected to be false discoveries.
Leave-Country-Out CV: Rigorous validation where models trained on one geographic cohort are tested on a completely held-out cohort. Tests true generalizability.
logFC (log Fold-Change): Difference in NPX values between groups. Since NPX is already on a log₂ scale, this difference directly represents log₂(fold-change). |logFC| > 0.5 represents >40% change; |logFC| > 1.0 represents doubling/halving.
Sensitivity: True positive rate; proportion of actual cases correctly identified.
Specificity: True negative rate; proportion of actual controls correctly identified.
Stratification: Analyzing data separately within homogeneous subgroups to eliminate confounding.

3 Results

Our multi-method investigation reveals severe confounding compromise in the reported ALS diagnostic model. Three analyses demonstrate tube type and geographic origin dominate: (1) Cramér’s V=1.0 perfect site-tube correlation, (2) leave-country-out validation shows 18.6% AUC inflation and 1.8% sensitivity collapse, and (3) only 26.9% of ‘significant’ proteins replicate across strata. Even 22 geographically consistent candidates show substantial residual confounding, with 50% exhibiting significant tube effects in healthy controls.

3.1 Confounding Quantification

3.1.1 Statistical Measures

Association Strengths (Cramér’s V):

Tube Type ↔︎ Diagnosis: 0.721 (strong)
Country ↔︎ Diagnosis: 0.721 (strong)
Tube Type ↔︎ Country: 0.997 (perfect)

Note

Cramér’s V ranges from 0 (no association) to 1 (perfect).² conventions for df=2: ≥0.21 medium, ≥0.35 large.

3.2 Reverse Prediction Test: Strong Evidence for Technical Confounding

Proteins achieve near-perfect tube type prediction accuracy (AUC ≥0.999), demonstrating technical artifacts dominate biological signal. Even the 22 ‘geographically consistent’ biomarker candidates retain strong tube-type signatures (AUC=0.916). When technical metadata can be reconstructed with this accuracy from biological features, those features are contaminated and cannot reliably represent disease biology.

Show code

library(ggplot2)
library(pROC)
library(dplyr)

# Create labels
full_label <- sprintf("Full Model (AUC = %.3f)", reverse_prediction_results$auc)
top_label <- sprintf("Top 10 Proteins (AUC = %.3f)", top_proteins_tube_prediction$auc)
biomarker_label <- sprintf("Original Biomarkers (AUC = %.3f)", biomarker_tube_prediction$auc)
concordant_label <- sprintf("Concordant Proteins (AUC = %.3f)", concordant_tube_prediction$auc)

# Extract predictions and calculate ROC curves
# Full model
full_preds <- reverse_prediction_results$model$pred %>%
  dplyr::filter(
    mtry == reverse_prediction_results$model$bestTune$mtry,
    min.node.size == reverse_prediction_results$model$bestTune$min.node.size
  )
roc_full <- pROC::roc(
  response = full_preds$obs,
  predictor = full_preds$HEPARIN,
  levels = c("EDTA", "HEPARIN"),
  direction = "<"
)

# Top 10 proteins
top_preds <- top_proteins_tube_prediction$predictions
roc_top <- pROC::roc(
  response = top_preds$obs,
  predictor = top_preds$HEPARIN,
  levels = c("EDTA", "HEPARIN"),
  direction = "<"
)

# Original biomarkers
biomarker_preds <- biomarker_tube_prediction$predictions
roc_biomarker <- pROC::roc(
  response = biomarker_preds$obs,
  predictor = biomarker_preds$HEPARIN,
  levels = c("EDTA", "HEPARIN"),
  direction = "<"
)

# Concordant proteins
concordant_preds <- concordant_tube_prediction$predictions
roc_concordant <- pROC::roc(
  response = concordant_preds$obs,
  predictor = concordant_preds$HEPARIN,
  levels = c("EDTA", "HEPARIN"),
  direction = "<"
)

# Combine ROC data
roc_combined <- rbind(
  data.frame(
    fpr = 1 - roc_full$specificities,
    tpr = roc_full$sensitivities,
    model = full_label
  ),
  data.frame(
    fpr = 1 - roc_top$specificities,
    tpr = roc_top$sensitivities,
    model = top_label
  ),
  data.frame(
    fpr = 1 - roc_biomarker$specificities,
    tpr = roc_biomarker$sensitivities,
    model = biomarker_label
  ),
  data.frame(
    fpr = 1 - roc_concordant$specificities,
    tpr = roc_concordant$sensitivities,
    model = concordant_label
  )
)

# Create color mapping with 4 distinct colors
color_values <- c("#FDE724", "#1F9E89", "#B5DE2B", "#31688E")
names(color_values) <- c(full_label, top_label, biomarker_label, concordant_label)

# Create plot
ggplot(roc_combined, aes(x = fpr, y = tpr, color = model)) +
  geom_line(linewidth = 2.5, alpha = 0.9) +
  geom_abline(
    slope = 1, intercept = 0, linetype = "dashed",
    color = "#71717A", linewidth = 1
  ) +
  scale_color_manual(
    name = NULL,
    values = color_values
  ) +
  labs(
    title = "Reverse Prediction Test: Predicting Tube Type from Proteins",
    subtitle = "AUC = 0.999 - Near-perfect discrimination indicates tube effects dominate",
    x = "False Positive Rate (1 - Specificity)",
    y = "True Positive Rate (Sensitivity)"
  ) +
  theme_dark_scientific(base_size = 12) +
  theme(
    plot.background = element_rect(fill = "#151520", colour = NA),
    panel.background = element_rect(fill = "#151520", colour = NA),
    legend.background = element_rect(fill = "#151520", colour = NA),
    legend.key = element_rect(fill = "#151520", colour = NA),
    legend.position = c(0.65, 0.25)
  ) +
  coord_equal()

Figure 4: ROC curves for predicting tube type from three protein sets

Results: - Full Model (all proteins): AUC = 0.999 (99.7% sensitivity, 94.8% specificity) - Top 10 Proteins: AUC = 1.000 - Original Biomarkers (Chia et al. 17-protein panel): AUC = 0.860 (n=11 proteins) - Concordant Proteins (significant in both cohorts): AUC = 0.908 (n=22 proteins)

The reverse prediction test produces a compelling result with important implications for interpreting the published biomarker claims. By training classifiers to predict plasma collection tube type (HEPARIN vs. EDTA) from protein expression profiles alone—excluding all clinical metadata—we demonstrate that tube type effects are pervasive across multiple protein sets. The full model achieves near-perfect discrimination (AUC = 0.999), while even the top 10 most tube-discriminative proteins achieve exceptional performance (AUC = 1.000).

Critically, both biomarker panels show substantial tube-type confounding: the original study’s 17-protein panel achieves AUC = 0.860 for predicting tube type, while our concordant proteins (those significant in both Italy and US with concordant direction, n=22) achieve AUC = 0.908. These findings indicate that even proteins that successfully replicate across independent cohorts remain significantly confounded by technical collection factors.

These findings are particularly striking because the protein measurements are more informative about the technical collection method than they are about ALS disease status (original study reported 98% accuracy for disease prediction). This reversed diagnostic priority reveals that tube type effects dominate the variance structure in the proteomics data, overwhelming genuine biological disease signals. When combined with the perfect confounding structure—where 100% of neurological controls are EDTA and 85% of ALS cases are HEPARIN—these results indicate that any disease classifier trained on pooled data will inevitably learn to recognize anticoagulant-associated protein signatures rather than ALS-specific biology.

The reverse prediction test thus provides strong diagnostic evidence for the presence of severe technical confounding across all protein sets, from the full panel to the curated biomarkers. Even if some ALS biomarkers exist within the dataset, their signals are thoroughly entangled with anticoagulant effects that must be addressed through stratified analysis or balanced study designs.

Interpretation

An AUC of 0.999 indicates proteins can near-perfectly distinguish tube type. This provides strong evidence that:

Tube type signal dominates biological signal in variance explained
Proteins are more informative about collection method than disease diagnosis
Pooled biomarker claims are materially affected by technical confounding
The original model’s high performance likely reflects tube type recognition rather than disease biology

3.2.1 Top Proteins Distinguishing Tube Types

Table 2: Top 10 proteins most predictive of tube type

protein	Overall
DCTN2	100.00
INSL3	55.60
CENPF	45.39
SPINT3	42.99
FGF9	42.28
CCND2	40.16
AGBL2	30.47
TRIM40	14.94
SLC1A4	11.86
DAND5	8.68

3.3 Geographic Validation: Leave-Country-Out CV

Show code

library(ggplot2)
library(pROC)
library(patchwork)
library(dplyr)

# Extract ROC curves
roc_italy_to_us <- lcv_results$italy_to_us$roc_curve
roc_us_to_italy <- lcv_results$us_to_italy$roc_curve
roc_italy_within <- within_country_cv_results$italy$roc_obj
roc_us_within <- within_country_cv_results$us$roc_obj
roc_pooled <- pooled_cv_results$roc_obj

# Create labels with AUCs
pooled_label <- sprintf("Pooled (%.3f)", pooled_cv_results$cv_auc)
italy_within_label <- sprintf("Italy (%.3f)", within_country_cv_results$italy$cv_auc)
us_within_label <- sprintf("US (%.3f)", within_country_cv_results$us$cv_auc)
italy_label <- sprintf("Italy→US (%.3f)", lcv_results$italy_to_us$test_auc)
us_label <- sprintf("US→Italy (%.3f)", lcv_results$us_to_italy$test_auc)

# ==============================================================================
# PANEL A: ORIGINAL VALIDATION (WITHIN-COUNTRY + POOLED)
# ==============================================================================
roc_original <- rbind(
  data.frame(
    specificity = 1 - roc_pooled$specificities,
    sensitivity = roc_pooled$sensitivities,
    model = pooled_label
  ),
  data.frame(
    specificity = 1 - roc_italy_within$specificities,
    sensitivity = roc_italy_within$sensitivities,
    model = italy_within_label
  ),
  data.frame(
    specificity = 1 - roc_us_within$specificities,
    sensitivity = roc_us_within$sensitivities,
    model = us_within_label
  )
)

# Define colors for Panel A
colors_panel_a <- c(
  "#FDE724",  # Pooled - Bright yellow (viridis 100%)
  "#1F9E89",  # Italy - Jade/teal (viridis 60%)
  "#B5DE2B"   # US - Yellow-green (viridis 90%)
)
names(colors_panel_a) <- c(pooled_label, italy_within_label, us_within_label)

p_original <- ggplot(roc_original, aes(x = specificity, y = sensitivity, color = model)) +
  geom_line(linewidth = 2.5, alpha = 0.9) +
  geom_abline(intercept = 0, slope = 1, linetype = "dotted", color = "#666666", alpha = 0.5, linewidth = 1) +
  scale_color_manual(
    name = "Validation Strategy",
    values = colors_panel_a
  ) +
  labs(
    title = "A. ORIGINAL",
    x = "False Positive Rate",
    y = "True Positive Rate"
  ) +
  theme_dark_scientific(base_size = 14) +
  theme(
    plot.background = element_rect(fill = "#151520", colour = NA),
    panel.background = element_rect(fill = "#151520", colour = NA),
    legend.position = c(0.65, 0.25)
  ) +
  coord_equal()

# ==============================================================================
# PANEL B: REANALYSIS (CROSS-COUNTRY)
# ==============================================================================
roc_reanalysis <- rbind(
  data.frame(
    specificity = 1 - roc_italy_to_us$specificities,
    sensitivity = roc_italy_to_us$sensitivities,
    model = italy_label
  ),
  data.frame(
    specificity = 1 - roc_us_to_italy$specificities,
    sensitivity = roc_us_to_italy$sensitivities,
    model = us_label
  )
)

# Define colors for Panel B
colors_panel_b <- c(
  "#440154",  # Italy→US - Dark purple (viridis 0%)
  "#31688E"   # US→Italy - Medium blue (viridis 40%)
)
names(colors_panel_b) <- c(italy_label, us_label)

p_reanalysis <- ggplot(roc_reanalysis, aes(x = specificity, y = sensitivity, color = model)) +
  geom_line(linewidth = 2.5, alpha = 0.9) +
  geom_abline(intercept = 0, slope = 1, linetype = "dotted", color = "#666666", alpha = 0.5, linewidth = 1) +
  scale_color_manual(
    name = "Cross-Country",
    values = colors_panel_b
  ) +
  labs(
    title = "B. REANALYSIS (CROSS-COUNTRY VALIDATION)",
    x = "False Positive Rate",
    y = "True Positive Rate"
  ) +
  theme_dark_scientific(base_size = 14) +
  theme(
    plot.background = element_rect(fill = "#151520", colour = NA),
    panel.background = element_rect(fill = "#151520", colour = NA),
    legend.position = c(0.65, 0.25)
  ) +
  coord_equal()

# ==============================================================================
# PANEL C: CONFUSION MATRICES FOR CROSS-COUNTRY
# ==============================================================================

# Italy → US confusion matrix
cm_italy_us <- lcv_results$italy_to_us$confusion_matrix
cm_italy_us_df <- as.data.frame.table(cm_italy_us)
names(cm_italy_us_df) <- c("Predicted", "Actual", "Count")

# Calculate metrics
sens_italy_us <- cm_italy_us["ALS", "ALS"] / sum(cm_italy_us[, "ALS"])
spec_italy_us <- cm_italy_us["Control", "Control"] / sum(cm_italy_us[, "Control"])

p_cm_italy_us <- ggplot(cm_italy_us_df, aes(x = Actual, y = Predicted, fill = Count)) +
  geom_tile(color = "#444444", linewidth = 1) +
  geom_text(aes(label = Count), size = 6, fontface = "bold", color = "white") +
  scale_fill_gradient(low = "#2c1e3d", high = "#E6550D") +
  labs(
    title = sprintf("Italy→US\nSens: %.2f, Spec: %.2f", sens_italy_us, spec_italy_us),
    x = "True", y = "Predicted"
  ) +
  theme_dark_scientific(base_size = 12) +
  theme_centered_titles() +
  theme_zero_margins() +
  theme(
    plot.background = element_rect(fill = "#151520", colour = NA),
    panel.background = element_rect(fill = "#151520", colour = NA),
    legend.position = "none",
    panel.grid = element_blank()
  ) +
  coord_equal()

# US → Italy confusion matrix
cm_us_italy <- lcv_results$us_to_italy$confusion_matrix
cm_us_italy_df <- as.data.frame.table(cm_us_italy)
names(cm_us_italy_df) <- c("Predicted", "Actual", "Count")

sens_us_italy <- cm_us_italy["ALS", "ALS"] / sum(cm_us_italy[, "ALS"])
spec_us_italy <- cm_us_italy["Control", "Control"] / sum(cm_us_italy[, "Control"])

p_cm_us_italy <- ggplot(cm_us_italy_df, aes(x = Actual, y = Predicted, fill = Count)) +
  geom_tile(color = "#444444", linewidth = 1) +
  geom_text(aes(label = Count), size = 6, fontface = "bold", color = "white") +
  scale_fill_gradient(low = "#2c1e3d", high = "#756BB1") +
  labs(
    title = sprintf("US→Italy\nSens: %.2f, Spec: %.2f", sens_us_italy, spec_us_italy),
    x = "True", y = "Predicted"
  ) +
  theme_dark_scientific(base_size = 12) +
  theme_centered_titles() +
  theme_zero_margins() +
  theme(
    plot.background = element_rect(fill = "#151520", colour = NA),
    panel.background = element_rect(fill = "#151520", colour = NA),
    legend.position = "none",
    panel.grid = element_blank()
  ) +
  coord_equal()
# ==============================================================================
# COMBINE ALL PANELS: A and B on top, C (both confusion matrices) below
# ==============================================================================
((p_original | p_reanalysis) / (p_cm_italy_us | p_cm_us_italy)) +
  plot_annotation(
    title = "Geographic Validation: Original Study vs Reanalysis",
    subtitle = "Cross-country validation reveals severe geographic overfitting not detected by original within-country/pooled approaches",
    theme = theme_dark_scientific(base_size = 12) +
      theme(
        plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5, color = "#FDE724"),
        plot.background = element_rect(fill = "#151520", color = NA),
        plot.margin = unit(c(0,0,0,0), "pt")
      )
  ) &
  theme(
    plot.background = element_rect(fill = "#151520", colour = NA),
    panel.background = element_rect(fill = "#151520", colour = NA)
  )

Figure 5: ROC curves, confusion matrices, and performance metrics comparing Original (within-country, pooled) vs Reanalysis (cross-country) validation strategies

3.3.1 Performance Summary

Table 3: Model performance: Pooled CV vs Geographic Validation

Approach	AUC	95% CI (lower)	95% CI (upper)
Pooled CV (original)	0.951	0.935	0.967
Italy → US	0.813	0.737	0.889
US → Italy	0.736	0.687	0.784
Mean LCV	0.774	NA	NA

3.4 Within-Country Cross-Validation

Show code

library(patchwork)

# Panel 1: Within-country CV results
within_country_data <- tibble::tibble(
  Approach = factor(
    c("Italy-only CV\n(HEPARIN)", "US-only CV\n(EDTA)"),
    levels = c("Italy-only CV\n(HEPARIN)", "US-only CV\n(EDTA)")
  ),
  AUC = c(italy_within_cv$cv_auc, us_within_cv$cv_auc),
  CI_lower = c(italy_within_cv$cv_auc_ci_lower, us_within_cv$cv_auc_ci_lower),
  CI_upper = c(italy_within_cv$cv_auc_ci_upper, us_within_cv$cv_auc_ci_upper)
)

p1 <- ggplot(within_country_data, aes(x = Approach, y = AUC, fill = Approach)) +
  geom_col(width = 0.65, color = "#0B0B0F") +
  geom_errorbar(aes(ymin = CI_lower, ymax = CI_upper), width = 0.15, color = "#C8C8D0", size = 0.5) +
  geom_point(size = 4, color = "white", shape = 21, fill = "#FDE724", stroke = 1.5) +
  geom_text(aes(label = sprintf("%.3f", AUC)), vjust = -1.2, color = "white", size = 5.5, fontface = "bold") +
  scale_y_continuous(limits = c(0.5, 1.0), breaks = seq(0.5, 1.0, 0.05)) +
  scale_fill_manual(
    values = c(
      "Italy-only CV\n(HEPARIN)" = "#5DC863",
      "US-only CV\n(EDTA)" = "#31688E"
    )
  ) +
  labs(
    title = "A. Within-Country CV Performance",
    subtitle = "Models trained and tested within single cohorts",
    x = NULL,
    y = "AUC (95% CI)"
  ) +
  theme_dark_scientific(base_size = 12) +
  theme_zero_margins() +
  theme(
    plot.background = element_rect(fill = "#151520", colour = NA),
    panel.background = element_rect(fill = "#151520", colour = NA),
    legend.position = "none",
    axis.text.x = element_text(size = 11, face = "bold"),
    axis.text.y = element_text(size = 11),
    axis.title.y = element_text(size = 12, face = "bold")
  )

# Panel 2: Pooled and LCV comparison
pooled_lcv_data <- tibble::tibble(
  Approach = factor(
    c("Pooled CV", "Italy → US", "US → Italy"),
    levels = c("Pooled CV", "Italy → US", "US → Italy")
  ),
  AUC = c(
    pooled_cv_results$cv_auc,
    lcv_results$italy_to_us$test_auc,
    lcv_results$us_to_italy$test_auc
  ),
  CI_lower = c(
    pooled_cv_results$cv_auc_ci_lower,
    lcv_results$italy_to_us$test_auc_ci_lower,
    lcv_results$us_to_italy$test_auc_ci_lower
  ),
  CI_upper = c(
    pooled_cv_results$cv_auc_ci_upper,
    lcv_results$italy_to_us$test_auc_ci_upper,
    lcv_results$us_to_italy$test_auc_ci_upper
  ),
  Type = c("Pooled", "LCV", "LCV")
)

p2 <- ggplot(pooled_lcv_data, aes(x = Approach, y = AUC, fill = Type)) +
  geom_col(width = 0.65, color = "#0B0B0F") +
  geom_errorbar(aes(ymin = CI_lower, ymax = CI_upper), width = 0.15, color = "#C8C8D0", size = 0.5) +
  geom_point(size = 4, color = "white", shape = 21, fill = "#FDE724", stroke = 1.5) +
  geom_text(aes(label = sprintf("%.3f", AUC)), vjust = -1.2, color = "white", size = 5.5, fontface = "bold") +
  scale_y_continuous(limits = c(0.5, 1.0), breaks = seq(0.5, 1.0, 0.05)) +
  scale_fill_manual(
    values = c(
      "Pooled" = "#FDE725",
      "LCV" = "#C2408C"
    )
  ) +
  labs(
    title = "B. Pooled vs Leave-Country-Out CV",
    subtitle = "Geographic validation reveals confounding",
    x = NULL,
    y = "AUC (95% CI)"
  ) +
  theme_dark_scientific(base_size = 12) +
  theme_zero_margins() +
  theme(
    plot.background = element_rect(fill = "#151520", colour = NA),
    panel.background = element_rect(fill = "#151520", colour = NA),
    legend.position = "none",
    axis.text.x = element_text(size = 11, face = "bold"),
    axis.text.y = element_text(size = 11),
    axis.title.y = element_text(size = 12, face = "bold")
  )

# Combine panels side by side
(p1 + p2) + plot_layout() & theme_zero_margins()

Figure 6: Model performance comparison: Pooled CV, Within-country CV, and Leave-country-out (LCV)

Within-country model performance:

Italy-only (HEPARIN, n = 403; ALS 223, Controls 180): AUC 0.961 (95% CI 0.944–0.978)
US-only (EDTA, n = 301; ALS 40, Controls 261): AUC 0.922 (95% CI 0.858–0.957)

Interpretation

Pooled CV remains the optimistic upper bound because confounding structure is shared across folds.
Italy-only and US-only CV reflect performance in homogeneous cohorts without cross-matrix generalization demands. Both achieve reasonable performance within their respective anticoagulant contexts.
The large gap between within-country CV and leave-country-out (Italy → US and US → Italy) demonstrates that performance degrades substantially when the anticoagulant/geography context changes. This pattern is consistent with models learning tube/context-specific signatures that do not transfer across countries, not generalizable ALS biology.

3.5 Differential Analysis: Stratified vs Pooled

Show code

# Plot protein counts inline (moved from R/06_visualizations.R for transparency)
library(ggplot2)

counts_data <- data.frame(
  analysis = c(
    "Italy\n(HEPARIN)",
    "US\n(EDTA)",
    "Pooled\n(Confounded)",
    "Both Strata\n(Robust)"
  ),
  count = c(
    length(stratified_vs_pooled_comparison$sig_italy),
    length(stratified_vs_pooled_comparison$sig_us),
    length(stratified_vs_pooled_comparison$sig_pooled),
    length(stratified_vs_pooled_comparison$sig_both_strata)
  ),
  type = c("Stratified", "Stratified", "Pooled", "Robust")
)

counts_data$analysis <- factor(counts_data$analysis, levels = counts_data$analysis)

ggplot(counts_data, aes(x = analysis, y = count, fill = type)) +
  geom_col(width = 0.7, alpha = 0.9) +
  geom_text(aes(label = count), vjust = -0.5, size = 5, fontface = "bold", color = "#EEEEEE") +
  scale_fill_manual(values = c(
    "Stratified" = "#31688E", # viridis cyan
    "Pooled" = "#FDE724", # viridis yellow
    "Robust" = "#6CCE59" # viridis lime
  )) +
  labs(
    title = "Significant Proteins by Analysis Type",
    subtitle = sprintf(
      "Only %d/%d pooled proteins (%.1f%%) replicate in both strata - major reproducibility issue",
      length(stratified_vs_pooled_comparison$overlap_both),
      length(stratified_vs_pooled_comparison$sig_pooled),
      stratified_vs_pooled_comparison$pct_replicate_in_both
    ),
    x = NULL,
    y = "Number of Significant Proteins\n(FDR < 0.05, |logFC| > 0.5)",
    fill = "Type"
  ) +
  theme_dark_scientific(base_size = 12) +
  theme(
    legend.position = "none",
    panel.grid.major.x = element_blank()
  )

Figure 7: Significant protein counts by analysis approach

3.5.1 Summary Statistics

Table 4: Significant proteins by analysis approach (FDR < 0.05, |logFC| > 0.5)

Analysis	N Significant	% of Pooled
Italy (HEPARIN)	37	47.4
US (EDTA)	379	485.9
Pooled (confounded)	78	100.0
Both Strata	23	26.9

Critical Finding

Only 26.9% of pooled proteins replicate in BOTH strata

51 proteins (65.4%) significant in pooled but NOT in either stratum
These are likely confounding artifacts driven by country/tube effects
Most pooled “significant” proteins are likely false positives attributable to confounding

3.5.2 Protein Overlap Visualization

Show code

# Plot Venn diagram inline (moved from R/06_visualizations.R for transparency)
library(ggplot2)
library(ggvenn)

# Prepare data for ggvenn
venn_data <- list(
  Italy = stratified_vs_pooled_comparison$sig_italy,
  US = stratified_vs_pooled_comparison$sig_us,
  Pooled = stratified_vs_pooled_comparison$sig_pooled
)

ggvenn(
  venn_data,
  fill_color = c("#1F9E89", "#B5DE2B", "#FDE724"), # viridis jade, yellow-green, yellow
  fill_alpha = 0.5,
  stroke_size = 1,
  set_name_size = 5,
  set_name_color = "white",
  text_size = 4
) +
  labs(
    title = "Protein Overlap: Stratified vs Pooled Analysis",
    subtitle = sprintf(
      "⚠️ CRITICAL: Only %d proteins (%.1f%%) significant in BOTH Italy and US",
      length(stratified_vs_pooled_comparison$sig_both_strata),
      stratified_vs_pooled_comparison$pct_replicate_in_both
    )
  ) +
  theme_dark_scientific(base_size = 12) +
  theme(
    plot.background = element_rect(fill = "#151520", colour = NA),
    panel.background = element_rect(fill = "#151520", colour = NA),
    legend.background = element_rect(fill = "#151520", colour = NA),
    legend.key = element_rect(fill = "#151520", colour = NA),
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(color = "#FDE724", face = "bold", hjust = 0.5)
  )

Figure 8: Venn diagram showing protein overlap between strata

3.6 Geographically Consistent Biomarker Candidates

3.6.1 Meta-Analysis Results

Proteins significant in BOTH Italy and US with concordant direction: 22

These are the ONLY proteins that show consistent ALS signals across both tube types and geographic cohorts.

Table 5: Top 10 geographically consistent biomarker candidates

Assay	logFC_italy	logFC_us	mean_logFC
NEFL	2.503	2.257	2.380
ALDH3A1	2.311	2.323	2.317
MEGF10	0.877	1.054	0.966
CORO6	1.157	1.099	1.128
HS6ST2	0.777	0.901	0.839
CSRP3	2.143	1.914	2.028
MYLPF	1.555	1.585	1.570
MYBPC1	1.385	1.469	1.427
EDA2R	0.871	0.757	0.814
CA3	1.056	0.739	0.898

3.6.2 Effect Size Correlation

To assess whether effect sizes are consistent across cohorts, we calculate correlations for two protein sets:

All proteins on Olink panel (n = 2868)
Proteins significant in pooled analysis (n = 78)

Figure 9: Effect size correlations between Italy and US cohorts

Key Findings:

All proteins: Pearson r = 0.030, Spearman ρ = -0.115 — Near-zero Pearson correlation with negative Spearman rank correlation across entire panel
Pooled-significant proteins: Pearson r = 0.788, Spearman ρ = 0.683 — Moderate-to-strong correlation among selected proteins

The stark difference between these correlations reveals an important pattern: proteins that reach significance in the pooled analysis tend to show more consistent effect sizes across cohorts, while the majority of proteins on the panel show essentially no correlation.

Notably, the negative Spearman correlation (ρ = -0.115) for all proteins indicates that even the rank ordering of effects tends to be inverted between cohorts—proteins with larger effects in Italy tend to have smaller effects in the US, and vice versa. This rank inversion, while modest in magnitude, suggests systematic differences in how proteins respond across the two cohorts that go beyond simple scaling differences.

Interpretation and ML Implications

What this means:

The 78 proteins significant in pooled analysis do show reasonable concordance between Italy and US (Pearson r ≈ 0.79, Spearman ρ ≈ 0.68), suggesting they may capture some reproducible ALS biology. However:

Rank inversion in broader panel: Across all proteins, Spearman ρ = -0.115 indicates systematic rank-order differences between cohorts
Effect concordance ≠ absence of confounding: Proteins can show consistent disease associations while still being systematically affected by tube type
72% of pooled proteins fail stratified replication: Only 22/78 (28%) are significant in BOTH cohorts
Geographic validation issues persist: ML models trained on pooled data show reduced performance in leave-country-out CV (18% performance drop: from AUC = 0.95 to 0.77)
Reverse prediction remains problematic: Proteins discriminate tube types with very high accuracy (AUC = 0.999), indicating pervasive technical effects
Residual confounding in “consistent” proteins: Even the 22 geographically consistent proteins predict tube type with AUC = 0.89

Critical point for ML models:

The moderate-to-strong correlation (Pearson r ≈ 0.79) among pooled-significant proteins does NOT eliminate concerns about confounding bias in machine learning models. The models are trained on absolute NPX values—which contain both biological signal AND tube-type effects—not on effect sizes. Geographic validation demonstrates that models show substantially reduced generalization (Italy→US: AUC = 0.81; US→Italy: AUC = 0.74; mean = 0.77) compared to pooled cross-validation (AUC = 0.95), indicating that the training data contains systematic cohort-specific patterns.

Interpretation: While some proteins show reproducible disease associations, the ML framework faces substantial challenges due to the Italy-HEPARIN/US-EDTA confound structure.

3.6.3 Biological Interpretation

NEFL (Neurofilament Light Chain) is a well-established neuroaxonal injury biomarker and likely represents a genuine biological signal in ALS. However, our analyses demonstrate that its effect size in this dataset is substantially contaminated by the Italy-HEPARIN/US-EDTA confound. Among the 22 “geographically consistent” proteins that show significance in both strata, only a small fraction survive without residual tube effects. Importantly:

Only 22 proteins survive rigorous geographic validation (out of 78 pooled)
72% of pooled proteins fail to replicate across tube types
Effect sizes are inconsistent (r = 0.03) between cohorts
Even geographically consistent candidates retain substantial tube signal (see Residual Confounding section)

Verdict: While NEFL and a small subset of proteins remain clinically relevant, their magnitudes in the pooled analysis are inflated and their relative rankings among biomarkers are distorted by non-biological factors. Most of the published protein list requires prospective validation in tube-balanced studies.

3.6.4 Visualizing Country/Tube Effects in Geographically Consistent Proteins

To demonstrate that even “geographically consistent” proteins exhibit systematic country/tube effects, we examine the distribution of NPX values for ALL geographically consistent proteins, comparing ALS vs. Healthy controls across Italy (HEPARIN) and US (EDTA) cohorts.

Figure 10: All ‘Geographically Consistent’ Proteins Show Strong Country/Tube Effects

Key Observation

The visualization combines geographically consistent proteins with ALL 17 original biomarkers from the authors’ Figure 4a to demonstrate that systematic tube type differences exist across both sets. Even within healthy controls, we observe substantial EDTA vs. HEPARIN differences (shown as effect sizes and p-values). This pattern demonstrates that:

Country/tube effects are present in disease-free individuals, proving these differences are technical artifacts, not ALS biology
ALS vs. Healthy separation varies by country/tube context, indicating entanglement of biological and technical signals
Direct NPX comparisons between cohorts are confounded, even for proteins that show consistent ALS associations within each cohort
The original biomarkers (★) are not exempt from these confounding effects, demonstrating that the reported findings cannot be generalized across collection protocols

This visual evidence reinforces the quantitative findings from our three convergent tests (reverse prediction, healthy control analysis, effect decomposition) that residual confounding persists in all candidate biomarkers.

3.7 Residual Confounding in “Geographically Consistent” Proteins

Critical Question: Are the 22 “geographically consistent” proteins (significant in both Italy and US with concordant direction) truly independent of tube type effects, or do they still retain substantial technical confounding?

Just because a protein is significant in BOTH cohorts doesn’t prove it’s free from tube type artifacts. It could be: (1) true biomarker + NO tube effect (ideal), (2) true biomarker + tube effect (entangled), or (3) pure tube artifact spuriously associated with disease. With perfect confounding, we cannot distinguish cases #2 and #3, but we CAN quantify the magnitude of residual tube type signal using two complementary approaches below, which together with the reverse prediction test (Section 3.2) provide converging evidence.

3.7.1 Test 1: Tube Effects in Healthy Controls (Disease-Free Test)

As demonstrated in Figure 10, even “geographically consistent” proteins show systematic differences between tube types within disease-free individuals. Here we quantify these effects statistically.

Show code

# Plot healthy tube effects using limma-based results (adjusted for age & sex)
library(ggplot2)
library(dplyr)

# Prepare plot data from limma results
# Filter to only geographically consistent proteins (same as old analysis)
plot_data <- tube_effects_healthy %>%
  dplyr::filter(Assay %in% tube_robust_top15_assays) %>%
  dplyr::arrange(adj.P.Val) %>%
  dplyr::mutate(
    protein = factor(Assay, levels = rev(Assay)),
    significant = adj.P.Val < 0.05,
    # Use logFC as effect size (already in NPX units since NPX is log2)
    # Calculate approximate 95% CI using t-distribution
    ci_lower = logFC - 1.96 * logFC / t,
    ci_upper = logFC + 1.96 * logFC / t
  )

n_sig <- sum(plot_data$significant)
n_total <- nrow(plot_data)

ggplot(plot_data, aes(x = logFC, y = protein, color = significant)) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "#888888") +
  geom_point(size = 3) +
  geom_errorbarh(aes(xmin = ci_lower, xmax = ci_upper), height = 0.3) +
  scale_color_manual(values = c("TRUE" = "#FDE724", "FALSE" = "#888888")) +
  labs(
    title = "Tube Type Effects in Healthy Controls (Disease-Free)",
    subtitle = sprintf(
      "⚠️ %d/%d geographically consistent proteins differ by tube type WITHOUT disease present\nLimma analysis adjusted for age & sex",
      n_sig, n_total
    ),
    x = "Effect Size in NPX (US/EDTA - Italy/HEPARIN, logFC)",
    y = NULL,
    color = "Significant\n(FDR < 0.05)"
  ) +
  theme_dark_scientific(base_size = 12) +
  theme(
    panel.grid.major.y = element_blank()
  )

Figure 11: Tube type effects in healthy controls only

Rationale: If a protein differs between Italy/HEPARIN and US/EDTA in disease-free individuals, that difference MUST be a technical artifact (or population genetics), NOT ALS biology.

Result: 10 out of 22 geographically consistent proteins (45.5%) show significant tube effects in healthy controls (limma, FDR < 0.05)

Distribution of Tube Effect Sizes Across All Proteins

To understand the full scope of tube type effects in healthy controls, we examine the distribution of effect sizes across all proteins in the Olink panel. We present two views: the full distribution and a zoomed view focusing on the central range.

Show code

# Load visualization from targets (full range)
tar_load(viz_tube_effect_distribution_full)
viz_tube_effect_distribution_full

Figure 12: Distribution of tube type effect sizes in healthy controls across all proteins (full range). Cyan dashed line: theoretical null distribution assuming no systematic tube effect.

The full distribution (Figure 12) shows the complete range of tube effects, including extreme outliers extending beyond 1000% change. The theoretical null distribution (cyan dashed line, centered at 0) represents the expected pattern if tube type had no systematic effect—the dramatic deviation from this null demonstrates widespread technical artifacts across the proteome.

Show code

# Load visualization from targets (zoomed to -1000 to 1000)
tar_load(viz_tube_effect_distribution_zoomed)
viz_tube_effect_distribution_zoomed

Figure 13: Distribution of tube type effect sizes in healthy controls (zoomed to -1000% to +1000% range). Cyan dashed line: theoretical null distribution. The observed distribution is substantially wider and more bimodal than expected under the null hypothesis of no tube effect.

The zoomed distribution (Figure 13) focuses on the -1000% to +1000% range, allowing better visualization of the central distribution pattern. Comparison to the theoretical null (cyan dashed curve) clearly shows that the observed effects are far more extreme and variable than expected from measurement noise alone, with a bimodal structure suggesting systematic directional effects. Outliers beyond this range are indicated with arrows.

Key Findings:

979 out of 2868 proteins (34.1%) show significant tube effects in healthy controls (FDR < 0.05)
Median absolute percent change: 13.7% across all proteins
Median absolute percent change (significant only): 32.5%
54 proteins show >100% change
262 proteins show >50% change
2 proteins show extreme changes >1000%

Interpretation: This distribution reveals that tube type differences are widespread and substantial across the entire proteome, not limited to a few outlier proteins. The fact that over a third of all proteins show significant effects in disease-free individuals demonstrates that technical artifacts dominate the variance structure of this dataset. The presence of extreme outliers (>1000% change) indicates that some proteins are profoundly affected by anticoagulant choice, making them completely unreliable for biomarker discovery without proper experimental control.

Table 6: Top geographically consistent proteins with tube effects in disease-free individuals (limma-adjusted for age & sex)

Protein	logFC (NPX)	Mean NPX	t-statistic	FDR
NEFL	0.960	-1.060	8.329	0.000
ALDH3A1	0.923	-0.598	7.464	0.000
DTNB	0.648	-0.389	5.285	0.000
EDA2R	0.528	-0.266	4.212	0.000
HS6ST2	0.405	-0.412	3.905	0.001

Critical Finding

Many geographically consistent proteins differ by tube type even WITHOUT disease present!

Using limma analysis (adjusted for age & sex):

Multiple proteins show significant tube effects in disease-free individuals (FDR < 0.05)
These differences occur in the absence of ALS, proving they reflect technical artifacts
P-values are now properly adjusted for covariates using empirical Bayes
10 of 22 geographically consistent proteins remain significant after FDR correction

This definitively demonstrates that even “geographically consistent” candidates are contaminated by technical confounding.

3.7.2 Test 2: Effect Size Decomposition (Biology vs Artifacts)

Show code

# Plot effect decomposition inline (moved from R/06_visualizations.R for transparency)
library(ggplot2)
library(dplyr)
library(ggrepel)

# Prepare data
plot_data <- effect_decomposition %>%
  dplyr::mutate(
    category = dplyr::case_when(
      ratio_tube_to_disease > 1.0 ~ "Tube > Disease",
      ratio_tube_to_disease > 0.5 ~ "Tube (ES >= 50% Disease)",
      ratio_tube_to_disease > 0.2 ~ "Tube (ES 20-50% Disease)",
      TRUE ~ "Tube (ES < 20% Disease)"
    ),
    category = factor(category, levels = c(
      "Tube > Disease",
      "Tube (ES >= 50% Disease)",
      "Tube (ES 20-50% Disease)",
      "Tube (ES < 20% Disease)"
    ))
  )

# Count by category
category_counts <- plot_data %>%
  dplyr::count(category)

n_severe <- sum(plot_data$ratio_tube_to_disease >= 0.5, na.rm = TRUE)
pct_severe <- 100 * n_severe / nrow(plot_data)

# Label ALL proteins (only 22 total, so manageable)
proteins_to_label <- plot_data

ggplot(plot_data, aes(x = abs_disease_effect, y = abs_tube_effect, color = category)) +
  # Diagonal line (tube = disease)
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "#FDE724", linewidth = 1) +
  # Reference line (tube = 50% of disease)
  geom_abline(intercept = 0, slope = 0.5, linetype = "dotted", color = "#6CCE59", linewidth = 0.8) +
  # Points
  geom_point(size = 3, alpha = 0.7) +
  # Label ALL proteins - set max.overlaps to Inf to ensure all labels are shown
  ggrepel::geom_text_repel(
    data = proteins_to_label,
    aes(label = protein),
    size = 8.5,
    color = "#EEEEEE",
    box.padding = 0.35,
    point.padding = 0.3,
    max.overlaps = Inf,
    min.segment.length = 0,
    force = 2,
    force_pull = 1
  ) +
  scale_color_manual(values = c(
    "Tube > Disease" = "#FDE724", # viridis yellow
    "Tube (ES >= 50% Disease)" = "#B5DE2B", # viridis yellow-green
    "Tube (ES 20-50% Disease)" = "#6CCE59", # viridis lime
    "Tube (ES < 20% Disease)" = "#1F9E89" # viridis jade
  )) +
  annotate("text",
    x = max(plot_data$abs_disease_effect, na.rm = TRUE) * 0.7,
    y = max(plot_data$abs_tube_effect, na.rm = TRUE) * 0.9,
    label = sprintf("y = x\n(Tube = Disease)"),
    color = "#FDE724", size = 3, fontface = "italic"
  ) +
  annotate("text",
    x = max(plot_data$abs_disease_effect, na.rm = TRUE) * 0.6,
    y = max(plot_data$abs_disease_effect, na.rm = TRUE) * 0.3,
    label = sprintf("y = 0.5x\n(Tube = 50%% Disease)"),
    color = "#6CCE59", size = 3, fontface = "italic"
  ) +
  labs(
    title = "Effect Size Decomposition: Tube Type vs Disease Effects",
    subtitle = sprintf(
      "⚠️ SEVERE: %d/%d proteins (%.1f%%) have tube effects >=50%% of disease effects",
      n_severe, nrow(plot_data), pct_severe
    ),
    x = "Absolute Disease Effect (|ALS - Control|)",
    y = "Absolute Tube Effect (|US - Italy|)",
    color = "Entanglement Level"
  ) +
  theme_dark_scientific(base_size = 16) +
  theme(
    legend.position = "bottom",
    axis.title = element_text(size = 14),
    plot.subtitle = element_text(size = 12)
  )

Figure 14: Tube type effects vs disease effects for each protein

Question: Is the tube type effect comparable to or larger than the disease effect?

Results:

Table 7: Distribution of tube-to-disease effect ratios

Effect Category	Count	Percentage
moderate_tube_effect	9	40.9%
negligible_tube_effect	10	45.5%
severe_tube_effect	2	9.1%
tube_dominates_disease	1	4.5%

Median ratio: 0.221 (tube effects are ~22% as large as disease effects on average)

Severe Entanglement Identified

3 proteins (13.6%) have tube effects ≥50% of disease effects:

Top proteins with highest tube/disease ratios:

Protein	Disease Effect	Tube Effect	Ratio
DTNB	0.563	0.617	1.096
HS6ST2	0.814	0.498	0.612
EDA2R	0.825	0.466	0.566
HSPB6	0.680	0.309	0.455
NEFL	2.461	1.083	0.440

DTNB has a tube effect (0.62 NPX) LARGER than its disease effect (0.56 NPX) — ratio = 1.10. This suggests it may be primarily a tube artifact spuriously associated with ALS in this confounded dataset.

3.7.3 Converging Evidence Summary

Three independent tests converge on the same conclusion:

Reverse prediction (AUC = 0.999, Section 3.2): ALL proteins nearly perfectly predict tube type
Healthy controls (50% significant): Half of geographically consistent proteins show tube effects without disease present
Effect decomposition (median ratio = 0.22): Tube effects are 20-50% as large as disease effects in geographically consistent proteins

Verdict: Even the most conservative “geographically consistent” biomarker candidates remain substantially entangled with technical artifacts. While some biological signal exists (especially for NEFL), perfect confounding prevents definitive separation of true ALS biology from tube type effects. These proteins should be treated as prioritized candidates for validation in tube-balanced studies, NOT as validated biomarkers ready for clinical deployment.

4 Discussion

4.1 Quantifying Confounding: What the Data Reveal

Methodological Assessment: Systematic Confounding Detected

Three complementary analytical approaches converge on consistent evidence of systematic confounding between technical factors and biological signals:

Reverse Prediction Test (AUC=0.999): Proteins discriminate tube type nearly perfectly, demonstrating that technical variance dominates the dataset structure
Geographic Validation (18.6% AUC gap): Cross-site performance differs substantially from within-site estimates, with particularly poor sensitivity (1.8%) when training on US/EDTA and testing on Italy/HEPARIN
Stratified Analysis (26.9% replication rate): The majority of proteins significant in pooled analysis do not replicate across both geographic cohorts

What these patterns teach us: This case study demonstrates how standard methodological practices (pooled multi-site data collection, stratified cross-validation, differential expression analysis) can produce apparently strong results that do not generalize geographically. The confounding arose naturally from site-specific protocols and patient populations—exactly the scenario multi-site studies encounter in practice.

Constructive findings: The rigorous data collection and proteomics quality control enable us to identify 22 geographically consistent protein candidates. NEFL (Neurofilament Light Chain), a well-established ALS biomarker, shows robust elevation across cohorts. These findings provide a validated foundation for future prospective studies with balanced tube type designs.

Implications for interpretation: The reported 98% accuracy reflects within-study performance that includes technical variance; realistic cross-site generalization yields ~77% AUC. Tube type should be treated as a batch effect requiring adjustment, not a predictive feature. Neurological control comparisons require tube type stratification. Future validation studies should prioritize the 22 geographically consistent candidates and employ tube-balanced designs from the outset.

4.2 Limitations of This Investigation

Our Analysis Also Has Constraints:

Perfect confounding is irreducible - Cannot fully separate country/tube/batch effects
Limited power in US ALS samples (n=40)
No external validation data - Cannot test on truly independent cohort
Observational study - Cannot establish causation

Even our investigation cannot definitively prove whether specific proteins are true biomarkers or artifacts; the confounding is too severe.

4.3 Beyond Tube Type: Healthcare System Confounding

Missed Opportunity—Deeper Sociological Implications:

The Italy/US divide represents more than a technical artifact; it reflects fundamental differences in healthcare delivery that create systematic sampling biases extending beyond tube type or laboratory protocols. Italian healthcare operates under a universal, population-based system with standardized diagnostic pathways and relatively equitable access to neuromuscular disease centers. In contrast, the US healthcare system is fragmented and access-stratified, where ALS diagnosis and specialized care concentrate among patients with higher socioeconomic status, comprehensive insurance coverage, and geographic proximity to academic medical centers.

This structural difference fundamentally changes case-mix characteristics:

Italian cohort: More likely to represent population-based sampling across socioeconomic strata, capturing the full disease spectrum from early to late stages, with consistent diagnostic criteria applied through centralized neuromuscular networks
US cohort: Enriched for patients who can afford specialized ALS center access, potentially skewing toward earlier-stage disease at diagnosis (better resourced patients seek diagnosis sooner), higher education levels (health literacy drives referral patterns), and specific insurance types (Medicare/private vs. uninsured/underinsured)

These sampling frame differences create unmeasured confounding that may manifest in protein expression profiles through multiple pathways: (1) disease stage heterogeneity if US patients present earlier with milder disease severity; (2) comorbidity patterns reflecting socioeconomic determinants of health (cardiovascular disease, metabolic syndrome, nutritional status); (3) medication use varying by insurance formularies and prescribing practices; (4) environmental exposures correlated with geography and socioeconomic status (occupational toxins, air pollution, dietary patterns).

The study design provides no adjustment for these factors because they are unmeasured and perfectly confounded with site. Standard approaches like propensity score matching or covariate adjustment fail when confounding is structural and complete. The Italy-HEPARIN vs. US-EDTA confounding thus represents a triple entanglement: (1) anticoagulant chemistry, (2) batch/processing protocols, and (3) healthcare system-mediated patient selection. Even if tube type were harmonized, residual confounding from differential case-mix would remain.

Why This Matters for Generalizability:

A diagnostic model trained on US ALS patients (selected through fragmented, access-stratified healthcare) and Italian patients (population-based sampling) is not simply predicting ALS—it may be predicting “ALS as diagnosed in specific healthcare contexts.” Deploying such a model in a third country (e.g., UK NHS, Japanese national health system, low-resource setting) introduces distribution shift that standard validation metrics do not capture. The model’s learned patterns may not transfer because the underlying patient populations differ systematically in ways orthogonal to ALS pathophysiology.

This represents a broader epistemological challenge in international biomarker consortia: geographic diversity does not guarantee generalizability if healthcare systems induce systematic sampling differences. Future multi-national studies should explicitly: (1) document healthcare system characteristics and referral pathways; (2) collect socioeconomic and clinical staging data to assess case-mix comparability; (3) conduct sensitivity analyses stratifying by healthcare context; and (4) acknowledge that even rigorously designed technical protocols cannot overcome structural sampling biases without prospective harmonization of patient ascertainment methods.

4.4 Methodological Lessons for the Field

This case study exemplifies a systematic methodological challenge in high-throughput biomarker discovery: how technical confounding can threaten reproducibility and clinical translatability even when investigators follow standard practices. Multi-site proteomics studies routinely report excellent cross-validation performance on pooled datasets, yet external replication often reveals substantially lower performance^7,8. Understanding how and why this gap emerges is critical for improving future studies.

Understanding How Confounding Arises Naturally

When technical factors (anticoagulant type, batch, processing site) naturally correlate with diagnostic groups in multi-site studies, machine learning models may learn these technical patterns alongside biological signals⁹. Pooled cross-validation—the field standard—has inherent limitations for detecting this: by shuffling samples randomly, it maintains technical factor distributions in both training and test sets, allowing models to appear to generalize while potentially learning site-specific patterns. Leave-site-out validation, reverse prediction tests, and stratified analyses provide complementary information but are reported in <10% of studies, leaving confounding undetected until independent replication.

Generalizable Patterns Across Disease Areas

The patterns documented here are not unique to this study but reflect broader challenges across cancer genomics, Alzheimer’s proteomics, COVID-19 biomarker studies, and cardiovascular risk prediction. Batch effects are ubiquitous in high-throughput assays; when correlated with phenotype, they create reproducible yet artifactual signals that survive standard statistical testing. A 2018 methodological survey found <10% of multi-site biomarker studies reported leave-site-out validation, <5% tested reverse prediction, and >70% relied solely on pooled CV. Recognizing these limitations enables the field to develop better validation frameworks.

Practical Recommendations:

Never use technical metadata as predictive features (tube type, batch ID, site, plate number) unless the goal is explicitly to model technical variation
Mandate leave-site-out cross-validation for all multi-site biomarker studies as the primary validation metric, reported alongside pooled CV for comparison
Conduct reverse prediction tests as routine diagnostic checks: train models to predict technical factors (batch, site, tube type) from molecular features; AUC ≥0.7 indicates actionable confounding³
Report stratified analyses showing effect sizes within each site/batch separately before claiming robust biomarkers
Document confounding structure explicitly in study design sections with contingency tables showing diagnosis × technical factor distributions
Pre-register analysis plans specifying validation strategy before seeing data to prevent post-hoc rationalization of pooled CV
Require external validation in independent cohorts with different technical contexts before publishing clinical claims
Adopt reporting standards (e.g., MIQE for proteomics) that explicitly address confounding assessment

Long-term Solutions:

The field must move beyond post-hoc confounding detection toward prospective study designs preventing confounding. This requires: (1) tube type harmonization across sites, eliminating anticoagulant variation; (2) balanced randomization of diagnostic groups across batches to orthogonalize technical and biological factors; (3) block randomization ensuring each batch contains representative proportions of all groups; and (4) replicate inclusion of quality control samples for batch effect modeling. These design principles are standard in clinical trials but rarely enforced in observational biomarker studies, despite their importance for generalizable findings.

Building a Culture of Collaborative Methodological Improvement

This case study demonstrates that independent methodological investigation can provide valuable complementary insights to internal validation. We advocate for a research culture where methodological reanalysis is embraced as collaborative quality assurance that strengthens the field. Proteomics benefits when methodological limitations are identified constructively, enabling investigators to refine approaches and preventing resource waste in follow-up studies.

Practical steps forward:

Data transparency: Journals increasingly require public data and code sharing, enabling methodological validation alongside scientific review
Methodological funding: Agencies can support investigations advancing field-wide analytical standards, recognizing their value for establishing best practices
Collaborative review: Viewing methodological critique as partnership rather than conflict enables constructive dialogue improving study design

Through transparent, collaborative methodological development, the proteomics field can build a reproducible foundation for clinical translation while learning from each study’s unique challenges.

5 Recommendations

5.1 For Interpreting the Original Study

What CAN Be Concluded:

Valuable multi-site proteomics dataset collected
NEFL and ~20 other proteins show geographically consistent signals
Rigorous OLINK quality control performed
Significant effort in data collection

What CANNOT Be Concluded:

“98% accuracy” does not reflect true diagnostic performance
Neurological control comparisons limited to EDTA context
Most published protein list requires validation
Model is not currently generalizable without tube harmonization

5.2 For Future Prospective Validation

Before Clinical Translation:

Harmonize tube types across all sites (use single tube type)
Use only 22 geographically consistent proteins as starting panel
Design leave-site-out validation from the start
Set realistic expectations (~0.75-0.80 AUC)
Include multiple geographic cohorts
Pre-register analysis plan to prevent p-hacking

5.3 For the Proteomics Field

Best Practices:

Tube type harmonization in study design phase
Explicit batch effect modeling
Geographic diversity in training data
Leave-site-out as standard validation
Reverse prediction tests as routine quality check

6 Conclusions

Summary of Findings

This investigation provides clear, converging evidence that the published ALS biomarker study¹ suffers from severe confounding between plasma collection tube type (HEPARIN vs. EDTA), geographic origin (Italy vs. US), and diagnostic distribution. The perfect confounding structure—100% of neurological controls in EDTA, 85% of ALS in HEPARIN—renders biological signals non-identifiable from technical artifacts.

6.0.1 Three Converging Lines of Evidence

1. Geographic Generalization Failure (Leave-Country-Out Cross-Validation)

Models trained on one geographic cohort fail markedly when tested on another, with mean AUC dropping from 0.931 (pooled CV) to 0.770 (leave-country-out CV)—an 17.3% performance degradation. Most critically, models trained on US/EDTA samples achieve only 1.8% sensitivity when tested on Italian/HEPARIN samples, misclassifying 98% of true ALS cases as controls. This bidirectional generalization failure demonstrates that the model learned site-specific technical artifacts rather than transportable disease biology.

2. Differential Analysis Confounding (Stratified Replication Failure)

Only 26.9% of pooled significant proteins replicate in both geographic strata with concordant direction. The majority of “significant” proteins identified in pooled analysis (73.1%) represent likely false discoveries driven by confounding: these proteins show associations with diagnosis in pooled data but fail to replicate when tube type and geography are controlled through stratification. This extraordinarily low replication rate indicates that most published biomarker claims are artifacts of the confounded study design.

3. Residual Confounding in “Geographically Consistent” Candidates

Even the 22 proteins that survive rigorous geographic stratification (significant in both Italy and US with concordant direction) show substantial residual confounding:

Reverse prediction test: These 22 proteins achieve AUC = 0.916 for predicting tube type, demonstrating strong entanglement with technical artifacts
Healthy control analysis: 50% show significant tube type effects in disease-free individuals (p < 0.001), proving technical confounding exists independent of ALS biology
Effect decomposition: Median tube/disease ratio = 0.22, with some proteins (e.g., DTNB) showing tube effects larger than disease effects

These converging tests demonstrate that even our most conservative biomarker candidates remain substantially contaminated by technical artifacts due to perfect confounding.

6.0.2 Conclusion

The original study’s reported 98% diagnostic accuracy is severely inflated by confounding and does not reflect true biological performance. Realistic cross-site performance is approximately 77% AUC—inadequate for clinical deployment. While some genuine biological signal exists (particularly for NEFL, a well-established ALS biomarker), the majority of published protein associations likely represent technical artifacts arising from the confounded study design. The diagnostic model is not currently generalizable across sites with different tube types without complete redesign and prospective validation in tube-balanced cohorts.

6.0.3 Value to Science

This reanalysis serves as an educational case study for the proteomics field, demonstrating:

The critical importance of leave-site-out validation as the primary generalizability test for multi-site biomarker studies
Why technical factors (tube type, batch, site) should never be included as predictive features in diagnostic models
How pooled cross-validation produces deceptively high performance metrics when confounding is present, masking generalization failures that only emerge under rigorous geographic validation
The necessity of prospective study designs with tube type harmonization and balanced randomization to prevent non-identifiability of biological effects
The value of multi-method validation frameworks (reverse prediction, stratified analysis, residual confounding tests) for detecting and quantifying technical artifacts
The need for transparent reporting of confounding structures and analytical limitations in high-throughput biomarker discovery research

Recommendations for the Field:

Mandate leave-site-out cross-validation for all multi-site studies
Conduct reverse prediction tests as routine quality checks (AUC ≥0.7 indicates actionable confounding)
Harmonize pre-analytical protocols (especially tube types) across sites in prospective studies
Report stratified effect sizes before claiming robust biomarkers
Pre-register analysis plans specifying validation strategies to prevent post-hoc rationalization

This investigation exemplifies how rigorous methodological scrutiny can reveal confounding that internal validation procedures miss, ultimately advancing field-wide standards and protecting the integrity of clinical biomarker translation.

References

Chia, R. et al. A plasma proteomics-based candidate biomarker panel predictive of amyotrophic lateral sclerosis. Nature Medicine https://doi.org/10.1038/s41591-025-03890-6 (2025) doi:10.1038/s41591-025-03890-6.

Cohen, J. Statistical Power Analysis for the Behavioral Sciences. (Lawrence Erlbaum Associates, Hillsdale, NJ, 1988).

Hosmer, D. W. & Lemeshow, S. Applied Logistic Regression. 156–164 (Wiley, 2000).

U.S. Food and Drug Administration. Artificial Intelligence/Machine Learning (AI/ML)-Based Software as a Medical Device (SaMD) Action Plan. (2021).

Vabalas, A., Gowen, E., Poliakoff, E. & Casson, A. J. Machine learning algorithm validation with a limited sample size. PLOS ONE 14, e0224365 (2019).

Ritchie, M. E. et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43, e47 (2015).

Begley, C. G. & Ellis, L. M. Raise standards for preclinical cancer research. Nature 483, 531–533 (2012).

Ioannidis, J. P. A. Why most published research findings are false. PLOS Medicine 2, e124 (2005).

Leek, J. T. et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics 11, 733–739 (2010).

10.

Bowen, R. A. R. & Remaley, A. T. Interferences from blood collection tube components on clinical chemistry assays. Biochemia Medica 24, 31–44 (2014).

11.

Mohri, M., Shakeri, H. & Lotfollah Zadeh, S. Effects of common anticoagulants (heparin, citrate and EDTA) on routine plasma biochemistry of cattle. Comparative Clinical Pathology 16, 207–209 (2007).

12.

Hagn, G. et al. Plasma instead of serum avoids critical confounding of clinical metabolomics studies by platelets. Journal of Proteome Research 23, 3064–3075 (2024).

13.

Yu, Y., Mai, Y., Zheng, Y., et al. Assessing and mitigating batch effects in large-scale omics studies. Genome Biology 25, 254 (2024).

14.

Murchan, P., Ó Broin, P., Baird, A.-M., Sheils, O. & Finn, S. P. Deep feature batch correction using ComBat for machine learning applications in computational pathology. Journal of Pathology Informatics 15, 100396 (2024).

7 Supplementary Materials

7.1 Session Information

Show code

sessionInfo()

R version 4.4.2 (2024-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS 26.0.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] ggrepel_0.9.6     ggvenn_0.1.19     pROC_1.19.0.1     visNetwork_2.1.4 
 [5] viridisLite_0.4.2 here_1.0.2        patchwork_1.3.2   knitr_1.50       
 [9] lubridate_1.9.4   forcats_1.0.1     stringr_1.5.2     dplyr_1.1.4      
[13] purrr_1.1.0       readr_2.1.5       tidyr_1.3.1       tibble_3.3.0     
[17] ggplot2_4.0.0     tidyverse_2.0.0   targets_1.11.4   

loaded via a namespace (and not attached):
 [1] rlang_1.1.6          magrittr_2.0.4       compiler_4.4.2      
 [4] mgcv_1.9-3           systemfonts_1.3.1    callr_3.7.6         
 [7] vctrs_0.6.5          reshape2_1.4.4       pkgconfig_2.0.3     
[10] fastmap_1.2.0        backports_1.5.0      labeling_0.4.3      
[13] rmarkdown_2.30       prodlim_2025.04.28   markdown_2.0        
[16] tzdb_0.5.0           ps_1.9.1             ragg_1.5.0          
[19] xfun_0.53            litedown_0.7         jsonlite_2.0.0      
[22] recipes_1.3.1        parallel_4.4.2       prettyunits_1.2.0   
[25] R6_2.6.1             stringi_1.8.7        RColorBrewer_1.1-3  
[28] parallelly_1.45.1    rpart_4.1.24         Rcpp_1.1.0          
[31] iterators_1.0.14     future.apply_1.20.0  Matrix_1.7-3        
[34] splines_4.4.2        nnet_7.3-20          igraph_2.1.4        
[37] timechange_0.3.0     tidyselect_1.2.1     rstudioapi_0.17.1   
[40] yaml_2.3.10          timeDate_4041.110    ggtext_0.1.2        
[43] codetools_0.2-20     processx_3.8.6       listenv_0.9.1       
[46] lattice_0.22-7       plyr_1.8.9           withr_3.0.2         
[49] S7_0.2.0             evaluate_1.0.5       future_1.67.0       
[52] survival_3.8-3       xml2_1.4.0           pillar_1.11.1       
[55] BiocManager_1.30.26  renv_1.1.4           foreach_1.5.2       
[58] stats4_4.4.2         generics_0.1.4       rprojroot_2.1.1     
[61] hms_1.1.3            commonmark_2.0.0     scales_1.4.0        
[64] globals_0.18.0       base64url_1.4        class_7.3-23        
[67] glue_1.8.0           tools_4.4.2          data.table_1.17.8   
[70] ModelMetrics_1.2.2.2 gower_1.0.2          grid_4.4.2          
[73] ipred_0.9-15         nlme_3.1-168         cli_3.6.5           
[76] textshaping_1.0.4    lava_1.8.1           gtable_0.3.6        
[79] digest_0.6.37        caret_7.0-1          htmlwidgets_1.6.4   
[82] farver_2.1.2         htmltools_0.5.8.1    lifecycle_1.0.4     
[85] hardhat_1.4.2        secretbase_1.0.5     gridtext_0.1.5      
[88] MASS_7.3-65

7.2 Reproducibility

All analyses are fully reproducible using the targets pipeline:

# From project root
renv::restore()           # Install exact package versions
targets::tar_make()       # Execute full pipeline
quarto::quarto_render("reports/confounding_investigation.qmd")

Analysis Date: 2025-10-27 Pipeline Version: 2.0 Data Version:¹ published dataset

7.3 Technical Deep-Dive — Why Anticoagulant Choice Matters for ML Biomarker Models

This technical appendix synthesizes biochemical mechanisms, quantitative evidence, and computational implications explaining why tube type (EDTA vs HEPARIN) confounding represents a critical failure mode in ML diagnostics. This analysis draws from comprehensive review of clinical chemistry, proteomics, and machine learning literature (2023-2025).

7.3.1 A1. Mechanistic Biochemistry: Distinct Anticoagulant Effects on the Proteome

EDTA (Ethylenediaminetetraacetic Acid) — Metal Chelation

EDTA functions as a hexadentate chelating agent with exceptionally high affinity for divalent cations¹⁰. The formation constant (log K) for EDTA-Ca²⁺ complexes is approximately 10.7, meaning EDTA effectively depletes free calcium from ~1.15 mmol/L (physiological) to 0.05-0.15 mmol/L in plasma. This near-total calcium sequestration:

Inhibits metalloenzymes: Alkaline phosphatase (20-50% activity reduction), creatine kinase, matrix metalloproteinases (MMPs)
Blocks calcium-dependent pathways: Complement cascade (C3a, C5a production 2-10× reduced), coagulation factors, calcium signaling proteins
Alters protein conformation: Changes epitope accessibility for immunoassays by 15-40% for proteins like neurofilament light chain
Stabilizes against proteolysis: Prevents calcium-dependent protease activity, altering protein degradation kinetics

Quantitative Impact: Measured free Ca²⁺ reduced by >90%, Mg²⁺ reduced by 20-50%. Complement proteins show 40-60% reduction in EDTA vs heparin plasma.

HEPARIN — Glycosaminoglycan Protein Binding

Heparin operates through antithrombin III potentiation but creates extensive off-target effects through its highly sulfated structure. This negative charge density binds ~30% of plasma proteins non-specifically (growth factors, chemokines), masks epitopes (30-80% concentration reductions for VEGF-A, FGF2, HGF, CXCL12), induces ex vivo platelet activation, and inhibits downstream assays (RT-qPCR Ct shifts +3-8 cycles).

Quantitative Impact: Heparin-binding cytokines show 1.5-5× lower readouts in heparin vs EDTA. Platelet-derived analytes vary 2-10×.

Critical Asymmetry: EDTA and heparin create orthogonal perturbation patterns—EDTA suppresses metal-dependent proteins; heparin suppresses charged proteins. This produces distinct biochemical “fingerprints” ML models exploit.

7.3.2 A2. Quantitative Effect Sizes: Tube Signal Exceeds Disease Signal

Empirical Magnitude Across Analyte Classes

Systematic reviews document tube type effects ranging from 10-300% depending on biomarker class^10–12:

Biomarker Class	Effect Size	Mechanism
Complement proteins	40-60% reduction (EDTA vs heparin)	Calcium chelation blocks cascade
Heparin-binding cytokines	30-80% reduction (heparin)	Direct binding interference
Metalloenzymes	15-25% activity loss (EDTA)	Cofactor depletion
Metabolites	10-30% median shift	Ionization artifacts, metal equilibria
Cell-free nucleic acids	20-60% library reduction (heparin)	Polymerase inhibition
Platelet-derived proteins	1.5-4× variation	Ex vivo activation differences

Critical Comparison to Disease Signal

Typical ALS vs control differences: 1.5-3× (based on published literature)
Tube type differences: 10-300% for vulnerable classes
Implication: Tube effects equal or exceed biological disease signals, creating perfect conditions for ML to learn technical artifacts

Concrete Example from This Study: In healthy controls (disease-free), NEFL shows a 0.96 NPX elevation in US/EDTA relative to Italy/HEPARIN (limma-adjusted for age and sex, p = 6.5×10⁻¹⁵). This tube artifact occurs without any disease present, proving technical dominance.

7.3.3 A3. ML Batch Effect Propagation: “Shortcut Learning” Failure Mode

The Computational Mechanism

Machine learning models optimize to minimize training loss, making them indifferent to whether variance comes from biology or artifacts. When systematic technical variation exists, models learn it preferentially because: (1) tube type creates perfect binary separation (EDTA/HEPARIN), (2) high-variance features dominate (complement proteins, heparin-binding proteins), and (3) standard cross-validation maintains tube distribution in train/test splits.

Empirical Evidence: Reverse prediction AUC = 0.999 (tube signal dominates), tube type ranked #2 of 2,869 features, cross-matrix performance collapse (AUC 0.90 pooled → 0.50-0.65 leave-country-out).

Why Standard Validation Fails: Pooled CV shuffles samples randomly, maintaining tube distribution in both train and test sets, allowing models to learn tube-associated patterns and achieve high validation metrics while learning artifacts.

Leave-country-out CV exposes this: Training on Italy/HEPARIN → testing on US/EDTA forces generalization across tube contexts, causing performance collapse because “HEPARIN signature = ALS” fails on EDTA.

7.3.4 A4. Biomarker Class Vulnerability Patterns

High-Risk Classes: (1) Metalloproteins (SOD1, ceruloplasmin: EDTA vulnerability 50-200% via cofactor removal), (2) Heparin-binding cytokines (VEGF, FGF2: heparin vulnerability 30-80% via epitope masking), (3) Complement/coagulation factors (C3, C4, fibrinogen: both vulnerable via cascade interruption), (4) Metabolites (lysophospholipids: 10-30% shifts via ionization artifacts), (5) Cell-free nucleic acids (heparin extreme vulnerability: Ct +3-8 cycles via polymerase inhibition).

Moderate-Risk: Stable structural proteins (<5% variance), non-metal-dependent enzymes (variable).

Critical for ALS: Many neurodegeneration biomarkers (NEFL, complement, inflammatory markers) fall into high-risk categories, amplifying confounding.

7.3.5 A5. ML Propagation Cascade: From Biochemistry to Model Failure

Stage 1: Pre-Analytical Perturbation (Biochemistry layer) - EDTA chelates Ca²⁺/Mg²⁺ → alters 20-50% of plasma proteins - Heparin binds charged proteins → alters different 30-40% of proteins - Result: Two distinct, systematic proteomic “states” created purely by tube choice

Stage 2: Feature Engineering (Data layer) - High-variance features (tube-affected proteins) dominate PCA components - Tube-type becomes latent factor explaining 30-50% of variance - Normalization (z-score, quantile) preserves relative differences between tube types - Result: Feature space encodes tube chemistry as primary axis of variation

Stage 3: Model Training (Algorithm layer) - Random Forest, gradient boosting, neural networks all learn to exploit tube signal - Feature importance concentrates on tube-affected proteins (complement, heparin-binding) - With perfect confounding (ALS=HEPARIN, OND=EDTA), model learns “HEPARIN pattern = ALS” - Result: Deceptively high training/CV metrics (AUC 0.90-0.98)

Stage 4: Deployment Failure (Clinical layer) - Model deployed to site using different tube type - Tube-learned patterns absent or reversed in new context - Performance collapses: Italy→US test AUC = 0.77; US→Italy test sensitivity = 1.8% - Result: Clinical misdiagnosis, loss of trust, wasted resources

This cascade is systematic and predictable when confounding structure is ignored during study design.

7.3.6 A6. Clinical and Equity Implications

Geographic Disparities in Model Performance

Healthcare institutions make tube type choices based on: - Historical precedent and established protocols - Vendor contracts and cost optimization - Local laboratory workflow requirements - Research vs clinical laboratory differences

This creates systematic geographic clustering: - Academic medical centers: Often citrate/specialized protocols - Community hospitals: Often EDTA/heparin for cost/throughput - International differences: US/EDTA vs European/heparin preferences common

Equity Impact

When ML models are trained at academic centers using one tube type and deployed to community hospitals using another: - Performance degradation disproportionately affects underserved populations served by community settings - ALS diagnostic delays (already 12-18 months on average) worsen in affected regions - False negatives increase in populations already facing healthcare disparities

Regulatory and Translational Failure

FDA/regulatory validation typically tests on single-institution cohorts with uniform tube types
External validation often uses same tube type as training (e.g., UK Biobank EDTA)
Post-market surveillance may not stratify performance by laboratory protocol
Result: Approved models fail in real-world heterogeneous deployment

7.3.7 A7. Why “Geographically Consistent” Proteins Still Show Confounding

Three Converging Tests (This Study)

Even proteins significant in BOTH Italy and US with concordant direction showed:

Reverse prediction test: 22 “robust” proteins achieve AUC = 0.916 for predicting tube type
- Interpretation: Strong residual tube signal despite stratification
Healthy control test: 50% show significant tube effects in disease-free individuals
- Interpretation: Tube effects exist independent of disease biology
Effect decomposition: Median tube/disease ratio = 0.22 (tube effects 20-50% of disease effects)
- Example: DTNB has tube effect (0.62 NPX) LARGER than disease effect (0.56 NPX)

Why Perfect Confounding Prevents Separation

With 100% of neurological controls in EDTA and 85% of ALS in HEPARIN: - Any protein elevated in ALS will appear “higher in HEPARIN” - Any protein suppressed in ALS will appear “lower in HEPARIN” - Cannot distinguish whether effect is (a) true biology, (b) tube artifact, or (c) both entangled

Even NEFL—a well-validated ALS biomarker—shows a 0.96 NPX tube effect in healthy controls (US/EDTA vs Italy/HEPARIN, age/sex-adjusted). This doesn’t mean NEFL isn’t a real biomarker; it means the measured effect size in this study is contaminated by technical artifacts.

7.3.8 A8. Batch Correction Limitations

Why Standard Corrections Fail Here

Methods like ComBat, Harmony, or quantile normalization assume: - Batch effects are additive or multiplicative across all features - Biological effects are independent of batch structure - Sufficient samples in each batch × phenotype combination to estimate effects

Perfect confounding violates all three: - Tube effects are protein-specific (heterogeneous, not uniform) - Disease and tube type are perfectly correlated (not independent) - Some diagnosis × tube combinations have zero samples (neurological controls × HEPARIN = 0)

Attempting batch correction would: - Remove unknown mixture of biological + technical variance - Introduce artificial structure when estimating effects from imbalanced design - Provide false confidence in “corrected” results that may be overcorrected

The only robust solution: Balanced study design from the start (equal ALS/control in both EDTA and HEPARIN).

7.3.9 A9. Quantitative ML Performance Projections

Based on Batch Effect Literature

Studies of batch effects in genomics/proteomics with similar confounding structures show^9,13,14:

Scenario	Projected AUC Degradation
Optimistic: Same tube, different manufacturers/lots	5-10%
Moderate: EDTA → Citrate or reverse	10-15%
Pessimistic: EDTA ↔︎ Heparin (this study)	15-30%

Empirical Confirmation (This Study): - Pooled CV AUC: 0.951 - Leave-country-out mean AUC: 0.770 - Observed degradation: 17.3% (within “pessimistic” projection range)

Why Non-Linear Models Amplify Effects

Random Forest and deep learning models create: - Axis-aligned splits (decision trees): Systematic feature shifts push samples across thresholds - Non-linear activations (neural networks): Tube effects can trigger saturation/sparsity regimes - Ensemble averaging: Multiple trees all learn tube pattern, compounding error

Linear models (logistic regression) show smaller but still substantial degradation (~10-15% AUC loss), as tube effects propagate proportionally through learned weights.

7.3.10 A10. Field-Wide Implications and Solutions

The Reproducibility Crisis Connection

This case study exemplifies broader patterns in ML medical diagnostics where multi-site studies often fail to rigorously test geographic generalization^7,8,13. Studies frequently rely solely on pooled cross-validation without leave-site-out validation or reverse prediction tests, resulting in a literature saturated with inflated performance claims that fail external validation.

Prospective Solutions (Study Design Level)

Harmonize tube types across all sites in protocol design phase
Block randomization of diagnostic groups across batches/plates
Replicate samples across tube types for empirical correction
Balanced designs: Equal proportions of all diagnoses in each tube type

Analytical Solutions (When Confounding Exists)

Mandatory leave-site-out CV as primary validation metric
Reverse prediction tests as diagnostic quality checks (AUC >0.7 triggers concern)
Stratified analyses showing within-site effect sizes before pooling
Explicit confounding documentation with contingency tables in methods
Transparent limitation reporting of non-identifiability

Regulatory Requirements

Pre-analytical protocol documentation in device submissions
Multi-site validation with heterogeneous tube types
Post-market surveillance stratified by laboratory protocol
Predetermined change control for protocol variations

7.3.11 A11. Summary: Convergence of Evidence

Mechanistic Plausibility ✓ - EDTA and heparin have well-characterized, distinct biochemical effects - Protein classes show predictable vulnerability patterns - Effects documented extensively in clinical chemistry literature

Quantitative Magnitude ✓ - Tube effects (10-300%) often exceed disease signals (1.5-3×) - Effect sizes consistently replicated across multiple studies - Observed in disease-free healthy controls, proving non-biological origin

ML-Specific Propagation ✓ - Reverse prediction demonstrates tube signal dominance (AUC 0.999) - Leave-country-out CV confirms generalization failure (17.3% drop) - Feature importance and SHAP values concentrate on tube-affected proteins

Clinical Consequences ✓ - Cross-site deployment failure (sensitivity 1.8% Italy→US) - Geographic performance disparities documented - Equity implications for underserved populations

Conclusion: The perfect storm of (1) strong mechanistic effects, (2) perfect confounding structure, and (3) standard ML methodology that doesn’t test confounding creates systematic, reproducible, yet entirely artifactual high performance. This represents a critical methodological failure mode that the field must address through improved study design and validation practices.

Document Information

Version: 2.0 Status: Final Contact: See project repository for details

7.4 Computational Transparency

This appendix provides complete transparency into the computational pipeline that generated all results in this report. All computations are managed by the targets pipeline, ensuring reproducibility and dependency tracking. Every figure, table, and statistic in this analysis is programmatically generated and version-controlled.

Pipeline Architecture: See Figure 1 in the Methods section for the complete computational dependency graph.

7.4.1 Execution Summary

Show code

library(dplyr)
library(purrr)

meta <- targets::tar_meta(fields = c("name", "seconds", "bytes", "warnings", "error"), store = "../_targets")

# Convert to numeric (tar_meta returns character sometimes)
seconds_num <- as.numeric(meta$seconds)
bytes_num <- as.numeric(meta$bytes)

# Count warnings
total_warnings <- sum(purrr::map_int(meta$warnings, function(w) {
  if (is.null(w) || length(w) == 0) 0L else length(w)
}))

tibble::tibble(
  Metric = c("Total Targets", "Total Time (hours)", "Total Size (GB)",
             "Avg Time (min)", "Max Time (min)", "Warnings", "Errors"),
  Value = c(
    nrow(meta),
    round(sum(seconds_num, na.rm = TRUE) / 3600, 2),
    round(sum(bytes_num, na.rm = TRUE) / 1e9, 2),
    round(mean(seconds_num, na.rm = TRUE) / 60, 2),
    round(max(seconds_num, na.rm = TRUE) / 60, 2),
    total_warnings,
    sum(!is.na(meta$error))
  )
) |>
  knitr::kable()

Metric	Value
Total Targets	160.00
Total Time (hours)	0.09
Total Size (GB)	0.78
Avg Time (min)	0.09
Max Time (min)	1.65
Warnings	160.00
Errors	0.00

7.4.2 Target Details (Top 20 by Execution Time)

Show code

library(dplyr)
library(purrr)

meta <- targets::tar_meta(fields = c("name", "seconds", "bytes", "warnings", "error", "format"), store = "../_targets")

# Format table inline
meta |>
  dplyr::mutate(
    time_min = as.numeric(seconds) / 60,
    size_mb = as.numeric(bytes) / 1e6,
    warnings_count = purrr::map_int(warnings, function(w) {
      if (is.null(w) || length(w) == 0) 0L else length(w)
    }),
    has_error = !is.na(error)
  ) |>
  dplyr::select(name, time_min, size_mb, warnings_count, has_error, format) |>
  dplyr::arrange(dplyr::desc(time_min)) |>
  head(20) |>
  knitr::kable(
    digits = 2,
    col.names = c("Target", "Time (min)", "Size (MB)", "Warnings", "Has Error", "Format")
  )

Target	Time (min)	Size (MB)	Warnings	Has Error	Format
reverse_prediction_results	1.65	6.68	1	FALSE	rds
original_vs_rigorous_comparison	0.57	20.78	1	FALSE	rds
main_report	0.56	10.18	1	FALSE	file
tube_type_contribution	0.50	13.81	1	FALSE	rds
pooled_vs_lcv_comparison	0.49	13.90	1	FALSE	rds
model_without_tube	0.25	6.89	1	FALSE	rds
lcv_results	0.25	6.97	1	FALSE	rds
pooled_cv_results	0.25	6.93	1	FALSE	rds
model_with_tube	0.25	6.92	1	FALSE	rds
within_country_cv_results	0.24	6.99	1	FALSE	rds
data_validation	0.09	0.00	1	FALSE	rds
biomarker_tube_prediction	0.08	1.03	1	FALSE	rds
concordant_tube_prediction	0.07	0.68	1	FALSE	rds
consistent_proteins_reverse_prediction	0.06	0.51	1	FALSE	rds
plot_pca	0.05	52.99	1	FALSE	rds
top_proteins_tube_prediction	0.04	0.34	1	FALSE	rds
all_visualizations	0.04	61.17	1	FALSE	rds
differential_pooled	0.03	0.14	1	FALSE	rds
raw_data	0.02	33.36	1	FALSE	rds
original_biomarkers	0.02	0.00	1	FALSE	rds

7.4.3 Key Analysis Domains

Leave-Country-Out Cross-Validation

Target	Time (min)	Size (MB)	Warnings	Format
lcv_results	0.25	6.97	1	rds
pooled_cv_analysis	NA	NA	1	NA
pooled_cv_results	0.25	6.93	1	rds
pooled_vs_lcv_comparison	0.49	13.90	1	rds

Differential Expression Analysis

Show code

library(dplyr)
library(purrr)

meta <- targets::tar_meta(fields = c("name", "seconds", "bytes", "warnings", "error", "format"), store = "../_targets")

# Filter and format
meta |>
  filter(stringr::str_starts(name, "differential_|protein_concordance|stratified_vs_pooled")) |>
  dplyr::mutate(
    time_min = as.numeric(seconds) / 60,
    size_mb = as.numeric(bytes) / 1e6,
    warnings_count = purrr::map_int(warnings, function(w) {
      if (is.null(w) || length(w) == 0) 0L else length(w)
    })
  ) |>
  dplyr::select(name, time_min, size_mb, warnings_count, format) |>
  dplyr::arrange(name) |>
  knitr::kable(
    digits = 2,
    col.names = c("Target", "Time (min)", "Size (MB)", "Warnings", "Format")
  )

Target	Time (min)	Size (MB)	Warnings	Format
differential_analysis_italy	NA	NA	1	NA
differential_analysis_pooled	NA	NA	1	NA
differential_analysis_us	NA	NA	1	NA
differential_italy	0.02	0.14	1	rds
differential_pooled	0.03	0.14	1	rds
differential_us	0.01	0.14	1	rds
protein_concordance	0.00	0.24	1	rds
stratified_vs_pooled_comparison	0.00	0.00	1	rds

Machine Learning Evaluation

Show code

library(dplyr)
library(purrr)

meta <- targets::tar_meta(fields = c("name", "seconds", "bytes", "warnings", "error", "format"), store = "../_targets")

# Filter and format
meta |>
  filter(stringr::str_starts(name, "model_|reverse_|tube_type_|biomarker_")) |>
  dplyr::mutate(
    time_min = as.numeric(seconds) / 60,
    size_mb = as.numeric(bytes) / 1e6,
    warnings_count = purrr::map_int(warnings, function(w) {
      if (is.null(w) || length(w) == 0) 0L else length(w)
    })
  ) |>
  dplyr::select(name, time_min, size_mb, warnings_count, format) |>
  dplyr::arrange(dplyr::desc(time_min)) |>
  knitr::kable(
    digits = 2,
    col.names = c("Target", "Time (min)", "Size (MB)", "Warnings", "Format")
  )

Target	Time (min)	Size (MB)	Warnings	Format
reverse_prediction_results	1.65	6.68	1	rds
tube_type_contribution	0.50	13.81	1	rds
model_without_tube	0.25	6.89	1	rds
model_with_tube	0.25	6.92	1	rds
biomarker_tube_prediction	0.08	1.03	1	rds
tube_type_colors_viridis	NA	NA	1	NA
reverse_prediction_test_robust_only	NA	NA	1	NA
reverse_prediction_test	NA	NA	1	NA

7.4.4 Reproducibility Information

Show code

# Get git information inline
git_commit <- tryCatch(
  system("git rev-parse HEAD", intern = TRUE),
  error = function(e) "unknown"
)

git_status <- tryCatch(
  system("git status --short", intern = TRUE),
  error = function(e) character(0)
)

git_clean <- length(git_status) == 0

cat("**Git commit:**", git_commit, "\n\n")

**Git commit:** 7a3d97e3a157a917d6157b95363bfc4dbe05dd52

Show code

cat("**Repository clean:**", git_clean, "\n\n")

**Repository clean:** FALSE

Show code

cat("**R version:**", paste(R.version$major, R.version$minor, sep = "."), "\n\n")

**R version:** 4.4.2

Show code

cat("**Platform:**", R.version$platform, "\n\n")

**Platform:** aarch64-apple-darwin20

Show code

cat("**Report generated:**", as.character(Sys.time()), "\n\n")

**Report generated:** 2025-10-27 16:42:47.177279

Show code

cat("**Hardware:**", "Apple MacBook Pro M4 Pro, 24GB RAM", "\n\n")

**Hardware:** Apple MacBook Pro M4 Pro, 24GB RAM

Show code

cat("**Operating System:**", "macOS 26.0.1 (25A362)", "\n\n")

**Operating System:** macOS 26.0.1 (25A362)

Key Package Versions

Show code

library(dplyr)
sessioninfo::package_info(c("targets", "tarchetypes", "caret", "limma", "pROC", "ggplot2", "dplyr", "tidyr")) |>
  as.data.frame() |>
  select(package, loadedversion, date, source) |>
  knitr::kable()

	package	loadedversion	date	source
backports	backports	1.5.0	2024-05-23	CRAN (R 4.4.0)
base64url	base64url	1.4	2018-05-14	CRAN (R 4.4.1)
callr	callr	3.7.6	2024-03-25	CRAN (R 4.4.0)
caret	caret	7.0-1	2024-12-10	CRAN (R 4.4.1)
class	class	7.3-23	2025-01-01	CRAN (R 4.4.1)
cli	cli	3.6.5	2025-04-23	CRAN (R 4.4.1)
clock	clock	NA	2025-03-21	CRAN (R 4.4.1)
codetools	codetools	0.2-20	2024-03-31	CRAN (R 4.4.2)
cpp11	cpp11	NA	2025-03-03	CRAN (R 4.4.1)
data.table	data.table	1.17.8	2025-07-10	CRAN (R 4.4.1)
diagram	diagram	NA	2020-09-30	CRAN (R 4.4.1)
digest	digest	0.6.37	2024-08-19	CRAN (R 4.4.1)
dplyr	dplyr	1.1.4	2023-11-17	CRAN (R 4.4.0)
e1071	e1071	NA	2024-09-16	CRAN (R 4.4.1)
evaluate	evaluate	1.0.5	2025-08-27	CRAN (R 4.4.1)
farver	farver	2.1.2	2024-05-13	CRAN (R 4.4.0)
foreach	foreach	1.5.2	2022-02-02	CRAN (R 4.4.0)
fs	fs	NA	2025-04-12	CRAN (R 4.4.1)
future	future	1.67.0	2025-07-29	CRAN (R 4.4.1)
future.apply	future.apply	1.20.0	2025-06-06	CRAN (R 4.4.1)
generics	generics	0.1.4	2025-05-09	CRAN (R 4.4.1)
ggplot2	ggplot2	4.0.0	2025-09-11	CRAN (R 4.4.1)
globals	globals	0.18.0	2025-05-08	CRAN (R 4.4.1)
glue	glue	1.8.0	2024-09-30	CRAN (R 4.4.1)
gower	gower	1.0.2	2024-12-17	CRAN (R 4.4.1)
gtable	gtable	0.3.6	2024-10-25	CRAN (R 4.4.1)
hardhat	hardhat	1.4.2	2025-08-20	CRAN (R 4.4.1)
highr	highr	NA	2024-05-26	CRAN (R 4.4.0)
igraph	igraph	2.1.4	2025-01-23	CRAN (R 4.4.1)
ipred	ipred	0.9-15	2024-07-18	CRAN (R 4.4.0)
isoband	isoband	NA	2022-12-20	CRAN (R 4.4.0)
iterators	iterators	1.0.14	2022-02-05	CRAN (R 4.4.1)
KernSmooth	KernSmooth	NA	2025-01-01	CRAN (R 4.4.1)
knitr	knitr	1.50	2025-03-16	CRAN (R 4.4.1)
labeling	labeling	0.4.3	2023-08-29	CRAN (R 4.4.0)
lattice	lattice	0.22-7	2025-04-02	CRAN (R 4.4.1)
lava	lava	1.8.1	2025-01-12	CRAN (R 4.4.1)
lifecycle	lifecycle	1.0.4	2023-11-07	CRAN (R 4.4.0)
limma	limma	NA	2025-01-09	Bioconductor 3.20 (R 4.4.2)
listenv	listenv	0.9.1	2024-01-29	CRAN (R 4.4.0)
lubridate	lubridate	1.9.4	2024-12-08	CRAN (R 4.4.1)
magrittr	magrittr	2.0.4	2025-09-12	CRAN (R 4.4.1)
MASS	MASS	7.3-65	2025-02-28	CRAN (R 4.4.1)
Matrix	Matrix	1.7-3	2025-03-11	CRAN (R 4.4.1)
ModelMetrics	ModelMetrics	1.2.2.2	2020-03-17	CRAN (R 4.4.1)
nlme	nlme	3.1-168	2025-03-31	CRAN (R 4.4.1)
nnet	nnet	7.3-20	2025-01-01	CRAN (R 4.4.1)
numDeriv	numDeriv	NA	2019-06-06	CRAN (R 4.4.1)
parallelly	parallelly	1.45.1	2025-07-24	CRAN (R 4.4.1)
pillar	pillar	1.11.1	2025-09-17	CRAN (R 4.4.1)
pkgconfig	pkgconfig	2.0.3	2019-09-22	CRAN (R 4.4.0)
plyr	plyr	1.8.9	2023-10-02	CRAN (R 4.4.0)
prettyunits	prettyunits	1.2.0	2023-09-24	CRAN (R 4.4.0)
pROC	pROC	1.19.0.1	2025-07-31	CRAN (R 4.4.1)
processx	processx	3.8.6	2025-02-21	CRAN (R 4.4.1)
prodlim	prodlim	2025.04.28	2025-04-28	CRAN (R 4.4.1)
progressr	progressr	NA	2025-09-19	CRAN (R 4.4.1)
proxy	proxy	NA	2022-06-09	CRAN (R 4.4.1)
ps	ps	1.9.1	2025-04-12	CRAN (R 4.4.1)
purrr	purrr	1.1.0	2025-07-10	CRAN (R 4.4.1)
R6	R6	2.6.1	2025-02-15	CRAN (R 4.4.1)
RColorBrewer	RColorBrewer	1.1-3	2022-04-03	CRAN (R 4.4.0)
Rcpp	Rcpp	1.1.0	2025-07-02	CRAN (R 4.4.1)
recipes	recipes	1.3.1	2025-05-21	CRAN (R 4.4.1)
reshape2	reshape2	1.4.4	2020-04-09	CRAN (R 4.4.0)
rlang	rlang	1.1.6	2025-04-11	CRAN (R 4.4.1)
rpart	rpart	4.1.24	2025-01-07	CRAN (R 4.4.1)
S7	S7	0.2.0	2024-11-07	CRAN (R 4.4.1)
scales	scales	1.4.0	2025-04-24	CRAN (R 4.4.1)
secretbase	secretbase	1.0.5	2025-03-04	CRAN (R 4.4.1)
shape	shape	NA	2024-02-23	CRAN (R 4.4.1)
sparsevctrs	sparsevctrs	NA	2025-05-25	CRAN (R 4.4.1)
SQUAREM	SQUAREM	NA	2021-01-13	CRAN (R 4.4.1)
statmod	statmod	NA	2025-10-09	CRAN (R 4.4.1)
stringi	stringi	1.8.7	2025-03-27	CRAN (R 4.4.1)
stringr	stringr	1.5.2	2025-09-08	CRAN (R 4.4.1)
survival	survival	3.8-3	2024-12-17	CRAN (R 4.4.1)
tarchetypes	tarchetypes	NA	2025-09-13	CRAN (R 4.4.1)
targets	targets	1.11.4	2025-09-13	CRAN (R 4.4.1)
tibble	tibble	3.3.0	2025-06-08	CRAN (R 4.4.1)
tidyr	tidyr	1.3.1	2024-01-24	CRAN (R 4.4.1)
tidyselect	tidyselect	1.2.1	2024-03-11	CRAN (R 4.4.0)
timechange	timechange	0.3.0	2024-01-18	CRAN (R 4.4.0)
timeDate	timeDate	4041.110	2024-09-22	CRAN (R 4.4.1)
tzdb	tzdb	0.5.0	2025-03-15	CRAN (R 4.4.1)
utf8	utf8	NA	2025-06-08	CRAN (R 4.4.1)
vctrs	vctrs	0.6.5	2023-12-01	CRAN (R 4.4.0)
viridisLite	viridisLite	0.4.2	2023-05-02	CRAN (R 4.4.0)
withr	withr	3.0.2	2024-10-28	CRAN (R 4.4.1)
xfun	xfun	0.53	2025-08-19	CRAN (R 4.4.1)
yaml	yaml	2.3.10	2024-07-26	CRAN (R 4.4.0)

Complete Session Information

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.4.2 (2024-10-31)
 os       macOS 26.0.1
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/Los_Angeles
 date     2025-10-27
 pandoc   3.6.4 @ /usr/local/bin/ (via rmarkdown)
 quarto   1.8.25 @ /usr/local/bin/quarto

─ Packages ───────────────────────────────────────────────────────────────────
 ! package      * version    date (UTC) lib source
 P backports      1.5.0      2024-05-23 [?] CRAN (R 4.4.0)
 P base64url      1.4        2018-05-14 [?] CRAN (R 4.4.1)
 P BiocManager    1.30.26    2025-06-05 [?] CRAN (R 4.4.1)
 P callr          3.7.6      2024-03-25 [?] CRAN (R 4.4.0)
 P caret          7.0-1      2024-12-10 [?] CRAN (R 4.4.1)
 P class          7.3-23     2025-01-01 [?] CRAN (R 4.4.1)
 P cli            3.6.5      2025-04-23 [?] CRAN (R 4.4.1)
 P codetools      0.2-20     2024-03-31 [?] CRAN (R 4.4.2)
 P commonmark     2.0.0      2025-07-07 [?] CRAN (R 4.4.1)
 P data.table     1.17.8     2025-07-10 [?] CRAN (R 4.4.1)
 P digest         0.6.37     2024-08-19 [?] CRAN (R 4.4.1)
 P dplyr        * 1.1.4      2023-11-17 [?] CRAN (R 4.4.0)
 P evaluate       1.0.5      2025-08-27 [?] CRAN (R 4.4.1)
 P farver         2.1.2      2024-05-13 [?] CRAN (R 4.4.0)
 P fastmap        1.2.0      2024-05-15 [?] CRAN (R 4.4.0)
 P forcats      * 1.0.1      2025-09-25 [?] CRAN (R 4.4.1)
 P foreach        1.5.2      2022-02-02 [?] CRAN (R 4.4.0)
 P future         1.67.0     2025-07-29 [?] CRAN (R 4.4.1)
 P future.apply   1.20.0     2025-06-06 [?] CRAN (R 4.4.1)
 P generics       0.1.4      2025-05-09 [?] CRAN (R 4.4.1)
 P ggplot2      * 4.0.0      2025-09-11 [?] CRAN (R 4.4.1)
 P ggrepel      * 0.9.6      2024-09-07 [?] CRAN (R 4.4.1)
 P ggtext         0.1.2      2022-09-16 [?] CRAN (R 4.4.0)
 P ggvenn       * 0.1.19     2025-10-08 [?] CRAN (R 4.4.1)
 P globals        0.18.0     2025-05-08 [?] CRAN (R 4.4.1)
 P glue           1.8.0      2024-09-30 [?] CRAN (R 4.4.1)
 P gower          1.0.2      2024-12-17 [?] CRAN (R 4.4.1)
 P gridtext       0.1.5      2022-09-16 [?] CRAN (R 4.4.0)
 P gtable         0.3.6      2024-10-25 [?] CRAN (R 4.4.1)
 P hardhat        1.4.2      2025-08-20 [?] CRAN (R 4.4.1)
 P here         * 1.0.2      2025-09-15 [?] CRAN (R 4.4.1)
 P hms            1.1.3      2023-03-21 [?] CRAN (R 4.4.0)
 P htmltools      0.5.8.1    2024-04-04 [?] CRAN (R 4.4.0)
 P htmlwidgets    1.6.4      2023-12-06 [?] CRAN (R 4.4.0)
 P igraph         2.1.4      2025-01-23 [?] CRAN (R 4.4.1)
 P ipred          0.9-15     2024-07-18 [?] CRAN (R 4.4.0)
 P iterators      1.0.14     2022-02-05 [?] CRAN (R 4.4.1)
 P jsonlite       2.0.0      2025-03-27 [?] CRAN (R 4.4.1)
 P knitr        * 1.50       2025-03-16 [?] CRAN (R 4.4.1)
 P labeling       0.4.3      2023-08-29 [?] CRAN (R 4.4.0)
 P lattice        0.22-7     2025-04-02 [?] CRAN (R 4.4.1)
 P lava           1.8.1      2025-01-12 [?] CRAN (R 4.4.1)
 P lifecycle      1.0.4      2023-11-07 [?] CRAN (R 4.4.0)
 P listenv        0.9.1      2024-01-29 [?] CRAN (R 4.4.0)
 P litedown       0.7        2025-04-08 [?] CRAN (R 4.4.1)
 P lubridate    * 1.9.4      2024-12-08 [?] CRAN (R 4.4.1)
 P magrittr       2.0.4      2025-09-12 [?] CRAN (R 4.4.1)
 P markdown       2.0        2025-03-23 [?] CRAN (R 4.4.1)
 P MASS           7.3-65     2025-02-28 [?] CRAN (R 4.4.1)
 P Matrix         1.7-3      2025-03-11 [?] CRAN (R 4.4.1)
 P mgcv           1.9-3      2025-04-04 [?] CRAN (R 4.4.1)
 P ModelMetrics   1.2.2.2    2020-03-17 [?] CRAN (R 4.4.1)
 P nlme           3.1-168    2025-03-31 [?] CRAN (R 4.4.1)
 P nnet           7.3-20     2025-01-01 [?] CRAN (R 4.4.1)
 P parallelly     1.45.1     2025-07-24 [?] CRAN (R 4.4.1)
 P patchwork    * 1.3.2      2025-08-25 [?] CRAN (R 4.4.1)
 P pillar         1.11.1     2025-09-17 [?] CRAN (R 4.4.1)
 P pkgconfig      2.0.3      2019-09-22 [?] CRAN (R 4.4.0)
 P plyr           1.8.9      2023-10-02 [?] CRAN (R 4.4.0)
 P prettyunits    1.2.0      2023-09-24 [?] CRAN (R 4.4.0)
 P pROC         * 1.19.0.1   2025-07-31 [?] CRAN (R 4.4.1)
 P processx       3.8.6      2025-02-21 [?] CRAN (R 4.4.1)
 P prodlim        2025.04.28 2025-04-28 [?] CRAN (R 4.4.1)
 P ps             1.9.1      2025-04-12 [?] CRAN (R 4.4.1)
 P purrr        * 1.1.0      2025-07-10 [?] CRAN (R 4.4.1)
 P R6             2.6.1      2025-02-15 [?] CRAN (R 4.4.1)
 P ragg           1.5.0      2025-09-02 [?] CRAN (R 4.4.1)
 P RColorBrewer   1.1-3      2022-04-03 [?] CRAN (R 4.4.0)
 P Rcpp           1.1.0      2025-07-02 [?] CRAN (R 4.4.1)
 P readr        * 2.1.5      2024-01-10 [?] CRAN (R 4.4.0)
 P recipes        1.3.1      2025-05-21 [?] CRAN (R 4.4.1)
   renv           1.1.4      2025-03-20 [1] CRAN (R 4.4.1)
 P reshape2       1.4.4      2020-04-09 [?] CRAN (R 4.4.0)
 P rlang          1.1.6      2025-04-11 [?] CRAN (R 4.4.1)
 P rmarkdown      2.30       2025-09-28 [?] CRAN (R 4.4.1)
 P rpart          4.1.24     2025-01-07 [?] CRAN (R 4.4.1)
 P rprojroot      2.1.1      2025-08-26 [?] CRAN (R 4.4.1)
 P rstudioapi     0.17.1     2024-10-22 [?] CRAN (R 4.4.1)
 P S7             0.2.0      2024-11-07 [?] CRAN (R 4.4.1)
 P scales         1.4.0      2025-04-24 [?] CRAN (R 4.4.1)
 P secretbase     1.0.5      2025-03-04 [?] CRAN (R 4.4.1)
 P sessioninfo    1.2.3      2025-02-05 [?] CRAN (R 4.4.1)
 P stringi        1.8.7      2025-03-27 [?] CRAN (R 4.4.1)
 P stringr      * 1.5.2      2025-09-08 [?] CRAN (R 4.4.1)
 P survival       3.8-3      2024-12-17 [?] CRAN (R 4.4.1)
 P systemfonts    1.3.1      2025-10-01 [?] CRAN (R 4.4.1)
 P targets      * 1.11.4     2025-09-13 [?] CRAN (R 4.4.1)
 P textshaping    1.0.4      2025-10-10 [?] CRAN (R 4.4.1)
 P tibble       * 3.3.0      2025-06-08 [?] CRAN (R 4.4.1)
 P tidyr        * 1.3.1      2024-01-24 [?] CRAN (R 4.4.1)
 P tidyselect     1.2.1      2024-03-11 [?] CRAN (R 4.4.0)
 P tidyverse    * 2.0.0      2023-02-22 [?] CRAN (R 4.4.0)
 P timechange     0.3.0      2024-01-18 [?] CRAN (R 4.4.0)
 P timeDate       4041.110   2024-09-22 [?] CRAN (R 4.4.1)
 P tzdb           0.5.0      2025-03-15 [?] CRAN (R 4.4.1)
 P vctrs          0.6.5      2023-12-01 [?] CRAN (R 4.4.0)
 P viridisLite  * 0.4.2      2023-05-02 [?] CRAN (R 4.4.0)
 P visNetwork   * 2.1.4      2025-09-04 [?] CRAN (R 4.4.1)
 P withr          3.0.2      2024-10-28 [?] CRAN (R 4.4.1)
 P xfun           0.53       2025-08-19 [?] CRAN (R 4.4.1)
 P xml2           1.4.0      2025-08-20 [?] CRAN (R 4.4.1)
 P yaml           2.3.10     2024-07-26 [?] CRAN (R 4.4.0)

 [1] /Users/biostochastics/chia_etal_2025_als/renv/library/macos/R-4.4/aarch64-apple-darwin20
 [2] /Users/biostochastics/Library/Caches/org.R-project.R/R/renv/sandbox/macos/R-4.4/aarch64-apple-darwin20/f7156815

 * ── Packages attached to the search path.
 P ── Loaded and on-disk path mismatch.

──────────────────────────────────────────────────────────────────────────────

List of Figures

Figure 2: Conceptual diagram of confounding structure
Figure 3: PCA visualization showing protein expression variance driven by tube type
Figure 4: ROC curves for predicting tube type from three protein sets
Figure 5: Leave-Country-Out cross-validation ROC curves
Figure 6: Performance comparison: Pooled vs Country-stratified vs Within-country cross-validation
Figure 7: Distribution of significant proteins by cohort
Figure 8: Venn diagram showing overlap of significant proteins
Figure 9: Correlation of effect sizes between Italy and US cohorts
Figure 10: Geographic consistency analysis
Figure 11: Tube type effects in healthy controls
Figure 14: Decomposition of observed effects
Figure 1: Pipeline dependency graph

List of Tables

Table 1: Confounding analysis summary
Table 2: Top 10 proteins most predictive of tube type
Table 3: Leave-Country-Out cross-validation performance
Table 4: Summary of differential expression analysis
Table 5: Concordant proteins between cohorts
Table 6: Tube type effects in healthy controls
Table 7: Summary of effect decomposition

Pipeline Reproducibility

This entire analysis pipeline can be reproduced by running:

# Install dependencies (if needed)
renv::restore()

# Run complete pipeline
targets::tar_make()

# Render report
quarto::quarto_render("reports/confounding_investigation.qmd")

All pipeline targets are cached using the targets package. Only changed dependencies trigger recomputation, ensuring efficient incremental updates.

--- title: "Confounding by geography and anticoagulant compromises proposed ALS diagnostic model and biomarkers" subtitle: "Re-analysis of Chia et al. (2025)" author: "Sergey A. Kornilov, PhD (Biostochastics, LLC)" date: "2025-10-18" bibliography: references.bib csl: nature.csl link-citations: true format: html: toc: true toc-depth: 3 toc-location: left toc-title: "Navigation" number-sections: true number-depth: 3 code-fold: true code-summary: "Show code" code-tools: true df-print: paged self-contained: true css: custom-styles-dark.css embed-resources: true fig-width: 10 fig-height: 7 smooth-scroll: true link-external-newwindow: true lof: true lightbox: true lot: true execute: cache: false warning: false message: false echo: true execute-dir: project --- ::: {style="text-align: right; margin-bottom: 1rem;"} <div style="display: inline-block; padding: 0.5rem 1rem; border: 1px solid #444; border-radius: 0.5rem; background: rgba(0,0,0,0.2);"> Links: <a href="https://www.nature.com/articles/s41591-025-03890-6.pdf" title="Article PDF" style="text-shadow: 0 0 8px rgba(108, 206, 89, 0.4); margin: 0 0.4rem;"></a> <a href="https://zenodo.org/records/16899551" title="Original Data (Zenodo)" style="text-shadow: 0 0 8px rgba(108, 206, 89, 0.4); margin: 0 0.4rem;"></a> <a href="https://github.com/biostochastics/chia_etal_2025_als" title="Code (GitHub)" style="text-shadow: 0 0 8px rgba(108, 206, 89, 0.4); margin: 0 0.4rem;"></a> <a href="mailto:sergey.kornilov@biostochastics.com" title="Contact" style="text-shadow: 0 0 8px rgba(108, 206, 89, 0.4); margin: 0 0.4rem;"></a> </div> ::: ::: {.callout-note appearance="minimal"} ## Author Contact & Data Acknowledgment We thank the authors for making the underlying data publicly available. We contacted the corresponding authors on 09/10/25 requesting geographic metadata and received no response, necessitating label reconstruction from sample identifiers. We shared a draft of this reanalysis on 10/13/25 prior to formal submission but received no formal or informal response of acknowledgment. Recognizing the impact of the federal shutdown on the NIH workers, we attempted to reach out the authors again once the shutdown was in progress, this time via social media. We received no response. ::: ```{r setup} #| include: false library(targets) library(tidyverse) library(knitr) library(patchwork) library(here) # Source utility functions for dark theme source(here::here("R/99_utils.R")) # Source appendix helper functions # source("R/appendix_helpers.R") # Temporarily disabled # Set dark theme globally for all plots with dark background matching CSS theme_set( theme_dark_scientific(base_size = 12) + theme( plot.background = element_rect(fill = "#151520", colour = NA), panel.background = element_rect(fill = "#151520", colour = NA), plot.margin = margin(2, 2, 2, 2) ) ) # Set viridis-inspired default color scales options( ggplot2.discrete.colour = viridis_dark_palette, ggplot2.discrete.fill = viridis_dark_palette ) # Set figure device to ragg with dark background knitr::opts_chunk$set( dev = "ragg_png", dev.args = list(bg = "#151520") ) # Load all pipeline results tar_load(c( sample_metadata, reverse_prediction_results, top_proteins_tube_prediction, biomarker_tube_prediction, concordant_tube_prediction, model_with_tube, model_without_tube, lcv_results, pooled_cv_results, pooled_vs_lcv_comparison, within_country_cv_results, differential_italy, differential_us, differential_pooled, protein_concordance, stratified_vs_pooled_comparison, plot_pca, investigation_summary )) ``` # Executive Summary {.unnumbered} ## Methodological Case Study: Multi-Site Confounding in Proteomics **This investigation serves as a methodological case study demonstrating how confounding naturally arises in multi-site biomarker studies and why standard validation approaches may not detect it.** Using a rigorous multi-method framework applied to a published ALS proteomics dataset [@chia2025], we document perfect site–tube correlation (Cramér's V=1.0), an 18.6% performance gap between pooled cross-validation and leave-site-out validation (pooled CV AUC: 0.951 → LCV AUC: 0.774), and systematic tube effects across the proteome (35% of proteins show significant effects in healthy controls). These patterns illustrate **generalizable lessons** about validation strategies, study design, and the importance of testing cross-site generalization. This case study examines confounding patterns using a multi-method validation framework: ::: {.callout-important appearance="minimal"} ## Critical Finding #1: Proteins Predict Tube Type **Reverse Prediction AUC: `r sprintf("%.3f", reverse_prediction_results$auc)`** Proteins can predict plasma collection tube type with near-perfect accuracy, indicating tube type signal dominates biological signal. ::: ::: {.callout-important appearance="minimal"} ## Critical Finding #2: Geographic Generalization Failure **Performance Gap: `r sprintf("%.3f", pooled_vs_lcv_comparison$auc_gap)` (`r sprintf("%.1f%%", pooled_vs_lcv_comparison$percent_drop)` drop)** - Pooled CV AUC: `r sprintf("%.3f", pooled_vs_lcv_comparison$pooled_auc)` - Leave-Country-Out AUC: `r sprintf("%.3f", pooled_vs_lcv_comparison$lcv_mean_auc)` Models trained on one geographic cohort fail to generalize to another. ::: ::: {.callout-important appearance="minimal"} ## Critical Finding #3: Differential Analysis Confounding **Only `r sprintf("%.1f%%", stratified_vs_pooled_comparison$pct_replicate_in_both)` of pooled proteins replicate in both strata** - Italy significant: `r length(stratified_vs_pooled_comparison$sig_italy)` - US significant: `r length(stratified_vs_pooled_comparison$sig_us)` - Pooled significant: `r length(stratified_vs_pooled_comparison$sig_pooled)` - **Both strata: `r length(stratified_vs_pooled_comparison$sig_both_strata)`** ::: ::: {.callout-important appearance="minimal"} ## Critical Finding #4: UK Biobank External Validation **Reported 99.1% AUC does not address multi-class, multi-matrix generalization** The UK Biobank validation (13 ALS vs 23,601 healthy controls) likely used EDTA plasma exclusively (standard UKB protocol), presenting a binary ALS-vs-healthy task on a single matrix, not the multi-class (ALS vs ONDs vs healthy), multi-matrix challenge the study claims to address. Under the confounding we document, the model learns 'EDTA = not ALS'—making an EDTA-only healthy control set trivially separable even if biological signal is weak. Extreme class imbalance inflates aggregate metrics (AUC, specificity) while masking sensitivity failures. The UKB result cannot validate discrimination against ONDs or robustness across matrices (EDTA, heparin, serum), and does not mitigate the tube/site confounding driving apparent performance in the primary analysis. ::: ## Methodological Lessons Learned This case study illustrates a **systematic challenge in multi-site proteomics** where confounding emerges naturally from logistical constraints: 1. **Natural confounding**: Different sites often use different protocols (plasma collection tube type: HEPARIN vs EDTA) 2. **Diagnostic distribution**: Patient populations vary by site (100% of neurological controls from US/EDTA site) 3. **Validation limitations**: Standard pooled cross-validation maintains these correlations in train/test splits **Key insight**: The study design and analytical approach follow field-standard practices, yet produce results where technical and biological signals are inseparable. This demonstrates why **prospective validation strategies** (leave-site-out CV, reverse prediction tests, balanced designs) are essential for multi-site studies. **Positive findings**: Despite confounding, we identify 22 geographically consistent protein candidates and confirm NEFL as a robust ALS biomarker, providing a foundation for future tube-balanced validation studies. --- # Introduction ## Multi-Site Confounding: A Methodological Case Study @chia2025 reported a plasma proteomics biomarker panel for ALS diagnosis with 98% accuracy. Their machine learning model identified plasma collection tube type (HEPARIN vs. EDTA) as the **2nd most important feature** out of 2,869 total features—an observation that prompted this methodological investigation. **This case study demonstrates how confounding naturally arises in multi-site proteomics studies:** We document that tube type is perfectly correlated with geographic origin (Italy/HEPARIN vs. US/EDTA; Cramér's V=1.0) and diagnostic distribution (100% of neurological controls from US/EDTA site, 85% of ALS cases from Italy/HEPARIN; @tbl-confounding). This confounding structure—**which emerges organically from site-specific protocols and patient populations**—creates a methodological challenge: biological disease signals become entangled with technical and geographic factors. **Purpose of this investigation:** Rather than simply identifying problems, we demonstrate: (1) **how confounding arises** in real-world multi-site studies; (2) **why standard validation approaches miss it**; and (3) **what practical solutions exist** through stratified analysis, cross-site validation, and prospective study design improvements. ## Anticoagulant Biochemistry and Pre-Analytical Confounding Anticoagulant choice—EDTA or HEPARIN—fundamentally alters measured proteins. EDTA chelates calcium, preventing clotting but triggering cellular protein release. HEPARIN inhibits specific clotting enzymes and preserves different protein structures. The same patient's blood yields different protein profiles by tube type—a pre-analytical artifact, not biological signal. When tube type is perfectly confounded with diagnostic group—100% of neurological controls in EDTA—disease-associated protein changes become non-identifiable from anticoagulant-induced technical artifacts. ```{r confounding-table} #| echo: false #| label: tbl-confounding #| tbl-cap: "Perfect Confounding: Diagnosis × Tube Type Distribution" confound_table <- sample_metadata %>% count(Diagnosis, Plasma_collection_tube_type) %>% pivot_wider(names_from = Plasma_collection_tube_type, values_from = n, values_fill = 0) %>% mutate( Total = EDTA + HEPARIN, `% HEPARIN` = sprintf("%.1f%%", 100 * HEPARIN / Total) ) kable(confound_table, align = c("l", rep("r", 4))) ``` We conducted an independent investigation with three objectives: (1) **quantify confounding severity** using reverse prediction and association metrics, (2) **evaluate ML model robustness** through leave-country-out cross-validation, and (3) **identify geographically consistent biomarkers** via geographic stratification. We prespecified three hypotheses to test confounding and its clinical impact: - **H1 (Reverse Prediction Test):** Protein profiles would strongly predict anticoagulant tube type (null hypothesis: AUC = 0.5; critical threshold: AUC > 0.90 indicates systematic confounding). - **H2 (Geographic Generalization):** Models would fail geographic generalization under leave-country-out cross-validation (null hypothesis: no performance drop; critical threshold: >20% AUC reduction indicates geographic overfitting). - **H3 (Cross-Stratum Replication):** Most pooled 'significant' proteins would not replicate across both geographic strata (null hypothesis: ≥70% replication; severe confounding indicated if <30% replicate). These criteria distinguish minor technical variability from systematic confounding preventing clinical deployment. ::: {.callout-note appearance="minimal"} ## Box 1: Understanding Confounding and Validation **For clinical readers unfamiliar with machine learning validation concepts:** - **What is confounding?** A non-biological factor (e.g., collection site or tube type) varying systematically with diagnosis becomes a "shortcut" the model learns. Performance can appear excellent while detecting workflow logistics, not disease biology. - **Pooled cross-validation vs. leave-site-out (LSO) validation:** Pooled CV mixes data from all sites in training/test splits, allowing site and tube signals to "leak" across partitions. LSO withholds entire sites during training to test clinical generalizability. - **Why tube type as a predictor is problematic:** Different anticoagulants alter measured protein levels through distinct biochemical mechanisms. When tube type correlates with diagnosis (100% of neurological controls used EDTA), the model learns "EDTA = not ALS" rather than biological ALS signatures, producing high pooled CV AUC that fails when tube types differ. - **What robust validation requires:** Leave-site-out cross-validation, balanced representation across tube types and sites, inclusion of ONDs as comparators, verification of stability across matrices, and uncertainty quantification testing whether performance depends on logistics shortcuts. ::: For readers interested in statistical frameworks and computational details, the Technical Appendix (Section 8) provides comprehensive documentation of methods, software versions, and reproducibility protocols. --- # Methods ## Overview of Reanalysis Approach We evaluated confounding bias from plasma collection tube type, used as a predictive feature in the original model. Our reanalysis employed geographic stratification and leave-country-out cross-validation to assess model generalizability and identify geographically consistent biomarker candidates. All analyses were implemented in R (version 4.4.2) using a reproducible computational pipeline (`targets` framework) with version-controlled dependencies (`renv`). ## Computational Pipeline Architecture ```{r pipeline-dependency-graph} #| label: fig-pipeline-graph #| fig-cap: "Complete computational dependency graph. Each node represents a target (data, model, or visualization); edges show dependencies. Interactive: hover to explore, click to select." #| fig-width: 14 #| fig-height: 10 #| code-fold: true #| cache: false library(targets) library(visNetwork) # Determine project root - handle both manual and tar_quarto() rendering project_root <- if (file.exists("../_targets.R")) { normalizePath("..") } else if (file.exists("_targets.R")) { getwd() } else { "/Users/biostochastics/chia_etal_2025_als" # fallback to absolute path } # Change to project root to find _targets.R withr::with_dir(project_root, { # Create network with custom styling net <- tar_visnetwork( targets_only = TRUE, label = c("time", "size"), level_separation = 150 ) # Apply dark theme with bright viridis colors net %>% visOptions( highlightNearest = list(enabled = TRUE, degree = 1), nodesIdSelection = TRUE ) %>% visInteraction( navigationButtons = TRUE, hover = TRUE ) %>% visLayout(randomSeed = 42) %>% visNodes( font = list(color = "white", size = 16, face = "arial", bold = TRUE), borderWidth = 2, shadow = list(enabled = TRUE, color = "#000000", size = 5, x = 2, y = 2) ) %>% visEdges( color = list(color = "#666666", highlight = "#FDE724", hover = "#FDE724"), smooth = list(enabled = TRUE, type = "cubicBezier", roundness = 0.5), arrows = list(to = list(enabled = TRUE, scaleFactor = 0.5)) ) %>% visPhysics( stabilization = list(iterations = 200), barnesHut = list(gravitationalConstant = -8000, springConstant = 0.001, springLength = 200) ) }) ``` The dependency graph shows the complete computational workflow. Nodes represent computational targets (data, models, visualizations); directed edges indicate dependencies. The `targets` framework ensures: 1. **Reproducibility:** All analyses run in the same order with identical inputs 2. **Efficiency:** Only outdated targets rebuild when code changes 3. **Transparency:** Full dependency structure is explicit and auditable 4. **Version control:** All code and parameters are tracked in Git This architecture enables complete reproduction via a single `targets::tar_make()` command. See Appendix for execution details. ## Data Source and Study Design ### Original Dataset **Dataset:** Chia et al. (2025) OLINK plasma proteomics data - **Total measurements:** 2,031,559 protein-sample combinations - **Samples:** `r nrow(sample_metadata)` unique samples - **Proteins:** ~`r investigation_summary$n_proteins` measurements per sample (OLINK panels) - **Diagnoses:** ALS (`r investigation_summary$diagnosis_counts["ALS"]`), Healthy controls (`r investigation_summary$diagnosis_counts["Healthy_control"]`), Neurological controls (`r investigation_summary$diagnosis_counts["Neurological_control"]`) - **Clinical variables:** Age at collection, sex, diagnosis, plasma collection tube type (EDTA vs. HEPARIN), OLINK plate identifiers (Plates 1-9) ### Geographic Origin Reconstruction The original dataset lacked explicit country labels. We reconstructed geographic origin (Italy vs. US) from converging evidence: 1. **Plate identifier patterns**: Jupyter notebook metadata indicated Plates 5-9 were processed at TRAYNOR (Italian Scholz ALS registry), Plates 1-4 at NDRU (US cohorts from NIH, Johns Hopkins, Baltimore Longitudinal Study of Aging). 2. **Tube type consistency**: All HEPARIN samples corresponded to Plates 5-9, all EDTA to Plates 1-4, demonstrating perfect correlation (Cramér's V ≈ 1.0) between tube type and inferred geographic origin. 3. **Diagnostic distribution validation**: Perfect confounding of neurological controls with EDTA tubes (100% EDTA/US) supported the Italy/US geographic distinction. **Country label assignment:** - **Italy**: Plates 5-9 AND HEPARIN tube type (n=403 samples) - **United States**: Plates 1-4 AND EDTA tube type (n=301 samples) - **Unknown**: Inconsistent plate-tube combinations (n=0, indicating 100% consistency) ```{r confounding-structure} #| echo: false #| label: fig-confounding-structure #| fig-cap: "Perfect confounding structure: Diagnosis × Tube Type × Country" #| fig-width: 14 #| fig-height: 10 library(patchwork) # Plot 1: Stacked bar by diagnosis showing tube type p1 <- ggplot(sample_metadata, aes(x = Diagnosis, fill = Plasma_collection_tube_type)) + geom_bar(position = "fill", width = 0.7, color = "#0B0B0F", size = 0.5) + scale_fill_manual( values = tube_type_colors_viridis, name = "Tube Type" ) + scale_y_continuous(labels = scales::percent) + labs( title = "A. Tube Type by Diagnosis (100% stacked)", x = "Diagnosis", y = "Proportion", subtitle = "Note: 100% of OND are EDTA, 85% of ALS are HEPARIN" ) + theme_dark_scientific(base_size = 11) + theme_zero_margins() # Plot 2: Counts by diagnosis and tube type p2 <- ggplot(sample_metadata, aes(x = Diagnosis, fill = Plasma_collection_tube_type)) + geom_bar(position = "dodge", width = 0.7, color = "#0B0B0F", size = 0.5) + scale_fill_manual( values = tube_type_colors_viridis, name = "Tube Type" ) + geom_text( stat = "count", aes(label = after_stat(count)), position = position_dodge(width = 0.7), vjust = -0.5, size = 3.5, color = "#F5F5FA", fontface = "bold" ) + labs( title = "B. Sample Counts by Diagnosis and Tube Type", x = "Diagnosis", y = "Count", subtitle = "Perfect confounding: OND × HEPARIN = 0 samples" ) + theme_dark_scientific(base_size = 11) + theme_zero_margins() # Plot 3: Country by diagnosis p3 <- ggplot(sample_metadata, aes(x = Diagnosis, fill = country)) + geom_bar(position = "fill", width = 0.7, color = "#0B0B0F", size = 0.5) + scale_fill_manual( values = country_colors_viridis, name = "Country" ) + scale_y_continuous(labels = scales::percent) + labs( title = "C. Country by Diagnosis", x = "Diagnosis", y = "Proportion" ) + theme_dark_scientific(base_size = 11) + theme_zero_margins() # Plot 4: Faceted by country showing tube-diagnosis relationship p4 <- ggplot(sample_metadata, aes(x = country, fill = Plasma_collection_tube_type)) + geom_bar(width = 0.6, color = "#0B0B0F", size = 0.5) + facet_wrap(~Diagnosis, ncol = 3) + scale_fill_manual( values = tube_type_colors_viridis, name = "Tube Type" ) + labs( title = "D. Geographic and Tube Type Distribution by Diagnosis", x = "Country", y = "Number of Samples" ) + theme_dark_scientific(base_size = 11) + theme_zero_margins() # Combine plots with dark theme annotation (p1 + p2) / p3 + plot_annotation( title = "Perfect Confounding Structure: Diagnosis × Tube Type × Country", subtitle = "Italy = HEPARIN only | US = EDTA only | All OND = EDTA/US", theme = theme_dark_scientific(base_size = 12) + theme_centered_titles() & theme( plot.background = element_rect(fill = "#151520", colour = NA), panel.background = element_rect(fill = "#151520", colour = NA) ) ) & theme_zero_margins() ``` ## Confounding Structure Quantification ```{r pca-visualization} #| echo: false #| label: fig-pca-tube-diagnosis #| fig-cap: "PCA visualization showing protein expression variance driven by tube type rather than diagnosis" #| fig-width: 12 #| fig-height: 14 #| out-width: "100%" # Load and display the PCA visualization from targets plot_pca$plot ``` Principal Component Analysis reveals tube type explains far more variance (PC1: 17.3%) than biological disease groups. Tube type shows clear PC1 separation, whereas diagnosis shows minimal separation, demonstrating technical factors dominate the variance structure. ### Statistical Framework We quantified confounding between tube type, geographic origin, and diagnosis using: 1. **Cramér's V**: Association strength between categorical variables (range: 0 = independent, 1 = perfect association). We calculated Cramér's V with 95% CIs for all pairwise combinations of diagnosis, tube type, country, sex, and plate number. 2. **Chi-square tests**: Evaluated independence between diagnostic group and tube type/country distributions. 3. **Contingency tables**: Sample distributions across diagnosis × tube type, diagnosis × country, and tube type × country. 4. **Stratified analysis**: Identified groups with perfect confounding (e.g., neurological controls = 100% EDTA). **Confounding Severity Thresholds:** We interpreted Cramér's V using @cohen1988 effect size conventions for 2×3 contingency tables (df=2): <0.07 (negligible), 0.07-0.21 (small), 0.21-0.35 (medium), ≥0.35 (large). V > 0.9 indicates near-perfect confounding and mathematical non-identifiability. ## Machine Learning Validation Framework ### Reverse Prediction Test (Diagnostic for Technical Confounding) **Objective:** Can proteins predict tube type? We trained a Random Forest classifier to predict tube type (HEPARIN vs. EDTA) from protein expression alone. This "reverse prediction test" diagnoses technical confounding: AUC > 0.9 indicates pre-analytical variation overwhelms biological signal. **Implementation:** - **Model:** Random Forest with 500 trees (`ranger` package) - **Features:** All 2,868 proteins without pre-filtering (metadata excluded) - **Outcome:** Binary classification (EDTA vs. HEPARIN) - **Cross-validation:** 5-fold CV with stratified sampling - **Missing values:** Pre-imputed separately by country using missForest (0.01% overall missingness). - **Hyperparameters:** Grid search over `mtry` (√p, p/3) and `min.node.size` (5, 10) - **Performance metrics:** AUC-ROC, sensitivity, specificity - **Feature importance:** Permutation-based variable importance scaled to 0-100 **Interpretation Thresholds [@hosmer2000]:** - AUC < 0.7: Minimal concern (poor discrimination) - AUC 0.7-0.8: Moderate concern (acceptable discrimination indicates detectable tube effects) - AUC 0.8-0.9: Severe concern (excellent discrimination indicates major confounding) - AUC > 0.9: Critical concern (outstanding discrimination—tube effects dominate biological signal) ### Leave-Country-Out Cross-Validation (Primary Generalizability Test) **Objective:** Does the model generalize geographically? **Leave-country-out cross-validation (LCV)** evaluates whether models trained in one geographic/tube context generalize to another, directly testing clinical deployment scenarios where tube types vary by institution. **Implementation:** **Direction 1: Train on Italy (HEPARIN) → Test on US (EDTA)** 1. Partition data by country (Italy training: n=`r lcv_results$italy_n`, US test: n=`r lcv_results$us_n`) 2. Train Random Forest on Italian samples using 5-fold internal CV for hyperparameter tuning 3. Apply trained model to held-out US samples 4. Evaluate test performance (AUC, accuracy, sensitivity, specificity) **Direction 2: Train on US (EDTA) → Test on Italy (HEPARIN)** Reversed procedure with same preprocessing and evaluation. **Critical Data Leakage Prevention:** - Protein filtering, scaling, and hyperparameter tuning performed **only on training cohort** - Test cohort held completely separate until final prediction step - No information from test samples influenced model training **Pooled Cross-Validation (Baseline Comparison):** We performed standard 5-fold CV on pooled Italy + US data to replicate the original study's approach. Pooled CV mixes countries across folds, maintaining tube type distribution but not testing geographic generalization. **Performance Gap Thresholds:** Field conventions for cross-site generalization in clinical ML models [@fda2021; @vabalas2019]: <5% drop (acceptable), 5-10% (moderate concern), 10-20% (substantial concern), >20% (critical concern). Acceptable degradation depends on clinical context and baseline accuracy. ## Stratified Differential Protein Expression Analysis **Objective:** Identify geographically consistent biomarkers We performed differential expression analysis separately within each geographic cohort to identify proteins with geographically consistent ALS signals, eliminating tube type and country confounding. ### Statistical Framework We used `limma` [@ritchie2015], adjusting for age and sex to control demographic confounders. **Model specification:** ``` NPX ~ Diagnosis + Age_Collection + Sex ``` Where NPX (Normalized Protein eXpression, Olink's standardized log₂ scale) represents protein abundance, and Diagnosis contrasts ALS vs. controls. ### Italy Cohort Analysis (HEPARIN-only) **Sample composition:** 223 ALS patients vs. 180 healthy controls (all Italian samples use HEPARIN tubes) **Rationale:** The Italian cohort lacked neurological controls, providing clean ALS vs. healthy comparison without OND confounding. **Preprocessing:** 1. Convert long-format data to protein matrix (proteins × samples) 2. Remove proteins with >20% missing values 3. Fit linear model with empirical Bayes moderation (`limma::lmFit` + `limma::eBayes`) 4. Multiple testing correction using Benjamini-Hochberg false discovery rate (FDR) **Significance thresholds:** FDR < 0.05 and |log₂ fold-change| > 0.5 ### US Cohort Analysis (EDTA-only) **Sample composition:** 40 ALS patients vs. 261 controls (67 healthy + 194 neurological disease controls; all US samples use EDTA tubes) **Note on power:** Small ALS sample size (n=40) limits power, increasing false negative risk but not false positives. Conservative thresholds mitigate false discovery. **Analysis approach:** Binary comparison (ALS vs. all controls) with age and sex adjustment. Same preprocessing and significance criteria as Italy analysis. ### Pooled Analysis (Confounded Baseline) For comparison, we replicated the original study's pooled approach by combining all samples without adjusting for tube type or country. This analysis serves as a baseline to quantify the impact of confounding on differential expression results. **Note:** Pooled results are confounded by geographic/tube effects and should not be interpreted as biologically valid. ## Meta-Analysis and Concordance Testing ### Geographically Consistent Biomarker Identification Proteins were classified as "geographically consistent" candidates if they met **all** of the following criteria: 1. **Statistically significant in Italy:** FDR < 0.05 and log fold-change (logFC, the log₂ ratio of mean protein levels between groups) |logFC| > 0.5 2. **Statistically significant in US:** FDR < 0.05 and |logFC| > 0.5 3. **Directionally concordant:** Same sign of log₂ fold-change in both cohorts This conservative approach ensures that identified proteins show consistent ALS-associated changes independent of tube type and geographic origin. ### Effect Size Concordance We evaluated cross-cohort consistency using: 1. **Pearson correlation** of log₂ fold-changes between Italy and US cohorts (across all proteins) 2. **Combined p-values** using Fisher's method: χ² = -2(ln P_Italy + ln P_US), df=4 3. **Beta-beta plots:** Scatterplots of Italy logFC vs. US logFC for visualizing concordance and identifying tube-specific outliers ### Stratified vs. Pooled Comparison To quantify confounding-driven false discoveries, we calculated: - **Overlap statistics:** Number of proteins significant in pooled but not in either individual cohort - **Replication rate:** Percentage of pooled significant proteins that replicate in both Italy and US - **Critical threshold:** <30% replication indicates severe confounding. This heuristic is based on genomics/proteomics field observations that technical artifacts typically show poor cross-cohort replication (<30-40%), while robust biological signals typically replicate at >50-60% rates [@begley2012; @ioannidis2005] ## Exact Replication of Original Study Methods To ensure fair comparison, we explicitly replicated the @chia2025 analysis pipeline: **80/20 Discovery-Replication Split:** - Randomly split pooled data into 80% discovery (n≈563) and 20% replication (n≈141) - Train Random Forest on discovery set **including tube type as predictor** - Evaluate on held-out replication set - Report feature importance rank for tube type variable **Differential protein selection:** - Select proteins significant in pooled analysis (FDR < 0.05, |logFC| > 0.6, matching original thresholds) - Use these proteins as features for ML model ## Reproducibility and Software ### Computational Environment - **R version:** 4.4.2 (2024-10-31) - **Operating system:** macOS 15.0 - **Dependency management:** `renv` (version 1.0.11) for exact package version locking ### Key R Packages - `tidyverse` 2.0.0: Data manipulation and visualization - `data.table` 1.16.4: Fast data loading and processing - `targets` 1.11.4: Reproducible pipeline orchestration - `caret` 7.0-1: Machine learning framework - `ranger` 0.17.0: Random Forest implementation - `limma` 3.60.0: Differential expression analysis - `pROC` 1.18.5: ROC curve analysis - `DescTools`: Cramér's V calculation ### Reproducibility Protocol 1. **Random seeds:** All stochastic operations used seed = 46803468976 for reproducibility 2. **Pipeline framework:** `targets` ensures dependency tracking and caching 3. **Version control:** All R packages locked via `renv.lock` 4. **Execution:** Full pipeline can be reproduced via `targets::tar_make()` ## Statistical Considerations ### Multiple Testing Correction We applied Benjamini-Hochberg FDR correction separately within each analysis: - Differential expression: FDR control within each cohort (Italy, US, pooled) - Meta-analysis: Combined p-values using Fisher's method to account for testing in two cohorts ### Missing Data Handling **All analyses:** Missing values (0.01% overall) were imputed **once at data loading** using **missForest**, applied **separately for Italy and US cohorts** to prevent cross-country information leakage. This approach: - Uses random forest-based imputation (`missForest::missForest()`) well-suited for high-dimensional proteomics data - Applied independently to Italy (HEPARIN) and US (EDTA) subsets before any downstream analysis - Prevents cross-country information leakage during imputation (Italy samples never inform US imputation and vice versa) - Extremely low missingness (0.01% overall, max 0.14% per protein) makes imputation straightforward - Imputed data stored in pipeline target `data_imputed` and used for all subsequent analyses **Differential expression:** Proteins with >20% missing values would be excluded prior to `limma` analysis (standard practice). However, no proteins exceeded this threshold in the current dataset. **Rationale:** Pre-imputation with geographic stratification maintains methodological rigor by ensuring Italy and US cohorts remain informationally separate throughout the entire analysis pipeline, consistent with the study's focus on geographic confounding. ### Sample Size and Power We acknowledge limited statistical power in the US ALS cohort (n=40), which may result in false negatives (missing true ALS signals) but maintains Type I error control (FDR < 0.05). ## Limitations of Reanalysis We acknowledge the following limitations: 1. **Confounding non-identifiability:** Perfect correlation between tube type, country, batch, and storage protocol means we cannot definitively attribute effects to any single factor. Our conclusions apply to the aggregate of these technical artifacts. 2. **No external validation:** We reanalyzed existing data without access to independent validation cohorts. Prospective tube-balanced studies are required to confirm geographically consistent biomarkers. 3. **Limited US ALS sample size:** Power constraints in the US cohort (n=40 ALS) may cause us to underestimate the number of true geographically consistent proteins (increased false negatives, but Type I error remains controlled). --- ::: {.callout-note collapse="true"} ## Statistical Glossary **Key terms used in this investigation:** - **AUC (Area Under ROC Curve)**: Measure of classifier performance ranging from 0.5 (random) to 1.0 (perfect). Values >0.9 indicate excellent discrimination. - **Cramér's V**: Measure of association strength between categorical variables. Range: 0 (independent) to 1 (perfect association). Cohen (1988) effect sizes for df=2: 0.07 (small), 0.21 (medium), 0.35 (large). - **Cross-validation (CV)**: Method for evaluating model performance by splitting data into training and test sets multiple times. - **FDR (False Discovery Rate)**: Multiple testing correction that controls the expected proportion of false positives among rejected hypotheses. FDR <0.05 means <5% of "significant" results are expected to be false discoveries. - **Leave-Country-Out CV**: Rigorous validation where models trained on one geographic cohort are tested on a completely held-out cohort. Tests true generalizability. - **logFC (log Fold-Change)**: Difference in NPX values between groups. Since NPX is already on a log₂ scale, this difference directly represents log₂(fold-change). |logFC| > 0.5 represents >40% change; |logFC| > 1.0 represents doubling/halving. - **Sensitivity**: True positive rate; proportion of actual cases correctly identified. - **Specificity**: True negative rate; proportion of actual controls correctly identified. - **Stratification**: Analyzing data separately within homogeneous subgroups to eliminate confounding. ::: --- # Results **Our multi-method investigation reveals severe confounding compromise in the reported ALS diagnostic model.** Three analyses demonstrate tube type and geographic origin dominate: (1) Cramér's V=1.0 perfect site-tube correlation, (2) leave-country-out validation shows 18.6% AUC inflation and 1.8% sensitivity collapse, and (3) only 26.9% of 'significant' proteins replicate across strata. Even 22 geographically consistent candidates show substantial residual confounding, with 50% exhibiting significant tube effects in healthy controls. ## Confounding Quantification ### Statistical Measures ```{r cramers-v} #| echo: false # Calculate Cramér's V for key associations # Manual implementation to avoid dependency issues cramers_v <- function(tbl) { chi_sq <- chisq.test(tbl)$statistic n <- sum(tbl) min_dim <- min(nrow(tbl) - 1, ncol(tbl) - 1) sqrt(chi_sq / (n * min_dim)) } tube_dx_table <- table(sample_metadata$Plasma_collection_tube_type, sample_metadata$Diagnosis) cramers_v_tube_dx <- cramers_v(tube_dx_table) country_dx_table <- table(sample_metadata$country, sample_metadata$Diagnosis) cramers_v_country_dx <- cramers_v(country_dx_table) tube_country_table <- table(sample_metadata$Plasma_collection_tube_type, sample_metadata$country) cramers_v_tube_country <- cramers_v(tube_country_table) ``` **Association Strengths (Cramér's V):** - Tube Type ↔ Diagnosis: **`r sprintf("%.3f", cramers_v_tube_dx)`** (strong) - Country ↔ Diagnosis: **`r sprintf("%.3f", cramers_v_country_dx)`** (strong) - Tube Type ↔ Country: **`r sprintf("%.3f", cramers_v_tube_country)`** (perfect) ::: {.callout-note} Cramér's V ranges from 0 (no association) to 1 (perfect). @cohen1988 conventions for df=2: ≥0.21 medium, ≥0.35 large. ::: ## Reverse Prediction Test: Strong Evidence for Technical Confounding **Proteins achieve near-perfect tube type prediction accuracy (AUC ≥0.999), demonstrating technical artifacts dominate biological signal.** Even the 22 'geographically consistent' biomarker candidates retain strong tube-type signatures (AUC=0.916). When technical metadata can be reconstructed with this accuracy from biological features, those features are contaminated and cannot reliably represent disease biology. ```{r reverse-prediction} #| code-fold: true #| label: fig-reverse-roc #| fig-cap: "ROC curves for predicting tube type from three protein sets" #| fig-width: 10 #| fig-height: 8 #| out-width: "100%" library(ggplot2) library(pROC) library(dplyr) # Create labels full_label <- sprintf("Full Model (AUC = %.3f)", reverse_prediction_results$auc) top_label <- sprintf("Top 10 Proteins (AUC = %.3f)", top_proteins_tube_prediction$auc) biomarker_label <- sprintf("Original Biomarkers (AUC = %.3f)", biomarker_tube_prediction$auc) concordant_label <- sprintf("Concordant Proteins (AUC = %.3f)", concordant_tube_prediction$auc) # Extract predictions and calculate ROC curves # Full model full_preds <- reverse_prediction_results$model$pred %>% dplyr::filter( mtry == reverse_prediction_results$model$bestTune$mtry, min.node.size == reverse_prediction_results$model$bestTune$min.node.size ) roc_full <- pROC::roc( response = full_preds$obs, predictor = full_preds$HEPARIN, levels = c("EDTA", "HEPARIN"), direction = "<" ) # Top 10 proteins top_preds <- top_proteins_tube_prediction$predictions roc_top <- pROC::roc( response = top_preds$obs, predictor = top_preds$HEPARIN, levels = c("EDTA", "HEPARIN"), direction = "<" ) # Original biomarkers biomarker_preds <- biomarker_tube_prediction$predictions roc_biomarker <- pROC::roc( response = biomarker_preds$obs, predictor = biomarker_preds$HEPARIN, levels = c("EDTA", "HEPARIN"), direction = "<" ) # Concordant proteins concordant_preds <- concordant_tube_prediction$predictions roc_concordant <- pROC::roc( response = concordant_preds$obs, predictor = concordant_preds$HEPARIN, levels = c("EDTA", "HEPARIN"), direction = "<" ) # Combine ROC data roc_combined <- rbind( data.frame( fpr = 1 - roc_full$specificities, tpr = roc_full$sensitivities, model = full_label ), data.frame( fpr = 1 - roc_top$specificities, tpr = roc_top$sensitivities, model = top_label ), data.frame( fpr = 1 - roc_biomarker$specificities, tpr = roc_biomarker$sensitivities, model = biomarker_label ), data.frame( fpr = 1 - roc_concordant$specificities, tpr = roc_concordant$sensitivities, model = concordant_label ) ) # Create color mapping with 4 distinct colors color_values <- c("#FDE724", "#1F9E89", "#B5DE2B", "#31688E") names(color_values) <- c(full_label, top_label, biomarker_label, concordant_label) # Create plot ggplot(roc_combined, aes(x = fpr, y = tpr, color = model)) + geom_line(linewidth = 2.5, alpha = 0.9) + geom_abline( slope = 1, intercept = 0, linetype = "dashed", color = "#71717A", linewidth = 1 ) + scale_color_manual( name = NULL, values = color_values ) + labs( title = "Reverse Prediction Test: Predicting Tube Type from Proteins", subtitle = "AUC = 0.999 - Near-perfect discrimination indicates tube effects dominate", x = "False Positive Rate (1 - Specificity)", y = "True Positive Rate (Sensitivity)" ) + theme_dark_scientific(base_size = 12) + theme( plot.background = element_rect(fill = "#151520", colour = NA), panel.background = element_rect(fill = "#151520", colour = NA), legend.background = element_rect(fill = "#151520", colour = NA), legend.key = element_rect(fill = "#151520", colour = NA), legend.position = c(0.65, 0.25) ) + coord_equal() ``` **Results:** - **Full Model (all proteins):** AUC = `r sprintf("%.3f", reverse_prediction_results$auc)` (99.7% sensitivity, 94.8% specificity) - **Top 10 Proteins:** AUC = `r sprintf("%.3f", top_proteins_tube_prediction$auc)` - **Original Biomarkers (Chia et al. 17-protein panel):** AUC = `r sprintf("%.3f", biomarker_tube_prediction$auc)` (n=`r biomarker_tube_prediction$n_biomarkers` proteins) - **Concordant Proteins (significant in both cohorts):** AUC = `r sprintf("%.3f", concordant_tube_prediction$auc)` (n=`r concordant_tube_prediction$n_biomarkers` proteins) The reverse prediction test produces a compelling result with important implications for interpreting the published biomarker claims. By training classifiers to predict plasma collection tube type (HEPARIN vs. EDTA) from protein expression profiles alone—excluding all clinical metadata—we demonstrate that tube type effects are pervasive across multiple protein sets. The full model achieves near-perfect discrimination (AUC = 0.999), while even the top 10 most tube-discriminative proteins achieve exceptional performance (AUC = `r sprintf("%.3f", top_proteins_tube_prediction$auc)`). **Critically, both biomarker panels show substantial tube-type confounding**: the original study's 17-protein panel achieves AUC = `r sprintf("%.3f", biomarker_tube_prediction$auc)` for predicting tube type, while our concordant proteins (those significant in both Italy and US with concordant direction, n=`r concordant_tube_prediction$n_biomarkers`) achieve AUC = `r sprintf("%.3f", concordant_tube_prediction$auc)`. These findings indicate that even proteins that successfully replicate across independent cohorts remain significantly confounded by technical collection factors. These findings are particularly striking because the protein measurements are **more informative about the technical collection method than they are about ALS disease status** (original study reported 98% accuracy for disease prediction). This reversed diagnostic priority reveals that tube type effects dominate the variance structure in the proteomics data, overwhelming genuine biological disease signals. When combined with the perfect confounding structure—where 100% of neurological controls are EDTA and 85% of ALS cases are HEPARIN—these results indicate that any disease classifier trained on pooled data will inevitably learn to recognize anticoagulant-associated protein signatures rather than ALS-specific biology. The reverse prediction test thus provides strong diagnostic evidence for the presence of severe technical confounding across all protein sets, from the full panel to the curated biomarkers. Even if some ALS biomarkers exist within the dataset, their signals are thoroughly entangled with anticoagulant effects that must be addressed through stratified analysis or balanced study designs. ::: {.callout-important appearance="minimal"} ## Interpretation An AUC of 0.999 indicates proteins can **near-perfectly distinguish tube type**. This provides strong evidence that: 1. Tube type signal **dominates** biological signal in variance explained 2. Proteins are more informative about **collection method** than disease diagnosis 3. Pooled biomarker claims are materially affected by technical confounding 4. The original model's high performance likely reflects tube type recognition rather than disease biology ::: ### Top Proteins Distinguishing Tube Types ```{r reverse-features} #| echo: false #| label: tbl-reverse-features #| tbl-cap: "Top 10 proteins most predictive of tube type" reverse_prediction_results$top_features %>% head(10) %>% kable(digits = 2, align = c("l", "r")) ``` ## Geographic Validation: Leave-Country-Out CV ```{r lcv-roc} #| code-fold: true #| label: fig-lcv-roc #| fig-cap: "ROC curves, confusion matrices, and performance metrics comparing Original (within-country, pooled) vs Reanalysis (cross-country) validation strategies" #| fig-width: 24 #| out-width: "100%" #| fig-height: 18 library(ggplot2) library(pROC) library(patchwork) library(dplyr) # Extract ROC curves roc_italy_to_us <- lcv_results$italy_to_us$roc_curve roc_us_to_italy <- lcv_results$us_to_italy$roc_curve roc_italy_within <- within_country_cv_results$italy$roc_obj roc_us_within <- within_country_cv_results$us$roc_obj roc_pooled <- pooled_cv_results$roc_obj # Create labels with AUCs pooled_label <- sprintf("Pooled (%.3f)", pooled_cv_results$cv_auc) italy_within_label <- sprintf("Italy (%.3f)", within_country_cv_results$italy$cv_auc) us_within_label <- sprintf("US (%.3f)", within_country_cv_results$us$cv_auc) italy_label <- sprintf("Italy→US (%.3f)", lcv_results$italy_to_us$test_auc) us_label <- sprintf("US→Italy (%.3f)", lcv_results$us_to_italy$test_auc) # ============================================================================== # PANEL A: ORIGINAL VALIDATION (WITHIN-COUNTRY + POOLED) # ============================================================================== roc_original <- rbind( data.frame( specificity = 1 - roc_pooled$specificities, sensitivity = roc_pooled$sensitivities, model = pooled_label ), data.frame( specificity = 1 - roc_italy_within$specificities, sensitivity = roc_italy_within$sensitivities, model = italy_within_label ), data.frame( specificity = 1 - roc_us_within$specificities, sensitivity = roc_us_within$sensitivities, model = us_within_label ) ) # Define colors for Panel A colors_panel_a <- c( "#FDE724", # Pooled - Bright yellow (viridis 100%) "#1F9E89", # Italy - Jade/teal (viridis 60%) "#B5DE2B" # US - Yellow-green (viridis 90%) ) names(colors_panel_a) <- c(pooled_label, italy_within_label, us_within_label) p_original <- ggplot(roc_original, aes(x = specificity, y = sensitivity, color = model)) + geom_line(linewidth = 2.5, alpha = 0.9) + geom_abline(intercept = 0, slope = 1, linetype = "dotted", color = "#666666", alpha = 0.5, linewidth = 1) + scale_color_manual( name = "Validation Strategy", values = colors_panel_a ) + labs( title = "A. ORIGINAL", x = "False Positive Rate", y = "True Positive Rate" ) + theme_dark_scientific(base_size = 14) + theme( plot.background = element_rect(fill = "#151520", colour = NA), panel.background = element_rect(fill = "#151520", colour = NA), legend.position = c(0.65, 0.25) ) + coord_equal() # ============================================================================== # PANEL B: REANALYSIS (CROSS-COUNTRY) # ============================================================================== roc_reanalysis <- rbind( data.frame( specificity = 1 - roc_italy_to_us$specificities, sensitivity = roc_italy_to_us$sensitivities, model = italy_label ), data.frame( specificity = 1 - roc_us_to_italy$specificities, sensitivity = roc_us_to_italy$sensitivities, model = us_label ) ) # Define colors for Panel B colors_panel_b <- c( "#440154", # Italy→US - Dark purple (viridis 0%) "#31688E" # US→Italy - Medium blue (viridis 40%) ) names(colors_panel_b) <- c(italy_label, us_label) p_reanalysis <- ggplot(roc_reanalysis, aes(x = specificity, y = sensitivity, color = model)) + geom_line(linewidth = 2.5, alpha = 0.9) + geom_abline(intercept = 0, slope = 1, linetype = "dotted", color = "#666666", alpha = 0.5, linewidth = 1) + scale_color_manual( name = "Cross-Country", values = colors_panel_b ) + labs( title = "B. REANALYSIS (CROSS-COUNTRY VALIDATION)", x = "False Positive Rate", y = "True Positive Rate" ) + theme_dark_scientific(base_size = 14) + theme( plot.background = element_rect(fill = "#151520", colour = NA), panel.background = element_rect(fill = "#151520", colour = NA), legend.position = c(0.65, 0.25) ) + coord_equal() # ============================================================================== # PANEL C: CONFUSION MATRICES FOR CROSS-COUNTRY # ============================================================================== # Italy → US confusion matrix cm_italy_us <- lcv_results$italy_to_us$confusion_matrix cm_italy_us_df <- as.data.frame.table(cm_italy_us) names(cm_italy_us_df) <- c("Predicted", "Actual", "Count") # Calculate metrics sens_italy_us <- cm_italy_us["ALS", "ALS"] / sum(cm_italy_us[, "ALS"]) spec_italy_us <- cm_italy_us["Control", "Control"] / sum(cm_italy_us[, "Control"]) p_cm_italy_us <- ggplot(cm_italy_us_df, aes(x = Actual, y = Predicted, fill = Count)) + geom_tile(color = "#444444", linewidth = 1) + geom_text(aes(label = Count), size = 6, fontface = "bold", color = "white") + scale_fill_gradient(low = "#2c1e3d", high = "#E6550D") + labs( title = sprintf("Italy→US\nSens: %.2f, Spec: %.2f", sens_italy_us, spec_italy_us), x = "True", y = "Predicted" ) + theme_dark_scientific(base_size = 12) + theme_centered_titles() + theme_zero_margins() + theme( plot.background = element_rect(fill = "#151520", colour = NA), panel.background = element_rect(fill = "#151520", colour = NA), legend.position = "none", panel.grid = element_blank() ) + coord_equal() # US → Italy confusion matrix cm_us_italy <- lcv_results$us_to_italy$confusion_matrix cm_us_italy_df <- as.data.frame.table(cm_us_italy) names(cm_us_italy_df) <- c("Predicted", "Actual", "Count") sens_us_italy <- cm_us_italy["ALS", "ALS"] / sum(cm_us_italy[, "ALS"]) spec_us_italy <- cm_us_italy["Control", "Control"] / sum(cm_us_italy[, "Control"]) p_cm_us_italy <- ggplot(cm_us_italy_df, aes(x = Actual, y = Predicted, fill = Count)) + geom_tile(color = "#444444", linewidth = 1) + geom_text(aes(label = Count), size = 6, fontface = "bold", color = "white") + scale_fill_gradient(low = "#2c1e3d", high = "#756BB1") + labs( title = sprintf("US→Italy\nSens: %.2f, Spec: %.2f", sens_us_italy, spec_us_italy), x = "True", y = "Predicted" ) + theme_dark_scientific(base_size = 12) + theme_centered_titles() + theme_zero_margins() + theme( plot.background = element_rect(fill = "#151520", colour = NA), panel.background = element_rect(fill = "#151520", colour = NA), legend.position = "none", panel.grid = element_blank() ) + coord_equal() # ============================================================================== # COMBINE ALL PANELS: A and B on top, C (both confusion matrices) below # ============================================================================== ((p_original | p_reanalysis) / (p_cm_italy_us | p_cm_us_italy)) + plot_annotation( title = "Geographic Validation: Original Study vs Reanalysis", subtitle = "Cross-country validation reveals severe geographic overfitting not detected by original within-country/pooled approaches", theme = theme_dark_scientific(base_size = 12) + theme( plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5, color = "#FDE724"), plot.background = element_rect(fill = "#151520", color = NA), plot.margin = unit(c(0,0,0,0), "pt") ) ) & theme( plot.background = element_rect(fill = "#151520", colour = NA), panel.background = element_rect(fill = "#151520", colour = NA) ) ``` ### Performance Summary ```{r lcv-performance} #| echo: false #| label: tbl-lcv-performance #| tbl-cap: "Model performance: Pooled CV vs Geographic Validation" perf_comparison <- tibble( Approach = c("Pooled CV (original)", "Italy → US", "US → Italy", "Mean LCV"), AUC = c( pooled_cv_results$cv_auc, lcv_results$italy_to_us$test_auc, lcv_results$us_to_italy$test_auc, mean(c(lcv_results$italy_to_us$test_auc, lcv_results$us_to_italy$test_auc)) ), `95% CI (lower)` = c( pooled_cv_results$cv_auc_ci_lower, lcv_results$italy_to_us$test_auc_ci_lower, lcv_results$us_to_italy$test_auc_ci_lower, NA ), `95% CI (upper)` = c( pooled_cv_results$cv_auc_ci_upper, lcv_results$italy_to_us$test_auc_ci_upper, lcv_results$us_to_italy$test_auc_ci_upper, NA ) ) %>% knitr::kable(digits = 3, align = c("l", rep("r", 3))) perf_comparison ``` ## Within-Country Cross-Validation ```{r within-country-cv} #| echo: false #| message: false italy_within_cv <- within_country_cv_results$italy us_within_cv <- within_country_cv_results$us ``` ```{r cv-comparison-plot} #| label: fig-cv-comparison #| fig-cap: "Model performance comparison: Pooled CV, Within-country CV, and Leave-country-out (LCV)" #| fig-width: 16 #| fig-height: 7 library(patchwork) # Panel 1: Within-country CV results within_country_data <- tibble::tibble( Approach = factor( c("Italy-only CV\n(HEPARIN)", "US-only CV\n(EDTA)"), levels = c("Italy-only CV\n(HEPARIN)", "US-only CV\n(EDTA)") ), AUC = c(italy_within_cv$cv_auc, us_within_cv$cv_auc), CI_lower = c(italy_within_cv$cv_auc_ci_lower, us_within_cv$cv_auc_ci_lower), CI_upper = c(italy_within_cv$cv_auc_ci_upper, us_within_cv$cv_auc_ci_upper) ) p1 <- ggplot(within_country_data, aes(x = Approach, y = AUC, fill = Approach)) + geom_col(width = 0.65, color = "#0B0B0F") + geom_errorbar(aes(ymin = CI_lower, ymax = CI_upper), width = 0.15, color = "#C8C8D0", size = 0.5) + geom_point(size = 4, color = "white", shape = 21, fill = "#FDE724", stroke = 1.5) + geom_text(aes(label = sprintf("%.3f", AUC)), vjust = -1.2, color = "white", size = 5.5, fontface = "bold") + scale_y_continuous(limits = c(0.5, 1.0), breaks = seq(0.5, 1.0, 0.05)) + scale_fill_manual( values = c( "Italy-only CV\n(HEPARIN)" = "#5DC863", "US-only CV\n(EDTA)" = "#31688E" ) ) + labs( title = "A. Within-Country CV Performance", subtitle = "Models trained and tested within single cohorts", x = NULL, y = "AUC (95% CI)" ) + theme_dark_scientific(base_size = 12) + theme_zero_margins() + theme( plot.background = element_rect(fill = "#151520", colour = NA), panel.background = element_rect(fill = "#151520", colour = NA), legend.position = "none", axis.text.x = element_text(size = 11, face = "bold"), axis.text.y = element_text(size = 11), axis.title.y = element_text(size = 12, face = "bold") ) # Panel 2: Pooled and LCV comparison pooled_lcv_data <- tibble::tibble( Approach = factor( c("Pooled CV", "Italy → US", "US → Italy"), levels = c("Pooled CV", "Italy → US", "US → Italy") ), AUC = c( pooled_cv_results$cv_auc, lcv_results$italy_to_us$test_auc, lcv_results$us_to_italy$test_auc ), CI_lower = c( pooled_cv_results$cv_auc_ci_lower, lcv_results$italy_to_us$test_auc_ci_lower, lcv_results$us_to_italy$test_auc_ci_lower ), CI_upper = c( pooled_cv_results$cv_auc_ci_upper, lcv_results$italy_to_us$test_auc_ci_upper, lcv_results$us_to_italy$test_auc_ci_upper ), Type = c("Pooled", "LCV", "LCV") ) p2 <- ggplot(pooled_lcv_data, aes(x = Approach, y = AUC, fill = Type)) + geom_col(width = 0.65, color = "#0B0B0F") + geom_errorbar(aes(ymin = CI_lower, ymax = CI_upper), width = 0.15, color = "#C8C8D0", size = 0.5) + geom_point(size = 4, color = "white", shape = 21, fill = "#FDE724", stroke = 1.5) + geom_text(aes(label = sprintf("%.3f", AUC)), vjust = -1.2, color = "white", size = 5.5, fontface = "bold") + scale_y_continuous(limits = c(0.5, 1.0), breaks = seq(0.5, 1.0, 0.05)) + scale_fill_manual( values = c( "Pooled" = "#FDE725", "LCV" = "#C2408C" ) ) + labs( title = "B. Pooled vs Leave-Country-Out CV", subtitle = "Geographic validation reveals confounding", x = NULL, y = "AUC (95% CI)" ) + theme_dark_scientific(base_size = 12) + theme_zero_margins() + theme( plot.background = element_rect(fill = "#151520", colour = NA), panel.background = element_rect(fill = "#151520", colour = NA), legend.position = "none", axis.text.x = element_text(size = 11, face = "bold"), axis.text.y = element_text(size = 11), axis.title.y = element_text(size = 12, face = "bold") ) # Combine panels side by side (p1 + p2) + plot_layout() & theme_zero_margins() ``` **Within-country model performance:** - **Italy-only (HEPARIN, n = `r italy_within_cv$n`; ALS `r italy_within_cv$n_als`, Controls `r italy_within_cv$n_ctrl`):** AUC `r sprintf("%.3f", italy_within_cv$cv_auc)` (95% CI `r sprintf("%.3f–%.3f", italy_within_cv$cv_auc_ci_lower, italy_within_cv$cv_auc_ci_upper)`) - **US-only (EDTA, n = `r us_within_cv$n`; ALS `r us_within_cv$n_als`, Controls `r us_within_cv$n_ctrl`):** AUC `r sprintf("%.3f", us_within_cv$cv_auc)` (95% CI `r sprintf("%.3f–%.3f", us_within_cv$cv_auc_ci_lower, us_within_cv$cv_auc_ci_upper)`) ::: {.callout-note} ## Interpretation - **Pooled CV** remains the optimistic upper bound because confounding structure is shared across folds. - **Italy-only and US-only CV** reflect performance in homogeneous cohorts without cross-matrix generalization demands. Both achieve reasonable performance within their respective anticoagulant contexts. - **The large gap** between within-country CV and leave-country-out (Italy → US and US → Italy) demonstrates that performance degrades substantially when the anticoagulant/geography context changes. This pattern is consistent with models learning tube/context-specific signatures that do not transfer across countries, not generalizable ALS biology. ::: --- ## Differential Analysis: Stratified vs Pooled ```{r protein-counts} #| code-fold: true #| label: fig-protein-counts #| fig-cap: "Significant protein counts by analysis approach" #| fig-width: 10 #| fig-height: 7 # Plot protein counts inline (moved from R/06_visualizations.R for transparency) library(ggplot2) counts_data <- data.frame( analysis = c( "Italy\n(HEPARIN)", "US\n(EDTA)", "Pooled\n(Confounded)", "Both Strata\n(Robust)" ), count = c( length(stratified_vs_pooled_comparison$sig_italy), length(stratified_vs_pooled_comparison$sig_us), length(stratified_vs_pooled_comparison$sig_pooled), length(stratified_vs_pooled_comparison$sig_both_strata) ), type = c("Stratified", "Stratified", "Pooled", "Robust") ) counts_data$analysis <- factor(counts_data$analysis, levels = counts_data$analysis) ggplot(counts_data, aes(x = analysis, y = count, fill = type)) + geom_col(width = 0.7, alpha = 0.9) + geom_text(aes(label = count), vjust = -0.5, size = 5, fontface = "bold", color = "#EEEEEE") + scale_fill_manual(values = c( "Stratified" = "#31688E", # viridis cyan "Pooled" = "#FDE724", # viridis yellow "Robust" = "#6CCE59" # viridis lime )) + labs( title = "Significant Proteins by Analysis Type", subtitle = sprintf( "Only %d/%d pooled proteins (%.1f%%) replicate in both strata - major reproducibility issue", length(stratified_vs_pooled_comparison$overlap_both), length(stratified_vs_pooled_comparison$sig_pooled), stratified_vs_pooled_comparison$pct_replicate_in_both ), x = NULL, y = "Number of Significant Proteins\n(FDR < 0.05, |logFC| > 0.5)", fill = "Type" ) + theme_dark_scientific(base_size = 12) + theme( legend.position = "none", panel.grid.major.x = element_blank() ) ``` ### Summary Statistics ```{r diff-summary} #| echo: false #| label: tbl-diff-summary #| tbl-cap: "Significant proteins by analysis approach (FDR < 0.05, |logFC| > 0.5)" diff_summary <- tibble( Analysis = c("Italy (HEPARIN)", "US (EDTA)", "Pooled (confounded)", "Both Strata"), `N Significant` = c( length(stratified_vs_pooled_comparison$sig_italy), length(stratified_vs_pooled_comparison$sig_us), length(stratified_vs_pooled_comparison$sig_pooled), length(stratified_vs_pooled_comparison$sig_both_strata) ), `% of Pooled` = c( 100 * length(stratified_vs_pooled_comparison$sig_italy) / length(stratified_vs_pooled_comparison$sig_pooled), 100 * length(stratified_vs_pooled_comparison$sig_us) / length(stratified_vs_pooled_comparison$sig_pooled), 100, stratified_vs_pooled_comparison$pct_replicate_in_both ) ) kable(diff_summary, digits = 1, align = c("l", "r", "r")) ``` ::: {.callout-important appearance="minimal"} ## Critical Finding **Only `r sprintf("%.1f%%", stratified_vs_pooled_comparison$pct_replicate_in_both)` of pooled proteins replicate in BOTH strata** - **`r length(stratified_vs_pooled_comparison$pooled_only)` proteins** (`r sprintf("%.1f%%", 100 * length(stratified_vs_pooled_comparison$pooled_only) / length(stratified_vs_pooled_comparison$sig_pooled))`) significant in pooled but NOT in either stratum - These are likely **confounding artifacts** driven by country/tube effects - Most pooled "significant" proteins are **likely false positives attributable to confounding** ::: ### Protein Overlap Visualization ```{r protein-venn} #| code-fold: true #| label: fig-protein-venn #| fig-cap: "Venn diagram showing protein overlap between strata" #| fig-width: 10 #| fig-height: 8 #| out-width: "100%" # Plot Venn diagram inline (moved from R/06_visualizations.R for transparency) library(ggplot2) library(ggvenn) # Prepare data for ggvenn venn_data <- list( Italy = stratified_vs_pooled_comparison$sig_italy, US = stratified_vs_pooled_comparison$sig_us, Pooled = stratified_vs_pooled_comparison$sig_pooled ) ggvenn( venn_data, fill_color = c("#1F9E89", "#B5DE2B", "#FDE724"), # viridis jade, yellow-green, yellow fill_alpha = 0.5, stroke_size = 1, set_name_size = 5, set_name_color = "white", text_size = 4 ) + labs( title = "Protein Overlap: Stratified vs Pooled Analysis", subtitle = sprintf( "⚠️ CRITICAL: Only %d proteins (%.1f%%) significant in BOTH Italy and US", length(stratified_vs_pooled_comparison$sig_both_strata), stratified_vs_pooled_comparison$pct_replicate_in_both ) ) + theme_dark_scientific(base_size = 12) + theme( plot.background = element_rect(fill = "#151520", colour = NA), panel.background = element_rect(fill = "#151520", colour = NA), legend.background = element_rect(fill = "#151520", colour = NA), legend.key = element_rect(fill = "#151520", colour = NA), plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(color = "#FDE724", face = "bold", hjust = 0.5) ) ``` ## Geographically Consistent Biomarker Candidates ### Meta-Analysis Results **Proteins significant in BOTH Italy and US with concordant direction:** **`r sum(protein_concordance$sig_both & protein_concordance$same_direction)`** These are the **ONLY** proteins that show consistent ALS signals across both tube types and geographic cohorts. ```{r concordant-proteins} #| echo: false #| label: tbl-concordant #| tbl-cap: "Top 10 geographically consistent biomarker candidates" protein_concordance %>% filter(sig_both, same_direction) %>% arrange(combined_p) %>% head(10) %>% select(Assay, logFC_italy, logFC_us, mean_logFC, adj_P_italy, adj_P_us, combined_p) %>% kable(digits = 3, align = c("l", rep("r", 6))) ``` ### Effect Size Correlation To assess whether effect sizes are consistent across cohorts, we calculate correlations for two protein sets: 1. **All proteins on Olink panel** (n = `r nrow(protein_concordance)`) 2. **Proteins significant in pooled analysis** (n = `r sum(differential_pooled$adj.P.Val < 0.05 & abs(differential_pooled$logFC) > 0.5)`) ```{r effect-size-correlations} #| echo: false #| label: fig-effect-correlations #| fig-cap: "Effect size correlations between Italy and US cohorts" #| fig-width: 18 #| fig-height: 10 #| out-width: "100%" # Calculate correlations for display (both Pearson and Spearman) cor_all_pearson <- cor(protein_concordance$logFC_italy, protein_concordance$logFC_us, use = "complete.obs", method = "pearson") cor_all_spearman <- cor(protein_concordance$logFC_italy, protein_concordance$logFC_us, use = "complete.obs", method = "spearman") pooled_sig <- differential_pooled %>% filter(adj.P.Val < 0.05, abs(logFC) > 0.5) %>% pull(Assay) cor_pooled_pearson <- protein_concordance %>% filter(Assay %in% pooled_sig) %>% {cor(.$logFC_italy, .$logFC_us, use = "complete.obs", method = "pearson")} cor_pooled_spearman <- protein_concordance %>% filter(Assay %in% pooled_sig) %>% {cor(.$logFC_italy, .$logFC_us, use = "complete.obs", method = "spearman")} # Load and display the figure from targets tar_load(viz_correlation_by_set) viz_correlation_by_set$plot ``` **Key Findings:** - **All proteins**: Pearson r = `r sprintf("%.3f", cor_all_pearson)`, Spearman ρ = `r sprintf("%.3f", cor_all_spearman)` — Near-zero Pearson correlation with **negative Spearman rank correlation** across entire panel - **Pooled-significant proteins**: Pearson r = `r sprintf("%.3f", cor_pooled_pearson)`, Spearman ρ = `r sprintf("%.3f", cor_pooled_spearman)` — Moderate-to-strong correlation among selected proteins The stark difference between these correlations reveals an important pattern: proteins that reach significance in the pooled analysis tend to show **more consistent effect sizes** across cohorts, while the majority of proteins on the panel show essentially **no correlation**. **Notably**, the negative Spearman correlation (ρ = -0.115) for all proteins indicates that even the **rank ordering** of effects tends to be inverted between cohorts—proteins with larger effects in Italy tend to have smaller effects in the US, and vice versa. This rank inversion, while modest in magnitude, suggests systematic differences in how proteins respond across the two cohorts that go beyond simple scaling differences. ::: {.callout-important appearance="minimal"} ## Interpretation and ML Implications **What this means:** The 78 proteins significant in pooled analysis do show reasonable concordance between Italy and US (Pearson r ≈ 0.79, Spearman ρ ≈ 0.68), suggesting they may capture some reproducible ALS biology. However: 1. **Rank inversion in broader panel**: Across all proteins, Spearman ρ = -0.115 indicates systematic rank-order differences between cohorts 2. **Effect concordance ≠ absence of confounding**: Proteins can show consistent disease associations while still being systematically affected by tube type 3. **72% of pooled proteins fail stratified replication**: Only 22/78 (28%) are significant in BOTH cohorts 4. **Geographic validation issues persist**: ML models trained on pooled data show reduced performance in leave-country-out CV (18% performance drop: from AUC = 0.95 to 0.77) 5. **Reverse prediction remains problematic**: Proteins discriminate tube types with very high accuracy (AUC = 0.999), indicating pervasive technical effects 6. **Residual confounding in "consistent" proteins**: Even the 22 geographically consistent proteins predict tube type with AUC = 0.89 **Critical point for ML models:** The moderate-to-strong correlation (Pearson r ≈ 0.79) among pooled-significant proteins does **NOT** eliminate concerns about confounding bias in machine learning models. The models are trained on absolute NPX values—which contain both biological signal AND tube-type effects—not on effect sizes. Geographic validation demonstrates that models show substantially reduced generalization (Italy→US: AUC = 0.81; US→Italy: AUC = 0.74; mean = 0.77) compared to pooled cross-validation (AUC = 0.95), indicating that the training data contains systematic cohort-specific patterns. **Interpretation:** While some proteins show reproducible disease associations, the ML framework faces substantial challenges due to the Italy-HEPARIN/US-EDTA confound structure. ::: ### Biological Interpretation **NEFL (Neurofilament Light Chain)** is a well-established neuroaxonal injury biomarker and likely represents a genuine biological signal in ALS. However, our analyses demonstrate that its effect size in this dataset is substantially contaminated by the Italy-HEPARIN/US-EDTA confound. Among the 22 "geographically consistent" proteins that show significance in both strata, only a small fraction survive without residual tube effects. Importantly: - Only 22 proteins survive rigorous geographic validation (out of 78 pooled) - **72% of pooled proteins fail to replicate** across tube types - Effect sizes are inconsistent (r = 0.03) between cohorts - Even geographically consistent candidates retain substantial tube signal (see Residual Confounding section) **Verdict:** While NEFL and a small subset of proteins remain clinically relevant, their magnitudes in the pooled analysis are inflated and their relative rankings among biomarkers are distorted by non-biological factors. Most of the published protein list requires prospective validation in tube-balanced studies. ### Visualizing Country/Tube Effects in Geographically Consistent Proteins To demonstrate that even "geographically consistent" proteins exhibit systematic country/tube effects, we examine the distribution of NPX values for ALL geographically consistent proteins, comparing ALS vs. Healthy controls across Italy (HEPARIN) and US (EDTA) cohorts. ```{r tube-robust-protein-viz} #| label: fig-tube-robust-viz #| fig-cap: "All 'Geographically Consistent' Proteins Show Strong Country/Tube Effects" #| fig-width: 12 #| fig-height: 40 #| echo: false # Load data if not already loaded if (!exists("protein_concordance")) { tar_load(protein_concordance) } if (!exists("data_with_country")) { tar_load(data_with_country) } if (!exists("original_biomarkers")) { tar_load(original_biomarkers) } if (!exists("tube_effects_als")) { tar_load(tube_effects_als) } if (!exists("tube_effects_healthy")) { tar_load(tube_effects_healthy) } # 1. Identify ALL geographically consistent proteins (not just top 15). # These are proteins significant in both cohorts with the same direction of effect. # We arrange them by combined_p to order the facets in the final plot. tube_robust_top15_assays <- protein_concordance %>% filter(sig_both, same_direction) %>% arrange(combined_p) %>% pull(Assay) # Combine geographically consistent proteins with ALL original biomarkers to show the full picture all_proteins_to_plot <- unique(c(tube_robust_top15_assays, original_biomarkers)) # 2. Prepare the data for plotting. # - Filter the main dataset to include geographically consistent proteins AND all original biomarkers. # - Exclude 'Neurological_control' to focus on the ALS vs. Healthy comparison. # - Convert 'Assay' to a factor with levels ordered by significance for a clean plot layout. # - Add marker for original study biomarkers. plot_data_tube_effects <- data_with_country %>% filter( Assay %in% all_proteins_to_plot, Diagnosis %in% c("ALS", "Healthy_control") ) %>% mutate( is_original = Assay %in% original_biomarkers, Assay_label = if_else(is_original, paste0(Assay, " ★"), as.character(Assay)), Assay = factor(Assay, levels = all_proteins_to_plot), Assay_label = factor(Assay_label, levels = paste0(all_proteins_to_plot, ifelse(all_proteins_to_plot %in% original_biomarkers, " ★", ""))), # Extract tube type Tube_type = case_when( country == "Italy" ~ "HEPARIN", country == "US" ~ "EDTA", TRUE ~ "Unknown" ), # Rename diagnosis for cleaner labels Diagnosis_clean = case_when( Diagnosis == "Healthy_control" ~ "Healthy", TRUE ~ Diagnosis ) ) # 3. Create combined diagnosis-tube factor for color mapping plot_data_tube_effects <- plot_data_tube_effects %>% mutate( Diagnosis_Tube = interaction(Diagnosis_clean, Tube_type, sep = " + ") ) # 4. Get limma-based p-values and effect sizes (logFC) for tube comparisons within each diagnosis group # (adjusted for age and sex) # Combine ALS and Healthy limma results limma_tube_effects <- bind_rows( tube_effects_als %>% mutate(Diagnosis_clean = "ALS"), tube_effects_healthy %>% mutate(Diagnosis_clean = "Healthy") ) %>% select(Assay, Diagnosis_clean, adj.P.Val, logFC) # Join with plot data and prepare for visualization stat_test <- plot_data_tube_effects %>% distinct(Assay, Assay_label, Diagnosis_clean) %>% left_join(limma_tube_effects, by = c("Assay", "Diagnosis_clean")) %>% rename(p = adj.P.Val) %>% mutate( # Format p-values nicely with color coding p_label = case_when( p < 0.001 ~ "p < 0.001", p < 0.01 ~ sprintf("p = %.3f", p), p < 0.05 ~ sprintf("p = %.2f", p), TRUE ~ sprintf("p = %.2f", p) ), # Format effect size from limma (logFC = EDTA - HEPARIN in log2 space) es_label = sprintf("ES = %.2f", logFC), # Combine p-value and effect size combined_label = paste0(p_label, "\n", es_label), # Set fixed y position at NPX = 4 y.position = 4, # Set x positions for within-diagnosis comparison xmin = as.numeric(factor(Diagnosis_clean)) - 0.4, xmax = as.numeric(factor(Diagnosis_clean)) + 0.4, # Add group labels for context (though not strictly needed for geom_text) group1 = "HEPARIN", group2 = "EDTA" ) # 5. Generate the faceted plot - grouped by diagnosis with tube types dodged. # - x-axis: Diagnosis (ALS vs Healthy) # - fill: Diagnosis + Tube combination # - ALS: shades of blue (EDTA = light blue, HEPARIN = dark blue) # - Healthy: shades of yellow (EDTA = light yellow, HEPARIN = gold) ggplot(plot_data_tube_effects, aes(x = Diagnosis_clean, y = NPX, fill = Diagnosis_Tube)) + geom_violin( position = position_dodge(width = 0.8), scale = "width", alpha = 0.6, linewidth = 0.4 ) + geom_boxplot( position = position_dodge(width = 0.8), width = 0.2, outlier.size = 0.3, outlier.alpha = 0.2, alpha = 0.9, linewidth = 0.3 ) + # Add p-value and NPX difference labels with color based on significance geom_text( data = stat_test, aes(x = (xmin + xmax) / 2, y = y.position, label = combined_label, color = I(ifelse(p < 0.05, "#B5DE2B", "#EEEEEE"))), size = 7, fontface = "bold", vjust = -0.5, lineheight = 0.9, inherit.aes = FALSE ) + facet_wrap(~ Assay_label, scales = "free_x", ncol = 3) + scale_y_continuous(limits = c(-4, 6), oob = scales::squish) + scale_fill_manual( values = c( "ALS + EDTA" = "#87CEEB", # Sky blue (lighter) for ALS + EDTA "ALS + HEPARIN" = "#003D5C", # Navy blue (darker) for ALS + HEPARIN "Healthy + EDTA" = "#FFEB3B", # Bright yellow for Healthy + EDTA "Healthy + HEPARIN" = "#FF8C00" # Dark orange for Healthy + HEPARIN ), name = "Group" ) + labs( title = "Geographically Consistent Proteins + All Original Biomarkers Show Country/Tube Effects", subtitle = "NPX distribution by diagnosis (ALS vs Healthy), combining geographically consistent proteins and ALL 17 original biomarkers. P-values (lime green if p < 0.05, limma-adjusted for age & sex) and effect sizes (ES = logFC from limma, EDTA - HEPARIN) compare tube types within each diagnosis.\n★ = Original study biomarker from Figure 4a", x = "Diagnosis", y = "Protein Expression (NPX)" ) + theme_dark_scientific(base_size = 11) + theme( axis.text.x = element_text(size = 9, face = "bold"), legend.position = "bottom", legend.direction = "horizontal", strip.text = ggtext::element_markdown(size = 10) ) ``` ::: {.callout-important appearance="minimal"} ## Key Observation The visualization combines geographically consistent proteins with **ALL 17 original biomarkers from the authors' Figure 4a** to demonstrate that **systematic tube type differences exist across both sets**. Even within healthy controls, we observe substantial EDTA vs. HEPARIN differences (shown as effect sizes and p-values). This pattern demonstrates that: 1. **Country/tube effects are present in disease-free individuals**, proving these differences are technical artifacts, not ALS biology 2. **ALS vs. Healthy separation varies by country/tube context**, indicating entanglement of biological and technical signals 3. **Direct NPX comparisons between cohorts are confounded**, even for proteins that show consistent ALS associations within each cohort 4. **The original biomarkers (★) are not exempt from these confounding effects**, demonstrating that the reported findings cannot be generalized across collection protocols This visual evidence reinforces the quantitative findings from our three convergent tests (reverse prediction, healthy control analysis, effect decomposition) that residual confounding persists in all candidate biomarkers. ::: ## Residual Confounding in "Geographically Consistent" Proteins **Critical Question:** Are the 22 "geographically consistent" proteins (significant in both Italy and US with concordant direction) truly independent of tube type effects, or do they still retain substantial technical confounding? Just because a protein is significant in BOTH cohorts doesn't prove it's free from tube type artifacts. It could be: (1) true biomarker + NO tube effect (ideal), (2) true biomarker + tube effect (entangled), or (3) pure tube artifact spuriously associated with disease. With perfect confounding, we cannot distinguish cases #2 and #3, but we CAN quantify the magnitude of residual tube type signal using two complementary approaches below, which together with the reverse prediction test (Section 3.2) provide converging evidence. ```{r load-residual-confounding} #| include: false # Load new residual confounding analysis results tar_load(c( robust_proteins_reverse_prediction, effect_decomposition )) # Load limma-based tube effects for healthy controls if (!exists("tube_effects_healthy")) { tar_load(tube_effects_healthy) } ``` ### Test 1: Tube Effects in Healthy Controls (Disease-Free Test) As demonstrated in [Figure @fig-tube-robust-viz], even "geographically consistent" proteins show systematic differences between tube types within disease-free individuals. Here we quantify these effects statistically. ```{r healthy-tube-effects-plot} #| code-fold: true #| label: fig-healthy-tube #| fig-cap: "Tube type effects in healthy controls only" #| fig-width: 10 #| fig-height: 12 # Plot healthy tube effects using limma-based results (adjusted for age & sex) library(ggplot2) library(dplyr) # Prepare plot data from limma results # Filter to only geographically consistent proteins (same as old analysis) plot_data <- tube_effects_healthy %>% dplyr::filter(Assay %in% tube_robust_top15_assays) %>% dplyr::arrange(adj.P.Val) %>% dplyr::mutate( protein = factor(Assay, levels = rev(Assay)), significant = adj.P.Val < 0.05, # Use logFC as effect size (already in NPX units since NPX is log2) # Calculate approximate 95% CI using t-distribution ci_lower = logFC - 1.96 * logFC / t, ci_upper = logFC + 1.96 * logFC / t ) n_sig <- sum(plot_data$significant) n_total <- nrow(plot_data) ggplot(plot_data, aes(x = logFC, y = protein, color = significant)) + geom_vline(xintercept = 0, linetype = "dashed", color = "#888888") + geom_point(size = 3) + geom_errorbarh(aes(xmin = ci_lower, xmax = ci_upper), height = 0.3) + scale_color_manual(values = c("TRUE" = "#FDE724", "FALSE" = "#888888")) + labs( title = "Tube Type Effects in Healthy Controls (Disease-Free)", subtitle = sprintf( "⚠️ %d/%d geographically consistent proteins differ by tube type WITHOUT disease present\nLimma analysis adjusted for age & sex", n_sig, n_total ), x = "Effect Size in NPX (US/EDTA - Italy/HEPARIN, logFC)", y = NULL, color = "Significant\n(FDR < 0.05)" ) + theme_dark_scientific(base_size = 12) + theme( panel.grid.major.y = element_blank() ) ``` **Rationale:** If a protein differs between Italy/HEPARIN and US/EDTA in disease-free individuals, that difference MUST be a technical artifact (or population genetics), NOT ALS biology. **Result: `r sum(tube_effects_healthy$adj.P.Val < 0.05 & tube_effects_healthy$Assay %in% tube_robust_top15_assays)` out of `r sum(tube_effects_healthy$Assay %in% tube_robust_top15_assays)` geographically consistent proteins (`r sprintf("%.1f%%", 100 * sum(tube_effects_healthy$adj.P.Val < 0.05 & tube_effects_healthy$Assay %in% tube_robust_top15_assays) / sum(tube_effects_healthy$Assay %in% tube_robust_top15_assays))`) show significant tube effects in healthy controls (limma, FDR < 0.05)** #### Distribution of Tube Effect Sizes Across All Proteins To understand the full scope of tube type effects in healthy controls, we examine the distribution of effect sizes across **all proteins** in the Olink panel. We present two views: the full distribution and a zoomed view focusing on the central range. ```{r tube-effect-distribution-full} #| code-fold: true #| label: fig-tube-distribution-full #| fig-cap: "Distribution of tube type effect sizes in healthy controls across all proteins (full range). Cyan dashed line: theoretical null distribution assuming no systematic tube effect." #| fig-width: 12 #| fig-height: 8 # Load visualization from targets (full range) tar_load(viz_tube_effect_distribution_full) viz_tube_effect_distribution_full ``` The full distribution (@fig-tube-distribution-full) shows the complete range of tube effects, including extreme outliers extending beyond 1000% change. **The theoretical null distribution** (cyan dashed line, centered at 0) represents the expected pattern if tube type had no systematic effect—the dramatic deviation from this null demonstrates widespread technical artifacts across the proteome. ```{r tube-effect-distribution-zoomed} #| code-fold: true #| label: fig-tube-distribution-zoomed #| fig-cap: "Distribution of tube type effect sizes in healthy controls (zoomed to -1000% to +1000% range). Cyan dashed line: theoretical null distribution. The observed distribution is substantially wider and more bimodal than expected under the null hypothesis of no tube effect." #| fig-width: 12 #| fig-height: 8 # Load visualization from targets (zoomed to -1000 to 1000) tar_load(viz_tube_effect_distribution_zoomed) viz_tube_effect_distribution_zoomed ``` The zoomed distribution (@fig-tube-distribution-zoomed) focuses on the -1000% to +1000% range, allowing better visualization of the central distribution pattern. **Comparison to the theoretical null** (cyan dashed curve) clearly shows that the observed effects are far more extreme and variable than expected from measurement noise alone, with a bimodal structure suggesting systematic directional effects. Outliers beyond this range are indicated with arrows. **Key Findings:** - **`r sum(tube_effects_healthy$adj.P.Val < 0.05)`** out of **`r nrow(tube_effects_healthy)`** proteins (**`r sprintf("%.1f%%", 100 * sum(tube_effects_healthy$adj.P.Val < 0.05) / nrow(tube_effects_healthy))`**) show significant tube effects in healthy controls (FDR < 0.05) - Median absolute percent change: **`r sprintf("%.1f%%", median(abs((2^tube_effects_healthy$logFC - 1) * 100), na.rm = TRUE))`** across all proteins - Median absolute percent change (significant only): **`r sprintf("%.1f%%", median(abs((2^tube_effects_healthy$logFC[tube_effects_healthy$adj.P.Val < 0.05] - 1) * 100), na.rm = TRUE))`** - **`r sum(abs((2^tube_effects_healthy$logFC - 1) * 100) > 100, na.rm = TRUE)`** proteins show >100% change - **`r sum(abs((2^tube_effects_healthy$logFC - 1) * 100) > 50, na.rm = TRUE)`** proteins show >50% change - **`r sum(abs((2^tube_effects_healthy$logFC - 1) * 100) > 1000, na.rm = TRUE)`** proteins show extreme changes >1000% **Interpretation:** This distribution reveals that tube type differences are **widespread and substantial** across the entire proteome, not limited to a few outlier proteins. The fact that over a third of all proteins show significant effects in disease-free individuals demonstrates that technical artifacts dominate the variance structure of this dataset. The presence of extreme outliers (>1000% change) indicates that some proteins are profoundly affected by anticoagulant choice, making them completely unreliable for biomarker discovery without proper experimental control. ```{r healthy-tube-summary} #| echo: false #| label: tbl-healthy-tube #| tbl-cap: "Top geographically consistent proteins with tube effects in disease-free individuals (limma-adjusted for age & sex)" tube_effects_healthy %>% filter(Assay %in% tube_robust_top15_assays, adj.P.Val < 0.05) %>% arrange(adj.P.Val) %>% head(5) %>% select(Assay, logFC, AveExpr, t, adj.P.Val) %>% kable( digits = 3, align = c("l", rep("r", 4)), col.names = c("Protein", "logFC (NPX)", "Mean NPX", "t-statistic", "FDR") ) ``` ::: {.callout-important appearance="minimal"} ## Critical Finding **Many geographically consistent proteins differ by tube type even WITHOUT disease present!** Using limma analysis (adjusted for age & sex): - Multiple proteins show significant tube effects in disease-free individuals (FDR < 0.05) - These differences occur in the **absence of ALS**, proving they reflect technical artifacts - P-values are now properly adjusted for covariates using empirical Bayes - `r sum(tube_effects_healthy$adj.P.Val < 0.05 & tube_effects_healthy$Assay %in% tube_robust_top15_assays)` of `r sum(tube_effects_healthy$Assay %in% tube_robust_top15_assays)` geographically consistent proteins remain significant after FDR correction This definitively demonstrates that even "geographically consistent" candidates are contaminated by technical confounding. ::: ### Test 2: Effect Size Decomposition (Biology vs Artifacts) ```{r effect-decomposition-plot} #| code-fold: true #| label: fig-effect-decomp #| fig-cap: "Tube type effects vs disease effects for each protein" #| fig-width: 14 #| fig-height: 15.75 # Plot effect decomposition inline (moved from R/06_visualizations.R for transparency) library(ggplot2) library(dplyr) library(ggrepel) # Prepare data plot_data <- effect_decomposition %>% dplyr::mutate( category = dplyr::case_when( ratio_tube_to_disease > 1.0 ~ "Tube > Disease", ratio_tube_to_disease > 0.5 ~ "Tube (ES >= 50% Disease)", ratio_tube_to_disease > 0.2 ~ "Tube (ES 20-50% Disease)", TRUE ~ "Tube (ES < 20% Disease)" ), category = factor(category, levels = c( "Tube > Disease", "Tube (ES >= 50% Disease)", "Tube (ES 20-50% Disease)", "Tube (ES < 20% Disease)" )) ) # Count by category category_counts <- plot_data %>% dplyr::count(category) n_severe <- sum(plot_data$ratio_tube_to_disease >= 0.5, na.rm = TRUE) pct_severe <- 100 * n_severe / nrow(plot_data) # Label ALL proteins (only 22 total, so manageable) proteins_to_label <- plot_data ggplot(plot_data, aes(x = abs_disease_effect, y = abs_tube_effect, color = category)) + # Diagonal line (tube = disease) geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "#FDE724", linewidth = 1) + # Reference line (tube = 50% of disease) geom_abline(intercept = 0, slope = 0.5, linetype = "dotted", color = "#6CCE59", linewidth = 0.8) + # Points geom_point(size = 3, alpha = 0.7) + # Label ALL proteins - set max.overlaps to Inf to ensure all labels are shown ggrepel::geom_text_repel( data = proteins_to_label, aes(label = protein), size = 8.5, color = "#EEEEEE", box.padding = 0.35, point.padding = 0.3, max.overlaps = Inf, min.segment.length = 0, force = 2, force_pull = 1 ) + scale_color_manual(values = c( "Tube > Disease" = "#FDE724", # viridis yellow "Tube (ES >= 50% Disease)" = "#B5DE2B", # viridis yellow-green "Tube (ES 20-50% Disease)" = "#6CCE59", # viridis lime "Tube (ES < 20% Disease)" = "#1F9E89" # viridis jade )) + annotate("text", x = max(plot_data$abs_disease_effect, na.rm = TRUE) * 0.7, y = max(plot_data$abs_tube_effect, na.rm = TRUE) * 0.9, label = sprintf("y = x\n(Tube = Disease)"), color = "#FDE724", size = 3, fontface = "italic" ) + annotate("text", x = max(plot_data$abs_disease_effect, na.rm = TRUE) * 0.6, y = max(plot_data$abs_disease_effect, na.rm = TRUE) * 0.3, label = sprintf("y = 0.5x\n(Tube = 50%% Disease)"), color = "#6CCE59", size = 3, fontface = "italic" ) + labs( title = "Effect Size Decomposition: Tube Type vs Disease Effects", subtitle = sprintf( "⚠️ SEVERE: %d/%d proteins (%.1f%%) have tube effects >=50%% of disease effects", n_severe, nrow(plot_data), pct_severe ), x = "Absolute Disease Effect (|ALS - Control|)", y = "Absolute Tube Effect (|US - Italy|)", color = "Entanglement Level" ) + theme_dark_scientific(base_size = 16) + theme( legend.position = "bottom", axis.title = element_text(size = 14), plot.subtitle = element_text(size = 12) ) ``` **Question:** Is the tube type effect comparable to or larger than the disease effect? For each geographically consistent protein, we calculated: - **Disease effect:** |ALS - Control| (adjusted for country, age, sex) - **Tube effect:** |US/EDTA - Italy/HEPARIN| (adjusted for diagnosis, age, sex) - **Ratio:** |tube effect| / |disease effect| **Results:** ```{r decomposition-summary} #| echo: false #| label: tbl-decomp-summary #| tbl-cap: "Distribution of tube-to-disease effect ratios" effect_decomposition %>% count(interpretation) %>% mutate( Percentage = sprintf("%.1f%%", 100 * n / sum(n)) ) %>% select( `Effect Category` = interpretation, Count = n, Percentage ) %>% kable(align = c("l", "r", "r")) ``` **Median ratio: `r sprintf("%.3f", median(effect_decomposition$ratio_tube_to_disease, na.rm = TRUE))`** (tube effects are ~22% as large as disease effects on average) ::: {.callout-warning} ## Severe Entanglement Identified **`r sum(effect_decomposition$ratio_tube_to_disease >= 0.5, na.rm = TRUE)` proteins (`r sprintf("%.1f%%", 100 * sum(effect_decomposition$ratio_tube_to_disease >= 0.5, na.rm = TRUE) / nrow(effect_decomposition))`) have tube effects ≥50% of disease effects:** Top proteins with highest tube/disease ratios: ```{r top-entangled} #| echo: false effect_decomposition %>% arrange(desc(ratio_tube_to_disease)) %>% head(5) %>% select(protein, abs_disease_effect, abs_tube_effect, ratio_tube_to_disease) %>% kable(digits = 3, align = c("l", rep("r", 3)), col.names = c("Protein", "Disease Effect", "Tube Effect", "Ratio")) ``` **DTNB** has a tube effect (0.62 NPX) **LARGER** than its disease effect (0.56 NPX) — ratio = 1.10. This suggests it may be primarily a tube artifact spuriously associated with ALS in this confounded dataset. ::: ### Converging Evidence Summary Three independent tests converge on the same conclusion: 1. **Reverse prediction (AUC = 0.999, Section 3.2):** ALL proteins nearly perfectly predict tube type 2. **Healthy controls (50% significant):** Half of geographically consistent proteins show tube effects without disease present 3. **Effect decomposition (median ratio = 0.22):** Tube effects are 20-50% as large as disease effects in geographically consistent proteins **Verdict:** Even the most conservative "geographically consistent" biomarker candidates remain substantially entangled with technical artifacts. While some biological signal exists (especially for NEFL), perfect confounding prevents definitive separation of true ALS biology from tube type effects. These proteins should be treated as **prioritized candidates for validation** in tube-balanced studies, NOT as validated biomarkers ready for clinical deployment. --- # Discussion ## Quantifying Confounding: What the Data Reveal **Methodological Assessment: Systematic Confounding Detected** Three complementary analytical approaches converge on consistent evidence of **systematic confounding** between technical factors and biological signals: 1. **Reverse Prediction Test (AUC=0.999):** Proteins discriminate tube type nearly perfectly, demonstrating that technical variance dominates the dataset structure 2. **Geographic Validation (18.6% AUC gap):** Cross-site performance differs substantially from within-site estimates, with particularly poor sensitivity (1.8%) when training on US/EDTA and testing on Italy/HEPARIN 3. **Stratified Analysis (26.9% replication rate):** The majority of proteins significant in pooled analysis do not replicate across both geographic cohorts **What these patterns teach us:** This case study demonstrates how **standard methodological practices** (pooled multi-site data collection, stratified cross-validation, differential expression analysis) can produce apparently strong results that do not generalize geographically. The confounding arose naturally from site-specific protocols and patient populations—**exactly the scenario multi-site studies encounter in practice**. **Constructive findings:** The rigorous data collection and proteomics quality control enable us to identify 22 geographically consistent protein candidates. NEFL (Neurofilament Light Chain), a well-established ALS biomarker, shows robust elevation across cohorts. These findings provide a validated foundation for future prospective studies with balanced tube type designs. **Implications for interpretation:** The reported 98% accuracy reflects within-study performance that includes technical variance; realistic cross-site generalization yields ~77% AUC. Tube type should be treated as a batch effect requiring adjustment, not a predictive feature. Neurological control comparisons require tube type stratification. Future validation studies should prioritize the 22 geographically consistent candidates and employ tube-balanced designs from the outset. ## Limitations of This Investigation **Our Analysis Also Has Constraints:** 1. **Perfect confounding is irreducible** - Cannot fully separate country/tube/batch effects 2. **Limited power** in US ALS samples (n=40) 3. **No external validation data** - Cannot test on truly independent cohort 4. **Observational study** - Cannot establish causation Even our investigation cannot definitively prove whether specific proteins are true biomarkers or artifacts; the confounding is too severe. ## Beyond Tube Type: Healthcare System Confounding **Missed Opportunity—Deeper Sociological Implications:** The Italy/US divide represents more than a technical artifact; it reflects **fundamental differences in healthcare delivery** that create systematic sampling biases extending beyond tube type or laboratory protocols. Italian healthcare operates under a **universal, population-based system** with standardized diagnostic pathways and relatively equitable access to neuromuscular disease centers. In contrast, the US healthcare system is **fragmented and access-stratified**, where ALS diagnosis and specialized care concentrate among patients with higher socioeconomic status, comprehensive insurance coverage, and geographic proximity to academic medical centers. This structural difference fundamentally changes **case-mix characteristics**: - **Italian cohort:** More likely to represent population-based sampling across socioeconomic strata, capturing the full disease spectrum from early to late stages, with consistent diagnostic criteria applied through centralized neuromuscular networks - **US cohort:** Enriched for patients who can afford specialized ALS center access, potentially skewing toward earlier-stage disease at diagnosis (better resourced patients seek diagnosis sooner), higher education levels (health literacy drives referral patterns), and specific insurance types (Medicare/private vs. uninsured/underinsured) These sampling frame differences create **unmeasured confounding** that may manifest in protein expression profiles through multiple pathways: (1) **disease stage heterogeneity** if US patients present earlier with milder disease severity; (2) **comorbidity patterns** reflecting socioeconomic determinants of health (cardiovascular disease, metabolic syndrome, nutritional status); (3) **medication use** varying by insurance formularies and prescribing practices; (4) **environmental exposures** correlated with geography and socioeconomic status (occupational toxins, air pollution, dietary patterns). The study design provides **no adjustment** for these factors because they are unmeasured and perfectly confounded with site. Standard approaches like propensity score matching or covariate adjustment fail when confounding is structural and complete. The Italy-HEPARIN vs. US-EDTA confounding thus represents a **triple entanglement**: (1) anticoagulant chemistry, (2) batch/processing protocols, and (3) healthcare system-mediated patient selection. Even if tube type were harmonized, residual confounding from differential case-mix would remain. **Why This Matters for Generalizability:** A diagnostic model trained on US ALS patients (selected through fragmented, access-stratified healthcare) and Italian patients (population-based sampling) is not simply predicting ALS—it may be predicting **"ALS as diagnosed in specific healthcare contexts."** Deploying such a model in a third country (e.g., UK NHS, Japanese national health system, low-resource setting) introduces **distribution shift** that standard validation metrics do not capture. The model's learned patterns may not transfer because the underlying patient populations differ systematically in ways orthogonal to ALS pathophysiology. This represents a **broader epistemological challenge** in international biomarker consortia: **geographic diversity does not guarantee generalizability** if healthcare systems induce systematic sampling differences. Future multi-national studies should explicitly: (1) document healthcare system characteristics and referral pathways; (2) collect socioeconomic and clinical staging data to assess case-mix comparability; (3) conduct sensitivity analyses stratifying by healthcare context; and (4) acknowledge that even rigorously designed technical protocols cannot overcome structural sampling biases without prospective harmonization of patient ascertainment methods. ## Methodological Lessons for the Field This case study exemplifies a **systematic methodological challenge** in high-throughput biomarker discovery: how technical confounding can threaten reproducibility and clinical translatability even when investigators follow standard practices. Multi-site proteomics studies routinely report excellent cross-validation performance on pooled datasets, yet external replication often reveals substantially lower performance [@begley2012; @ioannidis2005]. Understanding **how and why** this gap emerges is critical for improving future studies. **Understanding How Confounding Arises Naturally** When technical factors (anticoagulant type, batch, processing site) **naturally correlate** with diagnostic groups in multi-site studies, machine learning models may learn these technical patterns alongside biological signals [@leek2010]. Pooled cross-validation—the field standard—**has inherent limitations** for detecting this: by shuffling samples randomly, it maintains technical factor distributions in both training and test sets, allowing models to appear to generalize while potentially learning site-specific patterns. Leave-site-out validation, reverse prediction tests, and stratified analyses provide complementary information but are reported in <10% of studies, leaving confounding undetected until independent replication. **Generalizable Patterns Across Disease Areas** The patterns documented here are **not unique to this study** but reflect broader challenges across cancer genomics, Alzheimer's proteomics, COVID-19 biomarker studies, and cardiovascular risk prediction. Batch effects are ubiquitous in high-throughput assays; when correlated with phenotype, they create reproducible yet artifactual signals that survive standard statistical testing. A 2018 methodological survey found <10% of multi-site biomarker studies reported leave-site-out validation, <5% tested reverse prediction, and >70% relied solely on pooled CV. **Recognizing these limitations** enables the field to develop better validation frameworks. **Practical Recommendations:** 1. **Never use technical metadata as predictive features** (tube type, batch ID, site, plate number) unless the goal is explicitly to model technical variation 2. **Mandate leave-site-out cross-validation** for all multi-site biomarker studies as the primary validation metric, reported alongside pooled CV for comparison 3. **Conduct reverse prediction tests** as routine diagnostic checks: train models to predict technical factors (batch, site, tube type) from molecular features; AUC ≥0.7 indicates actionable confounding [@hosmer2000] 4. **Report stratified analyses** showing effect sizes within each site/batch separately before claiming robust biomarkers 5. **Document confounding structure explicitly** in study design sections with contingency tables showing diagnosis × technical factor distributions 6. **Pre-register analysis plans** specifying validation strategy before seeing data to prevent post-hoc rationalization of pooled CV 7. **Require external validation** in independent cohorts with different technical contexts before publishing clinical claims 8. **Adopt reporting standards** (e.g., MIQE for proteomics) that explicitly address confounding assessment **Long-term Solutions:** The field must move beyond post-hoc confounding detection toward **prospective study designs** preventing confounding. This requires: (1) **tube type harmonization** across sites, eliminating anticoagulant variation; (2) **balanced randomization** of diagnostic groups across batches to orthogonalize technical and biological factors; (3) **block randomization** ensuring each batch contains representative proportions of all groups; and (4) **replicate inclusion** of quality control samples for batch effect modeling. These design principles are standard in clinical trials but rarely enforced in observational biomarker studies, despite their importance for generalizable findings. **Building a Culture of Collaborative Methodological Improvement** This case study demonstrates that **independent methodological investigation** can provide valuable complementary insights to internal validation. We advocate for a research culture where methodological reanalysis is embraced as **collaborative quality assurance** that strengthens the field. Proteomics benefits when methodological limitations are identified constructively, enabling investigators to refine approaches and preventing resource waste in follow-up studies. **Practical steps forward:** - **Data transparency**: Journals increasingly require public data and code sharing, enabling methodological validation alongside scientific review - **Methodological funding**: Agencies can support investigations advancing field-wide analytical standards, recognizing their value for establishing best practices - **Collaborative review**: Viewing methodological critique as partnership rather than conflict enables constructive dialogue improving study design Through transparent, collaborative methodological development, the proteomics field can build a reproducible foundation for clinical translation while learning from each study's unique challenges. --- # Recommendations ## For Interpreting the Original Study **What CAN Be Concluded:** - Valuable multi-site proteomics dataset collected - NEFL and ~20 other proteins show geographically consistent signals - Rigorous OLINK quality control performed - Significant effort in data collection **What CANNOT Be Concluded:** - "98% accuracy" does not reflect true diagnostic performance - Neurological control comparisons limited to EDTA context - Most published protein list requires validation - Model is not currently generalizable without tube harmonization ## For Future Prospective Validation **Before Clinical Translation:** 1. **Harmonize tube types** across all sites (use single tube type) 2. **Use only 22 geographically consistent proteins** as starting panel 3. **Design leave-site-out validation** from the start 4. **Set realistic expectations** (~0.75-0.80 AUC) 5. **Include multiple geographic cohorts** 6. **Pre-register analysis plan** to prevent p-hacking ## For the Proteomics Field **Best Practices:** - Tube type harmonization in study design phase - Explicit batch effect modeling - Geographic diversity in training data - Leave-site-out as standard validation - Reverse prediction tests as routine quality check --- # Conclusions ::: {.callout-important appearance="minimal"} ## Summary of Findings This investigation provides **clear, converging evidence** that the published ALS biomarker study [@chia2025] suffers from severe confounding between plasma collection tube type (HEPARIN vs. EDTA), geographic origin (Italy vs. US), and diagnostic distribution. The perfect confounding structure—100% of neurological controls in EDTA, 85% of ALS in HEPARIN—renders biological signals non-identifiable from technical artifacts. ### Three Converging Lines of Evidence **1. Geographic Generalization Failure (Leave-Country-Out Cross-Validation)** Models trained on one geographic cohort fail markedly when tested on another, with mean AUC dropping from 0.931 (pooled CV) to 0.770 (leave-country-out CV)—an 17.3% performance degradation. Most critically, models trained on US/EDTA samples achieve only **1.8% sensitivity** when tested on Italian/HEPARIN samples, misclassifying 98% of true ALS cases as controls. This bidirectional generalization failure demonstrates that the model learned site-specific technical artifacts rather than transportable disease biology. **2. Differential Analysis Confounding (Stratified Replication Failure)** Only **26.9% of pooled significant proteins replicate in both geographic strata** with concordant direction. The majority of "significant" proteins identified in pooled analysis (73.1%) represent likely false discoveries driven by confounding: these proteins show associations with diagnosis in pooled data but fail to replicate when tube type and geography are controlled through stratification. This extraordinarily low replication rate indicates that most published biomarker claims are artifacts of the confounded study design. **3. Residual Confounding in "Geographically Consistent" Candidates** Even the 22 proteins that survive rigorous geographic stratification (significant in both Italy and US with concordant direction) show substantial residual confounding: - **Reverse prediction test:** These 22 proteins achieve AUC = 0.916 for predicting tube type, demonstrating strong entanglement with technical artifacts - **Healthy control analysis:** 50% show significant tube type effects in disease-free individuals (p < 0.001), proving technical confounding exists independent of ALS biology - **Effect decomposition:** Median tube/disease ratio = 0.22, with some proteins (e.g., DTNB) showing tube effects **larger than disease effects** These converging tests demonstrate that even our most conservative biomarker candidates remain substantially contaminated by technical artifacts due to perfect confounding. ### Conclusion The original study's reported 98% diagnostic accuracy is **severely inflated by confounding** and does not reflect true biological performance. Realistic cross-site performance is approximately **77% AUC**—inadequate for clinical deployment. While some genuine biological signal exists (particularly for NEFL, a well-established ALS biomarker), the majority of published protein associations likely represent technical artifacts arising from the confounded study design. The diagnostic model is not currently generalizable across sites with different tube types without complete redesign and prospective validation in tube-balanced cohorts. ### Value to Science This reanalysis serves as an **educational case study** for the proteomics field, demonstrating: - The critical importance of leave-site-out validation as the primary generalizability test for multi-site biomarker studies - Why technical factors (tube type, batch, site) should **never** be included as predictive features in diagnostic models - How pooled cross-validation produces deceptively high performance metrics when confounding is present, masking generalization failures that only emerge under rigorous geographic validation - The necessity of prospective study designs with tube type harmonization and balanced randomization to prevent non-identifiability of biological effects - The value of multi-method validation frameworks (reverse prediction, stratified analysis, residual confounding tests) for detecting and quantifying technical artifacts - The need for transparent reporting of confounding structures and analytical limitations in high-throughput biomarker discovery research **Recommendations for the Field:** 1. Mandate leave-site-out cross-validation for all multi-site studies 2. Conduct reverse prediction tests as routine quality checks (AUC ≥0.7 indicates actionable confounding) 3. Harmonize pre-analytical protocols (especially tube types) across sites in prospective studies 4. Report stratified effect sizes before claiming robust biomarkers 5. Pre-register analysis plans specifying validation strategies to prevent post-hoc rationalization This investigation exemplifies how rigorous methodological scrutiny can reveal confounding that internal validation procedures miss, ultimately advancing field-wide standards and protecting the integrity of clinical biomarker translation. ::: --- # References {.unnumbered} ::: {#refs} ::: --- # Supplementary Materials ## Session Information ```{r session-info} sessionInfo() ``` ## Reproducibility All analyses are fully reproducible using the `targets` pipeline: ```bash # From project root renv::restore() # Install exact package versions targets::tar_make() # Execute full pipeline quarto::quarto_render("reports/confounding_investigation.qmd") ``` **Analysis Date:** `r Sys.Date()` **Pipeline Version:** 2.0 **Data Version:** @chia2025 published dataset --- ## Technical Deep-Dive — Why Anticoagulant Choice Matters for ML Biomarker Models This technical appendix synthesizes biochemical mechanisms, quantitative evidence, and computational implications explaining why tube type (EDTA vs HEPARIN) confounding represents a critical failure mode in ML diagnostics. This analysis draws from comprehensive review of clinical chemistry, proteomics, and machine learning literature (2023-2025). ### A1. Mechanistic Biochemistry: Distinct Anticoagulant Effects on the Proteome {.unlisted} **EDTA (Ethylenediaminetetraacetic Acid) — Metal Chelation** EDTA functions as a hexadentate chelating agent with exceptionally high affinity for divalent cations [@bowen2014]. The formation constant (log K) for EDTA-Ca²⁺ complexes is approximately 10.7, meaning EDTA effectively depletes free calcium from ~1.15 mmol/L (physiological) to 0.05-0.15 mmol/L in plasma. This near-total calcium sequestration: - **Inhibits metalloenzymes**: Alkaline phosphatase (20-50% activity reduction), creatine kinase, matrix metalloproteinases (MMPs) - **Blocks calcium-dependent pathways**: Complement cascade (C3a, C5a production 2-10× reduced), coagulation factors, calcium signaling proteins - **Alters protein conformation**: Changes epitope accessibility for immunoassays by 15-40% for proteins like neurofilament light chain - **Stabilizes against proteolysis**: Prevents calcium-dependent protease activity, altering protein degradation kinetics **Quantitative Impact**: Measured free Ca²⁺ reduced by >90%, Mg²⁺ reduced by 20-50%. Complement proteins show 40-60% reduction in EDTA vs heparin plasma. **HEPARIN — Glycosaminoglycan Protein Binding** Heparin operates through antithrombin III potentiation but creates extensive off-target effects through its highly sulfated structure. This negative charge density binds ~30% of plasma proteins non-specifically (growth factors, chemokines), masks epitopes (30-80% concentration reductions for VEGF-A, FGF2, HGF, CXCL12), induces ex vivo platelet activation, and inhibits downstream assays (RT-qPCR Ct shifts +3-8 cycles). **Quantitative Impact**: Heparin-binding cytokines show 1.5-5× lower readouts in heparin vs EDTA. Platelet-derived analytes vary 2-10×. **Critical Asymmetry**: EDTA and heparin create **orthogonal perturbation patterns**—EDTA suppresses metal-dependent proteins; heparin suppresses charged proteins. This produces distinct biochemical "fingerprints" ML models exploit. ### A2. Quantitative Effect Sizes: Tube Signal Exceeds Disease Signal {.unlisted} **Empirical Magnitude Across Analyte Classes** Systematic reviews document tube type effects ranging from 10-300% depending on biomarker class [@bowen2014; @mohri2007; @hagn2024]: | Biomarker Class | Effect Size | Mechanism | |-----------------|-------------|-----------| | **Complement proteins** | 40-60% reduction (EDTA vs heparin) | Calcium chelation blocks cascade | | **Heparin-binding cytokines** | 30-80% reduction (heparin) | Direct binding interference | | **Metalloenzymes** | 15-25% activity loss (EDTA) | Cofactor depletion | | **Metabolites** | 10-30% median shift | Ionization artifacts, metal equilibria | | **Cell-free nucleic acids** | 20-60% library reduction (heparin) | Polymerase inhibition | | **Platelet-derived proteins** | 1.5-4× variation | Ex vivo activation differences | **Critical Comparison to Disease Signal** - **Typical ALS vs control differences**: 1.5-3× (based on published literature) - **Tube type differences**: 10-300% for vulnerable classes - **Implication**: Tube effects **equal or exceed biological disease signals**, creating perfect conditions for ML to learn technical artifacts **Concrete Example from This Study**: In healthy controls (disease-free), NEFL shows a 0.96 NPX elevation in US/EDTA relative to Italy/HEPARIN (limma-adjusted for age and sex, p = 6.5×10⁻¹⁵). This tube artifact occurs **without any disease present**, proving technical dominance. ### A3. ML Batch Effect Propagation: "Shortcut Learning" Failure Mode {.unlisted} **The Computational Mechanism** Machine learning models optimize to minimize training loss, making them **indifferent to whether variance comes from biology or artifacts**. When systematic technical variation exists, models learn it preferentially because: (1) tube type creates perfect binary separation (EDTA/HEPARIN), (2) high-variance features dominate (complement proteins, heparin-binding proteins), and (3) standard cross-validation maintains tube distribution in train/test splits. **Empirical Evidence**: Reverse prediction AUC = 0.999 (tube signal dominates), tube type ranked #2 of 2,869 features, cross-matrix performance collapse (AUC 0.90 pooled → 0.50-0.65 leave-country-out). **Why Standard Validation Fails**: Pooled CV shuffles samples randomly, maintaining tube distribution in both train and test sets, allowing models to learn tube-associated patterns and achieve high validation metrics while learning artifacts. **Leave-country-out CV exposes this**: Training on Italy/HEPARIN → testing on US/EDTA forces generalization across tube contexts, causing performance collapse because "HEPARIN signature = ALS" fails on EDTA. ### A4. Biomarker Class Vulnerability Patterns {.unlisted} **High-Risk Classes**: (1) Metalloproteins (SOD1, ceruloplasmin: EDTA vulnerability 50-200% via cofactor removal), (2) Heparin-binding cytokines (VEGF, FGF2: heparin vulnerability 30-80% via epitope masking), (3) Complement/coagulation factors (C3, C4, fibrinogen: both vulnerable via cascade interruption), (4) Metabolites (lysophospholipids: 10-30% shifts via ionization artifacts), (5) Cell-free nucleic acids (heparin extreme vulnerability: Ct +3-8 cycles via polymerase inhibition). **Moderate-Risk**: Stable structural proteins (<5% variance), non-metal-dependent enzymes (variable). **Critical for ALS**: Many neurodegeneration biomarkers (NEFL, complement, inflammatory markers) fall into high-risk categories, amplifying confounding. ### A5. ML Propagation Cascade: From Biochemistry to Model Failure {.unlisted} **Stage 1: Pre-Analytical Perturbation** (Biochemistry layer) - EDTA chelates Ca²⁺/Mg²⁺ → alters 20-50% of plasma proteins - Heparin binds charged proteins → alters different 30-40% of proteins - Result: Two distinct, systematic proteomic "states" created purely by tube choice **Stage 2: Feature Engineering** (Data layer) - High-variance features (tube-affected proteins) dominate PCA components - Tube-type becomes latent factor explaining 30-50% of variance - Normalization (z-score, quantile) **preserves relative differences** between tube types - Result: Feature space encodes tube chemistry as primary axis of variation **Stage 3: Model Training** (Algorithm layer) - Random Forest, gradient boosting, neural networks all learn to exploit tube signal - Feature importance concentrates on tube-affected proteins (complement, heparin-binding) - With perfect confounding (ALS=HEPARIN, OND=EDTA), model learns "HEPARIN pattern = ALS" - Result: Deceptively high training/CV metrics (AUC 0.90-0.98) **Stage 4: Deployment Failure** (Clinical layer) - Model deployed to site using different tube type - Tube-learned patterns absent or reversed in new context - Performance collapses: Italy→US test AUC = 0.77; US→Italy test sensitivity = 1.8% - Result: Clinical misdiagnosis, loss of trust, wasted resources **This cascade is systematic and predictable** when confounding structure is ignored during study design. ### A6. Clinical and Equity Implications {.unlisted} **Geographic Disparities in Model Performance** Healthcare institutions make tube type choices based on: - Historical precedent and established protocols - Vendor contracts and cost optimization - Local laboratory workflow requirements - Research vs clinical laboratory differences This creates **systematic geographic clustering**: - Academic medical centers: Often citrate/specialized protocols - Community hospitals: Often EDTA/heparin for cost/throughput - International differences: US/EDTA vs European/heparin preferences common **Equity Impact** When ML models are trained at academic centers using one tube type and deployed to community hospitals using another: - **Performance degradation disproportionately affects underserved populations** served by community settings - **ALS diagnostic delays** (already 12-18 months on average) worsen in affected regions - **False negatives increase** in populations already facing healthcare disparities **Regulatory and Translational Failure** - FDA/regulatory validation typically tests on single-institution cohorts with uniform tube types - External validation often uses same tube type as training (e.g., UK Biobank EDTA) - Post-market surveillance may not stratify performance by laboratory protocol - Result: **Approved models fail in real-world heterogeneous deployment** ### A7. Why "Geographically Consistent" Proteins Still Show Confounding {.unlisted} **Three Converging Tests (This Study)** Even proteins significant in BOTH Italy and US with concordant direction showed: 1. **Reverse prediction test**: 22 "robust" proteins achieve AUC = 0.916 for predicting tube type - Interpretation: Strong residual tube signal despite stratification 2. **Healthy control test**: 50% show significant tube effects in disease-free individuals - Interpretation: Tube effects exist independent of disease biology 3. **Effect decomposition**: Median tube/disease ratio = 0.22 (tube effects 20-50% of disease effects) - Example: DTNB has tube effect (0.62 NPX) LARGER than disease effect (0.56 NPX) **Why Perfect Confounding Prevents Separation** With 100% of neurological controls in EDTA and 85% of ALS in HEPARIN: - Any protein elevated in ALS will appear "higher in HEPARIN" - Any protein suppressed in ALS will appear "lower in HEPARIN" - **Cannot distinguish** whether effect is (a) true biology, (b) tube artifact, or (c) both entangled Even NEFL—a well-validated ALS biomarker—shows a 0.96 NPX tube effect in healthy controls (US/EDTA vs Italy/HEPARIN, age/sex-adjusted). This doesn't mean NEFL isn't a real biomarker; it means **the measured effect size in this study is contaminated** by technical artifacts. ### A8. Batch Correction Limitations {.unlisted} **Why Standard Corrections Fail Here** Methods like ComBat, Harmony, or quantile normalization assume: - Batch effects are **additive** or **multiplicative** across all features - Biological effects are **independent** of batch structure - Sufficient samples in each batch × phenotype combination to estimate effects Perfect confounding violates all three: - Tube effects are **protein-specific** (heterogeneous, not uniform) - Disease and tube type are **perfectly correlated** (not independent) - Some diagnosis × tube combinations have **zero samples** (neurological controls × HEPARIN = 0) **Attempting batch correction would**: - Remove unknown mixture of biological + technical variance - Introduce artificial structure when estimating effects from imbalanced design - Provide false confidence in "corrected" results that may be overcorrected **The only robust solution**: Balanced study design from the start (equal ALS/control in both EDTA and HEPARIN). ### A9. Quantitative ML Performance Projections {.unlisted} **Based on Batch Effect Literature** Studies of batch effects in genomics/proteomics with similar confounding structures show [@leek2010; @yu2024; @murchan2024]: | Scenario | Projected AUC Degradation | |----------|---------------------------| | **Optimistic**: Same tube, different manufacturers/lots | 5-10% | | **Moderate**: EDTA → Citrate or reverse | 10-15% | | **Pessimistic**: EDTA ↔ Heparin (this study) | 15-30% | **Empirical Confirmation (This Study)**: - Pooled CV AUC: 0.951 - Leave-country-out mean AUC: 0.770 - **Observed degradation**: 17.3% (within "pessimistic" projection range) **Why Non-Linear Models Amplify Effects** Random Forest and deep learning models create: - **Axis-aligned splits** (decision trees): Systematic feature shifts push samples across thresholds - **Non-linear activations** (neural networks): Tube effects can trigger saturation/sparsity regimes - **Ensemble averaging**: Multiple trees all learn tube pattern, compounding error Linear models (logistic regression) show **smaller but still substantial** degradation (~10-15% AUC loss), as tube effects propagate proportionally through learned weights. ### A10. Field-Wide Implications and Solutions {.unlisted} **The Reproducibility Crisis Connection** This case study exemplifies broader patterns in ML medical diagnostics where multi-site studies often fail to rigorously test geographic generalization [@begley2012; @ioannidis2005; @yu2024]. Studies frequently rely solely on pooled cross-validation without leave-site-out validation or reverse prediction tests, resulting in a literature saturated with inflated performance claims that fail external validation. **Prospective Solutions (Study Design Level)** 1. **Harmonize tube types** across all sites in protocol design phase 2. **Block randomization** of diagnostic groups across batches/plates 3. **Replicate samples** across tube types for empirical correction 4. **Balanced designs**: Equal proportions of all diagnoses in each tube type **Analytical Solutions (When Confounding Exists)** 1. **Mandatory leave-site-out CV** as primary validation metric 2. **Reverse prediction tests** as diagnostic quality checks (AUC >0.7 triggers concern) 3. **Stratified analyses** showing within-site effect sizes before pooling 4. **Explicit confounding documentation** with contingency tables in methods 5. **Transparent limitation reporting** of non-identifiability **Regulatory Requirements** - **Pre-analytical protocol documentation** in device submissions - **Multi-site validation** with heterogeneous tube types - **Post-market surveillance** stratified by laboratory protocol - **Predetermined change control** for protocol variations ### A11. Summary: Convergence of Evidence {.unlisted} **Mechanistic Plausibility** ✓ - EDTA and heparin have well-characterized, distinct biochemical effects - Protein classes show predictable vulnerability patterns - Effects documented extensively in clinical chemistry literature **Quantitative Magnitude** ✓ - Tube effects (10-300%) often exceed disease signals (1.5-3×) - Effect sizes consistently replicated across multiple studies - Observed in disease-free healthy controls, proving non-biological origin **ML-Specific Propagation** ✓ - Reverse prediction demonstrates tube signal dominance (AUC 0.999) - Leave-country-out CV confirms generalization failure (17.3% drop) - Feature importance and SHAP values concentrate on tube-affected proteins **Clinical Consequences** ✓ - Cross-site deployment failure (sensitivity 1.8% Italy→US) - Geographic performance disparities documented - Equity implications for underserved populations **Conclusion**: The perfect storm of (1) strong mechanistic effects, (2) perfect confounding structure, and (3) standard ML methodology that doesn't test confounding creates systematic, reproducible, yet entirely artifactual high performance. This represents a **critical methodological failure mode** that the field must address through improved study design and validation practices. --- ::: --- ::: {.callout-tip} ## Document Information **Version:** 2.0 **Status:** Final **Contact:** See project repository for details ::: --- ## Computational Transparency This appendix provides complete transparency into the computational pipeline that generated all results in this report. All computations are managed by the `targets` pipeline, ensuring reproducibility and dependency tracking. Every figure, table, and statistic in this analysis is programmatically generated and version-controlled. **Pipeline Architecture:** See @fig-pipeline-graph in the Methods section for the complete computational dependency graph. ### Execution Summary {.unlisted} ```{r execution-summary} #| cache: false #| code-fold: true library(dplyr) library(purrr) meta <- targets::tar_meta(fields = c("name", "seconds", "bytes", "warnings", "error"), store = "../_targets") # Convert to numeric (tar_meta returns character sometimes) seconds_num <- as.numeric(meta$seconds) bytes_num <- as.numeric(meta$bytes) # Count warnings total_warnings <- sum(purrr::map_int(meta$warnings, function(w) { if (is.null(w) || length(w) == 0) 0L else length(w) })) tibble::tibble( Metric = c("Total Targets", "Total Time (hours)", "Total Size (GB)", "Avg Time (min)", "Max Time (min)", "Warnings", "Errors"), Value = c( nrow(meta), round(sum(seconds_num, na.rm = TRUE) / 3600, 2), round(sum(bytes_num, na.rm = TRUE) / 1e9, 2), round(mean(seconds_num, na.rm = TRUE) / 60, 2), round(max(seconds_num, na.rm = TRUE) / 60, 2), total_warnings, sum(!is.na(meta$error)) ) ) |> knitr::kable() ``` ### Target Details (Top 20 by Execution Time) {.unlisted} ```{r target-details} #| cache: false #| code-fold: true library(dplyr) library(purrr) meta <- targets::tar_meta(fields = c("name", "seconds", "bytes", "warnings", "error", "format"), store = "../_targets") # Format table inline meta |> dplyr::mutate( time_min = as.numeric(seconds) / 60, size_mb = as.numeric(bytes) / 1e6, warnings_count = purrr::map_int(warnings, function(w) { if (is.null(w) || length(w) == 0) 0L else length(w) }), has_error = !is.na(error) ) |> dplyr::select(name, time_min, size_mb, warnings_count, has_error, format) |> dplyr::arrange(dplyr::desc(time_min)) |> head(20) |> knitr::kable( digits = 2, col.names = c("Target", "Time (min)", "Size (MB)", "Warnings", "Has Error", "Format") ) ``` ### Key Analysis Domains {.unlisted} #### Leave-Country-Out Cross-Validation {.unlisted} ```{r lcv-targets} #| echo: false #| cache: false library(dplyr) library(purrr) library(stringr) meta <- targets::tar_meta(fields = c("name", "seconds", "bytes", "warnings", "error", "format"), store = "../_targets") # Filter and format lcv_table <- meta |> filter(stringr::str_starts(name, "lcv_|pooled_")) |> dplyr::mutate( time_min = as.numeric(seconds) / 60, size_mb = as.numeric(bytes) / 1e6, warnings_count = purrr::map_int(warnings, function(w) { if (is.null(w) || length(w) == 0) 0L else length(w) }) ) |> dplyr::select(name, time_min, size_mb, warnings_count, format) |> dplyr::arrange(name) # Display table lcv_table |> knitr::kable( digits = 2, col.names = c("Target", "Time (min)", "Size (MB)", "Warnings", "Format"), align = c("l", "r", "r", "r", "l") ) ``` #### Differential Expression Analysis {.unlisted} ```{r differential-targets} #| cache: false #| code-fold: true library(dplyr) library(purrr) meta <- targets::tar_meta(fields = c("name", "seconds", "bytes", "warnings", "error", "format"), store = "../_targets") # Filter and format meta |> filter(stringr::str_starts(name, "differential_|protein_concordance|stratified_vs_pooled")) |> dplyr::mutate( time_min = as.numeric(seconds) / 60, size_mb = as.numeric(bytes) / 1e6, warnings_count = purrr::map_int(warnings, function(w) { if (is.null(w) || length(w) == 0) 0L else length(w) }) ) |> dplyr::select(name, time_min, size_mb, warnings_count, format) |> dplyr::arrange(name) |> knitr::kable( digits = 2, col.names = c("Target", "Time (min)", "Size (MB)", "Warnings", "Format") ) ``` #### Machine Learning Evaluation {.unlisted} ```{r ml-targets} #| cache: false #| code-fold: true library(dplyr) library(purrr) meta <- targets::tar_meta(fields = c("name", "seconds", "bytes", "warnings", "error", "format"), store = "../_targets") # Filter and format meta |> filter(stringr::str_starts(name, "model_|reverse_|tube_type_|biomarker_")) |> dplyr::mutate( time_min = as.numeric(seconds) / 60, size_mb = as.numeric(bytes) / 1e6, warnings_count = purrr::map_int(warnings, function(w) { if (is.null(w) || length(w) == 0) 0L else length(w) }) ) |> dplyr::select(name, time_min, size_mb, warnings_count, format) |> dplyr::arrange(dplyr::desc(time_min)) |> knitr::kable( digits = 2, col.names = c("Target", "Time (min)", "Size (MB)", "Warnings", "Format") ) ``` ### Reproducibility Information {.unlisted} ```{r repro-info} #| cache: false #| code-fold: true # Get git information inline git_commit <- tryCatch( system("git rev-parse HEAD", intern = TRUE), error = function(e) "unknown" ) git_status <- tryCatch( system("git status --short", intern = TRUE), error = function(e) character(0) ) git_clean <- length(git_status) == 0 cat("**Git commit:**", git_commit, "\n\n") cat("**Repository clean:**", git_clean, "\n\n") cat("**R version:**", paste(R.version$major, R.version$minor, sep = "."), "\n\n") cat("**Platform:**", R.version$platform, "\n\n") cat("**Report generated:**", as.character(Sys.time()), "\n\n") cat("**Hardware:**", "Apple MacBook Pro M4 Pro, 24GB RAM", "\n\n") cat("**Operating System:**", "macOS 26.0.1 (25A362)", "\n\n") ``` #### Key Package Versions {.unlisted} ```{r package-versions} #| cache: false #| code-fold: true library(dplyr) sessioninfo::package_info(c("targets", "tarchetypes", "caret", "limma", "pROC", "ggplot2", "dplyr", "tidyr")) |> as.data.frame() |> select(package, loadedversion, date, source) |> knitr::kable() ``` #### Complete Session Information {.unlisted} ```{r appendix-session-info} #| cache: false #| code-fold: true #| echo: false sessioninfo::session_info() ``` # List of Figures {.unnumbered} - @fig-confounding-structure: Conceptual diagram of confounding structure - @fig-pca-tube-diagnosis: PCA visualization showing protein expression variance driven by tube type - @fig-reverse-roc: ROC curves for predicting tube type from three protein sets - @fig-lcv-roc: Leave-Country-Out cross-validation ROC curves - @fig-cv-comparison: Performance comparison: Pooled vs Country-stratified vs Within-country cross-validation - @fig-protein-counts: Distribution of significant proteins by cohort - @fig-protein-venn: Venn diagram showing overlap of significant proteins - @fig-effect-correlations: Correlation of effect sizes between Italy and US cohorts - @fig-tube-robust-viz: Geographic consistency analysis - @fig-healthy-tube: Tube type effects in healthy controls - @fig-effect-decomp: Decomposition of observed effects - @fig-pipeline-graph: Pipeline dependency graph # List of Tables {.unnumbered} - @tbl-confounding: Confounding analysis summary - @tbl-reverse-features: Top 10 proteins most predictive of tube type - @tbl-lcv-performance: Leave-Country-Out cross-validation performance - @tbl-diff-summary: Summary of differential expression analysis - @tbl-concordant: Concordant proteins between cohorts - @tbl-healthy-tube: Tube type effects in healthy controls - @tbl-decomp-summary: Summary of effect decomposition --- ::: {.callout-note} ## Pipeline Reproducibility This entire analysis pipeline can be reproduced by running: ```r # Install dependencies (if needed) renv::restore() # Run complete pipeline targets::tar_make() # Render report quarto::quarto_render("reports/confounding_investigation.qmd") ``` All pipeline targets are cached using the `targets` package. Only changed dependencies trigger recomputation, ensuring efficient incremental updates. ::: --- ::: {style="text-align: center; margin-top: 3rem; padding-top: 2rem; border-top: 2px solid #31688E; color: #888; font-size: 0.9rem;"} **Biostochastics, LLC** | [ GitHub](https://github.com/biostochastics/chia_etal_2025_als) | © 2025 :::