| Diagnosis | EDTA | HEPARIN | Total | % HEPARIN |
|---|---|---|---|---|
| ALS | 40 | 223 | 263 | 84.8% |
| Healthy_control | 67 | 180 | 247 | 72.9% |
| Neurological_control | 194 | 0 | 194 | 0.0% |
Confounding by geography and anticoagulant compromises proposed ALS diagnostic model and biomarkers
Re-analysis of Chia et al. (2025)
Executive Summary
0.1 Methodological Case Study: Multi-Site Confounding in Proteomics
This investigation serves as a methodological case study demonstrating how confounding naturally arises in multi-site biomarker studies and why standard validation approaches may not detect it. Using a rigorous multi-method framework applied to a published ALS proteomics dataset1, we document perfect site–tube correlation (Cramér’s V=1.0), an 18.6% performance gap between pooled cross-validation and leave-site-out validation (pooled CV AUC: 0.951 → LCV AUC: 0.774), and systematic tube effects across the proteome (35% of proteins show significant effects in healthy controls). These patterns illustrate generalizable lessons about validation strategies, study design, and the importance of testing cross-site generalization.
This case study examines confounding patterns using a multi-method validation framework:
0.2 Methodological Lessons Learned
This case study illustrates a systematic challenge in multi-site proteomics where confounding emerges naturally from logistical constraints:
- Natural confounding: Different sites often use different protocols (plasma collection tube type: HEPARIN vs EDTA)
- Diagnostic distribution: Patient populations vary by site (100% of neurological controls from US/EDTA site)
- Validation limitations: Standard pooled cross-validation maintains these correlations in train/test splits
Key insight: The study design and analytical approach follow field-standard practices, yet produce results where technical and biological signals are inseparable. This demonstrates why prospective validation strategies (leave-site-out CV, reverse prediction tests, balanced designs) are essential for multi-site studies.
Positive findings: Despite confounding, we identify 22 geographically consistent protein candidates and confirm NEFL as a robust ALS biomarker, providing a foundation for future tube-balanced validation studies.
1 Introduction
1.1 Multi-Site Confounding: A Methodological Case Study
1 reported a plasma proteomics biomarker panel for ALS diagnosis with 98% accuracy. Their machine learning model identified plasma collection tube type (HEPARIN vs. EDTA) as the 2nd most important feature out of 2,869 total features—an observation that prompted this methodological investigation.
This case study demonstrates how confounding naturally arises in multi-site proteomics studies: We document that tube type is perfectly correlated with geographic origin (Italy/HEPARIN vs. US/EDTA; Cramér’s V=1.0) and diagnostic distribution (100% of neurological controls from US/EDTA site, 85% of ALS cases from Italy/HEPARIN; Table 1). This confounding structure—which emerges organically from site-specific protocols and patient populations—creates a methodological challenge: biological disease signals become entangled with technical and geographic factors.
Purpose of this investigation: Rather than simply identifying problems, we demonstrate: (1) how confounding arises in real-world multi-site studies; (2) why standard validation approaches miss it; and (3) what practical solutions exist through stratified analysis, cross-site validation, and prospective study design improvements.
1.2 Anticoagulant Biochemistry and Pre-Analytical Confounding
Anticoagulant choice—EDTA or HEPARIN—fundamentally alters measured proteins. EDTA chelates calcium, preventing clotting but triggering cellular protein release. HEPARIN inhibits specific clotting enzymes and preserves different protein structures. The same patient’s blood yields different protein profiles by tube type—a pre-analytical artifact, not biological signal. When tube type is perfectly confounded with diagnostic group—100% of neurological controls in EDTA—disease-associated protein changes become non-identifiable from anticoagulant-induced technical artifacts.
We conducted an independent investigation with three objectives: (1) quantify confounding severity using reverse prediction and association metrics, (2) evaluate ML model robustness through leave-country-out cross-validation, and (3) identify geographically consistent biomarkers via geographic stratification.
We prespecified three hypotheses to test confounding and its clinical impact:
H1 (Reverse Prediction Test): Protein profiles would strongly predict anticoagulant tube type (null hypothesis: AUC = 0.5; critical threshold: AUC > 0.90 indicates systematic confounding).
H2 (Geographic Generalization): Models would fail geographic generalization under leave-country-out cross-validation (null hypothesis: no performance drop; critical threshold: >20% AUC reduction indicates geographic overfitting).
H3 (Cross-Stratum Replication): Most pooled ‘significant’ proteins would not replicate across both geographic strata (null hypothesis: ≥70% replication; severe confounding indicated if <30% replicate).
These criteria distinguish minor technical variability from systematic confounding preventing clinical deployment.
For readers interested in statistical frameworks and computational details, the Technical Appendix (Section 8) provides comprehensive documentation of methods, software versions, and reproducibility protocols.
2 Methods
2.1 Overview of Reanalysis Approach
We evaluated confounding bias from plasma collection tube type, used as a predictive feature in the original model. Our reanalysis employed geographic stratification and leave-country-out cross-validation to assess model generalizability and identify geographically consistent biomarker candidates.
All analyses were implemented in R (version 4.4.2) using a reproducible computational pipeline (targets framework) with version-controlled dependencies (renv).
2.2 Computational Pipeline Architecture
Show code
library(targets)
library(visNetwork)
# Determine project root - handle both manual and tar_quarto() rendering
project_root <- if (file.exists("../_targets.R")) {
normalizePath("..")
} else if (file.exists("_targets.R")) {
getwd()
} else {
"/Users/biostochastics/chia_etal_2025_als" # fallback to absolute path
}
# Change to project root to find _targets.R
withr::with_dir(project_root, {
# Create network with custom styling
net <- tar_visnetwork(
targets_only = TRUE,
label = c("time", "size"),
level_separation = 150
)
# Apply dark theme with bright viridis colors
net %>%
visOptions(
highlightNearest = list(enabled = TRUE, degree = 1),
nodesIdSelection = TRUE
) %>%
visInteraction(
navigationButtons = TRUE,
hover = TRUE
) %>%
visLayout(randomSeed = 42) %>%
visNodes(
font = list(color = "white", size = 16, face = "arial", bold = TRUE),
borderWidth = 2,
shadow = list(enabled = TRUE, color = "#000000", size = 5, x = 2, y = 2)
) %>%
visEdges(
color = list(color = "#666666", highlight = "#FDE724", hover = "#FDE724"),
smooth = list(enabled = TRUE, type = "cubicBezier", roundness = 0.5),
arrows = list(to = list(enabled = TRUE, scaleFactor = 0.5))
) %>%
visPhysics(
stabilization = list(iterations = 200),
barnesHut = list(gravitationalConstant = -8000, springConstant = 0.001, springLength = 200)
)
})There were 22 warnings (use warnings() to see them)
The dependency graph shows the complete computational workflow. Nodes represent computational targets (data, models, visualizations); directed edges indicate dependencies. The targets framework ensures:
- Reproducibility: All analyses run in the same order with identical inputs
- Efficiency: Only outdated targets rebuild when code changes
- Transparency: Full dependency structure is explicit and auditable
- Version control: All code and parameters are tracked in Git
This architecture enables complete reproduction via a single targets::tar_make() command. See Appendix for execution details.
2.3 Data Source and Study Design
2.3.1 Original Dataset
Dataset: Chia et al. (2025) OLINK plasma proteomics data
- Total measurements: 2,031,559 protein-sample combinations
- Samples: 704 unique samples
- Proteins: ~2868 measurements per sample (OLINK panels)
- Diagnoses: ALS (263), Healthy controls (247), Neurological controls (194)
- Clinical variables: Age at collection, sex, diagnosis, plasma collection tube type (EDTA vs. HEPARIN), OLINK plate identifiers (Plates 1-9)
2.3.2 Geographic Origin Reconstruction
The original dataset lacked explicit country labels. We reconstructed geographic origin (Italy vs. US) from converging evidence:
Plate identifier patterns: Jupyter notebook metadata indicated Plates 5-9 were processed at TRAYNOR (Italian Scholz ALS registry), Plates 1-4 at NDRU (US cohorts from NIH, Johns Hopkins, Baltimore Longitudinal Study of Aging).
Tube type consistency: All HEPARIN samples corresponded to Plates 5-9, all EDTA to Plates 1-4, demonstrating perfect correlation (Cramér’s V ≈ 1.0) between tube type and inferred geographic origin.
Diagnostic distribution validation: Perfect confounding of neurological controls with EDTA tubes (100% EDTA/US) supported the Italy/US geographic distinction.
Country label assignment: - Italy: Plates 5-9 AND HEPARIN tube type (n=403 samples) - United States: Plates 1-4 AND EDTA tube type (n=301 samples) - Unknown: Inconsistent plate-tube combinations (n=0, indicating 100% consistency)
2.4 Confounding Structure Quantification
Principal Component Analysis reveals tube type explains far more variance (PC1: 17.3%) than biological disease groups. Tube type shows clear PC1 separation, whereas diagnosis shows minimal separation, demonstrating technical factors dominate the variance structure.
2.4.1 Statistical Framework
We quantified confounding between tube type, geographic origin, and diagnosis using:
Cramér’s V: Association strength between categorical variables (range: 0 = independent, 1 = perfect association). We calculated Cramér’s V with 95% CIs for all pairwise combinations of diagnosis, tube type, country, sex, and plate number.
Chi-square tests: Evaluated independence between diagnostic group and tube type/country distributions.
Contingency tables: Sample distributions across diagnosis × tube type, diagnosis × country, and tube type × country.
Stratified analysis: Identified groups with perfect confounding (e.g., neurological controls = 100% EDTA).
Confounding Severity Thresholds:
We interpreted Cramér’s V using2 effect size conventions for 2×3 contingency tables (df=2): <0.07 (negligible), 0.07-0.21 (small), 0.21-0.35 (medium), ≥0.35 (large). V > 0.9 indicates near-perfect confounding and mathematical non-identifiability.
2.5 Machine Learning Validation Framework
2.5.1 Reverse Prediction Test (Diagnostic for Technical Confounding)
Objective: Can proteins predict tube type?
We trained a Random Forest classifier to predict tube type (HEPARIN vs. EDTA) from protein expression alone. This “reverse prediction test” diagnoses technical confounding: AUC > 0.9 indicates pre-analytical variation overwhelms biological signal.
Implementation: - Model: Random Forest with 500 trees (ranger package) - Features: All 2,868 proteins without pre-filtering (metadata excluded) - Outcome: Binary classification (EDTA vs. HEPARIN) - Cross-validation: 5-fold CV with stratified sampling - Missing values: Pre-imputed separately by country using missForest (0.01% overall missingness). - Hyperparameters: Grid search over mtry (√p, p/3) and min.node.size (5, 10) - Performance metrics: AUC-ROC, sensitivity, specificity - Feature importance: Permutation-based variable importance scaled to 0-100
Interpretation Thresholds3: - AUC < 0.7: Minimal concern (poor discrimination) - AUC 0.7-0.8: Moderate concern (acceptable discrimination indicates detectable tube effects) - AUC 0.8-0.9: Severe concern (excellent discrimination indicates major confounding) - AUC > 0.9: Critical concern (outstanding discrimination—tube effects dominate biological signal)
2.5.2 Leave-Country-Out Cross-Validation (Primary Generalizability Test)
Objective: Does the model generalize geographically?
Leave-country-out cross-validation (LCV) evaluates whether models trained in one geographic/tube context generalize to another, directly testing clinical deployment scenarios where tube types vary by institution.
Implementation:
Direction 1: Train on Italy (HEPARIN) → Test on US (EDTA)
- Partition data by country (Italy training: n=403, US test: n=301)
- Train Random Forest on Italian samples using 5-fold internal CV for hyperparameter tuning
- Apply trained model to held-out US samples
- Evaluate test performance (AUC, accuracy, sensitivity, specificity)
Direction 2: Train on US (EDTA) → Test on Italy (HEPARIN)
Reversed procedure with same preprocessing and evaluation.
Critical Data Leakage Prevention: - Protein filtering, scaling, and hyperparameter tuning performed only on training cohort - Test cohort held completely separate until final prediction step - No information from test samples influenced model training
Pooled Cross-Validation (Baseline Comparison):
We performed standard 5-fold CV on pooled Italy + US data to replicate the original study’s approach. Pooled CV mixes countries across folds, maintaining tube type distribution but not testing geographic generalization.
Performance Gap Thresholds:
Field conventions for cross-site generalization in clinical ML models4,5: <5% drop (acceptable), 5-10% (moderate concern), 10-20% (substantial concern), >20% (critical concern). Acceptable degradation depends on clinical context and baseline accuracy.
2.6 Stratified Differential Protein Expression Analysis
Objective: Identify geographically consistent biomarkers
We performed differential expression analysis separately within each geographic cohort to identify proteins with geographically consistent ALS signals, eliminating tube type and country confounding.
2.6.1 Statistical Framework
We used limma6, adjusting for age and sex to control demographic confounders.
Model specification:
NPX ~ Diagnosis + Age_Collection + Sex
Where NPX (Normalized Protein eXpression, Olink’s standardized log₂ scale) represents protein abundance, and Diagnosis contrasts ALS vs. controls.
2.6.2 Italy Cohort Analysis (HEPARIN-only)
Sample composition: 223 ALS patients vs. 180 healthy controls (all Italian samples use HEPARIN tubes)
Rationale: The Italian cohort lacked neurological controls, providing clean ALS vs. healthy comparison without OND confounding.
Preprocessing: 1. Convert long-format data to protein matrix (proteins × samples) 2. Remove proteins with >20% missing values 3. Fit linear model with empirical Bayes moderation (limma::lmFit + limma::eBayes) 4. Multiple testing correction using Benjamini-Hochberg false discovery rate (FDR)
Significance thresholds: FDR < 0.05 and |log₂ fold-change| > 0.5
2.6.3 US Cohort Analysis (EDTA-only)
Sample composition: 40 ALS patients vs. 261 controls (67 healthy + 194 neurological disease controls; all US samples use EDTA tubes)
Note on power: Small ALS sample size (n=40) limits power, increasing false negative risk but not false positives. Conservative thresholds mitigate false discovery.
Analysis approach: Binary comparison (ALS vs. all controls) with age and sex adjustment. Same preprocessing and significance criteria as Italy analysis.
2.6.4 Pooled Analysis (Confounded Baseline)
For comparison, we replicated the original study’s pooled approach by combining all samples without adjusting for tube type or country. This analysis serves as a baseline to quantify the impact of confounding on differential expression results.
Note: Pooled results are confounded by geographic/tube effects and should not be interpreted as biologically valid.
2.7 Meta-Analysis and Concordance Testing
2.7.1 Geographically Consistent Biomarker Identification
Proteins were classified as “geographically consistent” candidates if they met all of the following criteria:
- Statistically significant in Italy: FDR < 0.05 and log fold-change (logFC, the log₂ ratio of mean protein levels between groups) |logFC| > 0.5
- Statistically significant in US: FDR < 0.05 and |logFC| > 0.5
- Directionally concordant: Same sign of log₂ fold-change in both cohorts
This conservative approach ensures that identified proteins show consistent ALS-associated changes independent of tube type and geographic origin.
2.7.2 Effect Size Concordance
We evaluated cross-cohort consistency using:
- Pearson correlation of log₂ fold-changes between Italy and US cohorts (across all proteins)
- Combined p-values using Fisher’s method: χ² = -2(ln P_Italy + ln P_US), df=4
- Beta-beta plots: Scatterplots of Italy logFC vs. US logFC for visualizing concordance and identifying tube-specific outliers
2.7.3 Stratified vs. Pooled Comparison
To quantify confounding-driven false discoveries, we calculated:
- Overlap statistics: Number of proteins significant in pooled but not in either individual cohort
- Replication rate: Percentage of pooled significant proteins that replicate in both Italy and US
- Critical threshold: <30% replication indicates severe confounding. This heuristic is based on genomics/proteomics field observations that technical artifacts typically show poor cross-cohort replication (<30-40%), while robust biological signals typically replicate at >50-60% rates7,8
2.8 Exact Replication of Original Study Methods
To ensure fair comparison, we explicitly replicated the1 analysis pipeline:
80/20 Discovery-Replication Split: - Randomly split pooled data into 80% discovery (n≈563) and 20% replication (n≈141) - Train Random Forest on discovery set including tube type as predictor - Evaluate on held-out replication set - Report feature importance rank for tube type variable
Differential protein selection: - Select proteins significant in pooled analysis (FDR < 0.05, |logFC| > 0.6, matching original thresholds) - Use these proteins as features for ML model
2.9 Reproducibility and Software
2.9.1 Computational Environment
- R version: 4.4.2 (2024-10-31)
- Operating system: macOS 15.0
- Dependency management:
renv(version 1.0.11) for exact package version locking
2.9.2 Key R Packages
tidyverse2.0.0: Data manipulation and visualizationdata.table1.16.4: Fast data loading and processingtargets1.11.4: Reproducible pipeline orchestrationcaret7.0-1: Machine learning frameworkranger0.17.0: Random Forest implementationlimma3.60.0: Differential expression analysispROC1.18.5: ROC curve analysisDescTools: Cramér’s V calculation
2.9.3 Reproducibility Protocol
- Random seeds: All stochastic operations used seed = 46803468976 for reproducibility
- Pipeline framework:
targetsensures dependency tracking and caching - Version control: All R packages locked via
renv.lock - Execution: Full pipeline can be reproduced via
targets::tar_make()
2.10 Statistical Considerations
2.10.1 Multiple Testing Correction
We applied Benjamini-Hochberg FDR correction separately within each analysis: - Differential expression: FDR control within each cohort (Italy, US, pooled) - Meta-analysis: Combined p-values using Fisher’s method to account for testing in two cohorts
2.10.2 Missing Data Handling
All analyses: Missing values (0.01% overall) were imputed once at data loading using missForest, applied separately for Italy and US cohorts to prevent cross-country information leakage. This approach:
- Uses random forest-based imputation (
missForest::missForest()) well-suited for high-dimensional proteomics data - Applied independently to Italy (HEPARIN) and US (EDTA) subsets before any downstream analysis
- Prevents cross-country information leakage during imputation (Italy samples never inform US imputation and vice versa)
- Extremely low missingness (0.01% overall, max 0.14% per protein) makes imputation straightforward
- Imputed data stored in pipeline target
data_imputedand used for all subsequent analyses
Differential expression: Proteins with >20% missing values would be excluded prior to limma analysis (standard practice). However, no proteins exceeded this threshold in the current dataset.
Rationale: Pre-imputation with geographic stratification maintains methodological rigor by ensuring Italy and US cohorts remain informationally separate throughout the entire analysis pipeline, consistent with the study’s focus on geographic confounding.
2.10.3 Sample Size and Power
We acknowledge limited statistical power in the US ALS cohort (n=40), which may result in false negatives (missing true ALS signals) but maintains Type I error control (FDR < 0.05).
2.11 Limitations of Reanalysis
We acknowledge the following limitations:
Confounding non-identifiability: Perfect correlation between tube type, country, batch, and storage protocol means we cannot definitively attribute effects to any single factor. Our conclusions apply to the aggregate of these technical artifacts.
No external validation: We reanalyzed existing data without access to independent validation cohorts. Prospective tube-balanced studies are required to confirm geographically consistent biomarkers.
Limited US ALS sample size: Power constraints in the US cohort (n=40 ALS) may cause us to underestimate the number of true geographically consistent proteins (increased false negatives, but Type I error remains controlled).
Key terms used in this investigation:
AUC (Area Under ROC Curve): Measure of classifier performance ranging from 0.5 (random) to 1.0 (perfect). Values >0.9 indicate excellent discrimination.
Cramér’s V: Measure of association strength between categorical variables. Range: 0 (independent) to 1 (perfect association). Cohen (1988) effect sizes for df=2: 0.07 (small), 0.21 (medium), 0.35 (large).
Cross-validation (CV): Method for evaluating model performance by splitting data into training and test sets multiple times.
FDR (False Discovery Rate): Multiple testing correction that controls the expected proportion of false positives among rejected hypotheses. FDR <0.05 means <5% of “significant” results are expected to be false discoveries.
Leave-Country-Out CV: Rigorous validation where models trained on one geographic cohort are tested on a completely held-out cohort. Tests true generalizability.
logFC (log Fold-Change): Difference in NPX values between groups. Since NPX is already on a log₂ scale, this difference directly represents log₂(fold-change). |logFC| > 0.5 represents >40% change; |logFC| > 1.0 represents doubling/halving.
Sensitivity: True positive rate; proportion of actual cases correctly identified.
Specificity: True negative rate; proportion of actual controls correctly identified.
Stratification: Analyzing data separately within homogeneous subgroups to eliminate confounding.
3 Results
Our multi-method investigation reveals severe confounding compromise in the reported ALS diagnostic model. Three analyses demonstrate tube type and geographic origin dominate: (1) Cramér’s V=1.0 perfect site-tube correlation, (2) leave-country-out validation shows 18.6% AUC inflation and 1.8% sensitivity collapse, and (3) only 26.9% of ‘significant’ proteins replicate across strata. Even 22 geographically consistent candidates show substantial residual confounding, with 50% exhibiting significant tube effects in healthy controls.
3.1 Confounding Quantification
3.1.1 Statistical Measures
Association Strengths (Cramér’s V):
- Tube Type ↔︎ Diagnosis: 0.721 (strong)
- Country ↔︎ Diagnosis: 0.721 (strong)
- Tube Type ↔︎ Country: 0.997 (perfect)
Cramér’s V ranges from 0 (no association) to 1 (perfect).2 conventions for df=2: ≥0.21 medium, ≥0.35 large.
3.2 Reverse Prediction Test: Strong Evidence for Technical Confounding
Proteins achieve near-perfect tube type prediction accuracy (AUC ≥0.999), demonstrating technical artifacts dominate biological signal. Even the 22 ‘geographically consistent’ biomarker candidates retain strong tube-type signatures (AUC=0.916). When technical metadata can be reconstructed with this accuracy from biological features, those features are contaminated and cannot reliably represent disease biology.
Show code
library(ggplot2)
library(pROC)
library(dplyr)
# Create labels
full_label <- sprintf("Full Model (AUC = %.3f)", reverse_prediction_results$auc)
top_label <- sprintf("Top 10 Proteins (AUC = %.3f)", top_proteins_tube_prediction$auc)
biomarker_label <- sprintf("Original Biomarkers (AUC = %.3f)", biomarker_tube_prediction$auc)
concordant_label <- sprintf("Concordant Proteins (AUC = %.3f)", concordant_tube_prediction$auc)
# Extract predictions and calculate ROC curves
# Full model
full_preds <- reverse_prediction_results$model$pred %>%
dplyr::filter(
mtry == reverse_prediction_results$model$bestTune$mtry,
min.node.size == reverse_prediction_results$model$bestTune$min.node.size
)
roc_full <- pROC::roc(
response = full_preds$obs,
predictor = full_preds$HEPARIN,
levels = c("EDTA", "HEPARIN"),
direction = "<"
)
# Top 10 proteins
top_preds <- top_proteins_tube_prediction$predictions
roc_top <- pROC::roc(
response = top_preds$obs,
predictor = top_preds$HEPARIN,
levels = c("EDTA", "HEPARIN"),
direction = "<"
)
# Original biomarkers
biomarker_preds <- biomarker_tube_prediction$predictions
roc_biomarker <- pROC::roc(
response = biomarker_preds$obs,
predictor = biomarker_preds$HEPARIN,
levels = c("EDTA", "HEPARIN"),
direction = "<"
)
# Concordant proteins
concordant_preds <- concordant_tube_prediction$predictions
roc_concordant <- pROC::roc(
response = concordant_preds$obs,
predictor = concordant_preds$HEPARIN,
levels = c("EDTA", "HEPARIN"),
direction = "<"
)
# Combine ROC data
roc_combined <- rbind(
data.frame(
fpr = 1 - roc_full$specificities,
tpr = roc_full$sensitivities,
model = full_label
),
data.frame(
fpr = 1 - roc_top$specificities,
tpr = roc_top$sensitivities,
model = top_label
),
data.frame(
fpr = 1 - roc_biomarker$specificities,
tpr = roc_biomarker$sensitivities,
model = biomarker_label
),
data.frame(
fpr = 1 - roc_concordant$specificities,
tpr = roc_concordant$sensitivities,
model = concordant_label
)
)
# Create color mapping with 4 distinct colors
color_values <- c("#FDE724", "#1F9E89", "#B5DE2B", "#31688E")
names(color_values) <- c(full_label, top_label, biomarker_label, concordant_label)
# Create plot
ggplot(roc_combined, aes(x = fpr, y = tpr, color = model)) +
geom_line(linewidth = 2.5, alpha = 0.9) +
geom_abline(
slope = 1, intercept = 0, linetype = "dashed",
color = "#71717A", linewidth = 1
) +
scale_color_manual(
name = NULL,
values = color_values
) +
labs(
title = "Reverse Prediction Test: Predicting Tube Type from Proteins",
subtitle = "AUC = 0.999 - Near-perfect discrimination indicates tube effects dominate",
x = "False Positive Rate (1 - Specificity)",
y = "True Positive Rate (Sensitivity)"
) +
theme_dark_scientific(base_size = 12) +
theme(
plot.background = element_rect(fill = "#151520", colour = NA),
panel.background = element_rect(fill = "#151520", colour = NA),
legend.background = element_rect(fill = "#151520", colour = NA),
legend.key = element_rect(fill = "#151520", colour = NA),
legend.position = c(0.65, 0.25)
) +
coord_equal()Results: - Full Model (all proteins): AUC = 0.999 (99.7% sensitivity, 94.8% specificity) - Top 10 Proteins: AUC = 1.000 - Original Biomarkers (Chia et al. 17-protein panel): AUC = 0.860 (n=11 proteins) - Concordant Proteins (significant in both cohorts): AUC = 0.908 (n=22 proteins)
The reverse prediction test produces a compelling result with important implications for interpreting the published biomarker claims. By training classifiers to predict plasma collection tube type (HEPARIN vs. EDTA) from protein expression profiles alone—excluding all clinical metadata—we demonstrate that tube type effects are pervasive across multiple protein sets. The full model achieves near-perfect discrimination (AUC = 0.999), while even the top 10 most tube-discriminative proteins achieve exceptional performance (AUC = 1.000).
Critically, both biomarker panels show substantial tube-type confounding: the original study’s 17-protein panel achieves AUC = 0.860 for predicting tube type, while our concordant proteins (those significant in both Italy and US with concordant direction, n=22) achieve AUC = 0.908. These findings indicate that even proteins that successfully replicate across independent cohorts remain significantly confounded by technical collection factors.
These findings are particularly striking because the protein measurements are more informative about the technical collection method than they are about ALS disease status (original study reported 98% accuracy for disease prediction). This reversed diagnostic priority reveals that tube type effects dominate the variance structure in the proteomics data, overwhelming genuine biological disease signals. When combined with the perfect confounding structure—where 100% of neurological controls are EDTA and 85% of ALS cases are HEPARIN—these results indicate that any disease classifier trained on pooled data will inevitably learn to recognize anticoagulant-associated protein signatures rather than ALS-specific biology.
The reverse prediction test thus provides strong diagnostic evidence for the presence of severe technical confounding across all protein sets, from the full panel to the curated biomarkers. Even if some ALS biomarkers exist within the dataset, their signals are thoroughly entangled with anticoagulant effects that must be addressed through stratified analysis or balanced study designs.
3.2.1 Top Proteins Distinguishing Tube Types
| protein | Overall |
|---|---|
| DCTN2 | 100.00 |
| INSL3 | 55.60 |
| CENPF | 45.39 |
| SPINT3 | 42.99 |
| FGF9 | 42.28 |
| CCND2 | 40.16 |
| AGBL2 | 30.47 |
| TRIM40 | 14.94 |
| SLC1A4 | 11.86 |
| DAND5 | 8.68 |
3.3 Geographic Validation: Leave-Country-Out CV
Show code
library(ggplot2)
library(pROC)
library(patchwork)
library(dplyr)
# Extract ROC curves
roc_italy_to_us <- lcv_results$italy_to_us$roc_curve
roc_us_to_italy <- lcv_results$us_to_italy$roc_curve
roc_italy_within <- within_country_cv_results$italy$roc_obj
roc_us_within <- within_country_cv_results$us$roc_obj
roc_pooled <- pooled_cv_results$roc_obj
# Create labels with AUCs
pooled_label <- sprintf("Pooled (%.3f)", pooled_cv_results$cv_auc)
italy_within_label <- sprintf("Italy (%.3f)", within_country_cv_results$italy$cv_auc)
us_within_label <- sprintf("US (%.3f)", within_country_cv_results$us$cv_auc)
italy_label <- sprintf("Italy→US (%.3f)", lcv_results$italy_to_us$test_auc)
us_label <- sprintf("US→Italy (%.3f)", lcv_results$us_to_italy$test_auc)
# ==============================================================================
# PANEL A: ORIGINAL VALIDATION (WITHIN-COUNTRY + POOLED)
# ==============================================================================
roc_original <- rbind(
data.frame(
specificity = 1 - roc_pooled$specificities,
sensitivity = roc_pooled$sensitivities,
model = pooled_label
),
data.frame(
specificity = 1 - roc_italy_within$specificities,
sensitivity = roc_italy_within$sensitivities,
model = italy_within_label
),
data.frame(
specificity = 1 - roc_us_within$specificities,
sensitivity = roc_us_within$sensitivities,
model = us_within_label
)
)
# Define colors for Panel A
colors_panel_a <- c(
"#FDE724", # Pooled - Bright yellow (viridis 100%)
"#1F9E89", # Italy - Jade/teal (viridis 60%)
"#B5DE2B" # US - Yellow-green (viridis 90%)
)
names(colors_panel_a) <- c(pooled_label, italy_within_label, us_within_label)
p_original <- ggplot(roc_original, aes(x = specificity, y = sensitivity, color = model)) +
geom_line(linewidth = 2.5, alpha = 0.9) +
geom_abline(intercept = 0, slope = 1, linetype = "dotted", color = "#666666", alpha = 0.5, linewidth = 1) +
scale_color_manual(
name = "Validation Strategy",
values = colors_panel_a
) +
labs(
title = "A. ORIGINAL",
x = "False Positive Rate",
y = "True Positive Rate"
) +
theme_dark_scientific(base_size = 14) +
theme(
plot.background = element_rect(fill = "#151520", colour = NA),
panel.background = element_rect(fill = "#151520", colour = NA),
legend.position = c(0.65, 0.25)
) +
coord_equal()
# ==============================================================================
# PANEL B: REANALYSIS (CROSS-COUNTRY)
# ==============================================================================
roc_reanalysis <- rbind(
data.frame(
specificity = 1 - roc_italy_to_us$specificities,
sensitivity = roc_italy_to_us$sensitivities,
model = italy_label
),
data.frame(
specificity = 1 - roc_us_to_italy$specificities,
sensitivity = roc_us_to_italy$sensitivities,
model = us_label
)
)
# Define colors for Panel B
colors_panel_b <- c(
"#440154", # Italy→US - Dark purple (viridis 0%)
"#31688E" # US→Italy - Medium blue (viridis 40%)
)
names(colors_panel_b) <- c(italy_label, us_label)
p_reanalysis <- ggplot(roc_reanalysis, aes(x = specificity, y = sensitivity, color = model)) +
geom_line(linewidth = 2.5, alpha = 0.9) +
geom_abline(intercept = 0, slope = 1, linetype = "dotted", color = "#666666", alpha = 0.5, linewidth = 1) +
scale_color_manual(
name = "Cross-Country",
values = colors_panel_b
) +
labs(
title = "B. REANALYSIS (CROSS-COUNTRY VALIDATION)",
x = "False Positive Rate",
y = "True Positive Rate"
) +
theme_dark_scientific(base_size = 14) +
theme(
plot.background = element_rect(fill = "#151520", colour = NA),
panel.background = element_rect(fill = "#151520", colour = NA),
legend.position = c(0.65, 0.25)
) +
coord_equal()
# ==============================================================================
# PANEL C: CONFUSION MATRICES FOR CROSS-COUNTRY
# ==============================================================================
# Italy → US confusion matrix
cm_italy_us <- lcv_results$italy_to_us$confusion_matrix
cm_italy_us_df <- as.data.frame.table(cm_italy_us)
names(cm_italy_us_df) <- c("Predicted", "Actual", "Count")
# Calculate metrics
sens_italy_us <- cm_italy_us["ALS", "ALS"] / sum(cm_italy_us[, "ALS"])
spec_italy_us <- cm_italy_us["Control", "Control"] / sum(cm_italy_us[, "Control"])
p_cm_italy_us <- ggplot(cm_italy_us_df, aes(x = Actual, y = Predicted, fill = Count)) +
geom_tile(color = "#444444", linewidth = 1) +
geom_text(aes(label = Count), size = 6, fontface = "bold", color = "white") +
scale_fill_gradient(low = "#2c1e3d", high = "#E6550D") +
labs(
title = sprintf("Italy→US\nSens: %.2f, Spec: %.2f", sens_italy_us, spec_italy_us),
x = "True", y = "Predicted"
) +
theme_dark_scientific(base_size = 12) +
theme_centered_titles() +
theme_zero_margins() +
theme(
plot.background = element_rect(fill = "#151520", colour = NA),
panel.background = element_rect(fill = "#151520", colour = NA),
legend.position = "none",
panel.grid = element_blank()
) +
coord_equal()
# US → Italy confusion matrix
cm_us_italy <- lcv_results$us_to_italy$confusion_matrix
cm_us_italy_df <- as.data.frame.table(cm_us_italy)
names(cm_us_italy_df) <- c("Predicted", "Actual", "Count")
sens_us_italy <- cm_us_italy["ALS", "ALS"] / sum(cm_us_italy[, "ALS"])
spec_us_italy <- cm_us_italy["Control", "Control"] / sum(cm_us_italy[, "Control"])
p_cm_us_italy <- ggplot(cm_us_italy_df, aes(x = Actual, y = Predicted, fill = Count)) +
geom_tile(color = "#444444", linewidth = 1) +
geom_text(aes(label = Count), size = 6, fontface = "bold", color = "white") +
scale_fill_gradient(low = "#2c1e3d", high = "#756BB1") +
labs(
title = sprintf("US→Italy\nSens: %.2f, Spec: %.2f", sens_us_italy, spec_us_italy),
x = "True", y = "Predicted"
) +
theme_dark_scientific(base_size = 12) +
theme_centered_titles() +
theme_zero_margins() +
theme(
plot.background = element_rect(fill = "#151520", colour = NA),
panel.background = element_rect(fill = "#151520", colour = NA),
legend.position = "none",
panel.grid = element_blank()
) +
coord_equal()
# ==============================================================================
# COMBINE ALL PANELS: A and B on top, C (both confusion matrices) below
# ==============================================================================
((p_original | p_reanalysis) / (p_cm_italy_us | p_cm_us_italy)) +
plot_annotation(
title = "Geographic Validation: Original Study vs Reanalysis",
subtitle = "Cross-country validation reveals severe geographic overfitting not detected by original within-country/pooled approaches",
theme = theme_dark_scientific(base_size = 12) +
theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5, color = "#FDE724"),
plot.background = element_rect(fill = "#151520", color = NA),
plot.margin = unit(c(0,0,0,0), "pt")
)
) &
theme(
plot.background = element_rect(fill = "#151520", colour = NA),
panel.background = element_rect(fill = "#151520", colour = NA)
)3.3.1 Performance Summary
| Approach | AUC | 95% CI (lower) | 95% CI (upper) |
|---|---|---|---|
| Pooled CV (original) | 0.951 | 0.935 | 0.967 |
| Italy → US | 0.813 | 0.737 | 0.889 |
| US → Italy | 0.736 | 0.687 | 0.784 |
| Mean LCV | 0.774 | NA | NA |
3.4 Within-Country Cross-Validation
Show code
library(patchwork)
# Panel 1: Within-country CV results
within_country_data <- tibble::tibble(
Approach = factor(
c("Italy-only CV\n(HEPARIN)", "US-only CV\n(EDTA)"),
levels = c("Italy-only CV\n(HEPARIN)", "US-only CV\n(EDTA)")
),
AUC = c(italy_within_cv$cv_auc, us_within_cv$cv_auc),
CI_lower = c(italy_within_cv$cv_auc_ci_lower, us_within_cv$cv_auc_ci_lower),
CI_upper = c(italy_within_cv$cv_auc_ci_upper, us_within_cv$cv_auc_ci_upper)
)
p1 <- ggplot(within_country_data, aes(x = Approach, y = AUC, fill = Approach)) +
geom_col(width = 0.65, color = "#0B0B0F") +
geom_errorbar(aes(ymin = CI_lower, ymax = CI_upper), width = 0.15, color = "#C8C8D0", size = 0.5) +
geom_point(size = 4, color = "white", shape = 21, fill = "#FDE724", stroke = 1.5) +
geom_text(aes(label = sprintf("%.3f", AUC)), vjust = -1.2, color = "white", size = 5.5, fontface = "bold") +
scale_y_continuous(limits = c(0.5, 1.0), breaks = seq(0.5, 1.0, 0.05)) +
scale_fill_manual(
values = c(
"Italy-only CV\n(HEPARIN)" = "#5DC863",
"US-only CV\n(EDTA)" = "#31688E"
)
) +
labs(
title = "A. Within-Country CV Performance",
subtitle = "Models trained and tested within single cohorts",
x = NULL,
y = "AUC (95% CI)"
) +
theme_dark_scientific(base_size = 12) +
theme_zero_margins() +
theme(
plot.background = element_rect(fill = "#151520", colour = NA),
panel.background = element_rect(fill = "#151520", colour = NA),
legend.position = "none",
axis.text.x = element_text(size = 11, face = "bold"),
axis.text.y = element_text(size = 11),
axis.title.y = element_text(size = 12, face = "bold")
)
# Panel 2: Pooled and LCV comparison
pooled_lcv_data <- tibble::tibble(
Approach = factor(
c("Pooled CV", "Italy → US", "US → Italy"),
levels = c("Pooled CV", "Italy → US", "US → Italy")
),
AUC = c(
pooled_cv_results$cv_auc,
lcv_results$italy_to_us$test_auc,
lcv_results$us_to_italy$test_auc
),
CI_lower = c(
pooled_cv_results$cv_auc_ci_lower,
lcv_results$italy_to_us$test_auc_ci_lower,
lcv_results$us_to_italy$test_auc_ci_lower
),
CI_upper = c(
pooled_cv_results$cv_auc_ci_upper,
lcv_results$italy_to_us$test_auc_ci_upper,
lcv_results$us_to_italy$test_auc_ci_upper
),
Type = c("Pooled", "LCV", "LCV")
)
p2 <- ggplot(pooled_lcv_data, aes(x = Approach, y = AUC, fill = Type)) +
geom_col(width = 0.65, color = "#0B0B0F") +
geom_errorbar(aes(ymin = CI_lower, ymax = CI_upper), width = 0.15, color = "#C8C8D0", size = 0.5) +
geom_point(size = 4, color = "white", shape = 21, fill = "#FDE724", stroke = 1.5) +
geom_text(aes(label = sprintf("%.3f", AUC)), vjust = -1.2, color = "white", size = 5.5, fontface = "bold") +
scale_y_continuous(limits = c(0.5, 1.0), breaks = seq(0.5, 1.0, 0.05)) +
scale_fill_manual(
values = c(
"Pooled" = "#FDE725",
"LCV" = "#C2408C"
)
) +
labs(
title = "B. Pooled vs Leave-Country-Out CV",
subtitle = "Geographic validation reveals confounding",
x = NULL,
y = "AUC (95% CI)"
) +
theme_dark_scientific(base_size = 12) +
theme_zero_margins() +
theme(
plot.background = element_rect(fill = "#151520", colour = NA),
panel.background = element_rect(fill = "#151520", colour = NA),
legend.position = "none",
axis.text.x = element_text(size = 11, face = "bold"),
axis.text.y = element_text(size = 11),
axis.title.y = element_text(size = 12, face = "bold")
)
# Combine panels side by side
(p1 + p2) + plot_layout() & theme_zero_margins()Within-country model performance:
- Italy-only (HEPARIN, n = 403; ALS 223, Controls 180): AUC 0.961 (95% CI 0.944–0.978)
- US-only (EDTA, n = 301; ALS 40, Controls 261): AUC 0.922 (95% CI 0.858–0.957)
- Pooled CV remains the optimistic upper bound because confounding structure is shared across folds.
- Italy-only and US-only CV reflect performance in homogeneous cohorts without cross-matrix generalization demands. Both achieve reasonable performance within their respective anticoagulant contexts.
- The large gap between within-country CV and leave-country-out (Italy → US and US → Italy) demonstrates that performance degrades substantially when the anticoagulant/geography context changes. This pattern is consistent with models learning tube/context-specific signatures that do not transfer across countries, not generalizable ALS biology.
3.5 Differential Analysis: Stratified vs Pooled
Show code
# Plot protein counts inline (moved from R/06_visualizations.R for transparency)
library(ggplot2)
counts_data <- data.frame(
analysis = c(
"Italy\n(HEPARIN)",
"US\n(EDTA)",
"Pooled\n(Confounded)",
"Both Strata\n(Robust)"
),
count = c(
length(stratified_vs_pooled_comparison$sig_italy),
length(stratified_vs_pooled_comparison$sig_us),
length(stratified_vs_pooled_comparison$sig_pooled),
length(stratified_vs_pooled_comparison$sig_both_strata)
),
type = c("Stratified", "Stratified", "Pooled", "Robust")
)
counts_data$analysis <- factor(counts_data$analysis, levels = counts_data$analysis)
ggplot(counts_data, aes(x = analysis, y = count, fill = type)) +
geom_col(width = 0.7, alpha = 0.9) +
geom_text(aes(label = count), vjust = -0.5, size = 5, fontface = "bold", color = "#EEEEEE") +
scale_fill_manual(values = c(
"Stratified" = "#31688E", # viridis cyan
"Pooled" = "#FDE724", # viridis yellow
"Robust" = "#6CCE59" # viridis lime
)) +
labs(
title = "Significant Proteins by Analysis Type",
subtitle = sprintf(
"Only %d/%d pooled proteins (%.1f%%) replicate in both strata - major reproducibility issue",
length(stratified_vs_pooled_comparison$overlap_both),
length(stratified_vs_pooled_comparison$sig_pooled),
stratified_vs_pooled_comparison$pct_replicate_in_both
),
x = NULL,
y = "Number of Significant Proteins\n(FDR < 0.05, |logFC| > 0.5)",
fill = "Type"
) +
theme_dark_scientific(base_size = 12) +
theme(
legend.position = "none",
panel.grid.major.x = element_blank()
)3.5.1 Summary Statistics
| Analysis | N Significant | % of Pooled |
|---|---|---|
| Italy (HEPARIN) | 37 | 47.4 |
| US (EDTA) | 379 | 485.9 |
| Pooled (confounded) | 78 | 100.0 |
| Both Strata | 23 | 26.9 |
3.5.2 Protein Overlap Visualization
Show code
# Plot Venn diagram inline (moved from R/06_visualizations.R for transparency)
library(ggplot2)
library(ggvenn)
# Prepare data for ggvenn
venn_data <- list(
Italy = stratified_vs_pooled_comparison$sig_italy,
US = stratified_vs_pooled_comparison$sig_us,
Pooled = stratified_vs_pooled_comparison$sig_pooled
)
ggvenn(
venn_data,
fill_color = c("#1F9E89", "#B5DE2B", "#FDE724"), # viridis jade, yellow-green, yellow
fill_alpha = 0.5,
stroke_size = 1,
set_name_size = 5,
set_name_color = "white",
text_size = 4
) +
labs(
title = "Protein Overlap: Stratified vs Pooled Analysis",
subtitle = sprintf(
"⚠️ CRITICAL: Only %d proteins (%.1f%%) significant in BOTH Italy and US",
length(stratified_vs_pooled_comparison$sig_both_strata),
stratified_vs_pooled_comparison$pct_replicate_in_both
)
) +
theme_dark_scientific(base_size = 12) +
theme(
plot.background = element_rect(fill = "#151520", colour = NA),
panel.background = element_rect(fill = "#151520", colour = NA),
legend.background = element_rect(fill = "#151520", colour = NA),
legend.key = element_rect(fill = "#151520", colour = NA),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(color = "#FDE724", face = "bold", hjust = 0.5)
)3.6 Geographically Consistent Biomarker Candidates
3.6.1 Meta-Analysis Results
Proteins significant in BOTH Italy and US with concordant direction: 22
These are the ONLY proteins that show consistent ALS signals across both tube types and geographic cohorts.
| Assay | logFC_italy | logFC_us | mean_logFC | adj_P_italy | adj_P_us | combined_p |
|---|---|---|---|---|---|---|
| NEFL | 2.503 | 2.257 | 2.380 | 0 | 0 | 0 |
| ALDH3A1 | 2.311 | 2.323 | 2.317 | 0 | 0 | 0 |
| MEGF10 | 0.877 | 1.054 | 0.966 | 0 | 0 | 0 |
| CORO6 | 1.157 | 1.099 | 1.128 | 0 | 0 | 0 |
| HS6ST2 | 0.777 | 0.901 | 0.839 | 0 | 0 | 0 |
| CSRP3 | 2.143 | 1.914 | 2.028 | 0 | 0 | 0 |
| MYLPF | 1.555 | 1.585 | 1.570 | 0 | 0 | 0 |
| MYBPC1 | 1.385 | 1.469 | 1.427 | 0 | 0 | 0 |
| EDA2R | 0.871 | 0.757 | 0.814 | 0 | 0 | 0 |
| CA3 | 1.056 | 0.739 | 0.898 | 0 | 0 | 0 |
3.6.2 Effect Size Correlation
To assess whether effect sizes are consistent across cohorts, we calculate correlations for two protein sets:
- All proteins on Olink panel (n = 2868)
- Proteins significant in pooled analysis (n = 78)
Key Findings:
- All proteins: Pearson r = 0.030, Spearman ρ = -0.115 — Near-zero Pearson correlation with negative Spearman rank correlation across entire panel
- Pooled-significant proteins: Pearson r = 0.788, Spearman ρ = 0.683 — Moderate-to-strong correlation among selected proteins
The stark difference between these correlations reveals an important pattern: proteins that reach significance in the pooled analysis tend to show more consistent effect sizes across cohorts, while the majority of proteins on the panel show essentially no correlation.
Notably, the negative Spearman correlation (ρ = -0.115) for all proteins indicates that even the rank ordering of effects tends to be inverted between cohorts—proteins with larger effects in Italy tend to have smaller effects in the US, and vice versa. This rank inversion, while modest in magnitude, suggests systematic differences in how proteins respond across the two cohorts that go beyond simple scaling differences.
3.6.3 Biological Interpretation
NEFL (Neurofilament Light Chain) is a well-established neuroaxonal injury biomarker and likely represents a genuine biological signal in ALS. However, our analyses demonstrate that its effect size in this dataset is substantially contaminated by the Italy-HEPARIN/US-EDTA confound. Among the 22 “geographically consistent” proteins that show significance in both strata, only a small fraction survive without residual tube effects. Importantly:
- Only 22 proteins survive rigorous geographic validation (out of 78 pooled)
- 72% of pooled proteins fail to replicate across tube types
- Effect sizes are inconsistent (r = 0.03) between cohorts
- Even geographically consistent candidates retain substantial tube signal (see Residual Confounding section)
Verdict: While NEFL and a small subset of proteins remain clinically relevant, their magnitudes in the pooled analysis are inflated and their relative rankings among biomarkers are distorted by non-biological factors. Most of the published protein list requires prospective validation in tube-balanced studies.
3.6.4 Visualizing Country/Tube Effects in Geographically Consistent Proteins
To demonstrate that even “geographically consistent” proteins exhibit systematic country/tube effects, we examine the distribution of NPX values for ALL geographically consistent proteins, comparing ALS vs. Healthy controls across Italy (HEPARIN) and US (EDTA) cohorts.
3.7 Residual Confounding in “Geographically Consistent” Proteins
Critical Question: Are the 22 “geographically consistent” proteins (significant in both Italy and US with concordant direction) truly independent of tube type effects, or do they still retain substantial technical confounding?
Just because a protein is significant in BOTH cohorts doesn’t prove it’s free from tube type artifacts. It could be: (1) true biomarker + NO tube effect (ideal), (2) true biomarker + tube effect (entangled), or (3) pure tube artifact spuriously associated with disease. With perfect confounding, we cannot distinguish cases #2 and #3, but we CAN quantify the magnitude of residual tube type signal using two complementary approaches below, which together with the reverse prediction test (Section 3.2) provide converging evidence.
3.7.1 Test 1: Tube Effects in Healthy Controls (Disease-Free Test)
As demonstrated in Figure 10, even “geographically consistent” proteins show systematic differences between tube types within disease-free individuals. Here we quantify these effects statistically.
Show code
# Plot healthy tube effects using limma-based results (adjusted for age & sex)
library(ggplot2)
library(dplyr)
# Prepare plot data from limma results
# Filter to only geographically consistent proteins (same as old analysis)
plot_data <- tube_effects_healthy %>%
dplyr::filter(Assay %in% tube_robust_top15_assays) %>%
dplyr::arrange(adj.P.Val) %>%
dplyr::mutate(
protein = factor(Assay, levels = rev(Assay)),
significant = adj.P.Val < 0.05,
# Use logFC as effect size (already in NPX units since NPX is log2)
# Calculate approximate 95% CI using t-distribution
ci_lower = logFC - 1.96 * logFC / t,
ci_upper = logFC + 1.96 * logFC / t
)
n_sig <- sum(plot_data$significant)
n_total <- nrow(plot_data)
ggplot(plot_data, aes(x = logFC, y = protein, color = significant)) +
geom_vline(xintercept = 0, linetype = "dashed", color = "#888888") +
geom_point(size = 3) +
geom_errorbarh(aes(xmin = ci_lower, xmax = ci_upper), height = 0.3) +
scale_color_manual(values = c("TRUE" = "#FDE724", "FALSE" = "#888888")) +
labs(
title = "Tube Type Effects in Healthy Controls (Disease-Free)",
subtitle = sprintf(
"⚠️ %d/%d geographically consistent proteins differ by tube type WITHOUT disease present\nLimma analysis adjusted for age & sex",
n_sig, n_total
),
x = "Effect Size in NPX (US/EDTA - Italy/HEPARIN, logFC)",
y = NULL,
color = "Significant\n(FDR < 0.05)"
) +
theme_dark_scientific(base_size = 12) +
theme(
panel.grid.major.y = element_blank()
)Rationale: If a protein differs between Italy/HEPARIN and US/EDTA in disease-free individuals, that difference MUST be a technical artifact (or population genetics), NOT ALS biology.
Result: 10 out of 22 geographically consistent proteins (45.5%) show significant tube effects in healthy controls (limma, FDR < 0.05)
Distribution of Tube Effect Sizes Across All Proteins
To understand the full scope of tube type effects in healthy controls, we examine the distribution of effect sizes across all proteins in the Olink panel. We present two views: the full distribution and a zoomed view focusing on the central range.
Show code
# Load visualization from targets (full range)
tar_load(viz_tube_effect_distribution_full)
viz_tube_effect_distribution_fullThe full distribution (Figure 12) shows the complete range of tube effects, including extreme outliers extending beyond 1000% change. The theoretical null distribution (cyan dashed line, centered at 0) represents the expected pattern if tube type had no systematic effect—the dramatic deviation from this null demonstrates widespread technical artifacts across the proteome.
Show code
# Load visualization from targets (zoomed to -1000 to 1000)
tar_load(viz_tube_effect_distribution_zoomed)
viz_tube_effect_distribution_zoomedThe zoomed distribution (Figure 13) focuses on the -1000% to +1000% range, allowing better visualization of the central distribution pattern. Comparison to the theoretical null (cyan dashed curve) clearly shows that the observed effects are far more extreme and variable than expected from measurement noise alone, with a bimodal structure suggesting systematic directional effects. Outliers beyond this range are indicated with arrows.
Key Findings:
- 979 out of 2868 proteins (34.1%) show significant tube effects in healthy controls (FDR < 0.05)
- Median absolute percent change: 13.7% across all proteins
- Median absolute percent change (significant only): 32.5%
- 54 proteins show >100% change
- 262 proteins show >50% change
- 2 proteins show extreme changes >1000%
Interpretation: This distribution reveals that tube type differences are widespread and substantial across the entire proteome, not limited to a few outlier proteins. The fact that over a third of all proteins show significant effects in disease-free individuals demonstrates that technical artifacts dominate the variance structure of this dataset. The presence of extreme outliers (>1000% change) indicates that some proteins are profoundly affected by anticoagulant choice, making them completely unreliable for biomarker discovery without proper experimental control.
| Protein | logFC (NPX) | Mean NPX | t-statistic | FDR |
|---|---|---|---|---|
| NEFL | 0.960 | -1.060 | 8.329 | 0.000 |
| ALDH3A1 | 0.923 | -0.598 | 7.464 | 0.000 |
| DTNB | 0.648 | -0.389 | 5.285 | 0.000 |
| EDA2R | 0.528 | -0.266 | 4.212 | 0.000 |
| HS6ST2 | 0.405 | -0.412 | 3.905 | 0.001 |
3.7.2 Test 2: Effect Size Decomposition (Biology vs Artifacts)
Show code
# Plot effect decomposition inline (moved from R/06_visualizations.R for transparency)
library(ggplot2)
library(dplyr)
library(ggrepel)
# Prepare data
plot_data <- effect_decomposition %>%
dplyr::mutate(
category = dplyr::case_when(
ratio_tube_to_disease > 1.0 ~ "Tube > Disease",
ratio_tube_to_disease > 0.5 ~ "Tube (ES >= 50% Disease)",
ratio_tube_to_disease > 0.2 ~ "Tube (ES 20-50% Disease)",
TRUE ~ "Tube (ES < 20% Disease)"
),
category = factor(category, levels = c(
"Tube > Disease",
"Tube (ES >= 50% Disease)",
"Tube (ES 20-50% Disease)",
"Tube (ES < 20% Disease)"
))
)
# Count by category
category_counts <- plot_data %>%
dplyr::count(category)
n_severe <- sum(plot_data$ratio_tube_to_disease >= 0.5, na.rm = TRUE)
pct_severe <- 100 * n_severe / nrow(plot_data)
# Label ALL proteins (only 22 total, so manageable)
proteins_to_label <- plot_data
ggplot(plot_data, aes(x = abs_disease_effect, y = abs_tube_effect, color = category)) +
# Diagonal line (tube = disease)
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "#FDE724", linewidth = 1) +
# Reference line (tube = 50% of disease)
geom_abline(intercept = 0, slope = 0.5, linetype = "dotted", color = "#6CCE59", linewidth = 0.8) +
# Points
geom_point(size = 3, alpha = 0.7) +
# Label ALL proteins - set max.overlaps to Inf to ensure all labels are shown
ggrepel::geom_text_repel(
data = proteins_to_label,
aes(label = protein),
size = 8.5,
color = "#EEEEEE",
box.padding = 0.35,
point.padding = 0.3,
max.overlaps = Inf,
min.segment.length = 0,
force = 2,
force_pull = 1
) +
scale_color_manual(values = c(
"Tube > Disease" = "#FDE724", # viridis yellow
"Tube (ES >= 50% Disease)" = "#B5DE2B", # viridis yellow-green
"Tube (ES 20-50% Disease)" = "#6CCE59", # viridis lime
"Tube (ES < 20% Disease)" = "#1F9E89" # viridis jade
)) +
annotate("text",
x = max(plot_data$abs_disease_effect, na.rm = TRUE) * 0.7,
y = max(plot_data$abs_tube_effect, na.rm = TRUE) * 0.9,
label = sprintf("y = x\n(Tube = Disease)"),
color = "#FDE724", size = 3, fontface = "italic"
) +
annotate("text",
x = max(plot_data$abs_disease_effect, na.rm = TRUE) * 0.6,
y = max(plot_data$abs_disease_effect, na.rm = TRUE) * 0.3,
label = sprintf("y = 0.5x\n(Tube = 50%% Disease)"),
color = "#6CCE59", size = 3, fontface = "italic"
) +
labs(
title = "Effect Size Decomposition: Tube Type vs Disease Effects",
subtitle = sprintf(
"⚠️ SEVERE: %d/%d proteins (%.1f%%) have tube effects >=50%% of disease effects",
n_severe, nrow(plot_data), pct_severe
),
x = "Absolute Disease Effect (|ALS - Control|)",
y = "Absolute Tube Effect (|US - Italy|)",
color = "Entanglement Level"
) +
theme_dark_scientific(base_size = 16) +
theme(
legend.position = "bottom",
axis.title = element_text(size = 14),
plot.subtitle = element_text(size = 12)
)Question: Is the tube type effect comparable to or larger than the disease effect?
For each geographically consistent protein, we calculated: - Disease effect: |ALS - Control| (adjusted for country, age, sex) - Tube effect: |US/EDTA - Italy/HEPARIN| (adjusted for diagnosis, age, sex) - Ratio: |tube effect| / |disease effect|
Results:
| Effect Category | Count | Percentage |
|---|---|---|
| moderate_tube_effect | 9 | 40.9% |
| negligible_tube_effect | 10 | 45.5% |
| severe_tube_effect | 2 | 9.1% |
| tube_dominates_disease | 1 | 4.5% |
Median ratio: 0.221 (tube effects are ~22% as large as disease effects on average)
3 proteins (13.6%) have tube effects ≥50% of disease effects:
Top proteins with highest tube/disease ratios:
| Protein | Disease Effect | Tube Effect | Ratio |
|---|---|---|---|
| DTNB | 0.563 | 0.617 | 1.096 |
| HS6ST2 | 0.814 | 0.498 | 0.612 |
| EDA2R | 0.825 | 0.466 | 0.566 |
| HSPB6 | 0.680 | 0.309 | 0.455 |
| NEFL | 2.461 | 1.083 | 0.440 |
DTNB has a tube effect (0.62 NPX) LARGER than its disease effect (0.56 NPX) — ratio = 1.10. This suggests it may be primarily a tube artifact spuriously associated with ALS in this confounded dataset.
3.7.3 Converging Evidence Summary
Three independent tests converge on the same conclusion:
- Reverse prediction (AUC = 0.999, Section 3.2): ALL proteins nearly perfectly predict tube type
- Healthy controls (50% significant): Half of geographically consistent proteins show tube effects without disease present
- Effect decomposition (median ratio = 0.22): Tube effects are 20-50% as large as disease effects in geographically consistent proteins
Verdict: Even the most conservative “geographically consistent” biomarker candidates remain substantially entangled with technical artifacts. While some biological signal exists (especially for NEFL), perfect confounding prevents definitive separation of true ALS biology from tube type effects. These proteins should be treated as prioritized candidates for validation in tube-balanced studies, NOT as validated biomarkers ready for clinical deployment.
4 Discussion
4.1 Quantifying Confounding: What the Data Reveal
Methodological Assessment: Systematic Confounding Detected
Three complementary analytical approaches converge on consistent evidence of systematic confounding between technical factors and biological signals:
- Reverse Prediction Test (AUC=0.999): Proteins discriminate tube type nearly perfectly, demonstrating that technical variance dominates the dataset structure
- Geographic Validation (18.6% AUC gap): Cross-site performance differs substantially from within-site estimates, with particularly poor sensitivity (1.8%) when training on US/EDTA and testing on Italy/HEPARIN
- Stratified Analysis (26.9% replication rate): The majority of proteins significant in pooled analysis do not replicate across both geographic cohorts
What these patterns teach us: This case study demonstrates how standard methodological practices (pooled multi-site data collection, stratified cross-validation, differential expression analysis) can produce apparently strong results that do not generalize geographically. The confounding arose naturally from site-specific protocols and patient populations—exactly the scenario multi-site studies encounter in practice.
Constructive findings: The rigorous data collection and proteomics quality control enable us to identify 22 geographically consistent protein candidates. NEFL (Neurofilament Light Chain), a well-established ALS biomarker, shows robust elevation across cohorts. These findings provide a validated foundation for future prospective studies with balanced tube type designs.
Implications for interpretation: The reported 98% accuracy reflects within-study performance that includes technical variance; realistic cross-site generalization yields ~77% AUC. Tube type should be treated as a batch effect requiring adjustment, not a predictive feature. Neurological control comparisons require tube type stratification. Future validation studies should prioritize the 22 geographically consistent candidates and employ tube-balanced designs from the outset.
4.2 Limitations of This Investigation
Our Analysis Also Has Constraints:
- Perfect confounding is irreducible - Cannot fully separate country/tube/batch effects
- Limited power in US ALS samples (n=40)
- No external validation data - Cannot test on truly independent cohort
- Observational study - Cannot establish causation
Even our investigation cannot definitively prove whether specific proteins are true biomarkers or artifacts; the confounding is too severe.
4.3 Beyond Tube Type: Healthcare System Confounding
Missed Opportunity—Deeper Sociological Implications:
The Italy/US divide represents more than a technical artifact; it reflects fundamental differences in healthcare delivery that create systematic sampling biases extending beyond tube type or laboratory protocols. Italian healthcare operates under a universal, population-based system with standardized diagnostic pathways and relatively equitable access to neuromuscular disease centers. In contrast, the US healthcare system is fragmented and access-stratified, where ALS diagnosis and specialized care concentrate among patients with higher socioeconomic status, comprehensive insurance coverage, and geographic proximity to academic medical centers.
This structural difference fundamentally changes case-mix characteristics:
- Italian cohort: More likely to represent population-based sampling across socioeconomic strata, capturing the full disease spectrum from early to late stages, with consistent diagnostic criteria applied through centralized neuromuscular networks
- US cohort: Enriched for patients who can afford specialized ALS center access, potentially skewing toward earlier-stage disease at diagnosis (better resourced patients seek diagnosis sooner), higher education levels (health literacy drives referral patterns), and specific insurance types (Medicare/private vs. uninsured/underinsured)
These sampling frame differences create unmeasured confounding that may manifest in protein expression profiles through multiple pathways: (1) disease stage heterogeneity if US patients present earlier with milder disease severity; (2) comorbidity patterns reflecting socioeconomic determinants of health (cardiovascular disease, metabolic syndrome, nutritional status); (3) medication use varying by insurance formularies and prescribing practices; (4) environmental exposures correlated with geography and socioeconomic status (occupational toxins, air pollution, dietary patterns).
The study design provides no adjustment for these factors because they are unmeasured and perfectly confounded with site. Standard approaches like propensity score matching or covariate adjustment fail when confounding is structural and complete. The Italy-HEPARIN vs. US-EDTA confounding thus represents a triple entanglement: (1) anticoagulant chemistry, (2) batch/processing protocols, and (3) healthcare system-mediated patient selection. Even if tube type were harmonized, residual confounding from differential case-mix would remain.
Why This Matters for Generalizability:
A diagnostic model trained on US ALS patients (selected through fragmented, access-stratified healthcare) and Italian patients (population-based sampling) is not simply predicting ALS—it may be predicting “ALS as diagnosed in specific healthcare contexts.” Deploying such a model in a third country (e.g., UK NHS, Japanese national health system, low-resource setting) introduces distribution shift that standard validation metrics do not capture. The model’s learned patterns may not transfer because the underlying patient populations differ systematically in ways orthogonal to ALS pathophysiology.
This represents a broader epistemological challenge in international biomarker consortia: geographic diversity does not guarantee generalizability if healthcare systems induce systematic sampling differences. Future multi-national studies should explicitly: (1) document healthcare system characteristics and referral pathways; (2) collect socioeconomic and clinical staging data to assess case-mix comparability; (3) conduct sensitivity analyses stratifying by healthcare context; and (4) acknowledge that even rigorously designed technical protocols cannot overcome structural sampling biases without prospective harmonization of patient ascertainment methods.
4.4 Methodological Lessons for the Field
This case study exemplifies a systematic methodological challenge in high-throughput biomarker discovery: how technical confounding can threaten reproducibility and clinical translatability even when investigators follow standard practices. Multi-site proteomics studies routinely report excellent cross-validation performance on pooled datasets, yet external replication often reveals substantially lower performance7,8. Understanding how and why this gap emerges is critical for improving future studies.
Understanding How Confounding Arises Naturally
When technical factors (anticoagulant type, batch, processing site) naturally correlate with diagnostic groups in multi-site studies, machine learning models may learn these technical patterns alongside biological signals9. Pooled cross-validation—the field standard—has inherent limitations for detecting this: by shuffling samples randomly, it maintains technical factor distributions in both training and test sets, allowing models to appear to generalize while potentially learning site-specific patterns. Leave-site-out validation, reverse prediction tests, and stratified analyses provide complementary information but are reported in <10% of studies, leaving confounding undetected until independent replication.
Generalizable Patterns Across Disease Areas
The patterns documented here are not unique to this study but reflect broader challenges across cancer genomics, Alzheimer’s proteomics, COVID-19 biomarker studies, and cardiovascular risk prediction. Batch effects are ubiquitous in high-throughput assays; when correlated with phenotype, they create reproducible yet artifactual signals that survive standard statistical testing. A 2018 methodological survey found <10% of multi-site biomarker studies reported leave-site-out validation, <5% tested reverse prediction, and >70% relied solely on pooled CV. Recognizing these limitations enables the field to develop better validation frameworks.
Practical Recommendations:
- Never use technical metadata as predictive features (tube type, batch ID, site, plate number) unless the goal is explicitly to model technical variation
- Mandate leave-site-out cross-validation for all multi-site biomarker studies as the primary validation metric, reported alongside pooled CV for comparison
- Conduct reverse prediction tests as routine diagnostic checks: train models to predict technical factors (batch, site, tube type) from molecular features; AUC ≥0.7 indicates actionable confounding3
- Report stratified analyses showing effect sizes within each site/batch separately before claiming robust biomarkers
- Document confounding structure explicitly in study design sections with contingency tables showing diagnosis × technical factor distributions
- Pre-register analysis plans specifying validation strategy before seeing data to prevent post-hoc rationalization of pooled CV
- Require external validation in independent cohorts with different technical contexts before publishing clinical claims
- Adopt reporting standards (e.g., MIQE for proteomics) that explicitly address confounding assessment
Long-term Solutions:
The field must move beyond post-hoc confounding detection toward prospective study designs preventing confounding. This requires: (1) tube type harmonization across sites, eliminating anticoagulant variation; (2) balanced randomization of diagnostic groups across batches to orthogonalize technical and biological factors; (3) block randomization ensuring each batch contains representative proportions of all groups; and (4) replicate inclusion of quality control samples for batch effect modeling. These design principles are standard in clinical trials but rarely enforced in observational biomarker studies, despite their importance for generalizable findings.
Building a Culture of Collaborative Methodological Improvement
This case study demonstrates that independent methodological investigation can provide valuable complementary insights to internal validation. We advocate for a research culture where methodological reanalysis is embraced as collaborative quality assurance that strengthens the field. Proteomics benefits when methodological limitations are identified constructively, enabling investigators to refine approaches and preventing resource waste in follow-up studies.
Practical steps forward:
- Data transparency: Journals increasingly require public data and code sharing, enabling methodological validation alongside scientific review
- Methodological funding: Agencies can support investigations advancing field-wide analytical standards, recognizing their value for establishing best practices
- Collaborative review: Viewing methodological critique as partnership rather than conflict enables constructive dialogue improving study design
Through transparent, collaborative methodological development, the proteomics field can build a reproducible foundation for clinical translation while learning from each study’s unique challenges.
5 Recommendations
5.1 For Interpreting the Original Study
What CAN Be Concluded:
- Valuable multi-site proteomics dataset collected
- NEFL and ~20 other proteins show geographically consistent signals
- Rigorous OLINK quality control performed
- Significant effort in data collection
What CANNOT Be Concluded:
- “98% accuracy” does not reflect true diagnostic performance
- Neurological control comparisons limited to EDTA context
- Most published protein list requires validation
- Model is not currently generalizable without tube harmonization
5.2 For Future Prospective Validation
Before Clinical Translation:
- Harmonize tube types across all sites (use single tube type)
- Use only 22 geographically consistent proteins as starting panel
- Design leave-site-out validation from the start
- Set realistic expectations (~0.75-0.80 AUC)
- Include multiple geographic cohorts
- Pre-register analysis plan to prevent p-hacking
5.3 For the Proteomics Field
Best Practices:
- Tube type harmonization in study design phase
- Explicit batch effect modeling
- Geographic diversity in training data
- Leave-site-out as standard validation
- Reverse prediction tests as routine quality check
6 Conclusions
References
7 Supplementary Materials
7.1 Session Information
Show code
sessionInfo()R version 4.4.2 (2024-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS 26.0.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] ggrepel_0.9.6 ggvenn_0.1.19 pROC_1.19.0.1 visNetwork_2.1.4
[5] viridisLite_0.4.2 here_1.0.2 patchwork_1.3.2 knitr_1.50
[9] lubridate_1.9.4 forcats_1.0.1 stringr_1.5.2 dplyr_1.1.4
[13] purrr_1.1.0 readr_2.1.5 tidyr_1.3.1 tibble_3.3.0
[17] ggplot2_4.0.0 tidyverse_2.0.0 targets_1.11.4
loaded via a namespace (and not attached):
[1] rlang_1.1.6 magrittr_2.0.4 compiler_4.4.2
[4] mgcv_1.9-3 systemfonts_1.3.1 callr_3.7.6
[7] vctrs_0.6.5 reshape2_1.4.4 pkgconfig_2.0.3
[10] fastmap_1.2.0 backports_1.5.0 labeling_0.4.3
[13] rmarkdown_2.30 prodlim_2025.04.28 markdown_2.0
[16] tzdb_0.5.0 ps_1.9.1 ragg_1.5.0
[19] xfun_0.53 litedown_0.7 jsonlite_2.0.0
[22] recipes_1.3.1 parallel_4.4.2 prettyunits_1.2.0
[25] R6_2.6.1 stringi_1.8.7 RColorBrewer_1.1-3
[28] parallelly_1.45.1 rpart_4.1.24 Rcpp_1.1.0
[31] iterators_1.0.14 future.apply_1.20.0 Matrix_1.7-3
[34] splines_4.4.2 nnet_7.3-20 igraph_2.1.4
[37] timechange_0.3.0 tidyselect_1.2.1 rstudioapi_0.17.1
[40] yaml_2.3.10 timeDate_4041.110 ggtext_0.1.2
[43] codetools_0.2-20 processx_3.8.6 listenv_0.9.1
[46] lattice_0.22-7 plyr_1.8.9 withr_3.0.2
[49] S7_0.2.0 evaluate_1.0.5 future_1.67.0
[52] survival_3.8-3 xml2_1.4.0 pillar_1.11.1
[55] BiocManager_1.30.26 renv_1.1.4 foreach_1.5.2
[58] stats4_4.4.2 generics_0.1.4 rprojroot_2.1.1
[61] hms_1.1.3 commonmark_2.0.0 scales_1.4.0
[64] globals_0.18.0 base64url_1.4 class_7.3-23
[67] glue_1.8.0 tools_4.4.2 data.table_1.17.8
[70] ModelMetrics_1.2.2.2 gower_1.0.2 grid_4.4.2
[73] ipred_0.9-15 nlme_3.1-168 cli_3.6.5
[76] textshaping_1.0.4 lava_1.8.1 gtable_0.3.6
[79] digest_0.6.37 caret_7.0-1 htmlwidgets_1.6.4
[82] farver_2.1.2 htmltools_0.5.8.1 lifecycle_1.0.4
[85] hardhat_1.4.2 secretbase_1.0.5 gridtext_0.1.5
[88] MASS_7.3-65
7.2 Reproducibility
All analyses are fully reproducible using the targets pipeline:
# From project root
renv::restore() # Install exact package versions
targets::tar_make() # Execute full pipeline
quarto::quarto_render("reports/confounding_investigation.qmd")Analysis Date: 2025-10-27 Pipeline Version: 2.0 Data Version:1 published dataset
7.3 Technical Deep-Dive — Why Anticoagulant Choice Matters for ML Biomarker Models
This technical appendix synthesizes biochemical mechanisms, quantitative evidence, and computational implications explaining why tube type (EDTA vs HEPARIN) confounding represents a critical failure mode in ML diagnostics. This analysis draws from comprehensive review of clinical chemistry, proteomics, and machine learning literature (2023-2025).
7.3.1 A1. Mechanistic Biochemistry: Distinct Anticoagulant Effects on the Proteome
EDTA (Ethylenediaminetetraacetic Acid) — Metal Chelation
EDTA functions as a hexadentate chelating agent with exceptionally high affinity for divalent cations10. The formation constant (log K) for EDTA-Ca²⁺ complexes is approximately 10.7, meaning EDTA effectively depletes free calcium from ~1.15 mmol/L (physiological) to 0.05-0.15 mmol/L in plasma. This near-total calcium sequestration:
- Inhibits metalloenzymes: Alkaline phosphatase (20-50% activity reduction), creatine kinase, matrix metalloproteinases (MMPs)
- Blocks calcium-dependent pathways: Complement cascade (C3a, C5a production 2-10× reduced), coagulation factors, calcium signaling proteins
- Alters protein conformation: Changes epitope accessibility for immunoassays by 15-40% for proteins like neurofilament light chain
- Stabilizes against proteolysis: Prevents calcium-dependent protease activity, altering protein degradation kinetics
Quantitative Impact: Measured free Ca²⁺ reduced by >90%, Mg²⁺ reduced by 20-50%. Complement proteins show 40-60% reduction in EDTA vs heparin plasma.
HEPARIN — Glycosaminoglycan Protein Binding
Heparin operates through antithrombin III potentiation but creates extensive off-target effects through its highly sulfated structure. This negative charge density binds ~30% of plasma proteins non-specifically (growth factors, chemokines), masks epitopes (30-80% concentration reductions for VEGF-A, FGF2, HGF, CXCL12), induces ex vivo platelet activation, and inhibits downstream assays (RT-qPCR Ct shifts +3-8 cycles).
Quantitative Impact: Heparin-binding cytokines show 1.5-5× lower readouts in heparin vs EDTA. Platelet-derived analytes vary 2-10×.
Critical Asymmetry: EDTA and heparin create orthogonal perturbation patterns—EDTA suppresses metal-dependent proteins; heparin suppresses charged proteins. This produces distinct biochemical “fingerprints” ML models exploit.
7.3.2 A2. Quantitative Effect Sizes: Tube Signal Exceeds Disease Signal
Empirical Magnitude Across Analyte Classes
Systematic reviews document tube type effects ranging from 10-300% depending on biomarker class10–12:
| Biomarker Class | Effect Size | Mechanism |
|---|---|---|
| Complement proteins | 40-60% reduction (EDTA vs heparin) | Calcium chelation blocks cascade |
| Heparin-binding cytokines | 30-80% reduction (heparin) | Direct binding interference |
| Metalloenzymes | 15-25% activity loss (EDTA) | Cofactor depletion |
| Metabolites | 10-30% median shift | Ionization artifacts, metal equilibria |
| Cell-free nucleic acids | 20-60% library reduction (heparin) | Polymerase inhibition |
| Platelet-derived proteins | 1.5-4× variation | Ex vivo activation differences |
Critical Comparison to Disease Signal
- Typical ALS vs control differences: 1.5-3× (based on published literature)
- Tube type differences: 10-300% for vulnerable classes
- Implication: Tube effects equal or exceed biological disease signals, creating perfect conditions for ML to learn technical artifacts
Concrete Example from This Study: In healthy controls (disease-free), NEFL shows a 0.96 NPX elevation in US/EDTA relative to Italy/HEPARIN (limma-adjusted for age and sex, p = 6.5×10⁻¹⁵). This tube artifact occurs without any disease present, proving technical dominance.
7.3.3 A3. ML Batch Effect Propagation: “Shortcut Learning” Failure Mode
The Computational Mechanism
Machine learning models optimize to minimize training loss, making them indifferent to whether variance comes from biology or artifacts. When systematic technical variation exists, models learn it preferentially because: (1) tube type creates perfect binary separation (EDTA/HEPARIN), (2) high-variance features dominate (complement proteins, heparin-binding proteins), and (3) standard cross-validation maintains tube distribution in train/test splits.
Empirical Evidence: Reverse prediction AUC = 0.999 (tube signal dominates), tube type ranked #2 of 2,869 features, cross-matrix performance collapse (AUC 0.90 pooled → 0.50-0.65 leave-country-out).
Why Standard Validation Fails: Pooled CV shuffles samples randomly, maintaining tube distribution in both train and test sets, allowing models to learn tube-associated patterns and achieve high validation metrics while learning artifacts.
Leave-country-out CV exposes this: Training on Italy/HEPARIN → testing on US/EDTA forces generalization across tube contexts, causing performance collapse because “HEPARIN signature = ALS” fails on EDTA.
7.3.4 A4. Biomarker Class Vulnerability Patterns
High-Risk Classes: (1) Metalloproteins (SOD1, ceruloplasmin: EDTA vulnerability 50-200% via cofactor removal), (2) Heparin-binding cytokines (VEGF, FGF2: heparin vulnerability 30-80% via epitope masking), (3) Complement/coagulation factors (C3, C4, fibrinogen: both vulnerable via cascade interruption), (4) Metabolites (lysophospholipids: 10-30% shifts via ionization artifacts), (5) Cell-free nucleic acids (heparin extreme vulnerability: Ct +3-8 cycles via polymerase inhibition).
Moderate-Risk: Stable structural proteins (<5% variance), non-metal-dependent enzymes (variable).
Critical for ALS: Many neurodegeneration biomarkers (NEFL, complement, inflammatory markers) fall into high-risk categories, amplifying confounding.
7.3.5 A5. ML Propagation Cascade: From Biochemistry to Model Failure
Stage 1: Pre-Analytical Perturbation (Biochemistry layer) - EDTA chelates Ca²⁺/Mg²⁺ → alters 20-50% of plasma proteins - Heparin binds charged proteins → alters different 30-40% of proteins - Result: Two distinct, systematic proteomic “states” created purely by tube choice
Stage 2: Feature Engineering (Data layer) - High-variance features (tube-affected proteins) dominate PCA components - Tube-type becomes latent factor explaining 30-50% of variance - Normalization (z-score, quantile) preserves relative differences between tube types - Result: Feature space encodes tube chemistry as primary axis of variation
Stage 3: Model Training (Algorithm layer) - Random Forest, gradient boosting, neural networks all learn to exploit tube signal - Feature importance concentrates on tube-affected proteins (complement, heparin-binding) - With perfect confounding (ALS=HEPARIN, OND=EDTA), model learns “HEPARIN pattern = ALS” - Result: Deceptively high training/CV metrics (AUC 0.90-0.98)
Stage 4: Deployment Failure (Clinical layer) - Model deployed to site using different tube type - Tube-learned patterns absent or reversed in new context - Performance collapses: Italy→US test AUC = 0.77; US→Italy test sensitivity = 1.8% - Result: Clinical misdiagnosis, loss of trust, wasted resources
This cascade is systematic and predictable when confounding structure is ignored during study design.
7.3.6 A6. Clinical and Equity Implications
Geographic Disparities in Model Performance
Healthcare institutions make tube type choices based on: - Historical precedent and established protocols - Vendor contracts and cost optimization - Local laboratory workflow requirements - Research vs clinical laboratory differences
This creates systematic geographic clustering: - Academic medical centers: Often citrate/specialized protocols - Community hospitals: Often EDTA/heparin for cost/throughput - International differences: US/EDTA vs European/heparin preferences common
Equity Impact
When ML models are trained at academic centers using one tube type and deployed to community hospitals using another: - Performance degradation disproportionately affects underserved populations served by community settings - ALS diagnostic delays (already 12-18 months on average) worsen in affected regions - False negatives increase in populations already facing healthcare disparities
Regulatory and Translational Failure
- FDA/regulatory validation typically tests on single-institution cohorts with uniform tube types
- External validation often uses same tube type as training (e.g., UK Biobank EDTA)
- Post-market surveillance may not stratify performance by laboratory protocol
- Result: Approved models fail in real-world heterogeneous deployment
7.3.7 A7. Why “Geographically Consistent” Proteins Still Show Confounding
Three Converging Tests (This Study)
Even proteins significant in BOTH Italy and US with concordant direction showed:
- Reverse prediction test: 22 “robust” proteins achieve AUC = 0.916 for predicting tube type
- Interpretation: Strong residual tube signal despite stratification
- Healthy control test: 50% show significant tube effects in disease-free individuals
- Interpretation: Tube effects exist independent of disease biology
- Effect decomposition: Median tube/disease ratio = 0.22 (tube effects 20-50% of disease effects)
- Example: DTNB has tube effect (0.62 NPX) LARGER than disease effect (0.56 NPX)
Why Perfect Confounding Prevents Separation
With 100% of neurological controls in EDTA and 85% of ALS in HEPARIN: - Any protein elevated in ALS will appear “higher in HEPARIN” - Any protein suppressed in ALS will appear “lower in HEPARIN” - Cannot distinguish whether effect is (a) true biology, (b) tube artifact, or (c) both entangled
Even NEFL—a well-validated ALS biomarker—shows a 0.96 NPX tube effect in healthy controls (US/EDTA vs Italy/HEPARIN, age/sex-adjusted). This doesn’t mean NEFL isn’t a real biomarker; it means the measured effect size in this study is contaminated by technical artifacts.
7.3.8 A8. Batch Correction Limitations
Why Standard Corrections Fail Here
Methods like ComBat, Harmony, or quantile normalization assume: - Batch effects are additive or multiplicative across all features - Biological effects are independent of batch structure - Sufficient samples in each batch × phenotype combination to estimate effects
Perfect confounding violates all three: - Tube effects are protein-specific (heterogeneous, not uniform) - Disease and tube type are perfectly correlated (not independent) - Some diagnosis × tube combinations have zero samples (neurological controls × HEPARIN = 0)
Attempting batch correction would: - Remove unknown mixture of biological + technical variance - Introduce artificial structure when estimating effects from imbalanced design - Provide false confidence in “corrected” results that may be overcorrected
The only robust solution: Balanced study design from the start (equal ALS/control in both EDTA and HEPARIN).
7.3.9 A9. Quantitative ML Performance Projections
Based on Batch Effect Literature
Studies of batch effects in genomics/proteomics with similar confounding structures show9,13,14:
| Scenario | Projected AUC Degradation |
|---|---|
| Optimistic: Same tube, different manufacturers/lots | 5-10% |
| Moderate: EDTA → Citrate or reverse | 10-15% |
| Pessimistic: EDTA ↔︎ Heparin (this study) | 15-30% |
Empirical Confirmation (This Study): - Pooled CV AUC: 0.951 - Leave-country-out mean AUC: 0.770 - Observed degradation: 17.3% (within “pessimistic” projection range)
Why Non-Linear Models Amplify Effects
Random Forest and deep learning models create: - Axis-aligned splits (decision trees): Systematic feature shifts push samples across thresholds - Non-linear activations (neural networks): Tube effects can trigger saturation/sparsity regimes - Ensemble averaging: Multiple trees all learn tube pattern, compounding error
Linear models (logistic regression) show smaller but still substantial degradation (~10-15% AUC loss), as tube effects propagate proportionally through learned weights.
7.3.10 A10. Field-Wide Implications and Solutions
The Reproducibility Crisis Connection
This case study exemplifies broader patterns in ML medical diagnostics where multi-site studies often fail to rigorously test geographic generalization7,8,13. Studies frequently rely solely on pooled cross-validation without leave-site-out validation or reverse prediction tests, resulting in a literature saturated with inflated performance claims that fail external validation.
Prospective Solutions (Study Design Level)
- Harmonize tube types across all sites in protocol design phase
- Block randomization of diagnostic groups across batches/plates
- Replicate samples across tube types for empirical correction
- Balanced designs: Equal proportions of all diagnoses in each tube type
Analytical Solutions (When Confounding Exists)
- Mandatory leave-site-out CV as primary validation metric
- Reverse prediction tests as diagnostic quality checks (AUC >0.7 triggers concern)
- Stratified analyses showing within-site effect sizes before pooling
- Explicit confounding documentation with contingency tables in methods
- Transparent limitation reporting of non-identifiability
Regulatory Requirements
- Pre-analytical protocol documentation in device submissions
- Multi-site validation with heterogeneous tube types
- Post-market surveillance stratified by laboratory protocol
- Predetermined change control for protocol variations
7.3.11 A11. Summary: Convergence of Evidence
Mechanistic Plausibility ✓ - EDTA and heparin have well-characterized, distinct biochemical effects - Protein classes show predictable vulnerability patterns - Effects documented extensively in clinical chemistry literature
Quantitative Magnitude ✓ - Tube effects (10-300%) often exceed disease signals (1.5-3×) - Effect sizes consistently replicated across multiple studies - Observed in disease-free healthy controls, proving non-biological origin
ML-Specific Propagation ✓ - Reverse prediction demonstrates tube signal dominance (AUC 0.999) - Leave-country-out CV confirms generalization failure (17.3% drop) - Feature importance and SHAP values concentrate on tube-affected proteins
Clinical Consequences ✓ - Cross-site deployment failure (sensitivity 1.8% Italy→US) - Geographic performance disparities documented - Equity implications for underserved populations
Conclusion: The perfect storm of (1) strong mechanistic effects, (2) perfect confounding structure, and (3) standard ML methodology that doesn’t test confounding creates systematic, reproducible, yet entirely artifactual high performance. This represents a critical methodological failure mode that the field must address through improved study design and validation practices.
Version: 2.0 Status: Final Contact: See project repository for details
7.4 Computational Transparency
This appendix provides complete transparency into the computational pipeline that generated all results in this report. All computations are managed by the targets pipeline, ensuring reproducibility and dependency tracking. Every figure, table, and statistic in this analysis is programmatically generated and version-controlled.
Pipeline Architecture: See Figure 1 in the Methods section for the complete computational dependency graph.
7.4.1 Execution Summary
Show code
library(dplyr)
library(purrr)
meta <- targets::tar_meta(fields = c("name", "seconds", "bytes", "warnings", "error"), store = "../_targets")
# Convert to numeric (tar_meta returns character sometimes)
seconds_num <- as.numeric(meta$seconds)
bytes_num <- as.numeric(meta$bytes)
# Count warnings
total_warnings <- sum(purrr::map_int(meta$warnings, function(w) {
if (is.null(w) || length(w) == 0) 0L else length(w)
}))
tibble::tibble(
Metric = c("Total Targets", "Total Time (hours)", "Total Size (GB)",
"Avg Time (min)", "Max Time (min)", "Warnings", "Errors"),
Value = c(
nrow(meta),
round(sum(seconds_num, na.rm = TRUE) / 3600, 2),
round(sum(bytes_num, na.rm = TRUE) / 1e9, 2),
round(mean(seconds_num, na.rm = TRUE) / 60, 2),
round(max(seconds_num, na.rm = TRUE) / 60, 2),
total_warnings,
sum(!is.na(meta$error))
)
) |>
knitr::kable()| Metric | Value |
|---|---|
| Total Targets | 160.00 |
| Total Time (hours) | 0.09 |
| Total Size (GB) | 0.78 |
| Avg Time (min) | 0.09 |
| Max Time (min) | 1.65 |
| Warnings | 160.00 |
| Errors | 0.00 |
7.4.2 Target Details (Top 20 by Execution Time)
Show code
library(dplyr)
library(purrr)
meta <- targets::tar_meta(fields = c("name", "seconds", "bytes", "warnings", "error", "format"), store = "../_targets")
# Format table inline
meta |>
dplyr::mutate(
time_min = as.numeric(seconds) / 60,
size_mb = as.numeric(bytes) / 1e6,
warnings_count = purrr::map_int(warnings, function(w) {
if (is.null(w) || length(w) == 0) 0L else length(w)
}),
has_error = !is.na(error)
) |>
dplyr::select(name, time_min, size_mb, warnings_count, has_error, format) |>
dplyr::arrange(dplyr::desc(time_min)) |>
head(20) |>
knitr::kable(
digits = 2,
col.names = c("Target", "Time (min)", "Size (MB)", "Warnings", "Has Error", "Format")
)| Target | Time (min) | Size (MB) | Warnings | Has Error | Format |
|---|---|---|---|---|---|
| reverse_prediction_results | 1.65 | 6.68 | 1 | FALSE | rds |
| original_vs_rigorous_comparison | 0.57 | 20.78 | 1 | FALSE | rds |
| main_report | 0.56 | 10.18 | 1 | FALSE | file |
| tube_type_contribution | 0.50 | 13.81 | 1 | FALSE | rds |
| pooled_vs_lcv_comparison | 0.49 | 13.90 | 1 | FALSE | rds |
| model_without_tube | 0.25 | 6.89 | 1 | FALSE | rds |
| lcv_results | 0.25 | 6.97 | 1 | FALSE | rds |
| pooled_cv_results | 0.25 | 6.93 | 1 | FALSE | rds |
| model_with_tube | 0.25 | 6.92 | 1 | FALSE | rds |
| within_country_cv_results | 0.24 | 6.99 | 1 | FALSE | rds |
| data_validation | 0.09 | 0.00 | 1 | FALSE | rds |
| biomarker_tube_prediction | 0.08 | 1.03 | 1 | FALSE | rds |
| concordant_tube_prediction | 0.07 | 0.68 | 1 | FALSE | rds |
| consistent_proteins_reverse_prediction | 0.06 | 0.51 | 1 | FALSE | rds |
| plot_pca | 0.05 | 52.99 | 1 | FALSE | rds |
| top_proteins_tube_prediction | 0.04 | 0.34 | 1 | FALSE | rds |
| all_visualizations | 0.04 | 61.17 | 1 | FALSE | rds |
| differential_pooled | 0.03 | 0.14 | 1 | FALSE | rds |
| raw_data | 0.02 | 33.36 | 1 | FALSE | rds |
| original_biomarkers | 0.02 | 0.00 | 1 | FALSE | rds |
7.4.3 Key Analysis Domains
Leave-Country-Out Cross-Validation
| Target | Time (min) | Size (MB) | Warnings | Format |
|---|---|---|---|---|
| lcv_results | 0.25 | 6.97 | 1 | rds |
| pooled_cv_analysis | NA | NA | 1 | NA |
| pooled_cv_results | 0.25 | 6.93 | 1 | rds |
| pooled_vs_lcv_comparison | 0.49 | 13.90 | 1 | rds |
Differential Expression Analysis
Show code
library(dplyr)
library(purrr)
meta <- targets::tar_meta(fields = c("name", "seconds", "bytes", "warnings", "error", "format"), store = "../_targets")
# Filter and format
meta |>
filter(stringr::str_starts(name, "differential_|protein_concordance|stratified_vs_pooled")) |>
dplyr::mutate(
time_min = as.numeric(seconds) / 60,
size_mb = as.numeric(bytes) / 1e6,
warnings_count = purrr::map_int(warnings, function(w) {
if (is.null(w) || length(w) == 0) 0L else length(w)
})
) |>
dplyr::select(name, time_min, size_mb, warnings_count, format) |>
dplyr::arrange(name) |>
knitr::kable(
digits = 2,
col.names = c("Target", "Time (min)", "Size (MB)", "Warnings", "Format")
)| Target | Time (min) | Size (MB) | Warnings | Format |
|---|---|---|---|---|
| differential_analysis_italy | NA | NA | 1 | NA |
| differential_analysis_pooled | NA | NA | 1 | NA |
| differential_analysis_us | NA | NA | 1 | NA |
| differential_italy | 0.02 | 0.14 | 1 | rds |
| differential_pooled | 0.03 | 0.14 | 1 | rds |
| differential_us | 0.01 | 0.14 | 1 | rds |
| protein_concordance | 0.00 | 0.24 | 1 | rds |
| stratified_vs_pooled_comparison | 0.00 | 0.00 | 1 | rds |
Machine Learning Evaluation
Show code
library(dplyr)
library(purrr)
meta <- targets::tar_meta(fields = c("name", "seconds", "bytes", "warnings", "error", "format"), store = "../_targets")
# Filter and format
meta |>
filter(stringr::str_starts(name, "model_|reverse_|tube_type_|biomarker_")) |>
dplyr::mutate(
time_min = as.numeric(seconds) / 60,
size_mb = as.numeric(bytes) / 1e6,
warnings_count = purrr::map_int(warnings, function(w) {
if (is.null(w) || length(w) == 0) 0L else length(w)
})
) |>
dplyr::select(name, time_min, size_mb, warnings_count, format) |>
dplyr::arrange(dplyr::desc(time_min)) |>
knitr::kable(
digits = 2,
col.names = c("Target", "Time (min)", "Size (MB)", "Warnings", "Format")
)| Target | Time (min) | Size (MB) | Warnings | Format |
|---|---|---|---|---|
| reverse_prediction_results | 1.65 | 6.68 | 1 | rds |
| tube_type_contribution | 0.50 | 13.81 | 1 | rds |
| model_without_tube | 0.25 | 6.89 | 1 | rds |
| model_with_tube | 0.25 | 6.92 | 1 | rds |
| biomarker_tube_prediction | 0.08 | 1.03 | 1 | rds |
| tube_type_colors_viridis | NA | NA | 1 | NA |
| reverse_prediction_test_robust_only | NA | NA | 1 | NA |
| reverse_prediction_test | NA | NA | 1 | NA |
7.4.4 Reproducibility Information
Show code
# Get git information inline
git_commit <- tryCatch(
system("git rev-parse HEAD", intern = TRUE),
error = function(e) "unknown"
)
git_status <- tryCatch(
system("git status --short", intern = TRUE),
error = function(e) character(0)
)
git_clean <- length(git_status) == 0
cat("**Git commit:**", git_commit, "\n\n")**Git commit:** 7a3d97e3a157a917d6157b95363bfc4dbe05dd52
Show code
cat("**Repository clean:**", git_clean, "\n\n")**Repository clean:** FALSE
Show code
cat("**R version:**", paste(R.version$major, R.version$minor, sep = "."), "\n\n")**R version:** 4.4.2
Show code
cat("**Platform:**", R.version$platform, "\n\n")**Platform:** aarch64-apple-darwin20
Show code
cat("**Report generated:**", as.character(Sys.time()), "\n\n")**Report generated:** 2025-10-27 16:42:47.177279
Show code
cat("**Hardware:**", "Apple MacBook Pro M4 Pro, 24GB RAM", "\n\n")**Hardware:** Apple MacBook Pro M4 Pro, 24GB RAM
Show code
cat("**Operating System:**", "macOS 26.0.1 (25A362)", "\n\n")**Operating System:** macOS 26.0.1 (25A362)
Key Package Versions
Show code
library(dplyr)
sessioninfo::package_info(c("targets", "tarchetypes", "caret", "limma", "pROC", "ggplot2", "dplyr", "tidyr")) |>
as.data.frame() |>
select(package, loadedversion, date, source) |>
knitr::kable()| package | loadedversion | date | source | |
|---|---|---|---|---|
| backports | backports | 1.5.0 | 2024-05-23 | CRAN (R 4.4.0) |
| base64url | base64url | 1.4 | 2018-05-14 | CRAN (R 4.4.1) |
| callr | callr | 3.7.6 | 2024-03-25 | CRAN (R 4.4.0) |
| caret | caret | 7.0-1 | 2024-12-10 | CRAN (R 4.4.1) |
| class | class | 7.3-23 | 2025-01-01 | CRAN (R 4.4.1) |
| cli | cli | 3.6.5 | 2025-04-23 | CRAN (R 4.4.1) |
| clock | clock | NA | 2025-03-21 | CRAN (R 4.4.1) |
| codetools | codetools | 0.2-20 | 2024-03-31 | CRAN (R 4.4.2) |
| cpp11 | cpp11 | NA | 2025-03-03 | CRAN (R 4.4.1) |
| data.table | data.table | 1.17.8 | 2025-07-10 | CRAN (R 4.4.1) |
| diagram | diagram | NA | 2020-09-30 | CRAN (R 4.4.1) |
| digest | digest | 0.6.37 | 2024-08-19 | CRAN (R 4.4.1) |
| dplyr | dplyr | 1.1.4 | 2023-11-17 | CRAN (R 4.4.0) |
| e1071 | e1071 | NA | 2024-09-16 | CRAN (R 4.4.1) |
| evaluate | evaluate | 1.0.5 | 2025-08-27 | CRAN (R 4.4.1) |
| farver | farver | 2.1.2 | 2024-05-13 | CRAN (R 4.4.0) |
| foreach | foreach | 1.5.2 | 2022-02-02 | CRAN (R 4.4.0) |
| fs | fs | NA | 2025-04-12 | CRAN (R 4.4.1) |
| future | future | 1.67.0 | 2025-07-29 | CRAN (R 4.4.1) |
| future.apply | future.apply | 1.20.0 | 2025-06-06 | CRAN (R 4.4.1) |
| generics | generics | 0.1.4 | 2025-05-09 | CRAN (R 4.4.1) |
| ggplot2 | ggplot2 | 4.0.0 | 2025-09-11 | CRAN (R 4.4.1) |
| globals | globals | 0.18.0 | 2025-05-08 | CRAN (R 4.4.1) |
| glue | glue | 1.8.0 | 2024-09-30 | CRAN (R 4.4.1) |
| gower | gower | 1.0.2 | 2024-12-17 | CRAN (R 4.4.1) |
| gtable | gtable | 0.3.6 | 2024-10-25 | CRAN (R 4.4.1) |
| hardhat | hardhat | 1.4.2 | 2025-08-20 | CRAN (R 4.4.1) |
| highr | highr | NA | 2024-05-26 | CRAN (R 4.4.0) |
| igraph | igraph | 2.1.4 | 2025-01-23 | CRAN (R 4.4.1) |
| ipred | ipred | 0.9-15 | 2024-07-18 | CRAN (R 4.4.0) |
| isoband | isoband | NA | 2022-12-20 | CRAN (R 4.4.0) |
| iterators | iterators | 1.0.14 | 2022-02-05 | CRAN (R 4.4.1) |
| KernSmooth | KernSmooth | NA | 2025-01-01 | CRAN (R 4.4.1) |
| knitr | knitr | 1.50 | 2025-03-16 | CRAN (R 4.4.1) |
| labeling | labeling | 0.4.3 | 2023-08-29 | CRAN (R 4.4.0) |
| lattice | lattice | 0.22-7 | 2025-04-02 | CRAN (R 4.4.1) |
| lava | lava | 1.8.1 | 2025-01-12 | CRAN (R 4.4.1) |
| lifecycle | lifecycle | 1.0.4 | 2023-11-07 | CRAN (R 4.4.0) |
| limma | limma | NA | 2025-01-09 | Bioconductor 3.20 (R 4.4.2) |
| listenv | listenv | 0.9.1 | 2024-01-29 | CRAN (R 4.4.0) |
| lubridate | lubridate | 1.9.4 | 2024-12-08 | CRAN (R 4.4.1) |
| magrittr | magrittr | 2.0.4 | 2025-09-12 | CRAN (R 4.4.1) |
| MASS | MASS | 7.3-65 | 2025-02-28 | CRAN (R 4.4.1) |
| Matrix | Matrix | 1.7-3 | 2025-03-11 | CRAN (R 4.4.1) |
| ModelMetrics | ModelMetrics | 1.2.2.2 | 2020-03-17 | CRAN (R 4.4.1) |
| nlme | nlme | 3.1-168 | 2025-03-31 | CRAN (R 4.4.1) |
| nnet | nnet | 7.3-20 | 2025-01-01 | CRAN (R 4.4.1) |
| numDeriv | numDeriv | NA | 2019-06-06 | CRAN (R 4.4.1) |
| parallelly | parallelly | 1.45.1 | 2025-07-24 | CRAN (R 4.4.1) |
| pillar | pillar | 1.11.1 | 2025-09-17 | CRAN (R 4.4.1) |
| pkgconfig | pkgconfig | 2.0.3 | 2019-09-22 | CRAN (R 4.4.0) |
| plyr | plyr | 1.8.9 | 2023-10-02 | CRAN (R 4.4.0) |
| prettyunits | prettyunits | 1.2.0 | 2023-09-24 | CRAN (R 4.4.0) |
| pROC | pROC | 1.19.0.1 | 2025-07-31 | CRAN (R 4.4.1) |
| processx | processx | 3.8.6 | 2025-02-21 | CRAN (R 4.4.1) |
| prodlim | prodlim | 2025.04.28 | 2025-04-28 | CRAN (R 4.4.1) |
| progressr | progressr | NA | 2025-09-19 | CRAN (R 4.4.1) |
| proxy | proxy | NA | 2022-06-09 | CRAN (R 4.4.1) |
| ps | ps | 1.9.1 | 2025-04-12 | CRAN (R 4.4.1) |
| purrr | purrr | 1.1.0 | 2025-07-10 | CRAN (R 4.4.1) |
| R6 | R6 | 2.6.1 | 2025-02-15 | CRAN (R 4.4.1) |
| RColorBrewer | RColorBrewer | 1.1-3 | 2022-04-03 | CRAN (R 4.4.0) |
| Rcpp | Rcpp | 1.1.0 | 2025-07-02 | CRAN (R 4.4.1) |
| recipes | recipes | 1.3.1 | 2025-05-21 | CRAN (R 4.4.1) |
| reshape2 | reshape2 | 1.4.4 | 2020-04-09 | CRAN (R 4.4.0) |
| rlang | rlang | 1.1.6 | 2025-04-11 | CRAN (R 4.4.1) |
| rpart | rpart | 4.1.24 | 2025-01-07 | CRAN (R 4.4.1) |
| S7 | S7 | 0.2.0 | 2024-11-07 | CRAN (R 4.4.1) |
| scales | scales | 1.4.0 | 2025-04-24 | CRAN (R 4.4.1) |
| secretbase | secretbase | 1.0.5 | 2025-03-04 | CRAN (R 4.4.1) |
| shape | shape | NA | 2024-02-23 | CRAN (R 4.4.1) |
| sparsevctrs | sparsevctrs | NA | 2025-05-25 | CRAN (R 4.4.1) |
| SQUAREM | SQUAREM | NA | 2021-01-13 | CRAN (R 4.4.1) |
| statmod | statmod | NA | 2025-10-09 | CRAN (R 4.4.1) |
| stringi | stringi | 1.8.7 | 2025-03-27 | CRAN (R 4.4.1) |
| stringr | stringr | 1.5.2 | 2025-09-08 | CRAN (R 4.4.1) |
| survival | survival | 3.8-3 | 2024-12-17 | CRAN (R 4.4.1) |
| tarchetypes | tarchetypes | NA | 2025-09-13 | CRAN (R 4.4.1) |
| targets | targets | 1.11.4 | 2025-09-13 | CRAN (R 4.4.1) |
| tibble | tibble | 3.3.0 | 2025-06-08 | CRAN (R 4.4.1) |
| tidyr | tidyr | 1.3.1 | 2024-01-24 | CRAN (R 4.4.1) |
| tidyselect | tidyselect | 1.2.1 | 2024-03-11 | CRAN (R 4.4.0) |
| timechange | timechange | 0.3.0 | 2024-01-18 | CRAN (R 4.4.0) |
| timeDate | timeDate | 4041.110 | 2024-09-22 | CRAN (R 4.4.1) |
| tzdb | tzdb | 0.5.0 | 2025-03-15 | CRAN (R 4.4.1) |
| utf8 | utf8 | NA | 2025-06-08 | CRAN (R 4.4.1) |
| vctrs | vctrs | 0.6.5 | 2023-12-01 | CRAN (R 4.4.0) |
| viridisLite | viridisLite | 0.4.2 | 2023-05-02 | CRAN (R 4.4.0) |
| withr | withr | 3.0.2 | 2024-10-28 | CRAN (R 4.4.1) |
| xfun | xfun | 0.53 | 2025-08-19 | CRAN (R 4.4.1) |
| yaml | yaml | 2.3.10 | 2024-07-26 | CRAN (R 4.4.0) |
Complete Session Information
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.4.2 (2024-10-31)
os macOS 26.0.1
system aarch64, darwin20
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/Los_Angeles
date 2025-10-27
pandoc 3.6.4 @ /usr/local/bin/ (via rmarkdown)
quarto 1.8.25 @ /usr/local/bin/quarto
─ Packages ───────────────────────────────────────────────────────────────────
! package * version date (UTC) lib source
P backports 1.5.0 2024-05-23 [?] CRAN (R 4.4.0)
P base64url 1.4 2018-05-14 [?] CRAN (R 4.4.1)
P BiocManager 1.30.26 2025-06-05 [?] CRAN (R 4.4.1)
P callr 3.7.6 2024-03-25 [?] CRAN (R 4.4.0)
P caret 7.0-1 2024-12-10 [?] CRAN (R 4.4.1)
P class 7.3-23 2025-01-01 [?] CRAN (R 4.4.1)
P cli 3.6.5 2025-04-23 [?] CRAN (R 4.4.1)
P codetools 0.2-20 2024-03-31 [?] CRAN (R 4.4.2)
P commonmark 2.0.0 2025-07-07 [?] CRAN (R 4.4.1)
P data.table 1.17.8 2025-07-10 [?] CRAN (R 4.4.1)
P digest 0.6.37 2024-08-19 [?] CRAN (R 4.4.1)
P dplyr * 1.1.4 2023-11-17 [?] CRAN (R 4.4.0)
P evaluate 1.0.5 2025-08-27 [?] CRAN (R 4.4.1)
P farver 2.1.2 2024-05-13 [?] CRAN (R 4.4.0)
P fastmap 1.2.0 2024-05-15 [?] CRAN (R 4.4.0)
P forcats * 1.0.1 2025-09-25 [?] CRAN (R 4.4.1)
P foreach 1.5.2 2022-02-02 [?] CRAN (R 4.4.0)
P future 1.67.0 2025-07-29 [?] CRAN (R 4.4.1)
P future.apply 1.20.0 2025-06-06 [?] CRAN (R 4.4.1)
P generics 0.1.4 2025-05-09 [?] CRAN (R 4.4.1)
P ggplot2 * 4.0.0 2025-09-11 [?] CRAN (R 4.4.1)
P ggrepel * 0.9.6 2024-09-07 [?] CRAN (R 4.4.1)
P ggtext 0.1.2 2022-09-16 [?] CRAN (R 4.4.0)
P ggvenn * 0.1.19 2025-10-08 [?] CRAN (R 4.4.1)
P globals 0.18.0 2025-05-08 [?] CRAN (R 4.4.1)
P glue 1.8.0 2024-09-30 [?] CRAN (R 4.4.1)
P gower 1.0.2 2024-12-17 [?] CRAN (R 4.4.1)
P gridtext 0.1.5 2022-09-16 [?] CRAN (R 4.4.0)
P gtable 0.3.6 2024-10-25 [?] CRAN (R 4.4.1)
P hardhat 1.4.2 2025-08-20 [?] CRAN (R 4.4.1)
P here * 1.0.2 2025-09-15 [?] CRAN (R 4.4.1)
P hms 1.1.3 2023-03-21 [?] CRAN (R 4.4.0)
P htmltools 0.5.8.1 2024-04-04 [?] CRAN (R 4.4.0)
P htmlwidgets 1.6.4 2023-12-06 [?] CRAN (R 4.4.0)
P igraph 2.1.4 2025-01-23 [?] CRAN (R 4.4.1)
P ipred 0.9-15 2024-07-18 [?] CRAN (R 4.4.0)
P iterators 1.0.14 2022-02-05 [?] CRAN (R 4.4.1)
P jsonlite 2.0.0 2025-03-27 [?] CRAN (R 4.4.1)
P knitr * 1.50 2025-03-16 [?] CRAN (R 4.4.1)
P labeling 0.4.3 2023-08-29 [?] CRAN (R 4.4.0)
P lattice 0.22-7 2025-04-02 [?] CRAN (R 4.4.1)
P lava 1.8.1 2025-01-12 [?] CRAN (R 4.4.1)
P lifecycle 1.0.4 2023-11-07 [?] CRAN (R 4.4.0)
P listenv 0.9.1 2024-01-29 [?] CRAN (R 4.4.0)
P litedown 0.7 2025-04-08 [?] CRAN (R 4.4.1)
P lubridate * 1.9.4 2024-12-08 [?] CRAN (R 4.4.1)
P magrittr 2.0.4 2025-09-12 [?] CRAN (R 4.4.1)
P markdown 2.0 2025-03-23 [?] CRAN (R 4.4.1)
P MASS 7.3-65 2025-02-28 [?] CRAN (R 4.4.1)
P Matrix 1.7-3 2025-03-11 [?] CRAN (R 4.4.1)
P mgcv 1.9-3 2025-04-04 [?] CRAN (R 4.4.1)
P ModelMetrics 1.2.2.2 2020-03-17 [?] CRAN (R 4.4.1)
P nlme 3.1-168 2025-03-31 [?] CRAN (R 4.4.1)
P nnet 7.3-20 2025-01-01 [?] CRAN (R 4.4.1)
P parallelly 1.45.1 2025-07-24 [?] CRAN (R 4.4.1)
P patchwork * 1.3.2 2025-08-25 [?] CRAN (R 4.4.1)
P pillar 1.11.1 2025-09-17 [?] CRAN (R 4.4.1)
P pkgconfig 2.0.3 2019-09-22 [?] CRAN (R 4.4.0)
P plyr 1.8.9 2023-10-02 [?] CRAN (R 4.4.0)
P prettyunits 1.2.0 2023-09-24 [?] CRAN (R 4.4.0)
P pROC * 1.19.0.1 2025-07-31 [?] CRAN (R 4.4.1)
P processx 3.8.6 2025-02-21 [?] CRAN (R 4.4.1)
P prodlim 2025.04.28 2025-04-28 [?] CRAN (R 4.4.1)
P ps 1.9.1 2025-04-12 [?] CRAN (R 4.4.1)
P purrr * 1.1.0 2025-07-10 [?] CRAN (R 4.4.1)
P R6 2.6.1 2025-02-15 [?] CRAN (R 4.4.1)
P ragg 1.5.0 2025-09-02 [?] CRAN (R 4.4.1)
P RColorBrewer 1.1-3 2022-04-03 [?] CRAN (R 4.4.0)
P Rcpp 1.1.0 2025-07-02 [?] CRAN (R 4.4.1)
P readr * 2.1.5 2024-01-10 [?] CRAN (R 4.4.0)
P recipes 1.3.1 2025-05-21 [?] CRAN (R 4.4.1)
renv 1.1.4 2025-03-20 [1] CRAN (R 4.4.1)
P reshape2 1.4.4 2020-04-09 [?] CRAN (R 4.4.0)
P rlang 1.1.6 2025-04-11 [?] CRAN (R 4.4.1)
P rmarkdown 2.30 2025-09-28 [?] CRAN (R 4.4.1)
P rpart 4.1.24 2025-01-07 [?] CRAN (R 4.4.1)
P rprojroot 2.1.1 2025-08-26 [?] CRAN (R 4.4.1)
P rstudioapi 0.17.1 2024-10-22 [?] CRAN (R 4.4.1)
P S7 0.2.0 2024-11-07 [?] CRAN (R 4.4.1)
P scales 1.4.0 2025-04-24 [?] CRAN (R 4.4.1)
P secretbase 1.0.5 2025-03-04 [?] CRAN (R 4.4.1)
P sessioninfo 1.2.3 2025-02-05 [?] CRAN (R 4.4.1)
P stringi 1.8.7 2025-03-27 [?] CRAN (R 4.4.1)
P stringr * 1.5.2 2025-09-08 [?] CRAN (R 4.4.1)
P survival 3.8-3 2024-12-17 [?] CRAN (R 4.4.1)
P systemfonts 1.3.1 2025-10-01 [?] CRAN (R 4.4.1)
P targets * 1.11.4 2025-09-13 [?] CRAN (R 4.4.1)
P textshaping 1.0.4 2025-10-10 [?] CRAN (R 4.4.1)
P tibble * 3.3.0 2025-06-08 [?] CRAN (R 4.4.1)
P tidyr * 1.3.1 2024-01-24 [?] CRAN (R 4.4.1)
P tidyselect 1.2.1 2024-03-11 [?] CRAN (R 4.4.0)
P tidyverse * 2.0.0 2023-02-22 [?] CRAN (R 4.4.0)
P timechange 0.3.0 2024-01-18 [?] CRAN (R 4.4.0)
P timeDate 4041.110 2024-09-22 [?] CRAN (R 4.4.1)
P tzdb 0.5.0 2025-03-15 [?] CRAN (R 4.4.1)
P vctrs 0.6.5 2023-12-01 [?] CRAN (R 4.4.0)
P viridisLite * 0.4.2 2023-05-02 [?] CRAN (R 4.4.0)
P visNetwork * 2.1.4 2025-09-04 [?] CRAN (R 4.4.1)
P withr 3.0.2 2024-10-28 [?] CRAN (R 4.4.1)
P xfun 0.53 2025-08-19 [?] CRAN (R 4.4.1)
P xml2 1.4.0 2025-08-20 [?] CRAN (R 4.4.1)
P yaml 2.3.10 2024-07-26 [?] CRAN (R 4.4.0)
[1] /Users/biostochastics/chia_etal_2025_als/renv/library/macos/R-4.4/aarch64-apple-darwin20
[2] /Users/biostochastics/Library/Caches/org.R-project.R/R/renv/sandbox/macos/R-4.4/aarch64-apple-darwin20/f7156815
* ── Packages attached to the search path.
P ── Loaded and on-disk path mismatch.
──────────────────────────────────────────────────────────────────────────────
List of Figures
- Figure 2: Conceptual diagram of confounding structure
- Figure 3: PCA visualization showing protein expression variance driven by tube type
- Figure 4: ROC curves for predicting tube type from three protein sets
- Figure 5: Leave-Country-Out cross-validation ROC curves
- Figure 6: Performance comparison: Pooled vs Country-stratified vs Within-country cross-validation
- Figure 7: Distribution of significant proteins by cohort
- Figure 8: Venn diagram showing overlap of significant proteins
- Figure 9: Correlation of effect sizes between Italy and US cohorts
- Figure 10: Geographic consistency analysis
- Figure 11: Tube type effects in healthy controls
- Figure 14: Decomposition of observed effects
- Figure 1: Pipeline dependency graph
List of Tables
- Table 1: Confounding analysis summary
- Table 2: Top 10 proteins most predictive of tube type
- Table 3: Leave-Country-Out cross-validation performance
- Table 4: Summary of differential expression analysis
- Table 5: Concordant proteins between cohorts
- Table 6: Tube type effects in healthy controls
- Table 7: Summary of effect decomposition
This entire analysis pipeline can be reproduced by running:
# Install dependencies (if needed)
renv::restore()
# Run complete pipeline
targets::tar_make()
# Render report
quarto::quarto_render("reports/confounding_investigation.qmd")All pipeline targets are cached using the targets package. Only changed dependencies trigger recomputation, ensuring efficient incremental updates.
Biostochastics, LLC | GitHub | © 2025