9 Metabolomics Data Analysis

Metabolomics involves the comprehensive analysis of small molecules (metabolites) in biological systems. This chapter covers LC-MS metabolomics data processing using the R for Mass Spectrometry ecosystem, with emphasis on the xcms package, Spectra integration, and downstream statistical and biological interpretation.

9.1 Conceptual Overview of LC-MS Metabolomics

9.1.1 Overview of the Metabolomics Pipeline

Mass Spectrometry (MS)-based metabolomics aims to identify and quantify the complete set of small molecules (the metabolome) within a biological system.[1, 2] The data analysis workflow is a complex, multi-stage process that transforms raw instrument data into actionable biological knowledge. While specific tools and platforms vary, the canonical workflow is highly conserved across the field.[3] This pipeline, which applies to both Liquid Chromatography (LC)-MS and Gas Chromatography (GC)-MS data, can be summarized in the following sequential steps:

Experimental Design and Sample Collection: A non-computational but critical planning stage that includes defining biological groups, determining sample size, and establishing standardized operating procedures (SOPs) for sample collection, processing, and metabolite extraction.[4–6]
Raw Data Acquisition: Samples are analyzed using LC-MS or GC-MS to separate metabolites and detect them based on their mass-to-charge ratio (m/z) and retention time.[1, 3]
Raw Data Preprocessing: The first major computational step. Signals are extracted from raw files using noise reduction, peak detection (peak picking), and retention time alignment.[3, 7, 8]
Data Normalization and Scaling: Processed data are adjusted to correct for systematic, non-biological variation between samples, such as differences in sample loading or instrument sensitivity.[1, 3]
Statistical Analysis: Clean, normalized data are subjected to univariate (e.g., t-tests, ANOVA) and multivariate (e.g., PCA, PLS-DA) methods to identify patterns and significantly different features between groups.[3, 6, 9]
Metabolite Identification and Annotation: Statistically significant features (defined by m/z and retention time) are matched against spectral databases (e.g., HMDB, METLIN, KEGG) to determine chemical identity.[3, 7]
Biological Interpretation: The final list of identified metabolites is mapped to metabolic pathways and networks (pathway analysis) to provide systems-level biological context.[1, 3, 6]

Each step is a discrete module whose output serves as the input for the next, forming a chain of analytical dependencies.

9.1.2 The R-Based Ecosystem vs. “Black Box” Solutions

Researchers have two primary avenues for executing this workflow: user-friendly GUI/web platforms (e.g., XCMS Online, MetaboAnalyst web server)[10, 11] or a programmable, modular workflow built within R.[12]

Web-based tools are accessible but often act as “black boxes,” obscuring parameter settings and limiting flexibility.[10] In contrast, an R-based pipeline:

Is open, reproducible, and auditable.[12]
Provides granular control over every parameter, which is critical because metabolomics is very sensitive to parametrization.[15]
Produces a runnable script that ensures reproducibility, a non-negotiable standard for high-impact publications.

This chapter focuses on constructing a programmatic R workflow centered on a core stack of Bioconductor packages:

xcms for preprocessing,
CAMERA for feature annotation,
MetaboAnalystR (and related tools) for statistical analysis and pathway interpretation.[16–18]

9.1.3 The Propagation of Error: A Core Challenge

The metabolomics pipeline is best understood as a sequence of dependencies, like a house of cards. An error introduced at any step—such as improper peak picking—will not be corrected downstream; instead, it will be propagated and amplified, leading to invalid conclusions.[15, 19]

Examples of this cascade:

If peak detection parameters are mis-specified, noise can be identified as features, or low-abundance signals can be missed.[15]
If retention time alignment fails, the same metabolite in different samples may be treated as different compounds.
If feature annotation (e.g., via CAMERA) is skipped, a single metabolite (e.g., glucose) appearing as multiple adducts and isotopes may be counted as multiple independent “significant” features, inflating false positives.[19]
If this flawed feature list is used in pathway analysis, the resulting biological interpretation is based on noise.

The R modules introduced below are explicitly designed to mitigate and correct issues introduced in previous stages.

9.1.4 Key R Packages in the Metabolomics Workflow

Package Name	Primary Function	Repository	Key References
`xcms`	Raw data preprocessing: peak detection, RT alignment, grouping	Bioconductor	[17, 20]
`CAMERA`	Feature deconvolution and annotation (adducts, isotopes, fragments)	Bioconductor	[18, 21, 22]
`MetaboAnalystR`	Statistical analysis, normalization, pathway analysis	GitHub	[16, 23]
`mzrtsim`	Simulated data generation (ground-truth LC-MS data)	GitHub	[24, 25]

9.2 Setting Up the Metabolomics Environment in R

9.3 Visual Overview of the LC-MS Workflow

A typical untargeted metabolomics workflow consists of:

Peak detection – Identifying m/z–RT features across samples.
Retention time alignment – Correcting RT drift between samples.
Correspondence – Matching features across samples to form a feature matrix.
Gap filling – Integrating missing peaks.
Annotation – Identifying metabolites or compound groups.
Statistical analysis – Finding differential metabolites and patterns.
Pathway analysis – Mapping metabolites to biological pathways.

flowchart TD
    subgraph Input
        A[Raw LC-MS Data<br/>mzML/mzXML/netCDF]
    end
    
    subgraph XCMS["xcms Processing Pipeline"]
        B[1. Peak Detection<br/>findChromPeaks<br/>CentWaveParam] --> C[2. RT Correction<br/>adjustRtime<br/>ObiwarpParam]
        C --> D[3. Feature Grouping<br/>groupFeatures<br/>PeakDensityParam]
        D --> E[4. Feature Matrix<br/>featureValues]
        E --> F[5. Gap Filling<br/>fillChromPeaks<br/>ChromPeakArea]
    end
    
    subgraph Annotation["Feature Annotation"]
        F --> G[CAMERA<br/>xsAnnotate]
        G --> H[Adduct/Isotope Grouping<br/>findIsotopes/findAdducts]
        H --> I[MetaboCoreUtils<br/>Mass Calculations]
    end
    
    subgraph Output["Feature Matrix"]
        I --> J[Deconvoluted Feature Table]
    end
    
    subgraph Stats
        J --> K[Quality Control<br/>CV, Missing Values]
        K --> L[Normalization<br/>Log, PQN, etc.]
        L --> M[Scaling<br/>Auto, Pareto, etc.]
        M --> N[Univariate<br/>t-test, ANOVA]
        N --> O[Pathway Analysis<br/>Enrichment]
    end
    
    A --> B
    
  style Input fill:#D7E6FB,stroke:#27408B,stroke-width:3px,color:#102A43
  style XCMS fill:#FBE0FA,stroke:#B000B0,stroke-width:3px,color:#102A43
  style Annotation fill:#D7E6FB,stroke:#27408B,stroke-width:3px,color:#102A43
  style Output fill:#FBE0FA,stroke:#B000B0,stroke-width:3px,color:#102A43
  style Stats fill:#D7E6FB,stroke:#27408B,stroke-width:3px,color:#102A43

XCMS Workflow Key Parameters

CentWave: ppm (5–25), peakwidth (5–30 sec), snthresh (3–10)

Alignment: binSize (e.g., 1), other alignment parameters in ObiwarpParam

Grouping: bw (1–30), minFraction (0.3–0.7 or higher), sampleGroups

Gap Filling: expandMz and expandRt define integration windows

  Step                   Process           Package                   Output
1    1            Peak Detection              xcms    Chromatographic peaks
2    2 Retention Time Correction              xcms            Aligned peaks
3    3            Correspondence              xcms           Feature matrix
4    4               Gap Filling              xcms          Complete matrix
5    5                Annotation CAMERA/CompoundDb             Putative IDs
6    6      Statistical Analysis    limma/mixOmics Differential metabolites

9.4 Preprocessing LC-MS Data with `xcms`

9.4.1 Data Import into R Objects

Modern mass spectrometers export data in formats such as mzML, mzXML, or netCDF.[17, 20] The RforMassSpectrometry ecosystem uses Spectra and MsExperiment for flexible and modern MS data handling.

9.4.2 Chromatographic Peak Detection

Peak detection identifies chromatographic peaks (ions at specific m/z and RT) and discriminates them from noise.[7, 8] The CentWave algorithm is standard for high-resolution data and is implemented via CentWaveParam and findChromPeaks.[18, 27]

Key parameters:

peakwidth – expected RT width; too narrow splits peaks, too wide merges them.
ppm – mass accuracy; must match instrument performance.
snthresh – signal-to-noise threshold; too low → noise, too high → missing true signals.
prefilter – discards regions without enough signal.[20]

9.4.3 Retention Time Alignment

Retention time drift arises from changes in column pressure, temperature, or solvent composition.[7, 8, 28] It must be corrected before samples can be compared.

9.4.4 Feature Grouping (Correspondence)

After peak detection and RT correction, features are grouped across samples to form a consistent feature matrix.[8, 20, 29]

9.4.5 Filling Missing Peaks

Optional gap filling integrates signal for peaks that fell below the detection threshold in some samples.[8, 29]

9.5 Simulated Metabolomics Dataset for Demonstration

To demonstrate downstream steps (QC, normalization, PCA, pathway analysis), we construct a simulated dataset with a true biological signal in one pathway (Glycolysis).

Synthetic dataset: 20 samples and 50 metabolites

9.6 Quality Assessment and Feature Filtering

9.6.1 Peak Quality Assessment

9.6.2 Peak Filtering

Filtering results:
  CV filter: 24 passed
  Detection rate filter: 50 passed
  Intensity filter: 9 passed
  Combined filter: 3 passed

  total_features passed_filter filter_rate
1             50             3           6

9.7 Normalization and Scaling

9.7.1 Normalization Methods

9.7.2 Scaling Methods

9.8 Multivariate Analysis

9.8.1 Principal Component Analysis (PCA)

PCA provides an unsupervised view of the main variance sources. If samples separate mainly by batch, there is a technical problem; if they separate by biological group, the signal is promising.

9.8.2 Partial Least Squares Discriminant Analysis (PLS-DA) and VIP

PLS-DA uses group labels to maximize separation but is prone to overfitting, so cross-validation and permutation tests are essential.

9.9 Metabolite Identification

9.9.1 Accurate Mass Matching

No matches found in database

9.9.2 Simulated MS/MS Spectrum

9.10 Pathway and Enrichment Analysis

9.10.1 Conceptual Background

Once significant features have putative metabolite IDs, pathway analysis tests whether these metabolites are overrepresented in particular biological pathways (e.g., Glycolysis, TCA cycle).[1, 3, 6, 9] The correct background set should be all detected metabolites in the experiment, not the entire KEGG database, to maintain proper statistical context.[19]

9.10.2 Simple Pathway Enrichment on the Simulated Dataset

First, identify significantly changing metabolites via t-tests with FDR correction:

Number of significant metabolites (FDR < 0.05): 0

Define a toy pathway database:

Enrichment function using Fisher’s exact test:

No significant pathway enrichment found.

With the simulated 3-fold change in Glycolysis metabolites, the Glycolysis pathway should show significant enrichment.

9.11 Simulated Ground-Truth Data and Pipeline Benchmarking

9.11.1 The Ground-Truth Paradox

For real experimental datasets, the true underlying biological differences are unknown.[36] This makes it impossible to objectively score pipeline performance (e.g., how many true positives, false positives). The solution is to create simulated datasets with known ground truth and use them to benchmark parameter choices and pipelines.[36, 37]

9.11.2 Template from the MVAPACK Study

A benchmark design includes:

Known features: predefined peak locations and intensities.
Known group differences: specified fold changes and acceptable CVs.
Controlled noise: Gaussian noise at different levels (0%, 5%, 10%).
Controlled missingness: random deletion of a known proportion of features.[36]

9.11.3 Simulating mzML Files with `mzrtsim` (Example)

9.11.4 Key Simulation Parameters

Parameter	Example Value	Effect
`pheight`	`pheight_case` vector	Biological fold change between groups
`rtime`	`rt_values`	Base retention time for each compound
`rnorm()` on RT	`rt_values + rnorm(10)`	RT drift (technical variation)
`tailingfactor`	`1.5`	Chromatographic peak tailing
`pwidth`	`10`	Peak width
`matrixmz`	`c(101.1, 205.3, ...)`	Background chemical noise

9.11.5 Benchmark Metrics

With ground truth, one can compute:

True Positives (TP), False Positives (FP), False Negatives (FN)
Sensitivity (Recall): ( TP / (TP + FN) )
Positive Predictive Value (PPV): ( TP / (TP + FP) )
F1 Score: harmonic mean of Recall and PPV.[2, 36]

These metrics allow objective comparison of parameter choices and software pipelines.

9.12 Common Pitfalls and R-Based Solutions

9.12.1 Batch Effects Left Uncorrected

Problem: Systematic variation between analytical batches can overshadow biological signal.[19, 31, 38, 39]
Diagnosis: PCA colored by batch rather than group shows separation by batch.
Solution: Use sva::ComBat or mixed-effects models to adjust batch before downstream analysis.

9.12.2 Normalization That Creates Artifacts

Problem: Naive normalization (e.g., TIC) can erase true biological differences or create false ones.[19]
Solution: Use robust methods (log transformation, PQN, mean-centering) as implemented in MetaboAnalystR::Normalization, and check effects on variance and PCA.

9.12.3 Adduct and Isotope Peaks Counted Multiple Times

Problem: One metabolite appears as many redundant features (adducts, isotopes), inflating significance.[19]
Solution: Deconvolve features using CAMERA to group related peaks into pseudospectra before statistics.

9.12.4 Poor Handling of Zero Inflation and Imputation

Problem: Zeros can represent true absence or “below detection”. Treating all zeros as technical artifacts can bias results.[19, 31]
Solution: Simple methods (e.g., replacing with 1/5 minimum) are a first pass; more principled models use zero-inflated or hurdle frameworks.

9.12.5 Overinterpreting Unannotated Peaks

Problem: Publishing strong biological narratives on Level 3/4 features (mass-only matches) leads to fragile conclusions.[19]
Solution: Respect MSI levels: treat unannotated features as candidates; confirm with MS/MS or standards for Level 2/1.

9.12.6 False Pathway Analysis

Problem: Using all of KEGG as background inflates significance.
Solution: Define background as all metabolites detected in your experiment.

9.12.7 Overfitting in Supervised Models

Problem: PLS-DA can separate random noise if not validated.
Solution: Always perform cross-validation and permutation testing (e.g., MetaboAnalystR::PLSDA.CV, PLSDA.Permut).

9.13 Quality Control and Batch-Oriented Checks

9.13.1 Simulated QC Samples

Percentage of stable features: 100 %

9.14 Exercises

Process a real LC-MS dataset with xcms, from raw files to a feature matrix.
Compare median, TIC, and quantile normalization on your dataset; inspect effects via PCA.
Build a full identification workflow using accurate mass and MS/MS database matching.
Perform a time-course metabolomics analysis and visualize trajectories.
Implement a complete pipeline with QC samples, batch correction, and pathway enrichment.

9.15 Summary

This chapter presented:

A conceptual overview of the LC-MS metabolomics pipeline and the importance of reproducible, script-based analysis in R.
Practical code examples for:
- Preprocessing with xcms (peak picking, RT correction, grouping, gap filling).
- Quality assessment, filtering, normalization, and scaling.
- Multivariate analysis (PCA, PLS-DA, VIP).
- Metabolite identification and simulated MS/MS.
- Pathway enrichment analysis on simulated data with known biology.
- QC simulation and stability assessment.
A discussion of ground-truth simulation and benchmarking strategies.
A structured review of common pitfalls and R-based solutions.

Combining a transparent, modular R workflow with rigorous simulation-based validation provides a robust foundation for turning raw spectra into reliable, biologically meaningful metabolomic insights.