flowchart TD
subgraph Input
A[Raw LC-MS Data<br/>mzML/mzXML/netCDF]
end
subgraph XCMS["xcms Processing Pipeline"]
B[1. Peak Detection<br/>findChromPeaks<br/>CentWaveParam] --> C[2. RT Correction<br/>adjustRtime<br/>ObiwarpParam]
C --> D[3. Feature Grouping<br/>groupFeatures<br/>PeakDensityParam]
D --> E[4. Feature Matrix<br/>featureValues]
E --> F[5. Gap Filling<br/>fillChromPeaks<br/>ChromPeakArea]
end
subgraph Annotation["Feature Annotation"]
F --> G[CAMERA<br/>xsAnnotate]
G --> H[Adduct/Isotope Grouping<br/>findIsotopes/findAdducts]
H --> I[MetaboCoreUtils<br/>Mass Calculations]
end
subgraph Output["Feature Matrix"]
I --> J[Deconvoluted Feature Table]
end
subgraph Stats
J --> K[Quality Control<br/>CV, Missing Values]
K --> L[Normalization<br/>Log, PQN, etc.]
L --> M[Scaling<br/>Auto, Pareto, etc.]
M --> N[Univariate<br/>t-test, ANOVA]
N --> O[Pathway Analysis<br/>Enrichment]
end
A --> B
style Input fill:#D7E6FB,stroke:#27408B,stroke-width:3px,color:#102A43
style XCMS fill:#FBE0FA,stroke:#B000B0,stroke-width:3px,color:#102A43
style Annotation fill:#D7E6FB,stroke:#27408B,stroke-width:3px,color:#102A43
style Output fill:#FBE0FA,stroke:#B000B0,stroke-width:3px,color:#102A43
style Stats fill:#D7E6FB,stroke:#27408B,stroke-width:3px,color:#102A43
9 Metabolomics Data Analysis
Metabolomics involves the comprehensive analysis of small molecules (metabolites) in biological systems. This chapter covers LC-MS metabolomics data processing using the R for Mass Spectrometry ecosystem, with emphasis on the xcms package, Spectra integration, and downstream statistical and biological interpretation.
9.1 Conceptual Overview of LC-MS Metabolomics
9.1.1 Overview of the Metabolomics Pipeline
Mass Spectrometry (MS)-based metabolomics aims to identify and quantify the complete set of small molecules (the metabolome) within a biological system.[1, 2] The data analysis workflow is a complex, multi-stage process that transforms raw instrument data into actionable biological knowledge. While specific tools and platforms vary, the canonical workflow is highly conserved across the field.[3] This pipeline, which applies to both Liquid Chromatography (LC)-MS and Gas Chromatography (GC)-MS data, can be summarized in the following sequential steps:
- Experimental Design and Sample Collection: A non-computational but critical planning stage that includes defining biological groups, determining sample size, and establishing standardized operating procedures (SOPs) for sample collection, processing, and metabolite extraction.[4–6]
- Raw Data Acquisition: Samples are analyzed using LC-MS or GC-MS to separate metabolites and detect them based on their mass-to-charge ratio (m/z) and retention time.[1, 3]
- Raw Data Preprocessing: The first major computational step. Signals are extracted from raw files using noise reduction, peak detection (peak picking), and retention time alignment.[3, 7, 8]
- Data Normalization and Scaling: Processed data are adjusted to correct for systematic, non-biological variation between samples, such as differences in sample loading or instrument sensitivity.[1, 3]
- Statistical Analysis: Clean, normalized data are subjected to univariate (e.g., t-tests, ANOVA) and multivariate (e.g., PCA, PLS-DA) methods to identify patterns and significantly different features between groups.[3, 6, 9]
- Metabolite Identification and Annotation: Statistically significant features (defined by m/z and retention time) are matched against spectral databases (e.g., HMDB, METLIN, KEGG) to determine chemical identity.[3, 7]
- Biological Interpretation: The final list of identified metabolites is mapped to metabolic pathways and networks (pathway analysis) to provide systems-level biological context.[1, 3, 6]
Each step is a discrete module whose output serves as the input for the next, forming a chain of analytical dependencies.
9.1.2 The R-Based Ecosystem vs. “Black Box” Solutions
Researchers have two primary avenues for executing this workflow: user-friendly GUI/web platforms (e.g., XCMS Online, MetaboAnalyst web server)[10, 11] or a programmable, modular workflow built within R.[12]
Web-based tools are accessible but often act as “black boxes,” obscuring parameter settings and limiting flexibility.[10] In contrast, an R-based pipeline:
- Is open, reproducible, and auditable.[12]
- Provides granular control over every parameter, which is critical because metabolomics is very sensitive to parametrization.[15]
- Produces a runnable script that ensures reproducibility, a non-negotiable standard for high-impact publications.
This chapter focuses on constructing a programmatic R workflow centered on a core stack of Bioconductor packages:
xcmsfor preprocessing,CAMERAfor feature annotation,MetaboAnalystR(and related tools) for statistical analysis and pathway interpretation.[16–18]
9.1.3 The Propagation of Error: A Core Challenge
The metabolomics pipeline is best understood as a sequence of dependencies, like a house of cards. An error introduced at any step—such as improper peak picking—will not be corrected downstream; instead, it will be propagated and amplified, leading to invalid conclusions.[15, 19]
Examples of this cascade:
- If peak detection parameters are mis-specified, noise can be identified as features, or low-abundance signals can be missed.[15]
- If retention time alignment fails, the same metabolite in different samples may be treated as different compounds.
- If feature annotation (e.g., via
CAMERA) is skipped, a single metabolite (e.g., glucose) appearing as multiple adducts and isotopes may be counted as multiple independent “significant” features, inflating false positives.[19] - If this flawed feature list is used in pathway analysis, the resulting biological interpretation is based on noise.
The R modules introduced below are explicitly designed to mitigate and correct issues introduced in previous stages.
9.1.4 Key R Packages in the Metabolomics Workflow
| Package Name | Primary Function | Repository | Key References |
|---|---|---|---|
xcms |
Raw data preprocessing: peak detection, RT alignment, grouping | Bioconductor | [17, 20] |
CAMERA |
Feature deconvolution and annotation (adducts, isotopes, fragments) | Bioconductor | [18, 21, 22] |
MetaboAnalystR |
Statistical analysis, normalization, pathway analysis | GitHub | [16, 23] |
mzrtsim |
Simulated data generation (ground-truth LC-MS data) | GitHub | [24, 25] |
9.2 Setting Up the Metabolomics Environment in R
9.3 Visual Overview of the LC-MS Workflow
A typical untargeted metabolomics workflow consists of:
- Peak detection – Identifying m/z–RT features across samples.
- Retention time alignment – Correcting RT drift between samples.
- Correspondence – Matching features across samples to form a feature matrix.
- Gap filling – Integrating missing peaks.
- Annotation – Identifying metabolites or compound groups.
- Statistical analysis – Finding differential metabolites and patterns.
- Pathway analysis – Mapping metabolites to biological pathways.
XCMS Workflow Key Parameters
- CentWave:
ppm(5–25),peakwidth(5–30 sec),snthresh(3–10)- Alignment:
binSize(e.g., 1), other alignment parameters inObiwarpParam- Grouping:
bw(1–30),minFraction(0.3–0.7 or higher),sampleGroups- Gap Filling:
expandMzandexpandRtdefine integration windows
Step Process Package Output
1 1 Peak Detection xcms Chromatographic peaks
2 2 Retention Time Correction xcms Aligned peaks
3 3 Correspondence xcms Feature matrix
4 4 Gap Filling xcms Complete matrix
5 5 Annotation CAMERA/CompoundDb Putative IDs
6 6 Statistical Analysis limma/mixOmics Differential metabolites
9.4 Preprocessing LC-MS Data with xcms
9.4.1 Data Import into R Objects
Modern mass spectrometers export data in formats such as mzML, mzXML, or netCDF.[17, 20] The RforMassSpectrometry ecosystem uses Spectra and MsExperiment for flexible and modern MS data handling.
9.4.2 Chromatographic Peak Detection
Peak detection identifies chromatographic peaks (ions at specific m/z and RT) and discriminates them from noise.[7, 8] The CentWave algorithm is standard for high-resolution data and is implemented via CentWaveParam and findChromPeaks.[18, 27]
Key parameters:
peakwidth– expected RT width; too narrow splits peaks, too wide merges them.ppm– mass accuracy; must match instrument performance.snthresh– signal-to-noise threshold; too low → noise, too high → missing true signals.prefilter– discards regions without enough signal.[20]
9.4.3 Retention Time Alignment
Retention time drift arises from changes in column pressure, temperature, or solvent composition.[7, 8, 28] It must be corrected before samples can be compared.
9.4.4 Feature Grouping (Correspondence)
After peak detection and RT correction, features are grouped across samples to form a consistent feature matrix.[8, 20, 29]
9.4.5 Filling Missing Peaks
Optional gap filling integrates signal for peaks that fell below the detection threshold in some samples.[8, 29]
9.5 Simulated Metabolomics Dataset for Demonstration
To demonstrate downstream steps (QC, normalization, PCA, pathway analysis), we construct a simulated dataset with a true biological signal in one pathway (Glycolysis).
Synthetic dataset: 20 samples and 50 metabolites
9.6 Quality Assessment and Feature Filtering
9.6.1 Peak Quality Assessment

9.6.2 Peak Filtering
Filtering results:
CV filter: 24 passed
Detection rate filter: 50 passed
Intensity filter: 9 passed
Combined filter: 3 passed
total_features passed_filter filter_rate
1 50 3 6
9.7 Normalization and Scaling
9.7.1 Normalization Methods

9.7.2 Scaling Methods

9.8 Multivariate Analysis
9.8.1 Principal Component Analysis (PCA)
PCA provides an unsupervised view of the main variance sources. If samples separate mainly by batch, there is a technical problem; if they separate by biological group, the signal is promising.

9.8.2 Partial Least Squares Discriminant Analysis (PLS-DA) and VIP
PLS-DA uses group labels to maximize separation but is prone to overfitting, so cross-validation and permutation tests are essential.


9.9 Metabolite Identification
9.9.1 Accurate Mass Matching
No matches found in database
9.9.2 Simulated MS/MS Spectrum

9.10 Pathway and Enrichment Analysis
9.10.1 Conceptual Background
Once significant features have putative metabolite IDs, pathway analysis tests whether these metabolites are overrepresented in particular biological pathways (e.g., Glycolysis, TCA cycle).[1, 3, 6, 9] The correct background set should be all detected metabolites in the experiment, not the entire KEGG database, to maintain proper statistical context.[19]
9.10.2 Simple Pathway Enrichment on the Simulated Dataset
First, identify significantly changing metabolites via t-tests with FDR correction:
Number of significant metabolites (FDR < 0.05): 0
Define a toy pathway database:
Enrichment function using Fisher’s exact test:
No significant pathway enrichment found.
With the simulated 3-fold change in Glycolysis metabolites, the Glycolysis pathway should show significant enrichment.
9.11 Simulated Ground-Truth Data and Pipeline Benchmarking
9.11.1 The Ground-Truth Paradox
For real experimental datasets, the true underlying biological differences are unknown.[36] This makes it impossible to objectively score pipeline performance (e.g., how many true positives, false positives). The solution is to create simulated datasets with known ground truth and use them to benchmark parameter choices and pipelines.[36, 37]
9.11.2 Template from the MVAPACK Study
A benchmark design includes:
- Known features: predefined peak locations and intensities.
- Known group differences: specified fold changes and acceptable CVs.
- Controlled noise: Gaussian noise at different levels (0%, 5%, 10%).
- Controlled missingness: random deletion of a known proportion of features.[36]
9.11.3 Simulating mzML Files with mzrtsim (Example)
9.11.4 Key Simulation Parameters
| Parameter | Example Value | Effect |
|---|---|---|
pheight |
pheight_case vector |
Biological fold change between groups |
rtime |
rt_values |
Base retention time for each compound |
rnorm() on RT |
rt_values + rnorm(10) |
RT drift (technical variation) |
tailingfactor |
1.5 |
Chromatographic peak tailing |
pwidth |
10 |
Peak width |
matrixmz |
c(101.1, 205.3, ...) |
Background chemical noise |
9.11.5 Benchmark Metrics
With ground truth, one can compute:
- True Positives (TP), False Positives (FP), False Negatives (FN)
- Sensitivity (Recall): ( TP / (TP + FN) )
- Positive Predictive Value (PPV): ( TP / (TP + FP) )
- F1 Score: harmonic mean of Recall and PPV.[2, 36]
These metrics allow objective comparison of parameter choices and software pipelines.
9.12 Common Pitfalls and R-Based Solutions
9.12.1 Batch Effects Left Uncorrected
- Problem: Systematic variation between analytical batches can overshadow biological signal.[19, 31, 38, 39]
- Diagnosis: PCA colored by batch rather than group shows separation by batch.
- Solution: Use
sva::ComBator mixed-effects models to adjust batch before downstream analysis.
9.12.2 Normalization That Creates Artifacts
- Problem: Naive normalization (e.g., TIC) can erase true biological differences or create false ones.[19]
- Solution: Use robust methods (log transformation, PQN, mean-centering) as implemented in
MetaboAnalystR::Normalization, and check effects on variance and PCA.
9.12.3 Adduct and Isotope Peaks Counted Multiple Times
- Problem: One metabolite appears as many redundant features (adducts, isotopes), inflating significance.[19]
- Solution: Deconvolve features using
CAMERAto group related peaks into pseudospectra before statistics.
9.12.4 Poor Handling of Zero Inflation and Imputation
- Problem: Zeros can represent true absence or “below detection”. Treating all zeros as technical artifacts can bias results.[19, 31]
- Solution: Simple methods (e.g., replacing with 1/5 minimum) are a first pass; more principled models use zero-inflated or hurdle frameworks.
9.12.5 Overinterpreting Unannotated Peaks
- Problem: Publishing strong biological narratives on Level 3/4 features (mass-only matches) leads to fragile conclusions.[19]
- Solution: Respect MSI levels: treat unannotated features as candidates; confirm with MS/MS or standards for Level 2/1.
9.12.6 False Pathway Analysis
- Problem: Using all of KEGG as background inflates significance.
- Solution: Define background as all metabolites detected in your experiment.
9.12.7 Overfitting in Supervised Models
- Problem: PLS-DA can separate random noise if not validated.
- Solution: Always perform cross-validation and permutation testing (e.g.,
MetaboAnalystR::PLSDA.CV,PLSDA.Permut).
9.13 Quality Control and Batch-Oriented Checks
9.13.1 Simulated QC Samples

Percentage of stable features: 100 %
9.14 Exercises
- Process a real LC-MS dataset with
xcms, from raw files to a feature matrix. - Compare median, TIC, and quantile normalization on your dataset; inspect effects via PCA.
- Build a full identification workflow using accurate mass and MS/MS database matching.
- Perform a time-course metabolomics analysis and visualize trajectories.
- Implement a complete pipeline with QC samples, batch correction, and pathway enrichment.
9.15 Summary
This chapter presented:
A conceptual overview of the LC-MS metabolomics pipeline and the importance of reproducible, script-based analysis in R.
Practical code examples for:
- Preprocessing with
xcms(peak picking, RT correction, grouping, gap filling). - Quality assessment, filtering, normalization, and scaling.
- Multivariate analysis (PCA, PLS-DA, VIP).
- Metabolite identification and simulated MS/MS.
- Pathway enrichment analysis on simulated data with known biology.
- QC simulation and stability assessment.
- Preprocessing with
A discussion of ground-truth simulation and benchmarking strategies.
A structured review of common pitfalls and R-based solutions.
Combining a transparent, modular R workflow with rigorous simulation-based validation provides a robust foundation for turning raw spectra into reliable, biologically meaningful metabolomic insights.