8  Statistical Analysis of MS Data

Statistical analysis is fundamental to extracting meaningful biological insights from mass spectrometry data. This chapter covers statistical methods integrated with the R for Mass Spectrometry ecosystem, including univariate and multivariate approaches.

flowchart LR
    subgraph Input["MS Feature Matrix"]
        A[Features × Samples<br/>Intensity Data]
    end
    
    subgraph QC["Quality Control"]
        B[Missing Value<br/>Assessment]
        C[CV Analysis<br/>Technical Replicates]
        D[Outlier Detection<br/>PCA/Clustering]
    end
    
    subgraph Norm["Normalization"]
        E[Total Ion Current<br/>TIC]
        F[Internal Standard<br/>IS]
        G[Median/Quantile<br/>Normalization]
    end
    
    subgraph Univariate["Univariate Tests"]
        H[t-test / Wilcoxon]
        I[ANOVA / Kruskal-Wallis]
        J[Linear Models<br/>limma]
    end
    
    subgraph Multivariate["Multivariate Analysis"]
        K[PCA<br/>Dimensionality Reduction]
        L[PLS-DA<br/>Supervised]
        M[Hierarchical<br/>Clustering]
    end
    
    subgraph Results["Results & Interpretation"]
        N[Volcano Plot<br/>FC vs p-value]
        O[Heatmap<br/>Expression Patterns]
        P[Pathway Analysis<br/>Enrichment]
    end
    
    A --> B
    B --> C
    C --> D
    D --> E
    E --> F
    F --> G
    G --> H
    G --> K
    H --> I
    I --> J
    K --> L
    L --> M
    J --> N
    M --> O
    N --> P
    O --> P
    
  style Input fill:#D7E6FB,stroke:#27408B,stroke-width:2px,color:#102A43
  style QC fill:#FBE0FA,stroke:#B000B0,stroke-width:2px,color:#102A43
  style Norm fill:#D7E6FB,stroke:#27408B,stroke-width:2px,color:#102A43
  style Univariate fill:#FBE0FA,stroke:#B000B0,stroke-width:2px,color:#102A43
  style Multivariate fill:#D7E6FB,stroke:#27408B,stroke-width:2px,color:#102A43
  style Results fill:#FBE0FA,stroke:#B000B0,stroke-width:2px,color:#102A43

Statistical Analysis Best Practices
  1. Quality Control First: Remove low-quality features before analysis
  2. Appropriate Normalization: Choose method based on experimental design
  3. Multiple Testing Correction: Always apply FDR/Bonferroni correction
  4. Effect Size: Report fold-changes alongside p-values
  5. Validation: Confirm findings with orthogonal methods

8.1 Setting Up the Statistical Environment

Dataset created:
  Samples: 30 
  Features: 100 
  Design: 2 conditions × 3 timepoints × 5 replicates

8.2 Descriptive Statistics

8.2.1 Basic Summary Statistics

              feature     mean   median        sd        cv       min      max
Feature_1   Feature_1 38539.95 30587.06  36945.31  95.86238  7683.549 154764.7
Feature_2   Feature_2 43532.54 32455.23  40455.91  92.93257  6380.475 205635.6
Feature_3   Feature_3 37879.54 27653.87  34593.29  91.32447  5644.510 184580.6
Feature_4   Feature_4 36481.56 23965.46  31656.02  86.77265  6037.189 131991.5
Feature_5   Feature_5 33341.32 23482.69  30827.79  92.46123  4261.610 118194.5
Feature_6   Feature_6 57532.78 31559.10 104430.56 181.51490  8037.520 568784.5
Feature_7   Feature_7 38466.44 23346.28  37830.52  98.34684  7718.372 165906.0
Feature_8   Feature_8 38978.13 25492.49  34694.03  89.00897  8192.558 173304.9
Feature_9   Feature_9 50905.26 27036.39  52624.87 103.37805 11720.590 224033.0
Feature_10 Feature_10 42747.82 33157.06  28850.95  67.49104  5802.359 116445.9

8.2.2 Distribution Analysis

8.2.3 Missing Value Analysis

8.3 Hypothesis Testing

8.3.1 Two-Sample t-tests

Number of significant features (FDR < 0.05): 0 
[1] p.value       statistic     estimate_diff feature       p.adjusted   
<0 rows> (or 0-length row.names)

8.3.2 Volcano Plot

8.4 ANOVA for Multiple Groups

Number of significant features (ANOVA FDR < 0.05): 0 

8.5 Correlation Analysis

8.5.1 Feature-Feature Correlations

8.5.2 Correlation with Experimental Factors

8.6 Principal Component Analysis (PCA)

8.6.1 Performing PCA

8.6.2 PCA Visualization

8.6.3 Scree Plot

8.6.4 PCA Loadings

8.7 Clustering Analysis

8.7.1 Hierarchical Clustering

8.7.2 K-means Clustering

8.7.3 Cluster Validation

  cluster size ave.sil.width
1       1   22          0.10
2       2    7         -0.03
3       3    1          0.00

Average silhouette width: 0.071 

8.8 Heat Map Analysis

8.8.1 Feature Heat Map

8.9 Power Analysis

8.9.1 Sample Size Calculation

8.10 Exercises

  1. Perform statistical analysis on your own MS dataset
  2. Implement different multiple testing correction methods and compare results
  3. Conduct time-series analysis for longitudinal MS data
  4. Apply machine learning classification to distinguish sample groups
  5. Develop quality control metrics based on statistical properties

8.11 Summary

This chapter covered essential statistical methods for MS data analysis, including descriptive statistics, hypothesis testing, multivariate analysis, and clustering. These statistical tools are fundamental for extracting meaningful biological insights from mass spectrometry experiments.