10  Proteomics Data Analysis

Proteomics focuses on the large-scale study of proteins, including their identification, quantification, and functional analysis. This chapter covers computational methods for bottom-up proteomics data analysis using the R for Mass Spectrometry ecosystem.

10.1 Setting Up Proteomics Environment

The R for Mass Spectrometry ecosystem provides specialized packages for proteomics analysis:

Available proteomics files:
1 : MRM-standmix-5.mzML.gz 
2 : MS3TMT10_01022016_32917-33481.mzML.gz 
3 : MS3TMT11.mzML 
4 : TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML.gz 
5 : TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzML.gz 

Selected file: MRM-standmix-5.mzML.gz 

10.2 Understanding Proteomics Workflows

10.2.1 Bottom-up Proteomics Pipeline

The typical bottom-up proteomics workflow involves:

  1. Sample preparation: Protein extraction, digestion (usually with trypsin)
  2. LC-MS/MS analysis: Liquid chromatography coupled to tandem mass spectrometry
  3. Database searching: Matching MS/MS spectra to peptide sequences
  4. Protein inference: Assembling peptides into protein identifications
  5. Quantitative analysis: Comparing protein abundances across samples

flowchart TD
    subgraph Sample["Sample Preparation"]
        A[Protein Extraction] --> B[Reduction & Alkylation]
        B --> C[Enzymatic Digestion<br/>Trypsin]
        C --> D[Peptide Cleanup<br/>Desalting]
    end
    
    subgraph MS["LC-MS/MS Analysis"]
        D --> E[LC Separation<br/>Reverse Phase]
        E --> F[MS1 Scan<br/>Precursor Selection]
        F --> G[MS2 Fragmentation<br/>HCD/CID/ETD]
        G --> H[Raw Data<br/>mzML Files]
    end
    
    subgraph Search["Database Searching"]
        H --> I[Spectra Object<br/>R/Spectra]
        I --> J{Search Engine}
        J --> K1[Mascot]
        J --> K2[MaxQuant]
        J --> K3[MSFragger]
        K1 --> L[PSM Table<br/>PSMatch]
        K2 --> L
        K3 --> L
    end
    
    subgraph Inference["Protein Inference"]
        L --> M[Filter PSMs<br/>FDR < 1%]
        M --> N[Peptide Assembly<br/>Unique + Shared]
        N --> O[Protein Grouping<br/>Parsimony Principle]
    end
    
    subgraph Quant["Quantification"]
        O --> P{Quant Method?}
        P -->|Label-Free| Q1[XIC Integration<br/>MS1 Intensity]
        P -->|TMT/iTRAQ| Q2[Reporter Ions<br/>MS2 Intensity]
        P -->|SILAC| Q3[Heavy/Light Ratio<br/>MS1 Intensity]
        Q1 --> R[QFeatures Object]
        Q2 --> R
        Q3 --> R
    end
    
    subgraph Analysis["Statistical Analysis"]
        R --> S[PSM → Peptide<br/>Aggregation]
        S --> T[Peptide → Protein<br/>Summarization]
        T --> U[Differential Analysis<br/>limma/DEqMS]
        U --> V[Results<br/>Volcano/Heatmap]
    end
    
  style Sample fill:#D7E6FB,stroke:#27408B,stroke-width:3px,color:#102A43
  style MS fill:#FBE0FA,stroke:#B000B0,stroke-width:3px,color:#102A43
  style Search fill:#D7E6FB,stroke:#27408B,stroke-width:3px,color:#102A43
  style Inference fill:#FBE0FA,stroke:#B000B0,stroke-width:3px,color:#102A43
  style Quant fill:#D7E6FB,stroke:#27408B,stroke-width:3px,color:#102A43
  style Analysis fill:#FBE0FA,stroke:#B000B0,stroke-width:3px,color:#102A43

Key Proteomics Concepts
  • PSM (Peptide-Spectrum Match): One MS/MS spectrum matched to one peptide sequence
  • FDR (False Discovery Rate): Typically controlled at 1% using target-decoy approach
  • Protein Parsimony: Minimal set of proteins explaining observed peptides
  • Missing Values: Can occur at PSM, peptide, or protein level - handle appropriately

10.2.2 Data Structures in Proteomics

Proteomics data has a hierarchical structure: - Spectra: Raw MS and MS/MS data - PSMs: Peptide-Spectrum Matches from database search - Peptides: Unique peptide sequences - Proteins: Protein groups inferred from peptides

10.3 MS/MS Spectral Data Processing

10.3.1 Loading and Examining MS/MS Data

Note: Using synthetic data due to mzR compatibility issues
Error details: BiocParallel errors
  1 remote errors, element index: 1
  0 unevaluated and other errors
  first remote error:
Error in DataFrame(..., check.names = FALSE): different row counts implied by arguments
 

Dataset summary:
Total spectra: 200 
MS levels: 2, 1 
Scan range: 1 200 
RT range: 100 4500 seconds

MS2 spectra: 164 
Precursor m/z range: 400.4 1595.69 
Charge state distribution:

 2  3  4 
49 58 57 

10.3.2 MS/MS Spectrum Quality Assessment

10.3.3 Spectrum Preprocessing

Processed 50 MS/MS spectra

10.4 Protein Identification

10.4.1 Peptide Spectral Matching

Created database with 100 proteins and 995 peptides

10.4.2 Simulate Peptide-Spectrum Matches (PSMs)

Generated 0 PSMs

10.4.3 PSM Quality Assessment and Filtering

10.4.4 PSM Filtering and FDR Control

10.5 Protein Inference and Quantification

10.5.1 Protein Grouping

10.5.2 Label-Free Quantification

10.5.3 Data Normalization and Preprocessing

10.6 Differential Expression Analysis

10.6.1 Statistical Testing with limma

10.6.2 Volcano Plot

10.6.3 Protein Set Analysis

10.7 Data Visualization and Reporting

10.7.1 Heat Map of Significant Proteins

10.8 Exercises

  1. Analyze real proteomics data from a public repository
  2. Implement different protein inference algorithms
  3. Compare various normalization methods for label-free quantification
  4. Perform time-course proteomics analysis
  5. Integrate proteomics with other omics data types

10.9 Summary

This chapter covered comprehensive proteomics data analysis workflows, including MS/MS data processing, protein identification, quantification, and differential expression analysis. These methods are essential for extracting biological insights from bottom-up proteomics experiments.