Note: Using synthetic data due to mzR compatibility issues
Backend class: MsBackendDataFrame
Total spectra: 100
Data access type: In-memory
This chapter covers advanced techniques and specialized applications in mass spectrometry data analysis using R, including backend management, computational considerations, and specialized workflows inspired by the R for Mass Spectrometry ecosystem.
The Spectra package uses different backends to store and access MS data efficiently. Understanding these backends is crucial for handling large-scale datasets.
Note: Using synthetic data due to mzR compatibility issues
Backend class: MsBackendDataFrame
Total spectra: 100
Data access type: In-memory
Backend information:
Backend class: MsBackendDataFrame
Data origin: synthetic_data.mzML
Retrieved peaks for 5 spectra
First spectrum has 80 peaks
Current backend class: MsBackendDataFrame
Data is in memory for fast access
user system elapsed
0.02 0.00 0.03
In-memory backend provides fast repeated access
HDF5 backend benefits:
- Efficient storage for large datasets
- Fast partial data loading
- Cross-platform compatibility
- Reduced memory footprint
Note: Install with BiocManager::install('MsBackendHdf5Peaks')
Available cores: 28
user system elapsed
0.25 0.00 0.27
user system elapsed
0.24 0.00 0.25
Parallel processing can significantly speed up large-scale operations
batch n_spectra rt_range ms_levels mean_peaks
1 1 20 100-810.1 2,1 97
2 2 20 847.47-1557.58 1,2 104
3 3 20 1594.95-2305.05 2,1 103
4 4 20 2342.42-3052.53 2,1 106
5 5 20 3089.9-3800 2,1 106
Batch processing helps manage memory for large datasets
Applied custom processing pipeline to 10 spectra
Processing steps include: smoothing, peak picking, and normalization
Network statistics:
Nodes: 20
Edges: 0
Connected components: 20

Strategies for integrating external resources:
1. GNPS (Global Natural Products Social Molecular Networking):
- Use GNPS REST API for spectral library matching
- Export data in GNPS-compatible formats
2. MassBank:
- Access curated reference spectra
- Use RMassBank for compound identification
3. ChemSpider/PubChem:
- Retrieve compound information
- Use webchem package for programmatic access
4. MetaboLights/PRIDE:
- Access public datasets
- Use appropriate R packages for data retrieval
Exported spectra in multiple formats:
- Metadata: CSV format
- Spectral data: mzML/MGF (commented out)
Created synthetic dataset:
Samples: 50
Features: 100
Classes: Disease, Healthy
Preprocessed features: log2 transformation and normalization

No significant features found. Using top 50 features by p-value.
Selected 50 features for ML
Training set: 35 samples
Test set: 15 samples
Trained Random Forest and SVM models
Random Forest Performance:
Accuracy Sensitivity Specificity Precision
0.6000000 0.3333333 1.0000000 1.0000000
AUC: 0.222
SVM Performance:
Accuracy Sensitivity Specificity Precision
0.7333333 0.5555556 1.0000000 1.0000000
AUC: 0.204


Created IMS dataset with 50 scans

Detected 1653 peaks in example scan


Call:
coxph(formula = surv_object ~ protein_A + protein_B + protein_C +
age + gender + stage, data = survival_data)
n= 200, number of events= 195
coef exp(coef) se(coef) z Pr(>|z|)
protein_A -0.0381232 0.9625943 0.0065579 -5.813 6.12e-09 ***
protein_B -0.0009910 0.9990095 0.0009303 -1.065 0.287
protein_C -0.0038282 0.9961791 0.0026701 -1.434 0.152
age 0.0011187 1.0011194 0.0073091 0.153 0.878
genderM 0.0755537 1.0784811 0.1492284 0.506 0.613
stageII 0.1418220 1.1523715 0.2093917 0.677 0.498
stageIII -0.0572136 0.9443923 0.2076276 -0.276 0.783
stageIV 0.4426659 1.5568521 0.2545010 1.739 0.082 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
protein_A 0.9626 1.0389 0.9503 0.975
protein_B 0.9990 1.0010 0.9972 1.001
protein_C 0.9962 1.0038 0.9910 1.001
age 1.0011 0.9989 0.9869 1.016
genderM 1.0785 0.9272 0.8050 1.445
stageII 1.1524 0.8678 0.7645 1.737
stageIII 0.9444 1.0589 0.6287 1.419
stageIV 1.5569 0.6423 0.9454 2.564
Concordance= 0.606 (se = 0.024 )
Likelihood ratio test= 52.71 on 8 df, p=1e-08
Wald test = 39.87 on 8 df, p=3e-06
Score (logrank) test = 40.17 on 8 df, p=3e-06

[1] "Precision Assessment:"
# A tibble: 3 × 2
concentration repeatability_cv
<dbl> <dbl>
1 1 3.53
2 10 3.36
3 50 2.81
[1] "Accuracy Assessment:"
# A tibble: 3 × 3
spiked_concentration mean_recovery sd_recovery
<dbl> <dbl> <dbl>
1 0.5 98.8 10.0
2 5 102. 9.55
3 50 99.3 10.5

This chapter covered advanced topics in mass spectrometry data analysis, including machine learning applications, ion mobility spectrometry, survival analysis, network analysis, and analytical method validation. These advanced techniques enable sophisticated analysis of complex MS datasets and support method development and validation efforts.