2 Accessing Open MS Data with MsDataHub

Reproducibility and access to public datasets are cornerstones of modern computational biology. The Bioconductor project facilitates this through the ExperimentHub infrastructure, which provides a centralized way to access curated data from various experiments. For the mass spectrometry community, the MsDataHub package serves as a dedicated portal to a wide range of proteomics and metabolomics datasets.

This chapter introduces MsDataHub and demonstrates how to use it to find, download, and load example MS data for analysis within the R for Mass Spectrometry ecosystem.

2.1 What is MsDataHub?

The MsDataHub package provides a collection of mass spectrometry datasets, including: - Raw MS data (mzML, CDF) - Peptide-spectrum matching (PSM) results (mzid) - Quantitative proteomics and metabolomics data tables - Contaminant FASTA databases (cRAP)

Data is downloaded and cached locally on your machine, ensuring that you only need to download each file once.

2.2 Installation

MsDataHub is a Bioconductor package. To install it, use BiocManager:

2.3 Exploring Available Datasets

To see a complete list of available datasets, you can call the MsDataHub() function. This returns a data frame with metadata for each resource.

We can then display this as an interactive table using DT::datatable.

The table includes details such as the resource title, data type, species, and the function required to access the data.

2.4 Accessing Data Examples

MsDataHub creates accessor functions for each dataset. For example, a file named PestMix1_DDA.mzML can be accessed by calling a function of the same name, PestMix1_DDA.mzML(). Let’s explore a few examples.

2.4.1 Example 1: Raw MS Data

Here, we load a raw DDA (Data-Dependent Acquisition) file from a TripleTOF instrument. The accessor function returns a file path, which we can then load into a Spectra object.

MSn data (Spectra) with 7602 spectra in a MsBackendMzR backend:
       msLevel     rtime scanIndex
     <integer> <numeric> <integer>
1            1     0.231         1
2            1     0.351         2
3            1     0.471         3
4            1     0.591         4
5            1     0.711         5
...        ...       ...       ...
7598         1   899.491      7598
7599         1   899.613      7599
7600         1   899.747      7600
7601         1   899.872      7601
7602         1   899.993      7602
 ... 34 more variables/columns.

file(s):
16e443a06a7_7861

2.4.2 Example 2: Peptide-Spectrum Matches (PSM)

This example downloads peptide-spectrum matching results from the PRIDE repository (accession PXD000001). The .mzid file can be loaded using the PSMatch package.

PSM with 5802 rows and 35 columns.
names(35): sequence spectrumID ... subReplacementResidue subLocation

2.4.3 Example 3: Quantitative Proteomics Data

MsDataHub also provides processed quantitative data. Here, we access a CPTAC (Clinical Proteomic Tumor Analysis Consortium) dataset. The accessor returns the path to a tab-delimited text file, which can be read into a QFeatures object.

class: SummarizedExperiment 
dim: 11466 45 
metadata(0):
assays(1): ''
rownames(11466): 1 2 ... 11465 11466
rowData names(143): Sequence N.term.cleavage.window ...
  Oxidation..M..site.IDs MS.MS.Count
colnames(45): Intensity.6A_1 Intensity.6A_2 ... Intensity.6E_8
  Intensity.6E_9
colData names(0):

2.5 Contributing to MsDataHub

MsDataHub is an open-source project, and contributions are welcome. If you have a dataset that you believe would be a valuable addition, you can open an issue on the MsDataHub GitHub repository to start the process.

By providing a simple and unified interface to a diverse range of MS data, MsDataHub significantly lowers the barrier to entry for researchers looking to learn new analysis techniques or benchmark their methods on established datasets.