5 Spectral Data Preprocessing

5.1 Introduction: The Stochastic and Deterministic Nature of Spectral Interference

The acquisition of spectral data, whether through vibrational spectroscopy (Near-Infrared, Fourier Transform Infrared, Raman) or mass spectrometry, is fundamentally an exercise in signal estimation amidst a complex background of physical interference. The recorded spectrum S(\lambda) is rarely a direct representation of the chemical analyte of interest A(\lambda). Instead, it is a superposition of the chemical signal, deterministic physical artifacts (such as scattering or fluorescence), and stochastic noise components arising from instrumental limitations. Consequently, the raw spectral data is mathematically ill-posed for direct multivariate regression or classification without rigorous preprocessing intervention.

The imperative for preprocessing arises from the core assumption of linear chemometric models like Principal Component Analysis (PCA) and Partial Least Squares (PLS): that the variance in the data matrix is linearly correlated with the property of interest (e.g., concentration). Physical artifacts introduce non-linearities and variance components that are orthogonal or, worse, collinear with the chemical signal, thereby degrading model robustness. For instance, in Near-Infrared (NIR) spectroscopy, variations in sample particle size can induce multiplicative scattering effects that scale the entire spectrum, effectively masquerading as concentration changes. Similarly, in Raman spectroscopy, fluorescence backgrounds can be orders of magnitude more intense than the inelastic scattering signal, obscuring the vibrational fingerprint.

This report provides an exhaustive analysis of the state-of-the-art in spectral preprocessing. It moves beyond superficial descriptions to explore the mathematical mechanics of smoothing algorithms, the iterative logic of baseline correction, and the physics-based models for scatter correction. Furthermore, it synthesizes these techniques into domain-specific pipelines, examining their application in high-stakes scenarios such as forensic ink analysis, pharmaceutical quality control, and metabolomic biomarker discovery.

5.2 Theoretical Mechanics of Signal Smoothing and Denoising

The first line of defense in spectral processing is the attenuation of high-frequency noise. This noise, often characterized as white noise or 1/f flicker noise, originates from detector electronics, thermal fluctuations, and photon statistics (shot noise). The challenge in smoothing is the preservation of spectral fidelity; aggressive noise reduction invariably risks distorting peak shapes, reducing heights, and broadening widths, which directly compromises quantitative accuracy.

5.2.1 The Savitzky-Golay Filter: Polynomial Convolution

The Savitzky-Golay (SG) filter stands as the preeminent algorithm for spectral smoothing, favored for its ability to preserve the higher moments of spectral peaks (such as width and height) better than simple moving average filters. Unlike a boxcar filter that effectively fits a zero-order polynomial (a flat line) to a data window, the SG filter fits a polynomial of order o to a window of w points via the method of linear least squares.

Mathematical Derivation and Gram Polynomials

The fundamental operation of the SG filter is convolution. For a spectral point x_i, the smoothed value x_i^* is calculated as the weighted sum of the raw values within the window defined by 2m+1 points (where w = 2m+1):

x_i^* = \sum_{j=-m}^{m} c_j x_{i+j}

Here, c_j are the convolution coefficients. These coefficients are not arbitrary; they are derived from the least-squares fit of a polynomial. Historically, calculating these coefficients required solving the normal equations for every window position. However, utilizing the properties of orthogonal Gram polynomials allows for the recursive calculation of these coefficients, significantly reducing computational overhead and allowing for the derivation of filters of any arbitrary order and length.

The polynomial fit is constrained by the parameters w and o. A critical constraint is that the window width must be strictly greater than the polynomial order (w \ge o + 1). If w = o + 1, the polynomial passes exactly through every point in the window, resulting in an identity transform with zero smoothing. As w increases relative to o, the smoothing effect intensifies.

Parameter Sensitivity and Spectral Artifacts

The selection of w and o is a trade-off between noise suppression and signal distortion.

Window Size (w): A larger window incorporates more data points into the fit, averaging out random noise more effectively. However, if the window exceeds the natural Full Width at Half Maximum (FWHM) of the spectral peaks, the filter will suppress the peak height and artificially broaden the base. In Raman spectroscopy, where peaks are naturally narrow, large windows (e.g., >15 points) can be destructive. In NIR, where bands are broad overtones, larger windows are permissible.
Polynomial Order (o): Higher-order polynomials (e.g., cubic or quartic) can track sharper curvature, preserving narrow peaks better than quadratic fits. However, they are less effective at removing noise.
Gibbs Phenomenon: A known artifact of the SG filter is the introduction of “ringing” or side-lobes around sharp spectral features. This manifests as artificial minima flanking a strong peak, a result of the polynomial trying to fit a high-frequency transition. This “phase reversal” where high-frequency noise is inverted rather than removed is a limiting factor of high-order filters.

Table 1: Comparative Impact of Savitzky-Golay Parameters on Spectral Integrity

Filter Configuration	Noise Attenuation	Feature Preservation	Risk Profile
Small Window / Low Order	Low	High	Insufficient denoising; noise remains dominant.
Large Window / Low Order	High	Low	Peak broadening; loss of resolution; height reduction.
Small Window / High Order	Very Low	Very High	Overfitting of noise; effectively no smoothing.
Large Window / High Order	Moderate	Moderate	Introduction of high-frequency artifacts (ringing).

5.2.2 Wavelet Transform Denoising

While SG filters operate in the time (wavenumber) domain, Wavelet Transform (WT) denoising operates in the time-frequency domain. This allows for spatially localized denoising, which is superior for non-stationary signals where noise characteristics or peak widths vary across the spectrum.

Discrete Wavelet Transform (DWT) and Thresholding

The DWT decomposes the spectrum into approximation coefficients (low-frequency trend) and detail coefficients (high-frequency noise/features) at various scales. Denoising is achieved by thresholding the detail coefficients.

Donoho Thresholding: This method applies a universal threshold derived from the median absolute deviation of the coefficients. Coefficients below this threshold are set to zero (hard thresholding) or shrunk (soft thresholding).
Wavelet Basis Selection: The choice of mother wavelet is crucial. The ‘bior4.4’ (biorthogonal) wavelet has been shown to yield optimal predictions in NIR analysis of pine seeds (P. koraiensis), achieving an R^2 of 0.9485 in PLS models, significantly outperforming raw data models.
Application: WT is particularly effective for removing cosmic rays (which appear as high-frequency singularities) and instrument noise while preserving the broad baseline and sharp peaks simultaneously.

5.2.3 Cosmic Ray Removal Strategies

Raman and CCD-based spectroscopies are plagued by cosmic rays—high-energy particles that strike the detector, causing single-pixel spikes of immense intensity. These artifacts are non-Gaussian and can skew normalization and integration steps if not removed early in the pipeline.

5.2.4 Z-Score and Nearest Neighbor Algorithms

Modified Z-Score: This statistical method calculates the difference between a point and its neighbors. If the intensity difference exceeds a standard deviation threshold (Z-score), it is flagged as a spike. This method is robust but can mistake sharp Raman peaks for spikes if the threshold is too aggressive.
Nearest Neighbor Comparison (NNC): This approach compares the intensity of a pixel to the average of its immediate neighbors. If the deviation is significant, the pixel is replaced by the interpolated value. This single-scan method avoids read noise amplification but requires careful tuning of sensitivity thresholds.
Deep Learning Approaches: Recent advancements utilize Convolutional Neural Networks (CNNs) trained to recognize the distinct shape of cosmic rays (single pixel width) versus Raman peaks (Gaussian/Lorentzian shape), offering automated cleaning without manual thresholding.

5.3 Baseline Correction: Geometric and Iterative Approaches

Baseline drift is perhaps the most pervasive artifact in vibrational spectroscopy. In Raman, it arises from sample fluorescence; in FT-IR, from scattering and instrument drift; in NMR, from incomplete water suppression. The goal of baseline correction is to estimate the low-frequency background B(\lambda) and subtract it from the measured spectrum S(\lambda) to recover the pure analyte signal.

5.4 Modified Polynomial Fitting (ModPoly)

Polynomial fitting is the classical approach. A polynomial of order n is fitted to the spectrum. However, a standard least-squares fit would pass through the middle of the peaks, effectively removing half the signal.

ModPoly Logic: The Modified Polynomial algorithm uses an iterative approach.

Fit a polynomial to the raw data.
Compare the fit to the data. Since spectral peaks are (usually) positive additions to the baseline, data points significantly above the polynomial fit are assumed to be peaks.
These “peak” points are excluded or replaced by the fitted value.
Re-fit the polynomial to the modified dataset.
Repeat until convergence.

Limitations: ModPoly struggles with complex baselines. High-order polynomials can introduce “Runge’s phenomenon”—oscillations in the baseline at the edges of the spectrum or in featureless regions, creating artificial bands.

5.4.1 Asymmetric Least Squares (ALS) Smoothing

Asymmetric Least Squares (ALS) has largely superseded polynomial fitting due to its flexibility and lack of assumption regarding the functional form of the baseline. It treats baseline estimation as a penalized least squares problem.

The ALS Objective Function

The ALS algorithm seeks to find a baseline vector \mathbf{z} that minimizes the following cost function:

S = \sum_i w_i (y_i - z_i)^2 + \lambda \sum_i (\Delta^2 z_i)^2

The first term measures the fidelity (how close the baseline is to the data). The second term measures roughness (the second derivative of the baseline). The parameter \lambda (lambda) controls the smoothness.

Asymmetry via Weights (w_i): The “asymmetric” magic happens in the weights. If the measured signal y_i is greater than the estimated baseline z_i, the weight w_i is set to a very small value p (e.g., 0.001). This allows the baseline to “ignore” the peaks. If y_i \le z_i, the weight is set to 1-p (e.g., 0.999), forcing the baseline to adhere tightly to the non-peak regions.

Parameter Tuning: Lambda and P

The performance of ALS is critically dependent on two parameters:

Lambda (\lambda): A smoothing parameter. Values typically range from 10^2 to 10^9. A low \lambda allows the baseline to snake into the peaks (overfitting), while a high \lambda forces a linear fit (underfitting). For Raman spectra with fluorescence, \lambda \approx 10^5 is a common starting point.
Asymmetry (p): Determines the penalty for positive deviations. A value of 0.001 to 0.01 is standard for positive peaks.

Issues: The requirement to manually tune \lambda and p makes standard ALS difficult to automate for high-throughput screening.

5.4.2 Adaptive Iteratively Reweighted PLS (airPLS)

To address the parameter tuning bottleneck of ALS, the airPLS (adaptive iteratively reweighted PLS) algorithm was developed. It removes the need for the asymmetry parameter p completely.

Mechanism: Instead of a fixed binary weight based on whether y > z, airPLS assigns weights adaptively based on the magnitude of the residual.

w_i = \begin{cases} 0 & y_i \ge z_i \\ \exp\left(\frac{t(y_i - z_i)}{|\mathbf{d}|}\right) & y_i < z_i \end{cases}

Here, the weight decays exponentially as the distance between the signal and baseline increases. This creates a “soft” exclusion of peaks that adapts to the noise level of the spectrum.

Advantages: airPLS is computationally fast, requires only the \lambda parameter (which is robust across orders of magnitude), and avoids the baseline “drop-off” artifacts seen in ModPoly. It is widely considered the gold standard for automated Raman baseline correction.

5.4.3 Improved Asymmetric Least Squares (IAsLS)

IAsLS is a further refinement that incorporates the first and second derivatives into the weighting scheme to better distinguish between peak regions and baseline regions. By considering the local slope, IAsLS can identify the start and end of peaks more accurately than intensity-based methods alone. This results in baselines that do not cut into the “shoulders” of broad peaks, a common failure mode of standard ALS.

Table 2: Comparative Analysis of Baseline Correction Algorithms

Algorithm	Complexity	User Parameters	Strengths	Weaknesses
ModPoly	Low	Polynomial Order	Simple, intuitive.	Oscillations (Runge’s), poor fit for complex shapes.
ALS (AsLS)	Medium	\lambda, p	Flexible, smooth baselines.	Requires manual tuning of p; sensitive to noise below baseline.
airPLS	Medium	\lambda	Adaptive weights, no p needed, fast.	Can struggle with very low SNR if noise > peak height.
IAsLS	High	\lambda, deriv thresholds	High accuracy, preserves peak shoulders.	Computationally more intensive.

5.5 Scatter Correction and Normalization Architectures

In Near-Infrared (NIR) spectroscopy, and to a lesser extent in Raman, the interaction of light with physical matter (particles, fibers, granules) creates scattering effects that dominate the spectral variance. According to the Kubelka-Munk theory, reflectance spectra are a function of both absorption (chemical) and scattering (physical). Variations in particle size, packing density, and surface roughness cause multiplicative scaling and additive offsets in the spectra, which must be corrected to linearize the relationship between absorbance and concentration.

5.5.1 Standard Normal Variate (SNV)

Standard Normal Variate (SNV) is a row-wise normalization technique. It assumes that the scattering effects manifest as a scaling and shifting of the spectrum.

Mathematical Formulation:

For a measured spectrum x_{ij} (sample i, wavelength j), the SNV corrected value is:

x_{ij}^{SNV} = \frac{x_{ij} - \bar{x}_i}{s_i}

where \bar{x}_i is the mean intensity of the i-th spectrum and s_i is the standard deviation of the i-th spectrum.

Operational Characteristics:

Independence: SNV operates on each spectrum individually. It does not require a reference spectrum or the statistics of the entire dataset. This makes it ideal for online/real-time monitoring where samples are processed sequentially.
Effect: It removes the constant offset (centering) and the multiplicative scaling factor (standardization).
Caveat: Because s_i includes the variance from the chemical peaks themselves, SNV can inadvertently scale down strong chemical signals if they dominate the spectral variance. However, in NIR, scattering variance usually dwarfs chemical variance, making this a safe assumption.

5.5.2 Multiplicative Scatter Correction (MSC)

Multiplicative Scatter Correction (MSC) is a model-based approach that separates the physical scattering from the chemical absorption by regressing each spectrum against a “ideal” reference spectrum (typically the mean spectrum of the calibration set).

Mathematical Formulation:

MSC assumes a linear relationship between the sample spectrum \mathbf{x}_i and the reference spectrum \overline{\mathbf{x}}:

\mathbf{x}_i = a_i + b_i \overline{\mathbf{x}} + \mathbf{e}_i

a_i: The additive offset (baseline shift).
b_i: The multiplicative scatter coefficient (path length correction).

The coefficients a_i and b_i are estimated via ordinary least squares regression. The corrected spectrum is then retrieved by inverting the model:

\mathbf{x}_{i, MSC} = \frac{\mathbf{x}_i - a_i}{b_i}

Strategic Implementation:

MSC is theoretically superior to SNV when the “ideal” reference spectrum is well-defined. However, its dependence on the dataset mean introduces a vulnerability: Data Leakage. If the mean spectrum is calculated using the entire dataset (including the test set), information from the test set leaks into the training process. In proper chemometric validation, the reference spectrum must be the mean of the training set only, and this same reference must be applied to the test/validation samples.

Comparison: Studies show that SNV and MSC often yield virtually identical results in terms of prediction accuracy (e.g., RMSEP). The choice is often a matter of preference or software availability, with SNV preferred for its sample independence.

5.5.3 Extended Multiplicative Scatter Correction (EMSC)

Standard MSC assumes that the scattering coefficient b is constant across all wavelengths. However, physics dictates that scattering is wavelength-dependent (e.g., Rayleigh scattering is proportional to \lambda^{-4}, Mie scattering varies with particle size).

EMSC Innovation: EMSC extends the linear model to include polynomial terms that account for wavelength-dependent scattering and explicit interference spectra.

\mathbf{x}_i = a_i + b_i \overline{\mathbf{x}} + d_i \lambda + e_i \lambda^2 + \sum_k g_{ik} \mathbf{c}_k + \mathbf{r}_i

Here, d_i and e_i capture the linear and quadratic scattering effects (slope and curvature). The term \mathbf{c}_k represents the spectra of known chemical interferents (e.g., water, CO2). By explicitly modeling and subtracting these, EMSC acts as both a scatter correction and a spectral filter.

Application: EMSC is particularly powerful in biological spectroscopy (e.g., FTIR of tissues) where scattering is complex and variable. It has been used to successfully recover absorption spectra from cylindrical domains in heterogeneous samples based on Mie theory approximations.

5.5.4 Probabilistic Quotient Normalization (PQN)

In mass spectrometry-based metabolomics, particularly for biofluids like urine, sample concentration varies wildly due to biological dilution (hydration state). Standard normalization (like Total Area Normalization) fails because a single massive peak (e.g., glucose in a diabetic) can skew the total area, forcing all other metabolite signals down.

PQN Mechanism:

Calculate a Reference Spectrum (usually the median of all QC samples).
For each variable (m/z feature), calculate the quotient of the sample intensity to the reference intensity.
The normalization factor for the sample is the median of these quotients.

f_i = \text{median}\left(\frac{x_{ij}}{x_{ref,j}}\right)

Divide the entire spectrum by f_i.

Rationale: The median is a robust statistic. While some metabolites change due to biology, the majority should remain constant relative to the reference. The median quotient, therefore, accurately reflects the physical dilution factor, ignoring the biological outliers. PQN is widely cited as the most robust method for urine metabolomics.

5.6 Feature Enhancement: The Calculus of Derivatives

Spectral derivatives are instrumental in enhancing resolution and eliminating baseline drifts. By transforming the signal into its slope (first derivative) or curvature (second derivative), broad overlapping bands can be separated into distinct peaks.

5.6.1 Savitzky-Golay vs. Norris-Williams Derivatives

There are two dominant schools of thought in computing derivatives: the polynomial-based Savitzky-Golay (SG) and the finite-difference-based Norris-Williams (NW).

Savitzky-Golay (SG) Derivatives

The SG filter can directly output the derivative of the fitted polynomial.

\frac{dy}{dx} \approx \frac{d}{dx} P(x)

Because the polynomial P(x) is a smooth function fitted to the window, its derivative is well-behaved. SG derivatives combine smoothing and differentiation in a single step, minimizing the noise amplification inherent in differentiation.

Parameters: The user defines the derivative order (d). A first derivative (d=1) removes additive baselines. A second derivative (d=2) removes linear baselines.

Consensus: SG derivatives are the standard in modern chemometrics due to their mathematical rigor and controllable smoothing.

5.6.2 Norris-Williams (NW) Derivatives (Gap-Segment)

The NW method, also known as “gap-segment” derivatives, is prevalent in agricultural NIR (e.g., Foss instruments).

Algorithm:

Smoothing: Apply a boxcar (moving average) filter of length S (segment).
Gap Difference: Calculate the difference between points separated by a gap G.

\text{Deriv}_i = \frac{\text{Smooth}_{i+G} - \text{Smooth}_{i-G}}{2G}

Comparison: NW is computationally simpler and faster than SG. Some studies suggest it is superior for detecting trace components in particulate systems (e.g., enzyme granules) because the “gap” acts as a tunable frequency filter, enhancing features of a specific width while suppressing high-frequency noise. However, NW derivatives can be “blocky” and less smooth than SG derivatives.

5.6.3 Interpretability and Risks

First Derivative: Peaks become zero-crossings. Inflection points become maxima/minima. Useful for detecting peak positions.
Second Derivative: Peaks become negative minima. Shoulders (inflections) become peaks. This effectively “deconvolves” overlapping bands.
Noise Amplification: Differentiation amplifies high-frequency noise. The signal-to-noise ratio (SNR) of a second derivative is significantly lower than the raw spectrum. Therefore, the smoothing window in SG or the segment size in NW must be increased when taking higher-order derivatives.

5.7 Domain-Specific Preprocessing Pipelines

The choice of preprocessing algorithms is not arbitrary; it is dictated by the physical nature of the sample and the spectroscopic technique.

5.7.1 Near-Infrared (NIR) Pipeline: The Scattering Problem

NIR spectra are dominated by broad overtones and strong scattering from solid samples.

Standard Protocol:

Transformation: Convert Reflectance (R) to Absorbance (\log(1/R)) to linearize the Beer-Lambert relationship.
Scatter Correction: Apply SNV or MSC to correct for particle size variations. This aligns the global intensity of the spectra.
Derivatives: Apply SG Second Derivative to remove linear baseline drifts and resolve overlapping O-H and C-H bands.
Smoothing: Implicit in the SG derivative step.

Case Study (P. koraiensis Seeds): In the prediction of seed viability, a combination of Wavelet Transform (for denoising) followed by Mean Centering and PLS yielded the highest accuracy (R^2 = 0.9485). The wavelet filter ‘bior4.4’ was specifically effective at preserving the seed’s chemical features while removing surface scattering noise.

Case Study (Pharmaceutical Tablets): For detecting active ingredients in tablets, SNV is critical to normalize for the variable pressure used in tableting, which changes the density and scattering properties of the pill.

5.7.2 Raman Pipeline: The Fluorescence Problem

Raman signals are weak and sit atop varying fluorescence backgrounds.

Standard Protocol:

Despiking: Apply Modified Z-Score or Nearest Neighbor filter to remove cosmic rays. Critical: This must be done before any smoothing, or the spike will be smeared into a broad artifact.
Baseline Correction: Apply airPLS or IAsLS to remove the broad fluorescence background.
Denoising: Apply mild SG smoothing (small window, e.g., 5-9 points) to reduce thermal noise without broadening the sharp Raman peaks.
Normalization: Normalize to the area of a standard band (e.g., Phenylalanine ring breathing at 1003 cm⁻¹ in biological samples) or Vector Normalization (Unit Length) to correct for laser power fluctuations.

Case Study (Coffee Origin): Raman spectroscopy was used to classify coffee. The “fluorescence effect” was the main hurdle. Preprocessing with Weighted Least Squares (a baseline method) and Normalization eliminated this effect, enabling successful classification.

5.7.3 Mass Spectrometry (Metabolomics) Pipeline

MS data suffers from axis drift (m/z instability) and concentration variance.

Standard Protocol:

Binning/Centroiding: Reduce raw profile data to peak centroids.
Alignment (Warping): Use Dynamic Time Warping (DTW) or Correlation Optimized Warping (COW) to align elution peaks across samples. Algorithms like RANSAC are used to robustly match peaks and define the warping function, improving alignment accuracy by up to 88%.
Filtering: Remove peaks that appear in fewer than x% of samples (noise reduction).
Normalization: Apply PQN to correct for dilution effects in biofluids.
Transformation: Log-transform to stabilize variance (heteroscedasticity correction).

5.8 Software Implementation and Ecosystems

The implementation of these algorithms is supported by open-source libraries in Python and R, which allow for reproducible and automated workflows.

5.9 Python Ecosystem

scipy.signal.savgol_filter: The standard implementation of the Savitzky-Golay filter. It is fast and efficient but requires manual parameter setting.
pybaselines: A specialized library containing implementations of airPLS, IAsLS, ModPoly, and SNIP. It provides a unified API for testing different baseline algorithms.
chemotools: A library designed to integrate chemometric preprocessing (MSC, SNV, SG) into scikit-learn pipelines. This is crucial for machine learning workflows, ensuring that preprocessing parameters (like the MSC reference spectrum) are learned during fit() and applied during transform(), preventing data leakage.
ramanspy: A comprehensive toolkit for Raman spectroscopy, offering modules for despiking, baseline subtraction, and spectral unmixing.

5.9.1 R Ecosystem

prospectr: The go-to package for NIR preprocessing. It includes implementations of Norris-Williams derivatives (gapDer), Savitzky-Golay (savitzkyGolay), Standard Normal Variate (standardNormalVariate), and Detrending.
EMSC: A dedicated package for Extended Multiplicative Signal Correction, allowing users to build complex interference models easily.
VPdtw: Implements Variable Penalty Dynamic Time Warping for aligning chromatographic data, critical for LC-MS analysis.
mdatools: A comprehensive package for chemometrics that provides a unified interface for preprocessing (autoscaling, SNV, MSC, SG) and modeling (PCA, PLS, SIMCA), mirroring functionality often found in commercial software.

Table 3: Feature Matrix of Spectral Preprocessing Libraries

Library	Language	Primary Domain	Key Algorithms	Integration
scipy	Python	General Signal	SG, Convolutions	General
pybaselines	Python	Vibrational	airPLS, AsLS, ModPoly	numpy
chemotools	Python	Chemometrics	MSC, SNV, SG	scikit-learn
ramanspy	Python	Raman	Despiking, Unmixing	matplotlib
prospectr	R	NIR/Agri	Norris, Gap-Segment, SNV	caret / pls
EMSC	R	Biological	EMSC, Interferent Modeling	Base R

5.10 Applied R Implementations: Vibrational Spectroscopy

This section focuses on essential R implementations for NIR and Raman spectroscopy, including smoothing, scatter correction, and baseline correction.

5.10.1 Smoothing and Derivatives: Savitzky-Golay

The prospectr package provides a highly efficient C++ implementation of the Savitzky-Golay filter.

Code

library(prospectr)

# Load example NIR soil data
data(NIRsoil)
spectra <- NIRsoil$spc

# Apply Savitzky-Golay Filter
# m = 0 (smoothing only), p = 3 (3rd order polynomial), w = 11 (window size)
sg_smooth <- savitzkyGolay(X = spectra, m = 0, p = 3, w = 11)

# Apply 1st Derivative
# m = 1 (1st derivative), p = 3, w = 11
sg_deriv1 <- savitzkyGolay(X = spectra, m = 1, p = 3, w = 11)

# Plotting the result
matplot(as.numeric(colnames(spectra)), t(sg_deriv1[1:5,]), type = 'l',
        main = "1st Derivative Spectra (Savitzky-Golay)",
        xlab = "Wavelength (nm)", ylab = "dA/dlambda")

5.10.2 Scatter Correction: SNV and MSC

Standard Normal Variate (SNV) and Multiplicative Scatter Correction (MSC) are standard for NIR data. prospectr handles SNV, while MSC is available in prospectr or pls.

Code

# Standard Normal Variate (SNV)
# Operates row-wise: (x - mean) / sd
snv_spectra <- standardNormalVariate(X = spectra)

# Multiplicative Scatter Correction (MSC)
# Requires a reference spectrum (default is the column means)
msc_spectra <- msc(X = spectra, ref_spectrum = colMeans(spectra))

# Note: Ideally, 'ref_spectrum' should be calculated from the TRAINING set only
# to prevent data leakage during model validation.
train_idx <- 1:500
ref_spec <- colMeans(spectra[train_idx, ])
msc_spectra_val <- msc(X = spectra, ref_spectrum = ref_spec)

5.10.3 Baseline Correction: airPLS

The airPLS algorithm is available via GitHub and utilizes sparse matrices for speed. It is ideal for removing fluorescence backgrounds in Raman spectra.

Code

# Install airPLS from GitHub if not available on CRAN
# library(devtools)
# install_github("zmzhang/airPLS_R")
library(airPLS)

# Simulate a spectrum with a baseline drift
wavenumbers <- seq(0, 1000, length.out = 1000)
pure_signal <- exp(-0.5 * (wavenumbers - 500)^2 / 20^2)
baseline_drift <- 0.001 * wavenumbers + 0.1 * sin(wavenumbers/100)
raw_signal <- pure_signal + baseline_drift + rnorm(1000, sd = 0.01)

# Apply airPLS baseline correction
# lambda: smoothness parameter (adjust based on noise level)
baseline_est <- airPLS(raw_signal, lambda = 1000, differences = 2, itermax = 20)
corrected_signal <- raw_signal - baseline_est

# Visualization
plot(wavenumbers, raw_signal, type = 'l', col = 'black', main = "airPLS Baseline Correction")
lines(wavenumbers, baseline_est, col = 'red', lwd = 2) # The estimated baseline
lines(wavenumbers, corrected_signal, col = 'blue', lty = 2) # Corrected signal
legend("topright", legend=c("Raw", "Baseline", "Corrected"), col=c("black", "red", "blue"), lty=c(1,1,2))

5.11 Comprehensive R Workflow for Mass Spectrometry

Preprocessing is a critical step in MS data analysis that improves data quality and enables accurate downstream analysis. This chapter covers baseline correction, smoothing, normalization, and peak picking using modern R for Mass Spectrometry tools.

5.11.1 Loading Required Libraries

Code

library(Spectra)           # Core MS data handling
library(MsCoreUtils)       # MS processing utilities
library(msdata)            # Example datasets
library(ProtGenerics)      # Generic functions
library(ggplot2)           # Visualization
library(dplyr)             # Data manipulation
library(patchwork)         # Plot composition

5.11.2 Understanding Raw Spectral Data: Loading and Inspecting Spectra

Code

# Load example proteomics data
# Note: Using setBackend to convert to in-memory storage to avoid mzR issues
ms_file <- msdata::proteomics(full.names = TRUE)

# First load with MsBackendMzR, then convert to DataFrame backend
tryCatch({
  ms_data <- Spectra(ms_file, backend = MsBackendMzR())
  # Convert to in-memory backend for better compatibility
  ms_data <- setBackend(ms_data, backend = MsBackendDataFrame())
  cat("Successfully loaded real MS data\n")
}, error = function(e) {
  # If mzR fails, create synthetic data for demonstration
  cat("Note: Using synthetic data due to mzR compatibility issues\n")
  cat("Error was:", conditionMessage(e), "\n\n")

  # Create synthetic MS2 spectra using proper format
  set.seed(123)
  n_spectra <- 100

  # Create peaks data - separate m/z and intensity lists
  mz_list <- lapply(1:n_spectra, function(i) {
    n_peaks <- sample(50:200, 1)
    sort(runif(n_peaks, 100, 2000))  # Already sorted
  })

  intensity_list <- lapply(mz_list, function(mz_vals) {
    rlnorm(length(mz_vals), meanlog = 8, sdlog = 2)
  })

  # Create spectra data frame with metadata
  library(S4Vectors)
  library(IRanges)
  spd <- DataFrame(
    msLevel = rep(2L, n_spectra),
    rtime = seq(100, 6000, length.out = n_spectra),
    precursorMz = runif(n_spectra, 400, 1500),
    precursorCharge = sample(2:3, n_spectra, replace = TRUE),
    polarity = rep(1L, n_spectra)
  )

  # Add list columns using NumericList
  spd$mz <- NumericList(mz_list)
  spd$intensity <- NumericList(intensity_list)

  # Create Spectra object from DataFrame backend
  backend <- MsBackendDataFrame()
  backend <- backendInitialize(backend, spd)
  ms_data <<- Spectra(backend)
})

# Focus on MS2 spectra for preprocessing examples
ms2_data <- filterMsLevel(ms_data, 2)
cat("Total MS2 spectra:", length(ms2_data), "\n")

# Select a representative spectrum
spectrum <- ms2_data[10]

# Extract peak data
peaks <- peaksData(spectrum)[[1]]
mz_vals <- peaks[, 1]
int_vals <- peaks[, 2]

cat("Spectrum contains", length(mz_vals), "peaks\n")
cat("m/z range:", round(range(mz_vals), 2), "\n")
cat("Intensity range:", sprintf("%.2e - %.2e", min(int_vals), max(int_vals)), "\n")

5.11.3 Visualizing Raw Data

Code

# Create a visualization of raw spectrum
raw_df <- data.frame(mz = mz_vals, intensity = int_vals)

p_raw <- ggplot(raw_df, aes(x = mz, y = intensity)) +
  geom_segment(aes(xend = mz, yend = 0), color = "steelblue", alpha = 0.6) +
  labs(title = "Raw MS2 Spectrum",
       subtitle = paste("Precursor:", round(precursorMz(spectrum), 3), "m/z"),
       x = "m/z", y = "Intensity") +
  theme_minimal()

print(p_raw)

5.11.4 Baseline Correction

Understanding Baseline Issues:

Code

# Use the already loaded ms_data from above to avoid reloading
# Get a representative spectrum
spectrum <- ms_data[min(100, length(ms_data))]
peaks <- peaksData(spectrum)[[1]]
mz_vals <- peaks[, 1]
int_vals <- peaks[, 2]

# Plot raw spectrum
plot(mz_vals, int_vals, type = "l",
     main = "Raw Spectrum",
     xlab = "m/z", ylab = "Intensity")

Baseline Removal Methods:

Code

# Simple baseline correction using quantile-based approach
baseline_estimate <- quantile(int_vals, 0.05)  # 5th percentile
corrected_intensity <- pmax(int_vals - baseline_estimate, 0)

# Plot corrected spectrum
plot(mz_vals, corrected_intensity, type = "l",
     main = "Baseline Corrected Spectrum",
     xlab = "m/z", ylab = "Intensity")

5.11.5 Smoothing Techniques

Smoothing reduces noise while preserving spectral features. The Spectra package provides built-in smoothing methods.

Savitzky-Golay Smoothing:

Code

# Apply Savitzky-Golay smoothing using Spectra
# halfWindowSize must be an integer
smoothed_spectrum <- smooth(spectrum, method = "SavitzkyGolay", halfWindowSize = 2L)

# Extract smoothed data
smoothed_peaks <- peaksData(smoothed_spectrum)[[1]]

# Compare original and smoothed
comparison_df <- data.frame(
  mz = c(mz_vals, smoothed_peaks[, 1]),
  intensity = c(int_vals, smoothed_peaks[, 2]),
  type = rep(c("Original", "Smoothed"), c(length(mz_vals), nrow(smoothed_peaks)))
)

p_smooth <- ggplot(comparison_df, aes(x = mz, y = intensity, color = type)) +
  geom_line(alpha = 0.7) +
  scale_color_manual(values = c("Original" = "gray60", "Smoothed" = "red")) +
  labs(title = "Savitzky-Golay Smoothing",
       x = "m/z", y = "Intensity", color = "Type") +
  theme_minimal() +
  theme(legend.position = "bottom")

print(p_smooth)

Moving Average Smoothing:

Code

# Custom moving average implementation
moving_average <- function(x, window = 5) {
  n <- length(x)
  smoothed <- numeric(n)
  half_window <- floor(window / 2)

  for (i in 1:n) {
    start_idx <- max(1, i - half_window)
    end_idx <- min(n, i + half_window)
    smoothed[i] <- mean(x[start_idx:end_idx])
  }
  return(smoothed)
}

# Apply moving average
ma_intensity <- moving_average(int_vals, window = 5)

# Visualize comparison
ma_df <- data.frame(
  mz = mz_vals,
  original = int_vals,
  moving_avg = ma_intensity
)

ggplot(ma_df) +
  geom_line(aes(x = mz, y = original), color = "gray60", alpha = 0.7) +
  geom_line(aes(x = mz, y = moving_avg), color = "blue", size = 1) +
  labs(title = "Moving Average Smoothing (window = 5)",
       x = "m/z", y = "Intensity") +
  theme_minimal()

Gaussian Smoothing:

Code

# Gaussian smoothing implementation
gaussian_smooth <- function(x, sigma = 1) {
  n <- length(x)
  kernel_size <- ceiling(3 * sigma)
  kernel <- exp(-((-kernel_size):kernel_size)^2 / (2 * sigma^2))
  kernel <- kernel / sum(kernel)

  # Apply convolution (simplified)
  smoothed <- stats::filter(x, kernel, sides = 2)
  smoothed[is.na(smoothed)] <- x[is.na(smoothed)]  # Handle edges
  return(as.numeric(smoothed))
}

# Apply Gaussian smoothing to the original intensity values
gaussian_smoothed <- gaussian_smooth(int_vals, sigma = 2)

plot(mz_vals, gaussian_smoothed, type = "l", col = "blue",
     main = "Gaussian Smoothed Spectrum", xlab = "m/z", ylab = "Intensity")
lines(mz_vals, int_vals, col = "gray", lty = 2)
legend("topright", c("Gaussian Smoothed", "Original"),
       col = c("blue", "gray"), lty = c(1, 2))

5.11.6 Peak Detection

Simple Peak Picking Algorithm:

Code

# Use the moving average smoothed data for peak detection
smoothed_intensity <- ma_intensity

# Simple peak detection function
detect_peaks <- function(mz, intensity, min_intensity = 1000, min_distance = 0.1) {
  n <- length(intensity)
  peaks <- logical(n)

  for (i in 2:(n-1)) {
    if (intensity[i] > intensity[i-1] &&
        intensity[i] > intensity[i+1] &&
        intensity[i] > min_intensity) {
      peaks[i] <- TRUE
    }
  }

  # Filter by minimum distance
  peak_indices <- which(peaks)
  if (length(peak_indices) > 1) {
    keep <- logical(length(peak_indices))
    keep[1] <- TRUE

    for (i in 2:length(peak_indices)) {
      if (mz[peak_indices[i]] - mz[peak_indices[keep][sum(keep)]] > min_distance) {
        keep[i] <- TRUE
      }
    }
    peak_indices <- peak_indices[keep]
  }

  return(list(
    mz = mz[peak_indices],
    intensity = intensity[peak_indices],
    indices = peak_indices
  ))
}

# Detect peaks
peaks <- detect_peaks(mz_vals, smoothed_intensity, min_intensity = 5000)

# Plot spectrum with detected peaks
plot(mz_vals, smoothed_intensity, type = "l",
     main = "Peak Detection Results", xlab = "m/z", ylab = "Intensity")
points(peaks$mz, peaks$intensity, col = "red", pch = 19)

Peak Statistics:

Code

# Analyze detected peaks
cat("Number of peaks detected:", length(peaks$mz), "\n")
cat("Peak m/z range:", range(peaks$mz), "\n")
cat("Peak intensity range:", range(peaks$intensity), "\n")

# Create peak list data frame
peak_list <- data.frame(
  mz = peaks$mz,
  intensity = peaks$intensity,
  relative_intensity = peaks$intensity / max(peaks$intensity) * 100
)

head(peak_list)

5.11.7 Normalization

Total Ion Current (TIC) Normalization:

Code

# TIC normalization
tic_normalize <- function(intensity) {
  tic <- sum(intensity)
  return(intensity / tic * 1e6)  # Scale to parts per million
}

normalized_intensity <- tic_normalize(smoothed_intensity)

# Compare before and after normalization
par(mfrow = c(2, 1))
plot(mz_vals, smoothed_intensity, type = "l",
     main = "Before TIC Normalization", xlab = "m/z", ylab = "Intensity")
plot(mz_vals, normalized_intensity, type = "l",
     main = "After TIC Normalization", xlab = "m/z", ylab = "Normalized Intensity")
par(mfrow = c(1, 1))

Base Peak Normalization:

Code

# Base peak normalization
base_peak_normalize <- function(intensity) {
  base_peak <- max(intensity)
  return(intensity / base_peak * 100)
}

bp_normalized <- base_peak_normalize(smoothed_intensity)

plot(mz_vals, bp_normalized, type = "l",
     main = "Base Peak Normalized Spectrum",
     xlab = "m/z", ylab = "Relative Intensity (%)")

5.11.8 Processing Multiple Spectra: Batch Processing Function

Code

process_spectrum <- function(spec, baseline_quantile = 0.05,
                            smooth_window = 5, min_peak_intensity = 1000) {
  mz_vals <- mz(spec)[[1]]
  int_vals <- intensity(spec)[[1]]

  # Baseline correction
  baseline <- quantile(int_vals, baseline_quantile)
  int_vals <- pmax(int_vals - baseline, 0)

  # Smoothing
  int_vals <- moving_average(int_vals, smooth_window)

  # Normalization
  int_vals <- base_peak_normalize(int_vals)

  # Peak detection
  peaks <- detect_peaks(mz_vals, int_vals, min_peak_intensity)

  return(list(
    mz = mz_vals,
    intensity = int_vals,
    peaks = peaks
  ))
}

# Process first 10 spectra
processed_results <- list()
for (i in 1:min(10, length(ms_data))) {
  processed_results[[i]] <- process_spectrum(ms_data[i])
}

Quality Assessment:

Code

# Assess processing quality
peak_counts <- sapply(processed_results, function(x) length(x$peaks$mz))
cat("Peak counts across processed spectra:\n")
summary(peak_counts)

# Plot peak count distribution
hist(peak_counts, breaks = 10,
     main = "Distribution of Peak Counts",
     xlab = "Number of Peaks", ylab = "Frequency")

5.11.9 Advanced Preprocessing Techniques

Mass Calibration Concepts:

Code

# Simple mass calibration example (theoretical)
calibrate_mass <- function(observed_mz, reference_mz, expected_mz) {
  # Linear calibration
  calibration_factor <- expected_mz / reference_mz
  calibrated_mz <- observed_mz * calibration_factor
  return(calibrated_mz)
}

# Example usage (with hypothetical values)
observed <- c(100.0, 200.1, 300.2)
reference <- 200.1
expected <- 200.0

calibrated <- calibrate_mass(observed, reference, expected)
cat("Original m/z:", observed, "\n")
cat("Calibrated m/z:", calibrated, "\n")

5.11.10 Preprocessing Pipeline: Complete Function

Code

complete_preprocessing <- function(spectra_obj,
                                  baseline_quantile = 0.05,
                                  smooth_sigma = 2,
                                  normalization = "base_peak",
                                  min_peak_intensity = 1000) {

  processed_spectra <- list()

  for (i in seq_along(spectra_obj)) {
    spec <- spectra_obj[i]
    mz_vals <- mz(spec)[[1]]
    int_vals <- intensity(spec)[[1]]

    # Step 1: Baseline correction
    baseline <- quantile(int_vals, baseline_quantile)
    int_vals <- pmax(int_vals - baseline, 0)

    # Step 2: Smoothing
    int_vals <- gaussian_smooth(int_vals, sigma = smooth_sigma)

    # Step 3: Normalization
    if (normalization == "tic") {
      int_vals <- tic_normalize(int_vals)
    } else if (normalization == "base_peak") {
      int_vals <- base_peak_normalize(int_vals)
    }

    # Step 4: Peak detection
    peaks <- detect_peaks(mz_vals, int_vals, min_peak_intensity)

    processed_spectra[[i]] <- list(
      original_index = i,
      mz = mz_vals,
      intensity = int_vals,
      peaks = peaks,
      metadata = list(
        processing_date = Sys.time(),
        parameters = list(
          baseline_quantile = baseline_quantile,
          smooth_sigma = smooth_sigma,
          normalization = normalization,
          min_peak_intensity = min_peak_intensity
        )
      )
    )
  }

  return(processed_spectra)
}

5.12 Exercises

Apply different baseline correction methods and compare results.
Experiment with various smoothing parameters.
Implement and test different peak detection algorithms.
Compare TIC and base peak normalization approaches.
Create a quality control function for preprocessing results.

5.12.1 Summary

This chapter covered essential preprocessing techniques for MS data, including baseline correction, smoothing, peak detection, and normalization. These steps are fundamental for preparing data for downstream analysis and ensuring reliable results.

5.13 Synthesis and Future Directions

Spectral preprocessing is not merely a “cleanup” step; it is a transformation of the data space that fundamentally alters the performance of chemometric models. The choice of algorithm—ModPoly vs. airPLS, SNV vs. MSC, SG vs. Norris—must be grounded in the physical reality of the sample and the noise structure of the instrument.

5.13.1 The “Order of Operations” Consensus

A recurring debate in the literature concerns the order of operations. The consensus emerging from recent reviews suggests:

Artifact Removal: Cosmic rays and detector spikes must go first.
Scatter/Pathlength Correction: SNV or MSC should generally precede derivatives. Derivatives remove the “magnitude” information that SNV needs to calculate the scaling factor.
Baseline/Spectral Filtering: Derivatives or Baseline Correction (airPLS) are applied next to remove chemical/fluorescence backgrounds.
Scaling: Mean centering is the final step before PCA/PLS.

5.13.2 Future Frontiers: Deep Learning

The future of preprocessing lies in automation. Neural networks, specifically 1D Convolutional Neural Networks (CNNs), are beginning to replace manual preprocessing pipelines. A CNN can learn to effectively “despike,” “smooth,” and “correct” data as part of its feature extraction layers, optimizing the preprocessing strategy specifically for the prediction task at hand. However, for regulatory environments (pharma, forensics), the explainability of classical algorithms like SG and SNV ensures their continued dominance.

In conclusion, the rigorous application of spectral preprocessing is the differentiator between a model that fits noise and a model that captures chemistry. By leveraging adaptive algorithms like airPLS and robust normalization like PQN, researchers can extract precise molecular insights from the chaotic superposition of signals that constitutes a raw spectrum.

5.14 References

Preprocessing of Spectral Data - ResearchGate, accessed November 25, 2025, https://www.researchgate.net/publication/397865719_Preprocessing_of_Spectral_Data
Mini-Tutorial: Cleaning Up the Spectrum Using Preprocessing Strategies for FT-IR ATR Analysis | Spectroscopy Online, accessed November 25, 2025, https://www.spectroscopyonline.com/view/mini-tutorial-cleaning-up-the-spectrum-using-preprocessing-strategies-for-ft-ir-atr-analysis
Baseline and Scatter: Correcting the Spectral Chameleons | Spectroscopy Online, accessed November 25, 2025, https://www.spectroscopyonline.com/view/baseline-and-scatter-correcting-the-spectral-chameleons
A Study on Various Preprocessing Algorithms Used For NIR Spectra. - Research Journal of Pharmaceutical, Biological and Chemical Sciences, accessed November 25, 2025, https://www.rjpbcs.com/pdf/2016_7(4)/[344].pdf
Artifacts and Anomalies in Raman Spectroscopy: A Review on Origins and Correction Procedures - PubMed Central, accessed November 25, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC11478279/
Savitzky–Golay filter - Wikipedia, accessed November 25, 2025, https://en.wikipedia.org/wiki/Savitzky%E2%80%93Golay_filter
Savitzky-Golay Smoothing and Differentiation Filter - Eigenvector …, accessed November 25, 2025, https://eigenvector.com/wp-content/uploads/2020/01/SavitzkyGolay.pdf
Introduction to the Savitzky-Golay Filter: A Comprehensive Guide (Using Python) - Medium, accessed November 25, 2025, https://medium.com/pythoneers/introduction-to-the-savitzky-golay-filter-a-comprehensive-guide-using-python-b2dd07a8e2ce
Savitzky–Golay Smoothing and Differentiation Filter of Even Length: A Gram Polynomial Approach | Spectroscopy Online, accessed November 25, 2025, https://www.spectroscopyonline.com/view/savitzky-golay-smoothing-and-differentiation-filter-even-length-gram-polynomial-approach
Two methods for baseline correction of spectral data - NIRPY Research, accessed November 25, 2025, https://nirpyresearch.com/two-methods-baseline-correction-spectral-data/
A review on spectral data preprocessing techniques for machine learning and quantitative analysis - PubMed Central, accessed November 25, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC12221524/
Open-sourced Raman spectroscopy data processing package implementing a baseline removal algorithm validated from multiple datasets acquired in human tissue and biofluids - NIH, accessed November 25, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC9941747/
Advanced preprocessing and analysis techniques for enhanced Raman spectroscopy data interpretation - SPIE Digital Library, accessed November 25, 2025, https://www.spiedigitallibrary.org/conference-proceedings-of-spie/13311/3048830/Advanced-preprocessing-and-analysis-techniques-for-enhanced-Raman-spectroscopy-data/10.1117/12.3048830.full
NIR spectra pre-processed using the SNV method. - ResearchGate, accessed November 25, 2025, https://www.researchgate.net/figure/NIR-spectra-pre-processed-using-the-SNV-method_fig6_320348058
Porchlight: An Accessible and Interactive Aid in Preprocessing of Spectral Data | Journal of Chemical Education - ACS Publications, accessed November 25, 2025, https://pubs.acs.org/doi/10.1021/acs.jchemed.2c00812
Beyond Traditional airPLS: Improved Baseline Removal in SERS with Parameter-Focused Optimization and Prediction | Analytical Chemistry - ACS Publications, accessed November 25, 2025, https://pubs.acs.org/doi/10.1021/acs.analchem.5c01253
An Automatic Baseline Correction Method Based on the Penalized Least Squares Method, accessed November 25, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC7181009/
Baseline: Asymmetric Least Squares | Infrared, Python & Chemometrics, accessed November 25, 2025, http://spectroscopy.ramer.at/pretreatment/baseline-correction-2-als/
Why and How Savitzky–Golay Filters Should Be Replaced - PMC - NIH, accessed November 25, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC9026279/
Baseline correction method based on improved asymmetrically reweighted penalized least squares for the Raman spectrum - Optica Publishing Group, accessed November 25, 2025, https://opg.optica.org/ao/upcoming_pdf.cfm?id=404863
zmzhang/airPLS: baseline correction using adaptive iteratively reweighted Penalized Least Squares - GitHub, accessed November 25, 2025, https://github.com/zmzhang/airPLS
pybaselines.Baseline.airpls - Read the Docs, accessed November 25, 2025, https://pybaselines.readthedocs.io/en/latest/generated/api/pybaselines.Baseline.airpls.html
Two scatter correction techniques for NIR spectroscopy in Python - NIRPY Research, accessed November 25, 2025, https://nirpyresearch.com/two-scatter-correction-techniques-nir-spectroscopy-python/
Standard Normal Variate, Multiplicative Signal Correction and Extended Multiplicative Signal Correction Preprocessing in Biospectroscopy | Request PDF - ResearchGate, accessed November 25, 2025, https://www.researchgate.net/publication/288230854_Standard_Normal_Variate_Multiplicative_Signal_Correction_and_Extended_Multiplicative_Signal_Correction_Preprocessing_in_Biospectroscopy
Could anybody tell me what are those pre-processing methods for NIR spectroscopy?, accessed November 25, 2025, https://www.researchgate.net/post/Could-anybody-tell-me-what-are-those-pre-processing-methods-for-NIR-spectroscopy
Multiplicative Scatter Correction | labCognition Online Help, accessed November 25, 2025, https://docs.labcognition.com/panorama/en/multiplicative_scatter_correction/
Include MSC in a custom pipeline using scikit-learn - NIRPY Research, accessed November 25, 2025, https://nirpyresearch.com/include-msc-custom-pipeline-scikit-learn/
Pre-processing in vibrational spectroscopy, a when, why and how - RSC Publishing, accessed November 25, 2025, https://pubs.rsc.org/en/content/getauthorversionpdf/c3ay42270d
chemometrics.Emsc - Read the Docs, accessed November 25, 2025, https://chemometrics.readthedocs.io/en/stable/generated/chemometrics.Emsc.html
The Use of Constituent Spectra and Weighting in Extended Multiplicative Signal Correction in Infrared Spectroscopy - PMC, accessed November 25, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC8948808/
Extended Multiplicative Signal Correction for Infrared Microspectroscopy of Heterogeneous Samples with Cylindrical Domains - Optica Publishing Group, accessed November 25, 2025, https://opg.optica.org/abstract.cfm?uri=as-73-8-859
Extended multiplicative signal correction in vibrational spectroscopy, a tutorial | Request PDF - ResearchGate, accessed November 25, 2025, https://www.researchgate.net/publication/257035315_Extended_multiplicative_signal_correction_in_vibrational_spectroscopy_a_tutorial
Probabilistic Quotient Normalization as Robust Method to Account for Dilution of Complex Biological Mixtures. Application in 1 H NMR Metabonomics - ResearchGate, accessed November 25, 2025, https://www.researchgate.net/publication/6977148_Probabilistic_Quotient_Normalization_as_Robust_Method_to_Account_for_Dilution_of_Complex_Biological_Mixtures_Application_in_1_H_NMR_Metabonomics
Evaluation of normalization strategies for mass spectrometry-based multi-omics datasets, accessed November 25, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC12214035/
What is a Norris Derivative? - ResearchGate, accessed November 25, 2025, https://www.researchgate.net/publication/274906336_What_is_a_Norris_derivative
An Evaluation of Different NIR-Spectral Pre-Treatments to Derive the Soil Parameters C and N of a Humus-Clay-Rich Soil - PMC - PubMed Central, accessed November 25, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC7922103/
Practical Considerations in Data Pre-treatment for NIR and Raman Spectroscopy, accessed November 25, 2025, https://www.americanpharmaceuticalreview.com/Featured-Articles/116330-Practical-Considerations-in-Data-Pre-treatment-for-NIR-and-Raman-Spectroscopy/
MSIWarp: A General Approach to Mass Alignment in Mass Spectrometry Imaging | Analytical Chemistry - ACS Publications, accessed November 25, 2025, https://pubs.acs.org/doi/10.1021/acs.analchem.0c03833
MSIWarp: A General Approach to Mass Alignment in Mass Spectrometry Imaging - NIH, accessed November 25, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC7745203/
chemotools - PyPI, accessed November 25, 2025, https://pypi.org/project/chemotools/0.0.10/
RamanSPy: An Open-Source Python Package for Integrative Raman Spectroscopy Data Analysis - PMC - PubMed Central, accessed November 25, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC11140669/
Python library for chemometric data analysis - GitHub, accessed November 25, 2025, https://github.com/maruedt/chemometrics
EMSC: Extended Multiplicative Signal Correction - CRAN, accessed November 25, 2025, https://cran.r-project.org/web/packages/EMSC/EMSC.pdf
Variable Penalty Dynamic Time Warping Code for Aligning Mass Spectrometry Chromatograms in R - Journal of Statistical Software, accessed November 25, 2025, https://www.jstatsoft.org/article/view/v047i08/581
NIR PAT Basics: Common Spectral Preprocessing for Powder Analysis - Sentronic GmbH, accessed November 25, 2025, https://www.sentronic.eu/knowledge-base/blog/nir-pat-basics-common-spectral-preprocessing-for-powder-analysis/
Artificial Intelligence in Analytical Spectroscopy, Part II: Examples in Spectroscopy, accessed November 25, 2025, https://www.spectroscopyonline.com/view/artificial-intelligence-in-analytical-spectroscopy-part-ii-examples-in-spectroscopy