Peakbin assembly and quantification

There is no definite order in which this and the former (peak detection) tasks should be performed: peak alignment followed by peak detection [5] or vice versa [21]. The peak or spectra alignment tries to match similar peaks detected across all the spectra. Again due to measurement-induced noise, the exact $m/z$ value of a peak can differ from one spectrum to another and there may be slight deviations or shifts over different runs, even if analyzing the same sample. This shift is widely known as the mass error effect [19]. In order to align the peaks, each spectrum is modified by shifting the signal until the peaks match.

All these variations may include signal shifts and potentially hide isotopic formations or very close compounds. Moreover, this effect is more likely when dealing with very complex mixtures.

Figure: Peakbin assembling algorithm. Threshold $\rho $ is the minimum permitted correlation threshold among two consecutive peaks or peakbins. Matrices $\mathbf {P}$ and $\mathbf {V}$ are the computed list of peakbins and the spectra values for those bins respectively.


Step 1. For each peak/peakbin $p_i$ and for each spectrum $s_j$, compute the intensity value

$v_{ij}=f(p_i, s_j)$.
Step 2. Compute the linear correlation matrix $\mathbf{R}$ between each pair of subset values
$v_{i\cdot} = f(p_i,\cdot)$ and $v_{i+1\cdot} = f(p_{i+1},\cdot)$.
Step 3. If all values $\mathbf{R}(i,i+1)< \rho$, then return $\mathbf{P} = [p_i]$ and $\mathbf{V} = [v_{ij}]$.
Else, for each pair $p_i$ and $p_{i+1}$ for which $\mathbf{R}(i,i+1)\geqslant$ $\rho $, combine $p_i$ and $p_{i+1}$ into a
single peakbin. Go to Step 1.

To overcome this artificial shifting, we propose assemble peakbins of different widths. In this way, a set of close peaks on the $m/z$ axis across different spectra would be clustered into the same peakbin if their intensity levels are similar. Classical clustering approaches have already been used to tackle this problem [2,24,10,21]. Instead, our preprocessing pipeline uses the Pearson linear correlation coefficient to group the peaks, as the computation time and memory demands are much lower. Peakbins are scanned recursively, and their signal values are quantified as the maximum value found in the bin [13]. The stopping criterion is met when there is no single peak or peakbin that shows a correlation value greater than a given threshold $\rho $. Figure 1.5 details the assembling algorithm. The output of this final preprocessing stage is thus composed of a list of peakbins, each one with a starting and ending point on the $m/z$ axis, coupled with the maximum signal value within each spectrum.