Peakbin assembly and quantification

There is no definite order in which this and the former (peak detection) tasks should be performed: peak alignment followed by peak detection [5] or vice versa [21]. The peak or spectra alignment tries to match similar peaks detected across all the spectra. Again due to measurement-induced noise, the exact

value of a peak can differ from one spectrum to another and there may be slight deviations or shifts over different runs, even if analyzing the same sample. This shift is widely known as the mass error effect [19]. In order to align the peaks, each spectrum is modified by shifting the signal until the peaks match.

All these variations may include signal shifts and potentially hide isotopic formations or very close compounds. Moreover, this effect is more likely when dealing with very complex mixtures.


Step 1. For each peak/peakbin  and for each spectrum , compute the intensity value

.

Step 2. Compute the linear correlation matrix  between each pair of subset values

 and .

Step 3. If all values , then return  and .


Else, for each pair  and  for which  , combine  and  into a


single peakbin. Go to Step 1.

To overcome this artificial shifting, we propose assemble peakbins of different widths. In this way, a set of close peaks on the axis across different spectra would be clustered into the same peakbin if their intensity levels are similar. Classical clustering approaches have already been used to tackle this problem [2,24,10,21]. Instead, our preprocessing pipeline uses the Pearson linear correlation coefficient to group the peaks, as the computation time and memory demands are much lower. Peakbins are scanned recursively, and their signal values are quantified as the maximum value found in the bin [13]. The stopping criterion is met when there is no single peak or peakbin that shows a correlation value greater than a given threshold $\rho$ . Figure 1.5 details the assembling algorithm. The output of this final preprocessing stage is thus composed of a list of peakbins, each one with a starting and ending point on the axis, coupled with the maximum signal value within each spectrum.