Peak detection

The problem of peak detection we refer to here consists of distinguishing an $m/z$ position corresponding to a true peak in the spectrum. Many biological works define a true peak as the peak really associated with a peptide in the biological sample [7]. This could be useful when the composition of the sample peptides is known beforehand and the spectra size is small. However, in the case of more complicated mixtures (e.g. blood serum), the real peaks are generally unknown.

Therefore, our aim at this stage is to prescreen a great many peaks that could later be grouped into peakbins. Having in mind that the machine learning analysis to be applied afterwards will separate relevant from non-relevant peaks, the impact of including some artifacts at this early stage is not so crucial. The peak detection algorithm is thus individually applied to each separate spectrum, and then, a list of candidate peaks is retrieved for each spectrum.

Eventhough it is by far the hottest issue in the MS preprocessing field, there is agreement only on three conditions that a candidate peak must meet [2]:

  1. the peak must have higher intensity than its neighbours;
  2. the peak must be above a chosen threshold;
  3. the peak must have an associated signal-to-noise ratio (SNR) higher than a set threshold.

We will take the peak detection algorithm proposed in [13] as the starting point. Our algorithm will follow the same top-down scheme, starting with the highest point of the overall signal and iteratively evaluating the lower points. To see whether a point $p$ is considered a peak we set a stricter criterion: there must exist a point $l$ (respectively, $r$) on its left (respectively, right) before the previous (next) peak. This point must satisfy two conditions. First, the value of the candidate point $p$ must be higher than a sensitivity threshold $T$ and, second, the candidate point $p$ must have an SNR higher than or equal to 3 within the intensity window framed by $l$ and $r$.

To estimate the SNR of a signal window, our algorithm computes the SNR value as the ratio between the point's height and the median absolute deviation (MAD) in the window $[l,r]$ under consideration [20]. The criterion that the SNR must be higher than or equal to a value of three is borrowed from the image analysis field and has been widely used in microarray quality metrics [4].

The main advantage of this peak detection algorithm is that it takes into account all the individual characteristics rather than the evaluation of an average spectrum that could hide independent features [5]. In addition, the spectra maintain their original $m/z$ shape obviating the need for a shifting or alignment process. On the downside, the computation time increases linearly with the number of spectra since all spectra are investigated.