A coherent mathematical characterization of isotope trace extraction, isotopic envelope extraction, and LCMS correspondence
Abstract
Background
Liquid chromatographymass spectrometry is a popular technique for highthroughput protein, lipid, and metabolite comparative analysis. Such statistical comparison of millions of data points requires the generation of an interrun correspondence. Though many techniques for generating this correspondence exist, few if any, address certain wellknown runtorun LCMS behaviors such as elution order swaps, unbounded retention time swaps, missing data, and significant differences in abundance. Moreover, not all extant correspondence methods leverage the rich discriminating information offered by isotope envelope extraction informed by isotope trace extraction. To date, no attempt has been made to create a formal generalization of extant algorithms for these problems.
Results
By enumerating extant objective functions for these problems, we elucidate discrepancies between known LCMS data behavior and extant approaches. We propose novel objective functions that more closely model known LCMS behavior.
Conclusions
Through instantiating the proposed objective functions in the form of novel algorithms, practitioners can more accurately capture the known behavior of isotope traces, isotopic envelopes, and replicate LCMS data, ultimately providing for improved quantitative accuracy.
Keywords
Proteomics Correspondence Alignment LipidomicsBackground
Liquid chromatographymass spectrometry (LCMS) is a popular technique for elucidating the composition of liquid samples. Data processing considerations are essential to accurately determine the identity of molecules (analytes such as lipids or peptides) contained in the sample (a process called identification), as well as their quantity in sample (a process called quantification).
Information about sample quantity is captured directly in survey scans, or MS (aka MS1) data. Fragmentation spectra of one or more analytes constitute MS/MS (or MS2) data, and this information is typically used to corroborate or ascertain the identity of a molecule. Partitioning/clustering MS1 signal from complex samples and mapping the signal to other analyses (correspondence) is challenging. Some quantification strategies bypass these challenges by using information derived directly or indirectly from MS/MS data. These methods include spectral counting [1] and isobaric tags for relative and absolute quantification (iTRAQ) [2]. Though these methods have been successful, the amount of quantifiable signal embedded in MS1 data is estimated to far exceed what is currently available by MS/MS [3]; however, most MS1 data remains unused by current software. Hence, improving methods for partitioning and mapping MS1 signal stands to significantly (˜10 fold) increase the sensitivity of a typical labelfree or isotopelabeling MSomics experiment, both for experiments currently being run and for past experiments where raw data is still available.
Mass spectrometry data, in its raw form, is not ideal for isotope trace extraction or subsequent processing. After internally accumulating signal over discrete time slices, the mass spectrometer outputs raw data condensed into the form of many narrow profiles wherever signal is present. Conversion to centroid mode integrates the abundance of each of these profiles into a single tuple called a centroid. This is considered a routine conversion for which ample software is readily available. We adopt the typical convention of using centroid data.
Despite the ubiquity of LCMS experiments, to the best of our knowledge, no concise, complete description of the LCMS isotope trace and isotopic envelope extraction problems exists. Here, we describe constructs for isotope traces and isotopic envelopes, as well as formally describe the relationship of centroids, isotope traces and isotopic envelopes. In this context, we review extant objective functions for isotope trace extraction, isotopic envelope extraction, and correspondence. Finally, we propose novel objective functions for each of these tasks that address shortcomings in current approaches.
Results and discussion
Isotope trace extraction
The most important data processing step in a typical quantitative LCMS pipeline is isotope trace extraction [4]. Clustering centroids into isotope traces is a nontrivial problem due to the many sources of noise affecting centroid mass and abundance. Sources of noise affecting centroids include chemistry effects due to chromatography, abundance inaccuracy due to ionization efficiencies, m/z deviation due to machine calibration, occlusion/adulteration of lowabundance signal due to dynamic range limitations, and compounded inaccuracies in masstocharge ratio (m/z) and abundance due to centroid construction. Of course, these complications are propagated from the clustering of isotope traces to the clustering of isotopic envelopes to the identification of crossexperiment correspondence.
A centroid is denoted as c = (µ, τ, α) where µ, τ, α are values for m/z, retention time (RT), and abundance, respectively. A single MS run produces a set of centroids $C={\left\{{c}_{i}\right\}}_{i=0}^{n}$, where n can readily reach into the millions.
where c^{ α } is the abundance of centroid c and c^{ µ } is the m/z of centroid c.
Note that the behavior of isotope traces are dependent on all three MS dimensions although many common approaches to isotope trace extraction ignore one or more of these dimensions. For example, most proprietary MS software uses hard m/z bins for isotope trace extraction.
Extant objective functions
The prominent algorithms for isotope trace extraction include centWave [5], MatchedFilter [5], centroidPicker [6], massifquant [7], and MaxQuant [8].
with isotope tracespecific scaling parameter bF and translation parameter t_{ F } chosen to maximize the convolutional fit over isotope trace F .
for some intensity threshold θ and centroid distance function δ_{ c }, resulting in G being composed of one or more connected components, each considered one isotope trace. Thus, $\mathcal{F}=\left\{{F}_{i}\forall {c}_{k}\in {F}_{i},{\exists}_{cl\in {F}_{i}}\left\{{c}_{l}\in \Upsilon \left({c}_{k}\right)\right\}\right\}$, where the neighborhood function ϒ (c) returns the set of nodes connected to c (and is symmetric because G is undirected).
The objective functions for massifquant and MaxQuant define $\mathcal{F}$ as the set of all F formed by iterating over values of time t, and adding c if c^{ τ } = t and $\left{c}^{\mu}{c}_{*}^{\mu}\right<\in $, where c_{∗} ∈ F and ${c}^{\tau}{c}_{*}^{\tau}\le {c}^{\tau}{c}_{j}^{\tau}$ for all c_{ j } ∈ F. For massifquant, ∈ is prescribed by a Kalman filter induced from the variance in c^{ µ } and c^{ α } for all c_{ j } ∈ F such that ${c}_{j}^{\tau}<t$, with the added constraint that c^{ τ } be unique in F . MaxQuant defines ∈ simply as a distance threshold of 7 ppm m/z.
Proposed objective functions
where, again, centroid clustering $\mathcal{F}$ and retention time means F^{ t } are chosen to minimize the Gaussian fit error; however, rather than using a single global variance in the RT dimension, each isotope trace F has a local variance σ_{ F }; in addition, the scaling factors have become timedependent scalar functions b_{ F }(·). The second Gaussian factor, parameterized by mean F^{ µ } and variance function h(·), models the m/z width of the isotope trace, which is a function of the abundance α. Isotope traces splay at low abundance and narrow at high abundance; thus, both the variance h(·) and the scaling factors a_{ F }(·) are modeled as functions dependent on the abundance α. Note that while variance is traceindependent (depending only on abundance), each isotope trace has its own scaling function (which in turn is dependent on abundance).
Alleviating current limitations in isotopic trace extraction
Current objective functions for isotopic trace extraction fail to capture isotopic trace behavior formalized in this section: namely, a pattern of centroids forming a generally tight distribution through time around a specific m/z, with variation occurring as a factor of abundance, with normal abundance traces splaying at the beginning and end of elution, and lower abundance traces displaying high m/z variance in general. Moreover, isotope traces are skewed in time, with sharp onset of intensity followed by a postpeak long tail. The shape of traces is almost never strictly Gaussian (or even symmetric), as chromatography almost always deviates from the Gaussian in heading (which is more steep) and in tailing (which is less steep). Our objective functions account for each of these behaviors.
Isotopic envelope extraction
In other words, 1) all centroids are assigned to an isotope trace; 2) isotope traces can't share centroids. Because any sensor's detection of a physical system will deviate somewhat from the true physical system, we can expect MS detections to contain extraneous centroids. However, all signal ought to be accounted for (even if some identified "traces" eventually are identified as noise) and, in a platonic model, ought to be assigned to an isotope trace.
The choice of partitions φ and ψ is guided by a set of distance functions Δ that define distances between centroids, isotope traces, isotopic envelopes, etc. and objective functions λ_{ F } and λ_{ E } that describe "good" isotope traces and isotopic envelopes, respectively. The choice of distance and objective functions, along with choice of optimization procedure, characterizes an algorithmic approach for solving this clustering problem. A defining general property of isotopic envelopes, however, is the regular spacing between component isotope traces. In addition, for virtually all molecules from biological sources we expect that if there is an isotope with index j and an isotope with index j + 2, then there exists an isotope with index j + 1.
where $\stackrel{\u0303}{m}$ is the uncharged molecular weight of the ion.
Every isotope trace consists of signal from at least one isotopic envelope, and, in the case of overlapping isotopic envelopes, an isotope trace may be composed of signal from more than one isotopic envelope.
Extant objective functions
where the G_{ E } compute a comparison between the (µ, τ, α) values for a centroid and the expected centroid values obtained from a heuristic isotopic envelope shape. Note that isotopic trace extraction is ignored.
where the notation c ∈^{ τ } F means that c ∈ F at time τ, E is the maximal intensity (instantaneous) isotopic envelope (at time τ), $\widehat{P}\left(\cdot \right)$ is the ratio of the intensity of isotope trace F (at time τ) to the total intensity of all isotope traces F ∈ E (at time τ), and P_{ m }(·) is the value of the Poisson distribution at c^{ µ }.
Proposed objective functions
where F^{ τ } could be defined analogously to Equation 7, could be the maximum intensity for isotopic trace F or could be some other reasonable definition for isotopic trace elution time.
We want to optimize ε and the z_{ E } so that λ_{ E } is minimized; that is, we want to find chargestate/isotopicenvelope pairs such that the errors in expected m/z and coelution time are minimized.
The isotopic envelope extraction segment of the MaxQuant [8] algorithm is one of the possible instantiations of this objective function, though many possibilities exist for how to set the allowable m/z and RT error and how to generate the prerequisite list of isotope traces.
Alleviating current limitations in isotopic envelope extraction
Isotopic envelopes are rich with data: the expectation of contiguous isotope traces with a uniform m/z charge gap, and similar maximal abundance across all isotope traces. Accounting for this behavior is not possible without adopting an isotope tracecentric approach to data extraction. Reliance upon maximal elution time alonean approach that is susceptible to conflation with overlapping envelopes in complex samplesis not a sensitive approach in envelopes of lower abundance, where maximal elution times are not pronounced. Moreover, by first finding the isotope traces, the exact m/z of each isotope trace can be calculated using a weighted average, alleviating the need for larger than theoretically justified isotope trace gaps, which will not be sensitive in complex samples with overlapping isotopic envelopes. Instead, the proposed objective functions leverage a precise and reliable m/z charge gap and adjacency of isotope traces along with maximal elution times, using all the information in the data.
Correspondence
The combination of noise from within one run (enumerated above) and noise from run to runmost notable in retention time shifts, where an isotopic envelope appears at a different retention time or with a compressed or stretched RT length compared to another runmake LCMS correspondence nontrivial.
The correspondence mapping should again optimize an objective function which, in turn, characterizes an algorithm choice for solving the correspondence problem.
Extant objective functions
where δ()^{ τ,µ } is a distance function defined over RT and m/z.
where D is the set of observed runs.
Proposed objective functions
In contrast to existing LCMS correspondence objective functions, the objective functions suggested here use the entire isotopic envelope. This allows greater discrimination by using isotope trace quantity and spacing to match isotopic envelopes from different runs. This extra discrimination is essential given the amount of RT variance and (to a lesser degree) m/z variance present in the data.
Let R be a set of runs, each of which has an associated set of isotopic envelopes ${\epsilon}_{r}={\left\{{E}_{i}^{r}\right\}}_{i=1}^{pr},1\le r\le \leftR\right$ and let $\stackrel{\u0303}{\epsilon}={\cup}_{r}{\epsilon}_{r}$. We seek to find a binary equivalence relation ρ that induces a set of correspondence classes over $\stackrel{\u0303}{\epsilon}$ that is reflexive (an envelope corresponds with itself), symmetric (if envelope E_{1} from run 1 corresponds with envelop E_{2} from run 2, then E_{2} also corresponds with E_{1}) and transitive (if envelope E_{1} from run 1 corresponds with envelope E_{2} from run 2 and envelope E_{2} corresponds with envelope E_{3} from run 3, then E_{1} corresponds with E_{3}); and if $\rho \left({E}_{i}^{r},{E}_{j}^{s}\right)=\text{TRUE}$, then for k ≠ i, $\rho \left({E}_{k}^{r},{E}_{j}^{s}\right)=\text{FALSE}$ and for k ≠ j, $\rho \left({E}_{i}^{r},{E}_{k}^{s}\right)=\text{FALSE}$ (an envelope from one run may have 0 or 1 matches from any other run; note that due to reflexivity, this also means that two nonidentical envelopes from the same run never correspond).
This relation should minimize

The difference in charge state between corresponding isotopic envelopes, ${\delta}_{charge}$.

The difference in m/z between isotope traces in corresponding isotopic envelopes, ${\delta}_{m{z}_{it}}$.

The difference in elution duration between isotope traces in corresponding isotopic envelopes, ${\delta}_{dur}$.

The difference in isotope abundance ratios between corresponding isotopic envelopes, ${\delta}_{ratio}$.

The difference in m/z between corresponding isotopic envelopes, ${\delta}_{m{z}_{ie}}$.

The number of singleton correspondence classes, ${\delta}_{orphan}$.

The difference in retention time between corresponding isotopic envelopes, ${\delta}_{rt}$.
Alleviating current limitations in correspondence
Recently, several ubiquitous shortcomings were identified in a review of over 50 LCMS correspondence algorithms [11]. The most significant of these shortcomings was the fact that all current LCMS correspondence algorithms make model assumptions that fail to capture common behavior. In other words, each algorithm is constructed in such a way that the algorithm is guaranteed to get the wrong answer under certain conditions that are common to real LCMS data. The behaviors discussed included the ideas that:

Not all analytes appear in all replicates.

Elution order can swap.

Shifts occur in m/z as well as in RT.
Some correspondence methods reduce isotopic envelopes to a single point representation. This deprives the method of a rich source of distinguishing data found in full isotopic envelopesthe expectation of contiguous isotope traces with a uniform m/z charge gap, number of isotope traces, and relative abundance ratio of isotope traces. Similarly, most correspondence algorithms conduct an initial RT alignment, where signals (almost always muchreduced from the full isotopic envelope, and rarely built up from isotope traces to isotopic envelopes) are shifted up or down in RT (preserving original order) in order to most closely match a reference run. This is invariably followed by direct matching. The problem is that the initial warping is a lossy procedure that adulterates the original RT time, which would be useful to probabilistically ascertaining the closest corresponding isotopic envelope.
The proposed objective function does not force matches between runs, as it is very common for species to either not be present or fall below the signaltonoise ratio in differential studies. Instead, the proposed objective function leverages the full breadth of isotope envelope information, allowing a rigorous direct comparison of candidate correspondences based on all available data to select the most likely correspondence (in the sense of minimizing error), or no correspondence at all if that is the most likely case given the data.
Conclusions
We present a concise attempt to formalize LCMS data clustering problems, describing the constructs of isotope traces and isotopic envelopes and their relational structure. We provide a review of current approaches to isotope trace extraction and LCMS correspondence, and propose novel objective functions for both tasks that address shortcomings in current methods.
Notes
References
 1.Choi H, Fermin D, Nesvizhskii AI: Significance analysis of spectral count data in labelfree shotgun proteomics. Mol Cell Proteomics. 2008, 7 (12): 23732385. 10.1074/mcp.M800203MCP200.PubMedCentralCrossRefPubMedGoogle Scholar
 2.Wiese S, Reidegeld KA, Meyer HE, Warscheid B: Protein labeling by iTRAQ: a new tool for quantitative mass spectrometry in proteome research. Proteomics. 2007, 7 (3): 340350. 10.1002/pmic.200600422.CrossRefPubMedGoogle Scholar
 3.Michalski A, Cox J, Mann M: More than 100,000 Detectable Peptide Species Elute in Single Shotgun Proteomics Runs but the Marjority is Inaccessible to DataDependent LCMS/MS. Journal of Proteome Research. 2011, 10: 17851793. 10.1021/pr101060v.CrossRefPubMedGoogle Scholar
 4.Cappadona S, Baker PR, Cutillas PR, Heck AJ, van Breukelen B: Current challenges in software solutions for mass spectrometrybased quantitative proteomics. Amino Acids. 2012, 43 (3): 10871108. 10.1007/s0072601212898.PubMedCentralCrossRefPubMedGoogle Scholar
 5.Tautenhahn R, Bottcher C, Neumann S: Highly sensitive feature detection for high resolution LC/MS. BMC Bioinformatics. 2008, 9 (1): 50410.1186/147121059504.PubMedCentralCrossRefPubMedGoogle Scholar
 6.Pluskal T, Castillo S, VillarBriones A, Oresic M: MZmine 2: Modular framework for processing, visualizing, and analyzing mass spectrometrybased molecular profile data. BMC Bioinformatics. 2010, 11 (1): 39510.1186/1471210511395.PubMedCentralCrossRefPubMedGoogle Scholar
 7.Conley CJ, Smith R, Torgrip RJ, Taylor RM, Tautenhahn R, Prince JT: Massifquant: opensource Kalman filter based XCMS isotope trace feature detection. Bioinformatics. 2014, 359Google Scholar
 8.Cox J, Mann M: MaxQuant enables high peptide identification rates, individualized ppbrange mass accuracies and proteomewide protein quantification. Nature Biotechnology. 2008, 26 (12): 13671372. 10.1038/nbt.1511.CrossRefPubMedGoogle Scholar
 9.Weisser H, Nahnsen S, Grossmann J, Nilse L, Quandt A, Brauer H, Sturm M, Kenar E, Kohlbacher O, Aebersold R, et al: An automated pipeline for highthroughput labelfree quantitative proteomics. Journal of Proteome Research. 2013Google Scholar
 10.Bellew M, Coram M, Fitzgibbon M, Igra M, Randolph T, Wang P, May D, Eng J, Fang R, Lin C, et al: A suite of algorithms for the comprehensive analysis of complex protein mixtures using highresolution LCMS. Bioinformatics. 2006, 22 (15): 19021909. 10.1093/bioinformatics/btl276.CrossRefPubMedGoogle Scholar
 11.Smith R, Ventura D, Prince JT: LCMS Alignment in Theory and Practice: A Comprehensive Algorithmic Review. Briefings in Bioinformatics. 2013Google Scholar
 12.Listgarten J, Neal RM, Roweis ST, Wong P, Emili A: Difference detection in LCMS data for protein biomarker discovery. Bioinformatics. 2007, 23 (2): 198204. 10.1093/bioinformatics/btl553.CrossRefGoogle Scholar
Copyright information
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.