An Automated, High-Throughput Method for Interpreting the Tandem Mass Spectra of Glycosaminoglycans
The biological interactions between glycosaminoglycans (GAGs) and other biomolecules are heavily influenced by structural features of the glycan. The structure of GAGs can be assigned using tandem mass spectrometry (MS2), but analysis of these data, to date, requires manually interpretation, a slow process that presents a bottleneck to the broader deployment of this approach to solving biologically relevant problems. Automated interpretation remains a challenge, as GAG biosynthesis is not template-driven, and therefore, one cannot predict structures from genomic data, as is done with proteins. The lack of a structure database, a consequence of the non-template biosynthesis, requires a de novo approach to interpretation of the mass spectral data. We propose a model for rapid, high-throughput GAG analysis by using an approach in which candidate structures are scored for the likelihood that they would produce the features observed in the mass spectrum. To make this approach tractable, a genetic algorithm is used to greatly reduce the search-space of isomeric structures that are considered. The time required for analysis is significantly reduced compared to an approach in which every possible isomer is considered and scored. The model is coded in a software package using the MATLAB environment. This approach was tested on tandem mass spectrometry data for long-chain, moderately sulfated chondroitin sulfate oligomers that were derived from the proteoglycan bikunin. The bikunin data was previously interpreted manually. Our approach examines glycosidic fragments to localize SO3 modifications to specific residues and yields the same structures reported in literature, only much more quickly.
KeywordsGlycosaminoglycan Fourier transform mass spectrometry Tandem mass spectrometry Automated interpretation
Glycosaminoglycans (GAGs) are linear, polydisperse carbohydrates consisting of a repeating uronic sugar and amino sugar copolymer. GAGs serve a multitude of roles in biology including cell-cell and cell-matrix interactions, generation of energy, changes in proteins binding conformation, and molecular recognition [1, 2, 3]. Certain GAGs have also been observed as potential biomarkers for disease states . The degree of GAG-protein binding has been shown to be highly dependent on their structure and, more specifically, the position of modifications within their generic repeating copolymer chain [5, 6].
Despite the simple polymeric backbone in GAGs, a single sugar residue can exhibit varying levels of three key modifications, namely O-sulfation, N-deacetylation/sulfation, and uronic sugar stereochemistry . Moreover, the biosynthesis of GAGs is not template driven, resulting in non-uniform dispersion of these modifications across the chain [7, 8]. Database-derived approaches are widely used for protein mass spectra assignment (either top-down or bottom-up) due to the predictability of amino acid sequences from genome sequences but fail when applied to biomolecules whose production is not template-derived [9, 10]. In contrast to the approaches that are successful for protein/peptide analysis, a de novo approach is required for the computer-based analysis of the tandem mass spectra of GAGs.
Considerable progress has been made in GAG analysis using mass spectrometry [1, 11]. At the MS1 level, a parts per million accurate mass measurement, using high-resolution instruments such as Fourier transform ion cyclotron resonance mass spectrometry (FTICR-MS), allows assignment of composition, from which GAG chain length, number of modifications, and types of modification can be assigned . Tandem MS (MS2) of GAGs using various ion activation methods, such as collision-induced dissociation (CID) [13, 14, 15], infrared multiphoton dissociation [16, 17, 18, 19], electron-detachment dissociation (EDD) [16, 18, 19, 20, 21, 22, 23, 24], and negative-electron transfer dissociation (NETD) [25, 26, 27], yields structurally informative fragment ions . Glycosidic bond fragmentation provides monosaccharide composition, while cross-ring fragmentation is used to assign the location of modifications within each residue . Because this is a de novo analytical approach, complete structure analysis requires an information-rich mass spectrum that contains sufficient fragment peaks to fully assign all the variable features. Recent developments in ion activation for GAGs have led to a variety of approaches to produce informative MS2 spectra [21, 23, 28, 30]. However, the interpretation of such complex mass spectra is generally a tedious manual process that relies upon the expertise of the data analyst. A better understanding of the structural features that promote GAG activity would benefit from an automated, accurate and high-throughput analytical process.
The complexity of the data sets and the time required for analysis increases dramatically as the chain length and the number of modifications increase. Two families of GAGs, heparin/heparan sulfate (Hp/HS) and chondroitin/dermatan sulfate (CS/DS), often contain large numbers of labile sulfate modifications. For these compounds, conventional MS2 methods are often inadequate for complete structural determination, either because they do not produce a comprehensive set of fragment ions required to assign all variable features or because they lead to decomposition products that confound the analysis [8, 31]. For example, fragmentation can be accompanied by decomposition of sulfomodifications, producing peaks that are reduced in mass by multiples of 80 mass units but match the mass of standard glycosidic fragments of their counterparts with fewer sulfate modifications [28, 32]. If one does not recognize the peaks that arise from such decomposition, incorrect structural assignments will result. Common de novo strategies that have been successful for protein sequencing [25, 33, 34, 35] will inevitably be exposed to substantially more false positives due to the high-likelihood of SO3 loss fragments in GAG MS and MS2. Na+/H+ exchange has been shown to decrease SO3 loss and makes characterization of highly sulfate species possible ; however, SO3 loss is almost always observed in MS2 spectra.
Tools for comparison of user-input structures with fragment peaks from tandem MS have been developed [12, 36, 37], but the requirement for a known starting structure limits applicability for high-throughput analysis.
To address this bottleneck for high-throughput sequencing of GAGs, efforts in computer-assisted methods look to improve upon the speed of analysis and to reduce the amount of user-input and supervision. Several software packages have been developed to overcome modern challenges in GAG analysis although a few require addition steps at the experimental level for optimal software performance. The heparin/HS oligosaccharide sequencing tool (HOST)  is a computational tool designed for sequencing heparin/HS oligosaccharides using enzymatic digestion combined with ESI-MSn. The method scores and returns the best matching sequences of GAGs based on disaccharide composition analysis, yielding predicted compositions and calculating expected fragmentation patterns in silico. Comparisons of theoretical fragments can then be compared to fragmentation of heparin/HS oligosaccharide MSn data and is scored to return the most likely sequence. However, disaccharide analysis requires complete enzymatic digestion of the GAG using heparin lyases I, II and III over multiple hours of incubation (16 h), limiting the method’s overall speed and applicability in a high-throughput GAG analysis platform.
Another piece of software known as GAG-ID  has been shown to discriminate and identify 21 synthetic tetrasaccharides eluted from LC-MS/MS using a scoring system based on peak intensities. It is the first of its kind to automated the interpretation of mixtures when coupled to LC-MS/MS but require complete chemical derivatization of the GAG by replacing all labile sulfate modifications with more stable acetyl groups. Much like HOST, derivatization may not be a viable option for universal GAG analysis.
HS-SEQ  is a de novo GAG sequencing computation framework that has been used to automate the structural identification of HS of dp4, 5, 6, 8, and 15. The method determines a precursor sequence (unmodified GAG backbone) and uses information from the tandem MS to best assign possible sulfate and acetate modifications. Assignments are made based on confidence values and are used to generate a list of top candidates. This is the first GAG software that requires only the tandem MS for sequence information. While certainly a high-throughput option, the structural assignment conflicts can arise in the form of sulfate loss fragment, internal fragments, or random matches. The authors of HS-SEQ not only note that the software removes the assignments with lower confidence to resolve conflicting assignments but also believe that this may produce false hits when examining samples extracted from biological sources.
The software developed in our laboratory is designed to sequence GAGs of indefinite length by comparing fragments of theoretical structures (in silico) against experimental data without the need for construction of a database, instead using a genetic algorithm optimization technique to limit the number of permutations while keeping analysis time to a maximum of a few minutes. The method assigns structures based on greatest likelihood using fragment ion products as a critical parameter for the genetic algorithm fitness criterion. Fragments that are in direct conflict with the highest scoring structure(s) are not discarded but reviewed again for possible additional components. We have tested this approach on MS2 data from intact CS chains released from the proteoglycan, bikunin. These chains vary in length from 27 to 43 saccharide residues, and vary in the degree of O-sulfomodification from 4 to 7, and thus represent a challenging test of this automated procedure.
Mass Spectrometry Analysis
Bikunin GAG MS and MS2 data reported in  was used as a proof-of-principle data set for the purposes of testing genetic algorithm efficacy. The monoisotopic peaks were selected via the SNAP algorithm from Bruker DataAnalysis software. Analysis of the MS2 was performed with the software alone and with no user supervision or assistance.
MS1 analysis of parent ion mass is performed using a composition assignment software module written in the MATLAB coding environment. Monoisotopic peaks and charge states are acquired from Bruker DataAnalysis and deconvoluted to a neutral mass. A composition is derived from one or more neutral mass(es) by searching a data matrix of possible chain lengths, degrees of sulfation, deacetylation, and sodium/hydrogen exchange. The user input also includes the possibility of reducing end modifications, and nonreducing ends that can terminate in unsaturated uronic acids, as is common in enzymatically produced GAG oligomers. Theoretical neutral masses in the spreadsheet are compared against user specified masses with a user-defined mass tolerance. The sequences that match are then used for performing the MS2 analysis.
For MS2 assignment, we implement a genetic algorithm based on fundamental aspects common to all genetic algorithms [42, 43, 44]. For MS2 analysis, the software uses a binary vector to represent glycan structures where on-bits denote an occupied site of SO3 modification. The first step generates two glycan structures at random that fit the expected composition (initialization step) and then proceeds to “breed” these structures into a new generation of candidates (crossover step). The new generation also is subject to potential mutations in their structure in the form of exchanges between their on- and off-bits (mutation step) in an effort to avoid converging upon a local maximum. Theoretical structures created in the crossover and mutation steps are then tested against the experimental MS2 data where the score of each structure is determined based on a closeness-of-fit paradigm (fitness). The scoring system is subject to various factors that will be discussed in detail in future papers. In the case of bikunin, the score of a structure is a naïve model that determines the top candidate based on the number of matching glyocosidic fragments. The primary three steps (crossover, mutation and fitness) are iterated until the maximum fitness value does not change after numerous cycles. The number of iterations required before termination of the algorithm can be defined by the user but is defaulted at a value of 3. The structure(s) containing the highest scores are then examined using additional data interpretation tools that assign fragment peak masses alongside their charge, intensity, and mass error (in ppm).
Experimental MS2 data collected by FTICR is extracted from Bruker Apex user interface software using the SNAP peak-picking algorithm. Monoisotopic peak masses and intensities are extracted in the form of comma-separate value (.csv) files. MATLAB software prompts the user for a .csv file containing mass-to-charge in column 1 and intensity in column 2, with mass-to-charge sorted in ascending order. Parent ion mass and charge must be provided by the user as well as mass information pertaining to a linker region mass on the reducing end. Composition details (chain length and numbers of: sulfation, n-acetylation, Na-H exchange) are calculated from a composition calculation module and then given to the software in the preliminary step before initializing the genetic algorithm.
For bikunin proteoglycan a linker mass of 641.1473 (Gal4S-Gal-Xyl-Serine) was used with the remainder of the bikunin chain length represented as a binary vector.
Software integrates separate functional modules to perform mass calculations of theoretical fragment ions, performing standard genetic algorithm features and scoring theoretical structures against experimental data.
Results and Discussion
As GAG chain length and modification increases, the number of possible structural permutations exceeds a value suitable for practical, computationally efficient search methods. For the chondroitin sulfate oligomers studied here, the number of structural possibilities is as large as 3.7E22 for an oligomer of length 50 (Eq. (2)). The number of possibilities is narrowed down when composition can be assigned and the number of known sulfate modifications is determined. While the paradigm for comparing theoretical structures against experimental data can differ, a minimum number of elements such as fragment type, fragment intensity, and sequence coverage must be considered for complete GAG characterization . Thus, instead of trying to shortcut these facets of analysis, we chose an approach that reduces the total search space. Hundreds of millions of structures may exist for a specific GAG composition, but for a pure sample, only one of these structures is a valid assignment. The impracticality of searching through a massive number of incorrect structures is reduced dramatically when a genetic algorithm search heuristic is applied .
Unambiguous mass tags such as the linker region dictate that greater emphasis should be placed on the reducing end (Y and Z fragments) and provide a more valid structural assignment. The primary fitness of a score is therefore based on its calculated f1 value, which considers the number of glycosidic fragments from the reducing end (NRE) that are matched in the experimental data. The software then checks to see if any match is potentially a sulfate decomposition peak by adding the mass of an SO3-H exchange (79.9568 Da) and searches the experimental data again for a matching mass. The value of f1 is then reduced by the number of peaks determined to be a product of sulfate decomposition (NRE + SO3).
If the value of f1 is tied among multiple structures, a secondary ranking is then determined with f2, the value of which is based on the number of glycosidic matches from the non-reducing end (B and C fragments). In similar fashion to calculating f1, considerations for potential sulfate decompositions are considered. Non-reducing end fragments are a tier below reducing end fragments since they could potentially match internal fragments due to the lack of an unambiguous mass tag. Incorrect assignment of internal fragments as non-reducing end fragments limits the validity of assignment.
A tertiary score f3 is used after matching glycosidic fragments from both reducing and non-reducing ends. Typically, a small selection of candidate structures (2–4) may end up with equal f1 and f2 values, in which case the summation of the intensities of all matched glycosidic fragments is the tiebreaker. This simple algorithm can and should be continuously fine-tuned for other purposes as software development continues but is sufficient for proof-of-principle purposes.
Of particular significance, the efficiency of this approach is found to increase as the total number of permutations increases. For a pure sample, only a single structure can be assigned to the MS2 spectrum, but the number of structures with drastically different modification patterns increases with respect to chain length. An increase in chain length also increases the number of GAG structures that could potentially share a feature not observed in the MS2. Structures containing these features drop out of the algorithm as possible options once a single structure of that particular type is scored.
Calculations shown here are run on a 2.4-GHz dual-core processor with 4 GB of RAM, a standard laptop or desktop computer. Speed of calculations can increase with more powerful processors such as a GPU workstation or computer cluster. It is important to note that the genetic algorithm in MATLAB is operated with separate function calls at each step of the algorithm’s cycle. Parallelization of these function calls is particularly attractive for samples of higher chain length and, in theory, could make spectra interpretation no longer the bottleneck for structural elucidation of GAGs. Additional GAG structures determined using this genetic-algorithm based GAG analysis software have been reported .
The software performance is limited by two factors: (1) the quality of the MS2 data and (2) the specificity of the fitness function. The former limitation can be reduced by using a high-performance instrument such as FTICR or Orbitrap mass spectrometers. Some fragment mass values differ by less than 1 Da, increasing the possibility of ambiguity in low-performance instruments. High-resolution mass spectra with single digit or lower ppm mass error minimize margins for incorrect assignment. Acquisition condition must also be optimized for glycan fragmentation and ideally limit production of confounding fragments such as SO3 loss or internal cleavages.
The latter factor, specificity of the fitness function in the genetic algorithm, is one that can be fine-tuned to GAG analysis by tandem mass spectrometry. The fitness function presented in this paper is simple, arbitrary, and based on the basics of glycan analysis. This approach works for the examples selected here because only glycosidic bond cleavage was assigned. Higher level structure analysis based on cross-ring cleavages requires a more sophisticated fitness function. A more complete and non-arbitrary scoring algorithm is being developed that assigns statistical weights and importance factors to various fragment peaks. Additional, peak intensity, while not considered heavily in this iteration of the code, can also signify important characteristics in GAG structure. Details for creating an optimized scoring algorithm will be discussed in future work.
Peak picking for GAG fragmentation is not discussed in this paper but is an important consideration moving forward. Bikunin fragment peaks were selected by the SNAP algorithm using averagine and manually validated; this approach is practical for lowly sulfated samples but averaging is insufficient for highly sulfated compounds due to contributions of sulfur to the A + 2 isotope peak. A fully automated and GAG-specific peak picking system is currently in development.
The software is applicable for GAGs that are both lowly sulfated such as bikunin and moderate and highly sulfated samples for both CS/DS and HS/Hp samples. Short-chain HS with more than one SO3 modification per disaccharide and long-chain chondroitin sulfate such as decorin with approximate 1 SO3 per disaccharide have been determined using our software [52, 53].
The uronic sugar stereochemistry is a variable modification in GAGs that is difficult to observe using just mass spectrometry. EDD data of heparin and heparan sulfate GAGs has produced a small subset of diagnostic fragments capable of distinguishing between glucuronic and iduronic acid epimers . Chemometric applications have yielded a diagnostic fragment ratio that can definitively determine the C5 stereochemistry . Application of this ratio can be integrated into the software after basic structural features have been assigned using the approach presented here.
The authors gratefully acknowledge funding from the National Institute of Health, grants P41GM103390 and R21HL136271.
- 1.Xie, B., Costello, C.E.: Carbohydrate structure determination by mass spectrometry. Carbohydr. Chem. Biol. Med. Appl. 29–57 (2008)Google Scholar
- 6.Zhao, Y.J., Singh, A., Xu, Y.M., Zong, C.L., Zhang, F.M., Boons, G.J., Liu, J., Linhardt, R.J., Woods, R.J., Amster, I.J.: Gas-phase analysis of the complex of fibroblast growth factor 1 with heparan sulfate: a traveling wave ion mobility spectrometry (TWIMS) and molecular modeling study. J. Am. Soc. Mass Spectrom. 28, 96–109 (2017)CrossRefPubMedGoogle Scholar
- 13.Kailemia, M.J., Patel, A.B., Johnson, D.T., Li, L.Y., Linhardt, R.J., Amster, I.J.: Differentiating chondroitin sulfate glycosaminoglycans using collision-induced dissociation; uronic acid cross-ring diagnostic fragments in a single stage of tandem mass spectrometry. Eur. J. Mass Spectrom. 21, 275–285 (2015)CrossRefGoogle Scholar
- 14.Flangea, C., Serb, A.F., Schiopu, C., Tudor, S., Sisu, E., Seidler, D.G., Zamfir, A.D.: Discrimination of GalNAc (4S/6S) sulfation sites in chondroitin sulfate disaccharides by chip-based nanoelectrospray multistage mass spectrometry. Cent. Eur. J. Chem. 7, 752–759 (2009)Google Scholar
- 17.Bin Oh, H., Leach, F.E., Arungundram, S., Al-Mafraji, K., Venot, A., Boons, G.J., Amster, I.J.: Multivariate analysis of electron detachment dissociation and infrared multiphoton dissociation mass spectra of heparan sulfate tetrasaccharides differing only in hexuronic acid stereochemistry. J. Am. Soc. Mass Spectrom. 22, 582–590 (2011)CrossRefGoogle Scholar
- 19.Wolff, J.J., Laremore, T.N., Busch, A.M., Linhardt, R.J., Amster, I.J.: Influence of charge state and sodium cationization on the electron detachment dissociation and infrared multiphoton dissociation of glycosaminoglycan oligosaccharides. J. Am. Soc. Mass Spectrom. 19, 790–798 (2008)CrossRefPubMedPubMedCentralGoogle Scholar
- 20.Leach, F.E., Ly, M., Laremore, T.N., Wolff, J.J., Perlow, J., Linhardt, R.J., Amster, I.J.: Hexuronic acid stereochemistry determination in chondroitin sulfate glycosaminoglycan oligosaccharides by electron detachment dissociation. J. Am. Soc. Mass Spectrom. 23, 1488–1497 (2012)CrossRefPubMedGoogle Scholar
- 21.Leach, F.E., Wolff, J.J., Laremore, T.N., Linhardt, R.J., Amster, I.J.: Evaluation of the experimental parameters which control electron detachment dissociation, and their effect on the fragmentation efficiency of glycosaminoglycan carbohydrates. Int. J. Mass Spectrom. 276, 110–115 (2008)CrossRefPubMedPubMedCentralGoogle Scholar
- 26.Leach, F.E., Riley, N.M., Westphall, M.S., Coon, J.J., Amster, I.J.: Negative electron transfer dissociation sequencing of increasingly sulfated glycosaminoglycan oligosaccharides on an orbitrap mass spectrometer. J. Am. Soc. Mass Spectrom. 28, 1844–1854 (2017)CrossRefPubMedPubMedCentralGoogle Scholar
- 37.Maxwell, E., Tan, Y., Tan, Y., Hu, H., Benson, G., Aizikov, K., Conley, S., Staples, G.O., Slysz, G.W., Smith, R.D., Zaia, J.: GlycReSoft: a software package for automated recognition of glycans from LC/MS data. PLoS One. 7, (2012)Google Scholar
- 43.Fogel, L.J., Owens, A.J., Walsh, M.J.: Artificial intelligence through a simulation of evolution. Proceedings of the Second Cybernetic Sciences Symposium: Biophysics and cybernetic systems. 131–155 (1965)Google Scholar
- 53.Singh, A., Kett, W.C., Severin, I.C., Agyekum, I., Duan, J.N., Amster, I.J., Proudfoot, A.E.I., Coombe, D.R., Woods, R.J.: The interaction of heparin tetrasaccharides with chemokine CCL5 is modulated by sulfation pattern and pH. J. Biol. Chem. 290, 15421–15436 (2015)CrossRefPubMedPubMedCentralGoogle Scholar
- 54.Agyekum, I., Patel, A.B., Zong, C.L., Boons, G.J., Amster, I.J.: Assignment of hexuronic acid stereochemistry in synthetic heparan sulfate tetrasaccharides with 2-O-sulfo uronic acids using electron detachment dissociation. Int. J. Mass Spectrom. 390, 163–169 (2015)CrossRefPubMedPubMedCentralGoogle Scholar