N-GlycositeAtlas: a database resource for mass spectrometry-based human N-linked glycoprotein and glycosylation site mapping
N-linked glycoprotein is a highly interesting class of proteins for clinical and biological research. The large-scale characterization of N-linked glycoproteins accomplished by mass spectrometry-based glycoproteomics has provided valuable insights into the interdependence of glycoprotein structure and protein function. However, these studies focused mainly on the analysis of specific sample type, and lack the integration of glycoproteomic data from different tissues, body fluids or cell types.
In this study, we collected the human glycosite-containing peptides identified through their de-glycosylated forms by mass spectrometry from over 100 publications and unpublished datasets generated from our laboratory. A database resource termed N-GlycositeAtlas was created and further used for the distribution analyses of glycoproteins among different human cells, tissues and body fluids. Finally, a web interface of N-GlycositeAtlas was created to maximize the utility and value of the database.
The N-GlycositeAtlas database contains more than 30,000 glycosite-containing peptides (representing > 14,000 N-glycosylation sites) from more than 7200 N-glycoproteins from different biological sources including human-derived tissues, body fluids and cell lines from over 100 studies.
The entire human N-glycoproteome database as well as 22 sub-databases associated with individual tissues or body fluids can be downloaded from the N-GlycositeAtlas website at http://nglycositeatlas.biomarkercenter.org.
N-linked glycosylation site
liquid chromatography combined with tandem-mass spectrometry
peripheral blood mononuclear cell
It is known that post-translational modifications (PTM) are among the most important factors that increase the diversity of proteins in terms of both structures and functions . The expression analysis of proteins and their PTMs is a key step for the functional characterization of genes and proteins. In the last decade, mass spectrometry has become the most important tool for large-scale proteomic and PTM analysis. Due to the rapid accumulation of a vast amount of proteomic data, many proteome-, sub-proteome-, and protein modification databases have been created in recent years to facilitate proteomic and PTM studies. These databases include ProteomicsDB , Human Proteome Map , GPMDB  and PeptideAtlas  for global proteomes; PhosphoSitePlus  for phosphorylation sites, acetylation sites, and ubiquitination sites; Unipep  for N-glycosite-containing peptides; and Cell Surface Protein Atlas  for cell surface proteins. The public availability of these databases has facilitated the progress of several studies in their corresponding fields.
Glycosylation is one of the most common PTMs, which plays important roles in many biological processes . Aberrant glycosylation is associated with the pathological progression of many diseases . N-linked glycosylation is a common feature shared by a large fraction of transmembrane proteins, cell surface proteins, and proteins secreted in body fluids [9, 10]. Transmembrane or cell surface glycoproteins are easily accessible to therapeutic drugs, antibodies, and ligands. The glycoproteins secreted in body fluids such as serum, cerebrospinal fluid, and urine are easily accessible and are thought to provide a detailed window into the state of health of an individual. These features make glycoproteins a highly interesting class of proteins for clinical and biological research.
In the last decade, thousands of N-linked glycoproteins have been identified through identifying their glycosite-containing peptides using mass spectrometry . These data have facilitated a better understanding of the glycoprotein contents in humans and other organisms. However, these studies only analyzed specific tissue types, body fluids or cell lines. Unipep is the only database that is specifically dedicated for predicted and identified N-glycosite-containing peptides , which unfortunately does not contain the information about sources of the identified glycopeptides. Hence, a systematic and integrated analysis of these identified glycoproteins and glycosites is urgently needed.
In this study, we collected more than 30,000 unique human glycosite-containing peptides (de-glycosylated) identified by mass spectrometry, representing > 14,000 unique N-glycosites from > 7200 N-glycoproteins, from over 100 publications and unpublished datasets. A database resource termed N-GlycositeAtlas was created and further used for the distribution analyses of glycoproteins among different human cells, tissues and body fluids. Finally, a web interface of N-GlycositeAtlas (http://nglycositeatlas.biomarkercenter.org) was created to maximize the utility and value of the database by providing an online search platform as well as a comprehensive and tissue- or body fluid-specific glycoprotein database that can be downloaded.
Collection of N-linked human glycosite-containing peptides
The mass spectrometry identified glycosite-containing peptides from human sources (including tissues, body fluids, and cell lines) were obtained from two main resources: (1) 34 datasets generated from our laboratory (including 15 published and 19 unpublished datasets); (2) 70 papers published by other groups since 2003 (collected on November, 2015). These publications were collected based on their citation of one of the following glycoproteomics technology papers: (1) hydrazide chemistry [12, 13, 14, 15]; (2) lectin enrichment ; (3) hydrophilic affinity ; (4) size extraction chromatography ; and (5) FASP-based lectin enrichment . All unpublished glycosite-containing peptides were enriched using the hydrazide chemistry (SPEG) method [12, 13] from different human-related samples. It should be noted that only glycosite-containing peptides identified by their de-glycosylated forms were collected, the glycoproteins identified through intact glycopeptides or other non-glycosylated peptides were not included in this study. After glycosite-containing peptide collection from these published papers, the data were further filtered by N-X-S/T motif (X can be any amino acid except proline) with deamidation (de-glycosylated form) at the asparagine residue. In order to keep the original records from published papers, no further quality control step was performed prior to the database assembly.
Among these unpublished datasets generated in our laboratory, eleven of them were generated before 2008 and have been included in the Unipep website (http://www.unipep.org)  and/or PeptideAtlas website (http://www.peptideatlas.org) . These samples were enriched by the SPEG method and analyzed by an LTQ ion trap (Thermo Fisher, San Jose, CA) or Q-TOF (Waters, Beverly, MA) mass spectrometers followed by being searched with the SEQUEST algorithm  against a human International Protein Index database (IPI) . The peptide mass tolerance was 2.0 Da. Carbamidomethylation (C, + 57.0215 Da) was set as a static modification; oxidation (M, + 15.9949 Da) and deamination (N, + 0.98 Da) were set as dynamic modifications. The output files were further evaluated by INTERACT and ProteinProphet [23, 24]. The identified peptides were filtered by a PeptideProphet probability score ≥ 0.9 and the deamidation of asparagine (N) in the N-X-S/T motif. The identification of glycosite-containing peptides from these data was filtered by deamidation (de-glycosylated form) in the N-X-S/T motif.
The other eight big datasets were generated using Orbitrap Velos and/or Q-Exactive mass spectrometers (Thermo Fisher Scientific, Bremen, Germany) after former glycopeptide enrichment using SPEG method and searched against an NCBI Reference Sequence (RefSeq) human protein database  using SEQUEST  in Proteome Discoverer v1.4 (Thermo Fisher Scientific). The database searching parameters for glycosite-containing peptide identification were set as follows: two missed cleavages were allowed for trypsin digestion with 10 ppm precursor mass tolerance and 0.06 Da fragment mass tolerance. Carbamidomethylation (C) was set as a static modification, while oxidation (M) and deamination (N) were set as dynamic modifications. For iTRAQ-labeled samples, iTRAQ-4plex (peptide N-terminal) and iTRAQ-4plex (K) were added as dynamic modifications. The glycosite-containing peptide identifications were filtered by 1% FDR and deamination in the N-X-S/T motif of the peptides. Four of these unpublished datasets (raw data) have been deposited to the ProteomeXchange Consortium (http://proteomecentral.proteomexchange.org) via the PRIDE partner repository  with the dataset identifier PXD005143. Another glycoproteome dataset is accessible through the Clinical Proteomic Tumor Analysis Consortium (CPTAC) website (https://cptac-data-portal.georgetown.edu/cptac/s/S020).
Glycoprotein mapping and database assembly
All identified glycosite-containing peptides from different published papers and unpublished datasets were matched to the UniProt human protein database (downloaded at Nov. 3rd, 2015 from website http://www.uniprot.org) using an in-house software. Using this in-house software, all glycosite-containing peptides were first mapped into the reviewed UniProt database, and unmatched peptides were further mapped into an un-reviewed UniProt database. The matched protein IDs, gene names, protein names, glycosylation site locations, and peptide sequences with ± 20 amino acids surrounding each glycosite (N-X-S/T motif, X ≠ P) were extracted and assembled into a human glycoprotein and glycosite database, termed N-GlycositeAtlas. When a peptide could match to more than one protein, all protein records were included in the database. In addition, only peptides containing the typical N-X-S/T N-glycosylation motif were included in the database.
The N-GlycositeAtlas is accessible at http://nglycositeatlas.biomarkercenter.org. The user can download the entire and 22 tissue/body fluid specific human glycoprotein databases from the website.
Results and discussions
Assembly of N-GlycositeAtlas
To expand the database, we also collected human glycosite-containing peptides from all papers regarding to human glycosite-containing peptide analysis published since 2003. Using the same strategy as above, we eventually collected 22,618 glycosite-containing peptides that belong to 8818 unique glycosites from 70 papers published by other laboratories [7, 14, 15, 17, 18, 33, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104]. Altogether, the N-GlycositeAtlas contains 30,872 unique glycosite-containing peptides that match to 14,644 unique glycosites in 7204 glycoproteins (Fig. 1 and Additional file 1: Table S1).
The confidence of these identified glycosites, especially the glycosites that were only identified once, can be further estimated based on their original mass spectrometry data and subsequent analytical methods. Owing to the huge improvement of mass spectrometry technology, liquid chromatography (LC) separation, analysis software and glycoprotein/glycosite-containing peptide isolation methods in recent years, the number of confidently identified N-linked glycosite-containing peptides has increased dramatically. As most of the glycosite-containing peptides in N-GlycositeAtlas were identified as their de-glycosylation form with deamidation (+ 0.98 Da) at the former glycosites (after PNGase F treatment), the high resolution and accuracy of the mass spectrometers that were used to conduct these studies in recent years greatly increased the identification confidence of the glycosites and glycoproteins as well as increased the numbers of identified glycosite-containing peptides at pre-determined false discovery rates (FDR). In order to estimate the confidence of the glycosite-containing peptides in the database using this information, we simply analyzed the data according to their date of publication. Our results showed that although the identified human glycoproteins and glycosites have been steadily increasing since 2003 when the first two glycoproteomic studies were published [12, 16], the huge increase mainly occurred in recent years (Fig. 2b). We found that the majority of the glycosites (83.4%) in the database were published during 2010–2015 (Fig. 2c), and these sites were most likely identified with high confidence by using high resolution and high accurate mass spectrometry.
Additional information about the detailed mass spectrometers and search parameters for the identification of a given glycosite or glycoprotein can be obtained from the original publications listed in the database.
Distribution of glycoproteins and glycosites across tissues and biological fluids
As N-linked glycoproteins account for a large portion of the protein content in serum and other body fluids, identifying the glycoprotein components in these body fluids is essential for their clinical utility. N-GlycositeAtlas contains 2645 and 1845 glycoproteins that were identified from urine and serum, respectively (Fig. 3). Based on these results, we found that more glycoproteins were identified from urine than from serum. The possible reason is that serum contains many high abundant glycoproteins, and these glycoproteins might inhibit the identification of low abundant glycoproteins in serum. Removal of these high abundant proteins before mass spectrometry-based proteomic or sub-proteomic analyses would increase the number of identified serum glycoproteins . Several hundred glycoproteins have also been identified from saliva and cerebrospinal fluid (CSF). In addition, > 1000 glycoproteins have been detected from platelets and T cell cell lines, and > 500 glycoproteins have been identified from B-cell cell lines (Fig. 3).
The glycoprotein and glycosite databases associated with individual tissues or body fluids can be downloaded from the N-GlycositeAtlas website.
Comparison of serum and urinary glycoproteins with tissue-derived glycoproteins
Body fluids other than serum such as urine and CSF are also important specimens for clinical tests. In this study, we also analyzed urine-derived glycoproteins based on the clinical utility of urine. The urinary glycoproteins were also compared with glycoproteins from eight different tissues. The results indicated that a lot of glycoproteins were also commonly identified from urine and tissues, with an average of 63.1 ± 12.1% tissue-derived glycoproteins overlapping with urine-derived glycoproteins (Fig. 4b). More tissue-derived glycoproteins were detected in urine than in serum, which could be attributed to the larger number of glycoproteins that were identified in urine compared to serum. To further investigate the potential of urine in clinical tests and biomarker discovery, we also compared the glycoproteins between urine and serum. Among 1845 glycoproteins identified in serum, 827 (44.8%) were also identified in urine. The abundance glycoprotein content in urine and the high percentage of glycoproteins that overlap with tissue-derived glycoproteins suggests the high potential of urine in clinical detection and biomarker discovery. However, additional studies are required to confirm whether these urinary glycoproteins change with disease and reflect different pathological states within different parts of the human body.
N-GlycositeAtlas web interface
We designed two layers of display pages to exhibit the results. The first layer of the display page only exhibits general information of the glycoproteins, including glycoprotein accession numbers (UniProt), gene names, protein names and identified glycosylation sites (Fig. 5b). The additional information for each glycoprotein can be gained in the second display page by clicking the related glycoprotein accession number. In the second display page, the user will obtain the tissue/liquid/cell line types where the glycoprotein was identified (Fig. 5c), all glycosite-containing peptides identified at each glycosite with the reference information (Fig. 5d), as well as the highlighted the location of the identified glycosites and glycosite-containing peptides in the protein sequence (Fig. 5e).
In addition, the entire human glycoprotein and glycosite database as well as the glycoprotein database for each individual tissue or body fluid can also be downloaded from the N-GlycositeAtlas website in a Microsoft Excel format. The following information is included in the database: (1) UniProt accession numbers of glycoproteins; (2) whether the protein has been reviewed in the UniProt database; (3) protein names; (4) gene names; (5) location of the glycosylation sites; (6) identified glycosite-containing peptides; (7) the protein sequence at ± 20 amino acids surrounding the identified glycosylation site; (8) names of tissues/body fluids/cell lines where the glycosite-containing peptide was identified; (9) year of publication; and (10) references. It should be noted that each line of text only contains one glycosite-containing peptide and one glycosite location. When a peptide contains more than one glycosite, each glycosite is displayed on a separate line. In addition, different proteins are also listed on separate lines when one glycosite-containing peptide was matched to more than one protein. The detailed information for each identified glycosite or glycosite-containing protein can be acquired from their original publications that are listed after each record.
In this study, we created a human glycoprotein and glycosite database containing > 14,000 N-glycosites and more than 7200 N-glycoproteins that were identified through their de-glycosylated forms of glycosite-containing peptides by mass spectrometry from over 100 publications or unpublished datasets. Based on the data in the database, we observed that although several thousand glycoproteins could be identified from one single tissue, there were still many tissues where no mass spectrometry-based glycosite data has been generated yet. A considerable amount of additional work is still needed to profile the human glycoproteomes at the human genomic level. Many common glycoproteins identified between tissues and serum confirmed the high value of serum in clinical tests, while the large proportion of common glycoproteins between different tissues and urine suggested the high potential of urine for clinical detection and biomarker discovery. Finally, the web interface of N-GlycositeAtlas (http://nglycositeatlas.biomarkercenter.org) was created to maximize the utility and value of the database by providing an online search platform as well as a comprehensive and tissue- or body fluid-specific glycoprotein database that can be downloaded.
SS collected and analyzed human glycosites with support from other co-authors; YH and MA wrote the in-house program for glycosite mapping and developed the web interface of the database; SS, PS, JC, WY, XJ, YT and HZ provided unpublished and newly generated datasets; ST provided suggestions on data analysis and manuscript preparation. SS and HZ prepared the manuscript. All authors read and approved the final manuscript.
This work was supported by the National Natural Science Foundation of China (Grant Nos. 91853123, 81773180, and 21705127), and Natural Science Foundation of Shaanxi Province (Grant No: 2018JM7086074). HZ was supported by the National Institutes of Health (Grant Nos.: U01CA152813, U24CA210985, P01HL107153, and R21AI122382).
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
- 6.Hornbeck PV, Kornhauser JM, Tkachev S, Zhang B, Skrzypek E, Murray B, Latham V, Sullivan M. PhosphoSitePlus: a comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse. Nucleic Acids Res. 2011. https://doi.org/10.1093/nar/gkr1122.CrossRefPubMedPubMedCentralGoogle Scholar
- 20.Farrah T, Deutsch EW, Omenn GS, Campbell DS, Sun Z, Bletz JA, Mallick P, Katz JE, Malmström J, Ossola R. A high-confidence human plasma proteome reference set with estimated concentrations in PeptideAtlas. Mol Cell Proteomics. 2011. https://doi.org/10.1074/mcp.M110.006353.CrossRefPubMedPubMedCentralGoogle Scholar
- 27.Sun S, Shah P, Eshghi ST, Yang W, Trikannad N, Yang S, Chen L, Aiyetan P, Hoti N, Zhang Z, et al. Comprehensive analysis of protein glycosylation by solid-phase extraction of N-linked glycans and glycosite-containing peptides. Nat Biotechnol. 2015. (Advance online publication).Google Scholar
- 29.Shah P, Wang X, Yang W, Eshghi ST, Sun S, Hoti N, Chen L, Yang S, Pasay J, Rubin A, et al. Integrated proteomic and glycoproteomic analyses of prostate cancer cells reveal glycoprotein alteration in protein abundance and glycosylation. Mol Cell Proteomics. 2015;14(10):2753–63.PubMedPubMedCentralCrossRefGoogle Scholar
- 31.Liu Y, Chen J, Sethi A, Li QK, Chen L, Collins B, Gillet LC, Wollscheid B, Zhang H, Aebersold R. Glycoproteomic analysis of prostate cancer tissues by SWATH mass spectrometry discovers N-acylethanolamine acid amidase and protein tyrosine kinase 7 as signatures for tumor aggressiveness. Mol Cell Proteomics. 2014;13(7):1753–68.PubMedPubMedCentralCrossRefGoogle Scholar
- 36.Almaraz RT, Tian Y, Bhattarcharya R, Tan E, Chen S-H, Dallas MR, Chen L, Zhang Z, Zhang H, Konstantopoulos K. Metabolic flux increases glycoprotein sialylation: implications for cell adhesion and cancer metastasis. Mol Cell Proteomics. 2012. https://doi.org/10.1074/mcp.M112.017558.CrossRefPubMedPubMedCentralGoogle Scholar
- 48.Goyallon A, Cholet S, Chapelle M, Junot C, Fenaille F. Evaluation of a combined glycomics and glycoproteomics approach for studying the major glycoproteins present in biofluids: application to cerebrospinal fluid. Rapid Commun Mass Spectrom. 2015;29(6):461–73.PubMedCrossRefPubMedCentralGoogle Scholar
- 49.Cheow ESH, Sim KH, de Kleijn D, Lee CN, Sorokin V, Sze SK. Simultaneous enrichment of plasma soluble and extracellular vesicular glycoproteins using prolonged ultracentrifugation-electrostatic repulsion–hydrophilic interaction chromatography (PUC-ERLIC) approach. Mol Cell Proteomics. 2015;14(6):1657–71.CrossRefGoogle Scholar
- 51.Zhang Z, Sun Z, Zhu J, Liu J, Huang G, Ye M, Zou H. High-throughput determination of the site-specific N-sialoglycan occupancy rates by differential oxidation of glycoproteins followed with quantitative glycoproteomics analysis. Anal Chem. 2014;86(19):9830–7.PubMedCrossRefPubMedCentralGoogle Scholar
- 53.Xu Y, Bailey U-M, Punyadeera C, Schulz BL. Identification of salivary N-glycoproteins and measurement of glycosylation site occupancy by boronate glycoprotein enrichment and liquid chromatography/electrospray ionization tandem mass spectrometry. Rapid Commun Mass Spectrom. 2014;28(5):471–82.PubMedCrossRefPubMedCentralGoogle Scholar
- 54.Weng Y, Qu Y, Jiang H, Wu Q, Zhang L, Yuan H, Zhou Y, Zhang X, Zhang Y. An integrated sample pretreatment platform for quantitative N-glycoproteome analysis with combination of on-line glycopeptide enrichment, deglycosylation and dimethyl labeling. Anal Chim Acta. 2014;833:1–8.PubMedCrossRefPubMedCentralGoogle Scholar
- 65.Hirao Y, Matsuzaki H, Iwaki J, Kuno A, Kaji H, Ohkura T, Togayachi A, Abe M, Nomura M, Noguchi M, et al. Glycoproteomics approach for identifying glycobiomarker candidate molecules for tissue type classification of non-small cell lung carcinoma. J Proteome Res. 2014;13(11):4705–16.PubMedCrossRefPubMedCentralGoogle Scholar
- 74.Li X, Jiang J, Zhao X, Wang J, Han H, Zhao Y, Peng B, Zhong R, Ying W, Qian X. N-glycoproteome analysis of the secretome of human metastatic hepatocellular carcinoma cell lines combining hydrazide chemistry, HILIC enrichment and mass spectrometry. PLoS ONE. 2013;8(12):e81921.PubMedPubMedCentralCrossRefGoogle Scholar
- 75.Kaji H, Ocho M, Togayachi A, Kuno A, Sogabe M, Ohkura T, Nozaki H, Angata T, Chiba Y, Ozaki H, et al. Glycoproteomic discovery of serological biomarker candidates for HCV/HBV infection-associated liver fibrosis and hepatocellular carcinoma. J Proteome Res. 2013;12(6):2630–40.PubMedCrossRefGoogle Scholar
- 77.Zhu J, Wang F, Chen R, Cheng K, Xu B, Guo Z, Liang X, Ye M, Zou H. Centrifugation assisted microreactor enables facile integration of trypsin digestion, hydrophilic interaction chromatography enrichment, and on-column deglycosylation for rapid and sensitive N-glycoproteome analysis. Anal Chem. 2012;84(11):5146–53.PubMedCrossRefGoogle Scholar
- 80.Whitmore TE, Peterson A, Holzman T, Eastham A, Amon L, McIntosh M, Ozinsky A, Nelson PS, Martin DB. Integrative analysis of N-linked human glycoproteomic data sets reveals PTPRF ectodomain as a novel plasma biomarker candidate for prostate cancer. J Proteome Res. 2012;11(5):2653–65.PubMedCrossRefGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.