The local-balanced model for improved machine learning outcomes on mass spectrometry data sets and other instrumental data

This article has been updated

Abstract

One unifying challenge when classifying biological samples with mass spectrometry data is overcoming the obstacle of sample-to-sample variability so that differences between groups, such as between a healthy set and a disease set, can be identified. Similarly, when the same sample is re-analyzed under identical conditions, instrument signals can fluctuate by more than 10%. This signal inconsistency imposes difficulties in identifying subtle differences across a set of samples, and it weakens the mass spectrometrist’s ability to effectively leverage data in domains as diverse as proteomics, metabolomics, glycomics, and imaging. We selected challenging data sets in the fields of glycomics, mass spectrometry imaging, and bacterial typing to study the problem of within-group signal variability and adapted a 30-year-old statistical approach to address the problem. The solution, “local-balanced model,” relies on using balanced subsets of training data to classify test samples. This analysis strategy was assessed on ESI-MS data of IgG-based glycopeptides and MALDI-MS imaging data of endogenous lipids, and MALDI-MS data of bacterial proteins. Two preliminary examples on non-mass spectrometry data sets are also included to show the potential generality of the method outside the field of MS analysis. We demonstrate that this approach is superior to simple normalization methods, generalizable to multiple mass spectrometry domains, and potentially appropriate in fields as diverse as physics and satellite imaging. In some cases, improvements in classification can be dramatic, with accuracy escalating from 60% with normalization alone to over 90% with the additional development described herein.

Graphical abstract

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3

Change history

  • 16 April 2020

    Springer Nature’s version of this paper was updated to present the correct Electronic Supplementary Material files.

References

  1. 1.

    Zhou Z, Zare RN. Personal information from latent fingerprints using desorption electrospray ionization mass spectrometry and machine learning. Anal Chem. 2017;89:1369–72.

    CAS  Article  Google Scholar 

  2. 2.

    Papagiannopoulou C, Parchen R, Rubbens P, Waegeman W. Fast pathogen identification using single-cell matrix-assisted laser desorption/ionization-aerosol time-of-flight mass spectrometry data and deep learning methods. Anal Chem. 2020;92:7523–31.

    CAS  Article  Google Scholar 

  3. 3.

    Xie YR, Castro D, Bell S, Rubakhin SS, Sweedler JV. Single-cell classification using mass spectrometry through interpretable machine learning. Anal Chem, 2020. (avail online.).

  4. 4.

    Hua D, Patabandige MW, Go EP, Desaire H. The Aristotle Classifier: using the whole glycomic profile to indicate a disease state. Anal Chem. 2019;91(17):11070–7.

    CAS  Article  Google Scholar 

  5. 5.

    Desaire H, Hua D. Adapting the Aristotle Classifier for accurate identifications of highly similar bacteria analyzed by MALDI-TOF MS. Anal Chem. 2020;92(1):1050–7.

    CAS  Article  Google Scholar 

  6. 6.

    Hua D, Liu X, Go EP, Wang Y, Hummon AB, Desaire H How to apply supervised machine learning tools to MS imaging files: case study with cancer spheroids undergoing treatment with the monoclonal antibody, cetuximab. J Am Soc Mass Spectrom 2020.

  7. 7.

    van den Berg RA, Hoefsloot HCJ, Westerhuis JA, Smilde AK, van der Werf MJ. Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics. 2006;7(1):142.

    Article  Google Scholar 

  8. 8.

    Välikangas T, Suomi T, Elo LL. A systematic evaluation of normalization methods in quantitative label-free proteomics. Brief Bioinform. 2016;19(1):bbw095.

    Article  Google Scholar 

  9. 9.

    Uh H-W, Klarić L, Ugrina I, Lauc G, Smilde AK, Houwing-Duistermaat JJ. Choosing proper normalization is essential for discovery of sparse glycan biomarkers. Molec Omics, 2020.

  10. 10.

    Benedetti E, Gerstner N, Pucic-Bakovic M, Keser T, Reiding KR, Ruhaak LR, et al. Systematic evaluation of normalization methods for glycomics data based on performance of network interference. bioRxiv. 2019. https://doi.org/10.1101/814244.

  11. 11.

    Fonville JM, Carter C, Cloarec O, Nicholson JK, Lindon JC, Bunch J, et al. Robust data processing and normalization strategy for MALDI mass spectrometric imaging. Anal Chem. 2012;84:1310–9.

    CAS  Article  Google Scholar 

  12. 12.

    Song X, He J, Pang X, Zhang J, Sun C, Huang L, et al. Virtual calibration quantitative mass spectrometry imaging for accurately mapping analytes across heterogenous biotissue. Anal Chem. 2019;91:2838–46.

    CAS  Article  Google Scholar 

  13. 13.

    Liu Z, Portero EP, Jian Y, Zhao Y, Onjiko RM, Zeng C, et al. Trace, machine learning of signal images for trace-sensitive mass spectrometry: a case study from single-cell metabolomics. Anal Chem. 2019;91:5768–76.

    CAS  Article  Google Scholar 

  14. 14.

    Blanzieri E, Melgani F. An adaptive SVM nearest neighbor classifier for remotely sensed imagery. In: IEEE Int. Conf. on Geoscience and Remote Sensing Symposium (IGARSS 2006), pp. 3931–3934, 2006.

  15. 15.

    Blanzieri E, Melgani F. Nearest neighbor classification of remote sensing images with the maximal margin principle. IEEE Trans Geosci Remote Sens. 2008;46(6):1604–811.

    Article  Google Scholar 

  16. 16.

    Segata N, Blanzieri E. Fast and scaleable local kernel models. J Mach Learn Res. 2010;11:1883–926.

    Google Scholar 

  17. 17.

    Jiang L, Cai Z, Wang D, Jiang S. Survey of improving K-nearest-neighbor for classification in Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007). IEEE, 2007.

  18. 18.

    Langley P, Iba W, Thomas K. An analysis of Bayesian classifiers. In Proceedings of the Tenth National Conference of Artificial Intelligence, pages 223–228. AAAI Press, 1992.

  19. 19.

    Li K-H, Li C-T. “Locally Weighted Learning for Naïve Bayes Classifier” 2014, arXiv:1412.6741v1.

  20. 20.

    Liu X-Y, Wu J, Zhou Z-H. Exploratory undersampling for class-imbalance learning. IEEE Transact Syst, Man, Cybern B: Cybern. 2009;39(2):539–50.

    Article  Google Scholar 

  21. 21.

    Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC, et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinf. 2011;12:77.

    Article  Google Scholar 

  22. 22.

    Hu W, Su X, Zhu Z, Go EP, Desaire H. GlycoPep MassList: software to generate massive inclusion lists for glycopeptide analyses. Anal Bioanal Chem. 2017;409(2):561–70.

    CAS  Article  Google Scholar 

  23. 23.

    Go EP, Moon HJ, Mure M, Desaire H. Recombinant human lysyl oxidase-like 2 secreted from human embryonic kidney cells displays complex and acidic glycans at all three N-linked glycosylation sites. J Proteome Res. 2018;17(5):1826–32.

    CAS  Article  Google Scholar 

  24. 24.

    Rebecchi KR, Wenke JL, Go EP, Desaire H. Label-free quantitation: a new glycoproteomics approach. J Am Soc Mass Spectrom. 2009;20:1048–59.

    CAS  Article  Google Scholar 

  25. 25.

    Liu X, Lukowski JK, Flinders C, Kim S, Georgiadis RA, Mumenthaler SM, et al. MALDI-MSI of immunotherapy: mapping the EGFR-targeting antibody cetuximab in 3D colon-cancer cell cultures. Anal Chem. 2018;90:14156–64.

    CAS  Article  Google Scholar 

  26. 26.

    https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite) (accessed September 10, 2020).

  27. 27.

    http://archive.ics.uci.edu/ml/datasets/hill-valley (accessed September 10, 2020).

  28. 28.

    Mahe P, Arsac M, Chatellier S, Monnin V, Perrot N, Mailler S, et al. Automatic identification of mixed bacterial species fingerprints in a MALDI-TOF mass-spectrum. Bioinformatics. 2014;30(9):1280–6.

    CAS  Article  Google Scholar 

  29. 29.

    Atkeson CG, Moore AW, Schall S. Locally weighted learning. Artif Intell Rev. 1997;11:11–73.

    Article  Google Scholar 

  30. 30.

    Huang S, Cai N, Pacheco PP, Narandes S, Wang Y, Xu W. Applications of support vector machine (SVM) learning in cancer genomics. Cancer Genomics Proteomics. 2018;15(1):41–51.

    CAS  PubMed  Google Scholar 

  31. 31.

    Xia J, Broadhurst DI, Wilson M, Wishart DS. Translational biomarker discovery in clinical metabolomics: an introductory tutorial. Metabolomics. 2013;9:280–99.

    CAS  Article  Google Scholar 

  32. 32.

    Sulecki N “Characterizing dimensionality reduction algorithm performance in terms of data set aspects.” Honors Thesis, Ohio University. 2017.

  33. 33.

    Shadvar A. Dimension reduction by mutual information discriminant analysis. Int J Artificial Intell Appl. 2012;3(3):23–35.

    Google Scholar 

Download references

Funding

The authors received financial support from NIH. This work was funded by grant R35GM130354 to H.D.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Heather Desaire.

Ethics declarations

Conflict of interest

The authors declare that they have no conflicts of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

ESM 1

(TXT 3.66 kb)

ESM 2

(TXT 1.06 kb)

ESM 3

(TXT 1.74 kb)

ESM 4

(DOCX 14.6 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Desaire, H., Patabandige, M.W. & Hua, D. The local-balanced model for improved machine learning outcomes on mass spectrometry data sets and other instrumental data. Anal Bioanal Chem (2021). https://doi.org/10.1007/s00216-020-03117-2

Download citation

Keywords

  • Software
  • Genomics/proteomics
  • Mass spectrometry
  • Machine learning
  • Imaging
  • Glycoprotein