Semi-Supervised Graph Embedding Scheme with Active Learning (SSGEAL): Classifying High Dimensional Biomedical Data

  • George Lee
  • Anant Madabhushi
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6282)


In this paper, we present a new dimensionality reduction (DR) method (SSGEAL) which integrates Graph Embedding (GE) with semi-supervised and active learning to provide a low dimensional data representation that allows for better class separation. Unsupervised DR methods such as Principal Component Analysis and GE have previously been applied to the classification of high dimensional biomedical datasets (e.g. DNA microarrays and digitized histopathology) in the reduced dimensional space. However, these methods do not incorporate class label information, often leading to embeddings with significant overlap between the data classes. Semi-supervised dimensionality reduction (SSDR) methods have recently been proposed which utilize both labeled and unlabeled instances for learning the optimal low dimensional embedding. However, in several problems involving biomedical data, obtaining class labels may be difficult and/or expensive. SSGEAL utilizes labels from instances, identified as “hard to classify” by a support vector machine based active learning algorithm, to drive an updated SSDR scheme while reducing labeling cost. Real world biomedical data from 7 gene expression studies and 3900 digitized images of prostate cancer needle biopsies were used to show the superior performance of SSGEAL compared to both GE and SSAGE (a recently popular SSDR method) in terms of both the Silhouette Index (SI) (SI = 0.35 for GE, SI = 0.31 for SSAGE, and SI = 0.50 for SSGEAL) and the Area Under the Receiver Operating Characteristic Curve (AUC) for a Random Forest classifier (AUC = 0.85 for GE, AUC = 0.93 for SSAGE, AUC = 0.94 for SSGEAL).


Dimensionality Reduction Method Graph Embed Nonlinear Dimensionality Reduction Graph Embed Active Learning Scheme 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Lee, G., Rodriguez, C., Madabhushi, A.: Investigating the efficacy of nonlinear dimensionality reduction schemes in classifying gene and protein expression studies. IEEE Trans. on Comp. Biol. and Bioinf. 5(3), 368–384 (2008)CrossRefGoogle Scholar
  2. 2.
    van der Maaten, L.J.P., Postma, E.O., van den Herik, H.J.: Dimensionality reduction: A comparative review. Tilburg University Technical Report, TiCC- TR2009–005 (2009)Google Scholar
  3. 3.
    Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Analysis and Machine Intelligence. 22(8), 888–905 (2000)CrossRefGoogle Scholar
  4. 4.
    Sugiyama, M., Idé, T., Nakajima, S., Sese, J.: Semi-supervised local fisher discriminant analysis for dimensionality reduction. Advances in Knowledge Discovery and Data Mining, 333–344 (2008)Google Scholar
  5. 5.
    Blum, A., Chawla, S.: Learning from labeled and unlabeled data using graph mincuts. In: International Conference on Machine Learning, pp. 19–26 (2001)Google Scholar
  6. 6.
    Zhao, H.: Combining labeled and unlabeled data with graph embedding. Neurocomputing 69(16-18), 2385–2389 (2006)CrossRefGoogle Scholar
  7. 7.
    Zhang, D., et al.: Semi-supervised dimensionality reduction. In: SIAM International Conference on Data Mining (2007)Google Scholar
  8. 8.
    Yang, X., Fu, H., Zha, H., Barlow, J.: Semi-supervised nonlinear dimensionality reduction. In: International Conference on Machine Learning, pp. 1065–1072 (2006)Google Scholar
  9. 9.
    Sun, D., Zhang, D.: A new discriminant principal component analysis method with partial supervision. Neural Processing Letters 30, 103–112 (2009)CrossRefGoogle Scholar
  10. 10.
    Doyle, S., et al.: A class balanced active learning scheme that accounts for minority class problems: Applications to histopathology. In: MICCAI (2009)Google Scholar
  11. 11.
    Liu, Y.: Active learning with support vector machine applied to gene expression data for cancer classification. J. Chem. Inf. Comput. Sci. 44(6), 1936–1941 (2004)CrossRefPubMedGoogle Scholar
  12. 12.
    Higgs, B.W., et al.: Spectral embedding finds meaningful (relevant) structure in image and microarray data. BMC Bioinformatics 7(74) (2006)Google Scholar
  13. 13.
    He, X., Ji, M., Bao, H.: Graph embedding with constraints. In: International Joint Conference on Artificial Intelligence, pp. 1065–1070 (2009)Google Scholar
  14. 14.
    Cohn, D.A., Ghahramani, Z., Jordan, M.I.: Active learning with statistical models. Journal of Artif. Intell. Res. 4, 129–145 (1996)Google Scholar
  15. 15.
    Tong, S., Koller, D.: Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 999–1006 (2000)Google Scholar
  16. 16.
    Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learning 20 (1995)Google Scholar
  17. 17.
    Rousseeuw, P.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20(1), 53–65 (1987)CrossRefGoogle Scholar
  18. 18.
    Doyle, S., Tomaszewski, J., Feldman, M., Madabhushi, A.: Hierarchical boosted bayesian ensemble for prostate cancer detection from digitized histopathology. IEEE Transactions on Biomedical Engineering (2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • George Lee
    • 1
  • Anant Madabhushi
    • 1
  1. 1.Department of Biomedical EngineeringRutgers, The State University of New JerseyPiscatawayUSA

Personalised recommendations