Feature Extraction Methods and Manifold Learning Methods

Part of the Advanced Information and Knowledge Processing book series (AI&KP)

In the previous chapters we presented several learning algorithms for classification and regression tasks. In many applicative problems data cannot be straightaway used to feed learning algorithms; they first need to have undergone a preliminary preprocessing. To illustrate this concept, we consider the following example. Suppose we want to build an automatic handwriting character recognizer, that is a system able to associate to a given bitmap the correct alphabet letter or digit. We assume that the data have the same sizes, that the data are bitmaps of n × m pixels; for the sake of simplicity we assume n = m = 28. Therefore the number of possible configurations is 28 × 28 = 216. This consideration implies that a learning machine straightly fed by character bitmaps will perform poorly since a representative training set can not be built. A common approach for overcoming this problem consists in representing each bitmap by a vector of d (with d ª nm) measures computed on the bitmap, called features, and then feeding the learning machine with the feature vector. The feature vector has the aim of representing in a concise way the distinctive characteristics of each letter. The more features represent the distinctive characteristics of each single character the higher is the performance of the learning machine. In machine learning, the preprocessing stage that converts the data into feature vectors is called feature extraction. One of the main aims of the feature extraction is to obtain the most representative feature vector using a number as small as possible of features. The use of more features than strictly necessary leads to several problems. A problem is the space needed to store the data. As the amount of available information increases, the compression for storage purposes becomes even more important. The speed of learning machines using the data depends on the dimension of the vectors, so a reduction of the dimension can result in reduced computational time. The most important problem is the sparsity of data when the dimensionality of the features is high. The sparsity of data implies that it is usually hard to make learning machines with good performances when the dimensionality of input data (that is, the feature dimensionality), is high. This phenomenon, discovered by Bellman, is called the curse of dimensionality [7].


Independent Component Independent Component Analysis Feature Extraction Method Blind Source Separation Locally Linear Embedding 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1. Principal Component Analysis. Principal Component Analysis. Springer-Verlag, 1986.Google Scholar
  2. 2.
    F.R. Bach and M.I. Jordan. Kernel independent component analysis. Journal of Machine Learning Research, 3(1):1-48, 2002.CrossRefMathSciNetGoogle Scholar
  3. 3.
    P. Baldi and K. Hornik. Neural networks and principal component analysis: learning from examples without local minima. Neural Networks, 2(1):53-58, 1989.CrossRefGoogle Scholar
  4. 4.
    A.R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3):930-945, 1993.zbMATHCrossRefMathSciNetGoogle Scholar
  5. 5.
    M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373-1396, 2003.zbMATHCrossRefGoogle Scholar
  6. 6.
    A. Bell and T. Sejnowski. An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6):1129-1159, 1995.CrossRefGoogle Scholar
  7. 7.
    R. Bellman. Adaptive Control Processes: A Guided Tour. Princeton University Press, 1961.Google Scholar
  8. 8.
    C. Bishop. Neural Networks for Pattern Recognition. Cambridge University Press, 1995.Google Scholar
  9. 9.
    L. Breiman. Hinging hyperplanes for regression, classification, and function approximation. IEEE Transactions on Information Theory, 39(3):999-1013, 1993.zbMATHCrossRefMathSciNetGoogle Scholar
  10. 10.
    J Bruske and G. Sommer. Intrinsic dimensionality estimation with optimally topology preserving maps. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(5):572-575, May 1998.CrossRefGoogle Scholar
  11. 11.
    F. Camastra. Data dimensionality estimation methods: A survey. Pattern Recog- nition, 36(12):2945-2954, December 2003.zbMATHCrossRefGoogle Scholar
  12. 12.
    F. Camastra and A. Vinciarelli. Estimating the intrinsic dimension of data with a fractal-based method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(10):1404-1407, October 2002.CrossRefGoogle Scholar
  13. 13.
    J.-F. Cardoso and B. Laheld. Equivalent adaptive source separation. IEEE Transactions on on Signal Processing, 44(12):3017-3030, 1996.CrossRefGoogle Scholar
  14. 14.
    G. Cayton. Algorithms for manifold learning. Technical report, Computer Science and Engineering department, University of California, San Diego, 2005.Google Scholar
  15. 15.
    C. L. Chang and R. C. T. Lee. A heuristic relaxation method for nonlinear mapping in cluster analysis. IEEE Transactions on Computers, C-23:178-184, February 1974.CrossRefGoogle Scholar
  16. 16.
    P. Comon. Independent component anaysis - a new concept? Signal Processing, 36(?):287-314, 1994.zbMATHCrossRefGoogle Scholar
  17. 17.
    T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT Press, 1990.Google Scholar
  18. 18.
    J. Costa and A. O. Hero. Geodetic entropic graphs for dimension and en- tropy dimension in manifold learning. IEEE Transactions on Signal Processing, 52(8):2210-2221, 2004.CrossRefMathSciNetGoogle Scholar
  19. 19.
    T. M. Cover and J. A. Thomas. Elements of Information Theory. Jphn Wiley, 1991.Google Scholar
  20. 20.
    P. Demartines and J. Herault. Curvilinear component analysis: A self-organizing neural network for nonlinear mapping in cluster analysis. IEEE Transactions on Neural Networks, 8(1):148-154, January 1997.CrossRefGoogle Scholar
  21. 21.
    R. A. DeVore. Degree of nonlinear approximation. In Approximation Theory, Vol. VI, pages 175-201. Academic Press, 1991.Google Scholar
  22. 22.
    R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley, 2001.Google Scholar
  23. 23.
    J. P. Eckmann and D. Ruelle. Ergodic theory of chaos and strange attractors. Review of Modern Physics, 57(3):617-659, 1985.CrossRefMathSciNetGoogle Scholar
  24. 24.
    J. P. Eckmann and D. Ruelle. Fundamental limitations for estimating dimen- sions and lyapounov exponents in dynamical systems. Physica, D-56:185-187, 1992.MathSciNetGoogle Scholar
  25. 25.
    B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, 1993.Google Scholar
  26. 26.
    R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2):179-188, 1936.Google Scholar
  27. 27.
    D. Fotheringhame and R. J. Baddeley. Nonlinear principal component analysis of neuronal spike train data. Biological Cybernetics, 77(4):282-288, 1997.CrossRefGoogle Scholar
  28. 28.
    J. H. Friedman. Exploratory projection pursuit. Journal of the American Sta-tistical Association, 82(397):249-260, 1987.zbMATHCrossRefGoogle Scholar
  29. 29.
    J. H. Friedman and J. W. Tukey. A projection pursuit algorithm for expoloratory data analysis. IEEE Transactions on Computers, C-23(9):881-890, 1974.CrossRefGoogle Scholar
  30. 30.
    K. Fukunaga. Intrinsic dimensionality extraction. In Classification, Pattern Recognition and Reduction of Dimensionality, Vol. 2 of Handbook of Statistics, pages 347-362. North Holland, 1982.Google Scholar
  31. 31.
    K. Fukunaga. An Introduction to Statistical Pattern Recognition. Academic Press, 1990.Google Scholar
  32. 32.
    K. Fukunaga and D. R. Olsen. An algorithm for finding intrinsic dimensionality of data. IEEE Transactions on Computers, 20(2):165-171, 1976.Google Scholar
  33. 33.
    F. Girosi. Regularization theory, radial basis functions and networks. In From Statistics to Neural Networks, pages 166-187,. Springer-Verlag, 1994.Google Scholar
  34. 34.
    F. Girosi and G. Anzellotti. Rates of convergence of approximation by translates. Technical report, Artificial Intelligence Laboratory, Massachussets Institute of Technology,, 1993.Google Scholar
  35. 35.
    P. Grassberger and I. Procaccia. Measuring the strangeness of strange attrac- tors. Physica, D9(1-2):189-208, 1983.MathSciNetGoogle Scholar
  36. 36.
    F. Hausdorff. Dimension und äusseres mass. Math. Annalen, 79(1-2):157-179, 1918.zbMATHCrossRefMathSciNetGoogle Scholar
  37. 37.
    A. Heyting and H. Freudenthal. Collected Works of L.E.J Brouwer. North- Holland Elsevier, 1975.Google Scholar
  38. 38.
    P. Huber. Projection pursuit. The Annals of Statistics, 13(2):435-475, 1985.zbMATHCrossRefMathSciNetGoogle Scholar
  39. 39.
    U. Hübner, C. O. Weiss, N. B. Abraham, and D. Tang. Lorenz-like chaos in nh3 -fir lasers. In Time Series Prediction. Forecasting the Future and Understanding the Past, pages 73-104. Addison Wesley, 1994.Google Scholar
  40. 40.
    A. Hyvärinen. New approximations of differential entropy for independent com-ponent analysis and projection pursuit. In Advances in Neural Information Processing Systems 10, pages 273-279. MIT Press, 1998.Google Scholar
  41. 41.
    A. Hyvärinen. The fixed-point algorithm and maximum likelihood for indepen-dent component analysis. Neural Processing Letters, 10(1):1-5, 1999.CrossRefGoogle Scholar
  42. 42.
    A. Hyvärinen and E. Oja. A fast fixed-point algorithm for independent compo-nent analysis. Neural Computation, 9(7):1483-1492, 1997.CrossRefGoogle Scholar
  43. 43.
    A. Hyvärinen and E. Oja. Independent component analysis: Algorithms and applications. Neural Networks, 13(4-5):411-430, 2000.CrossRefGoogle Scholar
  44. 44.
    A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice-Hall, 1988.Google Scholar
  45. 45.
    L. K. Jones. A simple lemma on greedy approximation in hilbert space and convergence rates for projection pursuit regression and neural network training. Journal of the Royal Statistical Society, 20(1):608-613, March 1992.zbMATHGoogle Scholar
  46. 46.
    C. Jutten and J. Herault. Blind separation of sources, part i: An adaptive algorithm based on neuromimetic architecture. Signal Processing, 24(1):1-10, 1991.zbMATHCrossRefGoogle Scholar
  47. 47.
    D. Kaplan and L. Glass. Understanding Nonlinear Dynamics. Springer-Verlag, 1995.Google Scholar
  48. 48.
    J. Karhunen and J. Joutsensalo. Representations and separation of signals using nonlinear pca type learning. Neural Networks, 7(1):113-127, 1994.CrossRefGoogle Scholar
  49. 49.
    J. Karhunen, E. Oja, L. Wang, R. Vigario, and J. Joutsensalo. A class of neural networks for independent component analysis. IEEE Transactions on Neural Networks, 8(3):486-504, 1997.CrossRefGoogle Scholar
  50. 50.
    B. Kégl. Intrinsic dimension estimation using packing numbers. In Advances in Neural Information Processing 15, pages 681-688. MIT Press, 2003.Google Scholar
  51. 51.
    M. Kirby. Geometric Data Analysis: An Empirical Approach to Dimensionality Reduction and the Study of Patterns. John Wiley, 2001.Google Scholar
  52. 52.
    T. Kohonen. Self-Organizing Map. Springer-Verlag, 1995.Google Scholar
  53. 53.
    G. A. Korn and T. M. Korn. Mathematical Handbook for Scientists and Engi- neers. Dover, 1961.Google Scholar
  54. 54.
    J. B. Kruskal. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29(1):1-27, 1964.zbMATHCrossRefMathSciNetGoogle Scholar
  55. 55.
    J. B. Kruskal. Comments on a nonlinear mapping for data structure analysis. IEEE Transaction on Computers, C-20:1614, December 1971.CrossRefGoogle Scholar
  56. 56.
    J. B. Kruskal. Linear transformation of multivariate data to reveal clustering. In Multidimensional Scaling, vol. I, pages 101-115. Academic Press, 1972.Google Scholar
  57. 57.
    J. B. Kruskal and J. D. Carroll. Geometrical models and badness-of-fit functions. In Multivariate Analisys, vol. 2, pages 639-671. Academic Press, 1969.Google Scholar
  58. 58.
    E. Levina and P. Bickel. Maximum likelihood estimation of intrinsic dimension. In Advances in Neural Information Processing 17, pages 777-784. MIT Press, 2005.Google Scholar
  59. 59.
    Y. Linde, A. Buzo, and R. Gray. An algorithm for vector quantizer design. IEEE Transaction on Communications, 28(1):84-95, 1980.CrossRefGoogle Scholar
  60. 60.
    G. G. Lorentz. Approximation of Functions. Chelsea Publishing, 1986.Google Scholar
  61. 61.
    E. C. Malthouse. Limitations of nonlinear pca as performed with generic neural networks. IEEE Transaction on Neural Networks, 9(1):165-173, 1998.CrossRefGoogle Scholar
  62. 62.
    B. Mandelbrot. Fractals: Form, Chance and Dimension. Freeman, 1977.Google Scholar
  63. 63.
    T. Martinetz and K. Schulten. Topology representing networks. Neural Net-works, 7(3):507-522, 1994.CrossRefGoogle Scholar
  64. 64.
    B. Mohar. Laplace eigenvalues of graphs: a survey. Discrete Mathematics, 109(1-3):171-183, 1992.zbMATHCrossRefMathSciNetGoogle Scholar
  65. 65.
    J.-P. Nadal and N. Parga. Nonlinear neurons in the low noise limit: a factorial code maximizes information transfer. Networks, 5(4):565-581, 1994.zbMATHCrossRefGoogle Scholar
  66. 66.
    E. Ott. Chaos in Dynamical Systems. Cambridge University Press, 1993.Google Scholar
  67. 67.
    B. A. Pearlmutter and L. C. Parra. Maximum likelihood blind source separation: A context-sensitive generalization of ica. In Advances in Neural Information Processing 9, pages 613-619. MIT Press, 1997.Google Scholar
  68. 68.
    K. Pettis, T. Bailey, T. Jain, and R. Dubes. An intrinsic dimensionality esti-mator from near-neighbor information. IEEE Transaction on Pattern Analysis and Machine Intelligence, 1(1):25-37, 1979.zbMATHCrossRefGoogle Scholar
  69. 69.
    D.-T. Pham, P. Garrat, and C. Jutten. Separation of a mixture of independent sources through a maximum likelihood approach. In Proceeding EUSIPCO92, pages 771-774, 1992.Google Scholar
  70. 70.
    W. H. Press, B. P. Flannery, S. A. Teulkosky, and W. T. Vetterling. Numerical Recipes: The Art of Scientific Computing. Cambridge University Press, 1989.Google Scholar
  71. 71.
    A. K. Romney, R. N. Shepard, and S. B. Nerlove. Multidimensionaling Scaling, vol. 2, Applications. Seminar Press, 1972.Google Scholar
  72. 72.
    A. K. Romney, R. N. Shepard, and S. B. Nerlove. Multidimensionaling Scaling, vol. I, Theory. Seminar Press, 1972.Google Scholar
  73. 73.
    O. Samko, A. D. Marshall, and P.L. Rosin. Selection of the optimal parameter value for the isomap algorithm. Pattern Recognition Letters, 27(9):968-979, 2006.CrossRefGoogle Scholar
  74. 74.
    J. W. Jr. Sammon. A nonlinear mapping for data structure analysis. IEEE Transaction on Computers, C-18(5):401-409, May 1969.CrossRefGoogle Scholar
  75. 75.
    L. K. Saul and S. Roweis. Think globally, fit locally: unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research, 4:119-155, June 2003.CrossRefMathSciNetGoogle Scholar
  76. 76.
    R. N. Shepard. The analysis of proximities: Multimensional scaling with an unknown distance function. Psychometrika, 27(3):219-246, June 1962.CrossRefMathSciNetGoogle Scholar
  77. 77.
    R. N. Shepard. Representation of structure in similarity data problems and prospects. Psychometrika, 39(4):373-421, December 1974.zbMATHCrossRefMathSciNetGoogle Scholar
  78. 78.
    R. N. Shepard and J. D. Carroll. Parametric representation of nonlinear data structures. In Multivariate Analysis, pages 561-592. Academic Press, 1969.Google Scholar
  79. 79.
    L. A. Smith. Intrinsic limits on dimension calculations. Physics Letters, A133(6):283-288, 1988.Google Scholar
  80. 80.
    R. L. Smith. Optimal estimation of fractal dimension. In Nonlinear Modeling and Forecasting, SFI Studies in the Sciences of Complexity vol. XII, pages 115-135. Addison Wesley, 1992.Google Scholar
  81. 81.
    F. Takens. On the numerical determination of the dimension of an attractor. In Dynamical Systems and Bifurcations, Proceedings Groningen 1984, pages 99-106. Springer-Verlag, 1984.Google Scholar
  82. 82.
    J. B. Tanenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(12):2319-2323, December 2000.CrossRefGoogle Scholar
  83. 83.
    J. Theiler. Lacunarity in a best estimator of fractal dimension. Physics Letters, A133(4-5):195-200, 1988.MathSciNetGoogle Scholar
  84. 84.
    J. Theiler. Statistical precision of dimension estimators. Physical Review, A41:3038-3051, 1990.Google Scholar
  85. 85.
    J. Theiler, S. Eubank, A. Longtin, B. Galdrikian, and J. D. Farmer. Testing for nonlinearity in time series: the method for surrogate date. Physica, D58(1-4):77-94, 1992.Google Scholar
  86. 86.
    G. V Trunk. Statistical estimation of the intrinsic dimensionality of a noisy signal collection. IEEE Transaction on Computers, 25(2):165-171, 1976.zbMATHCrossRefMathSciNetGoogle Scholar
  87. 87.
    P. J. Verveer and R. Duin. An evaluation of intrinsic dimensionality estimators. IEEE Transaction on Pattern Analysis and Machine Intelligence, 17(1):81-86, January 1995.CrossRefGoogle Scholar
  88. 88.
    W. H. Wolberg and O. Mangasarian. Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proceedings of the National Academy of Sciences, U.S.A., 87(1):9193-9196, 1990.zbMATHCrossRefGoogle Scholar

Copyright information

© Springer 2008

Personalised recommendations