Abstract
Feature Extraction, also known as Multidimensional Scaling, is a basic primitive associated with indexing, clustering, nearest neighbor searching and visualization. We consider the problem of feature extraction when the data-points are complex and the distance evaluation function is very expensive to evaluate. Examples of expensive distance evaluations include those for computing the Hausdor. distance between polygons in a spatial database, or the edit distance between macromolecules in a DNA or protein database.
We propose Cofe, a method for sparse feature extraction which is based on novel random non-linear projections. We evaluate Cofe on real data and find that it performs very well in terms of quality of features extracted, number of distances evaluated, number of database scans performed and total run time.We further propose Cofe-GR, which matches Cofe in terms of distance evaluations and run-time, but outperforms it in terms of quality of features extracted.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bairoch, A., Apweiler, R.: The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1998. Nucleic Acids Res. 26 (1998) 38–42
Berchtold, S., Böhm, C.: The Pyramid-Technique: Towards breaking the curse of dimensionality. Proc. ACM SIGMOD Conf. (1998) 142–176
Berchtold, S., Böhm, C., Keim, D.A., Kriegel, H.-P.: A cost model for nearest neighbor search in high-dimensional data space. Proc. ACM PODS Symposium (1997)
Berchtold, S., Keim, D.A., Kriegel, H.-P.: The X-tree: An index structure for highdimensional data. Proc. 22nd VLDB Conf. (1996) 28–39
Beckmann, N., Kriegel, H.-P., Schneider, R., Seeger, B.: The R*. tree: An E.cient and Robust Access Method for Points and Rectangles. Proc. ACM SIGMOD Conf. (1990) 322–331
Bourgain, J.: On Lipschitz embedding of finite metric spaces in Hilbert space. Israel J. of Math. (1985) 52:46–52
Faloutsos, C., Lin, K.-I.: FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. Proc. ACM SIGMOD 24(2) (1995) 163–174
Fukunaga, K.: Introduction to Statistical Pattern Recognition. Academic Press, 2nd Edition (1990)
Ganti, V., Ramakrishnan, R., Gehrke, J., Powell, A., French, J.: Clustering large datasets in arbitrary metric spaces. Proc. 15th ICDE Conf. (1999) 502–511
Golub, G. H., Van Loan, C. F.: Matrix computations. Johns Hopkins University Press, 2nd Edition (1989)
Kanth, K. V. R., Agrawal, D., Singh, A.: Dimensionality reduction for similarity searching in dynamic databases. Proc. ACM SIGMOD Conf. (1998) 142–176
Katayama, N., Satoh, S.: The SR-tree: An index structure for high-dimensional nearest neighbor queries. Proc. ACM SIGMOD Conf. (1997) 369–380
Kruskal, J.B.: Multidimensional Scaling by Optimizing Goodness of Fit to a Nonmetric Hypothesis. Psychometrika 29 (1964) 1–27
Kruskal, J.B.: Multidimensional Scaling and other Methods for Discovering Structure. Stat. Meth. for Digital Computers, Wiley, New York (1977) 296–339
Kruskal, J.B., Wish, M.: Multidimensional Scaling. Sage University Paper series on Quantitative Applications in the Social Sciences, Beverly Hills, CA (1978) 7–11
Lin, K.-I., Jagadish, H. V., Faloutsos, C.: The TV-tree: An index structure for high-dimensional data. Proc. 20th VLDB Conf. 3(4) (1994) 517–542
Linial, N., London, E., Rabinovich, Y.: The geometry of graphs and some of its algorithmic applications. Proc. 35th IEEE FOCS Symp. (1994) 577–591
Linial, M., Linial, N., Tishby, N., Yona, G.: Global self organization of all known protein sequences reveals inherent biological signatures. J. Mol. Biol. 268 (1997) 539–556
Smith, T., Waterman, M.: The identi.cation of common molecular subsequences. J. Mol. Biol. 147 (1981) 195–197
White, D. A., Jain, R.: Similarity Indexing with the SS-tree. Proc. 12th ICDE Conf. (1996) 516–523
Wang, W., Yang, J., Muntz, R. R.: PK-tree: A Spatial Index Structure for High Dimensional Point Data. 5th Intl. FODO Conf. (1998)
Zhang, T., Ramkarishnan, R., Livny, M.: Birch: An efficient data clustering method for large databases. Proc. ACM SIGMOD Conf. (1996) 103–114
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hristescu, G., Farach-Colton, M. (2000). Cofe: A Scalable Method for Feature Extraction from Complex Objects. In: Kambayashi, Y., Mohania, M., Tjoa, A.M. (eds) Data Warehousing and Knowledge Discovery. DaWaK 2000. Lecture Notes in Computer Science, vol 1874. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44466-1_36
Download citation
DOI: https://doi.org/10.1007/3-540-44466-1_36
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67980-6
Online ISBN: 978-3-540-44466-4
eBook Packages: Springer Book Archive