Abstract
We consider the problem of statistical pattern recognition in a heterogeneous, high-dimensional setting. In particular, we consider the search for meaningful cross-category associations in a heterogeneous text document corpus. Our approach involves “iterative denoising ” — that is, iteratively extracting (corpus-dependent) features and partitioning the document collection into sub-corpora. We present an anecdote wherein this methodology discovers a meaningful cross-category association in a heterogeneous collection of scientific documents.
Key words
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Berry M.W., editor (2004). Survey of text mining: clustering, classification, and retrieval. Springer-Verlag.
Borg I., Groenen P. (1997). Modern multidimensional scaling: theory and applications. Springer-Verlag.
Cowen L.J., Priebe C.E. (1997). Randomized nonlinear projections uncover high-dimensional structure. Advances in Applied Mathematics 9, 319–331.
Deerwester S., Dumais S.T., Furnas G.W., Landauer T.K., Harshman R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6), 391–407.
Jolliffe I.T. (1986). Principal component analysis. Springer-Verlag.
Lin D., Pantel P. (2002). Concept discovery from text. In Proceedings of Conference on Computational Linguistics 2002, Taipei, Taiwan, 577–583.
Maa J.-F., Pearl D.K., Bartoszynsky R. (1996). Reducing multidimensional two-sample data to one-dimensional interpoint comparisons. The Annals of Statistics 24, 1069–1074.
Pantel P., Lin D. (2002). Discovering word senses from text. In Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining 2002, Edmonton, Canada, 613–619.
Priebe C.E., Marchette D.J., Healy D.M. (2004). Integrated sensing and processing decision trees. IEEE Trans. PAMI, to appear.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Priebe, C.E. et al. (2004). Iterative Denoising for Cross-Corpus Discovery. In: Antoch, J. (eds) COMPSTAT 2004 — Proceedings in Computational Statistics. Physica, Heidelberg. https://doi.org/10.1007/978-3-7908-2656-2_31
Download citation
DOI: https://doi.org/10.1007/978-3-7908-2656-2_31
Publisher Name: Physica, Heidelberg
Print ISBN: 978-3-7908-1554-2
Online ISBN: 978-3-7908-2656-2
eBook Packages: Springer Book Archive