Abstract
Text mining and data mining are contrasted relative to automated prediction. Models are constructed by training on samples of unstructured documents, and results are projected to new text. A standard data format for input to prediction methods is described. The key objective of data preparation is to transform text into a numerical format, eventually sharing a common representation with numerical data mining. Different text-mining problems are introduced that fit within the prediction framework. These include document classification, information retrieval, clustering documents, information extraction, and performance evaluation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
P. Hayes and S. Weinstein. Construe/tis: A system for content-based indexing of a database of news stories. In Proceedings of the 2nd Conference on Innovative Applications of Artificial Intelligence, pages 49–66. AAAI Press, Menlo Park, 1990.
N. Jardine and C. van Rijsbergen. The use of hierarchical clustering in information retrieval. Information Storage and Retrieval, 7:217–240, 1971.
H. Luhn. Auto-encoding of documents for information retrieval systems. In M. Boaz, editor, Modern Trends in Documentation, pages 45–58. Pergamon Press, London, 1959.
M. Maron and J. Kuhns. On relevance, probabilistic indexing and information retrieval. Journal of the ACM, 7:216–244, 1960.
G. Salton, A. Wong, and C. Yang. A vector space model for automatic indexing. Communications of the ACM, 18:613–620, 1975.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2010 Springer-Verlag London Limited
About this chapter
Cite this chapter
Weiss, S.M., Indurkhya, N., Zhang, T. (2010). Overview of Text Mining. In: Fundamentals of Predictive Text Mining. Texts in Computer Science. Springer, London. https://doi.org/10.1007/978-1-84996-226-1_1
Download citation
DOI: https://doi.org/10.1007/978-1-84996-226-1_1
Publisher Name: Springer, London
Print ISBN: 978-1-84996-225-4
Online ISBN: 978-1-84996-226-1
eBook Packages: Computer ScienceComputer Science (R0)