Abstract
In the data understanding phase we have explored all available data and carefully checked if they satisfy our assumptions and correspond to our expectations. We intend to apply various modeling techniques to extract models from the data. Although we have not yet discussed any modeling technique in greater detail (see Chaps. 7ff), we have already glimpsed at some fundamental techniques and potential pitfalls in the previous chapter. Before we start modeling, we have to prepare our data set appropriately, that is, we are going to modify our dataset so that the modeling techniques are best supported but least biased.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The reduction of inflected or derived words to their root (or stem) is called stemming. So-called stemmers are computer programs that try to automate this step.
- 2.
Bayesian classifiers can handle numerical data directly by imposing some assumptions on their distribution. If such assumptions cannot be justified, discretization may be a better alternative.
References
Cook, D.J., Holder, L.B.: Mining Graph Data. Wiley, Chichester (2006)
Dougherty, J., Kohavi, R., Sahami, M.: Supervised and unsupervised discretization of continuous features. In: Proc. 12th Int. Conf. on Machine Learning (ICML 95, Lake Tahoe, CA), pp. 115–123. Morgan Kaufmann, San Mateo (1995)
Elomaa, T., Rousu, J.: Efficient multisplitting revisited: optima-preserving elimination of partition candidates. Data Min. Knowl. Discov. 8(2), 97–126 (2004)
Fayyad, U., Irani, K.: Multi-interval discretization of continuous-valued attributes for classification learning. In: Proc. 10th Int. Conf. on Artificial Intelligence (ICML’93, Amherst, MA), pp. 1022–1027. Morgan Kaufmann, San Mateo (1993)
Feldman, R., Sanger, J.: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, Cambridge (2007)
Guyon, I., Elisseeff, A.: An Introduction to Variable and Feature Selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
Hall, M.A., Smith, L.A.: Feature subset selection: a correlation based filter approach. In: Proc. Int. Conf. on Neural Information Processing and Intelligent Information Systems, pp. 855–858. Springer, Berlin (1997)
Kolodyazhniy, V., Klawonn, F., Tschumitschew, K.: A neuro-fuzzy model for dimensionality reduction and its application. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 15, 571–593 (2007)
Lowe, D., Tipping, M.E.: Feed-forward neural networks topographic mapping for exploratory data analysis. Neural Comput. Appl. 4, 83–95 (1996)
Markovitch, S., Rosenstein, S.: Feature generation using general constructor functions. Mach. Learn. 49(1), 59–98 (2002)
Murphy, P., Pazani, M.: ID2-of-3: constructive induction of m-of-n concepts for discriminators in decision trees. In: Proc. 8th Int. Conf. on Machine Learning (ICML’91, Chicago, IL), pp. 183–188. Morgan Kaufmann, San Mateo (1991)
Petrushin, V.A., Khan, L. (eds.): Multimedia Data Mining and Knowledge Discovery. Springer, New York (2006)
Rehm, F., Klawonn, F., Kruse, R.: POLARMAP—efficient visualisation of high dimensional data. In: Information Visualization, pp. 731–740. IEEE Press, Piscataway (2006)
Rehm, F., Klawonn, F.: Improving angle based mappings. In: Advanced Data Mining and Applications, pp. 3–14. Springer, Berlin (2008)
Robnik-Sikonja, M., Kononenko, I.: Theoretical and empirical analysis of ReliefF and RReliefF. Mach. Learn. 53(1–2), 23–69 (2004)
van der Putten, P., van Someren, M.: A bias-variance analysis of a real world learning problem: the COIL challenge 2000. Mach. Learn. 57, 177–195 (2004)
Yang, Y., Webb, G.I.: Discretization for naive Bayes learning: managing discretization bias and variance. Mach. Learn. 74(1), 39–74 (2009)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2010 Springer-Verlag London Limited
About this chapter
Cite this chapter
Berthold, M.R., Borgelt, C., Höppner, F., Klawonn, F. (2010). Data Preparation. In: Guide to Intelligent Data Analysis. Texts in Computer Science. Springer, London. https://doi.org/10.1007/978-1-84882-260-3_6
Download citation
DOI: https://doi.org/10.1007/978-1-84882-260-3_6
Publisher Name: Springer, London
Print ISBN: 978-1-84882-259-7
Online ISBN: 978-1-84882-260-3
eBook Packages: Computer ScienceComputer Science (R0)