Data Preparation

Berthold, Michael R.; Borgelt, Christian; Höppner, Frank; Klawonn, Frank

doi:10.1007/978-1-84882-260-3_6

Michael R. Berthold⁶,
Christian Borgelt⁷,
Frank Höppner⁸ &
…
Frank Klawonn⁹

Part of the book series: Texts in Computer Science ((TCS))

8809 Accesses
2 Citations

Abstract

In the data understanding phase we have explored all available data and carefully checked if they satisfy our assumptions and correspond to our expectations. We intend to apply various modeling techniques to extract models from the data. Although we have not yet discussed any modeling technique in greater detail (see Chaps. 7ff), we have already glimpsed at some fundamental techniques and potential pitfalls in the previous chapter. Before we start modeling, we have to prepare our data set appropriately, that is, we are going to modify our dataset so that the modeling techniques are best supported but least biased.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 49.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The reduction of inflected or derived words to their root (or stem) is called stemming. So-called stemmers are computer programs that try to automate this step.
2.
Bayesian classifiers can handle numerical data directly by imposing some assumptions on their distribution. If such assumptions cannot be justified, discretization may be a better alternative.

References

Cook, D.J., Holder, L.B.: Mining Graph Data. Wiley, Chichester (2006)
Book Google Scholar
Dougherty, J., Kohavi, R., Sahami, M.: Supervised and unsupervised discretization of continuous features. In: Proc. 12th Int. Conf. on Machine Learning (ICML 95, Lake Tahoe, CA), pp. 115–123. Morgan Kaufmann, San Mateo (1995)
Google Scholar
Elomaa, T., Rousu, J.: Efficient multisplitting revisited: optima-preserving elimination of partition candidates. Data Min. Knowl. Discov. 8(2), 97–126 (2004)
Article MathSciNet Google Scholar
Fayyad, U., Irani, K.: Multi-interval discretization of continuous-valued attributes for classification learning. In: Proc. 10th Int. Conf. on Artificial Intelligence (ICML’93, Amherst, MA), pp. 1022–1027. Morgan Kaufmann, San Mateo (1993)
Google Scholar
Feldman, R., Sanger, J.: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, Cambridge (2007)
Google Scholar
Guyon, I., Elisseeff, A.: An Introduction to Variable and Feature Selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
MATH Google Scholar
Hall, M.A., Smith, L.A.: Feature subset selection: a correlation based filter approach. In: Proc. Int. Conf. on Neural Information Processing and Intelligent Information Systems, pp. 855–858. Springer, Berlin (1997)
Google Scholar
Kolodyazhniy, V., Klawonn, F., Tschumitschew, K.: A neuro-fuzzy model for dimensionality reduction and its application. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 15, 571–593 (2007)
Article MATH Google Scholar
Lowe, D., Tipping, M.E.: Feed-forward neural networks topographic mapping for exploratory data analysis. Neural Comput. Appl. 4, 83–95 (1996)
Article Google Scholar
Markovitch, S., Rosenstein, S.: Feature generation using general constructor functions. Mach. Learn. 49(1), 59–98 (2002)
Article MATH Google Scholar
Murphy, P., Pazani, M.: ID2-of-3: constructive induction of m-of-n concepts for discriminators in decision trees. In: Proc. 8th Int. Conf. on Machine Learning (ICML’91, Chicago, IL), pp. 183–188. Morgan Kaufmann, San Mateo (1991)
Google Scholar
Petrushin, V.A., Khan, L. (eds.): Multimedia Data Mining and Knowledge Discovery. Springer, New York (2006)
Google Scholar
Rehm, F., Klawonn, F., Kruse, R.: POLARMAP—efficient visualisation of high dimensional data. In: Information Visualization, pp. 731–740. IEEE Press, Piscataway (2006)
Google Scholar
Rehm, F., Klawonn, F.: Improving angle based mappings. In: Advanced Data Mining and Applications, pp. 3–14. Springer, Berlin (2008)
Chapter Google Scholar
Robnik-Sikonja, M., Kononenko, I.: Theoretical and empirical analysis of ReliefF and RReliefF. Mach. Learn. 53(1–2), 23–69 (2004)
Google Scholar
van der Putten, P., van Someren, M.: A bias-variance analysis of a real world learning problem: the COIL challenge 2000. Mach. Learn. 57, 177–195 (2004)
Article MATH Google Scholar
Yang, Y., Webb, G.I.: Discretization for naive Bayes learning: managing discretization bias and variance. Mach. Learn. 74(1), 39–74 (2009)
Article Google Scholar

Download references

Author information

Authors and Affiliations

FB Informatik und Informationswissenschaft, Universität Konstanz, 78457, Konstanz, Germany
Prof. Dr. Michael R. Berthold
Intelligent Data Analysis & Graphical Models Research Unit, European Centre for Soft Computing, C/ Gonzalo Gutiérrez Quirós s/n Edificio Científico-Technológico Campus Mieres, 3ª Planta, 33600, Mieres, Asturias, Spain
Dr. Christian Borgelt
FB Wirtschaft, Ostfalia University of Applied Sciences, Robert-Koch-Platz 10-14, 38440, Wolfsburg, Germany
Prof. Dr. Frank Höppner
FB Informatik, Ostfalia University of Applied Sciences, Salzdahlumer Str. 46/48, 38302, Wolfenbüttel, Germany
Prof. Dr. Frank Klawonn

Authors

Prof. Dr. Michael R. Berthold
View author publications
You can also search for this author in PubMed Google Scholar
Dr. Christian Borgelt
View author publications
You can also search for this author in PubMed Google Scholar
Prof. Dr. Frank Höppner
View author publications
You can also search for this author in PubMed Google Scholar
Prof. Dr. Frank Klawonn
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael R. Berthold .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Berthold, M.R., Borgelt, C., Höppner, F., Klawonn, F. (2010). Data Preparation. In: Guide to Intelligent Data Analysis. Texts in Computer Science. Springer, London. https://doi.org/10.1007/978-1-84882-260-3_6

Download citation

DOI: https://doi.org/10.1007/978-1-84882-260-3_6
Publisher Name: Springer, London
Print ISBN: 978-1-84882-259-7
Online ISBN: 978-1-84882-260-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics