Abstract
This paper describes the techniques used for categorizing variables in Snout an intelligent assistant for exploratory data analysis of survey and similar data sets that is currently under development. We begin by reviewing existing work on category formation in data mining which has been mainly concerned with enabling decision tree programs to handle numeric variables. It is argued that there are other important but neglected aspects of category formation, notably the formation of new categorizations of nominal variables. We report the limited success achieved in categorizing variables from survey data using either endogenous methods or exogenous methods that maximise the association with only one dependent variable. We then describe the categorization technique used in Snout: a procedure that selects a partition that both maximises the number of variables associated with the partitioned variable and maximises the strength of those associations. We report on the success achieved using this procedure in exploring real survey data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth, Pacific Grove, CA., 1984.
J. Catlett. On changing continuous attributes into ordered discrete attributes. In Y. Kodratoff, editor, EWSL-91. Lecture Notes in Artificial Intelligence 482, pages 164–178. Springer-Verlag, Berlin — Heidelberg — New York, 1991.
J. A. Davis. Elementary Survey Analysis. Prentice-Hall, Englewood Cliffs, New Jersey, 1971.
J. Dougherty, R. Kohavi, and M. Sahami. Supervised and unsupervised discretisation of continuous features. In Proc. Twelfth International Conference on Machine Learning, Los Altos, CA, 1995. Morgan Kaufman Publ. Inc.
B.H. Erickson and T.A. Nosanchuk. Understanding Data. The Open University Press, 1979.
B. S. Everitt. Cluster Analysis. Heinemann, London, 2nd edition, 1980.
B. S. Everitt and G. Dunn. Applied Multivariate Statistical Analysis. Edward Arnold, London, 1991.
U. M. Fayyad and K. B. Irani. On the handling of continuous-valued attributes in decision tree generation. Machine Learning, 8:87–102, 1992.
U. M. Fayyad and K. B. Irani. Multi-interval discretization of continuous-valued attributes for classification learning. In Proc. Thirteenth International Joint Conference on Artificial Intelligence, pages 1022–1027, Los Altos, CA, 1993. Morgan Kaufman Publ. Inc.
D. H. Fisher. Knowledge Acquisition Via Incremental Clustering. Machine Learning, 2:139–172, 1987.
D. H. Fisher and P. Langley. Conceptual clustering and its relation to numerical taxonomy. In W. A. Gale, editor, Artificial Intelligence and Statistics, pages 77–116. Addison-Wesley, Reading, Mass., 1986.
J. Healey. Statistics: A Tool For Social Research. Wadsworth, Belmont, CA., 1990.
K. M. Ho and P. D. Scott. Discretization of continuous variables in bivariate relationships. In Proceedings of KDD-97, The Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA., Menlo Park, CA., August 1997. AAAI Press.
R. C. Holte. Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11:63–91, 1993.
R. Kerber. Chimerge: Discretisation of numeric attributes. In AAAI-92 Proceedings of the Tenth National Conference on Artificial Intelligence, pages 123–128, Cambridge, Mass., 1992. The MIT Press.
J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81–106, 1986.
J. R. Quinlan. Programs for Machine Learning. Morgan Kaufman Publ. Inc., Los Altos, CA, 1993.
J. R. Quinlan. Improved use of continuous attributes in c4.5. Journal of Artificial Intelligence Research, 4:77–90, 1996.
M. Richeldi and M. Rossotto. Class-driven statistical discretisation of continous attributes (extended abstract). In ECML-95: Proceedings of the European Conference on Machine Learning, Lecture Notes in Artificial Intelligence, volume 914, Berlin — Heidelberg — New York, 1995. Springer-Verlag.
P. D. Scott, A. P. M. Coxon, M. H. Hobbs, and R. J. Williams. Snout: An intelligent assistant ofr exploratory data analysis. In Lecture Notes in Artificial Intelligence: Proceedings of PKDD-97, The First European Symposium on Principles of Data Mining and Knowledge Discovery, Trondheim., Berlin — Heidelberg — New York, June 1997. Springer-Verlag.
J. W. Tukey. Exploratory Data Analysis. Addison-Wesley, Reading, Mass., 1977.
J. H. Ward. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58:236–244, 1963.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1997 Springer-Verlag
About this paper
Cite this paper
Scott, P.D., Williams, R.J., Ho, K.M. (1997). Forming categories in exploratory data analysis and data mining. In: Liu, X., Cohen, P., Berthold, M. (eds) Advances in Intelligent Data Analysis Reasoning about Data. IDA 1997. Lecture Notes in Computer Science, vol 1280. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0052844
Download citation
DOI: https://doi.org/10.1007/BFb0052844
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-63346-4
Online ISBN: 978-3-540-69520-2
eBook Packages: Springer Book Archive