Advertisement

Efficient Parallel Classification Using Dimensional Aggregates

  • Sanjay Goil
  • Alok Choudhary
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1759)

Abstract

Multidimensional aggregates are frequently computed to improve query performance in Online Analytical Processing applications. We present a new method for decision tree based classification trees using the aggregates computed in the multidimensional data model. The structure imposed on data in a explicit multidimensional storage mechanism leads to efficient dimensional operations. Decision tree based classification algorithms perform computations to find the best split point at each node of the tree. Efficient computation of the split in the decision tree can be done by using the one-dimensional aggregates if the cell values are the class-id values, and counts are maintained for each class. This is used repeatedly at the nodes of the decision tree to calculate splits and manage data. Previous parallel approaches for decision-tree based classification use sorted attribute lists and hash tables to compute the split point and split the data appropriately. The amount of data communicated is proportional to the product of number of records in the training set, and the number of dimensions, at each level of the tree, in the worst case. Parallel formulation of our approach uses data communication proportional to the product of the sum of cardinality of all dimensions and the number of non-classified nodes at each level of the tree. Communication volume is greatly reduced in our approach and is done in one phase of communication at each level of the tree, by coalescing messages. Preliminary results from our experiments on a coarse-grained, distributed memory parallel machine (IBM-SP2) show good performance.

Keywords

Active Node Gini Index Categorical Attribute Distribute Hash Table Split Point 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Michie, D., Spiegelhater, D., Taylor, C.: Machine Learning, Neural and Statistical Classification. Ellis Horwood (1994)Google Scholar
  2. 2.
    Goldberg, D.: Genetic Algorithms in Search, Optimization and Machine Learning. Morgan Kaufmann (1989)Google Scholar
  3. 3.
    Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth, Belmont (1984)zbMATHGoogle Scholar
  4. 4.
    Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann (1993)Google Scholar
  5. 5.
    Chan, P., Stolfo, S.: Meta-learning for multistrategy and parallel learning. In: Proc. International Workshop on Multistrategy Learning. (1993)Google Scholar
  6. 6.
    Mehta, M., Agrawal, R., Rissanen, J.: SLIQ: A fast scalable classifier for data mining. In: Proc. of the Fifth International Conference on Extending Database Technology. (1996)Google Scholar
  7. 7.
    Shafer, J., Agrawal, R., Mehta, M.: SPRINT: A scalable parallel classifier for data mining. In: Proc. 22th International Conference on Very Large Databases. (1996)Google Scholar
  8. 8.
    Joshi, M., Karypis, G., Kumar, V.: ScalParC: A new scalable and efficient parallel classification algorithm for mining large datasets. In: Proc. International Parallel Processing Symposium. (1998)Google Scholar
  9. 9.
    Gehrke, J., Ramakrishnan, R., Ganti, V.: RainForest-A Framework for Fast Decision Tree Construction of Large Data Sets. In: Proc. 24th International Conference on Very Large Databases. (1998)Google Scholar
  10. 10.
    Goil, S., Choudhary, A.: High performance multidimensional analysis and data mining. In: Proc. SC98: High Performance Networking and Computing Conference. (1998)Google Scholar
  11. 11.
    Gray, J., Bosworth, A., Layman, A., Pirahesh, H.: Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. In: Proc. 12th International Conference on Data Engineering. (1996)Google Scholar
  12. 12.
    Goil, S., Choudhary, A.: Parallel classification using the multidimensional data model (under preparation). Technical Report CPDC-9904-006, Northwestern University (1999)Google Scholar
  13. 13.
    Fifield, D.: Distributed Tree construction from large data sets. Bachelor’s Honors Thesis, Australian National University (1992)Google Scholar
  14. 14.
    Zaki, M., Ho, C., Agrawal, R.: Scalable parallel classification for data mining on shared-memory multiprocessors. In: Proc. International Conference on Data Engineering. (1999)Google Scholar
  15. 15.
    Al-furaih, I., Aluru, S., Goil, S., Ranka, S.: Parallel construction of multidimensional binary search trees. In: Proc. International Conference on Supercomputing. (1996)Google Scholar
  16. 16.
    Agrawal, R., Imielinski, T., Swami, A.: Database mining: A performance perspective. IEEE Transactions on Knowledge and Data Engineering (1993)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2000

Authors and Affiliations

  • Sanjay Goil
    • 1
  • Alok Choudhary
    • 2
  1. 1.Performance Technologies GroupSun Microsystems Inc.USA
  2. 2.Department of Electrical & Computer EngineeringNorthwestern UniversityUSA

Personalised recommendations