Compiler and Middleware Support for Scalable Data Mining

  • Gagan Agrawal
  • Ruoming Jin
  • Xiaogang Li
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2624)


The parallelizing compiler community has traditionally focused its efforts on scientific applications. This paper gives an overview of a compiler/runtime project targeting parallel and scalable execution of data mining algorithms. To the best of our knowledge, this is the first project with such a focus.

Data mining is the process of analyzing large datasets for extracting novel and useful patterns or models. Though a lot of effort has been put into developing parallel algorithms for data mining tasks, the expertise and effort currently required in implementing, maintaining, and performance tuning a parallel data mining application is an impediment in the wide use of parallel computers for data mining.

We have developed a data parallel dialect of Java that can be used for expressing common data mining algorithms at a high level. Our compiler generates a middleware specification from this dialect of Java. The middleware supports both distributed memory and shared memory parallelization, and performs a number of I/O optimizations to support efficient processing of disk resident datasets. Our final goal is to start from declarative mining operators, and translate them to data parallel Java.

In this paper, we describe the commonality among different data mining algorithms, the middleware and its interface, the data parallel dialect of Java, and the compilation techniques required for generating the middleware specification. Experimental evaluations of the middleware and the compiler are also presented.


Data Mining Association Rule Data Mining Algorithm Local Reduction Data Mining Application 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    H. Agrawal, R. A. DeMillo, and E. H. Spafford. Dynamic slicing in the presence of unconstrained pointers. In Proceedings of the ACM Fourth Symposium on Testing, Analysis and Verification (TAV 4), pages 60–73, 1991.Google Scholar
  2. [2]
    R. Agrawal and J. Shafer. Parallel mining of association rules. IEEE Transactions on Knowledge and Data Engineering, 8(6):962–969, June 1996.CrossRefGoogle Scholar
  3. [3]
    R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. 1994 Int. conf. Very Large DataBases (VLDB’94), pages 487–499, Santiago, Chile, September 1994.Google Scholar
  4. [4]
    R. Baraglia, D. Laforenza, S. Orlando, P. Palmerini, and R. Perego. Implementation issues in the design of i/o intensive data mining applications on clusters of workstations. In Proceedings of Workshop on High Performance Data Mining IPDPS 2000, LNCS Volume 1800, pages 350–357. Springer Verlag, 2000.Google Scholar
  5. [5]
    P. Becuzzi, M. Coppola, and M. Vanneschi. Mining of association rules in very large databases: A structured parallel approach. In Proceedings of Europar-99, Lecture Notes in Computer Science (LNCS) Volume 1685, pages 1441–1450. Springer Verlag, August 1999.Google Scholar
  6. [6]
    W. Blume, R. Doallo, R. Eigenman, J. Grout, J. Hoelflinger, T. Lawrence, J. Lee, D. Padua, Y. Paek, B. Pottenger, L. Rauchwerger, and P. Tu. Parallel programming with Polaris. IEEE Computer, (12): 78–82, December 1996.Google Scholar
  7. [7]
    Francois Bodin, Peter Beckman, Dennis Gannon, Srinivas Narayana, and Shelby X. Yang. Distributed pC++: Basic ideas for an object parallel language. Scientific Programming, 2(3), Fall 1993.Google Scholar
  8. [8]
    R. Bordawekar, A. Choudhary, K. Kennedy, C. Koelbel, and M. Paleczny. A model and compilation strategy for out-of-core data parallel programs. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPOPP), pages 1–10. ACM Press, July 1995. ACM SIGPLAN Notices, Vol. 30, No. 8.Google Scholar
  9. [9]
    Christan Borgelt. Apriori. Version 1.8.
  10. [10]
    C. Chang, A. Acharya, A. Sussman, and J. Saltz. T2: A customizable parallel database for multi-dimensional data. ACM SIGMOD Record, 27(1):58–66, March 1998.CrossRefGoogle Scholar
  11. [11]
    Chialin Chang, Renato Ferreira, Alan Sussman, and Joel Saltz. Infrastructure for building parallel database systems for multi-dimensional data. In Proceedings of the Second Merged IPPS/SPDP (13th International Parallel Processing Symposium & 10th Symposium on Parallel and Distributed Processing). IEEE Computer Society Press, April 1999.Google Scholar
  12. [12]
    Chialin Chang, Tahsin Kurc, Alan Sussman, and Joel Saltz. Query planning for range queries with user-defined aggregation on multi-dimensional scientific datasets. Technical Report CS-TR-3996 and UMIACS-TR-99-15, University of Maryland, Department of Computer Science and UMIACS, February 1999.Google Scholar
  13. [13]
    A.A. Chien and W.J. Dally. Concurrent aggregates (CA). In Proceedings of the Second ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPOPP), pages 187–196. ACM Press, March 1990.Google Scholar
  14. [14]
    John Darlington, Moustafa M. Ghanem, Yike Guo, and H. W. To. Performance models for co-ordinating parallel data classification. In Proceedings of the Seventh International Parallel Computing Workshop (PCW-97), Canberra, Australia, September 1997.Google Scholar
  15. [15]
    Inderjit S. Dhillon and Dharmendra S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In In Proceedings of Workshop on Large-Scale Parallel KDD Systems, in conjunction with the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 99), pages 47–56, August 1999.Google Scholar
  16. [16]
    M. W. Hall, S. Amarsinghe, B. R. Murphy, S. Liao, and Monica Lam. Detecting Course-Grain Parallelism using an Interprocedural Parallelizing Compiler. In Proceedings Supercomputing’ 95, December 1995.Google Scholar
  17. [17]
    E-H. Han, G. Karypis, and V. Kumar. Scalable parallel datamining for association rules. In Proceedings of ACM SIGMOD 1997, May 1997.Google Scholar
  18. [18]
    E-H. Han, G. Karypis, and V. Kumar. Scalable parallel datamining for association rules. IEEE Transactions on Data and Knowledge Engineering, 12(3), May / June 2000.Google Scholar
  19. [19]
    Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2000.Google Scholar
  20. [20]
    High Performance Fortran Forum. Hpf language specification, version 2.0. Available from, January 1997.
  21. [21]
    A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.Google Scholar
  22. [22]
    M. Kandemir, A. Choudhary, and A. Choudhary. Compiler optimizations for i/o intensive computations. In Proceedings of International Conference on Parallel Processing, September 1999.Google Scholar
  23. [23]
    M. Kandemir, A. Choudhary, J. Ramanujam, and R. Bordawekar. Compilation techniques for out-of-core parallel computations. Parallel Computing, (3–4):597–628, June 1998.Google Scholar
  24. [24]
    M. Kandemir, A. Choudhary, J. Ramanujam, and M. A.. Kandaswamy. A unified framework for optimizing locality, parallelism, and comunication in out-of-core computations. IEEE Transactions on Parallel and Distributed Systems, 11(9):648–662, 2000.CrossRefGoogle Scholar
  25. [25]
    M. Kandemir, J. Ramanujam, and A. Choudhary. Improving the performance of out-of-core computations. In Proceedings of International Conference on Parallel Processing, August 1997.Google Scholar
  26. [26]
    Bo Lu and John Mellor-Crummey. Compiler optimization of implicit reductions for distributed memory multiprocessors. In Proceedings of the 12th International Parallel Processing Symposium (IPPS), April 1998.Google Scholar
  27. [27]
    William A. Maniatty and Mohammed J. Zaki. A requirements analysis for parallel kdd systems. In Proceedings of Workshop on High Performance Data Mining, IPDPS 2000, LNCS Volume 1800, pages 358–365. IEEE Computer Society Press, May 2000.Google Scholar
  28. [28]
    Jose E. Moreira, Samuel P. Midkiff, Manish Gupta, and Richard D. Lawrence. Parallel data mining in Java. Technical Report RC 21326, IBM T. J. Watson Research Center, November 1998.Google Scholar
  29. [29]
    Todd C. Mowry, Angela K. Demke, and Orran Krieger. Automatic compiler-inserted i/o prefetching for out-of-core applications. In Proceedings of the Second Symposium on Operating Systems Design and plementation (OSDI’ 96), Nov 1996.Google Scholar
  30. [30]
    S. K. Murthy. Automatic construction of decision trees from data: A multidisciplinary survey. Data Mining and Knowledge Discovery, 2(4):345–389, 1998.CrossRefGoogle Scholar
  31. [31]
    M. Paleczny, K. Kennedy, and C. Koelbel. Compiler support for out-of-core arrays on parallel machines. In Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation, pages 110–118. IEEE Computer Society Press, February 1995.Google Scholar
  32. [32]
    Joel Saltz, Kathleen Crowley, Ravi Mirchandaney, and Harry Berryman. Run-time scheduling and execution of loops on message passing machines. Journal of Parallel and Distributed Computing, 8(4):303–312, April 1990.CrossRefGoogle Scholar
  33. [33]
    Joel H. Saltz, Ravi Mirchandaney, and Kay Crowley. Run-time parallelization and scheduling of loops. IEEE Transactions on Computers, 40(5):603–612, May 1991.CrossRefGoogle Scholar
  34. [34]
    David B. Skillicorn. Strategies for parallel data mining. IEEE Concurrency, Oct–Dec 1999.Google Scholar
  35. [35]
    D.B. Skillicorn. Strategies for parallelizing data mining. In Proceedings of the Workshop on High-Performance Data Mining, in association with IPPS/SPDP 1998, April 1998.Google Scholar
  36. [36]
    Kilian Stoffel and Abdelkader Belkoniene. Parallel k/h-means clustering for large datasets. In Proceedings of Europar-99, Lecture Notes in Computer Science (LNCS) Volume 1685, pages 1451–1454. Spring Verlag, August 1999.Google Scholar
  37. [37]
    R. Thakur, A. Choudhary, R. Bordawekar, S. More, and S. Kutipudi. Passion: Optimized I/O for parallel applications. IEEE Computer, 29(6):70–78, June 1996.Google Scholar
  38. [38]
    Rajeev Thakur, Rajesh Bordawekar, and Alok Choudhary. Compilation of out-of-core data parallel programs for distributed memory machines. In Proceedings of the IPPS’94 Second Annual Workshop on Input/Output in Parallel Computer Systems, pages 54–72, April 1994. Also appears in ACM Computer Architecture News, Vol. 22, No. 4, September 1994.Google Scholar
  39. [39]
    F. Tip. A survey of program slicing techniques. Journal of Programming Languages, 3(3):121–189, September 1995.Google Scholar
  40. [40]
    Janet Wu, Raja Das, Joel Saltz, Harry Berryman, and Seema Hiranandani. Distributed emory compiler design for sparse problems. IEEE Transactions on Computers, 44(6):737–753, June 1995.zbMATHCrossRefGoogle Scholar
  41. [41]
    K. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Libit, A. Krishnamurthy, P. Hilfinger, S. Graham, D. Gay, P. Colella, and A. Aiken. Titanium: A high-performance Java dialect. Concurrency Practice and Experience, 9(11), November 1998.Google Scholar
  42. [42]
    M. J. Zaki, M. Ogihara, S. Parthasarathy, and W. Li. Parallel data mining for association rules on shared memory multiprocessors. In Proceedings of Supercomputing’96, November 1996.Google Scholar
  43. [43]
    Mohammed J. Zaki. Parallel and distributed association mining: A survey. IEEE Concurrency, 7(4): 14–25, 1999.CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Gagan Agrawal
    • 1
  • Ruoming Jin
    • 1
  • Xiaogang Li
    • 1
  1. 1.Department of Computer and Information SciencesUniversity of DelawareNewark

Personalised recommendations