Skip to main content

The National Scalable Cluster Project: Three Lessons about High Performance Data Mining and Data Intensive Computing

  • Chapter
Handbook of Massive Data Sets

Part of the book series: Massive Computing ((MACO,volume 4))

Abstract

We discuss three principles learned from experience with the National Scalable Cluster Project. Storing, managing and mining massive data requires systems that exploit parallelism. This can be achieved with shared-nothing clusters and careful attention to I/O paths. Also, exploiting data parallelism at the file and record level provides efficient mapping of data-intensive problems onto clusters and is particularly well suited to data mining. Finally, the repetitive nature of data mining demands special attention be given to data layout on the hardware and to software access patterns while maintaining a storage schema easily derived from the legacy form of the data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 629.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 799.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 799.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Bibliography

  • T.E. Anderson, D.E. Culler, D.A. Patterson, and the NOW Team. A case for networks of workstations: NOW. IEEE Micro, 15: 54–64, 1995.

    Article  Google Scholar 

  • S. Bailey and R. Grossman. Transport layer multiplexing with psocket. Technical report, NCDM, 1999.

    Google Scholar 

  • D.J. Becker, T. Sterling, D. Savarese, E. Dorband, U. A. Ranawake, and C.V. Packer. BEOWULF: A parallel workstation for scientific computation. In Proceedings of the 1995 International Conference on Parallel Processing (ICPP), pages 11–14, 1995.

    Google Scholar 

  • L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classification and Regression Trees. Wadsworth, 1984.

    Google Scholar 

  • T.G. Dietterich. Machine learning research: Four current directions. AI Magazine, 18: 97–136, 1997.

    Google Scholar 

  • I. Foster and C. Kesselman. Globus project: A status report. In Proceedings of the Heterogeneous Computing Workshop, pages 4–18. IEEE, 1998.

    Google Scholar 

  • S. Grimsaw and W.A. Wulf. Legion — A view From 50,000 feet. In Proceedings of the Fifth IEEE International Symposium on High Performance Distributed Computing. IEEE Computer Society Press, 1996.

    Google Scholar 

  • R.L. Grossman, S. Bailey, A. Ramu, B. Malhi, H. Sivakumar, and A. Turinsky. Papyrus: A system for data mining over local and wide area clusters and super-clusters. In Proceedings of Supercomputing. IEEE, 1999a.

    Google Scholar 

  • R.L. Grossman, H. Bodek, D. Northcutt, and H.V. Poor. Data mining and tree-based optimization. In E. Simoudis, J. Han, and U. Fayyad, editors, Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pages 323–326. AAAI Press, 1996.

    Google Scholar 

  • R.L. Grossman, S. Kasif, D. Mon, A. Ramu, and B. Malhi. The preliminary design of Papyrus: A system for high performance distributed data mining over clusters, meta-clusters and super-clusters. In KDD98 Workshop on Distributed Data Mining. AAAI, 1999b.

    Google Scholar 

  • Y. Guo, S.M. Rueger, J. Sutiwaraphun, and J. Forbes-Millott. Metalearnig for parallel data mining. In Proceedings of the Seventh Parallel Computing Workshop, pages 1–2, 1997.

    Google Scholar 

  • R. Hollebeek. Data intensive computing, data mining, and economic development. In Proceedings of Supercomputing. IEEE, 1998. http://nscp.upenn.edu/hollebeek/talks/sc98.

    Google Scholar 

  • R. Hollebeek. Data intensive mining. In Proceedings of Supercomputing. IEEE, 1999. http://nscp.upenn.edu/hollebeek/talks/sc99.

    Google Scholar 

  • IBM. IBM Virtual Shared Disk Users Guide,1994.

    Google Scholar 

  • IBM, 1998.http://www.research.ibm.com/journal/rd/422/haskin.html.

  • D.S. Isenberg. Dark fiber economics. America’s Network, 1999.

    Google Scholar 

  • Y. Ishikawa, M. Sato, T. Kudoh, and J. Shimada. Towards a seamless parallel computing system on distributed environments. In CPSY. SWOPP’97, 1997.

    Google Scholar 

  • H. Kargupta, I. Hamzaoglu, and B. Stafford. Scalable, distributed data mining using an agent based architecture. In D. Heckerman, H. Mannila, D. Pregibon, and R. Uthurusamy, editors, Proceedings the Third International Conference on Knowledge Discovery and Data Mining, pages 211–214. AAAI Press, 1997.

    Google Scholar 

  • R. Moore, T.A. Prince, and M. Ellisman. Data intensive computing and digital libraries. Communications of the ACM, 41: 56–62, 1998.

    Article  Google Scholar 

  • A.E. Raftery, D. Madigan, and J.A. Hoeting. Bayesian model averaging for linear regression models. Journal of the American Statistical Association, 92: 179–191, 1996.

    Article  MathSciNet  MATH  Google Scholar 

  • S. Stolfo, A.L. Prodromidis, and P.K. Chan. JAM: Java agents for meta-learning over distributed databases. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining. AAAI Press, 1997.

    Google Scholar 

  • R. Subramonian and S. Parthasarathy. A framework for distributed data mining. In International Workshop on Distributed Data Mining (with KDD98), 1998.

    Google Scholar 

  • D. Wolpert. Stacked generalization. Neural Networks, 5: 241–259, 1992.

    Article  Google Scholar 

  • L. Xu and M.I. Jordan. EM learning on a generalised finite mixture model for combining multiple classifiers. In Proceedings of World Congress on Neural Networks. Erlbaum, 1993.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Grossman, R., Hollebeek, R. (2002). The National Scalable Cluster Project: Three Lessons about High Performance Data Mining and Data Intensive Computing. In: Abello, J., Pardalos, P.M., Resende, M.G.C. (eds) Handbook of Massive Data Sets. Massive Computing, vol 4. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-0005-6_23

Download citation

  • DOI: https://doi.org/10.1007/978-1-4615-0005-6_23

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4613-4882-5

  • Online ISBN: 978-1-4615-0005-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics