The National Scalable Cluster Project: Three Lessons about High Performance Data Mining and Data Intensive Computing

Grossman, Robert; Hollebeek, Robert

doi:10.1007/978-1-4615-0005-6_23

Robert Grossman³ &
Robert Hollebeek⁴

Part of the book series: Massive Computing ((MACO,volume 4))

511 Accesses
1 Citations

Abstract

We discuss three principles learned from experience with the National Scalable Cluster Project. Storing, managing and mining massive data requires systems that exploit parallelism. This can be achieved with shared-nothing clusters and careful attention to I/O paths. Also, exploiting data parallelism at the file and record level provides efficient mapping of data-intensive problems onto clusters and is particularly well suited to data mining. Finally, the repetitive nature of data mining demands special attention be given to data layout on the hardware and to software access patterns while maintaining a storage schema easily derived from the legacy form of the data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 629.00; Price excludes VAT (USA)

Softcover Book: USD 799.99; Price excludes VAT (USA)

Hardcover Book: USD 799.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Bibliography

T.E. Anderson, D.E. Culler, D.A. Patterson, and the NOW Team. A case for networks of workstations: NOW. IEEE Micro, 15: 54–64, 1995.
Article Google Scholar
S. Bailey and R. Grossman. Transport layer multiplexing with psocket. Technical report, NCDM, 1999.
Google Scholar
D.J. Becker, T. Sterling, D. Savarese, E. Dorband, U. A. Ranawake, and C.V. Packer. BEOWULF: A parallel workstation for scientific computation. In Proceedings of the 1995 International Conference on Parallel Processing (ICPP), pages 11–14, 1995.
Google Scholar
L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classification and Regression Trees. Wadsworth, 1984.
Google Scholar
T.G. Dietterich. Machine learning research: Four current directions. AI Magazine, 18: 97–136, 1997.
Google Scholar
I. Foster and C. Kesselman. Globus project: A status report. In Proceedings of the Heterogeneous Computing Workshop, pages 4–18. IEEE, 1998.
Google Scholar
S. Grimsaw and W.A. Wulf. Legion — A view From 50,000 feet. In Proceedings of the Fifth IEEE International Symposium on High Performance Distributed Computing. IEEE Computer Society Press, 1996.
Google Scholar
R.L. Grossman, S. Bailey, A. Ramu, B. Malhi, H. Sivakumar, and A. Turinsky. Papyrus: A system for data mining over local and wide area clusters and super-clusters. In Proceedings of Supercomputing. IEEE, 1999a.
Google Scholar
R.L. Grossman, H. Bodek, D. Northcutt, and H.V. Poor. Data mining and tree-based optimization. In E. Simoudis, J. Han, and U. Fayyad, editors, Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pages 323–326. AAAI Press, 1996.
Google Scholar
R.L. Grossman, S. Kasif, D. Mon, A. Ramu, and B. Malhi. The preliminary design of Papyrus: A system for high performance distributed data mining over clusters, meta-clusters and super-clusters. In KDD98 Workshop on Distributed Data Mining. AAAI, 1999b.
Google Scholar
Y. Guo, S.M. Rueger, J. Sutiwaraphun, and J. Forbes-Millott. Metalearnig for parallel data mining. In Proceedings of the Seventh Parallel Computing Workshop, pages 1–2, 1997.
Google Scholar
R. Hollebeek. Data intensive computing, data mining, and economic development. In Proceedings of Supercomputing. IEEE, 1998. http://nscp.upenn.edu/hollebeek/talks/sc98.
Google Scholar
R. Hollebeek. Data intensive mining. In Proceedings of Supercomputing. IEEE, 1999. http://nscp.upenn.edu/hollebeek/talks/sc99.
Google Scholar
IBM. IBM Virtual Shared Disk Users Guide,1994.
Google Scholar
IBM, 1998.http://www.research.ibm.com/journal/rd/422/haskin.html.
D.S. Isenberg. Dark fiber economics. America’s Network, 1999.
Google Scholar
Y. Ishikawa, M. Sato, T. Kudoh, and J. Shimada. Towards a seamless parallel computing system on distributed environments. In CPSY. SWOPP’97, 1997.
Google Scholar
H. Kargupta, I. Hamzaoglu, and B. Stafford. Scalable, distributed data mining using an agent based architecture. In D. Heckerman, H. Mannila, D. Pregibon, and R. Uthurusamy, editors, Proceedings the Third International Conference on Knowledge Discovery and Data Mining, pages 211–214. AAAI Press, 1997.
Google Scholar
R. Moore, T.A. Prince, and M. Ellisman. Data intensive computing and digital libraries. Communications of the ACM, 41: 56–62, 1998.
Article Google Scholar
A.E. Raftery, D. Madigan, and J.A. Hoeting. Bayesian model averaging for linear regression models. Journal of the American Statistical Association, 92: 179–191, 1996.
Article MathSciNet MATH Google Scholar
S. Stolfo, A.L. Prodromidis, and P.K. Chan. JAM: Java agents for meta-learning over distributed databases. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining. AAAI Press, 1997.
Google Scholar
R. Subramonian and S. Parthasarathy. A framework for distributed data mining. In International Workshop on Distributed Data Mining (with KDD98), 1998.
Google Scholar
D. Wolpert. Stacked generalization. Neural Networks, 5: 241–259, 1992.
Article Google Scholar
L. Xu and M.I. Jordan. EM learning on a generalised finite mixture model for combining multiple classifiers. In Proceedings of World Congress on Neural Networks. Erlbaum, 1993.
Google Scholar

Download references

Author information

Authors and Affiliations

University of Illinois at Chicago and Magnify Inc., Chicago, IL, 60607, USA
Robert Grossman
University of Pennsylvania and Hubs Inc., Philadelphia, PA, 19104, USA
Robert Hollebeek

Authors

Robert Grossman
View author publications
You can also search for this author in PubMed Google Scholar
Robert Hollebeek
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

AT&T Labs Research, USA
James Abello & Mauricio G. C. Resende &
University of Florida, USA
Panos M. Pardalos

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Grossman, R., Hollebeek, R. (2002). The National Scalable Cluster Project: Three Lessons about High Performance Data Mining and Data Intensive Computing. In: Abello, J., Pardalos, P.M., Resende, M.G.C. (eds) Handbook of Massive Data Sets. Massive Computing, vol 4. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-0005-6_23

Download citation

DOI: https://doi.org/10.1007/978-1-4615-0005-6_23
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4613-4882-5
Online ISBN: 978-1-4615-0005-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics