Abstract
We discuss three principles learned from experience with the National Scalable Cluster Project. Storing, managing and mining massive data requires systems that exploit parallelism. This can be achieved with shared-nothing clusters and careful attention to I/O paths. Also, exploiting data parallelism at the file and record level provides efficient mapping of data-intensive problems onto clusters and is particularly well suited to data mining. Finally, the repetitive nature of data mining demands special attention be given to data layout on the hardware and to software access patterns while maintaining a storage schema easily derived from the legacy form of the data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Bibliography
T.E. Anderson, D.E. Culler, D.A. Patterson, and the NOW Team. A case for networks of workstations: NOW. IEEE Micro, 15: 54–64, 1995.
S. Bailey and R. Grossman. Transport layer multiplexing with psocket. Technical report, NCDM, 1999.
D.J. Becker, T. Sterling, D. Savarese, E. Dorband, U. A. Ranawake, and C.V. Packer. BEOWULF: A parallel workstation for scientific computation. In Proceedings of the 1995 International Conference on Parallel Processing (ICPP), pages 11–14, 1995.
L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classification and Regression Trees. Wadsworth, 1984.
T.G. Dietterich. Machine learning research: Four current directions. AI Magazine, 18: 97–136, 1997.
I. Foster and C. Kesselman. Globus project: A status report. In Proceedings of the Heterogeneous Computing Workshop, pages 4–18. IEEE, 1998.
S. Grimsaw and W.A. Wulf. Legion — A view From 50,000 feet. In Proceedings of the Fifth IEEE International Symposium on High Performance Distributed Computing. IEEE Computer Society Press, 1996.
R.L. Grossman, S. Bailey, A. Ramu, B. Malhi, H. Sivakumar, and A. Turinsky. Papyrus: A system for data mining over local and wide area clusters and super-clusters. In Proceedings of Supercomputing. IEEE, 1999a.
R.L. Grossman, H. Bodek, D. Northcutt, and H.V. Poor. Data mining and tree-based optimization. In E. Simoudis, J. Han, and U. Fayyad, editors, Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pages 323–326. AAAI Press, 1996.
R.L. Grossman, S. Kasif, D. Mon, A. Ramu, and B. Malhi. The preliminary design of Papyrus: A system for high performance distributed data mining over clusters, meta-clusters and super-clusters. In KDD98 Workshop on Distributed Data Mining. AAAI, 1999b.
Y. Guo, S.M. Rueger, J. Sutiwaraphun, and J. Forbes-Millott. Metalearnig for parallel data mining. In Proceedings of the Seventh Parallel Computing Workshop, pages 1–2, 1997.
R. Hollebeek. Data intensive computing, data mining, and economic development. In Proceedings of Supercomputing. IEEE, 1998. http://nscp.upenn.edu/hollebeek/talks/sc98.
R. Hollebeek. Data intensive mining. In Proceedings of Supercomputing. IEEE, 1999. http://nscp.upenn.edu/hollebeek/talks/sc99.
IBM. IBM Virtual Shared Disk Users Guide,1994.
IBM, 1998.http://www.research.ibm.com/journal/rd/422/haskin.html.
D.S. Isenberg. Dark fiber economics. America’s Network, 1999.
Y. Ishikawa, M. Sato, T. Kudoh, and J. Shimada. Towards a seamless parallel computing system on distributed environments. In CPSY. SWOPP’97, 1997.
H. Kargupta, I. Hamzaoglu, and B. Stafford. Scalable, distributed data mining using an agent based architecture. In D. Heckerman, H. Mannila, D. Pregibon, and R. Uthurusamy, editors, Proceedings the Third International Conference on Knowledge Discovery and Data Mining, pages 211–214. AAAI Press, 1997.
R. Moore, T.A. Prince, and M. Ellisman. Data intensive computing and digital libraries. Communications of the ACM, 41: 56–62, 1998.
A.E. Raftery, D. Madigan, and J.A. Hoeting. Bayesian model averaging for linear regression models. Journal of the American Statistical Association, 92: 179–191, 1996.
S. Stolfo, A.L. Prodromidis, and P.K. Chan. JAM: Java agents for meta-learning over distributed databases. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining. AAAI Press, 1997.
R. Subramonian and S. Parthasarathy. A framework for distributed data mining. In International Workshop on Distributed Data Mining (with KDD98), 1998.
D. Wolpert. Stacked generalization. Neural Networks, 5: 241–259, 1992.
L. Xu and M.I. Jordan. EM learning on a generalised finite mixture model for combining multiple classifiers. In Proceedings of World Congress on Neural Networks. Erlbaum, 1993.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Grossman, R., Hollebeek, R. (2002). The National Scalable Cluster Project: Three Lessons about High Performance Data Mining and Data Intensive Computing. In: Abello, J., Pardalos, P.M., Resende, M.G.C. (eds) Handbook of Massive Data Sets. Massive Computing, vol 4. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-0005-6_23
Download citation
DOI: https://doi.org/10.1007/978-1-4615-0005-6_23
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4613-4882-5
Online ISBN: 978-1-4615-0005-6
eBook Packages: Springer Book Archive