Big Data Analytics
An in-depth analysis of data can reveal many interesting properties of the data, which can help us predict the future characteristics of the data. The objective of this chapter is to illustrate some of the meaningful changes that may occur in a set of data when it is transformed into big data through evolution. To make this objective practical and interesting, a split-merge-split framework is developed, presented, and applied in this chapter. A set of file-split, file-merge, and feature-split tasks is used in this framework. It helps explore the evolution of patterns from the cause of transformation from a set of data to a set of big data. Four digital images are used to create data sets, and statistical and geometrical techniques are applied with the split-merge-split framework to understand the evolution of patterns under different class characteristics, domain characteristics, and error characteristics scenarios.
KeywordsScatter Plot Minority Class Data Domain Imbalanced Data Hadoop Distribute File System
I would like to thank my daughter Prattheeba Suthaharan for generating the “MyPrismaColors” data set (image) presented in Fig. 3.12. I would also like to thank my graduate student Tyler Wendell, who took my Big Data Analytics and Machine Learning course (CSC495/693) in fall 2014 at the Department of Computer Science, the University of North Carolina at Greensboro, and then extended the application of Split-Merge-Split technique to his music data to classify country music and classical music from other genres (e.g., blues, jazz, etc.) using Hadoop distributed file system and Java platform. I greatly appreciate Mr. Brent Ladd and Mr. Robert Brown for their support in developing the Big Data Analytics and Machine Learning course through a subaward approved by the National Science Foundation.
- 1.S.M. Ross. A Course in Simulation, Macmillan Publishing Company, 1990.Google Scholar
- 2.D.A. Berry and B.W. Lindgren. Statistics: Theory and Methods, Second Edition, International Thomson Publishing Company, 1996.Google Scholar
- 3.J. Maindonald, and J. Braun. Data analysis and graphics using R: an example-based approach, Second Edition, Cambridge University Press, 2007.Google Scholar
- 4.L.C. Alwan. Statistical Process Analysis, Irwin/McGraw-Hill Publication, 2000.Google Scholar
- 6.(Electronic Version): LWSNDR, Labelled Wireless Sensor Network Data Repository, The University of North Carolina at Greensboro, 2010. WEB: http://www.uncg.edu/cmp/downloads.
- 7.P. C. Wong, H. W. Shen, C. R. Johnson, C. Chen, and R. B. Ross. “The top 10 challenges in extreme-scale visual analytics,” IEEE Computer Graphics and Applications, pp. 63–67, 2012.Google Scholar
- 9.R. Akbani, S. Kwek, and N. Japkowicz. “Applying support vector machines to imbalanced datasets.” Machine Learning: ECML 2004. Springer Berlin Heidelberg, pp. 39–50, 2004.Google Scholar
- 10.K. Kotipalli and S. Suthaharan. 2014. “Modeling of class imbalance using an empirical approach with spambase data set and random forest classification,” in Proceedings of the 3rd Annual Conference on Research in Information Technology, ACM, pp. 75–80.Google Scholar
- 13.B. Biggio, B. Nelson, and P. Laskov, “Support vector machines under adversarial label noise,” Asian Conference on Machine Learning, JMLR: Workshop and Conference Proceedings, vol. 20, pp. 97–112, 2011.Google Scholar
- 14.S. Fefilatyev, M. Shreve, K. Kramer, L. Hall, D. Goldgof, R. Kasturi, K. Daly, A. Remsen, and H. Bunke. “Label-noise reduction with support vector machines,” 21st International Conference on Pattern Recognition, pp. 3504–3508, 2012.Google Scholar
- 16.R.J.A. Little and D.B. Rubin. “Statistical analysis with missing data,” Wiley Series in Probability and Statistics, John Wiley and Sons, Inc. second edition, 2002.Google Scholar
- 17.E.A. Gustavo, P.A. Batista, and M.C. Monard. “An analysis of four missing data treatment methods for supervised learning,” Applied Artificial Intelligence, Taylor & Francis, vol. 17, pp. 519–533, 2003.Google Scholar
- 20.K. Lakshminarayan, S.A. Harp, R. Goldman, and T. Samad. “Imputation of missing data using machine learning techniques,” KDD-96 Proceedings, AAAI, pp. 140–145, 1996. Available at: http://www.aaai.org/Papers/KDD/1996/KDD96-023.pdf
- 21.L. Bottou, and O. Bousquet. “The tradeoffs of large scale learning.” In Proc. of NIPS, vol 4., p. 8, 2007.Google Scholar
- 22.S. H. Sengamedu. “Scalable analytics – algorithms and systems.” Big Data Analytics, Springer Berlin Heidelberg, BDA 2012, LNCS 7678, pp. 1–7, 2012.Google Scholar
- 23.C. Caragea, A. Silvescu, and P. Mitra. “Combining hashing and abstraction in sparse high dimensional feature spaces.” AAAI, p. 7, 2012.Google Scholar
- 24.S. Suthaharan. “Chaos-based image encryption scheme using Galois field for fast and secure transmission”. Real-Time Image Processing 2008, Proceedings of SPIE, vol. 6811, pp. 1–9, 2008, 681105.Google Scholar