An in-depth analysis of data can reveal many interesting properties of the data, which can help us predict the future characteristics of the data. The objective of this chapter is to illustrate some of the meaningful changes that may occur in a set of data when it is transformed into big data through evolution. To make this objective practical and interesting, a split-merge-split framework is developed, presented, and applied in this chapter. A set of file-split, file-merge, and feature-split tasks is used in this framework. It helps explore the evolution of patterns from the cause of transformation from a set of data to a set of big data. Four digital images are used to create data sets, and statistical and geometrical techniques are applied with the split-merge-split framework to understand the evolution of patterns under different class characteristics, domain characteristics, and error characteristics scenarios.


Scatter Plot Minority Class Data Domain Imbalanced Data Hadoop Distribute File System 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



I would like to thank my daughter Prattheeba Suthaharan for generating the “MyPrismaColors” data set (image) presented in Fig. 3.12. I would also like to thank my graduate student Tyler Wendell, who took my Big Data Analytics and Machine Learning course (CSC495/693) in fall 2014 at the Department of Computer Science, the University of North Carolina at Greensboro, and then extended the application of Split-Merge-Split technique to his music data to classify country music and classical music from other genres (e.g., blues, jazz, etc.) using Hadoop distributed file system and Java platform. I greatly appreciate Mr. Brent Ladd and Mr. Robert Brown for their support in developing the Big Data Analytics and Machine Learning course through a subaward approved by the National Science Foundation.


  1. 1.
    S.M. Ross. A Course in Simulation, Macmillan Publishing Company, 1990.Google Scholar
  2. 2.
    D.A. Berry and B.W. Lindgren. Statistics: Theory and Methods, Second Edition, International Thomson Publishing Company, 1996.Google Scholar
  3. 3.
    J. Maindonald, and J. Braun. Data analysis and graphics using R: an example-based approach, Second Edition, Cambridge University Press, 2007.Google Scholar
  4. 4.
    L.C. Alwan. Statistical Process Analysis, Irwin/McGraw-Hill Publication, 2000.Google Scholar
  5. 5.
    B. Dalessandro. “Bring the noise: Embracing randomness is the key to scaling-up machine learning algorithms.” Big Data vol. 1, no. 2, pp. 110–112, 2013.MathSciNetCrossRefGoogle Scholar
  6. 6.
    (Electronic Version): LWSNDR, Labelled Wireless Sensor Network Data Repository, The University of North Carolina at Greensboro, 2010. WEB:
  7. 7.
    P. C. Wong, H. W. Shen, C. R. Johnson, C. Chen, and R. B. Ross. “The top 10 challenges in extreme-scale visual analytics,” IEEE Computer Graphics and Applications, pp. 63–67, 2012.Google Scholar
  8. 8.
    C.M. Bishop. “Pattern recognition and machine learning,” Springer Science+Business Media, LLC, 2006.zbMATHGoogle Scholar
  9. 9.
    R. Akbani, S. Kwek, and N. Japkowicz. “Applying support vector machines to imbalanced datasets.” Machine Learning: ECML 2004. Springer Berlin Heidelberg, pp. 39–50, 2004.Google Scholar
  10. 10.
    K. Kotipalli and S. Suthaharan. 2014. “Modeling of class imbalance using an empirical approach with spambase data set and random forest classification,” in Proceedings of the 3rd Annual Conference on Research in Information Technology, ACM, pp. 75–80.Google Scholar
  11. 11.
    N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. “SMOTE: synthetic minority oversampling technique.” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.zbMATHGoogle Scholar
  12. 12.
    H. He, and E.A. Garcia, “Learning from imbalanced data,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263–1284, 2009.CrossRefGoogle Scholar
  13. 13.
    B. Biggio, B. Nelson, and P. Laskov, “Support vector machines under adversarial label noise,” Asian Conference on Machine Learning, JMLR: Workshop and Conference Proceedings, vol. 20, pp. 97–112, 2011.Google Scholar
  14. 14.
    S. Fefilatyev, M. Shreve, K. Kramer, L. Hall, D. Goldgof, R. Kasturi, K. Daly, A. Remsen, and H. Bunke. “Label-noise reduction with support vector machines,” 21st International Conference on Pattern Recognition, pp. 3504–3508, 2012.Google Scholar
  15. 15.
    B. Frenay and M. Verleysen, “Classification in the presence of label noise: a survey,” IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 5, pp. 845–869, 2014.CrossRefPubMedGoogle Scholar
  16. 16.
    R.J.A. Little and D.B. Rubin. “Statistical analysis with missing data,” Wiley Series in Probability and Statistics, John Wiley and Sons, Inc. second edition, 2002.Google Scholar
  17. 17.
    E.A. Gustavo, P.A. Batista, and M.C. Monard. “An analysis of four missing data treatment methods for supervised learning,” Applied Artificial Intelligence, Taylor & Francis, vol. 17, pp. 519–533, 2003.Google Scholar
  18. 18.
    M. Ramoni and P. Sebastiani. “Robust learning with missing data,” Machine Learning, Kluwer Academic Publishers, vol. 45, pp. 147–170, 2001.zbMATHGoogle Scholar
  19. 19.
    S. B. Kotsiantis. “Supervised machine learning: A review of classification techniques,” Informatica 31, pp. 249–268, 2007.zbMATHMathSciNetGoogle Scholar
  20. 20.
    K. Lakshminarayan, S.A. Harp, R. Goldman, and T. Samad. “Imputation of missing data using machine learning techniques,” KDD-96 Proceedings, AAAI, pp. 140–145, 1996. Available at:
  21. 21.
    L. Bottou, and O. Bousquet. “The tradeoffs of large scale learning.” In Proc. of NIPS, vol 4., p. 8, 2007.Google Scholar
  22. 22.
    S. H. Sengamedu. “Scalable analytics – algorithms and systems.” Big Data Analytics, Springer Berlin Heidelberg, BDA 2012, LNCS 7678, pp. 1–7, 2012.Google Scholar
  23. 23.
    C. Caragea, A. Silvescu, and P. Mitra. “Combining hashing and abstraction in sparse high dimensional feature spaces.” AAAI, p. 7, 2012.Google Scholar
  24. 24.
    S. Suthaharan. “Chaos-based image encryption scheme using Galois field for fast and secure transmission”. Real-Time Image Processing 2008, Proceedings of SPIE, vol. 6811, pp. 1–9, 2008, 681105.Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Shan Suthaharan
    • 1
  1. 1.Department of Computer ScienceUNC GreensboroGreensboroUSA

Personalised recommendations