Case Study II: Data Classification using Scalding and Spark

  • K G SrinivasaEmail author
  • Anil Kumar Muppalla
Part of the Computer Communications and Networks book series (CCN)


It is important to characterize learning problems depending on type of data they use. knowledge about the data is very important as similar learning techniques can be applied to similar data types. For example, Natural Language Processing and Bio-informatics use very similar tools for strings for natural language text and DNA sequences. The most basic type of data entities are Vectors . For example, an insurance corporation may want a vector of patient details like blood pressure, heart rate, height, weight, cholesterol, smoking status, gender to infer the patients life expectancy. A farmer might be interested in determining the ripeness of the fruit based on a vector of size, weight, spectral data. An electrical engineer may want to find dependency between voltage and current. A search engine might want to a vector of counts which describe the frequency of words.


Natural Language Processing Sepal Length Natural Language Text Petal Length Petal Width 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Witten, Ian H., and Eibe Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2005Google Scholar
  2. 2.
    Bishop, Christopher M. Pattern recognition and machine learning. Vol. 1. New York: springer, 2006Google Scholar
  3. 3.
    Hand, David J. ”Consumer credit and statistics.” Statistics in Finance (1998): 69-81Google Scholar
  4. 4.
    Alpaydin, Ethem. Introduction to machine learning. MIT press, 2004Google Scholar
  5. 5.
    Bartlett, Marian Stewart, et al. ”Recognizing facial expression: machine learning and application to spontaneous behavior.” Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. Vol. 2. IEEE, 2005Google Scholar
  6. 6.
    Kononenko, Igor. ”Machine learning for medical diagnosis: history, state of the art and perspective.” Artificial Intelligence in medicine 23.1 (2001): 89-109Google Scholar
  7. 7.
    Mitchell, Tom Michael. The discipline of machine learning. Carnegie Mellon University, School of Computer Science, Machine Learning Department, 2006Google Scholar
  8. 8.
    Kotsiantis, Sotiris B., I. D. Zaharakis, and P. E. Pintelas. ”Supervised machine learning: A review of classification techniques.” (2007): 3-24Google Scholar
  9. 9.
    Nguyen, Thuy TT, and Grenville Armitage. ”A survey of techniques for internet traffic classification using machine learning.” Communications Surveys & Tutorials, IEEE 10.4 (2008): 56-76Google Scholar
  10. 10.
    Barto, Andrew G. Reinforcement learning: An introduction. MIT press, 1998Google Scholar
  11. 11.
    Matthew Richardson, Amit Prakash, and Eric Brill. Beyond pagerank: machine learning for static ranking. In Les Carr, David De Roure, Arun Iyengar, Carole A. Goble, and Michael Dahlin, editors, WWW, pages 707-715. ACM, 2006Google Scholar
  12. 12.
    Robert Bell and Yehuda Koren. Scalable collaborative filtering with jointly derived neighborhood interpolation weights. In Proceedings of the IEEE International Conference on Data Mining (ICDM), pages 43 52, 2007Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.M.S. Ramaiah Institute of TechnologyBangaloreIndia

Personalised recommendations