Skip to main content

Part of the book series: Integrated Series in Information Systems ((ISIS,volume 36))

  • 13k Accesses

Abstract

The main objective of this chapter is to explain the two important dimensionality reduction techniques, feature hashing and principal component analysis, that can support scaling-up machine learning. The standard and flagged feature hashing approaches are explained in detail. The feature hashing approach suffers from the hash collision problem, and this problem is reported and discussed in detail in this chapter, too. Two collision controllers, feature binning and feature mitigation, are also proposed in this chapter to address this problem. The principal component analysis uses the concepts of eigenvalues and eigenvectors, and these terminologies are explained in detail with examples. The principal component analysis is also explained using a simple two-dimensional example, and several coding examples are also presented.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. B. Dalessandro. “Bring the noise: Embracing randomness is the key to scaling up machine learning algorithms.” Big Data vol. 1, no. 2, pp. 110–112, 2013.

    Article  MathSciNet  Google Scholar 

  2. L. Bottou. “Large-scale machine learning with stochastic gradient descent.” in Proceedings of COMPSTAT’2010. Physica-Verlag HD, pp. 177–186, 2010.

    Google Scholar 

  3. J. Han and C. Moraga. “The influence of the sigmoid function parameters on the speed of backpropagation learning.” In From Natural to Artificial Neural Computation, pages 195–201. Springer, 1995.

    Google Scholar 

  4. K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg. “Feature hashing for large scale multitask learning.” In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1113–1120. ACM, 2009.

    Google Scholar 

  5. http://en.wikipedia.org/wiki/Feature_hashing

  6. K. Shvachko, H. Kuang, S. Radia, and R. Chansler. “The hadoop distributed file system.” In Proceedings of the IEEE Symposium on Mass Storage Systems and Technologies, pp. 1–10, 2010.

    Google Scholar 

  7. J. Dean, and S. Ghemawat. “MapReduce: a flexible data processing tool.” Communications of the ACM, vol. 53, no. 1, pp. 72–77, 2010.

    Article  Google Scholar 

  8. B. Li, X. Chen, M.J. Li, J.Z. Huang, and S. Feng. “Scalable random forests for massive data,” P.N. Tan et al. (Eds): PAKDD 2012, Part I, LNAI 7301, pp. 135–146, 2012.

    Google Scholar 

  9. L. Rokach, and O. Maimon. “Top-down induction of decision trees classifiers-a survey.” Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on 35, no. 4, pp. 476–487, 2005.

    Google Scholar 

  10. L. Breiman, “Random forests.” Machine learning 45, pp. 5–32, 2001.

    Article  MATH  Google Scholar 

  11. L. Bottou, and O. Bousquet. “The tradeoffs of large scale learning.” In Proceedings of NIPS, vol 4., p. 8, 2007.

    Google Scholar 

  12. P. Domingos, and G. Hulten. “A general method for scaling up machine learning algorithms and its application to clustering.” In ICML, pp. 106–113. 2001.

    Google Scholar 

  13. Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola, A. Strehl, and V. Vishwanathan. “Hash kernels.” In Proceedings of the International Conference on Artificial Intelligence and Statistics, pp. 496–503. 2009.

    Google Scholar 

  14. Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola, and V. Vishwanathan. “Hash kernels for structured data.” The Journal of Machine Learning Research 10, pp. 2615–2637, 2009.

    MATH  MathSciNet  Google Scholar 

  15. B. Bai, J. Weston, D. Grangier, R. Collobert, O. Chapelle, and K. Weinberger, “Supervised semantic indexing.” In Proceedings of the 18th ACM conference on Information and knowledge management, pp. 187–196, 2009.

    Google Scholar 

  16. C. Caragea, A. Silvescu, and P. Mitra. “Combining hashing and abstraction in sparse high dimensional feature spaces.” AAAI, p. 7, 2012.

    Google Scholar 

  17. http://en.wikipedia.org/wiki/Principal_component_analysis

  18. http://www.math.northwestern.edu/~mlerma/papers/princcomp2d.pdf, Last accessed: May 14, 2015.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer Science+Business Media New York

About this chapter

Cite this chapter

Suthaharan, S. (2016). Dimensionality Reduction. In: Machine Learning Models and Algorithms for Big Data Classification. Integrated Series in Information Systems, vol 36. Springer, Boston, MA. https://doi.org/10.1007/978-1-4899-7641-3_14

Download citation

Publish with us

Policies and ethics