Dimensionality Reduction

Suthaharan, Shan

doi:10.1007/978-1-4899-7641-3_14

Shan Suthaharan⁴

Part of the book series: Integrated Series in Information Systems ((ISIS,volume 36))

13k Accesses

Abstract

The main objective of this chapter is to explain the two important dimensionality reduction techniques, feature hashing and principal component analysis, that can support scaling-up machine learning. The standard and flagged feature hashing approaches are explained in detail. The feature hashing approach suffers from the hash collision problem, and this problem is reported and discussed in detail in this chapter, too. Two collision controllers, feature binning and feature mitigation, are also proposed in this chapter to address this problem. The principal component analysis uses the concepts of eigenvalues and eigenvectors, and these terminologies are explained in detail with examples. The principal component analysis is also explained using a simple two-dimensional example, and several coding examples are also presented.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

B. Dalessandro. “Bring the noise: Embracing randomness is the key to scaling up machine learning algorithms.” Big Data vol. 1, no. 2, pp. 110–112, 2013.
Article MathSciNet Google Scholar
L. Bottou. “Large-scale machine learning with stochastic gradient descent.” in Proceedings of COMPSTAT’2010. Physica-Verlag HD, pp. 177–186, 2010.
Google Scholar
J. Han and C. Moraga. “The influence of the sigmoid function parameters on the speed of backpropagation learning.” In From Natural to Artificial Neural Computation, pages 195–201. Springer, 1995.
Google Scholar
K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg. “Feature hashing for large scale multitask learning.” In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1113–1120. ACM, 2009.
Google Scholar
http://en.wikipedia.org/wiki/Feature_hashing
K. Shvachko, H. Kuang, S. Radia, and R. Chansler. “The hadoop distributed file system.” In Proceedings of the IEEE Symposium on Mass Storage Systems and Technologies, pp. 1–10, 2010.
Google Scholar
J. Dean, and S. Ghemawat. “MapReduce: a flexible data processing tool.” Communications of the ACM, vol. 53, no. 1, pp. 72–77, 2010.
Article Google Scholar
B. Li, X. Chen, M.J. Li, J.Z. Huang, and S. Feng. “Scalable random forests for massive data,” P.N. Tan et al. (Eds): PAKDD 2012, Part I, LNAI 7301, pp. 135–146, 2012.
Google Scholar
L. Rokach, and O. Maimon. “Top-down induction of decision trees classifiers-a survey.” Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on 35, no. 4, pp. 476–487, 2005.
Google Scholar
L. Breiman, “Random forests.” Machine learning 45, pp. 5–32, 2001.
Article MATH Google Scholar
L. Bottou, and O. Bousquet. “The tradeoffs of large scale learning.” In Proceedings of NIPS, vol 4., p. 8, 2007.
Google Scholar
P. Domingos, and G. Hulten. “A general method for scaling up machine learning algorithms and its application to clustering.” In ICML, pp. 106–113. 2001.
Google Scholar
Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola, A. Strehl, and V. Vishwanathan. “Hash kernels.” In Proceedings of the International Conference on Artificial Intelligence and Statistics, pp. 496–503. 2009.
Google Scholar
Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola, and V. Vishwanathan. “Hash kernels for structured data.” The Journal of Machine Learning Research 10, pp. 2615–2637, 2009.
MATH MathSciNet Google Scholar
B. Bai, J. Weston, D. Grangier, R. Collobert, O. Chapelle, and K. Weinberger, “Supervised semantic indexing.” In Proceedings of the 18th ACM conference on Information and knowledge management, pp. 187–196, 2009.
Google Scholar
C. Caragea, A. Silvescu, and P. Mitra. “Combining hashing and abstraction in sparse high dimensional feature spaces.” AAAI, p. 7, 2012.
Google Scholar
http://en.wikipedia.org/wiki/Principal_component_analysis
http://www.math.northwestern.edu/~mlerma/papers/princcomp2d.pdf, Last accessed: May 14, 2015.

Download references

Author information

Authors and Affiliations

Department of Computer Science, UNC Greensboro, Greensboro, NC, USA
Shan Suthaharan

Authors

Shan Suthaharan
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Suthaharan, S. (2016). Dimensionality Reduction. In: Machine Learning Models and Algorithms for Big Data Classification. Integrated Series in Information Systems, vol 36. Springer, Boston, MA. https://doi.org/10.1007/978-1-4899-7641-3_14

Download citation

DOI: https://doi.org/10.1007/978-1-4899-7641-3_14
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4899-7640-6
Online ISBN: 978-1-4899-7641-3
eBook Packages: Business and ManagementBusiness and Management (R0)

Publish with us

Policies and ethics