Skip to main content

Clustering and Classification to Evaluate Data Reduction via Johnson-Lindenstrauss Transform

  • Conference paper
  • First Online:
Book cover Advances in Information and Communication (FICC 2020)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1130))

Included in the following conference series:

  • 1348 Accesses

Abstract

A dataset is a matrix X with \(n\times d\) entries, where n is the number of observations and d is the number of variables (dimensions). Johnson and Lindenstrauss assert that a transformation exists to achieve a matrix with \(n\times k\) entries, \(k<< d\), such that certain geometric properties of the original matrix are preserved. The property that we seek is that if we look at all pairs of points in matrix X, the distance between any two points should be the same within a given small acceptable level of distortion as the corresponding distance between the same two points in the reduced dataset. Does it follow that semantic content of the data is preserved in the transformation? We can answer in the affirmative that meaning in the original dataset was preserved in the reduced dataset. This was confirmed by comparison of clustering and classification results on the original and reduced datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Wang, A., Gehan, E.A.: Gene selection for microarray data analysis using principal component analysis. Stat. Med. 24(13), 2069–2087 (2005)

    Article  MathSciNet  Google Scholar 

  2. Fedoruk, J., Schmuland, B., Johnson, J., Heo, G.: Dimensionality reduction via the Johnson-Lindenstrauss lemma: theoretical and empirical bounds on embedding dimension. J. Supercomput. 74(8), 3933–3949 (2018)

    Article  Google Scholar 

  3. Cannings, T.I., Samworth, R.J.: Random-projection ensemble classification. J. Roy. Stat. Soc. B (Stat. Methodol.) 79(4), 959–1035 (2017)

    Article  MathSciNet  Google Scholar 

  4. Dasgupta, S.: Experiments with random projection. CoRR, vol. abs/1301.3849 (2013)

    Google Scholar 

  5. Fern, X., Brodley, C.: Random projection for high dimensional data clustering: a cluster ensemble approach. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML) (2003)

    Google Scholar 

  6. Klopotek, M.A.: Machine learning friendly set version of Johnson-Lindenstrauss lemma. CoRR, vol. abs/1703.01507 (2017)

    Google Scholar 

  7. Hoogman, M., Bralten, J., Hibar, D.P., Mennes, M., Zwiers, M.P., Schweren, L.S.J., van Hulzen, K.J.E., Medland, S.E., Shumskaya, E., Jahanshad, N., de Zeeuw, P., Szekely, E., Sudre, G., Wolfers, T., Onnink, A.M.H., Dammers, J.T., Mostert, J.C., Vives-Gilabert, Y., Kohls, G., Oberwelland, E., Seitz, J., Schulte-Rüther, M., Ambrosino, S., Doyle, A.E., Høvik, M.F., Dramsdahl, M., Tamm, L., van Erp, T.G.M., Dale, A., Schork, A., Conzelmann, A., Zierhut, K., Baur, R., McCarthy, H., Yoncheva, Y.N., Cubillo, A., Chantiluke, K., Mehta, M.A., Paloyelis, Y., Hohmann, S., Baumeister, S., Bramati, I., Mattos, P., Tovar-Moll, F., Douglas, P., Banaschewski, T., Brandeis, D., Kuntsi, J., Asherson, P., Rubia, K., Kelly, C., Martino, A.D., Milham, M.P., Castellanos, F.X., Frodl, T., Zentis, M., Lesch, K.-P., Reif, A., Pauli, P., Jernigan, T.L., Haavik, J., Plessen, K.J., Lundervold, A.J., Hugdahl, K., Seidman, L.J., Biederman, J., Rommelse, N., Heslenfeld, D.J., Hartman, C.A., Hoekstra, P.J., Oosterlaan, J., von Polier, G., Konrad, K., Vilarroya, O., Ramos-Quiroga, J.A., Soliva, J.C., Durston, S., Buitelaar, J.K., Faraone, S.V., Shaw, P., Thompson, P.M., Franke, B.: Subcortical brain volume differences in participants with attention deficit hyperactivity disorder in children and adults: a cross-sectional mega-analysis. The Lancet Psychiatry 4(4), 310–319 (2017)

    Article  Google Scholar 

  8. Sun, H., Chen, Y., Huang, Q., Lui, S., Huang, X., Shi, Y., Xu, X., Sweeney, J.A., Gong, Q.: Psychoradiologic utility of MR imaging for diagnosis of attention deficit hyperactivity disorder: a radiomics analysis. Radiology 287(2), 620–630 (2018). pMID: 29165048

    Article  Google Scholar 

  9. Li, T., Ma, S., Ogihara, M.: Wavelet Methods in Data Mining, pp. 553–571. Springer, Boston (2010)

    Google Scholar 

  10. Agarwal, D., Agrawal, R., Khanna, R., Kota, N.: Estimating rates of rare events with multiple hierarchies through scalable log-linear models. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2010, pp. 213–222. ACM, New York (2010)

    Google Scholar 

  11. Hand, D.J.: Data Mining Based in part on the article “Data mining” by David Hand, which appeared in the Encyclopedia of Environmetrics. American Cancer Society (2013)

    Google Scholar 

  12. Xi, X., Ueno, K., Keogh, E., Lee, D.-J.: Converting non-parametric distance-based classification to anytime algorithms. Pattern Anal. Appl. 11(3), 321–336 (2008)

    Article  MathSciNet  Google Scholar 

  13. LalithaY, S., Latte, M.V.: Lossless and lossy compression of dicom images with scalable ROI. IJCSNS Int. J. Comput. Sci. Netw. Secur. 10(7), 276–281 (2010)

    Google Scholar 

  14. Du, K.-L., Swamy, M.N.S.: Recurrent Neural Networks, pp. 337–353. Springer, London (2014)

    Google Scholar 

  15. Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.-H., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2008)

    Article  Google Scholar 

  16. Suthaharan, S.: Machine Learning Models and Algorithms for Big Data Classification, vol. 36. Springer, Boston (2016)

    Book  Google Scholar 

  17. Noble, W.S.: What is a support vector machine? Nat. Biotechnol. 24(12), 1565–1567 (2006)

    Article  Google Scholar 

  18. Stein, G., Chen, B. Wu, A.S., Hua, K.A.: Decision tree classifier for network intrusion detection with GA-based feature selection. In: Proceedings of the 43rd Annual Southeast Regional Conference - Volume 2, ACM-SE 43, pp. 136–141. ACM, New York (2005)

    Google Scholar 

  19. Mukherjee, S., Sharma, N.: Intrusion detection using Naive Bayes classifier with feature reduction. Procedia Technol. 4, 119–128 (2012). 2nd International Conference on Computer, Communication, Control and Information Technology (C3IT-2012) on February 25–26, 2012

    Article  Google Scholar 

  20. Deshmukh, S., Rajeswari, K., Patil, R.: Analysis of simple K-means with multiple dimensions using WEKA. Int. J. Comp. Appl. 110(1), 14–17 (2015)

    Google Scholar 

  21. Zarzour, H., Al-Sharif, Z., Al-Ayyoub, M., Jararweh, Y.: A new collaborative filtering recommendation algorithm based on dimensionality reduction and clustering techniques, pp. 102–106 (2018)

    Google Scholar 

  22. Nilashi, M., Ibrahim, O., Ahmadi, H., Shahmoradi, L., Samad, S., Bagherifard, K.: A recommendation agent for health products recommendation using dimensionality reduction and prediction machine learning techniques. J. Soft Comput. Decis. Support Syst. 5, 7–15 (2018)

    Google Scholar 

  23. Wang, S., Lu, J., Gu, X., Du, H., Yang, J.: Semi-supervised linear discriminant analysis for dimension reduction and classification. Pattern Recogn. 57, 179–189 (2016)

    Article  Google Scholar 

  24. Johnson, W.B., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbert space. Contemp. Math. 26(7), 189–206 (1984)

    Article  MathSciNet  Google Scholar 

  25. Bellec, P., Chu, C., Chouinard-Decorte, F., Benhajali, Y., Margulies, D.S., Craddock, R.C.: The neuro bureau ADHD-200 preprocessed repository. NeuroImage 144, 275–286 (2017). data Sharing Part II

    Article  Google Scholar 

  26. Matoušek, J.: On variants of the Johnson-Lindenstrauss lemma. Random Struct. Algorithms 33(2), 142–156 (2008)

    Article  MathSciNet  Google Scholar 

  27. Bengio, Y., Grandvalet, Y.: No unbiased estimator of the variance of K-fold cross-validation. J. Mach. Learn. Res. JMLR 5, 1089–1105 (2004)

    MathSciNet  MATH  Google Scholar 

  28. Markatou, M., Tian, H., Biswas, S., Hripcsak, G.: Analysis of variance of cross-validation estimators of the generalization error. J. Mach. Learn. Res. JMLR 6, 1127–1168 (2005)

    MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Julia Johnson .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ghalib, A., Jessup, T.D., Johnson, J., Monemian, S. (2020). Clustering and Classification to Evaluate Data Reduction via Johnson-Lindenstrauss Transform. In: Arai, K., Kapoor, S., Bhatia, R. (eds) Advances in Information and Communication. FICC 2020. Advances in Intelligent Systems and Computing, vol 1130. Springer, Cham. https://doi.org/10.1007/978-3-030-39442-4_16

Download citation

Publish with us

Policies and ethics