Clustering and Classification to Evaluate Data Reduction via Johnson-Lindenstrauss Transform

Ghalib, Abdulaziz; Jessup, Tyler D.; Johnson, Julia; Monemian, Seyedamin

doi:10.1007/978-3-030-39442-4_16

Abdulaziz Ghalib¹⁷,
Tyler D. Jessup¹⁷,
Julia Johnson¹⁷ &
…
Seyedamin Monemian¹⁷

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1130))

Included in the following conference series:

Future of Information and Communication Conference

1348 Accesses

Abstract

A dataset is a matrix X with \(n\times d\) entries, where n is the number of observations and d is the number of variables (dimensions). Johnson and Lindenstrauss assert that a transformation exists to achieve a matrix with \(n\times k\) entries, \(k<< d\), such that certain geometric properties of the original matrix are preserved. The property that we seek is that if we look at all pairs of points in matrix X, the distance between any two points should be the same within a given small acceptable level of distortion as the corresponding distance between the same two points in the reduced dataset. Does it follow that semantic content of the data is preserved in the transformation? We can answer in the affirmative that meaning in the original dataset was preserved in the reduced dataset. This was confirmed by comparison of clustering and classification results on the original and reduced datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Wang, A., Gehan, E.A.: Gene selection for microarray data analysis using principal component analysis. Stat. Med. 24(13), 2069–2087 (2005)
Article MathSciNet Google Scholar
Fedoruk, J., Schmuland, B., Johnson, J., Heo, G.: Dimensionality reduction via the Johnson-Lindenstrauss lemma: theoretical and empirical bounds on embedding dimension. J. Supercomput. 74(8), 3933–3949 (2018)
Article Google Scholar
Cannings, T.I., Samworth, R.J.: Random-projection ensemble classification. J. Roy. Stat. Soc. B (Stat. Methodol.) 79(4), 959–1035 (2017)
Article MathSciNet Google Scholar
Dasgupta, S.: Experiments with random projection. CoRR, vol. abs/1301.3849 (2013)
Google Scholar
Fern, X., Brodley, C.: Random projection for high dimensional data clustering: a cluster ensemble approach. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML) (2003)
Google Scholar
Klopotek, M.A.: Machine learning friendly set version of Johnson-Lindenstrauss lemma. CoRR, vol. abs/1703.01507 (2017)
Google Scholar
Hoogman, M., Bralten, J., Hibar, D.P., Mennes, M., Zwiers, M.P., Schweren, L.S.J., van Hulzen, K.J.E., Medland, S.E., Shumskaya, E., Jahanshad, N., de Zeeuw, P., Szekely, E., Sudre, G., Wolfers, T., Onnink, A.M.H., Dammers, J.T., Mostert, J.C., Vives-Gilabert, Y., Kohls, G., Oberwelland, E., Seitz, J., Schulte-Rüther, M., Ambrosino, S., Doyle, A.E., Høvik, M.F., Dramsdahl, M., Tamm, L., van Erp, T.G.M., Dale, A., Schork, A., Conzelmann, A., Zierhut, K., Baur, R., McCarthy, H., Yoncheva, Y.N., Cubillo, A., Chantiluke, K., Mehta, M.A., Paloyelis, Y., Hohmann, S., Baumeister, S., Bramati, I., Mattos, P., Tovar-Moll, F., Douglas, P., Banaschewski, T., Brandeis, D., Kuntsi, J., Asherson, P., Rubia, K., Kelly, C., Martino, A.D., Milham, M.P., Castellanos, F.X., Frodl, T., Zentis, M., Lesch, K.-P., Reif, A., Pauli, P., Jernigan, T.L., Haavik, J., Plessen, K.J., Lundervold, A.J., Hugdahl, K., Seidman, L.J., Biederman, J., Rommelse, N., Heslenfeld, D.J., Hartman, C.A., Hoekstra, P.J., Oosterlaan, J., von Polier, G., Konrad, K., Vilarroya, O., Ramos-Quiroga, J.A., Soliva, J.C., Durston, S., Buitelaar, J.K., Faraone, S.V., Shaw, P., Thompson, P.M., Franke, B.: Subcortical brain volume differences in participants with attention deficit hyperactivity disorder in children and adults: a cross-sectional mega-analysis. The Lancet Psychiatry 4(4), 310–319 (2017)
Article Google Scholar
Sun, H., Chen, Y., Huang, Q., Lui, S., Huang, X., Shi, Y., Xu, X., Sweeney, J.A., Gong, Q.: Psychoradiologic utility of MR imaging for diagnosis of attention deficit hyperactivity disorder: a radiomics analysis. Radiology 287(2), 620–630 (2018). pMID: 29165048
Article Google Scholar
Li, T., Ma, S., Ogihara, M.: Wavelet Methods in Data Mining, pp. 553–571. Springer, Boston (2010)
Google Scholar
Agarwal, D., Agrawal, R., Khanna, R., Kota, N.: Estimating rates of rare events with multiple hierarchies through scalable log-linear models. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2010, pp. 213–222. ACM, New York (2010)
Google Scholar
Hand, D.J.: Data Mining Based in part on the article “Data mining” by David Hand, which appeared in the Encyclopedia of Environmetrics. American Cancer Society (2013)
Google Scholar
Xi, X., Ueno, K., Keogh, E., Lee, D.-J.: Converting non-parametric distance-based classification to anytime algorithms. Pattern Anal. Appl. 11(3), 321–336 (2008)
Article MathSciNet Google Scholar
LalithaY, S., Latte, M.V.: Lossless and lossy compression of dicom images with scalable ROI. IJCSNS Int. J. Comput. Sci. Netw. Secur. 10(7), 276–281 (2010)
Google Scholar
Du, K.-L., Swamy, M.N.S.: Recurrent Neural Networks, pp. 337–353. Springer, London (2014)
Google Scholar
Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.-H., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2008)
Article Google Scholar
Suthaharan, S.: Machine Learning Models and Algorithms for Big Data Classification, vol. 36. Springer, Boston (2016)
Book Google Scholar
Noble, W.S.: What is a support vector machine? Nat. Biotechnol. 24(12), 1565–1567 (2006)
Article Google Scholar
Stein, G., Chen, B. Wu, A.S., Hua, K.A.: Decision tree classifier for network intrusion detection with GA-based feature selection. In: Proceedings of the 43rd Annual Southeast Regional Conference - Volume 2, ACM-SE 43, pp. 136–141. ACM, New York (2005)
Google Scholar
Mukherjee, S., Sharma, N.: Intrusion detection using Naive Bayes classifier with feature reduction. Procedia Technol. 4, 119–128 (2012). 2nd International Conference on Computer, Communication, Control and Information Technology (C3IT-2012) on February 25–26, 2012
Article Google Scholar
Deshmukh, S., Rajeswari, K., Patil, R.: Analysis of simple K-means with multiple dimensions using WEKA. Int. J. Comp. Appl. 110(1), 14–17 (2015)
Google Scholar
Zarzour, H., Al-Sharif, Z., Al-Ayyoub, M., Jararweh, Y.: A new collaborative filtering recommendation algorithm based on dimensionality reduction and clustering techniques, pp. 102–106 (2018)
Google Scholar
Nilashi, M., Ibrahim, O., Ahmadi, H., Shahmoradi, L., Samad, S., Bagherifard, K.: A recommendation agent for health products recommendation using dimensionality reduction and prediction machine learning techniques. J. Soft Comput. Decis. Support Syst. 5, 7–15 (2018)
Google Scholar
Wang, S., Lu, J., Gu, X., Du, H., Yang, J.: Semi-supervised linear discriminant analysis for dimension reduction and classification. Pattern Recogn. 57, 179–189 (2016)
Article Google Scholar
Johnson, W.B., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbert space. Contemp. Math. 26(7), 189–206 (1984)
Article MathSciNet Google Scholar
Bellec, P., Chu, C., Chouinard-Decorte, F., Benhajali, Y., Margulies, D.S., Craddock, R.C.: The neuro bureau ADHD-200 preprocessed repository. NeuroImage 144, 275–286 (2017). data Sharing Part II
Article Google Scholar
Matoušek, J.: On variants of the Johnson-Lindenstrauss lemma. Random Struct. Algorithms 33(2), 142–156 (2008)
Article MathSciNet Google Scholar
Bengio, Y., Grandvalet, Y.: No unbiased estimator of the variance of K-fold cross-validation. J. Mach. Learn. Res. JMLR 5, 1089–1105 (2004)
MathSciNet MATH Google Scholar
Markatou, M., Tian, H., Biswas, S., Hripcsak, G.: Analysis of variance of cross-validation estimators of the generalization error. J. Mach. Learn. Res. JMLR 6, 1127–1168 (2005)
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics and Computer Science, Laurentian University, Sudbury, ON, P3E2C6, Canada
Abdulaziz Ghalib, Tyler D. Jessup, Julia Johnson & Seyedamin Monemian

Authors

Abdulaziz Ghalib
View author publications
You can also search for this author in PubMed Google Scholar
Tyler D. Jessup
View author publications
You can also search for this author in PubMed Google Scholar
Julia Johnson
View author publications
You can also search for this author in PubMed Google Scholar
Seyedamin Monemian
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Julia Johnson .

Editor information

Editors and Affiliations

Faculty of Science and Engineering, Saga University, Saga, Japan
Kohei Arai
The Science and Information (SAI) Organization, Bradford, West Yorkshire, UK
Supriya Kapoor
The Science and Information (SAI) Organization, Bradford, West Yorkshire, UK
Rahul Bhatia

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ghalib, A., Jessup, T.D., Johnson, J., Monemian, S. (2020). Clustering and Classification to Evaluate Data Reduction via Johnson-Lindenstrauss Transform. In: Arai, K., Kapoor, S., Bhatia, R. (eds) Advances in Information and Communication. FICC 2020. Advances in Intelligent Systems and Computing, vol 1130. Springer, Cham. https://doi.org/10.1007/978-3-030-39442-4_16

Download citation

DOI: https://doi.org/10.1007/978-3-030-39442-4_16
Published: 13 February 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-39441-7
Online ISBN: 978-3-030-39442-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics