Skip to main content

A Regression-Based SVD Parallelization Using Overlapping Folds for Textual Data

  • Conference paper
  • First Online:
  • 527 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10004))

Abstract

One of the most difficult issues in text mining is high dimensionality caused by a large number of features (keywords). While various multivariate analyses, such as PCA and SVD (in information retrieval, called LSI), are developed to solve this curse of high dimensionality, they are computationally costly. This paper investigates a regression-based reconstruction method that enables parallelization of PCA/SVD by decomposing a document-term matrix into a set of sub-matrices with consideration of overlapped terms, and then to re-assemble using regression technique. To evaluate our method, we utilize two text datasets in the UCI Machine Learning Repository, called “Bag of Words” and “Reuter 50 50”. To measure the closeness between two documents, cosine similarity is applied while the accuracy is measured in the form of rank order mismatch. Finally, the result shows that, the matrices decomposition and re-assembly can preserve the quality of relation/representation.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Chen, Y.H., Ting-Chia, L.: Dimension reduction techniques for accessing Chinese readability. In: Machine Learning and Cybernetics ICMLC (2014)

    Google Scholar 

  2. Ketui, N., Theeramunkong, T.: Effect of weighting factors and unit-selection factors on text summarization. In: Pham, D.-N., Park, S.-B. (eds.) PRICAI 2014. LNCS (LNAI), vol. 8862, pp. 891–897. Springer, Cham (2014). doi:10.1007/978-3-319-13560-1_75

    Google Scholar 

  3. He, Q., Ding, X.: Sparse representation based on local time–frequency template matching for bearing transient fault feature extraction. J. Sound Vib. 370, 424–443 (2016)

    Article  Google Scholar 

  4. Bharti, K.K., Singh, P.K.: A three-stage unsupervised dimension reduction method for text clustering. J. Comput. Sci. 5(2), 156–169 (2014)

    Article  Google Scholar 

  5. Wall, M.E., Rechtsteiner, A., Rocha, L.M.: Singular value decomposition and principal component analysis. In: Berrar, D.P., Dubitzky, W., Granzow, M. (eds.) A Practical Approach to Microarray Data Analysis, pp. 91–109. Springer, Boston (2003)

    Chapter  Google Scholar 

  6. Jun, S., Park, S.-S., Jang, D.-S.: Document clustering method using dimension reduction and support vector clustering to overcome sparseness. Expert Syst. Appl. 41(7), 3204–3212 (2014)

    Article  Google Scholar 

  7. Gao, J., Zhang, J.: Clustered SVD strategies in latent semantic indexing. Inf. Process. Manage. 41(5), 1051–1063 (2005)

    Article  MATH  Google Scholar 

  8. Zabalza, J., et al.: Novel Folded-PCA for improved feature extraction and data reduction with hyperspectral imaging and SAR in remote sensing. ISPRS J. Photogrammetry Remote Sens. 93, 112–122 (2005)

    Article  Google Scholar 

  9. Xiuping, J., Richards, J.A.: Segmented principal components transformation for efficient hyperspectral remote-sensing image display and classification. IEEE Trans. Geosci. Remote Sens. 37(1), 538–542 (1999)

    Article  Google Scholar 

  10. Pascual-González, J., et al.: Combined use of MILP and multi-linear regression to simplify LCA studies. Comput. Chem. Eng. 82, 34–43 (2015)

    Article  Google Scholar 

  11. Qiao, H.: New SVD based initialization strategy for non-negative matrix factorization. Pattern Recogn. Lett. 63, 71–77 (2015)

    Article  Google Scholar 

  12. Shlens, J.: A tutorial on principal component analysis (2003)

    Google Scholar 

  13. Theeramunkong, T.: Introduction to concepts and techniques in data mining and application to text mining (2012)

    Google Scholar 

  14. Kittiphattanabawon, N., Theeramunkong, T., Nantajeewarawat, E.: News relation discovery based on association rule mining with combining factors. IEICE Trans. 94, 404–415 (2011)

    Article  Google Scholar 

  15. Lichman, M.: UCI Machine Learning Repository (2013). http://archive.ics.uci.edu/ml

  16. ZhiLiu, UCI Machine Learning Repository (2011). https://archive.ics.uci.edu/ml/datasets/Reuter_50_50

  17. Garcia, D.E.: Latent Semantic Indexing (LSI) A Fast Track Tutorial (2006)

    Google Scholar 

  18. Pavan Kumar, P., Agarwal, A., Bhagvati, C.: A structure based approach for mathematical expression retrieval. In: 6th International Workshop Multi-disciplinary Trends in Artificial Intelligence, MIWAI, Ho Chi Minh City, Vietnam (2012)

    Google Scholar 

Download references

Acknowledgement

This work is financially funded and supported by Sirindhorn International Institute of Technology, Thammasat University and Burapha University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Uraiwan Buatoom .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Buatoom, U., Theeramunkong, T., Kongprawechnon, W. (2017). A Regression-Based SVD Parallelization Using Overlapping Folds for Textual Data. In: Numao, M., Theeramunkong, T., Supnithi, T., Ketcham, M., Hnoohom, N., Pramkeaw, P. (eds) Trends in Artificial Intelligence: PRICAI 2016 Workshops. PRICAI 2016. Lecture Notes in Computer Science(), vol 10004. Springer, Cham. https://doi.org/10.1007/978-3-319-60675-0_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-60675-0_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-60674-3

  • Online ISBN: 978-3-319-60675-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics