Document Analysis Based on Multi-view Intact Space Learning with Manifold Regularization

Zhan, Zengrong; Ma, Zhengming

doi:10.1007/978-3-319-67777-4_4

Zengrong Zhan^18,19 &
Zhengming Ma¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10559))

Included in the following conference series:

International Conference on Intelligent Science and Big Data Engineering

2227 Accesses

Abstract

Document analysis plays an important role in our life, and traditional models like Latent Semantic Analysis (LSI) or Latent Dirichlet Allocation (LDA) cannot handle data from many sources. Multi-view learning technology like Multi-view Intact Space Learning (MISL), which integrates the complementary information on multiple views to discover a latent intact representation of the data, is effective for image or video application. But the model has not been applied to multi-lingual documents and has not considered the intrinsic geometrical and discriminating structure of the document data. To overcome this issue, we assume that if documents are close in the origin representation, they should also be close in the intact space representation. And we introduce a manifold regularization term to MISL so that the data is more smoothly in latent space. We conduct classification experiments on 10505 Wiki documents we crawled, and the result shows that it is outperforming TFIDF, LSI, LDA, and MISL.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Salton, G., Wong, A., Yang, C.-S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Article MATH Google Scholar
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391 (1990)
Article Google Scholar
Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1), 177–196 (2001)
Article MathSciNet MATH Google Scholar
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1999, pp. 50–57. ACM, New York (1999)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)
MATH Google Scholar
Zhao, J., Xie, X., Xin, X., Sun, S.: Multi-view learning overview: recent progress and new challenges. Inf. Fusion 38, 43–54 (2017)
Article Google Scholar
Xu, C., Tao, D., Xu, C.: Multi-view intact space learning. IEEE Trans. Pattern Anal. Mach. Intell. 37(12), 2531–2544 (2015)
Article Google Scholar
Belkin, M., Niyogi, P.: Laplacian eigenmaps and spectral techniques for embedding and clustering. NIPS 14, 585–591 (2001)
Google Scholar
Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res. 7(Nov), 2399–2434 (2006)
MathSciNet MATH Google Scholar
Chung, F.R.K.: Spectral Graph Theory, vol. 92. American Mathematical Society, Providence (1997)
MATH Google Scholar
Belkin, M.: Problems of learning on manifolds. Ph.D. thesis, The University of Chicago (2003). AAI3097083
Google Scholar
A fast and powerful scraping and web crawling framework (2017). https://scrapy.org/
Processing xml and html with python (2017). https://lxml.de/
Natural language toolkit (2017). http://www.nltk.org/
The Stanford natural language processing group (2017). https://nlp.stanford.edu/
Efficient topic modelling of text semantics in python (2017). https://radimrehurek.com/gensim/index.html

Download references

Author information

Authors and Affiliations

School of Electronic and Information Engineering, Sun Yat-sen University, Guangzhou, China
Zengrong Zhan & Zhengming Ma
School of Information Engineering, Guangzhou Panyu Polytechnic, Guangzhou, China
Zengrong Zhan

Authors

Zengrong Zhan
View author publications
You can also search for this author in PubMed Google Scholar
Zhengming Ma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zengrong Zhan .

Editor information

Editors and Affiliations

Dalian University of Technology, Dalian, China
Yi Sun
Dalian University of Technology, Dalian, China
Huchuan Lu
Dalian University of Technology, Dalian, China
Lihe Zhang
Nanjing University of Science and Technology, Nanjing, China
Jian Yang
Beijing Institute of Technology, Beijing, China
Hua Huang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhan, Z., Ma, Z. (2017). Document Analysis Based on Multi-view Intact Space Learning with Manifold Regularization. In: Sun, Y., Lu, H., Zhang, L., Yang, J., Huang, H. (eds) Intelligence Science and Big Data Engineering. IScIDE 2017. Lecture Notes in Computer Science(), vol 10559. Springer, Cham. https://doi.org/10.1007/978-3-319-67777-4_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-67777-4_4
Published: 14 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67776-7
Online ISBN: 978-3-319-67777-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics