Abstract
Document analysis plays an important role in our life, and traditional models like Latent Semantic Analysis (LSI) or Latent Dirichlet Allocation (LDA) cannot handle data from many sources. Multi-view learning technology like Multi-view Intact Space Learning (MISL), which integrates the complementary information on multiple views to discover a latent intact representation of the data, is effective for image or video application. But the model has not been applied to multi-lingual documents and has not considered the intrinsic geometrical and discriminating structure of the document data. To overcome this issue, we assume that if documents are close in the origin representation, they should also be close in the intact space representation. And we introduce a manifold regularization term to MISL so that the data is more smoothly in latent space. We conduct classification experiments on 10505 Wiki documents we crawled, and the result shows that it is outperforming TFIDF, LSI, LDA, and MISL.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Salton, G., Wong, A., Yang, C.-S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391 (1990)
Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1), 177–196 (2001)
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1999, pp. 50–57. ACM, New York (1999)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)
Zhao, J., Xie, X., Xin, X., Sun, S.: Multi-view learning overview: recent progress and new challenges. Inf. Fusion 38, 43–54 (2017)
Xu, C., Tao, D., Xu, C.: Multi-view intact space learning. IEEE Trans. Pattern Anal. Mach. Intell. 37(12), 2531–2544 (2015)
Belkin, M., Niyogi, P.: Laplacian eigenmaps and spectral techniques for embedding and clustering. NIPS 14, 585–591 (2001)
Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res. 7(Nov), 2399–2434 (2006)
Chung, F.R.K.: Spectral Graph Theory, vol. 92. American Mathematical Society, Providence (1997)
Belkin, M.: Problems of learning on manifolds. Ph.D. thesis, The University of Chicago (2003). AAI3097083
A fast and powerful scraping and web crawling framework (2017). https://scrapy.org/
Processing xml and html with python (2017). https://lxml.de/
Natural language toolkit (2017). http://www.nltk.org/
The Stanford natural language processing group (2017). https://nlp.stanford.edu/
Efficient topic modelling of text semantics in python (2017). https://radimrehurek.com/gensim/index.html
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Zhan, Z., Ma, Z. (2017). Document Analysis Based on Multi-view Intact Space Learning with Manifold Regularization. In: Sun, Y., Lu, H., Zhang, L., Yang, J., Huang, H. (eds) Intelligence Science and Big Data Engineering. IScIDE 2017. Lecture Notes in Computer Science(), vol 10559. Springer, Cham. https://doi.org/10.1007/978-3-319-67777-4_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-67777-4_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67776-7
Online ISBN: 978-3-319-67777-4
eBook Packages: Computer ScienceComputer Science (R0)