Refine the Corpora Based on Document Manifold

  • Chengwei Yao
  • Yilin Wang
  • Gencai Chen
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8346)


Nowadays, it is quite challenging to track and utilize overwhelming news information generated by internet. One approach is using topic models, such as pLSI, LDA, LPI, LapPLSI, LTM etc, to discover news topics automatically. However, in many real applications, the topics inferred by all these kinds of models are not much useful, because there are always a proportion of the documents actually belong to no topics. In this paper, we proposed a new technique to refine the document corpora before topic modeling. Inspired by manifold theory, we use Laplacian eigenmaps to discover the submanifold structure of the document space, and try to find those documents with loose relations to other documents, then exclude them from the corpora. Experiments show that topic models combined with our algorithm can improve the quality of the topics significantly.


topic model manifold graph Laplacian document clustering 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Belkin, M.: Problems of Learning on Manifolds. PhD thesis, University of Chicago (2003)Google Scholar
  2. 2.
    Belkin, M., Niyogi, P.: Laplacian eigenmaps and spectral techniques for embedding and clustering. In: NIPS, vol. 14 (2001)Google Scholar
  3. 3.
    Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: A geometric framework for learning from examples. Journal of Machine Learning Research 7, 2399–2434 (2006)zbMATHMathSciNetGoogle Scholar
  4. 4.
    Blei, D., Ng, A., Jordan, M.: Latent Dirichlet Allocation. Journal of machine Learning Research (2003)Google Scholar
  5. 5.
    He, X., Cai, D., Liu, H., Ma, W.-Y.: Locality preserving indexing for document representation. In: Proc. 2004 Int.Conf. on Research and Development in Information Retrieval (SIGIR 2004), Sheffield, UK, pp. 96–103 (July 2004)Google Scholar
  6. 6.
    Cai, D., He, X., Han, J.: Document clustering using locality preserving indexing. IEEE Transactions on Knowledge and Data Engineering 17(12), 1624–1637 (2005)CrossRefGoogle Scholar
  7. 7.
    Cai, D., Mei, Q., Han, J., Zhai, C.: Modeling Hidden Topics on Document Manifold. In: Proc. 2008 ACM Conf. on Information and Knowledge Management (CIKM 2008), Napa Valley, CA (October 2008)Google Scholar
  8. 8.
    Cai, D., Wang, X., He, X.: Probabilistic dyadic data analysis with local and global consistency. In: Proceedings of the 26th Annual International Conference on Machine Learning (ICML 2009), pp. 105–112 (2009)Google Scholar
  9. 9.
    Cai, D., He, X., Han, J.: Locally Consistent Concept Factorization for Document Clustering. IEEE Transactions on Knowledge and Data Engineering 23(6), 902–913 (2011)CrossRefGoogle Scholar
  10. 10.
    Hofmann, T.: Probabilistic latent semantic indexing. In: Proc.1999 Int. Conf. on Research and Development in Information Retrieval (SIGIR 1999) (1999)Google Scholar
  11. 11.
    Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Machine Learning 42(1-2), 177–196 (2001)CrossRefzbMATHGoogle Scholar
  12. 12.
    Neal, R., Hinton, G.: A view of the EM algorithm that justifies incremental, sparse, and other variants. In: Learningin Graphical Models. Kluwer (1998)Google Scholar
  13. 13.
    Lee, J.M.: Introduction to Smooth Manifolds. Springer, NewYork (2002)zbMATHGoogle Scholar
  14. 14.
    Si, L., Jin, R.: Adjusting mixture weights of Gaussian mixture model via regularized probabilistic latent semantic analysis. In: Ho, T.-B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 622–631. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  15. 15.
    Zhang, D., Chen, X., Lee, W.S.: Text classification with kernels on the multinomial manifold. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2005), pp. 266–273 (2005)Google Scholar
  16. 16.
    Zhu, X., Lafferty, J.: Harmonic mixtures: combining mixture models and graph-based methods for inductive and scalable semi-supervised learning. In: Proceedings of the 22nd International Conference on Machine Learning (ICML 2005), pp. 1052–1059 (2005)Google Scholar
  17. 17.
    Sha, F., Saul, L.: Analysis and extension of spectral methods for nonlinear dimensionality reduction. In: International Workshop on Machine Learning, vol. 22 (2005)Google Scholar
  18. 18.
    Cai, D., He, X.: Manifold Adaptive Experimental Design for Text Categorization. IEEE Transactions on Knowledge and Data Engineering 24(4), 707–719 (2012)CrossRefGoogle Scholar
  19. 19.
    Blei, D., Lafferty, J.: Dynamic topic models. In: Proceedings of the 23rd International Conference on Machine Learning (2006)Google Scholar
  20. 20.
    Wang, C., Blei, D., Heckerman, D.: Continuous time dynamic topic models. In: Uncertainty in Artificial Intelligence (UAI 2008) (2008)Google Scholar
  21. 21.
    Wang, C., Paisley, J., Blei, D.: Online variational inference for the hierarchical Dirichlet process. Artificial Intelligence and Statistics (2011)Google Scholar
  22. 22.
    Zhang, L., Chen, C., Bu, J., Chen, Z., Cai, D., Han, J.: Locally Discriminative Coclustering. IEEE Transactions on Knowledge and Data Engineering 24(6), 1025–1035 (2012)CrossRefGoogle Scholar
  23. 23.
    Bu, J., Xu, B., Wu, C., Chen, C., Zhu, J., Cai, D.: Unsupervised face-name association via commute distance. In: ACM Multimedia (ACM-MM 2012) (2012)Google Scholar
  24. 24.
    Zhu, J., Ma, H., Chen, C., Bu, J.: Social Recommendation Using Low-Rank Semi-definite Program. In: AAAI 2011 (2011)Google Scholar
  25. 25.
    Liu, X., Song, M., Zhao, Q., Tao, D., Chen, C., Bu, J.: Attribute-restricted latent topic model for person re-identification. Pattern Recognition (2012)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Chengwei Yao
    • 1
  • Yilin Wang
    • 2
  • Gencai Chen
    • 1
  1. 1.College of Computer Science and TechnologyZhejiang UniversityChina
  2. 2.University of NottinghamNottinghamUK

Personalised recommendations