A Probabilistic Model Based on Uncertainty for Data Clustering

  • Yaxin Yu
  • Xinhua Zhu
  • Miao Li
  • Guoren Wang
  • Dan Luo
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7607)


Recently, all kinds of data in real-life have exploded in an unbelievable way. In order to manage these data, dataspace has been becoming a universal platform, which contains various kinds of data, such as unstructured data, semi-structured data and structured data. But how to cluster these data in dataspace in an efficient and accurate way to help the user manage and explore them is still an intractable problem. In the previous work, the uncertain relationship between term and topic is not considered sufficiently. There are many techniques to handle this problem and probability theory provides an effective way to deal with the uncertainty of clustering. As a result, we proposed a novel probability model based on topic terms, i.e., Probabilistic Term Similarity Model (PTSM) to tackle the uncertainty between term and topic. In this model, not only terms from various data but also structure information of semi-structured and structured data are considered. Each term is assigned a probability indicating how relevant it is to the topic. Then, according to the probability for each term, a probabilistic matrix is established for clustering various data. At last, extensive experiment results show that the clustering method based on this probabilistic model has excellent performance and outperforms some other classical algorithms.


uncertainty probability topic data clustering dataspace 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Li, G., Ooi, B.C., Feng, J., Wang, J., Zhou, L.: EASE: An effective 3-in-1 keyword search method for unstructured, semi-structured and structured Data. In: Proceedings of Special Interest Group on Management of Data, pp. 903–914 (2008)Google Scholar
  2. 2.
    Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Computing Surveys 38(2), Ariticle 6 (2006)Google Scholar
  3. 3.
    Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. Technical Report. University of Minnesota-Computer Science and Engineering, Minnesota (2000)Google Scholar
  4. 4.
    Li, T., Ding, C., Zhang, Y., Shao, B.: Knowledge transformation from word space to document space. In: Proceedings of Special Interest Group on Information Retrieval, pp. 187–194 (2008)Google Scholar
  5. 5.
    Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: Proceedings of Special Interest Group on Knowledge Discovery and Data Mining, pp. 16–22 (1999)Google Scholar
  6. 6.
    Van Rijsbergen, C.J.: Information Retrieval. Butterworth-Heinemann Ltd. (1989)Google Scholar
  7. 7.
    Kowalski, G.: Information retrieval systems: theory and implementation. Springer, 10.1016/S0898-1221(97)80229-5 (1998)Google Scholar
  8. 8.
    Strehl, A., Ghosh, J.: Cluster ensembles: a knowledge reuse framework for combining multiple partitions. The Journal of Machine Learning Research 3, 583–617 (2003)MathSciNetzbMATHGoogle Scholar
  9. 9.
    Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of Special Interest Group on Information Retrieval, pp. 50–57 (1999)Google Scholar
  10. 10.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)zbMATHGoogle Scholar
  11. 11.
    Rafi, M., Maujood, M., Fazal, M.M., Ali, S.M.: A comparison of two suffix tree-based document clustering algorithms. CoRR abs/1112.6222 (2011)Google Scholar
  12. 12.
    Lee, D.D., Seung, H.S.: Learning the parts of objects with nonnegative matrix factorization. Nature 401, 788–791 (1999)CrossRefGoogle Scholar
  13. 13.
    Xu, W., Liu, X., Gong, Y.: Document clustering based on non-negative matrix factorization. In: Proceedings of Special Interest Group on Information Retrieval, pp. 267–273 (2003)Google Scholar
  14. 14.
    Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of Special Interest Group on Knowledge Discovery and Data Mining, pp. 269–274 (2001)Google Scholar
  15. 15.
    Hofmann, T., Puzicha, J.: Statistical models for co-occurrence data. Technical Report AIM, 1625 (1998)Google Scholar
  16. 16.
    Wang, W., Barnaghi, P., Bargiela, A.: Probabilistic Topic Models for Learning Terminological Ontologies. IEEE Transactions on Knowledge and Data Engineering, 1028–1040 (2010)Google Scholar
  17. 17.
    Cao, L.: Data Mining and Multi-agent Integration (edited). Springer (2009)Google Scholar
  18. 18.
    Cao, L., Weiss, G., Yu, P.S.: A Brief Introduction to Agent Mining. Journal of Autonomous Agents and Multi-Agent Systems 25, 419–424 (2012)CrossRefGoogle Scholar
  19. 19.
    Cao, L., Gorodetsky, V., Mitkas, P.A.: A Agent Mining: The Synergy of Agents and Data Mining. IEEE Intelligent Systems 24(3), 64–72 (2009)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Yaxin Yu
    • 1
  • Xinhua Zhu
    • 2
  • Miao Li
    • 1
  • Guoren Wang
    • 1
  • Dan Luo
    • 2
  1. 1.College of Information Science and EngineeringNortheastern UniversityChina
  2. 2.QCISUniversity of TechnologySydneyAustralia

Personalised recommendations