Document Clustering Using Local and Universal Knowledge

  • Kazem QazanfariEmail author
  • Abdou Youssef
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10934)


In almost all real-world text clustering problems, the distribution of the repository samples and the real distribution of the clusters’ concepts are rarely equivalent, which reduces the accuracy of the document clustering methods. Let U(f) and L(f) be the distribution functions of the extracted features based on Universal knowledge and Local -repository- knowledge, respectively. Having the same distribution functions U(f) and L(f) is desirable; however, in real-world situations, these two distribution functions are not equal and they might be even quite different. In this paper, we show how the difference between these two distribution functions could decrease the accuracy of the document clustering algorithms. To address this issue, two different methods are proposed which combine information from the local and universal knowledge efficiently. In the first method, a special transform T is introduced to combine the similarities of each pair of documents derived from the local and the universal knowledge. In the second method, the local and the universal knowledge are combined, per document, by concatenating each document’s feature vector derived from the local knowledge to the document feature vector derived from universal knowledge. The impact of the proposed methods on clustering is tested on two well-known datasets, Reuters and 20-Newsgroups. Experimental results show that by using either local or universal knowledge to generate the feature vectors, some documents could be assigned to a wrong cluster. However, we show that our proposed methods significantly improve the document clustering performance, thus demonstrating the benefit of enhancing local knowledge with universal knowledge in an efficient way.


Text mining Document clustering Transfer learning 


  1. 1.
    Berkhin, P.: A survey of clustering data mining techniques. Group. Multidimens. Data 25–71 (2006).
  2. 2.
    Tan, P.N., Michael, S., Vipin, K.: Data mining cluster analysis: basic concepts and algorithms. Introd. Data Min. 8, 487–568 (2006)Google Scholar
  3. 3.
    Qazanfari, K., Youssef, A.: Contextual feature weighting using knowledge beyond the repository knowledge. Int. J. Comput. Commun. Eng. (IJCCE) (2018)Google Scholar
  4. 4.
    Qazanfari, K., Youssef, A., Keane, K., Nelson, J.: A novel recommendation system to match college events and groups to students. AIAAT 261, 1–15 (2017)Google Scholar
  5. 5.
    Fahad, S.K.A., Wael, M.S.Y.: Review on semantic document clustering. Int. J. Contemp. Comput. Res. 1(1), 14–30 (2017)Google Scholar
  6. 6.
    Singh, J.P., Nizar, B.: Proportional data clustering using K-means algorithm: a comparison of different distances. In: 2017 IEEE International Conference on Industrial Technology (ICIT), pp. 1048–1052. IEEE (2017).
  7. 7.
    Forgy, E.C.: Analysis of multivariate data: efficiency versus interpretability of classification. Biometrics 21, 768–780 (1965)Google Scholar
  8. 8.
    Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data. An Introduction to Cluster Analysis, vol. 344. Wiley, Hoboken (2009)zbMATHGoogle Scholar
  9. 9.
    Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In: Workshop on Artificial Intelligence for Web Search (AAAI 2000), pp. 58–64 (2000)Google Scholar
  10. 10.
    Chim, H., Deng, X.: A new suffix tree similarity measure for document clustering. In: 16th International Conference on World Wide Web, pp. 121–130. ACM (2007).
  11. 11.
    Gower, J.C., Roos, G.J.S.: Minimum spanning trees and single linkage cluster analysis. J. R. Stat. Soc. Ser. C (Appl. Stat.) 18, 54–64 (1969). Scholar
  12. 12.
    Fisher, D.H.: Knowledge acquisition via incremental conceptual clustering. Mach. Learn. 2, 139–172 (1987). Scholar
  13. 13.
    King, B.: Step-wise clustering procedures. J. Am. Stat. Assoc. 62, 86–101 (1967)CrossRefGoogle Scholar
  14. 14.
    Liu, X., Gong, Y., Xu, W., Zu, S.: Document clustering with cluster refinement and model selection capabilities. In: 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 191–198 (2002).
  15. 15.
    Xu, J., Xu, B., Wang, P., Zheng, S., Tian, G., Zhao, J.: Self-taught convolutional neural networks for short text clustering. Neural Netw. 88, 22–31 (2017). Scholar
  16. 16.
    Gallant, S.I.: Method for document retrieval and for word sense disambiguation using neural networks U.S. Patent No. 5,317,507. 31 (1994)Google Scholar
  17. 17.
    Piotr, B., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016)
  18. 18.
    Lewis, D.D.: Reuters-21578, Distribution 1.0 (1987)Google Scholar
  19. 19.
    Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization, Computer Science Technical Report CMU-CS-96–118. Carnegie Mellon University (1996)Google Scholar
  20. 20.
    Jey, H.L., Timothy, B.: An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv preprint arXiv:1607.05368 (2016)
  21. 21.
    Rosenberg, A., Julia, H.: V-measure: a conditional entropy-based external cluster evaluation measure. In: EMNLP-CoNLL (2007)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.The George Washington UniversityWashington, DCUSA

Personalised recommendations