Skip to main content

Document Clustering Using Local and Universal Knowledge

  • Conference paper
  • First Online:
Machine Learning and Data Mining in Pattern Recognition (MLDM 2018)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10934))

  • 1795 Accesses

Abstract

In almost all real-world text clustering problems, the distribution of the repository samples and the real distribution of the clusters’ concepts are rarely equivalent, which reduces the accuracy of the document clustering methods. Let U(f) and L(f) be the distribution functions of the extracted features based on Universal knowledge and Local -repository- knowledge, respectively. Having the same distribution functions U(f) and L(f) is desirable; however, in real-world situations, these two distribution functions are not equal and they might be even quite different. In this paper, we show how the difference between these two distribution functions could decrease the accuracy of the document clustering algorithms. To address this issue, two different methods are proposed which combine information from the local and universal knowledge efficiently. In the first method, a special transform T is introduced to combine the similarities of each pair of documents derived from the local and the universal knowledge. In the second method, the local and the universal knowledge are combined, per document, by concatenating each document’s feature vector derived from the local knowledge to the document feature vector derived from universal knowledge. The impact of the proposed methods on clustering is tested on two well-known datasets, Reuters and 20-Newsgroups. Experimental results show that by using either local or universal knowledge to generate the feature vectors, some documents could be assigned to a wrong cluster. However, we show that our proposed methods significantly improve the document clustering performance, thus demonstrating the benefit of enhancing local knowledge with universal knowledge in an efficient way.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Berkhin, P.: A survey of clustering data mining techniques. Group. Multidimens. Data 25–71 (2006). https://doi.org/10.1007/3-540-28349-8_2

  2. Tan, P.N., Michael, S., Vipin, K.: Data mining cluster analysis: basic concepts and algorithms. Introd. Data Min. 8, 487–568 (2006)

    Google Scholar 

  3. Qazanfari, K., Youssef, A.: Contextual feature weighting using knowledge beyond the repository knowledge. Int. J. Comput. Commun. Eng. (IJCCE) (2018)

    Google Scholar 

  4. Qazanfari, K., Youssef, A., Keane, K., Nelson, J.: A novel recommendation system to match college events and groups to students. AIAAT 261, 1–15 (2017)

    Google Scholar 

  5. Fahad, S.K.A., Wael, M.S.Y.: Review on semantic document clustering. Int. J. Contemp. Comput. Res. 1(1), 14–30 (2017)

    Google Scholar 

  6. Singh, J.P., Nizar, B.: Proportional data clustering using K-means algorithm: a comparison of different distances. In: 2017 IEEE International Conference on Industrial Technology (ICIT), pp. 1048–1052. IEEE (2017). https://doi.org/10.1109/icit.2017.7915506

  7. Forgy, E.C.: Analysis of multivariate data: efficiency versus interpretability of classification. Biometrics 21, 768–780 (1965)

    Google Scholar 

  8. Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data. An Introduction to Cluster Analysis, vol. 344. Wiley, Hoboken (2009)

    MATH  Google Scholar 

  9. Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In: Workshop on Artificial Intelligence for Web Search (AAAI 2000), pp. 58–64 (2000)

    Google Scholar 

  10. Chim, H., Deng, X.: A new suffix tree similarity measure for document clustering. In: 16th International Conference on World Wide Web, pp. 121–130. ACM (2007). https://doi.org/10.1145/1242572.1242590

  11. Gower, J.C., Roos, G.J.S.: Minimum spanning trees and single linkage cluster analysis. J. R. Stat. Soc. Ser. C (Appl. Stat.) 18, 54–64 (1969). https://doi.org/10.2307/2346439

    Article  MathSciNet  Google Scholar 

  12. Fisher, D.H.: Knowledge acquisition via incremental conceptual clustering. Mach. Learn. 2, 139–172 (1987). https://doi.org/10.1007/BF00114265

    Article  Google Scholar 

  13. King, B.: Step-wise clustering procedures. J. Am. Stat. Assoc. 62, 86–101 (1967)

    Article  Google Scholar 

  14. Liu, X., Gong, Y., Xu, W., Zu, S.: Document clustering with cluster refinement and model selection capabilities. In: 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 191–198 (2002). https://doi.org/10.1145/564376.564411

  15. Xu, J., Xu, B., Wang, P., Zheng, S., Tian, G., Zhao, J.: Self-taught convolutional neural networks for short text clustering. Neural Netw. 88, 22–31 (2017). https://doi.org/10.1016/j.neunet.2016.12.008

    Article  Google Scholar 

  16. Gallant, S.I.: Method for document retrieval and for word sense disambiguation using neural networks U.S. Patent No. 5,317,507. 31 (1994)

    Google Scholar 

  17. Piotr, B., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016)

  18. Lewis, D.D.: Reuters-21578, Distribution 1.0 (1987)

    Google Scholar 

  19. Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization, Computer Science Technical Report CMU-CS-96–118. Carnegie Mellon University (1996)

    Google Scholar 

  20. Jey, H.L., Timothy, B.: An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv preprint arXiv:1607.05368 (2016)

  21. Rosenberg, A., Julia, H.: V-measure: a conditional entropy-based external cluster evaluation measure. In: EMNLP-CoNLL (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kazem Qazanfari .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Qazanfari, K., Youssef, A. (2018). Document Clustering Using Local and Universal Knowledge. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2018. Lecture Notes in Computer Science(), vol 10934. Springer, Cham. https://doi.org/10.1007/978-3-319-96136-1_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-96136-1_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-96135-4

  • Online ISBN: 978-3-319-96136-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics