Skip to main content

Pairwise-Constrained Deep Document Clustering

  • Conference paper
  • First Online:
Reliability and Statistics in Transportation and Communication (RelStat 2019)

Abstract

While in standard clustering no side information is used, users might be interested in providing additional information to influence the clustering. In case of document clustering, additional information can take the form of pairwise constraints where a user provides additional information about pairs of documents as must-link and cannot-link constraints (indicating respectively whether the documents in the pair are coming from the same cluster or not). In this paper, we propose a novel deep document clustering framework which can employ pairwise constraints while learning document representations to obtain better tailored results. Indeed, in our proposed framework, data representations (obtained through an autoencoder) and cluster representatives are learned through back propagation in a joint way. Devising a fully differentiable deep clustering framework with the ability of using pairwise constraints is the main contribution of this paper. Experiments conducted on 5 public datasets show the gain in clustering performance which the resulting approach can yield.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/MaziarMF/deep-k-means.

  2. 2.

    http://qwone.com/~jason/20Newsgroups/.

  3. 3.

    http://www.daviddlewis.com/resources/testcollections/reuters21578/.

References

  1. Li, C., Chen, S., Xing, J., Sun, A., Ma, Z.: Seed-guided topic model for document filtering and classification. ACM Trans. Inf. Syst. 37(1), 9 (2018)

    Article  Google Scholar 

  2. Chen, X., Xia, Y., Jin, P., Carroll, J.: Dataless text classification with descriptive LDA. In Twenty-Ninth AAAI Conference on Artificial Intelligence (2015)

    Google Scholar 

  3. Li, C., Xing, J., Sun, A., Ma, Z.: Effective document labeling with very few seed words: a topic model approach. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 85–94 (2016)

    Google Scholar 

  4. Shental, N., Bar-hillel, A., Hertz, T., Weinshall, D.: Computing Gaussian mixture models with EM using equivalence constraints. In: Thrun, S., Saul, L.K., Schölkopf, B. (eds.) Advances in Neural Information Processing Systems, vol. 16, pp. 465–472. MIT Press (2004)

    Google Scholar 

  5. Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S.: Constrained k-means clustering with background knowledge. Int. Conf. Mach. Learn. 1, 577–584 (2001)

    Google Scholar 

  6. Liu, Y., Jin, R., Jain, A.K.: BoostCluster: boosting clustering by pairwise constraints. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 450–459. New York, NY, USA (2007)

    Google Scholar 

  7. Shah, S.A., Koltun, V.: Deep continuous clustering (2018). arXiv preprint arXiv:1803.01449

  8. Huang, P., Huang, Y., Wang, W., Wang, L.: Deep embedding network for clustering. In: 2014 22nd International Conference on Pattern Recognition, pp. 1532–1537 (2014)

    Google Scholar 

  9. Xie, J., Girshick, R., Farhadi, A.: Unsupervised deep embedding for clustering analysis. In: International Conference on Machine Learning, pp. 478–487 (2016)

    Google Scholar 

  10. Hu, Y., Wang, J., Yu, N., Hua, X.-S.: Maximum margin clustering with pairwise constraints. In: 2008 Eighth IEEE International Conference on Data Mining, pp. 253–262 (2008)

    Google Scholar 

  11. Zeng, H., Cheung, Y.: Semi-supervised maximum margin clustering with pairwise constraints. IEEE Trans. Knowl. Data Eng. 24(5), 926–939 (2012)

    Article  Google Scholar 

  12. Basu, S., Banerjee, A., Mooney, R.J.: Active semi-supervision for pairwise constrained clustering. In: Proceedings of the 2004 SIAM International Conference on Data Mining, 0 vols., Society for Industrial and Applied Mathematics, pp. 333–344 (2004)

    Google Scholar 

  13. Xu, L., Neufeld, J., Larson, B., Schuurmans, D.: Maximum margin clustering. In: Advances in Neural Information Processing Systems, pp. 1537–1544 (2005)

    Google Scholar 

  14. Hearst, M.A., Dumais, S.T., Osuna, E., Platt, J., Scholkopf, B.: Support vector machines. IEEE Intell. Syst. Appl. 13(4), pp. 18–28 (1998)

    Google Scholar 

  15. Guo, X., Gao, L., Liu, X., Yin, J.: Improved deep embedded clustering with local structure preservation. In: IJCAL, pp. 1753–1759 (2017)

    Google Scholar 

  16. Yang, B., Fu, X., Sidiropoulos, N.D., Hong, M.: Towards k-means-friendly spaces: simultaneous deep learning and clustering. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 3861–3870 (2017)

    Google Scholar 

  17. Fard, M.M., Thonet, T., Gaussier, E.: Deep k-means: jointly clustering with k-means and learning representations (2018). arXiv preprint arXiv:1806.10069

  18. Chen, G.: Deep learning with nonparametric clustering (2015). arXiv preprint arXiv:1501.03084

  19. Guo, X., Zhu, E., Liu, X., Yin, J.: Deep embedded clustering with data augmentation. In: Asian Conference on Machine Learning, pp. 550–565 (2018)

    Google Scholar 

  20. Banijamali, E., Ghodsi, A.: Fast spectral clustering using autoencoders and landmarks. In: International Conference Image Analysis and Recognition, pp. 380–388 (2017)

    Google Scholar 

  21. Affeldt, S., Labiod, L., Nadif, M.: Spectral Clustering Via Ensemble Deep Autoencoder Learning (SC-EDAE) (2019). arXiv preprint arXiv:1901.02291

  22. Ghasedi Dizaji, K., Herandi, A., Deng, C., Cai, W., Huang, H.: Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5736–5745 (2017)

    Google Scholar 

  23. Hsu, C.-C., Lin, C.-W.: CNN-based joint clustering and representation learning with feature drift compensation for large-scale image data. IEEE Trans. Multimed. 20(2), 421–429 (2017)

    Article  Google Scholar 

  24. Yang, J., Parikh, D., Batra, D.: Joint unsupervised learning of deep representations and image clusters. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5147–5156 (2016)

    Google Scholar 

  25. Chang, J., Wang, L., Meng, G., Xiang, S., Pan, C.: Deep adaptive image clustering. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5880–5888, Venice (2017)

    Google Scholar 

  26. Li, X., Li, C., Chi, J., Ouyang, J., Li, C.: Dataless text classification: a topic modeling approach with document manifold. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 973–982, New York, NY, USA (2018)

    Google Scholar 

  27. Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: Advances in Neural Information Processing Systems, pp. 649–657 (2015)

    Google Scholar 

  28. Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–814 (2010)

    Google Scholar 

  29. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)

    Google Scholar 

  30. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2014). arXiv preprint arXiv:1412.6980

  31. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

Download references

Acknowledgement

This research was partly funded by the ANR project LOCUST and the AURA project AISUA.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Maziar Moradi Fard .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Fard, M.M., Thonet, T., Gaussier, E. (2020). Pairwise-Constrained Deep Document Clustering. In: Kabashkin, I., Yatskiv, I., Prentkovskis, O. (eds) Reliability and Statistics in Transportation and Communication. RelStat 2019. Lecture Notes in Networks and Systems, vol 117. Springer, Cham. https://doi.org/10.1007/978-3-030-44610-9_2

Download citation

Publish with us

Policies and ethics