Pairwise-Constrained Deep Document Clustering

Fard, Maziar Moradi; Thonet, Thibaut; Gaussier, Eric

doi:10.1007/978-3-030-44610-9_2

Maziar Moradi Fard¹²,
Thibaut Thonet¹³ &
Eric Gaussier¹²

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 117))

Included in the following conference series:

International Conference on Reliability and Statistics in Transportation and Communication

812 Accesses

Abstract

While in standard clustering no side information is used, users might be interested in providing additional information to influence the clustering. In case of document clustering, additional information can take the form of pairwise constraints where a user provides additional information about pairs of documents as must-link and cannot-link constraints (indicating respectively whether the documents in the pair are coming from the same cluster or not). In this paper, we propose a novel deep document clustering framework which can employ pairwise constraints while learning document representations to obtain better tailored results. Indeed, in our proposed framework, data representations (obtained through an autoencoder) and cluster representatives are learned through back propagation in a joint way. Devising a fully differentiable deep clustering framework with the ability of using pairwise constraints is the main contribution of this paper. Experiments conducted on 5 public datasets show the gain in clustering performance which the resulting approach can yield.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Li, C., Chen, S., Xing, J., Sun, A., Ma, Z.: Seed-guided topic model for document filtering and classification. ACM Trans. Inf. Syst. 37(1), 9 (2018)
Article Google Scholar
Chen, X., Xia, Y., Jin, P., Carroll, J.: Dataless text classification with descriptive LDA. In Twenty-Ninth AAAI Conference on Artificial Intelligence (2015)
Google Scholar
Li, C., Xing, J., Sun, A., Ma, Z.: Effective document labeling with very few seed words: a topic model approach. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 85–94 (2016)
Google Scholar
Shental, N., Bar-hillel, A., Hertz, T., Weinshall, D.: Computing Gaussian mixture models with EM using equivalence constraints. In: Thrun, S., Saul, L.K., Schölkopf, B. (eds.) Advances in Neural Information Processing Systems, vol. 16, pp. 465–472. MIT Press (2004)
Google Scholar
Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S.: Constrained k-means clustering with background knowledge. Int. Conf. Mach. Learn. 1, 577–584 (2001)
Google Scholar
Liu, Y., Jin, R., Jain, A.K.: BoostCluster: boosting clustering by pairwise constraints. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 450–459. New York, NY, USA (2007)
Google Scholar
Shah, S.A., Koltun, V.: Deep continuous clustering (2018). arXiv preprint arXiv:1803.01449
Huang, P., Huang, Y., Wang, W., Wang, L.: Deep embedding network for clustering. In: 2014 22nd International Conference on Pattern Recognition, pp. 1532–1537 (2014)
Google Scholar
Xie, J., Girshick, R., Farhadi, A.: Unsupervised deep embedding for clustering analysis. In: International Conference on Machine Learning, pp. 478–487 (2016)
Google Scholar
Hu, Y., Wang, J., Yu, N., Hua, X.-S.: Maximum margin clustering with pairwise constraints. In: 2008 Eighth IEEE International Conference on Data Mining, pp. 253–262 (2008)
Google Scholar
Zeng, H., Cheung, Y.: Semi-supervised maximum margin clustering with pairwise constraints. IEEE Trans. Knowl. Data Eng. 24(5), 926–939 (2012)
Article Google Scholar
Basu, S., Banerjee, A., Mooney, R.J.: Active semi-supervision for pairwise constrained clustering. In: Proceedings of the 2004 SIAM International Conference on Data Mining, 0 vols., Society for Industrial and Applied Mathematics, pp. 333–344 (2004)
Google Scholar
Xu, L., Neufeld, J., Larson, B., Schuurmans, D.: Maximum margin clustering. In: Advances in Neural Information Processing Systems, pp. 1537–1544 (2005)
Google Scholar
Hearst, M.A., Dumais, S.T., Osuna, E., Platt, J., Scholkopf, B.: Support vector machines. IEEE Intell. Syst. Appl. 13(4), pp. 18–28 (1998)
Google Scholar
Guo, X., Gao, L., Liu, X., Yin, J.: Improved deep embedded clustering with local structure preservation. In: IJCAL, pp. 1753–1759 (2017)
Google Scholar
Yang, B., Fu, X., Sidiropoulos, N.D., Hong, M.: Towards k-means-friendly spaces: simultaneous deep learning and clustering. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 3861–3870 (2017)
Google Scholar
Fard, M.M., Thonet, T., Gaussier, E.: Deep k-means: jointly clustering with k-means and learning representations (2018). arXiv preprint arXiv:1806.10069
Chen, G.: Deep learning with nonparametric clustering (2015). arXiv preprint arXiv:1501.03084
Guo, X., Zhu, E., Liu, X., Yin, J.: Deep embedded clustering with data augmentation. In: Asian Conference on Machine Learning, pp. 550–565 (2018)
Google Scholar
Banijamali, E., Ghodsi, A.: Fast spectral clustering using autoencoders and landmarks. In: International Conference Image Analysis and Recognition, pp. 380–388 (2017)
Google Scholar
Affeldt, S., Labiod, L., Nadif, M.: Spectral Clustering Via Ensemble Deep Autoencoder Learning (SC-EDAE) (2019). arXiv preprint arXiv:1901.02291
Ghasedi Dizaji, K., Herandi, A., Deng, C., Cai, W., Huang, H.: Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5736–5745 (2017)
Google Scholar
Hsu, C.-C., Lin, C.-W.: CNN-based joint clustering and representation learning with feature drift compensation for large-scale image data. IEEE Trans. Multimed. 20(2), 421–429 (2017)
Article Google Scholar
Yang, J., Parikh, D., Batra, D.: Joint unsupervised learning of deep representations and image clusters. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5147–5156 (2016)
Google Scholar
Chang, J., Wang, L., Meng, G., Xiang, S., Pan, C.: Deep adaptive image clustering. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5880–5888, Venice (2017)
Google Scholar
Li, X., Li, C., Chi, J., Ouyang, J., Li, C.: Dataless text classification: a topic modeling approach with document manifold. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 973–982, New York, NY, USA (2018)
Google Scholar
Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: Advances in Neural Information Processing Systems, pp. 649–657 (2015)
Google Scholar
Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–814 (2010)
Google Scholar
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2014). arXiv preprint arXiv:1412.6980
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar

Download references

Acknowledgement

This research was partly funded by the ANR project LOCUST and the AURA project AISUA.

Author information

Authors and Affiliations

Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000, Grenoble, France
Maziar Moradi Fard & Eric Gaussier
NAVER LABS Europe, 6 Chemin de Maupertuis, 38240, Meylan, France
Thibaut Thonet

Authors

Maziar Moradi Fard
View author publications
You can also search for this author in PubMed Google Scholar
Thibaut Thonet
View author publications
You can also search for this author in PubMed Google Scholar
Eric Gaussier
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Maziar Moradi Fard .

Editor information

Editors and Affiliations

Telematics and Logistics, Transport and Telecommunication Institute, Riga, Latvia
Igor Kabashkin
Transport and Telecommunication Institute, Riga, Latvia
Irina Yatskiv
Vilnius Gediminas Technical University, Vilnius, Lithuania
Olegas Prentkovskis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fard, M.M., Thonet, T., Gaussier, E. (2020). Pairwise-Constrained Deep Document Clustering. In: Kabashkin, I., Yatskiv, I., Prentkovskis, O. (eds) Reliability and Statistics in Transportation and Communication. RelStat 2019. Lecture Notes in Networks and Systems, vol 117. Springer, Cham. https://doi.org/10.1007/978-3-030-44610-9_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-44610-9_2
Published: 29 March 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-44609-3
Online ISBN: 978-3-030-44610-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics