A document representation framework with interpretable features using pre-trained word embeddings

  • Narendra Babu UnnamEmail author
  • P. Krishna Reddy
Regular Paper


We propose an improved framework for document representation using word embeddings. The existing models represent the document as a position vector in the same word embedding space. As a result, they are unable to capture the multiple aspects as well as the broad context in the document. Also, due to their low representational power, existing approaches perform poorly at document classification. Furthermore, the document vectors obtained using such methods have uninterpretable features. In this paper, we propose an improved document representation framework which captures multiple aspects of the document with interpretable features. In this framework, a document is represented in a different feature space by representing each dimension with a potential feature word with relatively high discriminating power. A given document is modeled as the distances between the feature words and the document. To represent a document, we have proposed two criteria for the selection of potential feature words and a distance function to measure the distance between the feature word and the document. Experimental results on multiple datasets show that the proposed model consistently performs better at document classification over the baseline methods. The proposed approach is simple and represents the document with interpretable word features. Overall, the proposed model provides an alternative framework to represent the larger text units with word embeddings and provides the scope to develop new approaches to improve the performance of document representation and its applications.


Text mining Feature engineering Document representation Document classification Word embeddings 



  1. 1.
    Arora, S., Liang, Y., Ma, T.: A simple but tough-to-beat baseline for sentence embeddings. In: International Conference on Learning Representations (2017)Google Scholar
  2. 2.
    Baroni, M.: 39 distributions in text. In: Corpus Linguistics: An International Handbook, vol. 2, pp. 803–822 (2005) Google Scholar
  3. 3.
    Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)zbMATHGoogle Scholar
  4. 4.
    Blashfield, R.K.: Mixture model tests of cluster analysis: accuracy of four agglomerative hierarchical methods. Psychol. Bull. 83(3), 377 (1976)CrossRefGoogle Scholar
  5. 5.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
  6. 6.
    Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016)
  7. 7.
    Camacho-Collados, J., Pilehvar, T.: From word to sense embeddings: a survey on vector representations of meaning. arXiv preprint arXiv:1805.04032 (2018)
  8. 8.
    Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine Learning, pp. 160–167. ACM (2008)Google Scholar
  9. 9.
    Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011)zbMATHGoogle Scholar
  10. 10.
    De Boom, C., Van Canneyt, S., Demeester, T., Dhoedt, B.: Representation learning for very short texts using weighted word embedding aggregation. Pattern Recognit. Lett. 80, 150–156 (2016)CrossRefGoogle Scholar
  11. 11.
    Dubuisson, M.P., Jain, A.K.: A modified Hausdorff distance for object matching. In: Proceedings of 12th International Conference on Pattern Recognition, pp. 566–568. IEEE (1994)Google Scholar
  12. 12.
    Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning, vol. 1, no. 10, pp. 18–20 (2001)Google Scholar
  13. 13.
    Goodman, B., Flaxman, S.: European union regulations on algorithmic decision-making and a “right to explanation”. AI Mag. 38(3), 50–57 (2017)CrossRefGoogle Scholar
  14. 14.
    Harris, Z.S.: Distributional structure. Word 10(2–3), 146–162 (1954)CrossRefGoogle Scholar
  15. 15.
    Huttenlocher, D.P., Klanderman, G.A., Rucklidge, W.J.: Comparing images using the Hausdorff distance. IEEE Trans. Pattern Anal. Mach. Intell. 15(9), 850–863 (1993)CrossRefGoogle Scholar
  16. 16.
    Iacobacci, I., Pilehvar, M.T., Navigli, R.: SensEmbed: learning sense embeddings for word and relational similarity. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, vol. 1, pp. 95–105 (2015)Google Scholar
  17. 17.
    Inc., A.E.: Calculating the Average of Averages. (2014). Accessed 20 Aug 2019
  18. 18.
    Johnson, S.C.: Hierarchical clustering schemes. Psychometrika 32(3), 241–254 (1967)CrossRefGoogle Scholar
  19. 19.
    Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016)
  20. 20.
    Kim, H.K., Kim, H., Cho, S.: Bag-of-concepts: comprehending document representation through clustering words in distributed representation. Neurocomputing 266, 336–352 (2017)CrossRefGoogle Scholar
  21. 21.
    Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014)
  22. 22.
    Kiros, R., Zhu, Y., Salakhutdinov, R.R., Zemel, R., Urtasun, R., Torralba, A., Fidler, S.: Skip-thought vectors. In: Advances in Neural Information Processing Systems, pp. 3294–3302 (2015)Google Scholar
  23. 23.
    Lance, G.N., Williams, W.T.: A general theory of classificatory sorting strategies: 1. Hierarchical systems. Comput. J. 9(4), 373–380 (1967)CrossRefGoogle Scholar
  24. 24.
    Lau, J.H., Baldwin, T.: An empirical evaluation of doc2vec with practical insights into document embedding generation. CoRR arXiv:abs/1607.05368 (2016)
  25. 25.
    Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1188–1196 (2014)Google Scholar
  26. 26.
    Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned from word embeddings. Trans. Assoc. Comput. Linguist. 3, 211–225 (2015)CrossRefGoogle Scholar
  27. 27.
    Manning, C., Raghavan, P., Schütze, H.: Introduction to information retrieval. Nat. Lang. Eng. 16(1), 382 (2010)zbMATHGoogle Scholar
  28. 28.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
  29. 29.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)Google Scholar
  30. 30.
    Mikolov, T., Yih, W., Zweig, G.: Linguistic regularities in continuous space word representations. In: North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 13, pp. 746–751 (2013)Google Scholar
  31. 31.
    Nickel, M., Kiela, D.: Poincaré embeddings for learning hierarchical representations. In: Advances in Neural Information Processing Systems, pp. 6338–6347 (2017)Google Scholar
  32. 32.
    Nutanong, S., Jacox, E.H., Samet, H.: An incremental Hausdorff distance calculation algorithm. Proc. VLDB Endow. 4(8), 506–517 (2011)CrossRefGoogle Scholar
  33. 33.
    Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)Google Scholar
  34. 34.
    Sahlgren, M.: The distributional hypothesis. Ital. J. Disabil. Stud. 20, 33–53 (2008)Google Scholar
  35. 35.
    Salton, G., Yang, C.S.: On the specification of term values in automatic indexing. J. Doc. 29(4), 351–372 (1973)CrossRefGoogle Scholar
  36. 36.
    Sneath, P.H.: The application of computers to taxonomy. Microbiology 17(1), 201–226 (1957)CrossRefGoogle Scholar
  37. 37.
    Sokal, R., Sneath, P.: Principles Numerical Taxonomy, vol. 359. WH Friedman and Company, San Francisco (1963)zbMATHGoogle Scholar
  38. 38.
    Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28(1), 11–21 (1972)CrossRefGoogle Scholar
  39. 39.
    Speer, R., Chin, J., Havasi, C.: Conceptnet 5.5: an open multilingual graph of general knowledge. (2017)Google Scholar
  40. 40.
    Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: Advances in Neural Information Processing Systems, pp. 649–657 (2015)Google Scholar
  41. 41.
    Zipf, G.K.: The Psycho-biology of Language: An Introduction to Dynamic Philology. Routledge, Abingdon (2013)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Kohli Centre on Intelligent SystemsIIIT HyderabadHyderabadIndia

Personalised recommendations