Skip to main content

Deep Learning for Document Representation

  • Chapter
  • First Online:
Handbook of Deep Learning Applications

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 136))

Abstract

While machines can discover semantic relationships in natural written language, they depend on human intervention for the provision of the necessary parameters. Precise and satisfactory document representation is the key to supporting computer models in accessing the underlying meaning in written language. Automated text classification, where the objective is to assign a set of categories to documents, is a classic problem. The range of studies in text classification is varied, ranging from studying a sophisticated approach for document representation to developing the best possible classifiers. A common representation approach in text classification is bag-of-words, where documents are represented by a vector of the words that appear in each document. Although bag-of-words is very simple to generate, the main challenge in such a presentation is that the resulting vector is very large and sparse. This sparsity and the need to ensure semantic understanding of text documents are the major challenges in text categorization. Deep learning-based approaches provide a fixed length vector in a continuous space to represent words and documents. This chapter reviews the available document representation methods that include five deep learning-based approaches: Word2Vec, Doc2Vec, GloVe, LSTM, and CNN.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. C. Wei, S. Luo, X. Ma, H. Ren, J. Zhang, L. Pan, Locally embedding autoencoders: a semi-supervised manifold learning approach of document representation. PLoS ONE 11 (2016)

    Article  Google Scholar 

  2. Z.S. Harris, Distributional structure Word 10, 146–162 (1954)

    Google Scholar 

  3. T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space. arXiv:13013781 (2013)

  4. J. Pennington, R. Socher, C.D. Manning, Glove: global vectors for word representation, in EMNLP (2014), pp. 1532–1543

    Google Scholar 

  5. Q.V. Le, T. Mikolov, Distributed representations of sentences and documents. arXiv:14054053 (2014)

  6. D.M. Blei, A.Y. Ng, M.I. Jordan, Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  7. K. Pearson, LIII. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci. 2, 559–572 (1901)

    Google Scholar 

  8. S. Deerwester, Improving information retrieval with latent semantic indexing (1988)

    Google Scholar 

  9. A. Jung, An introduction to a new data analysis tool: independent component analysis, in Proceedings of Workshop GK “Nonlinearity”, Regensburg (2001)

    Google Scholar 

  10. E.E. Milios, M.M. Shafiei, S. Wang, R. Zhang, B. Tang, J. Tougas, A systematic study on document representation and dimensionality reduction for text clustering. Technical report (Faculty of Computer Science, Dalhousie University, 2006)

    Google Scholar 

  11. J. SzymańSki, Comparative analysis of text representation methods using classification. Cybern. Syst. 45, 180–199 (2014)

    Article  Google Scholar 

  12. E. Gabrilovich, S. Markovitch, Computing semantic relatedness using Wikipedia-based explicit semantic analysis, in IJcAI (2007), pp 1606–1611

    Google Scholar 

  13. M. Li, P. Vitányi, An Introduction to Kolmogorov Complexity and Its Applications (Springer Science & Business Media, 2009)

    Google Scholar 

  14. A.L. Maas, R.E. Daly, P.T. Pham, D. Huang, A.Y. Ng, C. Potts, Learning word vectors for sentiment analysis, in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Association for Computational Linguistics, 2011), pp. 142–150

    Google Scholar 

  15. M. Kamkarhaghighi, M. Makrehchi, Content tree word embedding for document representation. Expert Syst. Appl. 90, 241–249 (2017)

    Article  Google Scholar 

  16. S. Hong, Improving Paragraph2Vec (2016)

    Google Scholar 

  17. S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9, 1735–1780 (1997)

    Article  Google Scholar 

  18. A. Graves, J. Schmidhuber, Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18, 602–610 (2005)

    Article  Google Scholar 

  19. K. Greff, R.K. Srivastava, J. Koutník, B.R. Steunebrink, J. Schmidhuber, LSTM: a search space Odyssey. IEEE Trans. Neural Netw. Learn. Syst. (2016)

    Google Scholar 

  20. M. Sundermeyer, R. Schlüter, H. Ney, LSTM neural networks for language modeling, in Thirteenth Annual Conference of the International Speech Communication Association (2012)

    Google Scholar 

  21. I. Sutskever, O. Vinyals, Q.V. Le, Sequence to sequence learning with neural networks, in Advances in Neural Information Processing Systems (2014), pp. 3104–3112

    Google Scholar 

  22. R. Pascanu, T. Mikolov, Y. Bengio, On the difficulty of training recurrent neural networks, in International Conference on Machine Learning (2013), pp. 1310–1318

    Google Scholar 

  23. A. Conneau, H. Schwenk, L. Barrault, Y. Lecun, Very deep convolutional networks for natural language processing. arXiv:160601781 (2016)

  24. Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998)

    Article  Google Scholar 

  25. R. Johnson, T. Zhang, Effective use of word order for text categorization with convolutional neural networks. arXiv:14121058 (2014)

  26. Y. Kim, Convolutional neural networks for sentence classification. arXiv:14085882 (2014)

  27. X. Zhang, J. Zhao, Y. LeCun, Character-level convolutional networks for text classification, in Advances in Neural Information Processing Systems (2015), pp. 649–657

    Google Scholar 

  28. Kaggle, Bag of Words Meets Bags of Popcorn, vol. 2016 (2015)

    Google Scholar 

  29. H.K. Kim, H. Kim, S. Cho, Distributed representation of documents with explicit explanatory features (2014)

    Google Scholar 

  30. M. Bernotas, K. Karklius, R. Laurutis, A. Slotkienė, The peculiarities of the text document representation, using ontology and tagging-based clustering technique. Inf. Technol. Control 36 (2015)

    Google Scholar 

  31. Y. Hong, T. Zhao, Automatic Hilghter of Lengthy Legal Documents (2015)

    Google Scholar 

  32. B. Lao, K. Jagadeesh, Classifying legal questions into topic areas using machine learning (2014)

    Google Scholar 

  33. Z. Lin, M. Feng, C.Nd. Santos, M. Yu, B. Xiang, B. Zhou, Y. Bengio, A structured self-attentive sentence embedding. arXiv:170303130 (2017)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mehran Kamkarhaghighi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Kamkarhaghighi, M., Gultepe, E., Makrehchi, M. (2019). Deep Learning for Document Representation. In: Balas, V., Roy, S., Sharma, D., Samui, P. (eds) Handbook of Deep Learning Applications. Smart Innovation, Systems and Technologies, vol 136. Springer, Cham. https://doi.org/10.1007/978-3-030-11479-4_5

Download citation

Publish with us

Policies and ethics