Abstract
While machines can discover semantic relationships in natural written language, they depend on human intervention for the provision of the necessary parameters. Precise and satisfactory document representation is the key to supporting computer models in accessing the underlying meaning in written language. Automated text classification, where the objective is to assign a set of categories to documents, is a classic problem. The range of studies in text classification is varied, ranging from studying a sophisticated approach for document representation to developing the best possible classifiers. A common representation approach in text classification is bag-of-words, where documents are represented by a vector of the words that appear in each document. Although bag-of-words is very simple to generate, the main challenge in such a presentation is that the resulting vector is very large and sparse. This sparsity and the need to ensure semantic understanding of text documents are the major challenges in text categorization. Deep learning-based approaches provide a fixed length vector in a continuous space to represent words and documents. This chapter reviews the available document representation methods that include five deep learning-based approaches: Word2Vec, Doc2Vec, GloVe, LSTM, and CNN.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
C. Wei, S. Luo, X. Ma, H. Ren, J. Zhang, L. Pan, Locally embedding autoencoders: a semi-supervised manifold learning approach of document representation. PLoS ONE 11 (2016)
Z.S. Harris, Distributional structure Word 10, 146–162 (1954)
T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space. arXiv:13013781 (2013)
J. Pennington, R. Socher, C.D. Manning, Glove: global vectors for word representation, in EMNLP (2014), pp. 1532–1543
Q.V. Le, T. Mikolov, Distributed representations of sentences and documents. arXiv:14054053 (2014)
D.M. Blei, A.Y. Ng, M.I. Jordan, Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
K. Pearson, LIII. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci. 2, 559–572 (1901)
S. Deerwester, Improving information retrieval with latent semantic indexing (1988)
A. Jung, An introduction to a new data analysis tool: independent component analysis, in Proceedings of Workshop GK “Nonlinearity”, Regensburg (2001)
E.E. Milios, M.M. Shafiei, S. Wang, R. Zhang, B. Tang, J. Tougas, A systematic study on document representation and dimensionality reduction for text clustering. Technical report (Faculty of Computer Science, Dalhousie University, 2006)
J. SzymańSki, Comparative analysis of text representation methods using classification. Cybern. Syst. 45, 180–199 (2014)
E. Gabrilovich, S. Markovitch, Computing semantic relatedness using Wikipedia-based explicit semantic analysis, in IJcAI (2007), pp 1606–1611
M. Li, P. Vitányi, An Introduction to Kolmogorov Complexity and Its Applications (Springer Science & Business Media, 2009)
A.L. Maas, R.E. Daly, P.T. Pham, D. Huang, A.Y. Ng, C. Potts, Learning word vectors for sentiment analysis, in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Association for Computational Linguistics, 2011), pp. 142–150
M. Kamkarhaghighi, M. Makrehchi, Content tree word embedding for document representation. Expert Syst. Appl. 90, 241–249 (2017)
S. Hong, Improving Paragraph2Vec (2016)
S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9, 1735–1780 (1997)
A. Graves, J. Schmidhuber, Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18, 602–610 (2005)
K. Greff, R.K. Srivastava, J. Koutník, B.R. Steunebrink, J. Schmidhuber, LSTM: a search space Odyssey. IEEE Trans. Neural Netw. Learn. Syst. (2016)
M. Sundermeyer, R. Schlüter, H. Ney, LSTM neural networks for language modeling, in Thirteenth Annual Conference of the International Speech Communication Association (2012)
I. Sutskever, O. Vinyals, Q.V. Le, Sequence to sequence learning with neural networks, in Advances in Neural Information Processing Systems (2014), pp. 3104–3112
R. Pascanu, T. Mikolov, Y. Bengio, On the difficulty of training recurrent neural networks, in International Conference on Machine Learning (2013), pp. 1310–1318
A. Conneau, H. Schwenk, L. Barrault, Y. Lecun, Very deep convolutional networks for natural language processing. arXiv:160601781 (2016)
Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998)
R. Johnson, T. Zhang, Effective use of word order for text categorization with convolutional neural networks. arXiv:14121058 (2014)
Y. Kim, Convolutional neural networks for sentence classification. arXiv:14085882 (2014)
X. Zhang, J. Zhao, Y. LeCun, Character-level convolutional networks for text classification, in Advances in Neural Information Processing Systems (2015), pp. 649–657
Kaggle, Bag of Words Meets Bags of Popcorn, vol. 2016 (2015)
H.K. Kim, H. Kim, S. Cho, Distributed representation of documents with explicit explanatory features (2014)
M. Bernotas, K. Karklius, R. Laurutis, A. Slotkienė, The peculiarities of the text document representation, using ontology and tagging-based clustering technique. Inf. Technol. Control 36 (2015)
Y. Hong, T. Zhao, Automatic Hilghter of Lengthy Legal Documents (2015)
B. Lao, K. Jagadeesh, Classifying legal questions into topic areas using machine learning (2014)
Z. Lin, M. Feng, C.Nd. Santos, M. Yu, B. Xiang, B. Zhou, Y. Bengio, A structured self-attentive sentence embedding. arXiv:170303130 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Kamkarhaghighi, M., Gultepe, E., Makrehchi, M. (2019). Deep Learning for Document Representation. In: Balas, V., Roy, S., Sharma, D., Samui, P. (eds) Handbook of Deep Learning Applications. Smart Innovation, Systems and Technologies, vol 136. Springer, Cham. https://doi.org/10.1007/978-3-030-11479-4_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-11479-4_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11478-7
Online ISBN: 978-3-030-11479-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)