Deep Learning for Document Representation

Kamkarhaghighi, Mehran; Gultepe, Eren; Makrehchi, Masoud

doi:10.1007/978-3-030-11479-4_5

Mehran Kamkarhaghighi⁷,
Eren Gultepe⁷ &
Masoud Makrehchi⁷

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 136))

3268 Accesses
2 Citations

Abstract

While machines can discover semantic relationships in natural written language, they depend on human intervention for the provision of the necessary parameters. Precise and satisfactory document representation is the key to supporting computer models in accessing the underlying meaning in written language. Automated text classification, where the objective is to assign a set of categories to documents, is a classic problem. The range of studies in text classification is varied, ranging from studying a sophisticated approach for document representation to developing the best possible classifiers. A common representation approach in text classification is bag-of-words, where documents are represented by a vector of the words that appear in each document. Although bag-of-words is very simple to generate, the main challenge in such a presentation is that the resulting vector is very large and sparse. This sparsity and the need to ensure semantic understanding of text documents are the major challenges in text categorization. Deep learning-based approaches provide a fixed length vector in a continuous space to represent words and documents. This chapter reviews the available document representation methods that include five deep learning-based approaches: Word2Vec, Doc2Vec, GloVe, LSTM, and CNN.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

C. Wei, S. Luo, X. Ma, H. Ren, J. Zhang, L. Pan, Locally embedding autoencoders: a semi-supervised manifold learning approach of document representation. PLoS ONE 11 (2016)
Article Google Scholar
Z.S. Harris, Distributional structure Word 10, 146–162 (1954)
Google Scholar
T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space. arXiv:13013781 (2013)
J. Pennington, R. Socher, C.D. Manning, Glove: global vectors for word representation, in EMNLP (2014), pp. 1532–1543
Google Scholar
Q.V. Le, T. Mikolov, Distributed representations of sentences and documents. arXiv:14054053 (2014)
D.M. Blei, A.Y. Ng, M.I. Jordan, Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
K. Pearson, LIII. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci. 2, 559–572 (1901)
Google Scholar
S. Deerwester, Improving information retrieval with latent semantic indexing (1988)
Google Scholar
A. Jung, An introduction to a new data analysis tool: independent component analysis, in Proceedings of Workshop GK “Nonlinearity”, Regensburg (2001)
Google Scholar
E.E. Milios, M.M. Shafiei, S. Wang, R. Zhang, B. Tang, J. Tougas, A systematic study on document representation and dimensionality reduction for text clustering. Technical report (Faculty of Computer Science, Dalhousie University, 2006)
Google Scholar
J. SzymańSki, Comparative analysis of text representation methods using classification. Cybern. Syst. 45, 180–199 (2014)
Article Google Scholar
E. Gabrilovich, S. Markovitch, Computing semantic relatedness using Wikipedia-based explicit semantic analysis, in IJcAI (2007), pp 1606–1611
Google Scholar
M. Li, P. Vitányi, An Introduction to Kolmogorov Complexity and Its Applications (Springer Science & Business Media, 2009)
Google Scholar
A.L. Maas, R.E. Daly, P.T. Pham, D. Huang, A.Y. Ng, C. Potts, Learning word vectors for sentiment analysis, in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Association for Computational Linguistics, 2011), pp. 142–150
Google Scholar
M. Kamkarhaghighi, M. Makrehchi, Content tree word embedding for document representation. Expert Syst. Appl. 90, 241–249 (2017)
Article Google Scholar
S. Hong, Improving Paragraph2Vec (2016)
Google Scholar
S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9, 1735–1780 (1997)
Article Google Scholar
A. Graves, J. Schmidhuber, Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18, 602–610 (2005)
Article Google Scholar
K. Greff, R.K. Srivastava, J. Koutník, B.R. Steunebrink, J. Schmidhuber, LSTM: a search space Odyssey. IEEE Trans. Neural Netw. Learn. Syst. (2016)
Google Scholar
M. Sundermeyer, R. Schlüter, H. Ney, LSTM neural networks for language modeling, in Thirteenth Annual Conference of the International Speech Communication Association (2012)
Google Scholar
I. Sutskever, O. Vinyals, Q.V. Le, Sequence to sequence learning with neural networks, in Advances in Neural Information Processing Systems (2014), pp. 3104–3112
Google Scholar
R. Pascanu, T. Mikolov, Y. Bengio, On the difficulty of training recurrent neural networks, in International Conference on Machine Learning (2013), pp. 1310–1318
Google Scholar
A. Conneau, H. Schwenk, L. Barrault, Y. Lecun, Very deep convolutional networks for natural language processing. arXiv:160601781 (2016)
Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998)
Article Google Scholar
R. Johnson, T. Zhang, Effective use of word order for text categorization with convolutional neural networks. arXiv:14121058 (2014)
Y. Kim, Convolutional neural networks for sentence classification. arXiv:14085882 (2014)
X. Zhang, J. Zhao, Y. LeCun, Character-level convolutional networks for text classification, in Advances in Neural Information Processing Systems (2015), pp. 649–657
Google Scholar
Kaggle, Bag of Words Meets Bags of Popcorn, vol. 2016 (2015)
Google Scholar
H.K. Kim, H. Kim, S. Cho, Distributed representation of documents with explicit explanatory features (2014)
Google Scholar
M. Bernotas, K. Karklius, R. Laurutis, A. Slotkienė, The peculiarities of the text document representation, using ontology and tagging-based clustering technique. Inf. Technol. Control 36 (2015)
Google Scholar
Y. Hong, T. Zhao, Automatic Hilghter of Lengthy Legal Documents (2015)
Google Scholar
B. Lao, K. Jagadeesh, Classifying legal questions into topic areas using machine learning (2014)
Google Scholar
Z. Lin, M. Feng, C.Nd. Santos, M. Yu, B. Xiang, B. Zhou, Y. Bengio, A structured self-attentive sentence embedding. arXiv:170303130 (2017)

Download references

Author information

Authors and Affiliations

University of Ontario Institute of Technology, Oshawa, ON, Canada
Mehran Kamkarhaghighi, Eren Gultepe & Masoud Makrehchi

Authors

Mehran Kamkarhaghighi
View author publications
You can also search for this author in PubMed Google Scholar
Eren Gultepe
View author publications
You can also search for this author in PubMed Google Scholar
Masoud Makrehchi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mehran Kamkarhaghighi .

Editor information

Editors and Affiliations

Aurel Vlaicu University of Arad, Arad, Romania
Valentina Emilia Balas
School of Computer Science and Engineering, Vellore Institute of Technology, Vellore, Tamil Nadu, India
Sanjiban Sekhar Roy
University of Canberra, Bruce, ACT, Australia
Dharmendra Sharma
Department of Civil Engineering, National Institute of Technology Patna, Patna, Bihar, India
Pijush Samui

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kamkarhaghighi, M., Gultepe, E., Makrehchi, M. (2019). Deep Learning for Document Representation. In: Balas, V., Roy, S., Sharma, D., Samui, P. (eds) Handbook of Deep Learning Applications. Smart Innovation, Systems and Technologies, vol 136. Springer, Cham. https://doi.org/10.1007/978-3-030-11479-4_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-11479-4_5
Published: 26 February 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11478-7
Online ISBN: 978-3-030-11479-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics