Abstract
The essence of text is a string of characters. Characters form words, phrases, sentences, paragraphs, and documents. To enable computers to process natural language efficiently, it is necessary to find an ideal method for formally representing text. First, this representation should be able to truly reflect the content of the text, including its theme, domain, structure, and semantics. Second, it should have the ability to distinguish different texts.
Although text is a type of unstructured data, there is grammar in it. The meanings organized by the grammar cannot be directly used by statistical machine learning models. Therefore, it is necessary to transform the text into a format that can be processed by machine learning algorithms, for example, into vectors. This process of text formalization is called text representation.
This chapter introduces representative text representation methods for statistical machine learning.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The word embeddings are usually randomly initialized and updated during training.
- 2.
\(p_\theta (w|w_i)= \frac {\text{exp}\{{\boldsymbol h} \cdot {\boldsymbol e}(w)\}}{\sum _{k=1}^{|V|}\text{exp}\{{\boldsymbol h} \cdot {\boldsymbol e}(w_k)\}}= \frac {\text{exp}\{{\boldsymbol h} \cdot {\boldsymbol e}(w)\}}{z(w)}\), and z(w) is usually set as a constant 1.0 in NCE.
- 3.
30% satisfy and 70% partially satisfy the semantic composition property.
- 4.
Paragraph Vector with sentence as Distributed Memory.
- 5.
Distributed Bag-of-Words version of the Paragraph Vector.
- 6.
This is also called filter, and it performs information filtering for a window-sized context.
References
Arora, S., Li, Y., Liang, Y., Ma, T., & Risteski, A. (2016). A latent variable model approach to PMI-based word embeddings. In Transactions on ACL (pp. 385–400).
Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3, 1137–1155.
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.
Chen, X., Xu, L., Liu, Z., Sun, M., & Luan, H. (2015a). Joint learning of character and word embeddings. In Proceeding of IJCAI.
Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., et al. (2014). Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of EMNLP.
Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of ICML.
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 12, 2493–2537.
Firth, J. R. (1957). A synopsis of linguistic theory. In F. R. Palmer (Ed.), Studies in linguistic analysis. Oxford: Philological Society.
Gan, Z., Pu, Y., Henao, R., Li, C., He, X., & Carin, L. (2017). Learning generic sentence representations using convolutional neural networks. In Proceedings of EMNLP (pp. 2390–2400).
Harris, Z. S. (1954). Distributional structure. Word, 10(2–3), 146–162.
Hashimoto, K., & Tsuruoka, Y. (2016). Adaptive joint learning of compositional and non-compositional phrase embeddings. In Proceedings of ACL (pp. 205–215).
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Hu, W., Zhang, J., & Zheng, N. (2016). Different contexts lead to different word embeddings. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers (pp. 762–771).
Kim, Y. (2014). Convolutional neural networks for sentence classification. In Proceedings of EMNLP (pp. 1746–1751).
Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, R., Torralba, A., et al. (2015). Skip-thought vectors. In Advances in Neural Information Processing Systems (pp. 3294–3302).
Lai, S., Liu, K., He, S., & Zhao, J. (2016). How to generate a good word embedding. IEEE Intelligent Systems, 31(6), 5–14.
Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In Proeceedings of ICML (pp. 1188–1196).
Li, B., Liu, T., Zhao, Z., Tang, B., Drozd, A., Rogers, A., et al. (2017a). Investigating different syntactic context types and context representations for learning word embeddings. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 2421–2431).
Li, J., Luong, M.-T., & Jurafsky, D. (2015). A hierarchical neural autoencoder for paragraphs and documents. In Proceedings of ACL (pp. 1106–1115).
Li, S., Chua, T.-S., Zhu, J., & Miao, C. (2016b). Generative topic embedding: A continuous representation of documents. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (pp. 666–675).
Ling, W., Tsvetkov, Y., Amir, S., Fermandez, R., Dyer, C., Black, A. W., et al. (2015). Not all contexts are created equal: Better word representations with variable attention. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 1367–1372).
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013a). Efficient estimation of word representations in vector space. In Proceedings of ICLR Workshop Track.
Mikolov, T., Karafiát, M., Burget, L., Černockỳ, J., & Khudanpur, S. (2010). Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems (pp. 3111–3119).
Nallapati, R., Zhou, B., dos santos, C. N., Gulcehre, C., Xiang, B. (2016). Abstractive text summarization using sequence-to-sequence rnns and beyond. Preprint, arXiv:1602.06023.
Pinter, Y., Guthrie, R., & Eisenstein, J. (2017). Mimicking word embeddings using subword RNNs. In Proceedings of EMNLP (pp. 102–112).
Salton, G., Wong, A., & Yang, C.-S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620.
See, A., Liu, P., & Manning, C. (2017). Get to the point: Summarization with pointer-generator networks. In Proceedings of ACL.
Socher, R., Pennington, J., Huang, E. H., Ng, A. Y., & Manning, C. D. (2011b). Semi-supervised recursive autoencoders for predicting sentiment distributions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 151–161). Stroudsburg: Association for Computational Linguistics.
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y., et al.(2013). Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 1631–1642).
Tang, D., Qin, B., & Liu, T. (2015b). Document modeling with gated recurrent neural network for sentiment classification. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 1422–1432).
Tissier, J., Gravier, C., & Habrard, A. (2017). Dict2vec: Learning word embeddings using lexical dictionaries. In Proceedings of EMNLP (pp. 254–263).
Wang, S., Zhang, J., Lin, N., & Zong, C. (2018). Investigating inner properties of multimodal representation and semantic compositionality with brain-based componential semantics. In Proceedings of AAAI (pp. 5964–5972).
Wang, S., Zhang, J., & Zong, C. (2017a). Exploiting word internal structures for generic Chinese sentence representation. In Proceedings of EMNLP (pp. 298–303).
Wang, S., Zhang, J., & Zong, C. (2017b). Learning sentence representation with guidance of human attention. In Proceedings of IJCAI (pp. 4137–4143).
Wang, S., & Zong, C. (2017). Comparison study on critical components in composition model for phrase representation. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 16(3), 1–25.
Wang, Y., Huang, H.-Y., Feng, C., Zhou, Q., Gu, J., & Gao, X. (2016b). CSE: Conceptual sentence embeddings based on attention model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (pp. 505–515).
Wang, Z., Zhang, J., Feng, J., & Chen, Z. (2014). Knowledge graph and text jointly embedding. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1591–1601).
Wieting, J., & Gimpel, K. (2017). Revisiting recurrent networks for paraphrastic sentence embeddings. In Proceedings of ACL (pp. 2078–2088).
Xu, J., Liu, J., Zhang, L., Li, Z., & Chen, H. (2016). Improve Chinese word embeddings by exploiting internal structure. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1041–1050).
Yaghoobzadeh, Y., & Schütze, H. (2016). Intrinsic subspace evaluation of word embedding representations. In Proceedings of ACL (pp. 236–246).
Yu, M., & Dredze, M. (2015). Learning composition models for phrase embeddings. Transactions of the Association for Computational Linguistics, 3, 227–242.
Zhang, J., Liu, S., Li, M., Zhou, M., & Zong, C. (2014). Bilingually-constrained phrase embeddings for machine translation. In Proceedings of ACL.
Zong, C. (2013). Statistical natural language processing (2nd ed.). Beijing: Tsinghua University Press (in Chinese).
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2021 Tsinghua University Press
About this chapter
Cite this chapter
Zong, C., Xia, R., Zhang, J. (2021). Text Representation. In: Text Data Mining. Springer, Singapore. https://doi.org/10.1007/978-981-16-0100-2_3
Download citation
DOI: https://doi.org/10.1007/978-981-16-0100-2_3
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-0099-9
Online ISBN: 978-981-16-0100-2
eBook Packages: Computer ScienceComputer Science (R0)