Skip to main content

Text Representation

  • Chapter
  • First Online:
Text Data Mining

Abstract

The essence of text is a string of characters. Characters form words, phrases, sentences, paragraphs, and documents. To enable computers to process natural language efficiently, it is necessary to find an ideal method for formally representing text. First, this representation should be able to truly reflect the content of the text, including its theme, domain, structure, and semantics. Second, it should have the ability to distinguish different texts.

Although text is a type of unstructured data, there is grammar in it. The meanings organized by the grammar cannot be directly used by statistical machine learning models. Therefore, it is necessary to transform the text into a format that can be processed by machine learning algorithms, for example, into vectors. This process of text formalization is called text representation.

This chapter introduces representative text representation methods for statistical machine learning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 89.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The word embeddings are usually randomly initialized and updated during training.

  2. 2.

    \(p_\theta (w|w_i)= \frac {\text{exp}\{{\boldsymbol h} \cdot {\boldsymbol e}(w)\}}{\sum _{k=1}^{|V|}\text{exp}\{{\boldsymbol h} \cdot {\boldsymbol e}(w_k)\}}= \frac {\text{exp}\{{\boldsymbol h} \cdot {\boldsymbol e}(w)\}}{z(w)}\), and z(w) is usually set as a constant 1.0 in NCE.

  3. 3.

    30% satisfy and 70% partially satisfy the semantic composition property.

  4. 4.

    Paragraph Vector with sentence as Distributed Memory.

  5. 5.

    Distributed Bag-of-Words version of the Paragraph Vector.

  6. 6.

    This is also called filter, and it performs information filtering for a window-sized context.

References

  • Arora, S., Li, Y., Liang, Y., Ma, T., & Risteski, A. (2016). A latent variable model approach to PMI-based word embeddings. In Transactions on ACL (pp. 385–400).

    Google Scholar 

  • Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3, 1137–1155.

    MATH  Google Scholar 

  • Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.

    Article  Google Scholar 

  • Chen, X., Xu, L., Liu, Z., Sun, M., & Luan, H. (2015a). Joint learning of character and word embeddings. In Proceeding of IJCAI.

    Google Scholar 

  • Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., et al. (2014). Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of EMNLP.

    Google Scholar 

  • Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of ICML.

    Google Scholar 

  • Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 12, 2493–2537.

    MATH  Google Scholar 

  • Firth, J. R. (1957). A synopsis of linguistic theory. In F. R. Palmer (Ed.), Studies in linguistic analysis. Oxford: Philological Society.

    Google Scholar 

  • Gan, Z., Pu, Y., Henao, R., Li, C., He, X., & Carin, L. (2017). Learning generic sentence representations using convolutional neural networks. In Proceedings of EMNLP (pp. 2390–2400).

    Google Scholar 

  • Harris, Z. S. (1954). Distributional structure. Word, 10(2–3), 146–162.

    Article  Google Scholar 

  • Hashimoto, K., & Tsuruoka, Y. (2016). Adaptive joint learning of compositional and non-compositional phrase embeddings. In Proceedings of ACL (pp. 205–215).

    Google Scholar 

  • Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

    Article  Google Scholar 

  • Hu, W., Zhang, J., & Zheng, N. (2016). Different contexts lead to different word embeddings. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers (pp. 762–771).

    Google Scholar 

  • Kim, Y. (2014). Convolutional neural networks for sentence classification. In Proceedings of EMNLP (pp. 1746–1751).

    Google Scholar 

  • Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, R., Torralba, A., et al. (2015). Skip-thought vectors. In Advances in Neural Information Processing Systems (pp. 3294–3302).

    Google Scholar 

  • Lai, S., Liu, K., He, S., & Zhao, J. (2016). How to generate a good word embedding. IEEE Intelligent Systems, 31(6), 5–14.

    Article  Google Scholar 

  • Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In Proeceedings of ICML (pp. 1188–1196).

    Google Scholar 

  • Li, B., Liu, T., Zhao, Z., Tang, B., Drozd, A., Rogers, A., et al. (2017a). Investigating different syntactic context types and context representations for learning word embeddings. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 2421–2431).

    Google Scholar 

  • Li, J., Luong, M.-T., & Jurafsky, D. (2015). A hierarchical neural autoencoder for paragraphs and documents. In Proceedings of ACL (pp. 1106–1115).

    Google Scholar 

  • Li, S., Chua, T.-S., Zhu, J., & Miao, C. (2016b). Generative topic embedding: A continuous representation of documents. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (pp. 666–675).

    Google Scholar 

  • Ling, W., Tsvetkov, Y., Amir, S., Fermandez, R., Dyer, C., Black, A. W., et al. (2015). Not all contexts are created equal: Better word representations with variable attention. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 1367–1372).

    Google Scholar 

  • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013a). Efficient estimation of word representations in vector space. In Proceedings of ICLR Workshop Track.

    Google Scholar 

  • Mikolov, T., Karafiát, M., Burget, L., ÄŒernockỳ, J., & Khudanpur, S. (2010). Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association.

    Google Scholar 

  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems (pp. 3111–3119).

    Google Scholar 

  • Nallapati, R., Zhou, B., dos santos, C. N., Gulcehre, C., Xiang, B. (2016). Abstractive text summarization using sequence-to-sequence rnns and beyond. Preprint, arXiv:1602.06023.

    Google Scholar 

  • Pinter, Y., Guthrie, R., & Eisenstein, J. (2017). Mimicking word embeddings using subword RNNs. In Proceedings of EMNLP (pp. 102–112).

    Google Scholar 

  • Salton, G., Wong, A., & Yang, C.-S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620.

    Article  Google Scholar 

  • See, A., Liu, P., & Manning, C. (2017). Get to the point: Summarization with pointer-generator networks. In Proceedings of ACL.

    Google Scholar 

  • Socher, R., Pennington, J., Huang, E. H., Ng, A. Y., & Manning, C. D. (2011b). Semi-supervised recursive autoencoders for predicting sentiment distributions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 151–161). Stroudsburg: Association for Computational Linguistics.

    Google Scholar 

  • Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y., et al.(2013). Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 1631–1642).

    Google Scholar 

  • Tang, D., Qin, B., & Liu, T. (2015b). Document modeling with gated recurrent neural network for sentiment classification. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 1422–1432).

    Google Scholar 

  • Tissier, J., Gravier, C., & Habrard, A. (2017). Dict2vec: Learning word embeddings using lexical dictionaries. In Proceedings of EMNLP (pp. 254–263).

    Google Scholar 

  • Wang, S., Zhang, J., Lin, N., & Zong, C. (2018). Investigating inner properties of multimodal representation and semantic compositionality with brain-based componential semantics. In Proceedings of AAAI (pp. 5964–5972).

    Google Scholar 

  • Wang, S., Zhang, J., & Zong, C. (2017a). Exploiting word internal structures for generic Chinese sentence representation. In Proceedings of EMNLP (pp. 298–303).

    Google Scholar 

  • Wang, S., Zhang, J., & Zong, C. (2017b). Learning sentence representation with guidance of human attention. In Proceedings of IJCAI (pp. 4137–4143).

    Google Scholar 

  • Wang, S., & Zong, C. (2017). Comparison study on critical components in composition model for phrase representation. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 16(3), 1–25.

    Article  MathSciNet  Google Scholar 

  • Wang, Y., Huang, H.-Y., Feng, C., Zhou, Q., Gu, J., & Gao, X. (2016b). CSE: Conceptual sentence embeddings based on attention model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (pp. 505–515).

    Google Scholar 

  • Wang, Z., Zhang, J., Feng, J., & Chen, Z. (2014). Knowledge graph and text jointly embedding. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1591–1601).

    Google Scholar 

  • Wieting, J., & Gimpel, K. (2017). Revisiting recurrent networks for paraphrastic sentence embeddings. In Proceedings of ACL (pp. 2078–2088).

    Google Scholar 

  • Xu, J., Liu, J., Zhang, L., Li, Z., & Chen, H. (2016). Improve Chinese word embeddings by exploiting internal structure. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1041–1050).

    Google Scholar 

  • Yaghoobzadeh, Y., & Schütze, H. (2016). Intrinsic subspace evaluation of word embedding representations. In Proceedings of ACL (pp. 236–246).

    Google Scholar 

  • Yu, M., & Dredze, M. (2015). Learning composition models for phrase embeddings. Transactions of the Association for Computational Linguistics, 3, 227–242.

    Article  Google Scholar 

  • Zhang, J., Liu, S., Li, M., Zhou, M., & Zong, C. (2014). Bilingually-constrained phrase embeddings for machine translation. In Proceedings of ACL.

    Google Scholar 

  • Zong, C. (2013). Statistical natural language processing (2nd ed.). Beijing: Tsinghua University Press (in Chinese).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Tsinghua University Press

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Zong, C., Xia, R., Zhang, J. (2021). Text Representation. In: Text Data Mining. Springer, Singapore. https://doi.org/10.1007/978-981-16-0100-2_3

Download citation

  • DOI: https://doi.org/10.1007/978-981-16-0100-2_3

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-16-0099-9

  • Online ISBN: 978-981-16-0100-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics