Abstract
In Chap. 3, we introduced various methods for learning distributed representations for words, phrases, sentences, and documents. These distributed representations are applied in downstream text data mining tasks, such as named entity recognition, text/sentiment classification, relation extraction, and text summarization. Although these representation learning methods improve downstream tasks, their potential is limited mainly because they suffer from critical issues. First, the employed neural network model cannot be sufficiently deep, since the lack of large-scale training data makes it difficult to optimize massive model parameters. Second, the distributed representations are usually fixed (static) after learning and cannot deal with the problem of polysemy. For example, the word star can mean a famous person or a luminous celestial body, and a static representation cannot distinguish between the two in different contexts. Third, different tasks usually employ different models for representation learning, and knowledge sharing is not fully exploited.
Recently, a new paradigm, called pretraining and fine-tuning, has been proposed and become widely used in natural language processing. In the task-independent pretraining step, a large neural network model is designed, and the network parameters are pretrained with huge unannotated text data (easily available on the Internet) to optimize the language model or other self-supervised objective functions. Because a very large text dataset is used for training, the pretrained model is robust and can memorize the language regularity in the model parameters. In the subsequent fine-tuning step, small-scale task-specific annotated data can be employed to fine-tune the pretrained model to perform specific downstream tasks well. Due to the effective use of both massive unannotated text and small task-dependent labeled data, this paradigm achieves state-of-the-art performance in many text-processing tasks. This chapter briefly introduces some well-known methods, including ELMo, GPT, BERT, XLNet, and UniLM.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Codes and models can be found at https://allennlp.org/elmo.
- 2.
- 3.
Model and codes can be found at https://github.com/tensorflow/tensor2tensor.
- 4.
The self-attention sublayer calculates the i-th representation in the upper layers by using the i-th hidden state in the current layer to attend to all the neighbors including itself, resulting in attention weights which are then employed to linearly combine all the representations in the current layer. It will be formally defined later.
- 5.
The codes and models are available at https://github.com/openai/gpt-2.
- 6.
The models and examples are available at https://github.com/openai/gpt-3.
- 7.
Codes and pretrained models can be available at https://github.com/google-research/bert.
- 8.
The codes and pretrained models are available at https://github.com/zihangdai/xlnet.
- 9.
They have named the reimplementation RoBERTa; details are available at https://github.com/pytorch/fairseq.
- 10.
Codes and models can be found at https://github.com/microsoft/unilm.
References
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., et al. (2020). Language models are few-shot learners.
Conneau, A., & Lample, G. (2019). Cross-lingual language model pretraining. Advances in Neural Information Processing Systems, 32, 7059–7069.
Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q., & Salakhutdinov, R. (2019). Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 2978–2988).
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171–4186).
Dong, L., Yang, N., Wang, W., Wei, F., Liu, X., Wang, Y., et al. (2019). Unified language model pre-training for natural language understanding and generation. In Proceedings of NeurIPS.
Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L., et al. (2019). Tinybert: Distilling bert for natural language understanding. Preprint, arXiv:1909.10351.
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). Albert: A lite bert for self-supervised learning of language representations. Preprint, arXiv:1909.11942.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., et al. (2019). Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., et al. (2018). Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (pp. 2227–2237).
Qiu, X., Sun, T., Xu, Y., Shao, Y., Dai, N., & Huang, X. (2020). Pre-trained models for natural language processing: A survey. Preprint, arXiv:2003.08271.
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. Technical report, OpenAI.
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). Distilbert, a distilled version of bert: Smaller, faster, cheaper and lighter. Preprint, arXiv:1910.01108.
Song, K., Tan, X., Qin, T., Lu, J., & Liu, T.-Y. (2019). Mass: Masked sequence to sequence pre-training for language generation. In Proceedings of ICML.
Sun, Y., Wang, S., Li, Y., Feng, S., Chen, X., Zhang, H., et al. (2019). ERNIE: Enhanced representation through knowledge integration. Preprint, arXiv:1904.09223.
Sun, Y., Wang, S., Li, Y., Feng, S., Tian, H., Wu, H., et al. (2020). ERNIE 2.0: A continual pre-training framework for language understanding. In Proceedings of AAAI.
Tang, R., Lu, Y., Liu, L., Mou, L., Vechtomova, O., & Lin, J. (2019). Distilling task-specific knowledge from bert into simple neural networks. Preprint, arXiv:1903.12136.
Vawani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al. (2017). Attention is all you need. In Proceedings of NeurIPS.
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., & Le, Q. V. (2019). XLNet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems (Vol. 32, pp. 5753–5763).
Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M., & Liu, Q. (2019). ERNIE: Enhanced language representation with informative entities. Preprint, arXiv:1905.07129.
Zhou, L., Zhang, J., & Zong, C. (2019). Synchronous bidirectional neural machine translation. Transactions of the Association for Computational Linguistics, 7, 91–105.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2021 Tsinghua University Press
About this chapter
Cite this chapter
Zong, C., Xia, R., Zhang, J. (2021). Text Representation with Pretraining and Fine-Tuning. In: Text Data Mining. Springer, Singapore. https://doi.org/10.1007/978-981-16-0100-2_4
Download citation
DOI: https://doi.org/10.1007/978-981-16-0100-2_4
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-0099-9
Online ISBN: 978-981-16-0100-2
eBook Packages: Computer ScienceComputer Science (R0)