Development of a Song Lyric Corpus for the English Language

Rodrigues, Matheus Augusto Gonzaga; de Paiva Oliveira, Alcione; Moreira, Alexandra

doi:10.1007/978-3-030-23281-8_33

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11608))

Included in the following conference series:

International Conference on Applications of Natural Language to Information Systems

1667 Accesses

Abstract

Web Scraping Tools are simplifying the task of creating large databases for various applications such as the construction of corpus aimed at the development of applications for natural language processing. Many of these applications require a large amount of data, and in that sense, the Web presents itself as an important data source. Among the various tasks in the NLP scope, one of the most challenging is automatic text generation. In this task the objective is to generate syntactically and semantically correct texts after a training process on a particular corpus. This article presents the elaboration of an English song lyrics Corpus, extracted from the Web, that can be used to train applications for automatic generation of lyrics, poems, or other NPL related tasks. After its normalization, an analysis of the Corpus is presented, as well as analyzes performed after the corpus vectorization (embedding) generated with the use of two current techniques.

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001, and also by the funding agencies FAPEMIG and CNPq.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Detecting explicit lyrics: a case study in Italian music

Article Open access 21 May 2022

The WASABI Dataset: Cultural, Lyrics and Audio Analysis Metadata About 2 Million Popular Commercially Released Songs

An Automatic Approach to Generate Corpus in Spanish

References

Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Article Google Scholar
Ellis, R.J., Xing, Z., Fang, J., Wang, Y.: Quantifying lexical novelty in song lyrics. In: ISMIR, pp. 694–700 (2015)
Google Scholar
Habernal, I., Zayed, O., Gurevych, I.: C4corpus: multilingual web-size corpus with free license. In: LREC, pp. 914–922 (2016)
Google Scholar
Honnibal, M., Montani, I.: spacy 2: natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing (2017)
Google Scholar
Kuznetsov, S.: 55000+ song lyrics. https://www.kaggle.com/mousehead/songlyrics. Accessed March 2019
Miethaner, U.: The blur (blues lyrics collected at the University of Regensburg) corpus: blues lyricism and the African American literary tradition. Curr. Objectives Postgrad. Am. Stud. 2 (2001). https://doi.org/10.5283/copas.64
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Milev, P.: Conceptual approach for development of web scraping application for tracking information. Econ. Altern. (3), 475–485 (2017)
Google Scholar
Nishina, Y.: A study of pop songs based on the billboard corpus. Int. J. Lang. Linguist. 4(2), 125–134 (2017)
Google Scholar
Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. ELRA, Valletta, Malta, May 2010. http://is.muni.cz/publication/884893/en
Seitner, J., et al.: A large database of hypernymy relations extracted from the web. In: LREC, pp. 360–367 (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

Universidade Federal de Vicosa, Vicosa, MG, 36570900, Brazil
Matheus Augusto Gonzaga Rodrigues, Alcione de Paiva Oliveira & Alexandra Moreira

Authors

Matheus Augusto Gonzaga Rodrigues
View author publications
You can also search for this author in PubMed Google Scholar
Alcione de Paiva Oliveira
View author publications
You can also search for this author in PubMed Google Scholar
Alexandra Moreira
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alcione de Paiva Oliveira .

Editor information

Editors and Affiliations

Conservatoire National des Arts et Métiers, Paris, France
Elisabeth Métais
University of Salford, Salford, UK
Farid Meziane
University of Salford, Salford, UK
Sunil Vadera
Oakland University, Rochester, MI, USA
Vijayan Sugumaran
CSE, University of Salford, Salford, UK
Mohamad Saraee

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rodrigues, M.A.G., de Paiva Oliveira, A., Moreira, A. (2019). Development of a Song Lyric Corpus for the English Language. In: Métais, E., Meziane, F., Vadera, S., Sugumaran, V., Saraee, M. (eds) Natural Language Processing and Information Systems. NLDB 2019. Lecture Notes in Computer Science(), vol 11608. Springer, Cham. https://doi.org/10.1007/978-3-030-23281-8_33

Download citation

DOI: https://doi.org/10.1007/978-3-030-23281-8_33
Published: 21 June 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-23280-1
Online ISBN: 978-3-030-23281-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Development of a Song Lyric Corpus for the English Language

Abstract

Access this chapter

Similar content being viewed by others

Detecting explicit lyrics: a case study in Italian music

The WASABI Dataset: Cultural, Lyrics and Audio Analysis Metadata About 2 Million Popular Commercially Released Songs

An Automatic Approach to Generate Corpus in Spanish

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Development of a Song Lyric Corpus for the English Language

Abstract

Access this chapter

Similar content being viewed by others

Detecting explicit lyrics: a case study in Italian music

The WASABI Dataset: Cultural, Lyrics and Audio Analysis Metadata About 2 Million Popular Commercially Released Songs

An Automatic Approach to Generate Corpus in Spanish

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation