Skip to main content

Development of a Song Lyric Corpus for the English Language

  • Conference paper
  • First Online:
  • 1641 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11608))

Abstract

Web Scraping Tools are simplifying the task of creating large databases for various applications such as the construction of corpus aimed at the development of applications for natural language processing. Many of these applications require a large amount of data, and in that sense, the Web presents itself as an important data source. Among the various tasks in the NLP scope, one of the most challenging is automatic text generation. In this task the objective is to generate syntactically and semantically correct texts after a training process on a particular corpus. This article presents the elaboration of an English song lyrics Corpus, extracted from the Web, that can be used to train applications for automatic generation of lyrics, poems, or other NPL related tasks. After its normalization, an analysis of the Corpus is presented, as well as analyzes performed after the corpus vectorization (embedding) generated with the use of two current techniques.

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001, and also by the funding agencies FAPEMIG and CNPq.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)

    Article  Google Scholar 

  2. Ellis, R.J., Xing, Z., Fang, J., Wang, Y.: Quantifying lexical novelty in song lyrics. In: ISMIR, pp. 694–700 (2015)

    Google Scholar 

  3. Habernal, I., Zayed, O., Gurevych, I.: C4corpus: multilingual web-size corpus with free license. In: LREC, pp. 914–922 (2016)

    Google Scholar 

  4. Honnibal, M., Montani, I.: spacy 2: natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing (2017)

    Google Scholar 

  5. Kuznetsov, S.: 55000+ song lyrics. https://www.kaggle.com/mousehead/songlyrics. Accessed March 2019

  6. Miethaner, U.: The blur (blues lyrics collected at the University of Regensburg) corpus: blues lyricism and the African American literary tradition. Curr. Objectives Postgrad. Am. Stud. 2 (2001). https://doi.org/10.5283/copas.64

  7. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

  8. Milev, P.: Conceptual approach for development of web scraping application for tracking information. Econ. Altern. (3), 475–485 (2017)

    Google Scholar 

  9. Nishina, Y.: A study of pop songs based on the billboard corpus. Int. J. Lang. Linguist. 4(2), 125–134 (2017)

    Google Scholar 

  10. Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. ELRA, Valletta, Malta, May 2010. http://is.muni.cz/publication/884893/en

  11. Seitner, J., et al.: A large database of hypernymy relations extracted from the web. In: LREC, pp. 360–367 (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alcione de Paiva Oliveira .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rodrigues, M.A.G., de Paiva Oliveira, A., Moreira, A. (2019). Development of a Song Lyric Corpus for the English Language. In: Métais, E., Meziane, F., Vadera, S., Sugumaran, V., Saraee, M. (eds) Natural Language Processing and Information Systems. NLDB 2019. Lecture Notes in Computer Science(), vol 11608. Springer, Cham. https://doi.org/10.1007/978-3-030-23281-8_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-23281-8_33

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-23280-1

  • Online ISBN: 978-3-030-23281-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics