Skip to main content

Automatic Word Embeddings-Based Glossary Term Extraction from Large-Sized Software Requirements

  • Conference paper
  • First Online:
Requirements Engineering: Foundation for Software Quality (REFSQ 2020)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 12045))

Abstract

[Context and Motivation] Requirements glossary defines specialized and technical terms used in a requirements document. A requirements glossary helps in improving the quality and understandability of requirements documents. [Question/Problem] Manual extraction of glossary terms from a large body of requirements is an expensive and time-consuming task. This paper proposes a fundamentally new approach for automated extraction of glossary terms from large-sized requirements documents. [Principal Ideas/Result] Firstly, our technique extracts the candidate glossary terms by applying text chunking. Next, we apply a novel word embeddings based semantic filter for reducing the number of candidate glossary terms. Since word embeddings are very effective in identifying terms that are semantically very similar, this filter ensures that only domain-specific terms are present in the final set of glossary terms. We create a domain-specific reference corpus for home automation by Wikipedia crawling and use it for computing the semantic similarity scores of candidate glossary terms. We apply our technique to a large-sized requirements document, i.e., a CrowdRE dataset with around 3000 crowd-generated requirements for smart home applications. Semantic filtering reduces the number of glossary terms by 92.7%. To evaluate the quality of our extracted glossary terms we manually create the ground truth data from CrowdRE dataset and use it for computing precision and recall. Additionally, we also compute the requirements coverage of these extracted glossary terms. [Contributions] Our detailed experiments show that word embeddings based semantic filtering can be very useful for extracting glossary terms from a large body of requirements.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://selenium-python.readthedocs.io/.

  2. 2.

    https://en.wikipedia.org/wiki/Home_automation.

  3. 3.

    https://petscan.wmflabs.org/.

  4. 4.

    https://www.nltk.org/.

  5. 5.

    https://www.ranks.nl/stopwords.

  6. 6.

    https://www.nltk.org/_modules/nltk/tag.html.

  7. 7.

    https://www.nltk.org/_modules/nltk/tag/perceptron.html#PerceptronTagger.

  8. 8.

    https://www.nltk.org/_modules/nltk/stem/wordnet.html.

  9. 9.

    https://github.com/SibaMishra/Automatic-Glossary-Terms-Extraction-Using-Word-Embeddings.

References

  1. Aguilera, C., Berry, D.M.: The use of a repeated phrase finder in requirements extraction. J. Syst. Softw. 13(3), 209–230 (1990)

    Article  Google Scholar 

  2. Arora, C., Sabetzadeh, M., Briand, L.C., Zimmer, F.: Automated extraction and clustering of requirements glossary terms. IEEE Trans. Softw. Eng. 43(10), 918–945 (2017)

    Article  Google Scholar 

  3. Dwarakanath, A., Ramnani, R.R., Sengupta, S.: Automatic extraction of glossary terms from natural language requirements. In: 21st IEEE International Requirements Engineering Conference (RE), pp. 314–319, July 2013

    Google Scholar 

  4. Ferrari, A., Donati, B., Gnesi, S.: Detecting domain-specific ambiguities: an NLP approach based on Wikipedia crawling and word embeddings. In: 25th IEEE International Requirements Engineering Conference Workshops (REW), pp. 393–399, September 2017

    Google Scholar 

  5. Ferrari, A., Esuli, A., Gnesi, S.: Identification of cross-domain ambiguity with language models. In: 5th International Workshop on Artificial Intelligence for Requirements Engineering (AIRE), pp. 31–38, August 2018

    Google Scholar 

  6. Gemkow, T., Conzelmann, M., Hartig, K., Vogelsang, A.: Automatic glossary term extraction from large-scale requirements specifications. In: 26th IEEE International Requirements Engineering Conference, pp. 412–417. IEEE Computer Society (2018)

    Google Scholar 

  7. Goldin, L., Berry, D.M.: AbstFinder, a prototype natural language text abstraction finder for use in requirements elicitation. Autom. Softw. Eng. 4(4), 375–412 (1997). https://doi.org/10.1023/A:1008617922496

    Article  Google Scholar 

  8. Hull, M.E.C., Jackson, K., Dick, J.: Requirements Engineering, 2nd edn. Springer, Heidelberg (2005). https://doi.org/10.1007/b138335

    Book  MATH  Google Scholar 

  9. Justeson, J.S., Katz, S.M.: Technical terminology: some linguistic properties and an algorithm for identification in text. Nat. Lang. Eng. 1(1), 9–27 (1995)

    Article  Google Scholar 

  10. Kof, L.: Natural language processing for requirements engineering: applicability to large requirements documents. In: Workshop on Automated Software Engineering (2004)

    Google Scholar 

  11. Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. NIPS 2014, pp. 2177–2185 (2014)

    Google Scholar 

  12. Ménard, P.A., Ratté, S.: Concept extraction from business documents for software engineering projects. Autom. Softw. Eng. 23(4), 649–686 (2015). https://doi.org/10.1007/s10515-015-0184-4

    Article  Google Scholar 

  13. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  14. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2. NIPS 2013, pp. 3111–3119 (2013)

    Google Scholar 

  15. Mishra, S., Sharma, A.: On the use of word embeddings for identifying domain specific ambiguities in requirements. In: 2019 IEEE 27th International Requirements Engineering Conference Workshops (REW), pp. 234–240 (2019)

    Google Scholar 

  16. Murukannaiah, P.K., Ajmeri, N., Singh, M.P.: Toward automating crowd RE. In: 25th IEEE International Requirements Engineering Conference (RE), pp. 512–515, September 2017

    Google Scholar 

  17. Murukannaiah, P.K., Ajmeri, N., Singh, M.P.: Acquiring creative requirements from the crowd: understanding the influences of individual personality and creative potential in crowd RE. In: 24th IEEE International Requirements Engineering Conference (RE), pp. 176–185, September 2016

    Google Scholar 

  18. Park, Y., Byrd, R.J., Boguraev, B.K.: Automatic glossary extraction: beyond terminology identification. In: COLING 2002: The 19th International Conference on Computational Linguistics (2002). https://www.aclweb.org/anthology/C02-1142

  19. Pohl, K.: Requirements Engineering - Fundamentals, Principles, and Techniques, 1st edn. Springer, Heidelberg (2010)

    Google Scholar 

  20. Pohl, K., Böckle, G., van der Linden, F.J.: Software Product Line Engineering: Foundations. Principles and Techniques. Springer, Heidelberg (2005). https://doi.org/10.1007/3-540-28901-1

    Book  MATH  Google Scholar 

  21. Popescu, D., Rugaber, S., Medvidovic, N., Berry, D.M.: Reducing ambiguities in requirements specifications via automatically created object-oriented models. In: Paech, B., Martell, C. (eds.) Monterey Workshop 2007. LNCS, vol. 5320, pp. 103–124. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-89778-1_10

    Chapter  Google Scholar 

  22. Romero, F.P., Olivas, J.A., Genero, M., Piattini, M.: Automatic extraction of the main terminology used in empirical software engineering through text mining techniques. In: Proceedings of the Second ACM-IEEE International Symposium on Empirical Software Engineering and Measurement. ESEM 2008, pp. 357–358 (2008)

    Google Scholar 

  23. Zou, X., Settimi, R., Cleland-Huang, J.: Improving automated requirements trace retrieval: a study of term-based enhancement methods. Empir. Softw. Eng. 15(2), 119–146 (2010). https://doi.org/10.1007/s10664-009-9114-z

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arpit Sharma .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mishra, S., Sharma, A. (2020). Automatic Word Embeddings-Based Glossary Term Extraction from Large-Sized Software Requirements. In: Madhavji, N., Pasquale, L., Ferrari, A., Gnesi, S. (eds) Requirements Engineering: Foundation for Software Quality. REFSQ 2020. Lecture Notes in Computer Science(), vol 12045. Springer, Cham. https://doi.org/10.1007/978-3-030-44429-7_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-44429-7_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-44428-0

  • Online ISBN: 978-3-030-44429-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics