Using semantic roles to improve text classification in the requirements domain

Rago, Alejandro; Marcos, Claudia; Diaz-Pace, J. Andres

doi:10.1007/s10579-017-9406-7

Using semantic roles to improve text classification in the requirements domain

Original Paper
Published: 11 November 2017

Volume 52, pages 801–837, (2018)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

647 Accesses
8 Citations
2 Altmetric
Explore all metrics

Abstract

Engineering activities often produce considerable documentation as a by-product of the development process. Due to their complexity, technical analysts can benefit from text processing techniques able to identify concepts of interest and analyze deficiencies of the documents in an automated fashion. In practice, text sentences from the documentation are usually transformed to a vector space model, which is suitable for traditional machine learning classifiers. However, such transformations suffer from problems of synonyms and ambiguity that cause classification mistakes. For alleviating these problems, there has been a growing interest in the semantic enrichment of text. Unfortunately, using general-purpose thesaurus and encyclopedias to enrich technical documents belonging to a given domain (e.g. requirements engineering) often introduces noise and does not improve classification. In this work, we aim at boosting text classification by exploiting information about semantic roles. We have explored this approach when building a multi-label classifier for identifying special concepts, called domain actions, in textual software requirements. After evaluating various combinations of semantic roles and text classification algorithms, we found that this kind of semantically-enriched data leads to improvements of up to 18% in both precision and recall, when compared to non-enriched data. Our enrichment strategy based on semantic roles also allowed classifiers to reach acceptable accuracy levels with small training sets. Moreover, semantic roles outperformed Wikipedia- and WordNET-based enrichments, which failed to boost requirements classification with several techniques. These results drove the development of two requirements tools, which we successfully applied in the processing of textual use cases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

(Semi-) automatic Categorization of Natural Language Requirements

Requirements Classification Using FastText and BETO in Spanish Documents

Automatic semantic analysis of software requirements through machine learning and ontology approach

Article 08 December 2016

Yinglin Wang

Notes

For simplicity, a supervised approach is considered.
The original requirements documents of the case studies can be downloaded from: http://www.alejandrorago.com.ar/files/assets/dataset-Source.zip.
http://www.comp.lancs.ac.uk/~greenwop/tao/.
http://sce.uhcl.edu/helm/RUP_course_example/courseregistrationproject/indexcourse.htm.
http://sce.uhcl.edu/helm/rationalunifiedprocess/examples/csports/.
The complete dataset can be found at http://www.alejandrorago.com.ar/files/assets/dataset-DomainActions.zip.
https://code.google.com/p/mate-tools/.
http://verbs.colorado.edu/~mpalmer/projects/ace.html.
http://www.cs.waikato.ac.nz/ml/weka/.
http://mulan.sourceforge.net/index.html.
https://code.google.com/p/reassistant/.
http://ucrefactoring.googlecode.com/.

References

Badawi, D., & Altincay, H. (2014). A novel framework for termset selection and weighting in binary text classification. Engineering Applications of Artificial Intelligence, 35, 38–53. https://doi.org/10.1016/j.engappai.2014.06.012.
Article Google Scholar
Bai, R., Wang, X., & Liao, J. (2010). Extract semantic information from wordnet to improve text classification performance. In T. H. Kim & H. Adeli (Eds.), Advances in computer science and information technology, Lecture notes in computer science (Vol. 6059, pp. 409–420). Berlin: Springer. https://doi.org/10.1007/978-3-642-13577-4_36.
Chapter Google Scholar
Björkelund, A., Bohnet, B., Hafdell, L., & Nugues, P. (2010). A high-performance syntactic and semantic dependency parser. 23rd International conference on computational linguistics: Demonstrations (COLING ’10) (pp. 33–36). Beijing: Association for Computational Linguistics.
Bloehdorn, S., & Hotho, A. (2006). Boosting for text classification with semantic features. Advances in Web Mining and Web Usage Analysis, 3932, 149–166. https://doi.org/10.1007/11899402_10.
Article Google Scholar
Casamayor, A., Godoy, D., & Campo, M. (2012). Functional grouping of natural language requirements for assistance in architectural software design. Knowledge-Based Systems, 30, 78–86. https://doi.org/10.1016/j.knosys.2011.12.009.
Article Google Scholar
Compeau, P., & Pevzner, P. (2015). Bioinformatics algorithms: An active learning approach (2nd ed.). San Diego: Active Learning Publishers.
Google Scholar
Diamantopoulos, T., Roth, M., Symeonidis, A., & Klein, E. (2017). Software requirements as an application domain for natural language processing. Language Resources and Evaluation. https://doi.org/10.1007/s10579-017-9381-z.
Google Scholar
Egozi, O., Markovitch, S., & Gabrilovich, E. (2011). Concept-based information retrieval using explicit semantic analysis. ACM Transactions on Information Systems (TOIS), 29(2), 8:1–8:34. https://doi.org/10.1145/1961209.1961211.
Article Google Scholar
Falessi, D., Cantone, G., & Canfora, G. (2013). Empirical principles and an industrial case study in retrieving equivalent requirements via natural language processing techniques. IEEE Transactions on Software Engineering, 39(1), 18–44. https://doi.org/10.1109/TSE.2011.122.
Article Google Scholar
Femmer, H., Fernandez, D. M., Wagner, S., & Eder, S. (2017). Rapid quality assurance with requirements smells. Journal of Systems and Software, 123, 190–213. https://doi.org/10.1016/j.jss.2016.02.047.
Article Google Scholar
Huang, L. (2011). Concept-based text clustering. Doctoral thesis, The University of Waikato, Hamilton
Huang, L., Milne, D., Frank, E., & Witten, I. H. (2012). Learning a concept-based document similarity measure. Journal of the American Society for Information Science and Technology, 63(8), 1593–1608. https://doi.org/10.1002/asi.22689.
Article Google Scholar
Hull, E., Jackson, K., & Dick, J. (2014). Requirements engineering (3rd ed.). Berlin: Springer.
Google Scholar
Jurkiewicz, J., & Nawrocki, J. (2015). Automated events identification in use cases. Information and Software Technology, 58, 110–122. https://doi.org/10.1016/j.infsof.2014.09.011.
Article Google Scholar
Kamalrudin M., Hosking J. G., & Grundy, J. (2011). Improving requirements quality using essential use case interaction patterns. In ICSE’11, Hawaii (pp. 531–540). https://doi.org/10.1145/1985793.1985866.
Kang, S., Cho, S., & Kang, P. (2015). Multi-class classification via heterogeneous ensemble of one-class classifiers. Engineering Applications of Artificial Intelligence, 43, 35–43. https://doi.org/10.1016/j.engappai.2015.04.003.
Article Google Scholar
Kehagias, A., Petridis, V., Kaburlasos, V. G., & Fragkou, P. (2003). A comparison of word- and sense-based text categorization using several classification algorithms. Journal of Intelligent Information Systems, 21(3), 227–247. https://doi.org/10.1023/A:1025554732352.
Article Google Scholar
Kelleher, J. D., Namee, B. M., & D’Arcy, A. (2015). Fundamentals of machine learning for predictive data analytics: Algorithms, worked examples, and case studies (1st ed.). Cambridge: MIT Press.
Google Scholar
Li, J. Q., Zhao, Y., & Liu, B. (2012). Exploiting semantic resources for large scale text categorization. Journal of Intelligent Information Systems, 39(3), 763–788. https://doi.org/10.1007/s10844-012-0211-x.
Article Google Scholar
Llorens, H., Saquete, E., & Navarro-Colorado, B. (2013). Applying semantic knowledge to the automatic processing of temporal expressions and events in natural language. Information Processing & Management, 49(1), 179–197. https://doi.org/10.1016/j.ipm.2012.05.005.
Article Google Scholar
Mahmoud A., & Carver, D. (2015). Exploiting online human knowledge in requirements engineering. In 23rd International requirements engineering conference (RE’15), IEEE (pp. 262–267). https://doi.org/10.1109/RE.2015.7320434
Mansuy T., & Hilderman, R. J. (2006). A characterization of wordnet features in Boolean models for text classification. In 5th Australasian conference on data mining and analystics (AusDM’06) (Vol. 61, pp. 103–109).
Ménard, P. A., & Ratté, S. (2016). Concept extraction from business documents for software engineering projects. Automated Software Engineering, 23(4), 649–686. https://doi.org/10.1007/s10515-015-0184-4.
Article Google Scholar
Mund, J, Fernandez, D, M., Femmer, H., & Eckhardt, J. (2015) Does quality of requirements specifications matter? Combined results of two empirical studies. In ACM/IEEE international symposium on empirical software engineering and measurement (ESEM’15) (pp. 1–10). https://doi.org/10.1109/ESEM.2015.7321195.
Navigli, R., Faralli, S., Soroa, A., de Lacalle, O., & Agirre, E. (2011). Two birds with one stone: Learning semantic models for text categorization and word sense disambiguation. In 20th ACM international conference on information and knowledge management (CIKM’11) (pp. 2317–2320). https://doi.org/10.1145/2063576.2063955.
Nazir, F., Butt, W. H., Anwar, M. W., & Khan Khattak, M. A. (2017). The applications of natural language processing (NLP) for software requirement engineering—A systematic literature review (pp. 485–493). Singapore: Springer. https://doi.org/10.1007/978-981-10-4154-9_56.
Google Scholar
Nguyen, T. H., Grundy, J., & Almorsy, M. (2015). Rule-based extraction of goal-use case models from text. In 10th Joint meeting on foundations of software engineering (FSE’2015) (pp. 591–601). https://doi.org/10.1145/2786805.2786876.
Palmer, M., Gildea, D., & Kingsbury, P. (2005). The proposition bank: An annotated corpus of semantic roles. Computational Linguistics, 31(1), 71–106. https://doi.org/10.1162/0891201053630264.
Article Google Scholar
Palmer, M., Gildea, D., & Xue, N. (2010). Semantic role labeling. Synthesis lectures on human language technologies. San Rafael: Morgan & Claypool.
Google Scholar
Rago, A., Marcos, C., & Diaz-Pace, A. (2013). Uncovering quality-attribute concerns in use case specifications via early aspect mining. Requirements Engineering, 18(1), 67–84. https://doi.org/10.1007/s00766-011-0142-z.
Article Google Scholar
Rago, A., Marcos, C., & Diaz-Pace, A. (2016a). Assisting requirements analysts to find latent concerns with REAssistant. Automated Software Engineering, 23(2), 219–252. https://doi.org/10.1007/s10515-014-0156-0.
Article Google Scholar
Rago, A., Marcos, C., & Diaz-Pace, A. (2016b). Identifying duplicate functionality in textual use cases by aligning semantic actions. Software and Systems Modeling, 15(2), 579–603. https://doi.org/10.1007/s10270-014-0431-3.
Article Google Scholar
Rago, A., Marcos, C., & Diaz-Pace, A. (2016c). Opportunities for analyzing hardware specifications with NLP techniques. In 3rd Workshop on design automation for understanding hardware designs (DUHDe’16), design, automation and test in Europe conference and exhibition (DATE’16), Dresden, Germany.
Rooney, N., Wang, H., & Taylor, P. S. (2014). An investigation into the application of ensemble learning for entailment classification. Information Processing & Management, 50(1), 87–103. https://doi.org/10.1016/j.ipm.2013.08.002.
Article Google Scholar
Rosadini, B., Ferrari, A., Gori, G., Fantechi, A., Gnesi, S., Trotta, I., & Bacherini, S. (2017). Using NLP to detect requirements defects: An industrial experience in the railway domain. In chap 23rd International working conference REFSQ 2017, Essen, Germany, February 27–March 2, 2017, Proceedings (pp. 344–360). Springer International Publishing. https://doi.org/10.1007/978-3-319-54045-0_24.
Roth, M., & Klein, E. (2015). Parsing software requirements with an ontology-based semantic role labeler. In 1st Workshop on language and ontologies at the 11th international conference on computational semantics (IWCS’15) (pp. 15–21). London, United Kingdom.
Roth, M., Diamantopoulos, T., Klein, E., & Symeonidis, A. (2014). Software requirements: A new domain for semantic parsers. In Workshop on semantic parsing at the conference of the association for computational linguistics (ACL’14) (pp. 50–54). Baltimore, MD.
Selvaretnam, B., & Belkhatir, M. (2016). A linguistically driven framework for query expansion via grammatical constituent highlighting and role-based concept weighting. Information Processing & Management, 52(2), 174–192. https://doi.org/10.1016/j.ipm.2015.04.002.
Article Google Scholar
Sengupta, S., Ramnani, R. R., Das, S., & Chandran, A. (2015). Verb-based semantic modelling and analysis of textual requirements. In 8th India software engineering conference (ISEC’15) (pp. 30–39). https://doi.org/10.1145/2723742.2723745.
Sinha, A., Paradkar, A., Takeuchi, H., & Nakamura, T. (2010). Extending automated analysis of natural language use cases to other languages. In 18th IEEE international requirements engineering conference (RE’10) (pp. 364–369). https://doi.org/10.1109/RE.2010.52
Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing & Management, 45(4), 427–437. https://doi.org/10.1016/j.ipm.2009.03.002.
Article Google Scholar
Szu-ting, Y. (2015). Robust semantic role labeling. United States: LAP Lambert Academic Publishing.
Google Scholar
Tommasel, A., & Godoy, D. (2014). Semantic grounding of social annotations for enhancing resource classification in folksonomies. Journal of Intelligent Information Systems. https://doi.org/10.1007/s10844-014-0339-y.
Google Scholar
Tsoumakas, G., Katakis, I., & Vlahavas, I. (2010). Mining multi-label data. In O. Maimon & L. Rokach (Eds.), Data mining and knowledge discovery handbook (pp. 667–685). Boston, MA: Springer. https://doi.org/10.1007/978-0-387-09823-4_34.
Uysal, A. K., & Gunal, S. (2014). The impact of preprocessing on text classification. Information Processing & Management, 50(1), 104–112. https://doi.org/10.1016/j.ipm.2013.08.006.
Article Google Scholar
Wang, P., & Domeniconi, C. (2008). Building semantic kernels for text classification using wikipedia. In 14th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’08) (pp. 713–721). https://doi.org/10.1145/1401890.1401976.
Wang, P., Hu, J., Zeng, H. J., & Chen, Z. (2009). Using wikipedia knowledge to improve text classification. Knowledge and Information Systems, 19(3), 265–281. https://doi.org/10.1007/s10115-008-0152-4.
Article Google Scholar
Wiegers, K., & Beatty, J. (2013). Software requirements (3rd ed.). Developer best practices. Redmond, WA: Microsoft Press.
Zhang, H. (2004). The optimality of naive bayes. In V. Barr & Z. Markov (Eds.), 17th International Florida Artificial Intelligence Research Society conference (FLAIRS 2004) (pp. 562–567). Miami Beach, FL: AAAI Press.
Zhang, M. L., & Zhou, Z. H. (2014). A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 26(8), 1819–1837. https://doi.org/10.1109/TKDE.2013.39.
Article Google Scholar

Download references

Acknowledgements

This work was partially supported by ANPCyT (Argentina) through PICT Project 2015 No. 2565. The authors are grateful to the doctoral students that helped to manually tag the sentences of the case-studies with DAs. The authors would like to make a special mention to Paula Frade, Miguel Ruival, German Attanasio and Rodrigo Gonzalez for testing the DA classifier and helping us to make adjustments to the implementation. The authors also thank the anonymous reviewers for their feedback that helped to improve the quality of the manuscript.

Author information

Authors and Affiliations

ISISTAN Research Institute, UNICEN University, Tandil, Argentina
Alejandro Rago, Claudia Marcos & J. Andres Diaz-Pace
CONICET, Buenos Aires, Argentina
Alejandro Rago & J. Andres Diaz-Pace
CIC, Buenos Aires, Argentina
Claudia Marcos

Authors

Alejandro Rago
View author publications
You can also search for this author in PubMed Google Scholar
Claudia Marcos
View author publications
You can also search for this author in PubMed Google Scholar
J. Andres Diaz-Pace
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Alejandro Rago, Claudia Marcos or J. Andres Diaz-Pace.

Appendix: Description of domain actions

Process: Represents interactions that involve CPU-demanding activities (of a system).
- Verification: Covers interactions associated from checks of user input, validation of stored information and consistency of data.
- Calculation: Covers interactions associated to the analysis of information and the synthesis of new results.
- Communication: Covers all types of interaction with subsystems or foreign software/hardware.
  - Internal: Groups interactions linked to data sharing with subsystems.
  - External: Groups interactions linked to data sharing with other systems.

Data: Represents interactions that involve data-related activities, such as persistence and caches operations.
- Read: Covers interactions associated to retrieval of data.
  - Single: Groups retrieval interactions of single values, often linked to parameters and object representations.
  - Multiple: Groups retrieval interactions of many tuples of information, often materialized as a complex query.
- Write: Covers interactions associated to the storage of data, by either adding, modifying or removing.
  - Create: Groups interactions aimed at incorporating new information to the system.
  - Update: Groups interactions aimed at altering pre-existing information in the system.
  - Delete: Groups interactions aimed at removing information from the system.

Use Case: Represents interactions commonly used in use case scenarios to manage the execution flow.
- Begin: Groups interactions frequently used to denote the start of a use case flow.
- End: Groups interactions frequently used to denote the end of a use case flow.
- Control: Groups interactions to denote the jump from one use case step to another.

Input/Output: Represents interactions that involve the communication between the system described and human actors (or other systems).
- Input: Covers interactions associated to the feeding of information to the system.
  - Entry: Groups interactions related to feeding in data via physical/virtual interface.
  - Selection: Groups interactions related to choosing data from a list of options.
- Output: Covers interactions associated to the delivery of information to end users.
  - Display: Groups interactions related to the presentation of data on a physical/virtual display.
  - Notification: Groups interactions about status changes or warning messages.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rago, A., Marcos, C. & Diaz-Pace, J.A. Using semantic roles to improve text classification in the requirements domain. Lang Resources & Evaluation 52, 801–837 (2018). https://doi.org/10.1007/s10579-017-9406-7

Download citation

Published: 11 November 2017
Issue Date: September 2018
DOI: https://doi.org/10.1007/s10579-017-9406-7

Keywords

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using semantic roles to improve text classification in the requirements domain

Abstract

Access this article

Similar content being viewed by others

(Semi-) automatic Categorization of Natural Language Requirements

Requirements Classification Using FastText and BETO in Spanish Documents

Automatic semantic analysis of software requirements through machine learning and ontology approach

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Appendix: Description of domain actions

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Using semantic roles to improve text classification in the requirements domain

Abstract

Access this article

Similar content being viewed by others

(Semi-) automatic Categorization of Natural Language Requirements

Requirements Classification Using FastText and BETO in Spanish Documents

Automatic semantic analysis of software requirements through machine learning and ontology approach

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Appendix: Description of domain actions

Appendix: Description of domain actions

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation