What is the Message About? Automatic Multi-label Classification of Open Source Repository Messages into Content Types

  • Daniel Campbell
  • Luis Adrián Cabrera-Diego
  • Yannis KorkontzelosEmail author
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 1153)


Users of Open Source Software (OSS) projects discuss a diverse range of topics online. The content of a post often corresponds to one or more context-sensitive content types, e.g. a suggestion for a solution, a request for further clarification or indication that a proposed solution did not work. The detection of content types can provide several benefits for software developers. For instance, content types can be used as indicators that summarise the content of the messages. These indicators can be exploited as part of a developer-centric knowledge mining platform allowing developers and project managers to create action alerts concerning new bugs found outside of a bug tracker or they can be combined with other metrics to assess the quality of an OSS project. We present a multi-label classifier, able to classify messages exchanged on communication means about OSS, and detailed evaluation results. We experimented with two state-of-the-art multi-label classification approaches HOMER (Hierarchy Of Multilabel classifiER) and RAkEL (RAndom k-labELsets) as these met the technical requirements of the CROSSMINER project. A manually-annotated threaded corpus of posts form newsgroups discussions, bug tracking systems and forums related to Eclipse projects was also used. The results are promising and indicate the potential to attract novel and deeper research for this task.


Content classification Multi-label classification Open Source Software Natural language Text mining Machine learning Information retrieval 



This research work is part of the CROSSMINER Project, which has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under grant agreement No. 732223.


  1. 1.
    Mooney, R.J., Roy, L.: Content-based book recommending using learning for text categorization. In: Proceedings of DL, pp. 195–204. ACM, New York (2000)Google Scholar
  2. 2.
    Bhatia, S., Mitra, P.: Classifying user messages for managing web forum data. In: WebDB, pp. 13–18 (2012)Google Scholar
  3. 3.
    Xia, X., Feng, Y., Lo, D., Chen, Z., Wang, X.: Towards more accurate multi-label software behavior learning. In: Proceedings of CSMR-WCRE, pp. 134–143 (2014)Google Scholar
  4. 4.
    Bagnato, A., Barmpis, K., Bessis, N., Cabrera-Diego, L.A., Di Rocco, J., Di Ruscio, D., Gergely, T., Hansen, S., Kolovos, D., Krief, P., Korkontzelos, I., Laurière, S., Lopez de la Fuente, J.M., Maló, P., Paige, R.F., Spinellis, D., Thomas, C., Vinju, J.: Developer-centric knowledge mining from large open-source software repositories (CROSSMINER). In: In Proceedings of STAFF, Marburg, Germany, pp. 375–384. Springer (2018)Google Scholar
  5. 5.
    Keivanloo, I., Forbes, C., Hmood, A., Erfani, M., Neal, C., Peristerakis, G., Rilling, J.: A linked data platform for mining software repositories. In: Proceedings of MSR, pp. 32–35 (2012)Google Scholar
  6. 6.
    Bavota, G., Ciemniewska, A., Chulani, I., De Nigro, A., Di Penta, M., Galletti, D., Galoppini, R., Gordon, T.F., Kedziora, P., Lener, I., Torelli, F., Pratola, R., Pukacki, J., Rebahi, Y., Villalonga, S.G.: The market for open source: an intelligent virtual open source marketplace. In: Proceedings of CSMR-WCRE, pp. 399–402 (2014)Google Scholar
  7. 7.
    van Deursen, A., Mesbah, A., Cornelissen, B., Zaidman, A., Pinzger, M., Guzzi, A.: Adinda: a knowledgeable, browser-based IDE. In: Proceedings of ICSE, vol. 2, pp. 203–206 (2010)Google Scholar
  8. 8.
    Di Ruscio, D., Kolovos, D., Matragkas, N., Korkontzelos, I., Vinju, J.: OSSMETER: a software measurement platform for automatically analysing open source software projects. In: Proceedings of ESEC/FSE (2015)Google Scholar
  9. 9.
    Tsoumakas, G., Katakis, I., Vlahavas, I.: Effective and efficient multilabel classification in domains with large number of labels. In: Proceedings of MMD, Antwerp, Belgium, pp. 53–59 (2008)Google Scholar
  10. 10.
    Tsoumakas, G., Katakis, I., Vlahavas, I.: Random k-labelsets for multilabel classification. IEEE Trans. Knowl. Data Eng. 23(7), 1079–1089 (2011)CrossRefGoogle Scholar
  11. 11.
    Korkontzelos, Y., Thompson, P., Ananiadou, S.: Identifying content types of messages related to open source software projects. In: Proceedings of LREC 2016, pp. 1837–1844. European Language Resources Association (ELRA), Portoro (2016)Google Scholar
  12. 12.
    Palau, R.M., Moens, M.F.: Argumentation mining: the detection, classification and structuring of arguments in text. In: Proceedings of BNAIC, pp. 351–352 (2009)Google Scholar
  13. 13.
    Mann, W.C., Taboada, M.: Rhetorical structure theory: looking back and moving ahead. Discourse Stud. 8(3), 423–460 (2006)CrossRefGoogle Scholar
  14. 14.
    Bacchelli, A., Dal Sasso, T., D’Ambros, M., Lanza, M.: Content classification of development emails. In: Proceedings of ICSE, pp. 375–385, June 2012Google Scholar
  15. 15.
    Pascarella, L., Bruntink, M., Bacchelli, A.: Classifying code comments in Java software systems. Empir. Softw. Eng. 24(3), 1499–1537 (2019)CrossRefGoogle Scholar
  16. 16.
    Alfaro, C., Cano-Montero, J., Gómez, J., Moguerza, J.M., Ortega, F.: A multi-stage method for content classification and opinion mining on weblog comments. Ann. Oper. Res. 236(1), 197–213 (2016)CrossRefGoogle Scholar
  17. 17.
    Zhou, B., Xia, X., Lo, D., Tian, C., Wang, X.: Towards more accurate content categorization of API discussions. In: Proceedings of ICPC, pp. 95–105. ACM, New York (2014)Google Scholar
  18. 18.
    Tsoumakas, G., Spyromitros-Xioufis, E., Vilcek, J., Vlahavas, I.: Mulan: a Java library for multi-label learning. J. Mach. Learn. Res. 12, 2411–2414 (2011)MathSciNetzbMATHGoogle Scholar
  19. 19.
    Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of EACL, Valencia, Spain, vol. 2, pp. 427–431 (2017)Google Scholar
  20. 20.
    Mockus, J.B., Mockus, L.J.: Bayesian approach to global optimization and application to multiobjective and constrained problems. J. Optim. Theory Appl. 70(1), 157–172 (1991)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Daniel Campbell
    • 1
  • Luis Adrián Cabrera-Diego
    • 1
  • Yannis Korkontzelos
    • 1
    Email author
  1. 1.Department of Computer ScienceEdge Hill UniversityOrmskirkUK

Personalised recommendations