Skip to main content

Mining Cohesive Domain Topics from Source Code

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 7925))

Abstract

Using topic models to mine domain topics from source code has been a promising way for developers to comprehend the functional concerns implemented in the source code of a software system. However, not all the topics mined from source code are domain topics that represent functional concerns of the software. Besides domain topics, other topics may represent cross-cutting concerns or other concerns. These topics are noises in the context of helping developers to comprehend the functional concerns. In this paper, we propose an approach to filter out noises and mine Cohesive Domain Topics (CDTs) from source code. A topic is a CDT if its associated words represent certain functional concern and its associated source code elements collaboratively implement the functional concern. Firstly, we propose a series of Filtering Heuristics to filter out programming related information in source code which may bring in noises. Then, we mine raw topics from source code using Latent Dirichlet Allocation. Finally, based on the structural relationships among the source code elements associated to a topic, we propose a novel metric called Topic Cohesion to identify CDTs from the raw topics. Experimental results on a set of open source software show that our approach can effectively filter out noises and obtain CDTs from source code.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abran, A., Moore, J., Bourque, P., Dupuis, R., Tripp, L.: Guide to the software engineering body of knowledge, 2004 version. IEEE Computer Society 1 (2004)

    Google Scholar 

  2. Gethers, M., Savage, T., Di Penta, M., Oliveto, R., Poshyvanyk, D., De Lucia, A.: Codetopics: Which topic am i coding now? In: 33rd International Conference on Software Engineering (ICSE), pp. 1034–1036. IEEE (2011)

    Google Scholar 

  3. Savage, T., Dit, B., Gethers, M., Poshyvanyk, D.: Topicxp: Exploring topics in source code using latent dirichlet allocation. In: IEEE International Conference on Software Maintenance (ICSM), pp. 1–6. IEEE (2010)

    Google Scholar 

  4. Maskeri, G., Sarkar, S., Heafield, K.: Mining business topics in source code using latent dirichlet allocation. In: Proceedings of the 1st India Software Engineering Conference, pp. 113–120. ACM (2008)

    Google Scholar 

  5. Abebe, S., Tonella, P.: Towards the extraction of domain concepts from the identifiers. In: 18th Working Conference on Reverse Engineering (WCRE), pp. 77–86. IEEE (2011)

    Google Scholar 

  6. Kuhn, A., Ducasse, S., Gírba, T.: Semantic clustering: Identifying topics in source code. Information and Software Technology 49(3), 230–243 (2007)

    Article  Google Scholar 

  7. Liu, Y., Poshyvanyk, D., Ferenc, R., Gyimóthy, T., Chrisochoides, N.: Modeling class cohesion as mixtures of latent topics. In: IEEE International Conference on Software Maintenance (ICSM), pp. 233–242. IEEE (2009)

    Google Scholar 

  8. Baldi, P., Lopes, C., Linstead, E., Bajracharya, S.: A theory of aspects as latent topics. In: ACM Sigplan Notices, vol. 43, pp. 543–562. ACM (2008)

    Google Scholar 

  9. Steyvers, M., Griffiths, T.: Probabilistic topic models. Handbook of Latent Semantic Analysis 427(7), 424–440 (2007)

    Google Scholar 

  10. Blei, D., Ng, A., Jordan, M.: Latent dirichlet allocation. The Journal of Machine Learning Research 3, 993–1022 (2003)

    MATH  Google Scholar 

  11. Asuncion, H., Asuncion, A., Taylor, R.: Software traceability with topic modeling. In: 32nd ACM/IEEE International Conference on Software Engineering (ICSE), pp. 95–104. ACM (2010)

    Google Scholar 

  12. Tian, K., Revelle, M., Poshyvanyk, D.: Using latent dirichlet allocation for automatic categorization of software. In: 6th IEEE International Working Conference on Mining Software Repositories (MSR), pp. 163–166. IEEE (2009)

    Google Scholar 

  13. Kawaguchi, S., Garg, P., Matsushita, M., Inoue, K.: Mudablue: An automatic categorization system for open source repositories. Journal of Systems and Software 79(7), 939–953 (2006)

    Article  Google Scholar 

  14. Thomas, S., Adams, B., Hassan, A., Blostein, D.: Modeling the evolution of topics in source code histories. In: 8th Working Conference on Mining Software Repositories, MSR (2011)

    Google Scholar 

  15. Lukins, S., Kraft, N., Etzkorn, L.: Bug localization using latent dirichlet allocation. Information and Software Technology 52(9), 972–990 (2010)

    Article  Google Scholar 

  16. Adams, B., Jiang, Z., Hassan, A.: Identifying crosscutting concerns using historical code changes. In: 32nd ACM/IEEE International Conference on Software Engineering (ICSE), pp. 305–314. ACM (2010)

    Google Scholar 

  17. Bieman, J., Kang, B.: Cohesion and reuse in an object-oriented system. In: ACM SIGSOFT Software Engineering Notes, vol. 20, pp. 259–262. ACM (1995)

    Google Scholar 

  18. Briand, L., Wüst, J., Daly, J., Victor Porter, D.: Exploring the relationships between design measures and software quality in object-oriented systems. Journal of Systems and Software 51(3), 245–273 (2000)

    Article  Google Scholar 

  19. Chidamber, S., Darcy, D., Kemerer, C.: Managerial use of metrics for object-oriented software: An exploratory analysis. IEEE Transactions on Software Engineering 24(8), 629–639 (1998)

    Article  Google Scholar 

  20. Etzkorn, L., Davis, C.: Automatically identifying reusable oo legacy code. Computer 30(10), 66–71 (1997)

    Article  Google Scholar 

  21. Briand, L., Daly, J., Wüst, J.: A unified framework for cohesion measurement in object-oriented systems. Empirical Software Engineering 3(1), 65–117 (1998)

    Article  Google Scholar 

  22. De Lucia, A., Oliveto, R., Vorraro, L.: Using structural and semantic metrics to improve class cohesion. In: IEEE International Conference on Software Maintenance (ICSM), pp. 27–36. IEEE (2008)

    Google Scholar 

  23. Marcus, A., Poshyvanyk, D., Ferenc, R.: Using the conceptual cohesion of classes for fault prediction in object-oriented systems. IEEE Transactions on Software Engineering 34(2), 287–300 (2008)

    Article  Google Scholar 

  24. Meyers, T., Binkley, D.: An empirical study of slice-based cohesion and coupling metrics. ACM Transactions on Software Engineering and Methodology (TOSEM) 17(1), 2 (2007)

    Article  Google Scholar 

  25. Griffiths, T., Steyvers, M.: Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America 101, 5228–5235 (2004)

    Article  Google Scholar 

  26. Oliveto, R., Gethers, M., Poshyvanyk, D., De Lucia, A.: On the equivalence of information retrieval methods for automated traceability link recovery. In: 18th International Conference on Program Comprehension (ICPC), pp. 68–71. IEEE (2010)

    Google Scholar 

  27. Dit, B., Revelle, M., Gethers, M., Poshyvanyk, D.: Feature location in source code: A taxonomy and survey. Journal of Software Maintenance and Evolution: Research and Practice (2011)

    Google Scholar 

  28. Ali, N., Guéhéneuc, Y., Antoniol, G.: Factors impacting the inputs of traceability recovery approaches. Software and Systems Traceability, 99–127 (2012)

    Google Scholar 

  29. McMillan, C., Poshyvanyk, D., Revelle, M.: Combining textual and structural analysis of software artifacts for traceability link recovery. In: ICSE Workshop on Traceability in Emerging Forms of Software Engineering, pp. 41–48. IEEE (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Xie, B., Li, M., Jin, J., Zhao, J., Zou, Y. (2013). Mining Cohesive Domain Topics from Source Code. In: Favaro, J., Morisio, M. (eds) Safe and Secure Software Reuse. ICSR 2013. Lecture Notes in Computer Science, vol 7925. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38977-1_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-38977-1_16

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-38976-4

  • Online ISBN: 978-3-642-38977-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics