Abstract
Docstring plays an important role in software development and maintanance as it is used in source code to document a specific segment of code. In dynamic language programming, docstring is usually used to annotate types of parameters and return values.
Docstrings can help developers remind the expected types of a parameter, without process of comprehending the context which is time-consuming. In this study, we propose an automatic approach to recognize potential types of a parameter from its description.
In our approach, we utilize feature selection to select useful features for classifier training. Then we adopt four different kinds of classifiers to recognize potential types and evaluate their performances using seven metrics.
We collect a dataset of 314 type descriptions from ten prevalent Python projects. Our experimental results show that, Decision Tree classifier has the best performances among four studied classifiers, whose precision, recall, F1-score, jaccard index, hamming loss, accuracy and MRR achieve 0.681, 0.548, 0.582, 0.542, 1.234, 0.432 and 0.778 respectively. Multi-layer perceptron has the weakest performances. Futher more, we discover that the performances of four classifiers achieve their best performances when select top 20% or 40% features with the highest \(\chi ^2\) statistic.
This study archive a dataset of type descriptions and propose a framework of automatically recognizing potential types of a parameter from its description.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Barone, A.V.M., Sennrich, R.: A parallel corpus of python functions and documentation strings for automated code documentation and code generation. arXiv preprint arXiv:1707.02275 (2017)
Belue, L.M., Bauer Jr., K.W.: Determining input features for multilayer perceptrons. Neurocomputing 7(2), 111–121 (1995)
Gao, Z., Bird, C., Barr, E.T.: To type or not to type: quantifying detectable bugs in JavaScript. In: Proceedings of the 39th International Conference on Software Engineering, (ICSE) 2017, Buenos Aires, Argentina, 20–28 May 2017, pp. 758–769 (2017). https://doi.org/10.1109/ICSE.2017.75
Milojkovic, N., Ghafari, M., Nierstrasz, O.: It’s duck (typing) season! In: Proceedings of the 25th International Conference on Program Comprehension, ICPC 2017, Buenos Aires, Argentina, 22–23 May 2017, pp. 312–315 (2017). https://doi.org/10.1109/ICPC.2017.10
Milojkovic, N., Ghafari, M., Nierstrasz, O.: Exploiting type hints in method argument names to improve lightweight type inference. In: Proceedings of the 25th International Conference on Program Comprehension, ICPC 2017, Buenos Aires, Argentina, 22–23 May 2017. pp. 77–87 (2017). https://doi.org/10.1109/ICPC.2017.33
Ghamrawi, N., McCallum, A.: Collective multi-label classification. In: Proceedings of the 14th ACM international conference on Information and knowledge management, pp. 195–200. ACM (2005)
Goodger, D.: Docstring Conventions (2001). https://www.python.org/dev/peps/pep-0257/
Herrera, F., Charte, F., Rivera, A.J., del Jesus, M.J.: Multilabel classification. In: Herrera, F., Charte, F., Rivera, A.J., del Jesus, M. (eds.) Multilabel Classification, pp. 17–31. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41111-8_2
Sikandar, A., et al.: Decision tree based approaches for detecting protein complex in protein protein interaction network (PPI) via link and sequence analysis. IEEE Access 6, 22108–22120 (2018)
Johnson, R., Zhang, T.: Supervised and semi-supervised text categorization using LSTM for region embeddings. arXiv preprint arXiv:1602.02373 (2016)
Vitousek, M.M., Kent, A.M., Siek, J.G., Baker, J.: Design and evaluation of gradual typing for Python. In: ACM SIGPLAN Notices, vol. 50, pp. 45–56. ACM (2014)
Iyer, S., Konstas, I., Cheung, A.: Summarizing source code using a neural attention model. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics Volume 1: Long Papers, vol. 1, pp. 2073–2083 (2016)
Bijalwan, V., Kumar, V., Kumari, P., Pascual, J.: KNN based machine learning approach for text and document mining. Int. J. Database Theory Appl. 7(1), 61–70 (2014)
Taherzadeh, G., Zhou, Y., Liew, A.W.C., Yang, Y.: Structure-based prediction of protein-peptide binding regions using random forest. Bioinformatics 34(3), 477–484 (2017)
Liu, H., Setiono, R.: Chi2: feature selection and discretization of numeric attributes. In: Proceedings Seventh International Conference on Tools with Artificial Intelligence, pp. 388–391. IEEE (1995)
Xu, Z., Liu, P., Zhang, X., Xu, B.: Python predictive analysis for bug detection. In: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pp. 121–132. ACM (2016)
Loper, E.: Epydoc: API documentation extraction in Python. http://epydoc.sourceforge.net/pycon-epydoc.ps. Accessed 13 2008
McBurney, P.W., McMillan, C.: Automatic documentation generation via source code summarization of method context. In: Proceedings of the 22nd International Conference on Program Comprehension. ICPC 2014, pp. 279–290. ACM, New York, NY, USA (2014). http://doi.acm.org/10.1145/2597008.2597149
Mining, W.I.D.: Data Mining: Concepts And Techniques. Morgan Kaufmann, Burlington (2006)
Papanikolaou, Y., Dimitriadis, D., Tsoumakas, G., Laliotis, M., Markantonatos, N., Vlahavas, I.P.: Ensemble approaches for large-scale multi-label classification and question answering in biomedicine. In: CLEF (Working Notes), pp. 1348–1360 (2014)
Xu, Z., Zhang, X., Chen, L., Pei, K., Xu, B.: Python probabilistic type inference with natural language support. In: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pp. 607–618. ACM (2016)
Souza, C., Figueiredo, E.: How do programmers use optional typing?: an empirical study. In: Proceedings of the 13th International Conference on Modularity, pp. 109–120. ACM (2014)
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Chen, L., Xu, B., Zhou, T., Zhou, X.: A constraint based bug checking approach for Python. In: 33rd Annual IEEE International Computer Software and Applications Conference, 2009. COMPSAC 2009, vol. 2, pp. 306–311. IEEE (2009)
Acknowledgments
The work is supported by National Key R&D Program of China (2018YFB1003900), the Natural Science Foundation of Jiangsu Province of China (BK20140611), the National Natural Science Foundation of China (61872177, 61772263, 61432001), and the program B for Outstanding PhD candidate of Nanjing University.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Luo, Y., Ma, W., Li, Y., Chen, Z., Chen, L. (2018). Recognizing Potential Runtime Types from Python Docstrings. In: Bu, L., Xiong, Y. (eds) Software Analysis, Testing, and Evolution. SATE 2018. Lecture Notes in Computer Science(), vol 11293. Springer, Cham. https://doi.org/10.1007/978-3-030-04272-1_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-04272-1_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-04271-4
Online ISBN: 978-3-030-04272-1
eBook Packages: Computer ScienceComputer Science (R0)