Recognizing Potential Runtime Types from Python Docstrings

Luo, Yang; Ma, Wanwangying; Li, Yanhui; Chen, Zhifei; Chen, Lin

doi:10.1007/978-3-030-04272-1_5

Yang Luo¹⁴,
Wanwangying Ma¹⁴,
Yanhui Li¹⁴,
Zhifei Chen¹⁴ &
…
Lin Chen¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 11293))

Included in the following conference series:

International Conference on Software Analysis, Testing, and Evolution

703 Accesses
1 Citations

Abstract

Docstring plays an important role in software development and maintanance as it is used in source code to document a specific segment of code. In dynamic language programming, docstring is usually used to annotate types of parameters and return values.

Docstrings can help developers remind the expected types of a parameter, without process of comprehending the context which is time-consuming. In this study, we propose an automatic approach to recognize potential types of a parameter from its description.

In our approach, we utilize feature selection to select useful features for classifier training. Then we adopt four different kinds of classifiers to recognize potential types and evaluate their performances using seven metrics.

We collect a dataset of 314 type descriptions from ten prevalent Python projects. Our experimental results show that, Decision Tree classifier has the best performances among four studied classifiers, whose precision, recall, F1-score, jaccard index, hamming loss, accuracy and MRR achieve 0.681, 0.548, 0.582, 0.542, 1.234, 0.432 and 0.778 respectively. Multi-layer perceptron has the weakest performances. Futher more, we discover that the performances of four classifiers achieve their best performances when select top 20% or 40% features with the highest \(\chi ^2\) statistic.

This study archive a dataset of type descriptions and propose a framework of automatically recognizing potential types of a parameter from its description.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/Instagram/MonkeyType.

References

Barone, A.V.M., Sennrich, R.: A parallel corpus of python functions and documentation strings for automated code documentation and code generation. arXiv preprint arXiv:1707.02275 (2017)
Belue, L.M., Bauer Jr., K.W.: Determining input features for multilayer perceptrons. Neurocomputing 7(2), 111–121 (1995)
Article Google Scholar
Gao, Z., Bird, C., Barr, E.T.: To type or not to type: quantifying detectable bugs in JavaScript. In: Proceedings of the 39th International Conference on Software Engineering, (ICSE) 2017, Buenos Aires, Argentina, 20–28 May 2017, pp. 758–769 (2017). https://doi.org/10.1109/ICSE.2017.75
Milojkovic, N., Ghafari, M., Nierstrasz, O.: It’s duck (typing) season! In: Proceedings of the 25th International Conference on Program Comprehension, ICPC 2017, Buenos Aires, Argentina, 22–23 May 2017, pp. 312–315 (2017). https://doi.org/10.1109/ICPC.2017.10
Milojkovic, N., Ghafari, M., Nierstrasz, O.: Exploiting type hints in method argument names to improve lightweight type inference. In: Proceedings of the 25th International Conference on Program Comprehension, ICPC 2017, Buenos Aires, Argentina, 22–23 May 2017. pp. 77–87 (2017). https://doi.org/10.1109/ICPC.2017.33
Ghamrawi, N., McCallum, A.: Collective multi-label classification. In: Proceedings of the 14th ACM international conference on Information and knowledge management, pp. 195–200. ACM (2005)
Google Scholar
Goodger, D.: Docstring Conventions (2001). https://www.python.org/dev/peps/pep-0257/
Herrera, F., Charte, F., Rivera, A.J., del Jesus, M.J.: Multilabel classification. In: Herrera, F., Charte, F., Rivera, A.J., del Jesus, M. (eds.) Multilabel Classification, pp. 17–31. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41111-8_2
Chapter Google Scholar
Sikandar, A., et al.: Decision tree based approaches for detecting protein complex in protein protein interaction network (PPI) via link and sequence analysis. IEEE Access 6, 22108–22120 (2018)
Article Google Scholar
Johnson, R., Zhang, T.: Supervised and semi-supervised text categorization using LSTM for region embeddings. arXiv preprint arXiv:1602.02373 (2016)
Vitousek, M.M., Kent, A.M., Siek, J.G., Baker, J.: Design and evaluation of gradual typing for Python. In: ACM SIGPLAN Notices, vol. 50, pp. 45–56. ACM (2014)
Google Scholar
Iyer, S., Konstas, I., Cheung, A.: Summarizing source code using a neural attention model. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics Volume 1: Long Papers, vol. 1, pp. 2073–2083 (2016)
Google Scholar
Bijalwan, V., Kumar, V., Kumari, P., Pascual, J.: KNN based machine learning approach for text and document mining. Int. J. Database Theory Appl. 7(1), 61–70 (2014)
Article Google Scholar
Taherzadeh, G., Zhou, Y., Liew, A.W.C., Yang, Y.: Structure-based prediction of protein-peptide binding regions using random forest. Bioinformatics 34(3), 477–484 (2017)
Article Google Scholar
Liu, H., Setiono, R.: Chi2: feature selection and discretization of numeric attributes. In: Proceedings Seventh International Conference on Tools with Artificial Intelligence, pp. 388–391. IEEE (1995)
Google Scholar
Xu, Z., Liu, P., Zhang, X., Xu, B.: Python predictive analysis for bug detection. In: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pp. 121–132. ACM (2016)
Google Scholar
Loper, E.: Epydoc: API documentation extraction in Python. http://epydoc.sourceforge.net/pycon-epydoc.ps. Accessed 13 2008
McBurney, P.W., McMillan, C.: Automatic documentation generation via source code summarization of method context. In: Proceedings of the 22nd International Conference on Program Comprehension. ICPC 2014, pp. 279–290. ACM, New York, NY, USA (2014). http://doi.acm.org/10.1145/2597008.2597149
Mining, W.I.D.: Data Mining: Concepts And Techniques. Morgan Kaufmann, Burlington (2006)
Google Scholar
Papanikolaou, Y., Dimitriadis, D., Tsoumakas, G., Laliotis, M., Markantonatos, N., Vlahavas, I.P.: Ensemble approaches for large-scale multi-label classification and question answering in biomedicine. In: CLEF (Working Notes), pp. 1348–1360 (2014)
Google Scholar
Xu, Z., Zhang, X., Chen, L., Pei, K., Xu, B.: Python probabilistic type inference with natural language support. In: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pp. 607–618. ACM (2016)
Google Scholar
Souza, C., Figueiredo, E.: How do programmers use optional typing?: an empirical study. In: Proceedings of the 13th International Conference on Modularity, pp. 109–120. ACM (2014)
Google Scholar
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Article Google Scholar
Chen, L., Xu, B., Zhou, T., Zhou, X.: A constraint based bug checking approach for Python. In: 33rd Annual IEEE International Computer Software and Applications Conference, 2009. COMPSAC 2009, vol. 2, pp. 306–311. IEEE (2009)
Google Scholar

Download references

Acknowledgments

The work is supported by National Key R&D Program of China (2018YFB1003900), the Natural Science Foundation of Jiangsu Province of China (BK20140611), the National Natural Science Foundation of China (61872177, 61772263, 61432001), and the program B for Outstanding PhD candidate of Nanjing University.

Author information

Authors and Affiliations

State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Yang Luo, Wanwangying Ma, Yanhui Li, Zhifei Chen & Lin Chen

Authors

Yang Luo
View author publications
You can also search for this author in PubMed Google Scholar
Wanwangying Ma
View author publications
You can also search for this author in PubMed Google Scholar
Yanhui Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhifei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Lin Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lin Chen .

Editor information

Editors and Affiliations

Nanjing University, Nanjing, China
Lei Bu
Peking University, Peking, China
Yingfei Xiong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Luo, Y., Ma, W., Li, Y., Chen, Z., Chen, L. (2018). Recognizing Potential Runtime Types from Python Docstrings. In: Bu, L., Xiong, Y. (eds) Software Analysis, Testing, and Evolution. SATE 2018. Lecture Notes in Computer Science(), vol 11293. Springer, Cham. https://doi.org/10.1007/978-3-030-04272-1_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-04272-1_5
Published: 20 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-04271-4
Online ISBN: 978-3-030-04272-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics