TBCNN for Programs’ Abstract Syntax Trees

Mou, Lili; Jin, Zhi

doi:10.1007/978-981-13-1870-2_4

Lili Mou¹⁶ &
Zhi Jin¹⁷

Part of the book series: SpringerBriefs in Computer Science ((BRIEFSCOMPUTER))

1440 Accesses

Abstract

In this chapter, we will apply the tree-based convolutional neural network (TBCNN) to the source code of programming languages, which we call programming language processing. In fact, programming language processing is a hot research topic in the field of software engineering; it has also aroused growing interest in the artificial intelligence community. A distinct characteristic of a program is that it contains rich, explicit, and complicated structural information, necessitating more intensive modeling of structures. In this chapter, we propose a TBCNN variant for programming language processing, where a convolution kernel is designed for programs’ abstract syntax trees. We show the effectiveness of TBCNN in two different program analysis tasks: classifying programs according to functionality, and detecting code snippets of certain patterns. TBCNN outperforms baseline methods, including several neural models for NLP.

The contents of this chapter were published in [16]. Copyright\(\copyright \)2016, Association for the Advancement of Artificial Intelligence (https://www.aaai.org). Implementation code and the collected dataset are available through our website (https://sites.google.com/site/treebasedcnn/).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 16.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Parsed by pycparser (https://pypi.python.org/pypi/pycparser/).
2.
In their original paper, they do not deal with varying-length data, but their method extends naturally to this scenario. Their method is also mathematically equivalent to average pooling.
3.
http://programming.grids.cn. The data are available on our website (Footnote 1 of this Chapter).
4.
We do not use the pretrained vector representations, which are inimical to the recursive neural network: the weight \(W_\text {code}\) encodes children’s representation to its candidate parent’s; adversely, the high-level nodes in programs (e.g., a function definition) are typically non-informative.
5.
History versions can be found at https://arxiv.org/pdf/1409.3348v1 and https://arxiv.org/pdf/1409.5718v1.

References

Baxter, I., Yahin, A., Moura, L., Sant’Anna, M., Bier, L.: Clone detection using abstract syntax trees. In: Proceedings of the International Conference on Software Maintenance, pp. 368–377 (1998)
Google Scholar
Bengio, Y., Courville, A., Vincent, P.: Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)
Article Google Scholar
Bettenburg, N., Begel, A.: Deciphering the story of software development through frequent pattern mining. In: Proceedings of the 35th International Conference on Software Engineering, pp. 1197–1200 (2013)
Google Scholar
Chilowicz, M., Duris, E., Roussel, G.: Syntax tree fingerprinting for source code similarity detection. In: Proceedings of the IEEE International Conference on Program Comprehension, pp. 243–247 (2009)
Google Scholar
Collobert, R., Weston, J.: A unified architecture for natural language processing: Deep neural networks with multitask learning. In: Proceedings of 25th International Conference on Machine Learning, pp. 160–167 (2008)
Google Scholar
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011)
Google Scholar
Dahl, G., Mohamed, A., Hinton, G.: Phone recognition with the mean-covariance restricted Boltzmann machine. In: Advances in Neural Information Processing Systems, pp. 469–477 (2010)
Google Scholar
Dietz, L., Dallmeier, V., Zeller, A., Scheffer, T.: Localizing bugs in program executions with graphical models. In: Advances in Neural Information Processing Systems, pp. 468–476 (2009)
Google Scholar
Ghabi, A., Egyed, A.: Code patterns for automatically validating requirements-to-code traces. In: Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, pp. 200–209 (2012)
Google Scholar
Hao, D., Lan, T., Zhang, H., Guo, C., Zhang, L.: Is this a bug or an obsolete test? In: Proceedings of the European Conference on Object-Oriented Programming, pp. 602–628 (2013)
Chapter Google Scholar
Hermann, K.M., Blunsom, P.: Multilingual models for compositional distributed semantics. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 58–68 (2014)
Google Scholar
Hindle, A., Barr, E.T., Su, Z., Gabel, M., Devanbu, P.: On the naturalness of software. In: Proceedings of the 34th International Conference on Software Engineering, pp. 837–847 (2012)
Google Scholar
Kalchbrenner, N., Grefenstette, E., Blunsom, P.: A convolutional neural network for modelling sentences. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 655–665 (2014)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Mou, L., Li, G., Zhang, L., Wang, T., Jin, Z.: Convolutional neural networks over tree structures for programming language processing. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp. 1287–1293 (2016)
Google Scholar
Mou, L., Peng, H., Li, G., Xu, Y., Zhang, L., Jin, Z.: Discriminative neural sentence modeling by tree-based convolution. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2315–2325 (2015)
Google Scholar
Pane, J., Ratanamahatana, C., Myers, B.: Studying the language and structure in non-programmers’ solutions to programming problems. Int. J. Hum. Comput. Stud. 54(2), 237–264 (2001)
Article Google Scholar
Peng, H., Mou, L., Li, G., Liu, Y., Zhang, L., Jin, Z.: Building program vector representations for deep learning. In: Proceedings of the 8th International Conference on Knowledge Science, Engineering and Management, pp. 547–553 (2015)
Chapter Google Scholar
Pinker, S.: The Language Instinct: The New Science of Language and Mind. Pengiun Press (1994)
Google Scholar
Socher, R., Huang, E., Pennin, J., Manning, C., Ng, A.: Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Advances in Neural Information Processing Systems, pp. 801–809 (2011)
Google Scholar
Socher, R., Karpathy, A., Le, Q., Manning, C., Ng, A.Y.: Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Linguist. 2, 207–218 (2014)
Google Scholar
Socher, R., Pennington, J., Huang, E., Ng, A., Manning, C.: Semi-supervised recursive autoencoders for predicting sentiment distributions. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 151–161 (2011)
Google Scholar
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C., Ng, A., Potts, C.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642 (2013)
Google Scholar
Steidl, D., Gode, N.: Feature-based detection of bugs in clones. In: Proceedings of the 7th International Workshop on Software Clones, pp. 76–82 (2013)
Google Scholar
Yamaguchi, F., Lottmann, M., Rieck, K.: Generalized vulnerability extrapolation using abstract syntax trees. In: Proceedings of 28th Annual Computer Security Applications Conference, pp. 359–368 (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

AdeptMind Research, Toronto, ON, Canada
Lili Mou
Institute of Software, Peking University, Beijing, China
Zhi Jin

Authors

Lili Mou
View author publications
You can also search for this author in PubMed Google Scholar
Zhi Jin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lili Mou .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Mou, L., Jin, Z. (2018). TBCNN for Programs’ Abstract Syntax Trees. In: Tree-Based Convolutional Neural Networks. SpringerBriefs in Computer Science. Springer, Singapore. https://doi.org/10.1007/978-981-13-1870-2_4

Download citation

DOI: https://doi.org/10.1007/978-981-13-1870-2_4
Published: 02 October 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1869-6
Online ISBN: 978-981-13-1870-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics