Abstract
In this chapter, we will apply the tree-based convolutional neural network (TBCNN) to the source code of programming languages, which we call programming language processing. In fact, programming language processing is a hot research topic in the field of software engineering; it has also aroused growing interest in the artificial intelligence community. A distinct characteristic of a program is that it contains rich, explicit, and complicated structural information, necessitating more intensive modeling of structures. In this chapter, we propose a TBCNN variant for programming language processing, where a convolution kernel is designed for programs’ abstract syntax trees. We show the effectiveness of TBCNN in two different program analysis tasks: classifying programs according to functionality, and detecting code snippets of certain patterns. TBCNN outperforms baseline methods, including several neural models for NLP.
The contents of this chapter were published in [16]. Copyright\(\copyright \)2016, Association for the Advancement of Artificial Intelligence (https://www.aaai.org). Implementation code and the collected dataset are available through our website (https://sites.google.com/site/treebasedcnn/).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Parsed by pycparser (https://pypi.python.org/pypi/pycparser/).
- 2.
In their original paper, they do not deal with varying-length data, but their method extends naturally to this scenario. Their method is also mathematically equivalent to average pooling.
- 3.
http://programming.grids.cn. The data are available on our website (Footnote 1 of this Chapter).
- 4.
We do not use the pretrained vector representations, which are inimical to the recursive neural network: the weight \(W_\text {code}\) encodes children’s representation to its candidate parent’s; adversely, the high-level nodes in programs (e.g., a function definition) are typically non-informative.
- 5.
History versions can be found at https://arxiv.org/pdf/1409.3348v1 and https://arxiv.org/pdf/1409.5718v1.
References
Baxter, I., Yahin, A., Moura, L., Sant’Anna, M., Bier, L.: Clone detection using abstract syntax trees. In: Proceedings of the International Conference on Software Maintenance, pp. 368–377 (1998)
Bengio, Y., Courville, A., Vincent, P.: Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)
Bettenburg, N., Begel, A.: Deciphering the story of software development through frequent pattern mining. In: Proceedings of the 35th International Conference on Software Engineering, pp. 1197–1200 (2013)
Chilowicz, M., Duris, E., Roussel, G.: Syntax tree fingerprinting for source code similarity detection. In: Proceedings of the IEEE International Conference on Program Comprehension, pp. 243–247 (2009)
Collobert, R., Weston, J.: A unified architecture for natural language processing: Deep neural networks with multitask learning. In: Proceedings of 25th International Conference on Machine Learning, pp. 160–167 (2008)
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011)
Dahl, G., Mohamed, A., Hinton, G.: Phone recognition with the mean-covariance restricted Boltzmann machine. In: Advances in Neural Information Processing Systems, pp. 469–477 (2010)
Dietz, L., Dallmeier, V., Zeller, A., Scheffer, T.: Localizing bugs in program executions with graphical models. In: Advances in Neural Information Processing Systems, pp. 468–476 (2009)
Ghabi, A., Egyed, A.: Code patterns for automatically validating requirements-to-code traces. In: Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, pp. 200–209 (2012)
Hao, D., Lan, T., Zhang, H., Guo, C., Zhang, L.: Is this a bug or an obsolete test? In: Proceedings of the European Conference on Object-Oriented Programming, pp. 602–628 (2013)
Hermann, K.M., Blunsom, P.: Multilingual models for compositional distributed semantics. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 58–68 (2014)
Hindle, A., Barr, E.T., Su, Z., Gabel, M., Devanbu, P.: On the naturalness of software. In: Proceedings of the 34th International Conference on Software Engineering, pp. 837–847 (2012)
Kalchbrenner, N., Grefenstette, E., Blunsom, P.: A convolutional neural network for modelling sentences. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 655–665 (2014)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Mou, L., Li, G., Zhang, L., Wang, T., Jin, Z.: Convolutional neural networks over tree structures for programming language processing. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp. 1287–1293 (2016)
Mou, L., Peng, H., Li, G., Xu, Y., Zhang, L., Jin, Z.: Discriminative neural sentence modeling by tree-based convolution. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2315–2325 (2015)
Pane, J., Ratanamahatana, C., Myers, B.: Studying the language and structure in non-programmers’ solutions to programming problems. Int. J. Hum. Comput. Stud. 54(2), 237–264 (2001)
Peng, H., Mou, L., Li, G., Liu, Y., Zhang, L., Jin, Z.: Building program vector representations for deep learning. In: Proceedings of the 8th International Conference on Knowledge Science, Engineering and Management, pp. 547–553 (2015)
Pinker, S.: The Language Instinct: The New Science of Language and Mind. Pengiun Press (1994)
Socher, R., Huang, E., Pennin, J., Manning, C., Ng, A.: Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Advances in Neural Information Processing Systems, pp. 801–809 (2011)
Socher, R., Karpathy, A., Le, Q., Manning, C., Ng, A.Y.: Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Linguist. 2, 207–218 (2014)
Socher, R., Pennington, J., Huang, E., Ng, A., Manning, C.: Semi-supervised recursive autoencoders for predicting sentiment distributions. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 151–161 (2011)
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C., Ng, A., Potts, C.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642 (2013)
Steidl, D., Gode, N.: Feature-based detection of bugs in clones. In: Proceedings of the 7th International Workshop on Software Clones, pp. 76–82 (2013)
Yamaguchi, F., Lottmann, M., Rieck, K.: Generalized vulnerability extrapolation using abstract syntax trees. In: Proceedings of 28th Annual Computer Security Applications Conference, pp. 359–368 (2012)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2018 The Author(s)
About this chapter
Cite this chapter
Mou, L., Jin, Z. (2018). TBCNN for Programs’ Abstract Syntax Trees. In: Tree-Based Convolutional Neural Networks. SpringerBriefs in Computer Science. Springer, Singapore. https://doi.org/10.1007/978-981-13-1870-2_4
Download citation
DOI: https://doi.org/10.1007/978-981-13-1870-2_4
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1869-6
Online ISBN: 978-981-13-1870-2
eBook Packages: Computer ScienceComputer Science (R0)