Detecting Programming Language from Source Code Using Bayesian Learning Techniques

Khasnabish, Jyotiska Nath; Sodhi, Mitali; Deshmukh, Jayati; Srinivasaraghavan, G.

doi:10.1007/978-3-319-08979-9_39

Jyotiska Nath Khasnabish²⁰,
Mitali Sodhi²⁰,
Jayati Deshmukh²⁰ &
…
G. Srinivasaraghavan²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8556))

Included in the following conference series:

International Workshop on Machine Learning and Data Mining in Pattern Recognition

2487 Accesses
5 Citations

Abstract

With dozens of popular programming languages used worldwide, the number of source code files of programs available online for public use is massive. However most blogs, forums or online Q&A websites have poor searchability for specific programming language source code. Näive thumb rules based on the file extension if any are invariably used for syntax highlighting, indentation and other ways to improve readability of the code by programming language editors. A more systematic way to identify the language in which a given source file was written would be of immense value. We believe that simple Bayesiam models would be adequate for this given the intrinsic syntactic structure of any programming language. In this paper, we present Bayesian learning models for correctly identifying the programming language in which a given piece of source code was written, with high probability. We have used 20000 source code files across 10 programming languages to train and test the model using the following Bayesian classifier models – Naive Bayes, Bayesian Network and Multinomial Naive Bayes. Lastly, we show a performance comparison among the three models in terms of classification accuracy on the test data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Github, http://www.github.com
Atlassian Bitbucket, http://bitbucket.org
Stack Exchange, http://stackexchange.com
SearchCode, http://searchcode.com
Codase, http://codase.com
Google Code Prettify, http://code.google.com/p/google-code-prettify/
SyntaxHighlighter, http://alexgorbatchev.com/SyntaxHighlighter/
Highlight.js, http://highlightjs.org
SourceClassifier, https://github.com/chrislo/sourceclassifier
Klein, D., Murray, K., Weber, S.: Algorithmic programming language identification. CoRR abs/1106.4064 (2011)
Google Scholar
Rish, I.: An empirical study of the naive bayes classifier. In: IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, vol. 3, pp. 41–46 (2001)
Google Scholar
Rennie, J.D., Shih, L., Teevan, J., Karger, D.R.: Tackling the poor assumptions of naive bayes text classifiers. In: ICML, Washington DC, vol. 3, pp. 616–623 (2003)
Google Scholar
Heckerman, D.: A tutorial on learning with bayesian networks. Tech. rep., Learning in Graphical Models (1996)
Google Scholar
Heckerman, D., Geiger, D., Chickering, D.M.: Learning bayesian networks: The combination of knowledge and statistical data. Tech. Rep. MSR-TR-94-09, Microsoft Research, Redmond, WA
Google Scholar
McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI 1998 Workshop on Learning for Text Categorization, pp. 41–48. AAAI Press (1998)
Google Scholar
Kibriya, A., Frank, E., Pfahringer, B., Holmes, G.: Multinomial naive bayes for text categorization revisited. In: Webb, G.I., Yu, X. (eds.) AI 2004. LNCS (LNAI), vol. 3339, pp. 488–499. Springer, Heidelberg (2004)
Google Scholar
Oracle Java Documentation, http://docs.oracle.com/javase/tutorial/
Weka, http://www.cs.waikato.ac.nz/ml/weka/

Download references

Author information

Authors and Affiliations

International Institute of Information Technology, Bangalore, Bangalore, 560100, India
Jyotiska Nath Khasnabish, Mitali Sodhi, Jayati Deshmukh & G. Srinivasaraghavan

Authors

Jyotiska Nath Khasnabish
View author publications
You can also search for this author in PubMed Google Scholar
Mitali Sodhi
View author publications
You can also search for this author in PubMed Google Scholar
Jayati Deshmukh
View author publications
You can also search for this author in PubMed Google Scholar
G. Srinivasaraghavan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Vision and Applied Computer Sciences, IBaI,, Leipzig, Germany
Petra Perner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Khasnabish, J.N., Sodhi, M., Deshmukh, J., Srinivasaraghavan, G. (2014). Detecting Programming Language from Source Code Using Bayesian Learning Techniques. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2014. Lecture Notes in Computer Science(), vol 8556. Springer, Cham. https://doi.org/10.1007/978-3-319-08979-9_39

Download citation

DOI: https://doi.org/10.1007/978-3-319-08979-9_39
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-08978-2
Online ISBN: 978-3-319-08979-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics