Personality Recognition from Source Code Based on Lexical, Syntactic and Semantic Features
- 133 Downloads
Automatic personality recognition from source code is a scarcely explored problem. We propose personality recognition with handcrafted features, based on lexical, syntactic and semantic properties of source code. Out of 35 proposed features, 22 features are completely novel. We also show that n-gram features are simple but surprisingly good predictors of personality and present results arising from joint usage of both handcrafted and baseline features. Additionally we compare our results with scores obtained within the Personality Recognition in SOurce COde track during Forum for Information Retrieval Evaluation 2016 and set up state-of-the-art results for conscientiousness and neuroticism traits.
KeywordsAutomatic personality recognition Personality traits Source code Feature engineering
Personality influences many aspects of human behaviour, e.g. made decisions, propensity for communication with other people, way of writing or listened music . In the context of computer science, personality may influence organization of created source code, or a choice of a software project a person takes part in.
While automatic personality recognition from text attained remarkable attention , personality recognition from source code is still a scarcely explored problem.
Automatic personality recognition can be useful to customize learning process or to assess cultural fit in a company. Each company has a different culture  – there are places where programmers are supposed to often contact clients and conversely where they talk only to their supervisors. Firms may also differ in workplace organization – is it an open plan office, a small room, or remote work. Cultural fit is whether an employee is satisfied with these and some other matters in his workplace. If a person fits in his company, he is more involved in what he is doing, more satisfied with things he has accomplished during his work time and more productive, which is beneficial both for him and his employer. Cultural fit depends on one’s personality, thus an automatic personality recognition system that detects whether a person fits into the company’s environment based on programming assessment completed during a recruitment phase could save both employee’s and employer’s time and stress. Both in academia and industry, psychological and sociological predispositions of programmers could be examined to better recognize their soft skills and choose job.
This paper proposes personality recognition from source code with random tree forest on the basis of 35 handcrafted features based on lexical, syntactic and semantic properties of source code. Out of these 35 features, 22 features are novel and have not been used earlier in personality recognition from source code. We compare above features with n-gram features serving as baseline features and present results arising from joint usage of both handcrafted and baseline features. Finally, we compare our results with scores obtained within Personality Recognition in SOurce COde (PR-SOCO) track during Forum for Information Retrieval Evaluation (PAN@FIRE 2016).
Conscientiousness (C) - consistency, persistence, good organizational skills.
Agreeableness (A) - attitude towards others (whether a person is suspicious or trustful, modest, willing to compromise).
Neuroticism (N) - impulsiveness, susceptibility to stress and anxiety.
Openness to experience (O) - intellectual curiosity, willingness to explore, rich imagination of examined person, searching original solutions rather than following in someone’s footsteps.
Extroversion (E) - assertiveness, building relationship at ease.
Each personality trait can be divided further into six facets, but facets are out of scope of our work.
2 Related Work
Deep learning personality predictors require no feature engineering, no preprocessing, scanning, nor parsing of source codes. An example of such an approach is an LSTM neural network which reads source code byte by byte . Low amount of learning data is, however, especially problematic for this approach, as not only the correct predictor (a classifier or a regressor), but also the relevant features should be learned from data.
number of files submitted by each programmer,
mean number of lines in programs,
mean length of variables,
mean number of classes,
mean length of classes (computed on the basis of the number of lines of code),
mean number of attributes, methods in a class,
number of programs implementing the same class,
number of errors,
Halstead complexity measures (e.g. difficulty and time needed for implementation and understanding),
duplicated fragments of source code,
frequency of occurrence of comments and their length,
occurrence of comments written exclusively in capital letters,
number of comments in classes,
number of words inside comments,
usage of punctuation marks inside comments,
number of lines with missing white characters inside arithmetic expressions,
number of import declarations, which import the whole content of module (usage of * instead of concrete classes),
used white characters,
ways of indentation and formatting used by the programmer,
number of empty lines between methods, blocks of code and number of white characters between parentheses,
occurrence of digits, capital and small letters and symbol _ in names, as well as length of names.
In , frequency distribution of different types of nodes in an abstract syntax tree was examined, yielding however low results, little above baseline approaches.
Another type of features are character n-grams – versatile, easy to implement features, which are language independent and have a wide range of applications in classification tasks, including authorship attribution [14, 35], author profiling , authorship verification [9, 20] and plagiarism detection . They may also provide convenient features for a baseline solution of PR-SOCO. In the context of personality recognition from source code, character n-grams were used in [17, 32].
The choice of a predictor (a regressor or a classifier) is a more standard procedure and includes mainly: linear regression [16, 25, 32], support vector regression [7, 11], decision trees , nearest neighbours [25, 36] and neural networks [12, 36].
The research concerning personality recognition from source code is scarce and extraction of novel features will likely extend possiblities of distinguishing traits.
3 Proposed Features
Features of source code used as predictors of personality traits. Novel features are marked with \(*\)
length of lines
average, 80th percentile
length of variables
average, 80th percentile
length of methods (in lines)
average, maximum, 80th percentile
\(*\) length of comments (in characters, not taking into account code which was commented out)
average, maximum, 80th percentile
length of comments (in lines) in ratio to length of code (in lines)
\(*\) length of code (in lines) which was commented out
\(*\) number of lines with more than one instruction
\(*\) number of consecutive lines with aligned characters, e.g. ( or = (4 features)
number of white characters
\(*\) number of occurrences of \(-1\) (special value)
\(*\) ratio of the number of words in English to the number of words in languages other than English in names of variables, methods and classes
\(*\) preserving naming convention (e.g. PascalCase or snake_case)
\(*\) consistency in using curly brackets in the next line or at the end of a line containing a method or class declaration (2 features)
\(*\) consistent application of curly brackets around one-line branches of code
\(*\) level of indentation - whether it increases with the beginning of a new block of code (and only then)
\(*\) degree of exploitation of language syntax (e.g. using various syntax of for loop, using lambdas)
\(*\) depth of references to fields and methods of fields or results of methods of objects (e.g. Cubiculos.get(i).Casilleros.get(j))
maximum, 80th percentile
number of methods in a class
average, maximum, 80th percentile
\(*\) number of used switch instructions
number of separated logic blocks of code within methods
\(*\) number of code duplications
\(*\) maximum nesting depth of instructions
Proposed features are grounded in the extension of lexical hypothesis to programming languages. Lexical hypothesis  says that the most important differences in personality are reflected in used natural language, vocabulary. According to , the more important the difference, the more likely it will be reflected in a single word. We suppose that in the domain of programming code a conscientious person will likely apply consistent indentation throught the code; a person high in openness might use richer vocabulary while an extrovert might use longer names for variables, methods and classes. Additionaly, corellation between personality traits and programming style has been found in , according to which persons high in openness prefer breadth-first programming style, while persons low in openness prefer depth-first programming style.
We describe in detail three features, most complicated due to their involved implementation: the number of code duplications, length of comments in characters and the level of indentation.
Detection of code duplication is quite a complex task, which could be even cast as another machine learning problem, provided suitable learning data would be available or generated . We adopted a simpler solution consuming less computing resources – syntax tree rewriting . Two pieces of code, one being a duplicate of another, exhibit the same structure but differ in names of constants, variables or methods.
The syntax tree generated with the javalang parser is transformed to a topologically equivalent syntax tree, where tree nodes are simplified, to only reflect the structure of the code and discard irrelevant data. For instance, a name of declared method has been discarded, but structure of its body, type of formal parameters and returned type have been retained. For blocks of instructions, information about entrance conditions has been discarded. Detection of code duplication in one block is performed on the basis of such a simplified tree. A list of all subtrees in the block is created and subtrees which serialize to the same expression are treated as duplications.
Computing length of comment, otherwise simple, requires detection whether a comment contains parts of source code. Parsing a comment with a parser of Java would end up with a failure, as programmers usually comment a few lines of code or methods rather than entire programs. To solve this problem, besides the main parser of the whole program, parsers of smaller grammatical units of a program are used.
As white characters are discarded during lexical analysis and even less information is passed to the parser, the level of indentation feature was implemented as a state machine (separate from the used parser), which reads tokens, one by one, and tracks the level of indentation. One difficulty in implementing this feature lies in distinguishing between a correct and wrong indentation after a sequence of empty lines of code. Although based only on finite automata formalism, the state machine has to roughly understand the syntax of Java – it tracks the number of opening parentheses or curly brackets, closing an open block at the correct indentation level, reopening a block of code at a wrong indentation level. The state machine also knows which instructions require indentation. Additional difficulty arises from one-line bodies of if and for instructions, where curly brackets are not required. This seemingly simple task becomes a complex programming problem due to the great number of cases which should be considered.
Due to above difficulties, the implementation of the discussed feature ignores checking the level of indentation in conditional instructions and loops whose bodies contain only one line of code; and in switch instructions. For the switch instruction, it is even impossible to determine which notation is correct, as the flat form was used by programmers mainly in the past, while switch instruction in the indented form is predominant currently.
3.1 Choice of a Parser
Parsing time (wall time) with ANTLR, JavaParser and javalang parsers, (secs). Results has been rounded up to the nearest integer
As a learning and evaluation data we used the corpus of source codes, released for the PR-SOCO track , which accompanied PAN@FIRE 2016. The track was aimed at automatic personality recognition of programmers on the basis of Java source codes they authored. In the PR-SOCO corpus, personality was modelled with Big Five, and each trait was given a value from [20, 80]. The corpus contains 2492 source code programs written in Java by 70 students of computer science along with values of their personality traits. Values of personality traits were found on the basis of 25-item BFI questionnaire called Big Five locator which was completed by students. The students made their code submission through a web-based online judge for grading. The judge system does not have tools for style correction. However, it is not known whether students used an IDE before the submission or not. The training and test set contain source codes of 49 and 21 programmers, respectively. During the PR-SOCO contest, personalities of 21 persons from the test set were concealed from participating teams. Each team was allowed to submit 6 trial solutions (shots). A single solution predicted five traits for each of 21 persons from the test set.
We followed the PR-SOCO track and used two measures to assess our solution and compare with existing personality predictors: Root Mean Square Error (RMSE) and Pearson Product-Moment Correlation coefficient (PCC).
5 Experiments and Results
In the experiments we took the profile-based approach, i.e., all source codes of a programmer were treated as one learning instance. Since personality traits take continuous values, personality recognition was cast as a regression problem with random forest regression  from scikit-learn package  as the prediction module. Random forest regressors were trained on 85% of the original training set, remaining 15% of the training set was reserved for the model selection procedure. We examined the random forest regression with the number of decision trees varying from 64 to 128 and their depth varying from 2 to 6. Optimal values of the above hyperparameters were selected separately for each personality trait with grid search . Mean Square Error (MSE) was used as the function measuring the quality of a split.
Beside regression with 35 handcrafted features, we used \(N=1500\) n-gram features as our baseline: \(N_1=1000\) most frequent character trigrams (n-grams with \(n=3\)) and \(N_2=500\) most frequent token trigrams. By tokens we mean lexical units returned by the Java scanner. Finally, we tried personality recognition with 1535 features, both handcrafted and n-gram features.
Table 3 presents results of personality recognition we obtained with 3 sets of features: proposed handcrafted features, n-grams and handcrafted features together with n-grams. For comparison, best results, medians and mean results of FIRE competitors are given in Table 4 (summary of FIRE competition  shows also first, second and third quantiles, all extreme values, and detailed results of all participating teams). Additionally we computed confidence intervals with the pairs bootstrap method .
For conscientiousness personality trait, the model with handcrafted features obtained RMSE equal 8.17 (with 95% confidence interval [6.00, 9.98]) which is lower than the minimum error achieved in the competition. Obtained value of PCC is 0.33 (with 95% confidence interval \([-0.02, 0.65]\)) and equals to the corresponding maximum PCC value of PR-SOCO competitors.
Additionally, for openness we obtained RMSE lower than the median and close to the best score of PR-SOCO. For all personality traits absolute values of obtained PCCs were higher than the median values of PR-SOCO.
Performance measures (RMSE and PCC) of automatic personality recognition with different sets of features
Summary of personality recognition results obtained during PR-SOCO track of FIRE 2016 competition
using conventional indentation
average length of method
average length of comments
number of methods in a class (80th percentile)
length of method (80th percentile)
average length of names
ratio of words in English to words in other languages
length of names (80th percentile)
number of white characters
maximum length of comments.
The effect of joint usage of handcrafted features and n-grams is the reduced error (in comparison to usage of only one type of features) for extroversion and agreeableness, although it does not set up new state-of-the-art results.
Finally, we examined statistical significance of obtained trait predictions (RMSEs). Statistical tests, conducted on the STAC platform , were computed for 14 algorithms (11 solutions from the PR-SOCO task and our three solutions: with handcrafted features, n-grams and all features) and five datasets (predictions for each of five traits were counted as a separate dataset). For solution from the PR-SOCO task we always chose the best shot. As the omnibus test we used Friedman F-test  for testing hypothesis \(H_0\) that the means of the results of two or more algorithms are the same, followed by Nemenyi test  as the post-hoc test for pairwise comparison of predictors. At the significance level \(\alpha = 0.05\), hypothesis \(H_0\) should be rejected but pairwise comparison revealed no pair of algorithms with a statistical difference in results.
In this work we proposed new features for automatic personality recognition from source code. Handcrafted features turned out to be most useful for predicting openness and conscientiousness, traits (together with extroversion) connected with programming aptitude . These features, despite their low number, achieved the state-of-the-art-results for conscientiousness. The lowest error in conscientiousness prediction is in line with the fact, that conscientiousness (and extroversion) are easily inferred from even slices of behaviour [6, 18].
N-gram features are surprisingly good predictors of personality, at the same time they are easy to implement and language independent.
While the programmers’ personalities may be connected with the code they write, we could not capture the relation between them. The results we achieved in neuroticism and conscientiousness recognition are state-of-the-art in personality recognition from source code, yet still insufficient to state that such a correlation exists.
Large confidence intervals of RMSEs and PCCs, and conducted statistical tests prove that larger datasets are needed to increase statistical strength of our results as well as other methods proposed so far. New datasets should take into account more programming languages and programmers, including professional programmers.
The research presented in this paper was supported by the funds assigned to AGH University of Science and Technology by the Polish Ministry of Science and Higher Education. Paolo Rosso, Francisco Rangel and Felipe Restrepo-Calle are acknowledged for making the PR-SOCO corpus available for our research and information about its construction.
- 1.Allport, G.W., Odbert, H.: Trait names: a psycho-lexical study. Psychol. Monogr. 47(1), i–171 (1936). https://doi.org/10.1037/h0093360
- 2.Basile, A., Dwyer, G., Medvedeva, M., Rawee, J., Haagsma, H., Nissim, M.: N-gram: new groningen author-profiling model. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum (2017)Google Scholar
- 3.Bilan, I., Saller, E., Roth, B., Krytchak, M.: CAPS-PRC: a system for personality recognition in programming code. In: Majumder, P., Mitra, M., Mehta, P., Sankhavara, J., Ghosh, K. (eds.) Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, pp. 21–24 (2016)Google Scholar
- 7.Castellanos, H.A.: Personality recognition applying machine learning techniques on source code metrics. In: Majumder, P., Mitra, M., Mehta, P., Sankhavara, J., Ghosh, K. (eds.) Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, pp. 25–29 (2016)Google Scholar
- 8.Claesen, M., Moor, B.D.: Hyperparameter search in machine learning. CoRR abs/1502.02127 (2015)Google Scholar
- 9.van Dam, M.: A basic character N-gram approach to authorship verification notebook for PAN at CLEF 2013. In: Forner, P., Navigli, R., Tufis, D., Ferro, N. (eds.) Working Notes for CLEF 2013 Conference (2013)Google Scholar
- 11.Delair, R., Mahajan, R.: A supervised approach for personality recognition in source code using code analysis tool at FIRE 2016. In: Majumder, P., Mitra, M., Mehta, P., Sankhavara, J., Ghosh, K. (eds.) Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, pp. 30–32 (2016)Google Scholar
- 12.Doval, Y., Gómez-Rodríguez, C., Vilares, J.: Shallow recurrent neural network for personality recognition in source code. In: Majumder, P., Mitra, M., Mehta, P., Sankhavara, J., Ghosh, K. (eds.) Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, pp. 33–37 (2016)Google Scholar
- 13.Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. No. 57 in Monographs on Statistics and Applied Probability. Chapman & Hall/CRC, Boca Raton (1993)Google Scholar
- 14.Escalante, H.J., Solorio, T., Montes-y-Gómez, M.: Local histograms of character N-grams for authorship attribution. In: Lin, D., Matsumoto, Y., Mihalcea, R. (eds.) The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 288–298 (2011)Google Scholar
- 16.Ghosh, K., Parui, S.K.: Indian Statistical Institute Kolkata at PR-SOCO 2016: a simple linear regression based approach. In: Majumder, P., Mitra, M., Mehta, P., Sankhavara, J., Ghosh, K. (eds.) Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, pp. 48–51 (2016)Google Scholar
- 17.Giménez, M., Paredes, R.: PRHLT at PR-SOCO: a regression model for predicting personality traits from source code. In: Majumder, P., Mitra, M., Mehta, P., Sankhavara, J., Ghosh, K. (eds.) Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, pp. 38–42 (2016)Google Scholar
- 18.Gnambs, T.: The elusive general factor of personality: the acquaintance effect. Eur. J. Pers. 27(5), 507–520 (2013)Google Scholar
- 20.Houvardas, J., Stamatatos, E.: N-gram feature selection for authorship identification. In: Euzenat, J., Domingue, J. (eds.) 12th International Conference on Artificial Intelligence: Methodology, Systems, and Applications, AIMSA 2006. LNCS (LNAI), vol. 4183, pp. 77–86. Springer, Heidelberg (2006). https://doi.org/10.1007/11861461_10CrossRefGoogle Scholar
- 21.John, O.P., Srivastava, S.: The big five trait taxonomy: history, measurement, and theoretical perspectives, pp. 102–138. Guilford Press (1999)Google Scholar
- 23.Kuta, M., Kitowski, J.: Optimisation of character n-gram profiles method for intrinsic plagiarism detection. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2014. LNCS (LNAI), vol. 8468, pp. 500–511. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07176-3_44CrossRefGoogle Scholar
- 24.Li, L., Feng, H., Zhuang, W., Meng, N., Ryder, B.G.: CCLearner: a deep learning-based clone detection approach. In: 2017 IEEE International Conference on Software Maintenance and Evolution, ICSME 2017, pp. 249–260 (2017)Google Scholar
- 25.Liebeck, M., Modaresi, P., Askinadze, A., Conrad, S.: Pisco: a computational approach to predict personality types from Java source code. In: Majumder, P., Mitra, M., Mehta, P., Sankhavara, J., Ghosh, K. (eds.) Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, pp. 43–47 (2016)Google Scholar
- 29.Nemenyi, P.: Distribution-free multiple comparisons. Ph.D. thesis, Princeton University (1963)Google Scholar
- 30.Parr, T.: Language Implementation Patterns: Create Your Own Domain-Specific and General Programming Languages. Pragmatic Bookshelf, Raleigh (2009)Google Scholar
- 32.Phani, S., Lahiri, S., Biswas, A.: Personality recognition in source code working note: team BESUMich. In: Majumder, P., Mitra, M., Mehta, P., Sankhavara, J., Ghosh, K. (eds.) Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, pp. 16–20 (2016)Google Scholar
- 33.Rangel, F., González, F., Restrepo, F., Montes, M., Rosso, P.: PAN@FIRE: overview of the PR-SOCO track on personality recognition in SOurce COde. In: Majumder, P., Mitra, M., Mehta, P., Sankhavara, J. (eds.) FIRE 2016. LNCS, vol. 10478, pp. 1–19. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73606-8_1CrossRefGoogle Scholar
- 34.Rodríguez-Fdez, I., Canosa, A., Mucientes, M., Bugarín, A.: STAC: a web platform for the comparison of algorithms using statistical tests. In: 2015 IEEE International Conference on Fuzzy Systems, FUZZ-IEEE, pp. 1–8 (2015). https://doi.org/10.1109/FUZZ-IEEE.2015.7337889
- 35.Sapkota, U., Bethard, S., Montes-y-Gómez, M., Solorio, T.: Not all character N-grams are created equal: a study in authorship attribution. In: Mihalcea, R., Chai, J.Y., Sarkar, A. (eds.) NAACL HLT 2015, The 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 93–102 (2015)Google Scholar
- 36.Vázquez, E.V., et al.: UAEMex system for identifying personality traits from source code. In: Majumder, P., Mitra, M., Mehta, P., Sankhavara, J., Ghosh, K. (eds.) Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, pp. 52–55 (2016)Google Scholar