Skip to main content

Empirical, Human-Centered Evaluation of Programming and Programming Language Constructs: Controlled Experiments

  • Conference paper
  • First Online:
Grand Timely Topics in Software Engineering (GTTSE 2015)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 10223))

Abstract

While the application of empirical methods has a long tradition in domains such as performance evaluation, the application of empirical methods with human subjects in order to evaluate the usability of programming techniques, programming language constructs or whole programming languages is relatively new (or, at least, running such studies is becoming more common). Despite the urgent need for such usability studies, few researchers are well-versed in such techniques, certainly when compared to the large number of researchers inventing new programming techniques or formal approaches. The main goal of this text is to introduce empirical methods for evaluating programming language constructs, with a strong focus on quantitative methods. The paper concludes with by explaining how and why a series of controlled experiments were gradually designed to study the usability of type systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    According to Hanenberg [14] the phrase software science is being used in order to describe the research related to software artifacts in general. While the term software engineering is used much more often, especially the programming language community or people doing performance measurements feel that this term does not adequately describe their domains. We think that the term software science, although originally used by Halstead [12] for something different, is more appropriate to describe the whole domain of software-related research.

  2. 2.

    Sheil called the study of programming as practiced by computer science even ‘an unholy mixture of mathematics, literary criticism, and folklore.’ 1 [37, p. 102].

  3. 3.

    It should be noted that just recently a study appeared which was not able to reveal a measurable benefit of lambda expressions in C++. Instead, the study showed at least for non-professional programmers a measurable disadvantage (see [45]).

  4. 4.

    Again, to get an impression of how less it is main-stream: according to Kaijanaho the number of randomized controlled trials on human-factors comparative evaluation of language features up to 2012 was 22 (see [22, p. 143]).

  5. 5.

    Additionally, the paper collection of Victor Basili by Boehm et al. [1] gives a larger set of examples about performed controlled trials.

  6. 6.

    www.ibm.com/software/analytics/spss/.

  7. 7.

    https://www.r-project.org/.

  8. 8.

    The corresponding non-parametric tests [5] are valid here, too, i.e. it is possible to analyse the crossover trial using a U-test and a Wilcoxon-test.

  9. 9.

    The rather arbitrary choice of .05 is probably commonly used because it has been originally proposed by Fisher [8] although some other disciplines use a different alpha level.

  10. 10.

    The points are word-by-word citatations from Souza and Figueiredo [39].

  11. 11.

    Two other questions are formulated, which are skipped were for reasons of simplification.

  12. 12.

    It is understandable that the authors do not run inference-statistical methods: a huge number of different words is being tested and it sounds plausible, that traditional approaches from inference statistics would not have revealed differences at all – because of the high number of variables.

  13. 13.

    The authors distinguish in his paper between a third and a fourth study that we present here as one, because the hypothesis and applied analysis methods were identical.

  14. 14.

    At least, this statement can be found in the work by Kaijanaho [22].

  15. 15.

    The result of the experient was that the additional type annotations of generic Java helped when using an undocumented API – which (again) confirmed the previous findings– but which also showed a situation where generic types reduced the extensibility of an API (see [20]).

  16. 16.

    Which was the results of the replication study by Kleinschmager et al. [16, 24].

References

  1. Boehm, B., Rombach, H.D., Zelkowitz, M.V.: Foundations of Empirical Software Engineering: The Legacy of Victor R. Basili. Springer, Heidelberg (2005)

    Book  Google Scholar 

  2. Bracha, G.: Pluggable type systems. In: OOPSLA’04 Workshop on Revival of Dynamic Languages (2004)

    Google Scholar 

  3. Bruce, K.B.: Foundations of Object-Oriented Languages: Types and Semantics. MIT Press, Cambridge (2002)

    Google Scholar 

  4. Cohen, J.: Statistical Power Analysis for the Behavioral Sciences. L. Erlbaum Associates, Hillsdale (1988)

    MATH  Google Scholar 

  5. Conover, W.J.: Practical Nonparametric Statistics, 3rd edn. Wiley, New York (1998)

    Google Scholar 

  6. Endrikat, S., Hanenberg, S., Robbes, R., Stefik, A.: How do API documentation and static typing affect API usability? In: 36th International Conference on Software Engineering, ICSE 2014, Hyderabad, India - 31 May–07 June 2014, pp. 632–642 (2014)

    Google Scholar 

  7. Fischer, L., Hanenberg, S.: An empirical investigation of the effects of type systems and code completion on API usability using typescript and javascript in MS visual studio. In: Proceedings of the Dynamic Language Symposium. accepted for publication (2015)

    Google Scholar 

  8. Fisher, R.A.: Statistical Methods for Research Workers. Cosmo Study Guides. Cosmo Publications, New Delhi (1925)

    MATH  Google Scholar 

  9. Gannon, J.D.: An experimental evaluation of data type conventions. Commun. ACM 20(8), 584–595 (1977)

    Article  MATH  Google Scholar 

  10. Georges, A., Buytaert, D., Eeckhout, L.: Statistically rigorous java performance evaluation. SIGPLAN Not. 42(10), 57–76 (2007)

    Article  Google Scholar 

  11. Glaser, B.G., Strauss, A.L.: The Discovery of Grounded Theory: Strategies for Qualitative Research. Aldine Publishing Company, Chicago (1967). Observations

    Google Scholar 

  12. Halstead, M.H.: Elements of Software Science (Operating and Programming Systems Series). Elsevier Science Inc., New York (1977)

    MATH  Google Scholar 

  13. Hanenberg, S.: An experiment about static and dynamic type systems: Doubts about the positive impact of static type systems on development time. In: Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA, pp. 22–35. ACM, New York (2010)

    Google Scholar 

  14. Hanenberg, S.: Faith, hope, and love: An essay on software science’s neglect of human factors. In: Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages And Applications, OOPSLA 2010, pp. 933–946. Reno/Tahoe, Nevada, October 2010

    Google Scholar 

  15. Hanenberg, S.: Why do we know so little about programming languages, and what would have happened if we had known more? In: Proceedings of the 10th ACM Symposium on Dynamic Languages, DLS 2014, p. 1. ACM, New York (2014)

    Google Scholar 

  16. Hanenberg, S., Kleinschmager, S., Robbes, R., Tanter, É., Stefik, A.: An empirical study on the impact of static typing on software maintainability. Empirical Softw. Eng. 19(5), 1335–1382 (2014)

    Article  Google Scholar 

  17. Hanenberg, S., Stefik, A.: On the need to define community agreements for controlled experiments with human subjects - a discussion paper. In: Submitted to PLATEAU 2015 (2015)

    Google Scholar 

  18. Harlow, L.L., Mulaik, S.A., Steiger, J.H.: What If There Were No Significance Tests?. Multivariate Applications Book Series. Lawrence Erlbaum Associates Publishers, Hillsdale (1997)

    Google Scholar 

  19. Hoda, R., Noble, J., Marshall, S.: Developing a grounded theory to explain the practices of self-organizing agile teams. Empirical Softw. Eng. 17(6), 609–639 (2012)

    Article  Google Scholar 

  20. Hoppe, M., Hanenberg, S.: Do developers benefit from generic types? An empirical comparison of generic and raw types in java. In: Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications, OOPSLA 2013, pp. 457–474. ACM, New York (2013)

    Google Scholar 

  21. Juristo, N., Moreno, A.M.: Basics of Software Engineering Experimentation. Springer, Heidelberg (2001)

    Book  MATH  Google Scholar 

  22. Kaijanaho, A.-J.: Evidence-based programming language design: A philosophical and methodological exploration. Number 222 in Jyväskylä Studies in Computing. University of Jyväskylä, Finland (2015)

    Google Scholar 

  23. Kirk, R.E.: Experimental Design: Procedures for the Behavioral Sciences Procedures for the Behavioral Sciences. SAGE Publications, Thousand Oaks (2012)

    Book  MATH  Google Scholar 

  24. Kleinschmager, S., Hanenberg, S., Robbes, R., Tanter, É., Stefik, A.: Do static type systems improve the maintainability of software systems? An empirical study. In: IEEE 20th International Conference on Program Comprehension, ICPC 2012, Passau, Germany, pp. 153–162, 11–13 June 2012

    Google Scholar 

  25. Ko, A.J., LaToza, T.D., Burnett, M.M.: A practical guide to controlled experiments of software engineering tools with human participants. Empirical Softw. Eng. 20(1), 110–141 (2015)

    Article  Google Scholar 

  26. Laprie, J.-C.: Dependability of computer systems: concepts, limits, improvements. In: Sixth International Symposium on Software Reliability Engineering, ISSRE 1995, Toulouse, France, 24–27 October 1995, pp. 2–11 (1995)

    Google Scholar 

  27. Mayer, C., Hanenberg, S., Robbes, R., Tanter, É., Stefik, A.: An empirical study of the influence of static type systems on the usability of undocumented software. In: Proceedings of the 27th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2012, part of SPLASH 2012, Tucson, AZ, USA, 21–25 October 2012, pp. 683–702. ACM (2012)

    Google Scholar 

  28. McConnell, S.: What does 10x mean? Measuring variations in programmer productivity. In: Oram, A., Wilson, G. (eds.) Making Software: What Really Works, and Why We Believe It, O’Reilly Series, pp. 567–575. O’Reilly Media (2010)

    Google Scholar 

  29. Okon, S., Hanenberg, S.: Can we enforce a benefit for dynamically typed languages in comparison to statically typed ones? A controlled experiment. In: 2016 IEEE 24th International Conference on Program Comprehension (ICPC), pp. 1–10, May 2016

    Google Scholar 

  30. Parnin, C., Bird, C., Murphy-Hill, E.R.: Java generics adoption: How new features are introduced, championed, or ignored. In: Proceedings of the 8th International Working Conference on Mining Software Repositories, MSR 2011 (Co-located with ICSE), Waikiki, Honolulu, HI, USA, 21–28 May 2011, pp. 3–12. IEEE (2011)

    Google Scholar 

  31. Petersen, P., Hanenberg, S., Robbes, R.: An empirical comparison of static and dynamic type systems on API usage in the presence of an IDE: Java vs. groovy with eclipse. In: 22nd International Conference on Program Comprehension, ICPC 2014, Hyderabad, India, 2–3 June 2014, pp. 212–222 (2014)

    Google Scholar 

  32. Pierce, B.C.: Types and Programming Languages. MIT Press, Cambridge (2002)

    MATH  Google Scholar 

  33. Popper, K.R.: The Logic of Scientific Discovery, Routledge. 1st English Edition: 1959, Original First Edition (German): Logik der Forschung, published 1935 by Julius Springer, Austria, Vienna (2002)

    Google Scholar 

  34. Prechelt, L., Tichy, W.F.: A controlled experiment to assess the benefits of procedure argument type checking. IEEE Trans. Softw. Eng. 24(4), 302–312 (1998)

    Article  Google Scholar 

  35. Seaman, C.B.: Qualitative methods in empirical studies of software engineering. IEEE Trans. Software Eng. 25(4), 557–572 (1999)

    Article  Google Scholar 

  36. Senn, S.S.: Cross-over Trials in Clinical Research. Statistics in Practice. Wiley, Chichester (1993)

    Google Scholar 

  37. Sheil, B.A.: The psychological study of programming. ACM Comput. Surv. 13(1), 101–120 (1981)

    Article  Google Scholar 

  38. Shneiderman, B., Psychology, S.: Human Factors in Computer and Information Systems. Winthrop Publishers, Cambridge (1980)

    Google Scholar 

  39. Souza, C., Figueiredo, E.: How do programmers use optional typing?: An empirical study. In: Proceedings of the 13th International Conference on Modularity, MODULARITY 2014, pp. 109–120. ACM, New York (2014)

    Google Scholar 

  40. Spiza, S., Hanenberg, S.: Type names without static type checking already improve the usability of APIS (as long as the type names are correct): An empirical study. In: Proceedings of the 13th International Conference on Modularity, MODULARITY 2014, pp. 99–108. ACM, New York (2014)

    Google Scholar 

  41. Stefik, A., Hanenberg, S.: The programming language wars: Questions and responsibilities for the programming language community. In: 2014 Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software, Onward!, pp. 283–299. ACM, New York (2014)

    Google Scholar 

  42. Stefik, A., Siebert, S.: An empirical investigation into programming language syntax. Trans. Comput. Educ. 13(4), 19:1–19:40 (2013)

    Article  Google Scholar 

  43. Stuchlik, A., Hanenberg, S.: Static vs. dynamic type systems: An empirical study about the relationship between type casts and development time. In: Proceedings of the 7th Symposium on Dynamic Languages, DLS 2011, Portland, Oregon, pp. 97–106. ACM (2011)

    Google Scholar 

  44. Tichy, W.F.: Should computer scientists experiment more? IEEE Comput. 31, 32–40 (1998)

    Article  Google Scholar 

  45. Uesbeck, P.M., Stefik, A., Hanenberg, S., Pedersen, J., Daleiden, P.: An empirical study on the impact of C++ lambdas and programmer experience. In: 38th International Conference on Software Engineering Austin, TX, 14–22 May 2016. to appear (2016)

    Google Scholar 

  46. Wohlin, C., Runeson, P., Höst, M., Ohlsson, M.C., Regnell, B., Wesslén, A.: Experimentation in Software Engineering: An Introduction. Kluwer Academic Publishers, Norwell (2000)

    Book  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stefan Hanenberg .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Hanenberg, S. (2017). Empirical, Human-Centered Evaluation of Programming and Programming Language Constructs: Controlled Experiments. In: Cunha, J., Fernandes, J., Lämmel, R., Saraiva, J., Zaytsev, V. (eds) Grand Timely Topics in Software Engineering. GTTSE 2015. Lecture Notes in Computer Science(), vol 10223. Springer, Cham. https://doi.org/10.1007/978-3-319-60074-1_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-60074-1_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-60073-4

  • Online ISBN: 978-3-319-60074-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics