Skip to main content

The collection of valid software engineering data involves substantial effort and is not a priority in most software production environments. This often leads to missing or otherwise invalid data. This fact tends to be overlooked by most software engineering researchers and may lead to a biased analysis. This chapter reviews missing data methods and applies them on a software engineering data set to illustrate a variety of practical contexts where such techniques are needed and to highlight the pitfalls of ignoring the missing data problem.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Albrecht, A. J. & Gaffney Jr., J. E. (1983), Software function, source lines of code, and development effort prediction: a software science validation, IEEE Transactions on Software Engineering 9(6), 639–648.

    Article  Google Scholar 

  • An, K. H., Gustafson, D. A. & Melton, A. C. (1987), A model for software maintenance, in Proceedings of the Conference in Software Maintenance, Austin, Texas, pp. 57–62.

    Google Scholar 

  • Atkins, D., Ball, T., Graves, T. & Mockus, A. (1999), Using version control data to evaluate the effectiveness of software tools, in 1999 International Conference on Software Engineering, ACM Press, Rio de Janeiro, Brazil, pp. 324–333.

    Google Scholar 

  • Barnard, J. & Rubin, D. B. (1999), Small sample degrees of freedom with multiple imputation, Biometrika 86(4), 948–955.

    Article  MATH  MathSciNet  Google Scholar 

  • Chidamber, S. R. & Kemerer, C. F. (1994), A metrics suite for object oriented design, IEEE Trans. Software Eng. 20(6), 476–493.

    Article  Google Scholar 

  • Fleming, T. H. & Harrington, D. (1984), Nonparametric estimation of the survival distribution in censored data, Communications in Statistics–Theory and Methods 20 13, 2469–2486.

    Google Scholar 

  • Goldenson, D. R., Gopal, A. & Mukhopadhyay, T. (1999), Determinants of success in software measurement programs, in Sixth International Symposium on Software Metrics, IEEE Computer Society Press, Los Alamitos, CA, pp. 10–21.

    Google Scholar 

  • Graves, T. L. & Mockus, A. (1998), Inferring change effort from configuration management databases, in Metrics 98: Fifth International Symposium on Software Metrics, Bethesda, MD, pp. 267–273.

    Google Scholar 

  • Graves, T. L., Karr, A. F., Marron, J. S. & Siy, H. P. (2000), Predicting fault incidence using software change history, IEEE Transactions on Software Engineering, 26(7), 653–661.

    Article  Google Scholar 

  • Halstead, M. H. (1977), Elements of Software Science, Elsevier North-Holland, New York.

    MATH  Google Scholar 

  • Herbsleb, J. D. & Grinter, R. (1998), Conceptual simplicity meets organizational complexity: Case study of a corporate metrics program, in 20th International Conference on Software Engineering, IEEE Computer Society Press, Los Alamitos, CA, pp. 271–280.

    Google Scholar 

  • Herbsleb, J. D., Krishnan, M., Mockus, A., Siy, H. P. & Tucker, G. T. (2000), Lessons from Ten Years of Software Factory Experience, Technical Report, Bell Laboratories.

    Google Scholar 

  • Jönsson, P. & Wohlin, C. (2004), An evaluation of k-nearest neighbour imputation using likert data, in Proceedings of the 10th International Symposium on Software Metrics, pp. 108–118.

    Google Scholar 

  • Kaplan, E. & Meyer, P. (1958), Non-parametric estimation from incomplete observations, Journal of the American Statistical Association, 457–481.

    Google Scholar 

  • Kim, J. & Curry, J. (1977), The treatment of missing data in multivariate analysis, Social Methods and Research 6, 215–240.

    Article  Google Scholar 

  • Little, R. J. A. (1988), A test of missing completely at random for multivariate data with missing values, Journal of the American Statistical Association 83(404), 1198–1202.

    Article  MathSciNet  Google Scholar 

  • Little, R. & Hyonggin, A. (2003), Robust likelihood-based analysis of multivariate data with missing values, Technical Report Working Paper 5, The University of Michigan Department of Biostatistics Working Paper Series. http://www.bepress.com/umichbiostat/paper5.

  • Little, R. J. A. & Rubin, D. B. (1987), Statistical Analysis with Missing Data, Wiley Series in Probability and Mathematical Statistics, Wiley, New York.

    MATH  Google Scholar 

  • Little, R. J. A. & Rubin, D. B. (1989), The analysis of social science data with missing values, Sociological Methods and Research 18(2), 292–326.

    Article  Google Scholar 

  • McCabe, T. (1976), A complexity measure, IEEE Transactions on Software Engineering 2(4), 308–320.

    Article  MathSciNet  Google Scholar 

  • Mockus, A. (2006), Empirical estimates of software availability of deployed systems, in 2006 International Symposium on Empirical Software Engineering, ACM Press, Rio de Janeiro, Brazil, pp. 222–231.

    Google Scholar 

  • Mockus, A. (2007), Software support tools and experimental work, in V. Basili et al., eds, Empirical Software Engineering Issues: LNCS 4336, Springer, pp. 91–99.

    Google Scholar 

  • Mockus, A. & Votta, L. G. (1997), Identifying reasons for software changes using historic databases, Technical Report BL0113590–980410-04, Bell Laboratories.

    Google Scholar 

  • Myrtveit, I., Stensrud, E. & Olsson, U. (2001), Analyzing data sets with missing data: an empirical evaluation of imputation methods and likelihood-based methods’ IEEE Transactions on Software Engineering 27(11), 1999–1013.

    Google Scholar 

  • Novo, A. (2002), Analysis of multivariate normal datasets with missing values, Ported to R by Alvaro A. Novo. Original by J.L. Schafer.

    Google Scholar 

  • R Development Core Team (2005), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria. ISBN 3–900051-07–0. http://www.R-project.org.

  • Roth, P. L. (1994), Missing data: a conceptual review for applied psychologist, Personnel Psychology 47, 537–560.

    Article  Google Scholar 

  • Rubin, D. B. (1987), Multiple Imputation for Nonresponse in Surveys, Wiley, New York.

    Book  Google Scholar 

  • Schafer, J. L. (1997), Analysis of Incomplete Data, Monograph on Statistics and Applied Probability, Chapman & Hall, London.

    Google Scholar 

  • Schafer, J. S. (1999), Software for multiple imputation. http://www.stat.psu.edu/<jls/misoftwa.html.

  • Schafer, J. L. & Olsen, M. K. (1998), Multiple imputation for multivariate missing data problems, Multivariate Behavioural Research 33(4), 545–571.

    Article  Google Scholar 

  • Strike, K., Emam, K. E. & Madhavji, N. (2001), Software cost estimation with incomplete data, IEEE Transactions on Software Engineering 27(10), 890–908.

    Article  Google Scholar 

  • Swanson, E. B. (1976), The dimensions of maintenance, in Proceedings of the 2nd Conference on Software Engineering, San Francisco, pp. 492–497.

    Google Scholar 

  • Twala, B., Cartwright, M. & Shepperd, M. (2006), Ensemble of missing data techniques to improve software prediction accuracy, in ICSE’06, ACM, Shanghai, China, pp. 909–912.

    Google Scholar 

  • Weisberg, S. (1985), Applied Linear Regression, 2nd Edition, Wiley, New York, USA.

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag London Limited

About this chapter

Cite this chapter

Mockus, A. (2008). Missing Data in Software Engineering. In: Shull, F., Singer, J., Sjøberg, D.I.K. (eds) Guide to Advanced Empirical Software Engineering. Springer, London. https://doi.org/10.1007/978-1-84800-044-5_7

Download citation

  • DOI: https://doi.org/10.1007/978-1-84800-044-5_7

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-84800-043-8

  • Online ISBN: 978-1-84800-044-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics