An Introduction to the Statistical Evaluation of Fluency Measures with Signal Detection Theory

  • Keith SmolkowskiEmail author
  • Kelli D. Cummings
  • Lisa Strycker


Fluency represents the learned ability to respond quickly, effortlessly, and accurately to a given stimuli. The fluent application of a skill, however, requires frequent and deliberate practice on all relevant subskills, not simply the repetition of subskills that is already fluent. Dancers, for example, learn best through marking, where they practice only partial movements of a performance. Diagnosing the source of the disfluency is critical for educators. Judgments grounded on data, statistical models, and even informal prediction models, however, outperform those based on intuition alone. Teachers can easily and accurately select the students in most need of supplemental instructions or support through the use of diagnostic or classification systems.

This chapter describes the basic methods recommended for the development and evaluation of classification systems using a framework called signal detection theory. We present the theoretical basis for signal detection, and methods for statistically evaluating diagnostic decisions in education, which seek to balance time, clarity, and accuracy. The methods can be applied to any screener or test, continuous or ordinal, including many measures that are available in education, used to gauge the likely accomplishment of a relevant criterion. To illustrate the methods, we select Dynamic Indicators of Basic Early Literacy Skills (DIBELS; 6th Ed.) measures as an example of screening system and use the Stanford Achievement Test (10th Ed.) as the criterion measure.


Signal detection theory Screening Diagnostic accuracy ROC curve Decision-making 


  1. American Educational Research Association (AERA), American Psychological Association (APA), & National Council on Measurement in Education (NCME). (1999). Standards for Educational and Psychological Testing. Washington, DC: Author.Google Scholar
  2. Bengtsson, S., Nagy, Z., Skare, S., Forsman, L., Forssberg, H., & Ullén, F. (2005). Extensive piano practicing has regionally specific effects on white matter development. Nature Neuroscience, 8(9), 1148–1150CrossRefPubMedGoogle Scholar
  3. Brooks, H. E. (2004). Tornado-warning performance in the past and future: A perspective from signal detection theory. Bulletin of the American Meteorological Society, 85(6), 837–843.CrossRefGoogle Scholar
  4. Burkel, R. H., Chiou, C.-P., Keyes, T. K., Meeker, W. Q., Rose, J. H., Sturges, D. J., Thompson, R. B., & Tucker, W. (2002). A methodology for the assessment of the capability of inspection systems for detection of subsurface flaws in aircraft turbine engine components (Final Report, DOT/FAA/AR-01/96). Washington, DC: U.S. Department of Transportation, Federal Aviation Administration, Office of Aviation Research.Google Scholar
  5. Carran, D. T., & Scott, K. G. (1992). Risk assessment in preschool children: Research implications for the early detection of educational handicaps. Topics in Early Childhood Special Education, 12, 196–211.Google Scholar
  6. Carter, A. S., Briggs-Gowan, M., & Davis, N. O. (2004). Assessment of young children’s social emotional development and psychopathology: Recent advances and recommendations for practice. Journal of Child Psychology and Psychiatry, 45, 109–134.Google Scholar
  7. Cepeda, N. J., Pashler, H., Vul, E., Wixted, J. T., & Rohrer, D. (2006). Distributed practice in verbal recall tasks: A review and quantitative synthesis. Psychological Bulletin, 132(3), 354–380.CrossRefPubMedGoogle Scholar
  8. Clarke, B., Baker, S. K., Smolkowski, K., & Chard, D. (2008). An analysis of early numeracy curriculum-based measurement: Examining the role of growth in student outcomes. Remedial and Special Education, 29(1), 46–57. doi:10.1177/0741932507309694.CrossRefGoogle Scholar
  9. Connolly, T., Arkes, H. R., & Hammond, K. R. (Eds.). (2000). Judgment and decision making: An interdisciplinary reader (2nd ed.). New York: Cambridge University Press.Google Scholar
  10. Cook, B. G., & Odom, S. L. (2013). Evidence-based practices and implementation science in special education. Exceptional Children, 79(2), 135–144.Google Scholar
  11. Dana, J., & Dawes, R. M. (2004). The superiority of simple alternatives to regression for social science predictions. Journal of Educational and Behavioral Statistics, 29(3), 317–331. doi:10.3102/10769986029003317.CrossRefGoogle Scholar
  12. Dawes, R. M. (1979). The robust beauty of improper linear models in decision making. American Psychologist, 34(7), 571–582. doi:10.1037/0003-066X.34.7.571.CrossRefGoogle Scholar
  13. Dawes, R. M. (1986). Representative thinking in clinical judgment. Clinical Psychology Review, 6, 425–441. doi:10.1016/0272-7358(86)90030-9.CrossRefGoogle Scholar
  14. Dompnier, B., Pansu, P., & Bressoux, P. (2006). An integrative model of scholastic judgments: Pupils' characteristics, class context, halo effect and internal attributions. European Journal of Psychology of Education, 21(2), 119–133.CrossRefGoogle Scholar
  15. Engelmann, S., & Carnine, D. (1991). Theory of instruction: Principles and applications (Rev. Ed.). Eugene: ADI Press.Google Scholar
  16. Ericsson, K. A., Krampe, R. T. H., & Tesch-Römer, C. (1993). The role of deliberate practice in the acquisition of expert performance. Psychological Review, 100, 363–406.CrossRefGoogle Scholar
  17. Ericsson, K. A., Roring, R., & Nandagopal, K. (2007). Giftedness and evidence for reproducibly superior performance: An account based on the expert performance framework. High Ability Studies, 18(1), 3–56.CrossRefGoogle Scholar
  18. Fields, R. D. (2005). Myelination: An overlooked mechanism of synaptic plasticity? The Neuroscientist, 11(6), 528–531.PubMedCentralCrossRefPubMedGoogle Scholar
  19. Fleiss, J. L. (1981). Statistical methods for rates and proportions (2nd ed.). New York: Wiley.Google Scholar
  20. Glover, T. A., Albers, C. A., & Kratochwill, T. R. (2007). Considerations for evaluating universal screening assessments. Journal of School Psychology, 45, 117–135.CrossRefGoogle Scholar
  21. Goldberg, L. R. (1972). Parameters of personality inventory construction and utilization: A comparison of prediction strategies and tactics. Multivariate Behavioral Research Monograph, 7, No. 2. (Fort Worth, TX: Texas Christian University Press).Google Scholar
  22. Gredler, G. R. (2000b). Early childhood screening for developmental and educational problems. In B. A. Bracken (Ed.), The psychoeducational assessment of preschool children (3rd ed.) (pp. 399–411). Needham Heights, MA: Allyn & Bacon.Google Scholar
  23. Grove, W. M. (2005). Clinical versus statistical prediction: The contribution of Paul E. Meehl. Journal of Clinical Psychology, 61(10), 1233–1243.Google Scholar
  24. Grove, W. M., & Lloyd, M. (2006). Meehl’s Contribution to Clinical Versus Statistical Prediction. Journal of Abnormal Psychology, 115(2), 192–194. doi:10.1037/0021-843X.115.2.192.Google Scholar
  25. Grove, W. M., Zald, D. H., Lebow, B. S., Snitz, B. E., & Nelson, C. (2000). Clinical versus mechanical prediction: A meta-analysis. Psychological Assessment, 12(1), 19.CrossRefPubMedGoogle Scholar
  26. Hamilton, C., & Shinn, M. R. (2003). Characteristics of word callers: An investigation of the accuracy of teachers’ judgments of reading comprehension and oral reading skills. School Psychology Review, 32(2), 228–240.Google Scholar
  27. Hintze, J. M., Ryan, A. L., & Stoner, G. (2003). Concurrent validity and diagnostic accuracy of the dynamic indicators of basic early literacy skills and the comprehensive test of phonological processing. School Psychology Review, 32(4), 541–556.Google Scholar
  28. Jenkins, J. R., Hudson, R. F., & Johnson, E. S. (2007). Screening for at-risk readers in a response to intervention framework. School Psychology Review, 36(4), 582–600.Google Scholar
  29. Kahneman, D., Slovic, P., & Tversky, A. (1982). Judgment under uncertainty: Heuristics and biases. New York: Cambridge University Press.CrossRefGoogle Scholar
  30. Katz, D., & Foxman, B. (1993). How well do prediction equations predict? Using receiver operating characteristic curves and accuracy curves to compare validity and generalizability. Epidemiology (Cambridge, Mass.), 4(4), 319–326.CrossRefGoogle Scholar
  31. Kingslake, B. (1983). The predictive (in)accuracy of on-entry to school screening procedures when used to anticipate learning difficulties. British Journal of Special Education, 1, 23–26.Google Scholar
  32. Kloo, A., & Zigmond, N. (2008). Implementing progress monitoring in a really low achieving school among very low-skilled teachers. Paper presented at the 2008 annual Pacific Coast Research Conference.Google Scholar
  33. Kopiez, R., & Lee, J. I. (2006). Towards a dynamic model of skills involved in sight reading music. Music Education Research, 8(1), 97–120.CrossRefGoogle Scholar
  34. Kraemer, H. (1992). Evaluating medical tests: Objective and quantitative guidelines. Newbury Park: Sage.Google Scholar
  35. Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.CrossRefPubMedGoogle Scholar
  36. Lasko, T. A., Bhagwat, J. G., Zou, K. H., & Ohno-Machado, L. (2005). The use of receiver operating characteristic curves in biomedical informatics. Journal of Biomedical Informatics, 38, 404–415.CrossRefPubMedGoogle Scholar
  37. Lewinsohn, P. M., Seeley, J. R., Roberts, R. E., & Allen, N. B. (1997). Center for Epidemiological Studies Depression Scale (CES-D) as a screening instrument for depression among community-residing older adults. Psychology and Aging, 12(2), 277–287.CrossRefPubMedGoogle Scholar
  38. Logan, G. D. (1988). Toward an instance theory of automatization. Psychological Review, 95(4), 492–527.CrossRefGoogle Scholar
  39. MacGinitie, W., & MacGinitie, R. (2006). Gates-MacGinitie reading tests (4th ed.). Iowa City: Houghton Mifflin.Google Scholar
  40. Malhotra, R., & Indrayan, A. A. (2010). A simple nomogram for sample size for estimating sensitivity and specificity of medical tests. Indian Journal of Ophthalmology, 58(6), 519–522.PubMedCentralCrossRefPubMedGoogle Scholar
  41. Marston, D., Muyskens, P., Lau, M., & Canter, A. (2003). Problem-solving model for decision making with high-incidence disabilities: The Minneapolis experience. Learning Disabilities Research and Practice, 18(3), 187–200.CrossRefGoogle Scholar
  42. Martin, S. D., & Shapiro, E. S. (2011). Examining the accuracy of teachers’ judgments of DIBELS performance. Psychology in the Schools, 48(4), 343–356. doi:10.1002/pits.20558.CrossRefGoogle Scholar
  43. Mason, S. J., & Graham, N. E. (1999). Conditional probabilities, relative operating characteristics, and relative operating levels. Weather and Forecasting, 14, 713–725.CrossRefGoogle Scholar
  44. Meehl, P. E. (1954). Clinical versus statistical prediction: A theoretical analysis and a review of the evidence. Minneapolis: University of Minnesota Press.CrossRefGoogle Scholar
  45. Meehl, P. E. (1986). Causes and effects of my disturbing little book. Journal of Personality Assessment, 50, 370–375.CrossRefPubMedGoogle Scholar
  46. Meehl, P. E., & Rosen, A. (1955). Antecedent probability and the efficiency of psychometric signs, patterns, or cutting scores. Psychological Bulletin, 52(3), 194–216.CrossRefPubMedGoogle Scholar
  47. Meisels, S. J. (1987). Uses and abuses of developmental screening and school readiness testing. Young Children, 42(4–9), 68–73.Google Scholar
  48. Nelson, J. M. (2008). Beyond correlational analysis of the Dynamic Indicators of Basic Early Literacy Skills (DIBELS): A classification validity study. School Psychology Quarterly, 23(4), 542–552.CrossRefGoogle Scholar
  49. Neyman, J., & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London, 231, 289–337.Google Scholar
  50. Pearson Education, Inc. (2007). Stanford achievement test-10th Edition (SAT10): Normative update. Upper Saddle River: Author.Google Scholar
  51. Pepe, M. S. (2003). The statistical evaluation of medical tests for classification and prediction. Oxford: New York.Google Scholar
  52. Peterson, W. W., Birdsall, T. G., & Fox, W. C. (1954). The theory of signal detectability. IRE Professional Group on Information Theory PGIT, 4, 171–212.CrossRefGoogle Scholar
  53. Petscher, Y., Kim, Y.-S., & Foorman, B. R. (2011). The importance of predictive power in early screening assessments: Implications for placement in the response to intervention framework. Assessment for Effective Intervention, 36(3), 158–166.PubMedCentralCrossRefPubMedGoogle Scholar
  54. Piasta, S. B., Petscher, Y., & Justice, L. M. (2012). How many letters should preschoolers in public programs know? The diagnostic efficiency of various preschool letter-naming benchmarks for predicting first-grade literacy achievement. Journal of Educational Psychology, 104(4), 945–958.CrossRefGoogle Scholar
  55. Posner, M. I., DiGirolamo, G. J., & Fernandez-Duque, D. (1997). Brain mechanisms of cognitive skills. Consciousness and Cognition, 6(2–3), 267–290.CrossRefGoogle Scholar
  56. Richmond, E. (2012). Different Goals for Students of Different Races? The Atlantic.Google Scholar
  57. Rice, M. E., & Harris, G. T. (2005). Comparing effect sizes in follow-up studies: ROC area, Cohen’s d, and r. Law and Human Behavior, 29(5), 615–620.CrossRefPubMedGoogle Scholar
  58. Schatschneider, C., Petscher, Y., & Williams, K. M. (2008). How to evaluate a screening process: The vocabulary of screening and what educators need to know. In L. Justice & C. Vukelich (Eds.), Achieving excellence in preschool literacy instruction (pp. 304–316). New York: Guilford Press.Google Scholar
  59. Silberglitt, B., & Hintze, J. (2005). Formative assessment using CBM-R cut scores to track progress toward success on state-mandated achievement tests: A comparison of methods. Journal of Psychoeducational Assessment, 23, 304–325.CrossRefGoogle Scholar
  60. Smolkowski, K., & Cummings, K. (2014). Evaluation of diagnostic systems: The selection of students at risk for reading difficulties with DIBELS measures (6th edition). Manuscript submitted for publication.Google Scholar
  61. Smolkowski, K., & Gunn, B. (2012). Reliability and validity of the Classroom Observations of Student-Teacher Interactions (COSTI) for kindergarten reading instruction. Early Childhood Research Quarterly, 27(2), 316–328. doi:10.1016/j.ecresq.2011.09.004.CrossRefGoogle Scholar
  62. Smolkowski, K., Cummings, K. D., & Baker, D. (2014). Evaluation of diagnostic systems: the selection of English learners at risk for reading difficulties with DIBELS measures (6th edition). Manuscript submitted for publication.Google Scholar
  63. STARD Statement (2008). Standards for the Reporting of Diagnostic accuracy studies. Accessed 15 May 2014.
  64. Streiner, D. L. (2003). Diagnosing tests: Using and misusing diagnostic and screening tests. Journal Of Personality Assessment, 81(3), 209–219.CrossRefPubMedGoogle Scholar
  65. Streiner, D. L., & Cairney, J. (2007). What's under the ROC? an introduction to receiver operating characteristics curves. Canadian Journal of Psychiatry, 52(2), 121–128.Google Scholar
  66. Swets, J. A. (1973). The relative operating characteristic in Psychology. Science, 182(4116), 990–1000.CrossRefPubMedGoogle Scholar
  67. Swets, J. A. (1986). Indices of discrimination or diagnostic accuracy: Their ROCs and implied models. Psychological Bulletin, 99(1), 100–117.CrossRefPubMedGoogle Scholar
  68. Swets, J. A. (1988). Measuring the accuracy of diagnostic systems. Science, 240(4857), 1285–1293.CrossRefPubMedGoogle Scholar
  69. Swets, J. A. (1996). Signal Detection Theory and ROC Analysis in Psychology and Diagnostics: Collected Papers. Hillsdale: Lawrence Erlbaum Associates.Google Scholar
  70. Swets, J. A., Dawes, R. M., & Monahan, J. (2000a, October). Better decisions through science. Scientific American, 283(4), 82–87.CrossRefPubMedGoogle Scholar
  71. Swets, J. A., Dawes, R. M., & Monahan, J. (2000b). Psychological science can improve diagnostic decisions. Psychological Science in the Public Interest, 1(1), 1–26.CrossRefPubMedGoogle Scholar
  72. Warburton, E. C., Wilson, M., Lynch, M., & Cuykendall, S. (2013). The cognitive benefits of movement reduction: Evidence from dance marking. Psychological Science. Advance online publication. doi:10.1177/0956797613478824.Google Scholar
  73. Youden, W. J. (1950). Index for rating diagnostic tests. Cancer, 3, 32–35.CrossRefPubMedGoogle Scholar
  74. Zhou, X.-H., McClish, D. K., & Obuchowski, N. A. (2002). Statistical methods in diagnostic medicine. New York: Wiley.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2016

Authors and Affiliations

  • Keith Smolkowski
    • 1
  • Kelli D. Cummings
    • 2
  • Lisa Strycker
    • 1
  1. 1.Oregon Research InstituteEugeneUSA
  2. 2.University of MarylandCollege ParkUSA

Personalised recommendations