A two-phase machine learning approach for predicting student outcomes


Learning analytics have proved promising capabilities and opportunities to many aspects of academic research and higher education studies. Data-driven insights can significantly contribute to provide solutions for curbing costs and improving education quality. This paper adopts a two-phase machine learning approach, which utilizes both unsupervised and supervised learning techniques for predicting outcomes of students following Higher Education programs of studies. The approach has been applied in a case-study which has been performed in the context of an undergraduate Computer Science curriculum offered by the University of Thessaly in Greece. Students involved in the case study were initially grouped based on the similarity of specific education-related factors and metrics. Using the K-Means algorithm, our clustering experiments revealed the presence of three coherent clusters of students. Subsequently, the discovered clusters were utilized to train prediction models for addressing each particular cluster of students individually. In this regard, two machine learning models were trained for every cluster of students in order to predict the time to degree completion and student enrollment in the offered educational programs. The developed models are claimed to produce predictions with relatively high accuracy. Finally, the paper discusses the potential usefulness of the clustering-aided approach for learning analytics in Higher Education.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10


  1. 1.



  1. Abidi, S. M. R., Hussain, M., Xu, Y., & Zhang, W. (2018). Prediction of confusion attempting algebra homework in an intelligent tutoring system through machine learning techniques for educational sustainable development. Sustainability (Switzerland), 11(1). https://doi.org/10.3390/su11010105.

  2. Abubakar, Y., & Ahmad, N. B. H. (2017). Prediction of students ’ performance in E- learning environment using random Forest. International Journal of Innovative Computing, 7(2), 1–5.

    Google Scholar 

  3. Aldowah, H., Al-Samarraie, H., & Fauzy, W. M. (2019). Educational data mining and learning analytics for 21st century higher education: A review and synthesis. Telematics and Informatics, 37(April 2018), 13–49. https://doi.org/10.1016/j.tele.2019.01.007.

    Article  Google Scholar 

  4. Al-Shehri, H., Al-Qarni, A., Al-Saati, L., Batoaq, A., Badukhen, H., Alrashed, S., … Olatunji, S. (2017). Student performance prediction using support vector machine and K-nearest neighbor. Canadian Conference on Electrical and Computer Engineering, 1–4. https://doi.org/10.1109/CCECE.2017.7946847.

  5. Anand, V. K., Abdul Rahiman, S. K., Ben George, E., & Huda, A. S. (2018). Recursive clustering technique for students’ performance evaluation in programming courses. Proceedings of Majan international conference: Promoting entrepreneurship and technological skills: National Needs, global trends, MIC 2018, 1–5. https://doi.org/10.1109/MINTC.2018.8363153.

  6. Asif, R., Merceron, A., Ali, S. A., & Haider, N. G. (2017). Analyzing undergraduate students’ performance using educational data mining. Computers in Education, 113, 177–194. https://doi.org/10.1016/j.compedu.2017.05.007.

    Article  Google Scholar 

  7. Bharara, S., Sabitha, S., & Bansal, A. (2018). Application of learning analytics using clustering data Mining for Students’ disposition analysis. Education and Information Technologies, 23(2), 957–984. https://doi.org/10.1007/s10639-017-9645-7.

    Article  Google Scholar 

  8. Bhogan, S., Sawant, K., Naik, P., Shaikh, R., Diukar, O., & Dessai, S. (2017). Predicting student performance based on clustering and classification. IOSR Journal of Computer Engineering, 19(03), 49–52. https://doi.org/10.9790/0661-1903054952.

    Article  Google Scholar 

  9. Breiman, L. (2001). Random forests. Machine Learning, 1–122. https://doi.org/10.1201/9780367816377-11.

  10. Burgos, C., Campanario, M. L., de la Peña, D., Lara, J. A., Lizcano, D., & Martínez, M. A. (2018). Data mining for modeling students’ performance: A tutoring action plan to prevent academic dropout. Computers and Electrical Engineering, 66, 541–556. https://doi.org/10.1016/j.compeleceng.2017.03.005.

    Article  Google Scholar 

  11. Cardona, T. A., & Cudney, E. a. (2019). Predicting student retention using support vector machines. Procedia Manufacturing, 39, 1827–1833. https://doi.org/10.1016/j.promfg.2020.01.256.

    Article  Google Scholar 

  12. Caruana, R., & Niculescu-Mizil, A. (2006). An empirical comparison of supervised learning algorithms. https://doi.org/10.1145/1143844.1143865.

  13. Chatti, M. A., Dyckhoff, A. L., Schroeder, U., & Thüs, H. (2012). A reference model for learning analytics. International Journal of Technology Enhanced Learning, 4(5–6), 318–331. https://doi.org/10.1504/IJTEL.2012.051815.

    Article  Google Scholar 

  14. Chung, J. Y., & Lee, S. (2019). Dropout early warning systems for high school students using machine learning. Children and Youth Services Review, 96, 346–353. https://doi.org/10.1016/j.childyouth.2018.11.030.

    Article  Google Scholar 

  15. Fan, Z., & Sun, Y. (2017). Clustering of college students based on improved K-means algorithm. Proceedings - 2016 International Computer Symposium, ICS 2016, 676–679. https://doi.org/10.1109/ICS.2016.0139.

  16. Fisher, A., Rudin, C., & Dominici, F. (2019). All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously. Journal of Machine Learning Research, 20(vi).

  17. Francis, B. K., & Babu, S. S. (2019). Predicting academic performance of students using a hybrid data mining approach. Journal of Medical Systems, 43(6). https://doi.org/10.1007/s10916-019-1295-4.

  18. Gray, C. C., & Perkins, D. (2019). Utilizing early engagement and machine learning to predict student outcomes. Computers in Education, 131(July 2018), 22–32. https://doi.org/10.1016/j.compedu.2018.12.006.

    Article  Google Scholar 

  19. HQA. (2017). Higher education quality report - 2017. HQA (Vol. 1).

  20. Hussain, M., Zhu, W., Zhang, W., Abidi, S. M. R., & Ali, S. (2019). Using machine learning to predict student difficulties from learning session data. Artificial Intelligence Review, 52(1), 381–407. https://doi.org/10.1007/s10462-018-9620-8.

    Article  Google Scholar 

  21. Iatrellis, O., Kameas, A., & Fitsilis, P. (2017). Academic advising systems: A systematic literature review of empirical evidence. Education in Science, 7(4), 90. https://doi.org/10.3390/educsci7040090.

    Article  Google Scholar 

  22. Iatrellis, O., Kameas, A., & Fitsilis, P. (2019a). A novel integrated approach to the execution of personalized and self-evolving learning pathways. Education and Information Technologies (2019) 24:781-803, 24(ISSN 1360-2357). https://doi.org/10.1007/s10639-018-9802-7.

  23. Iatrellis, O., Kameas, A., & Fitsilis, P. (2019b). EDUC8 pathways: Executing self-evolving and personalized intra-organizational educational processes. Evolving Systems, 11, 227–240. https://doi.org/10.1007/s12530-019-09287-4.

    Article  Google Scholar 

  24. Iatrellis, O., Savvas, I. K., Kameas, A., & Fitsilis, P. (2020). Integrated learning pathways in higher education: A framework enhanced with machine learning and semantics. Education and Information Technologies, 21. https://doi.org/10.1007/s10639-020-10105-7.

  25. Kappe, R., & Van Der Flier, H. (2012). Predicting academic success in higher education: What’s more important than being smart? European Journal of Psychology of Education, 27(4), 605–619. https://doi.org/10.1007/s10212-011-0099-9.

    Article  Google Scholar 

  26. Kizilcec, R. F., Piech, C., & Schneider, E. (2013). Deconstructing disengagement: Analyzing learner subpopulations in massive open online courses. ACM international conference proceeding series, 170–179. https://doi.org/10.1145/2460296.2460330.

  27. Lee, K. (2018). Machine learning approaches for learning analytics: Collaborative filtering or regression with experts ? Korea, 1–11.

  28. MacQueen, J., et al. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1(14), 281–297.

    MathSciNet  MATH  Google Scholar 

  29. Malhotra, R. (2014). Comparative analysis of statistical and machine learning methods for predicting faulty modules. Applied Soft Computing Journal, 21, 286–297. https://doi.org/10.1016/j.asoc.2014.03.032.

    Article  Google Scholar 

  30. Mason, C., Twomey, J., Wright, D., & Whitman, L. (2018). Predicting engineering student attrition risk using a probabilistic neural network and comparing results with a Backpropagation neural network and logistic regression. Research in Higher Education, 59(3), 382–400. https://doi.org/10.1007/s11162-017-9473-z.

    Article  Google Scholar 

  31. McKenzie, K., & Schweitzer, R. (2001). Who succeeds at university? Factors predicting academic performance in first year Australian university students. Higher Education Research and Development, 20(1), 21–33. https://doi.org/10.1080/07924360120043621.

    Article  Google Scholar 

  32. Muñoz-Merino, P. J., González Novillo, R., & Delgado Kloos, C. (2018). Assessment of skills and adaptive learning for parametric exercises combining knowledge spaces and item response theory. Applied Soft Computing Journal, 68, 110–124. https://doi.org/10.1016/j.asoc.2018.03.045.

    Article  Google Scholar 

  33. Nájera, A. B. U., de la Calleja, J., & Medina, M. A. (2017). Associating students and teachers for tutoring in higher education using clustering and data mining. Computer Applications in Engineering Education, 25(5), 823–832. https://doi.org/10.1002/cae.21839.

    Article  Google Scholar 

  34. Nauta, M. M. (2010). The development, evolution, and status of Holland’s theory of vocational personalities: Reflections and future directions for counseling psychology. Journal of Counseling Psychology, 57(1), 11–22. https://doi.org/10.1037/a0018213.

    MathSciNet  Article  Google Scholar 

  35. Oyelade, O. J., Oladipupo, O. O., & Obagbuwa, I. C. (2010). Application of k Means Clustering algorithm for prediction of Students Academic Performance, 7, 292–295. Retrieved from http://arxiv.org/abs/1002.2425

  36. Pang, Y., Judd, N., O’Brien, J., & Ben-Avie, M. (2017). Predicting students’ graduation outcomes through support vector machines. Proceedings - Frontiers in Education Conference, FIE, 2017-Octob, 1–8. https://doi.org/10.1109/FIE.2017.8190666.

  37. Papamitsiou, Z., & Economides, A. A. (2014). Learning analytics and educational data mining in practice: A systemic literature review of empirical evidence. Educational Technology & Society, 17(4), 49–64.

    Google Scholar 

  38. Pasina, I., Bayram, G., Labib, W., Abdelhadi, A., & Nurunnabi, M. (2019). Clustering students into groups according to their learning style. MethodsX, 6, 2189–2197. https://doi.org/10.1016/j.mex.2019.09.026.

    Article  Google Scholar 

  39. Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11), 559–572. https://doi.org/10.1080/14786440109462720.

    Article  MATH  Google Scholar 

  40. Pliakos, K., Joo, S. H., Park, J. Y., Cornillie, F., Vens, C., & Van den Noortgate, W. (2019). Integrating machine learning into item response theory for addressing the cold start problem in adaptive learning systems. Computers in Education, 137, 91–103. https://doi.org/10.1016/j.compedu.2019.04.009.

    Article  Google Scholar 

  41. Ruiperez-Valiente, J. A., Munoz-Merino, P. J., Alexandron, G., & Pritchard, D. E. (2019). Using machine learning to detect “multiple-account” cheating and analyze the influence of student and problem features. IEEE Transactions on Learning Technologies, 12(1), 112–122. https://doi.org/10.1109/TLT.2017.2784420.

    Article  Google Scholar 

  42. Umair, S., & Majid Sharif, M. (2018). Predicting students grades using artificial neural networks and support vector machine. Encyclopedia of Information Science and Technology, Fourth Edition. https://doi.org/10.4018/978-1-5225-2255-3.ch449.

  43. Xu, X., Wang, J., Peng, H., & Wu, R. (2019). Prediction of academic performance associated with internet usage behaviors using machine learning algorithms. Computers in Human Behavior, 98, 166–173. https://doi.org/10.1016/j.chb.2019.04.015.

    Article  Google Scholar 

  44. Yang, F., & Li, F. W. B. (2018). Study on student performance estimation, student progress analysis, and student potential prediction based on data mining. Computers in Education, 123, 97–108. https://doi.org/10.1016/j.compedu.2018.04.006.

    Article  Google Scholar 

  45. Yang, T. Y., Brinton, C. G., Joe-Wong, C., & Chiang, M. (2017). Behavior-based grade prediction for MOOCs via time series neural networks. IEEE Journal on Selected Topics in Signal Processing, 11(5), 716–728. https://doi.org/10.1109/JSTSP.2017.2700227.

    Article  Google Scholar 

  46. Yue, H., & Fu, X. (2017). Rethinking graduation and time to degree: A fresh perspective. Research in Higher Education, 58(2), 184–213. https://doi.org/10.1007/s11162-016-9420-4.

    Article  Google Scholar 

  47. Zhang, H., Huang, T., Lv, Z., Liu, S., & Yang, H. (2019). MOOCRC: A highly accurate resource recommendation model for use in MOOC environments. Mobile Networks and Applications, 24(1), 34–46. https://doi.org/10.1007/s11036-018-1131-y.

    Article  Google Scholar 

Download references

Author information



Corresponding author

Correspondence to Omiros Iatrellis.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Iatrellis, O., Savvas, I.Κ., Fitsilis, P. et al. A two-phase machine learning approach for predicting student outcomes. Educ Inf Technol (2020). https://doi.org/10.1007/s10639-020-10260-x

Download citation


  • Learning analytics
  • Unsupervised learning
  • Supervised learning
  • Higher education