Weak Classifiers Performance Measure in Handling Noisy Clinical Trial Data

  • Ezzatul Akmal Kamaru-ZamanEmail author
  • Andrew Brass
  • James Weatherall
  • Shuzlina Abdul Rahman
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 652)


Most research concluded that machine learning performance is better when dealing with cleaned dataset compared to dirty dataset. In this paper, we experimented three weak or base machine learning classifiers: Decision Table, Naive Bayes and k-Nearest Neighbor to see their performance on real-world, noisy and messy clinical trial dataset rather than employing beautifully designed dataset. We involved the clinical trial data scientist in leading us to a better data analysis exploration and enhancing the performance result evaluation. The classifiers performances were analyzed using Accuracy and Receiver Operating Characteristic (ROC), supported with sensitivity, specificity and precision values which resulted to contradiction of conclusion made by previous research. We employed pre-processing techniques such as interquartile range technique to remove the outliers and mean imputation to handle missing values and these techniques resulted to; all three classifiers work better in dirty dataset compared to imputed and clean dataset by showing highest accuracy and ROC measure. Decision Table turns out to be the best classifier when dealing with real-world noisy clinical trial.


Clinical trial Classifier Decision Table k-Nearest Neighbor Machine learning Naïve bayes Noisy data 



This paper is a part of Master Dissertation Theses written in University of Manchester, UK. We would like to thank data scientists from Advance Analytics Centre, Astra Zeneca, Alderley Park, Chesire, UK for their review, support and suggestion on this study.


  1. 1.
    Rogers, S., Girolami, M.: A First Course in Machine Learning. CRC Press, Boca Raton (2015)zbMATHGoogle Scholar
  2. 2.
    Simon, H.A.: Applications of Machine Learning and Rule Induction (1995)Google Scholar
  3. 3.
    Gamberger, D., Lavrač, N.: Noise detection and elimination applied to noise handling in KRK chess endgame. In: International Conference Inductive Logic Programming (1997)Google Scholar
  4. 4.
    Zhu, X., Wu, X.: Class noise vs. attribute noise: a quantitative study. Artif. Intell. Rev. 22(3), 177–210 (2004)MathSciNetCrossRefzbMATHGoogle Scholar
  5. 5.
    Little, R.J., D’Agostino, R., Cohen, M.L., Dickersin, K., Emerson, S.S., Farrar, J.T., Frangakis, C., Hogan, J.W., Molenberghs, G., Murphy, S.A., Neaton, J.D., Rotnitzky, A., Scharfstein, D., Shih, W.J., Siegel, J.P., Stern, H.: The prevention and treatment of missing data in clinical trials. N. Engl. J. Med. 367(14), 1355–1360 (2012)CrossRefGoogle Scholar
  6. 6.
    Grubbs, F.E.: Procedures for detecting outlying observations in samples (1974)Google Scholar
  7. 7.
    Gamberger, D., Lavrač, N., Duzeroski, S.: Noise detection and elimination in data preprocessing: experiments in medical domains. Appl. Artif. Intell. 14(2), 205–223 (2000)CrossRefGoogle Scholar
  8. 8.
    Van Hulse, J., Khoshgoftaar, T.: Knowledge discovery from imbalanced and noisy data. Data Knowl. Eng. 68(12), 1513–1542 (2009)CrossRefGoogle Scholar
  9. 9.
    Zhu, X., Wu, X., Chen, Q.: Eliminating class noise in large datasets. In: ICML, pp. 920–927 (2003)Google Scholar
  10. 10.
    Hall, M.A.: Correlation-based feature selection for machine learning. Methodology 21i195–i20, 1–5 (1999)Google Scholar
  11. 11.
    Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)Google Scholar
  12. 12.
    Zupan, B., Demšar, J., Kattan, M.W., Beck, J.R., Bratko, I.: Machine learning for survival analysis: a case study on recurrence of prostate cancer. Artif. Intell. Med. 20(1), 59–75 (2000)CrossRefGoogle Scholar
  13. 13.
    Kalapanidas, E., Avouris, N., Craciun, M., Neagu, D.: Machine learning algorithms: a study on noise sensitivity. In: Proceedings of the 1st Balcan Conference on Informatics, pp. 356–365, October 2003Google Scholar
  14. 14.
    Vannucci, M., Colla, V., Cateni, S.: An hybrid ensemble method based on data clustering and weak learners reliabilities estimated through neural networks. In: Rojas, I., Joya, G., Catala, A. (eds.) IWANN 2015. LNCS, vol. 9095, pp. 400–411. Springer, Heidelberg (2015)CrossRefGoogle Scholar
  15. 15.
    Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 33(1–2), 1–39 (2010)CrossRefGoogle Scholar
  16. 16.
    Maclin, R., Opitz, D.: Popular Ensemble Methods: An Empirical Study,, vol. cs.AI, pp. 169–198 (2011)Google Scholar
  17. 17.
    Dietterich, T.G.: Ensemble methods in machine learning. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 1–15. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  18. 18.
    Kohavi, R.: The power of decision tables. In: Machine Learning, ECML 1995, pp. 174–189 (1995)Google Scholar
  19. 19.
    Wets, G., Vanthienen, J., Timmermans, H.: Modelling decision tables from data. In: Wu, X., Kotagiri, R., Korb, K.B. (eds.) PAKDD 1998. LNCS, vol. 1394. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  20. 20.
    John, G.H.G., Langley, P.: Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the Conference on Uncertainty in Artificial Intelligence, Montreal, Quebec, Canada, vol. 1, pp. 338–345 (1995)Google Scholar
  21. 21.
    Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Mach. Learn. 6(1), 37–66 (1991)Google Scholar
  22. 22.
    Zweig, M.H., Campbell, G.: Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clin. Chem. 39(4), 561–577 (1993)Google Scholar
  23. 23.
    Witten, I.H., Frank, E., Hall, M.A.: Data Mining: Practical Machine Learning Tools and Techniques (Google eBook) (2011)Google Scholar
  24. 24.
    Li, M., Shang, C., Feng, S., Fan, J.: Quick attribute reduction in inconsistent decision tables. Inf. Sci. (Ny) 254, 155–180 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
  25. 25.
    Tomar, D., Agarwal, S.: A survey on data mining approaches for healthcare. Int. J. Bio-Sci. Bio-Technol. 5(5), 241–266 (2013)CrossRefGoogle Scholar
  26. 26.
    Everitt, B.S., Landau, S., Leese, M., Stahl, D.: Miscellaneous Clustering Methods (2011)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2016

Authors and Affiliations

  • Ezzatul Akmal Kamaru-Zaman
    • 1
    • 2
    Email author
  • Andrew Brass
    • 2
  • James Weatherall
    • 3
  • Shuzlina Abdul Rahman
    • 1
  1. 1.Faculty of Computer and Mathematical SciencesUniversiti Teknologi MARAShah AlamMalaysia
  2. 2.School of Computer ScienceUniversity of ManchesterManchesterUK
  3. 3.Advanced Analytics CentreAstra Zeneca R&DChesireUK

Personalised recommendations