Applying Under-Sampling Techniques and Cost-Sensitive Learning Methods on Risk Assessment of Breast Cancer

  • Jia-Lien Hsu
  • Ping-Cheng Hung
  • Hung-Yen Lin
  • Chung-Ho Hsieh
Non-invasive Diagnostic Systems
Part of the following topical collections:
  1. Systems-Level Quality Improvement


Breast cancer is one of the most common cause of cancer mortality. Early detection through mammography screening could significantly reduce mortality from breast cancer. However, most of screening methods may consume large amount of resources. We propose a computational model, which is solely based on personal health information, for breast cancer risk assessment. Our model can be served as a pre-screening program in the low-cost setting. In our study, the data set, consisting of 3976 records, is collected from Taipei City Hospital starting from 2008.1.1 to 2008.12.31. Based on the dataset, we first apply the sampling techniques and dimension reduction method to preprocess the testing data. Then, we construct various kinds of classifiers (including basic classifiers, ensemble methods, and cost-sensitive methods) to predict the risk. The cost-sensitive method with random forest classifier is able to achieve recall (or sensitivity) as 100 %. At the recall of 100 %, the precision (positive predictive value, PPV), and specificity of cost-sensitive method with random forest classifier was 2.9 % and 14.87 %, respectively. In our study, we build a breast cancer risk assessment model by using the data mining techniques. Our model has the potential to be served as an assisting tool in the breast cancer screening.


Breast cancer Cost-sensitive learning Sampling 



Financial support for this study was provided in part by a grant from the National Science Council, Taiwan, under Contract No. NSC-102-2218-E-030-002. The funding agreement ensured the authors’ independence in designing the study, interpreting the data, writing, and publishing the report.


  1. 1.
    Siegel, R., Naishadham, D., Jemal, A., Cancer statistics, 2013. CA: Cancer J. Clin. 63(1):11–30, 2013. Available from: doi: 10.3322/caac.21166.Google Scholar
  2. 2.
    Kim, J., and Shin, H., Breast cancer survivability prediction using labeled, unlabeled, and pseudo-labeled patient data. J. Am. Med. Inform. Assoc. 20(4):613–618, 2013.CrossRefGoogle Scholar
  3. 3.
    Uhry, Z., Hédelin, G., Colonna, M., Asselain, B., Arveux, P., Rogel, A, et al., Multi-state Markov models in cancer screening evaluation: a brief review and case study. Stat. Methods Med. Res. 19(5):463–486, 2010.CrossRefMathSciNetGoogle Scholar
  4. 4.
    Bleyer, A., and Welch, H.G., Effect of three decades of screening mammography on breast-cancer incidence. N. Engl. J. Med. 367(21):1998–2005, 2012.CrossRefGoogle Scholar
  5. 5.
    Blume, J.D., Cormack, J.B., Mendelson, E.B., Lehrer, D., Pisano, E.D., Jong, R.A., et al., Combined screening with ultrasound and mammography vs mammography alone in women at elevated risk of breast cancer. J. Am. Med. Assoc. 299(18):2151–2163, 2008.CrossRefGoogle Scholar
  6. 6.
    Lord, S.J., Lei, W., Craft, P., Cawson, J.N., Morris, I., Walleser, S., et al., A systematic review of the effectiveness of magnetic resonance imaging (MRI) as an addition to mammography and ultrasound in screening young women at high risk of breast cancer. Eur. J. Cancer 43 (13):1905–1917, 2007. Available from: Scholar
  7. 7.
    Breast Cancer Screening (PDQ), Breast Cancer Screening Modalities Beyond Mammography (Health Professional Version) [homepage on the Internet]. National Cancer Institute; c2014 [updated 2014 Oct. 3; cited 2014 Oct. 6]. Available from:
  8. 8.
    Kittler, J., Hatef, M., Duin, R.P.W., Matas, J., On combining classifiers. IEEE Trans. Pattern Anal. Mach. Intell. 20(3):226–239, 1998. Available from: doi: 10.1109/34.667881.CrossRefGoogle Scholar
  9. 9.
    Wolpert, D.H., Stacked generalization. Neural Netw. 5:241–259, 1992.CrossRefGoogle Scholar
  10. 10.
    Elkan, C.: The Foundations of cost-sensitive learning. In: Proceedings of the 17th International Joint Conference on Artificial Intelligence - Volume 2. IJCAI’01. Available from:, pp. 973–978. Morgan Kaufmann Publishers Inc., San Francisco, CA (2001)
  11. 11.
    Seiffert, C., Khoshgoftaar, T.M., van Hulse, J., Napolitano A.: A Comparative Study of Data Sampling and Cost Sensitive Learning. In: Proceedings of the 2008 IEEE International Conference on Data Mining Workshops, pp. 46–52 (2008)Google Scholar
  12. 12.
    Garca-Laencina, P., Sancho-Gmez, J.L., Figueiras-Vidal, A., Pattern classification with missing data: a review. Neural Comput. Applic. 19 (2):263–282, 2010. Available from: doi: 10.1007/s00521-009-0295-6.CrossRefGoogle Scholar
  13. 13.
    Evangelopoulos, N.E., Latent semantic analysis. Wiley Interdiscip. Rev. Cogn. Sci. 4(6):683–692, 2013. doi: 10.1002/wcs.1254.Google Scholar
  14. 14.
    Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A., Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6):391–407, 1990.CrossRefGoogle Scholar
  15. 15.
    Fawcett, T., An introduction to, R O C analysis. Pattern Recognit. Lett. 27(8):861–874, 2006.CrossRefMathSciNetGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Jia-Lien Hsu
    • 1
  • Ping-Cheng Hung
    • 1
  • Hung-Yen Lin
    • 1
  • Chung-Ho Hsieh
    • 2
  1. 1.Department of Computer Science and Information EngineeringFu Jen Catholic UniversityNew Tapei CityRepublic of China
  2. 2.Department of General SurgeryShin Kong Wu Ho-Su Memorial HospitalTaipeiRepublic of China

Personalised recommendations