Finding Influential Factors for Different Types of Cancer: A Data Mining Approach

  • Munima JahanEmail author
  • Elham Akhond Zadeh Noughabi
  • Behrouz H. Far
  • Reda Alhajj
Part of the Lecture Notes in Social Networks book series (LNSN)


Cancer is one of the leading causes of death around the world. Finding the risk factors related to different types of cancer can help researchers understand the process of cancer development and find new ways of preventing the disease. Most of the researches done on cancer datasets focus only one type of cancer. This research aims to provide a new methodology for extracting significant influential factors affecting multiple cancer types by employing frequent pattern mining, association rule mining, and contrast set mining techniques. The datasets used are US general population collected from the National Health Interview Survey (NHIS) and the Surveillance, Epidemiology, and End Results (SEER) Program. The rules discovered have invaluable contribution in two aspects: some of the rules validate the existing knowledge about cancer and a few of them expand further research scope to enrich expert knowledge in cancer domain. Experimental results illustrate that high cholesterol and high blood pressure are evident among cancer patients. Considering the demographic facts, female and the age group between 61 and 85 are more prone to cancer. Also, the Hispanic origin “not Hispanic/Spanish origin” are the majority among cancer patients. This research is one of the few works that implies to diverse cancer domain and unique in methodology for finding dominant factors associated with cancer.


Data mining Frequent pattern mining Association rule mining Contrast set mining Cancer research Risk factors for cancer 


  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
    Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM-SIGMOD International Conference on Management.Google Scholar
  6. 6.
    Stephen, B., & Michael, P. (2001). Detecting group differences: Mining contrast sets. Data Mining and Knowledge Discovery, 5(3), 213–246.CrossRefGoogle Scholar
  7. 7.
    SEER Publication, Cancer Facts, Surveillance Research Program, Cancer Statistics Branch, limited use data (1973–2007). Available at:
  8. 8.
  9. 9.
    Agrawal R & Srikant R (1994) Fast algorithms for mining association rules. In Proceedings of the 1994 International Conference on Very Large Data Bases (VLDB’94) (pp. 487–499), Santiago, Chile.Google Scholar
  10. 10.
    Zaki, M. J. (2000). Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering, 12, 372–390.CrossRefGoogle Scholar
  11. 11.
    Agarwal, R., Aggarwal, C. C., & Prasad, V. V. V. (2001). A tree projection algorithm for generation of frequent itemsets. Journal of Parallel and Distributed Computing, 61, 350–371.CrossRefGoogle Scholar
  12. 12.
    Mannila, H., Toivonen, H., & Verkamo, AI. (1994) Efficient algorithms for discovering association rules. In Proceeding of the AAAI’94 Workshop Knowledge Discovery in Databases (KDD’94) (pp. 181–192), Seattle, WA.Google Scholar
  13. 13.
    Agrawal, R. & Srikant, R. (1995). Mining sequential patterns. In Proceedings of the 1995 International Conference on Data Engineering (ICDE’95) (pp. 3–14), Taipei, Taiwan.Google Scholar
  14. 14.
    Han, J. W., Pei, J., & Yan, X. F. (2004). From sequential pattern to structured pattern mining: A pattern-growth approach. Journal of Computer Science and Technology, 19(3), 257–279.CrossRefGoogle Scholar
  15. 15.
    Yoon, S., Taha, B., & Bakken, S. (2014). Using a data mining approach to discover behavior correlates of chronic disease: A case study of depression. Studies in Health Technology and Informatics, 201, 71–78.Google Scholar
  16. 16.
    Wang, H., Wang, W., Yang, J., & Yu, P. S. (2002). Clustering by pattern similarity in large data sets. In Proceeding of the 2002 ACM-SIGMOD International Conference on Management of Data (SIGMOD’02) (pp. 418–427), Madison, WI.Google Scholar
  17. 17.
  18. 18.
    Karabatak, M., & Ince, M. C. (2009). An expert system for detection of breast cancer based on association rules and neural network. Expert Systems with Applications, 36, 3465–3469.CrossRefGoogle Scholar
  19. 19.
    Mavaddat, N., Rebbeck, T. R., Lakhani, S. R., Easton, D. F., & Antoniou, A. C. (2010). Incorporating tumour pathology information into breast cancer risk prediction algorithms. Breast Cancer Research, 12, R28.CrossRefGoogle Scholar
  20. 20.
    Malpani, R., Lu, M., Zhang, D., & Sung, W.K. (2011). Mining transcriptional association rules from breast cancer profile data. In IEEE IRI 2011, August 3–5, 2011, Las Vegas, Nevada, USA.Google Scholar
  21. 21.
    Lopez, F., Cuadros, M., Blanco, A., & Concha, A. (2009). Unveiling fuzzy associations between breast cancer prognostic factors and gene expression data. Database and expert systems application. In 20th International Workshop on Database and Expert Systems Application (pp. 338–342).Google Scholar
  22. 22.
    Bener, A., Moore, A. M., Ali, R., & El Ayoubi, H. R. (2010). Impacts of family history and lifestyle habits on colorectal cancer risk: A case-control study in Qatar. Asian Pacific Journal of Cancer Prevention, 11, 963–968.Google Scholar
  23. 23.
    Nahar, J., Tickel, K. S., Shawkat Ali, A. B. M., & Chen, Y. P. P. (2011). Significant cancer prevention factor extraction: An association rule discovery approach. Journal of Medical Systems, 35, 353–367.CrossRefGoogle Scholar
  24. 24.
    Hu, R. (2010). Medical data mining based on association rules. Computer and Information Science, 3(4), 104.CrossRefGoogle Scholar
  25. 25.
    Agrawal, A. & Choudhary, A. (2011). Identifying HotSpots in lung cancer data using association rule mining. In 11th IEEE International Conference on Data Mining Workshops (pp. 995–1002).Google Scholar
  26. 26.
    Aksoy, S., Dizdar, O., Harputluoglu, H., & Altundag, K. (2014). Demographic, clinical, and pathological characteristics of Turkish triple-negative breast cancer patients: Single center experience. Annals of Oncology, 18, 1904–1906 Oxford University Press.CrossRefGoogle Scholar
  27. 27.
    Cramer, H., Ward, L., Steel, A., Lauche, R., Dobos, G., & Zhang, Y. (2016). Prevalence, patterns, and predictors of yoga use: Results of a U.S. Nationally Representative Survey. American Journal of Preventive Medicine, 50, 230–235. Scholar
  28. 28.
    Warner, M., Schenker, N., Heinen, M. A., & Fingerhut, L. A. (2005). The effects of recall on reporting injury and poisoning episodes in the National Health Interview Survey. Injury Prevention, 11, 282–287. Scholar
  29. 29.
    Rajesh, K., & Sheila, A. (2012). Analysis of SEER dataset for breast cancer diagnosis using C4.5 classification algorithm. International Journal of Advanced Research in Computer and Communication Engineering, 1(2), 2278.Google Scholar
  30. 30.
    Yadav, R., Khan, Z., & Saxena, H. (2013). Chemotherapy prediction of cancer patient by using data mining techniques. International Journal of Computer Applications, 76(10), 28–31.CrossRefGoogle Scholar
  31. 31.
    Agrawal, A., Misra, S., Narayanan, R., Polepeddi, L., & Alok, C. (2011, August). A lung cancer outcome calculator using ensemble data mining on SEER data, In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Google Scholar
  32. 32.
    Majali, J., Niranjan, R., Phatak, V., & Tadakhe, O. (2014). Data mining techniques for diagnosis and prognosis of breast cancer. International Journal of Computer Science and Information Technologies, 5(5), 6487–6490.Google Scholar
  33. 33.
    Al-Bahrani, R., Agrawal, A., & Alok, C. (2013). Colon cancer survival prediction using ensemble data mining on SEER data. In Proceedings of the IEEE Big Data Workshop on Bioinformatics and Health Informatics (BHI).Google Scholar
  34. 34.
    Umesh, D. R., & Ramachandra, B. (2016). Big data analytics to predict breast cancer recurrence on SEER dataset using MapReduce approach. International Journal of Computer Applications, 150(7), 7–11.CrossRefGoogle Scholar
  35. 35.
    Tan, P. N., Steinbach, M., & Kumar, V. (2006). Introduction to data mining. Boston, MA: Pearson Education Inc..Google Scholar
  36. 36.
    Piatetsky, S., Frawley, G., & William, J. (Eds.). (1991). Discovery, analysis, and presentation of strong rules, knowledge discovery in databases. Cambridge, MA: AAAI/MIT Press.Google Scholar
  37. 37.
    R-3.3.2 for Windows (32/64 bit) available at
  38. 38.
  39. 39.
  40. 40.
  41. 41.

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Munima Jahan
    • 1
    Email author
  • Elham Akhond Zadeh Noughabi
    • 1
  • Behrouz H. Far
    • 1
  • Reda Alhajj
    • 2
  1. 1.Department of Electrical and Computer EngineeringUniversity of CalgaryCalgaryCanada
  2. 2.Department of Computer ScienceUniversity of CalgaryCalgaryCanada

Personalised recommendations