Finding Influential Factors for Different Types of Cancer: A Data Mining Approach
Cancer is one of the leading causes of death around the world. Finding the risk factors related to different types of cancer can help researchers understand the process of cancer development and find new ways of preventing the disease. Most of the researches done on cancer datasets focus only one type of cancer. This research aims to provide a new methodology for extracting significant influential factors affecting multiple cancer types by employing frequent pattern mining, association rule mining, and contrast set mining techniques. The datasets used are US general population collected from the National Health Interview Survey (NHIS) and the Surveillance, Epidemiology, and End Results (SEER) Program. The rules discovered have invaluable contribution in two aspects: some of the rules validate the existing knowledge about cancer and a few of them expand further research scope to enrich expert knowledge in cancer domain. Experimental results illustrate that high cholesterol and high blood pressure are evident among cancer patients. Considering the demographic facts, female and the age group between 61 and 85 are more prone to cancer. Also, the Hispanic origin “not Hispanic/Spanish origin” are the majority among cancer patients. This research is one of the few works that implies to diverse cancer domain and unique in methodology for finding dominant factors associated with cancer.
KeywordsData mining Frequent pattern mining Association rule mining Contrast set mining Cancer research Risk factors for cancer
- 5.Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM-SIGMOD International Conference on Management.Google Scholar
- 7.SEER Publication, Cancer Facts, Surveillance Research Program, Cancer Statistics Branch, limited use data (1973–2007). Available at: http://seer.cancer.gov/data/.https://www.cdc.gov/nchs/nhis/index.htm.
- 9.Agrawal R & Srikant R (1994) Fast algorithms for mining association rules. In Proceedings of the 1994 International Conference on Very Large Data Bases (VLDB’94) (pp. 487–499), Santiago, Chile.Google Scholar
- 12.Mannila, H., Toivonen, H., & Verkamo, AI. (1994) Efficient algorithms for discovering association rules. In Proceeding of the AAAI’94 Workshop Knowledge Discovery in Databases (KDD’94) (pp. 181–192), Seattle, WA.Google Scholar
- 13.Agrawal, R. & Srikant, R. (1995). Mining sequential patterns. In Proceedings of the 1995 International Conference on Data Engineering (ICDE’95) (pp. 3–14), Taipei, Taiwan.Google Scholar
- 15.Yoon, S., Taha, B., & Bakken, S. (2014). Using a data mining approach to discover behavior correlates of chronic disease: A case study of depression. Studies in Health Technology and Informatics, 201, 71–78.Google Scholar
- 16.Wang, H., Wang, W., Yang, J., & Yu, P. S. (2002). Clustering by pattern similarity in large data sets. In Proceeding of the 2002 ACM-SIGMOD International Conference on Management of Data (SIGMOD’02) (pp. 418–427), Madison, WI.Google Scholar
- 20.Malpani, R., Lu, M., Zhang, D., & Sung, W.K. (2011). Mining transcriptional association rules from breast cancer profile data. In IEEE IRI 2011, August 3–5, 2011, Las Vegas, Nevada, USA.Google Scholar
- 21.Lopez, F., Cuadros, M., Blanco, A., & Concha, A. (2009). Unveiling fuzzy associations between breast cancer prognostic factors and gene expression data. Database and expert systems application. In 20th International Workshop on Database and Expert Systems Application (pp. 338–342).Google Scholar
- 22.Bener, A., Moore, A. M., Ali, R., & El Ayoubi, H. R. (2010). Impacts of family history and lifestyle habits on colorectal cancer risk: A case-control study in Qatar. Asian Pacific Journal of Cancer Prevention, 11, 963–968.Google Scholar
- 25.Agrawal, A. & Choudhary, A. (2011). Identifying HotSpots in lung cancer data using association rule mining. In 11th IEEE International Conference on Data Mining Workshops (pp. 995–1002).Google Scholar
- 29.Rajesh, K., & Sheila, A. (2012). Analysis of SEER dataset for breast cancer diagnosis using C4.5 classification algorithm. International Journal of Advanced Research in Computer and Communication Engineering, 1(2), 2278.Google Scholar
- 31.Agrawal, A., Misra, S., Narayanan, R., Polepeddi, L., & Alok, C. (2011, August). A lung cancer outcome calculator using ensemble data mining on SEER data, In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Google Scholar
- 32.Majali, J., Niranjan, R., Phatak, V., & Tadakhe, O. (2014). Data mining techniques for diagnosis and prognosis of breast cancer. International Journal of Computer Science and Information Technologies, 5(5), 6487–6490.Google Scholar
- 33.Al-Bahrani, R., Agrawal, A., & Alok, C. (2013). Colon cancer survival prediction using ensemble data mining on SEER data. In Proceedings of the IEEE Big Data Workshop on Bioinformatics and Health Informatics (BHI).Google Scholar
- 35.Tan, P. N., Steinbach, M., & Kumar, V. (2006). Introduction to data mining. Boston, MA: Pearson Education Inc..Google Scholar
- 36.Piatetsky, S., Frawley, G., & William, J. (Eds.). (1991). Discovery, analysis, and presentation of strong rules, knowledge discovery in databases. Cambridge, MA: AAAI/MIT Press.Google Scholar
- 37.R-3.3.2 for Windows (32/64 bit) available at https://cran.r-project.org/bin/windows/base/.