Data Mining System Applied to Population Databases for Studies on Lung Cancer

  • J. PérezEmail author
  • F. Henriques
  • R. Santaolaya
  • O. Fragoso
  • A. Mexicano
Part of the Springer Optimization and Its Applications book series (SOIA, volume 65)


This work addresses the problem of finding the mortality distribution for lung cancer in Mexican districts, through clustering patterns discovery. A data mining system was developed which consists of a pattern generator and a visualization subsystem. Such an approach may contribute to biomarker discovery by means of identifying risk regions for a given cancer type and further reduce the cost and time spend in conducting cancer studies. The k-means algorithm was used for the generation of patterns, which permits expressing patterns as groups of districts with affinity in their location and mortality rate attributes. The source data were obtained from Mexican official institutions. As a result, a set of grouping patterns reflecting the mortality distribution of lung cancer in Mexico was generated. Two interesting patterns in northeastern and northwestern Mexico with high mortality rate were detected. We consider that patterns generated by the data mining system, can be useful for identifying high risk cancer areas and biomarkers discovery.


Lung Cancer Data Mining Data Warehouse Data Mining Technique Lung Cancer Mortality 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    P. Adrianns and D. Zantinge. Data Mining. Pearson Education Ltd, Canada, 1996.Google Scholar
  2. 2.
    C. Bouchardy, D.M. Parkin, and M. Khlat. Education and mortality from cancer in São Paulo, Brazil. Annals of Epidemiology, 3(1):64–70, 1993.CrossRefGoogle Scholar
  3. 3.
    P. Chapman, J. Clinton, R. Kerber, T. Khabaza, T. Reinartz, C. Shearer, and R. Wirth. Cross Industry Standard Process for Datamining version 1.0 step by step datamining guide, SPSS. Last visited: 2011.
  4. 4.
    Núcleo de Acopio y Análisis de Información en Salud. Descripción de las tablas de mortalidad por tumores malignos. Last visited: 2011.
  5. 5.
    F. Eibe, H. Mark, and T. Len. Weka api. Last visited: 2010.
  6. 6.
    F. Faggiano, T. Partanen, M. Kogevinas, and P. Boffetta. Socioeconomic differences in cancer incidence and mortality. Technical report, International Agency for Research on Cancer (IARC), 1997. Last visited: 2011.
  7. 7.
    A. Flouris and J. Duffy. Application of artificial intelligence systems in the analysis of epidemiological data. European Journal of Epidemiology, 21:167–170, 2006.CrossRefGoogle Scholar
  8. 8.
    J.J.G. García and M.B. Jasso. Mortalidad por cáncer en el adulto mayor en México., 2004. Last visited: 2011.
  9. 9.
    S.S. Hecht, J.M. Yuan, and D. Hatsukami. Applying tobacco carcinogen and toxicant biomarkers in product regulation and cancer prevention. Chemical Research in Toxicology, 23(6):1001–1008, 2010.CrossRefGoogle Scholar
  10. 10.
    J. Hernández, M.J. Ramírez, and R.C. Ferri. Introducción a la Minería de Datos, Exploración y Selección. Pearson Prentice Hall, Madrid, España, 2004.Google Scholar
  11. 11.
    A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM computing surveys, 31:264 – 323, 1999.CrossRefGoogle Scholar
  12. 12.
    M. Labib and M. Malek. Data mining for cancer management in Egypt case study: Childhood acute lymphoblastic leukemia. World Academy of Science, Engineering and Technology, 8:309–314, 2005.Google Scholar
  13. 13.
    D. Larose. Data Mining Methods and Models. John Wiley & Sons, New Jersey, EUA, 2006.zbMATHGoogle Scholar
  14. 14.
    J. Liao, L. Yu, Y. Mei, M. Guarnera, J. Shen, R. Li, Z. Liu, and F. Jiang. Small nucleolar RNA signatures as biomarkers for non-small-cell lung cancer. Molecular Cancer, 9, 2006.Google Scholar
  15. 15.
    J.B. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifteenth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–298, 1967.Google Scholar
  16. 16.
    R. Maheswaran, D. Strachan, B. Dodgeon, and N.G. Best. A population-based case-control study for examining early life influences on geographical variation in adult mortality in England and Wales using stomach cancer and stroke as examples. International Journal of Epidemiology, 31:375–382, 2002.CrossRefGoogle Scholar
  17. 17.
    M.F. Medina and F.M. Salazar. Frecuencia y patrón cambiante del cáncer pulmonar en México. Salud Pública de México, 42(4):333–336, 2000.CrossRefGoogle Scholar
  18. 18.
    I. Mullins, M. Siadaty, J. Lyman, K. Scully, C.T. Garrettb, W.G. Millerb, R. Mullerb, B. Robsonc, C. Aptec, S. Weissc, I. Rigoutsosc, D. Plattc, S. Cohend, and W.A. Knaus. Data mining and clinical data repositories: Insights from a 667,000 patient data set. Computers in Biology and Medicine, 36:1351–1377, 2006.CrossRefGoogle Scholar
  19. 19.
    National Institute of Public Health. Collection and Analysis Core on Health Information. Last visited: 2011.
  20. 20.
    National Institute of Statistic Geography and Informatics. Database District System. Last visited: 2011.
  21. 21.
    C.R. Pacheco and M.G.S. Díaz. Tumores Pulmonares, volume 4, chapter 9, pp. 35–40. Academia Nacional de Medicina/Intersistemas, México city, 1999.Google Scholar
  22. 22.
    N. Pérez, R. Murillo, C. Pinzón, and C. Hernández. Costos de la atención médica del cáncer de pulmón, la EPOC y el IAM atribuibles al consumo de tabaco en Colombia (proyecto multicéntrico de la OPS). Revista Colombiana de Cancerología, 11(4):241–249, 2007.Google Scholar
  23. 23.
    L.M. Reynales, M.S. Juárez, and S.R. Valdés. Costos de atención médica atribuibles al tabaquismo en el IMSS, Morelos. Salud Pública de México, 47(6):451–457, 2005.CrossRefGoogle Scholar
  24. 24.
    G.L.M. Ruíz, P. Rizo, F. Sánchez, A. Osornio, C. García, and G.A. Meneses. Lung cancer mortality in Mexico. BioMed Central Cancer, 7:A29, 2007.Google Scholar
  25. 25.
    K. Thangavel, P. Jaganathan, and P. Esmy. Subgroup discovery in cervical cancer analysis using data mining. AIML Journal, 6:29–36, 2006.Google Scholar
  26. 26.
    G.V. Tovar, A.F.J. López, and S.N. Rodríguez. Tendencias de la mortalidad por cáncer pulmonar en México, 1980-2000. Pan American Journal of Public Health, 17(4):254–262, 2005.CrossRefGoogle Scholar
  27. 27.
    D. Wheeler. A comparison of spatial clustering and cluster detection techniques for childhood leukemia incidence in Ohio, 1996–2003. International Journal of Health Geographics, 6:13, 2007.CrossRefGoogle Scholar
  28. 28.
    H.I. Witten and F. Eibe. Data Mining: practical machine learning tools and techniques with java implementations. Morgan Kaufmann, San Francisco, EUA, 2000.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  • J. Pérez
    • 1
    Email author
  • F. Henriques
    • 2
  • R. Santaolaya
    • 1
  • O. Fragoso
    • 1
  • A. Mexicano
    • 1
  1. 1.Centro Nacional de Investigación y Desarrollo TecnológicoCuernavacaMéxico
  2. 2.Fundação Nacional de SaúdeRecifeBrazil

Personalised recommendations