Data Mining Paradigm in the Study of Air Quality

  • Natacha Soledad RepresaEmail author
  • Alfonso Fernández-Sarría
  • Andrés Porta
  • Jesús Palomar-Vázquez
Review Article


Air pollution is a serious global problem that threatens human life and health, as well as the environment. The most important aspect of a successful air quality management strategy is the measurement analysis, air quality forecasting, and reporting system. A complete insight, an accurate prediction, and a rapid response may provide valuable information for society’s decision-making. The data mining paradigm can assist in the study of air quality by providing a structured work methodology that simplifies data analysis. This study presents a systematic review of the literature from 2014 to 2018 on the use of data mining in the analysis of air pollutant measurements. For this review, a data mining approach to air quality analysis was proposed that was consistent with the 748 articles consulted. The most frequent sources of data have been the measurements of monitoring networks, and other technologies such as remote sensing, low-cost sensors, and social networks which are gaining importance in recent years. Among the topics studied in the literature were the redundancy of the information collected in the monitoring networks, the forecasting of pollutant levels or days of excessive regulation, and the identification of meteorological or land use parameters that have the most substantial impact on air quality. As methods to visualise and present the results, we recovered graphic design, air quality index development, heat mapping, and geographic information systems. We hope that this study will provide anchoring of theoretical-practical development in the field and that it will provide inputs for air quality planning and management.


Air quality Environmental management Air pollution Data mining 



The authors gratefully acknowledge the support of the Argentinean Scientific and Technical Research Council, the Polytechnic University of Valencia, Spain, and the National University of La Plata, Argentina. National Agency for Scientific and Technological Promotion funded this project through the PICT 2015-0618.


  1. Alsahli MM, Al-Harbi M (2018) Allocating optimum sites for air quality monitoring stations using GIS suitability analysis. Urban Clim 24:875–886CrossRefGoogle Scholar
  2. Amegah AK, Agyei-Mensah S (2017) Urban air pollution in sub-Saharan Africa: time for action. Environ Pollut 220:738–743CrossRefGoogle Scholar
  3. Austin E, Coull BA, Zanobetti A, Koutrakis P (2013) A framework to spatially cluster air pollution monitoring sites in US based on the PM2.5 composition. Environ Int 59:244–254CrossRefGoogle Scholar
  4. Bai Y, Li Y, Wang X, Xie J, Li C (2016) Air pollutants concentrations forecasting using back propagation neural network based on wavelet decomposition with meteorological conditions. Atmos Pollut Res 7(3):557–566CrossRefGoogle Scholar
  5. Bakhtiarifar MH, Bashiri M, Amiri A (2017) Optimization of problems with multivariate multiple functional responses: a case study in air quality. Commun Statist Simul Comput 46(10):8049–8063CrossRefGoogle Scholar
  6. Baldasano JM, Valera E, Jimenez P (2003) Air quality data from large cities. Sci Total Environ 307:141–165CrossRefGoogle Scholar
  7. Bellinger C, Jabbar MSM, Zaïane O, Osornio-Vargas A (2017) A systematic review of data mining and machine learning for air pollution epidemiology. BMC Public Health 17(1):907CrossRefGoogle Scholar
  8. Biancofiore F, Busilacchio M, Verdecchia M, Tomassetti B, Aruffo E, Bianco S et al (2017) Recursive neural network model for analysis and forecast of PM10 and PM2.5. Atmos Pollut Res 8(4):652–659CrossRefGoogle Scholar
  9. Birant D (2011) Comparison of decision tree algorithms for predicting potential air pollutant emissions with data mining models. J Environ Inform 17(1)CrossRefGoogle Scholar
  10. Carslaw DC, Ropkins K (2012) Openair—an R package for air quality data analysis. Environ Model Softw 27:52–61CrossRefGoogle Scholar
  11. Castellanos MG, Dayal U, Simitsis A, Wilkinson WK (2014). Quality-driven ETL design optimization 2014. U.S. Patent No. 8:719–769. U.S. Patent and Trademark Office, Washington, DCGoogle Scholar
  12. Chen G, Li S, Knibbs LD, Hamm NAS, Cao W, Li T, Guo J, Ren H, Abramson MJ, Guo Y (2018a) A machine learning method to estimate PM 2.5 concentrations across China with remote sensing, meteorological and land use information. Science of the Total Environment 636:52-60CrossRefGoogle Scholar
  13. Chen G, Wang Y, Li S, Cao W, Ren H, Knibbs LD, Abramson MJ, Guo Y (2018b) Spatiotemporal patterns of PM10 concentrations over China during 2005–2016: A satellite-based estimation using the random forests approach. Environmental Pollution 242:605-613CrossRefGoogle Scholar
  14. Chen J, Xin J, An J, Wang Y, Liu Z, Chao N, Meng Z (2014) Observation of aerosol optical properties and particulate pollution at background station in the Pearl River Delta region. Atmos Res 143:216–227CrossRefGoogle Scholar
  15. Chen M, Wang P, Chen Q, Wu J, Chen X (2015) A clustering algorithm for sample data based on environmental pollution characteristics. Atmos Environ 107:194–203CrossRefGoogle Scholar
  16. Chen Y, Wang L, Li F, Du B, Choo KKR, Hassan H, Qin W (2017) Air quality data clustering using EPLS method. Inform Fusion 36:225–232CrossRefGoogle Scholar
  17. Csépe Z, Makra L, Voukantsis D, Matyasovszky I, Tusnády G, Karatzas K, Thibaudon M (2014) Predicting daily ragweed pollen concentrations using computational intelligence techniques over two heavily polluted areas in Europe. Sci Total Environ 476:542–552CrossRefGoogle Scholar
  18. Desarkar A, Das A (2018) Implementing decision tree in air pollution reduction framework. In: Smart computing and informatics. Springer, Singapore, pp 105–113CrossRefGoogle Scholar
  19. Dincer NG, Akkuş Ö (2018) A new fuzzy time series model based on robust clustering for forecasting of air pollution. Ecol Inform 43:157–164CrossRefGoogle Scholar
  20. Domańska D, Łukasik S (2016) Handling high-dimensional data in air pollution forecasting tasks. Ecol Inform 34:70–91CrossRefGoogle Scholar
  21. Domańska D, Wojtylak M (2014) Explorative forecasting of air pollution. Atmos Environ 92:19–30CrossRefGoogle Scholar
  22. Duboue M (1978) Pollution roses: a simple way of interpreting the data obtained by air pollution measurement systems in the proximity of refineries. Stud Environ Sci:133–136Google Scholar
  23. Elangasinghe MA, Singhal N, Dirks KN, Salmond JA (2014b) Development of an ANN–based air pollution forecasting system with explicit knowledge through sensitivity analysis. Atmos Pollut Res 5(4):696–708CrossRefGoogle Scholar
  24. Elangasinghe MA, Singhal N, Dirks KN, Salmond JA, Samarasinghe S (2014a) Complex time series analysis of PM10 and PM2.5 for a coastal site using artificial neural network modelling and k-means clustering. Atmos Environ 94:106–116CrossRefGoogle Scholar
  25. European Commission (2008) Directive 2008/50/EC of the European Parliament and of the Council of 21 May 2008 on ambient air quality and cleaner air for Europe. Off J European UnionGoogle Scholar
  26. Feng X, Li Q, Zhu Y, Hou J, Jin L, Wang J (2015) Artificial neural networks forecasting of PM2.5 pollution using air mass trajectory based geographic model and wavelet transformation. Atmos Environ 107:118–128CrossRefGoogle Scholar
  27. Franceschi F, Cobo M, Figueredo M (2018) Discovering relationships and forecasting PM10 and PM2.5 concentrations in Bogotá, Colombia, using artificial neural networks, principal component analysis, and k-means clustering. Atmos Pollut Res 9(5):912–922CrossRefGoogle Scholar
  28. Fu M, Wang W, Le Z, Khorram MS (2015) Prediction of particulate matter concentrations by developed feed-forward neural network with rolling mechanism and gray model. Neural Comput Appl 26(8):1789–1797CrossRefGoogle Scholar
  29. Gacquer D, Delcroix V, Delmotte F, Piechowiak S (2011) Comparative study of supervised classification algorithms for the detection of atmospheric pollution. Eng Appl Artif Intell 24(6):1070–1083CrossRefGoogle Scholar
  30. Gómez-Losada Á (2017) Clustering air monitoring stations according to background and ambient pollution using hidden Markov models and multidimensional scaling. In: Data science. Springer, Cham, pp 123–132CrossRefGoogle Scholar
  31. Gong B, Ordieres-Meré J (2016) Prediction of daily maximum ozone threshold exceedances by preprocessing and ensemble artificial intelligence techniques: case study of Hong Kong. Environ Model Softw 84:290–303CrossRefGoogle Scholar
  32. Gong B, Ordieres-Meré J (2017) Reconfiguring existing pollutant monitoring stations by increasing the value of the gathered information. Environmental Modelling & Software 96:106-122Google Scholar
  33. Gulia S, Nagendra SS, Khare M, Khanna I (2015) Urban air quality management-a review. Atmos Pollut Res 6(2):286–304CrossRefGoogle Scholar
  34. Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, New YorkGoogle Scholar
  35. Harkat MF, Mansouri M, Nounou M, Nounou H (2018) Enhanced data validation strategy of air quality monitoring network. Environ Res 160:183–194CrossRefGoogle Scholar
  36. Hasenfratz D, Saukh O, Walser C, Hueglin C, Fierz M, Arn T et al (2015) Deriving high-resolution urban air pollution maps using mobile sensor nodes. Pervasive Mobile Comput 16:268–285CrossRefGoogle Scholar
  37. Hastie TJ (2017) Generalized additive models. In: Statistical models in S. Routledge, Boca Raton, pp 249–307CrossRefGoogle Scholar
  38. He HD, Li M, Wang WL, Wang ZY, Xue Y (2018) Prediction of PM2. 5 concentration based on the similarity in air quality monitoring network. Building and Environment 137:11-17CrossRefGoogle Scholar
  39. Holešovský J, Čampulová M, Michálek J (2018) Semiparametric outlier detection in nonstationary times series: case study for atmospheric pollution in Brno, Czech Republic. Atmos Pollut Res 9(1):27–36CrossRefGoogle Scholar
  40. Honarvar AR, Sami A (2019) Towards sustainable smart city by particulate matter prediction using urban big data, excluding expensive air pollution infrastructures. Big Data Res 17:56–65CrossRefGoogle Scholar
  41. Hu Y, Fan J, Zhang H, Chen X, Dai G (2016) An estimated method of urban PM2. 5 Concentration distribution for a mobile sensing system. Pervasive Mobile Comput 25:88–103CrossRefGoogle Scholar
  42. Jiang P, Dong Q, Li P (2017) A novel hybrid strategy for PM2. 5 concentration analysis and prediction. J Environ Manag 196:443–457CrossRefGoogle Scholar
  43. Junger WL, De Leon AP (2015) Imputation of missing data in time series for air pollutants. Atmos Environ 102:96–104CrossRefGoogle Scholar
  44. Junninen H, Niska H, Tuppurainen K, Ruuskanen J, Kolehmainen M (2004) Methods for imputation of missing values in air quality data sets. Atmos Environ 38(18):2895–2907CrossRefGoogle Scholar
  45. Kitchenham B (2004) Procedures for performing systematic reviews. Keele UK Keele Univ 33(2004):1–26Google Scholar
  46. Knaflic CN (2015) Storytelling with data: a data visualization guide for business professionals. WileyGoogle Scholar
  47. Leung Y, Leung KS, Wong MH, Mak T, Cheung KY, Lo LY et al (2018) An integrated web-based air pollution decision support system–a prototype. Int J Geogr Inform Sci:1–28Google Scholar
  48. Li Q, Shao J (2015) Regularizing lasso: a consistent variable selection method. Stat Sin:975–992Google Scholar
  49. Liao TW (2005) Clustering of time series data—a survey. Pattern Recogn 38(11):1857–1874CrossRefGoogle Scholar
  50. Lin H, Liu T, Xiao J, Zeng W, Li X, Guo L et al (2016) Quantifying short-term and long-term health benefits of attaining ambient fine particulate pollution standards in Guangzhou, China. Atmos Environ 137:38–44CrossRefGoogle Scholar
  51. Liu Z, Xie M, Tian K, Gao P (2017) GIS-based analysis of population exposure to PM2. 5 air pollution—a case study of Beijing. J Environ Sci 59:48–53CrossRefGoogle Scholar
  52. Ma KL, Liao I, Frazier J, Hauser H, Kostis HN (2012) Scientific storytelling using visualization. IEEE Comput Graph Appl 32(1):12–19CrossRefGoogle Scholar
  53. Mabahwi NAB, Leh OLH, Omar D (2014) Human health and wellbeing: human health effect of air pollution. Procedia Soc Behav Sci 153:221–229CrossRefGoogle Scholar
  54. Marć M, Bielawska M, Simeonov V, Namieśnik J, Zabiegała B (2016) The effect of anthropogenic activity on BTEX, NO2, SO2, and CO concentrations in urban air of the spa city of Sopot and medium-industrialized city of Tczew located in North Poland. Environ Res 147:513–524CrossRefGoogle Scholar
  55. Martínez J, Saavedra Á, García-Nieto PJ, Piñeiro JI, Iglesias C, Taboada J et al (2014) Air quality parameters outliers detection using functional data analysis in the Langreo urban area (Northern Spain). Appl Math Comput 241:1–10Google Scholar
  56. Mayer H (1999) Air pollution in cities. Atmos Environ 33(24–25):4029–4037CrossRefGoogle Scholar
  57. Mintz D (2012). Technical assistance document for the reporting of daily air quality-the air quality index (aqi): US environmental protection agency. Office of Air Quality Planning and StandardsGoogle Scholar
  58. Mori U, Mendiburu A, Lozano JA (2016) Similarity measure selection for clustering time series databases. IEEE Trans Knowl Data Eng 28(1):181–195CrossRefGoogle Scholar
  59. Ni XY, Huang H, Du WP (2017) Relevance analysis and short-term prediction of PM2.5 concentrations in Beijing based on multi-source data. Atmos Environ 150:146–161CrossRefGoogle Scholar
  60. Olvera-García MÁ, Carbajal-Hernández JJ, Sánchez-Fernández LP, Hernández-Bautista I (2016) Air quality assessment using a weighted fuzzy inference system. Ecol inform 33:57–74CrossRefGoogle Scholar
  61. Petkova EP, Jack DW, Volavka-Close NH, Kinney PL (2013) Particulate matter pollution in African cities. Air Qual Atmos Health 6(3):603–614CrossRefGoogle Scholar
  62. Pires JCM, Sousa SIV, Pereira MC, Alvim-Ferraz MCM, Martins FG (2008) Management of air quality monitoring using principal component and cluster analysis—Part I: SO2 and PM10. Atmos Environ 42(6):1249–1260CrossRefGoogle Scholar
  63. Podobnik B, Stanley HE (2008) Detrended cross-correlation analysis: a new method for analyzing two nonstationary time series. Phys Rev Lett 100(8):084102CrossRefGoogle Scholar
  64. Qiao ZX, Pan W, Lu WZ (2017) Multiscale multifractal properties between ground-level ozone and its precursors in rural area in Hong Kong. J Environ Manag 196:270–277CrossRefGoogle Scholar
  65. Qin S, Liu F, Wang C, Song Y, Qu J (2015) Spatial-temporal analysis and projection of extreme particulate matter (PM10 and PM2.5) levels using association rules: A case study of the Jing-Jin-Ji region, China. Atmospheric Environment 120:339-350CrossRefGoogle Scholar
  66. Rathore MMU, Paul A, Ahmad A, Chen BW, Huang B, Ji W (2015) Real-time big data analytical architecture for remote sensing application. IEEE J Sel Top Appl Earth Obs Remote Sens 8(10):4610–4621CrossRefGoogle Scholar
  67. Russo A, Lind PG, Raischel F, Trigo R, Mendes M (2015) Neural network forecast of daily pollution concentration using optimal meteorological data at synoptic and local scales. Atmos Pollut Res 6(3):540–549CrossRefGoogle Scholar
  68. Sadat YK, Nikaein T, Karimipour F (2015) Fuzzy spatial association rule mining to analyze the effect of environmental variables on the risk of allergic asthma prevalence. Geodesy Cartogr 41(2):101–112CrossRefGoogle Scholar
  69. Salako GO, Hopke PK (2012) Impact of percentile computation method on PM 24-h air quality standard. J Environ Manag 107:110–113CrossRefGoogle Scholar
  70. Sammarco M, Tse R, Pau G, Marfia G (2017) Using geosocial search for urban air pollution monitoring. Pervasive Mobile Comput 35:15–31CrossRefGoogle Scholar
  71. Sekar C, Gurjar BR, Ojha CSP, Goyal MK (2015) Potential assessment of neural network and decision tree algorithms for forecasting ambient PM 2.5 and CO concentrations: case study. J Hazard Toxic Radioactive Waste 20(4):A5015001Google Scholar
  72. Shahbazi H, Taghvaee S, Hosseini V, Afshin H (2016) A GIS based emission inventory development for Tehran. Urban Clim 17:216–229CrossRefGoogle Scholar
  73. Sharma P, Chandra A, Kaushik SC (2009) Forecasts using box–Jenkins models for the ambient air quality data of Delhi City. Environ Monit Assess 157(1–4):105–112CrossRefGoogle Scholar
  74. Shi D, Guan J, Zurada J, Manikas A (2017) A data-mining approach to identification of risk factors in safety management systems. J Manag Inf Syst 34(4):1054–1081CrossRefGoogle Scholar
  75. Shi Y, Lau KKL, Ng E (2017b) Incorporating wind availability into land use regression modelling of air quality in mountainous high-density urban environment. Environ Res 157:17–29CrossRefGoogle Scholar
  76. Shmilovici A (2009) Support vector machines. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, Boston, MAGoogle Scholar
  77. Soh PW, Chang JW, Huang JW (2018) Adaptive deep learning-based air quality prediction model using the Most relevant spatial-temporal relations. IEEE Access 6:38186–38199CrossRefGoogle Scholar
  78. Soysal ÖM (2015) Association rule mining with mostly associated sequential patterns. Expert Syst Appl 42(5):2582–2592CrossRefGoogle Scholar
  79. Sulemana I (2012) Assessing over-aged Car legislation as an environmental policy law in Ghana. Int J Bus Soc Sci 3(20)Google Scholar
  80. Sullivan TJ, Driscoll CT, Beier CM, Burtraw D, Fernandez IJ, Galloway JN et al (2018) Air pollution success stories in the United States: the value of long-term observations. Environ Sci Policy 84:69–73CrossRefGoogle Scholar
  81. Terry WR, Lee JB, Kumar A (1986) Time series analysis in acid rain modeling: evaluation of filling missing values by linear interpolation. Atmos Environ 20:1941–1943CrossRefGoogle Scholar
  82. Tian Y, Yao X, Chen L (2019) Analysis of spatial and seasonal distributions of air pollutants by incorporating urban morphological characteristics. Comput Environ Urban Syst 75:35–48CrossRefGoogle Scholar
  83. Villar A, Zarrabeitia MT, Fdez-Arroyabe P, Santurtún A (2018) Integrating and analyzing medical and environmental data using ETL and business intelligence tools. Int J Biometeorol 62(6):1085–1095CrossRefGoogle Scholar
  84. Wamba SF, Akter S, Edwards A, Chopin G, Gnanzou D (2015) How ‘big data’ can make big impact: findings from a systematic review and a longitudinal case study. Int J Prod Econ 165:234–246CrossRefGoogle Scholar
  85. Wang D, Wei S, Luo H, Yue C, Grunder O (2017a) A novel hybrid model for air quality index forecasting based on two-phase decomposition technique and modified extreme learning machine. Sci Total Environ 580:719–733CrossRefGoogle Scholar
  86. Wang H, Zhao L (2018) A joint prevention and control mechanism for air pollution in the Beijing-Tianjin-Hebei region in China based on long-term and massive data mining of pollutant concentration. Atmos Environ 174:25–42CrossRefGoogle Scholar
  87. Wang J, Song G (2018) A deep spatial-temporal ensemble model for air quality prediction. Neurocomputing 314:198–206CrossRefGoogle Scholar
  88. Wang J, Zhang X, Guo Z, Lu H (2017b) Developing an early-warning system for air quality prediction and assessment of cities in China. Expert Syst Appl 84:102–116CrossRefGoogle Scholar
  89. Wang L, Zhong B, Vardoulakis S, Zhang F, Pilot E, Li Y et al (2016) Air quality strategies on public health and health equity in Europe—a systematic review. Int J Environ Res Public Health 13(12):1196CrossRefGoogle Scholar
  90. Wang S, Paul MJ, Dredze M (2015) Social media as a sensor of air quality and public response in China. J Med Internet Res 17(3)CrossRefGoogle Scholar
  91. Westerlund J, Urbain JP, Bonilla J (2014) Application of air quality combination forecasting to Bogota. Atmos Environ 89:22–28CrossRefGoogle Scholar
  92. Witten IH, Frank E, Hall MA, Pal CJ (2016) Data mining: practical machine learning tools and techniques. Morgan KaufmannGoogle Scholar
  93. World Health Organization (2016). Ambient air pollution: a global assessment of exposure and burden of diseaseGoogle Scholar
  94. Wu Y, Zhang F, Shi Y, Pilot E, Lin L, Fu Y et al (2016) Spatiotemporal characteristics and health effects of air pollutants in Shenzhen. Atmos Pollut Res 7(1):58–65CrossRefGoogle Scholar
  95. Xie Y, Zhao L, Xue J, Gao HO, Li H, Jiang R et al (2018) Methods for defining the scopes and priorities for joint prevention and control of air pollution regions based on data-mining technologies. J Clean Prod 185:912–921CrossRefGoogle Scholar
  96. Xu Y, Yang W, Wang J (2017) Air quality early-warning system for cities in China. Atmos Environ 148:239–257CrossRefGoogle Scholar
  97. Yang F, Tan J, Zhao Q, Du Z, He K, Ma Y et al (2011) Characteristics of PM2.5 speciation in representative megacities and across China. Atmos Chem Phys 11(11):5207–5219CrossRefGoogle Scholar
  98. Yang G, Huang J, Li X (2018b) Mining sequential patterns of PM2. 5 pollution in three zones in China. J Clean Prod 170:388–398CrossRefGoogle Scholar
  99. Yang L, Xu H, Jin Z (2018a). Estimating spatial variability of ground-level PM2.5 based on a satellite-derived aerosol optical depth product: Fuzhou, ChinaGoogle Scholar
  100. Yang X, Zheng Y, Geng G, Liu H, Man H, Lv Z, He K, de Hoogh K (2017) Development of PM2.5 and NO2 models in a LUR framework incorporating satellite remote sensing and air quality model data in Pearl River Delta region, China. Environmental Pollution 226:143–153CrossRefGoogle Scholar
  101. Yeganeh B, Hewson MG, Clifford S, Knibbs LD, Morawska L (2017) A satellite-based model for estimating PM2.5 concentration in a sparsely populated environment using soft computing techniques. Environ Model Softw 88:84–92CrossRefGoogle Scholar
  102. Zhang C, Ni Z, Ni L (2015) Multifractal detrended cross-correlation analysis between PM2.5 and meteorological factors. Physica A: Statist Mech Appl 438:114–123CrossRefGoogle Scholar
  103. Zhang NN, Ma F, Qin CB, Li YF (2018) Spatiotemporal trends in PM2.5 levels from 2013 to 2017 and regional demarcations for joint prevention and control of atmospheric pollution in China. Chemosphere 210:1176–1184CrossRefGoogle Scholar
  104. Zhang Y, Bocquet M, Mallet V, Seigneur C, Baklanov A (2012) Real-time air quality forecasting. Part I: History, techniques, and current status. Atmos Environ 60:632–655CrossRefGoogle Scholar
  105. Zhao C, Song G (2017) Application of data mining to the analysis of meteorological data for air quality prediction: a case study in Shenyang. IOP Conf Ser: Earth Environ Sci 81(1)CrossRefGoogle Scholar
  106. Zotteri G, Kalchschmidt M, Caniato F (2005) The impact of aggregation level on forecasting performance. Int J Prod Econ 93:479–491CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Centro de Investigaciones del MedioambienteNational University of La Plata (UNLP)La PlataArgentina
  2. 2.Geo-Environmental Cartography and Remote Sensing GroupPolytechnic University of ValenciaValenciaSpain

Personalised recommendations