Skip to main content

Model Data Selection and Data Pre-processing Approaches

  • Chapter
  • First Online:
Hydrological Data Driven Modelling

Part of the book series: Earth Systems Data and Models ((ESDM,volume 1))

Abstract

Data-based modeling relies on historical data without directly taking account of underlying physical processes in hydrology . So, real-world modeling of hydrological processes commonly requires a complex input structure and very lengthy training data to represent inherent complex dynamic systems. In cases where a large amount of input data is available, and all of which used for modeling, technical issues such as the increase in the computational complexity and lack of memory spaces have been observed. The likelihood of these problems occurring is much greater in the case of hydrological modeling, as these models possess high nonlinearity and a large number of parameters. Therefore, there is a definite need to identify proper techniques which adequately reduce the number of inputs and the required training data length in nonlinear models. Removing redundant inputs from all available input pools and deciding upon the optimum data length to make a reliable prediction are the main purposes of these approaches. This section of the book describes the abilities of novel techniques such as Gamma Test (GT), entropy theory (ET), Principle Component Analysis (PCA), cluster analysis (CA), Akaike’s Information Criterion (AIC ), and Bayesian Information Criterion (BIC ) in model data selection. The novelty of this work is that many of these approaches are used for the first time in hydrological modeling scenarios such as solar radiation estimation, rainfall-runoff modeling , and evapotranspiration modeling . Towards the end of this chapter, conventional data selection procedures such as the Cross-Correlation Approach (CCA), Cross-Validation Approach (CVA), and Data Splitting Approach (DSA) are explained in detail. These traditional approaches were used to check the authenticity of the newly applied methods in the later case study chapters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 129.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Adcock RJ (1878) A problem in least squares. Analyst 5:53–54

    Google Scholar 

  2. Agalbjorn S, Kon car N, Jones AJ (1997) A note on the gamma test. Neural Comput Appl 5(3):131–133. ISSN 0-941-0643

    Google Scholar 

  3. Ahmadi A, Han D, Karamouz M, Remesan R et al (2009) Input data selection for solar radiation estimation. Hydrol Process 23(19):2754–2764

    Google Scholar 

  4. Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19:716–723

    Google Scholar 

  5. Akaike H (1978) A Bayesian analysis of the minimum AIC procedure. Ann Inst Math Stat A (30):9–14

    Google Scholar 

  6. Amorocho J, Espildora B (1973) Entropy in the assessment of uncertainty of hydrologic systems and models. Water Resour Res 9(6):1511–1522

    Google Scholar 

  7. Boltzmann L (1877) Ueber die Beziehung eines allgemeine mechanischen Satzes zum zweiten Hauptsatzes der Warmetheorie. Sitzungsber. Akad. Wiss. Wien, Math-Naturwiss Kl 75:67–73

    Google Scholar 

  8. Bonner RE (1964) On some clustering techniques. IBM J Res Dev 8:22–32

    Google Scholar 

  9. Bray MTJ (2009) Rainfall analysis based on rain gauge and weather radar. PhD Thesis, University of Bristol (Unpublished)

    Google Scholar 

  10. Breiman L, Friedman J, Olshen R, Stone C et al (1984) Classification and regression trees. Wadsworth, Belmon

    Google Scholar 

  11. Breiman L (1996) Heuristics of instability and stabilization in model selection. Ann Stat 24:2350–2383

    Google Scholar 

  12. Burman P (1989) A comparative study of ordinary cross-validation, r-fold cross-validation and the repeated learning-testing methods. Biometrika 76(3):503–514

    Google Scholar 

  13. Burnham KP, Anderson DR (1998) Model selection and inference: A practical information-theoretical approach. Springer-Verlag, New York

    Google Scholar 

  14. Caselton WF, Husain T (1980) Hydrologic networks: information transmission. J Water Res Plann Manage Div 106(WR2):503–519

    Google Scholar 

  15. Cauchy AL (1829) Oeuvres IX(2):172–175

    Google Scholar 

  16. Chapman TG (1986) Entropy as a measure of hydrologic data uncertainty and model performance. J Hydrol 85:111–126

    Google Scholar 

  17. Connellan OP, James H (1998) Forecasting commercial property values in the short term. In: RICS Cutting edge conference, Leicester, RICS London. Available electronically from www.rics-foundation.org

  18. Corcoran J, Wilson I, Ware J (2003) Predicting the geo-temporal variation of crime and disorder. Int J Forecast 19:623–634. doi:10.1016/S0169-2070(03)00095-5

  19. Cormack RM (1971) A review of classification. J R Stat Soc 134:321–367

    Google Scholar 

  20. De Oliveira AG (1999) Synchronisation of chaos and applications to secure communications. PhD thesis, Department of Computing, Imperial College of Science, Technology and Medicine, University of London

    Google Scholar 

  21. Durrant PJ (2001) winGamma: a non-linear data analysis and modelling tool with applications to flood prediction. PhD thesis, Department of Computer Science, Cardiff University, Wales, UK

    Google Scholar 

  22. Evans D, Jones AJ (2002) A proof of the gamma test. Proc R Soc Ser A 458(2027):2759–2799

    Google Scholar 

  23. Everitt BS, Hand DJ (1981) Finite mixture distribution. Chapman and Hall, New York

    Google Scholar 

  24. Finkelstein MO, Friedberg RM (1967) The application of an entropy theory of concentration to the Clayton Act. Yale Law Rev 76:677–717

    Google Scholar 

  25. Florek K, Lukaszewiez JM, Perkal J, Steinhaus H, Zubrzchi S et al. (1951) Sur la liason et la division des points d’un ensemble fini. Colloquium Mathematicum 2:282–285

    Google Scholar 

  26. Frenken K (2004) Entropy and information theory. In: Hanusch H, Pyka E (eds) The Elgar companion to neo-schumpeterian economics. Edward Elgar, Cheltenham

    Google Scholar 

  27. Frenken K, Hekkert M, Godfroij P (2004) R&D portfolios in environmentally friendly automotive propulsion: variety, competition and policy implications. Technol Forecast Soc Change 71(5):485–507

    Google Scholar 

  28. Gordon AD (1980) Classification. Chapman & Hall, London

    Google Scholar 

  29. Hannachi A, O’Neill A (2001) Atmospheric multiple equilibria and non-Gaussian behaviour in model simulations. Q J R Meteorol Soc 127:939–958

    Google Scholar 

  30. Harmancioglu N (1981) Measuring the information content of hydrological processes by the entropy concept. J Civil Eng Faculty of Ege University, Special Issue: Centennial of Ataturk’s Birth, Izmir, Turkey, pp 13–40

    Google Scholar 

  31. Harmancioglu NB, Alpaslan N (1992) Water quality monitoring network design: a problem of multi-objective decision making. AWRA Water Resour Bull 28(1):179–192

    Google Scholar 

  32. Harmancioglu N, Yevjevich V (1987) Transfer of hydrologic information among river points. J Hydrol 91:103–118

    Google Scholar 

  33. Harmancioglu NB, Singh VP (1998) Entropy in environmental and water resources. In: Herschy RW, Fairbridge RW (eds) Encyclopedia of hydrology and water resources. Kluwer Academic Publishers, Boston, pp 225–241

    Google Scholar 

  34. Hildenbrand W, Paschen H (1964) Ein axiomatische begründetes Konzentrationsmass. Stat Inf 3:53–61 (published by the statistical office of the European Communities)

    Google Scholar 

  35. Hoskisson RE, Hitt MA, Johnson RA, Moesel DD (1993) Construct-validity of an objective (entropy) categorical measure of diversification. Strateg Manage J 14(3):215–235

    Google Scholar 

  36. Husain T (1989) Hydrologic uncertainty measure and network design. Water Resour Bull 25(3):527–534

    Google Scholar 

  37. Izraeli O, Murphy KJ (2003) The effect of industrial diversity on state unemployment rate and per capita income. Ann Reg Sci 37:1–14

    Google Scholar 

  38. Jacquemin, AP, Berry CH (1979) Entropy measure of diversification and corporate growth. J Ind Econ 27(4):359–369

    Google Scholar 

  39. James H, Connellan OP (2000) Forecasts of a small feature in a property index. In: Proceedings of RICS cutting edge conference, London, RICS London. Available electronically from www.rics-foundation.org

  40. Johnson SC (1967) Hierarchical clustering schemes. Psycometrika 32:241–254

    Google Scholar 

  41. Jolliffe IT (2004) Principal component analysis, 2nd edn. Springer, New York

    Google Scholar 

  42. Jones AJ, Tsui A, de Oliveira AG (2002) Neural models of arbitrary chaotic systems: construction and the role of time delayed feedback in control and synchronization. Complex Int 9, p. tsui01:1–9

    Google Scholar 

  43. Kemp SE, Wilson ID, Ware JA (2005) A tutorial on the gamma test. Int J Simul 6(1–2):67–75

    Google Scholar 

  44. Kennedy P (2003) A guide to econometrics, 5th edn. Blackwell Publishing, Oxford

    Google Scholar 

  45. Kidson JW (1975) Eigenvector analysis of monthly mean surface data. Mon Weather Rev 103:177–186

    Google Scholar 

  46. Koncar N (1997) Optimisation methodologies for direct inverse neurocontrol. PhD thesis, Department of Computing, Imperial College of Science, Technology and Medicine, University of London

    Google Scholar 

  47. Krstanovic PF, Singh VP (1992) Evaluation of rainfall networks using entropy I. Water Resour Manage 6:279–293

    Google Scholar 

  48. Li W, Sherriff A, Liu X (2000) Assessing risk factors of human complex diseases by Akaike and Bayesian information criteria (abstract). Am J Hum Genet 67(Suppl):S222

    Google Scholar 

  49. Li W, Yang Y (2000) How many genes are needed for a discriminant microarray data analysis? In: Critical assessment of techniques for microarray data mining workshop, pp 137–150

    Google Scholar 

  50. MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley symposium on mathematical statistics and probability. University of California Press, vol 1, pp 281–297

    Google Scholar 

  51. Maruyama T, Kawachi T (1998) Evaluation of rainfall characteristics using entropy. J Rainwater Catchment Syst 4(1):7–10

    Google Scholar 

  52. Maryon RH (1979) Eigenanalysis of the Northern hemispherical 15-day mean surface pressure field and its application to long-range forecasting. 13 Branch Memorandum No. 82. UK Meteorological Office, Bracknell

    Google Scholar 

  53. McLachlan GJ, Krishnan T (1997) The EM algorithm and extensions. Wiley, New York, p 274

    Google Scholar 

  54. Mutua FM (1994) The use of the Akaike information criterion in the identification of an optimum flood frequency model. Hydrol Sci J 39(3):235–244

    Google Scholar 

  55. Oliveira AG (1999) Synchronization of chaos and applications to secure communications. PhD thesis, Department of Computing, Imperial College of Science, Technology and Medicine, University of London

    Google Scholar 

  56. Overland JE, Priesendorfer RW (1982) A significance test for principal components applied to a cyclone climatology. Mon Weather Rev 110:1–4

    Google Scholar 

  57. Ozkul S, Harmancioglu NB, Singh VP (2000) Entropy-based assessment of water quality monitoring networks. J Hydrol Eng 5(1):90–100

    Google Scholar 

  58. Palepu K (1985) Diversification strategy, profit performance and the entropy measure. Strateg Manage J 6:239–255

    Google Scholar 

  59. Pearson K (1901) On lines and planes of closest fit to systems of points in space. Phil Mag 6(2):559–572

    Google Scholar 

  60. Pi H, Peterson C (1994) Finding the embedding dimension and variable dependencies in time series. Neural Comput 5:509–520

    Google Scholar 

  61. Picard RR, Cook RDJ (1984) Am Stat Assoc 79:575–583

    Google Scholar 

  62. Remesan R, Shamim MA, Han D, Mathew J (2009) Runoff prediction using an integrated hybrid modelling scheme. J Hydrol 372:48–60

    Google Scholar 

  63. Schwarz G (1978) Estimating the dimension of a model. Ann Stat 62(2):461–464

    Google Scholar 

  64. Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423, 623–656

    Google Scholar 

  65. Shannon CE, Weaver W (1949) The mathematical theory of communication. University of Illinois Press, Urbana, Illinois

    Google Scholar 

  66. Shao J (1993) Linear model selection by cross-validation. J Am Stat Assoc 88:486–494

    Google Scholar 

  67. Shao J (1997) An asymptotic theory for linear model selection. Statistica Sinica 7:221–264

    Google Scholar 

  68. Sneath PHA (1957) The application of computers to taxonomy. J Genet Microbiol 17:201–226

    Google Scholar 

  69. Specht DF (1990) Probabilistic neural networks and the poly-nomial Adaline as complementary techniques for classification. IEEE Conference on Neural Networks, vol 1, no 1, March 1990

    Google Scholar 

  70. Steinhaus H (1956) Sur la division des corp materiels en parties (in French). Bull Acad Polon Sci 4(12):801–804

    Google Scholar 

  71. Tabony RC (1981) A principal component and spectral analysis of European rainfall. J Climatol 1:283

    Google Scholar 

  72. Tan PN, Steinbach M, Kumar V et al (2006) Introduction to data mining. Pearson Addison Wesley, Boston

    Google Scholar 

  73. Theil H (1972) Statistical decomposition analysis. North-Holland Publishing Co., Amsterdam

    Google Scholar 

  74. Theil H (1967) Economics and information theory. North-Holland Publishing Co., Amsterdam

    Google Scholar 

  75. Titterington DM, Smith AFM, Makov UE (1985) Statistical analysis of finite mixture distributions. Wiley, New York, p 243

    Google Scholar 

  76. Tryon RC (1939) Cluster analysis. McGraw-Hill, New York

    Google Scholar 

  77. Tsui APM (1999) Smooth data modelling and stimulus-response via stabilisation of neural chaos. PhD thesis, Department of Computing, Imperial College of Science, Technology and Medicine, University of London

    Google Scholar 

  78. Tsui APM, Jones AJ, de Oliveira AG (2002) The construction of smooth models using irregular embeddings determined by a gamma test analysis. Neural Comput Appl 10(4):318–329. doi:10.1007/s005210200004

  79. Ward JH (1963) Hierarchical grouping to optimise an objective function. J Am Stat Assoc 58:236–244

    Google Scholar 

  80. Wigley TML, Lough JM, Jones PD (1984) Spatial patterns of precipitation in England and Wales and a revised, homogeneous England and Wales precipitation series. J Climatol 4:1–25

    Google Scholar 

  81. Wilks DS (2006) Statistical methods in the atmospheric sciences, 2nd edn. Elsevier, Oxford

    Google Scholar 

  82. Xu QS, Liang YZ (2001) Monte Carlo cross validation. Chemom Intell Lab Syst 56:1–11

    Google Scholar 

  83. Yang Y, Burn DH (1994) An entropy approach to data collection network design. J Hydrol 157(4):307–324

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Renji Remesan .

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Remesan, R., Mathew, J. (2015). Model Data Selection and Data Pre-processing Approaches. In: Hydrological Data Driven Modelling. Earth Systems Data and Models, vol 1. Springer, Cham. https://doi.org/10.1007/978-3-319-09235-5_3

Download citation

Publish with us

Policies and ethics