Abstract
Data-based modeling relies on historical data without directly taking account of underlying physical processes in hydrology . So, real-world modeling of hydrological processes commonly requires a complex input structure and very lengthy training data to represent inherent complex dynamic systems. In cases where a large amount of input data is available, and all of which used for modeling, technical issues such as the increase in the computational complexity and lack of memory spaces have been observed. The likelihood of these problems occurring is much greater in the case of hydrological modeling, as these models possess high nonlinearity and a large number of parameters. Therefore, there is a definite need to identify proper techniques which adequately reduce the number of inputs and the required training data length in nonlinear models. Removing redundant inputs from all available input pools and deciding upon the optimum data length to make a reliable prediction are the main purposes of these approaches. This section of the book describes the abilities of novel techniques such as Gamma Test (GT), entropy theory (ET), Principle Component Analysis (PCA), cluster analysis (CA), Akaike’s Information Criterion (AIC ), and Bayesian Information Criterion (BIC ) in model data selection. The novelty of this work is that many of these approaches are used for the first time in hydrological modeling scenarios such as solar radiation estimation, rainfall-runoff modeling , and evapotranspiration modeling . Towards the end of this chapter, conventional data selection procedures such as the Cross-Correlation Approach (CCA), Cross-Validation Approach (CVA), and Data Splitting Approach (DSA) are explained in detail. These traditional approaches were used to check the authenticity of the newly applied methods in the later case study chapters.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Adcock RJ (1878) A problem in least squares. Analyst 5:53–54
Agalbjorn S, Kon car N, Jones AJ (1997) A note on the gamma test. Neural Comput Appl 5(3):131–133. ISSN 0-941-0643
Ahmadi A, Han D, Karamouz M, Remesan R et al (2009) Input data selection for solar radiation estimation. Hydrol Process 23(19):2754–2764
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19:716–723
Akaike H (1978) A Bayesian analysis of the minimum AIC procedure. Ann Inst Math Stat A (30):9–14
Amorocho J, Espildora B (1973) Entropy in the assessment of uncertainty of hydrologic systems and models. Water Resour Res 9(6):1511–1522
Boltzmann L (1877) Ueber die Beziehung eines allgemeine mechanischen Satzes zum zweiten Hauptsatzes der Warmetheorie. Sitzungsber. Akad. Wiss. Wien, Math-Naturwiss Kl 75:67–73
Bonner RE (1964) On some clustering techniques. IBM J Res Dev 8:22–32
Bray MTJ (2009) Rainfall analysis based on rain gauge and weather radar. PhD Thesis, University of Bristol (Unpublished)
Breiman L, Friedman J, Olshen R, Stone C et al (1984) Classification and regression trees. Wadsworth, Belmon
Breiman L (1996) Heuristics of instability and stabilization in model selection. Ann Stat 24:2350–2383
Burman P (1989) A comparative study of ordinary cross-validation, r-fold cross-validation and the repeated learning-testing methods. Biometrika 76(3):503–514
Burnham KP, Anderson DR (1998) Model selection and inference: A practical information-theoretical approach. Springer-Verlag, New York
Caselton WF, Husain T (1980) Hydrologic networks: information transmission. J Water Res Plann Manage Div 106(WR2):503–519
Cauchy AL (1829) Oeuvres IX(2):172–175
Chapman TG (1986) Entropy as a measure of hydrologic data uncertainty and model performance. J Hydrol 85:111–126
Connellan OP, James H (1998) Forecasting commercial property values in the short term. In: RICS Cutting edge conference, Leicester, RICS London. Available electronically from www.rics-foundation.org
Corcoran J, Wilson I, Ware J (2003) Predicting the geo-temporal variation of crime and disorder. Int J Forecast 19:623–634. doi:10.1016/S0169-2070(03)00095-5
Cormack RM (1971) A review of classification. J R Stat Soc 134:321–367
De Oliveira AG (1999) Synchronisation of chaos and applications to secure communications. PhD thesis, Department of Computing, Imperial College of Science, Technology and Medicine, University of London
Durrant PJ (2001) winGamma: a non-linear data analysis and modelling tool with applications to flood prediction. PhD thesis, Department of Computer Science, Cardiff University, Wales, UK
Evans D, Jones AJ (2002) A proof of the gamma test. Proc R Soc Ser A 458(2027):2759–2799
Everitt BS, Hand DJ (1981) Finite mixture distribution. Chapman and Hall, New York
Finkelstein MO, Friedberg RM (1967) The application of an entropy theory of concentration to the Clayton Act. Yale Law Rev 76:677–717
Florek K, Lukaszewiez JM, Perkal J, Steinhaus H, Zubrzchi S et al. (1951) Sur la liason et la division des points d’un ensemble fini. Colloquium Mathematicum 2:282–285
Frenken K (2004) Entropy and information theory. In: Hanusch H, Pyka E (eds) The Elgar companion to neo-schumpeterian economics. Edward Elgar, Cheltenham
Frenken K, Hekkert M, Godfroij P (2004) R&D portfolios in environmentally friendly automotive propulsion: variety, competition and policy implications. Technol Forecast Soc Change 71(5):485–507
Gordon AD (1980) Classification. Chapman & Hall, London
Hannachi A, O’Neill A (2001) Atmospheric multiple equilibria and non-Gaussian behaviour in model simulations. Q J R Meteorol Soc 127:939–958
Harmancioglu N (1981) Measuring the information content of hydrological processes by the entropy concept. J Civil Eng Faculty of Ege University, Special Issue: Centennial of Ataturk’s Birth, Izmir, Turkey, pp 13–40
Harmancioglu NB, Alpaslan N (1992) Water quality monitoring network design: a problem of multi-objective decision making. AWRA Water Resour Bull 28(1):179–192
Harmancioglu N, Yevjevich V (1987) Transfer of hydrologic information among river points. J Hydrol 91:103–118
Harmancioglu NB, Singh VP (1998) Entropy in environmental and water resources. In: Herschy RW, Fairbridge RW (eds) Encyclopedia of hydrology and water resources. Kluwer Academic Publishers, Boston, pp 225–241
Hildenbrand W, Paschen H (1964) Ein axiomatische begründetes Konzentrationsmass. Stat Inf 3:53–61 (published by the statistical office of the European Communities)
Hoskisson RE, Hitt MA, Johnson RA, Moesel DD (1993) Construct-validity of an objective (entropy) categorical measure of diversification. Strateg Manage J 14(3):215–235
Husain T (1989) Hydrologic uncertainty measure and network design. Water Resour Bull 25(3):527–534
Izraeli O, Murphy KJ (2003) The effect of industrial diversity on state unemployment rate and per capita income. Ann Reg Sci 37:1–14
Jacquemin, AP, Berry CH (1979) Entropy measure of diversification and corporate growth. J Ind Econ 27(4):359–369
James H, Connellan OP (2000) Forecasts of a small feature in a property index. In: Proceedings of RICS cutting edge conference, London, RICS London. Available electronically from www.rics-foundation.org
Johnson SC (1967) Hierarchical clustering schemes. Psycometrika 32:241–254
Jolliffe IT (2004) Principal component analysis, 2nd edn. Springer, New York
Jones AJ, Tsui A, de Oliveira AG (2002) Neural models of arbitrary chaotic systems: construction and the role of time delayed feedback in control and synchronization. Complex Int 9, p. tsui01:1–9
Kemp SE, Wilson ID, Ware JA (2005) A tutorial on the gamma test. Int J Simul 6(1–2):67–75
Kennedy P (2003) A guide to econometrics, 5th edn. Blackwell Publishing, Oxford
Kidson JW (1975) Eigenvector analysis of monthly mean surface data. Mon Weather Rev 103:177–186
Koncar N (1997) Optimisation methodologies for direct inverse neurocontrol. PhD thesis, Department of Computing, Imperial College of Science, Technology and Medicine, University of London
Krstanovic PF, Singh VP (1992) Evaluation of rainfall networks using entropy I. Water Resour Manage 6:279–293
Li W, Sherriff A, Liu X (2000) Assessing risk factors of human complex diseases by Akaike and Bayesian information criteria (abstract). Am J Hum Genet 67(Suppl):S222
Li W, Yang Y (2000) How many genes are needed for a discriminant microarray data analysis? In: Critical assessment of techniques for microarray data mining workshop, pp 137–150
MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley symposium on mathematical statistics and probability. University of California Press, vol 1, pp 281–297
Maruyama T, Kawachi T (1998) Evaluation of rainfall characteristics using entropy. J Rainwater Catchment Syst 4(1):7–10
Maryon RH (1979) Eigenanalysis of the Northern hemispherical 15-day mean surface pressure field and its application to long-range forecasting. 13 Branch Memorandum No. 82. UK Meteorological Office, Bracknell
McLachlan GJ, Krishnan T (1997) The EM algorithm and extensions. Wiley, New York, p 274
Mutua FM (1994) The use of the Akaike information criterion in the identification of an optimum flood frequency model. Hydrol Sci J 39(3):235–244
Oliveira AG (1999) Synchronization of chaos and applications to secure communications. PhD thesis, Department of Computing, Imperial College of Science, Technology and Medicine, University of London
Overland JE, Priesendorfer RW (1982) A significance test for principal components applied to a cyclone climatology. Mon Weather Rev 110:1–4
Ozkul S, Harmancioglu NB, Singh VP (2000) Entropy-based assessment of water quality monitoring networks. J Hydrol Eng 5(1):90–100
Palepu K (1985) Diversification strategy, profit performance and the entropy measure. Strateg Manage J 6:239–255
Pearson K (1901) On lines and planes of closest fit to systems of points in space. Phil Mag 6(2):559–572
Pi H, Peterson C (1994) Finding the embedding dimension and variable dependencies in time series. Neural Comput 5:509–520
Picard RR, Cook RDJ (1984) Am Stat Assoc 79:575–583
Remesan R, Shamim MA, Han D, Mathew J (2009) Runoff prediction using an integrated hybrid modelling scheme. J Hydrol 372:48–60
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 62(2):461–464
Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423, 623–656
Shannon CE, Weaver W (1949) The mathematical theory of communication. University of Illinois Press, Urbana, Illinois
Shao J (1993) Linear model selection by cross-validation. J Am Stat Assoc 88:486–494
Shao J (1997) An asymptotic theory for linear model selection. Statistica Sinica 7:221–264
Sneath PHA (1957) The application of computers to taxonomy. J Genet Microbiol 17:201–226
Specht DF (1990) Probabilistic neural networks and the poly-nomial Adaline as complementary techniques for classification. IEEE Conference on Neural Networks, vol 1, no 1, March 1990
Steinhaus H (1956) Sur la division des corp materiels en parties (in French). Bull Acad Polon Sci 4(12):801–804
Tabony RC (1981) A principal component and spectral analysis of European rainfall. J Climatol 1:283
Tan PN, Steinbach M, Kumar V et al (2006) Introduction to data mining. Pearson Addison Wesley, Boston
Theil H (1972) Statistical decomposition analysis. North-Holland Publishing Co., Amsterdam
Theil H (1967) Economics and information theory. North-Holland Publishing Co., Amsterdam
Titterington DM, Smith AFM, Makov UE (1985) Statistical analysis of finite mixture distributions. Wiley, New York, p 243
Tryon RC (1939) Cluster analysis. McGraw-Hill, New York
Tsui APM (1999) Smooth data modelling and stimulus-response via stabilisation of neural chaos. PhD thesis, Department of Computing, Imperial College of Science, Technology and Medicine, University of London
Tsui APM, Jones AJ, de Oliveira AG (2002) The construction of smooth models using irregular embeddings determined by a gamma test analysis. Neural Comput Appl 10(4):318–329. doi:10.1007/s005210200004
Ward JH (1963) Hierarchical grouping to optimise an objective function. J Am Stat Assoc 58:236–244
Wigley TML, Lough JM, Jones PD (1984) Spatial patterns of precipitation in England and Wales and a revised, homogeneous England and Wales precipitation series. J Climatol 4:1–25
Wilks DS (2006) Statistical methods in the atmospheric sciences, 2nd edn. Elsevier, Oxford
Xu QS, Liang YZ (2001) Monte Carlo cross validation. Chemom Intell Lab Syst 56:1–11
Yang Y, Burn DH (1994) An entropy approach to data collection network design. J Hydrol 157(4):307–324
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Remesan, R., Mathew, J. (2015). Model Data Selection and Data Pre-processing Approaches. In: Hydrological Data Driven Modelling. Earth Systems Data and Models, vol 1. Springer, Cham. https://doi.org/10.1007/978-3-319-09235-5_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-09235-5_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09234-8
Online ISBN: 978-3-319-09235-5
eBook Packages: Earth and Environmental ScienceEarth and Environmental Science (R0)