Model Data Selection and Data Pre-processing Approaches

Remesan, Renji; Mathew, Jimson

doi:10.1007/978-3-319-09235-5_3

Renji Remesan⁶ &
Jimson Mathew⁷

Part of the book series: Earth Systems Data and Models ((ESDM,volume 1))

1698 Accesses
1 Citations

Abstract

Data-based modeling relies on historical data without directly taking account of underlying physical processes in hydrology . So, real-world modeling of hydrological processes commonly requires a complex input structure and very lengthy training data to represent inherent complex dynamic systems. In cases where a large amount of input data is available, and all of which used for modeling, technical issues such as the increase in the computational complexity and lack of memory spaces have been observed. The likelihood of these problems occurring is much greater in the case of hydrological modeling, as these models possess high nonlinearity and a large number of parameters. Therefore, there is a definite need to identify proper techniques which adequately reduce the number of inputs and the required training data length in nonlinear models. Removing redundant inputs from all available input pools and deciding upon the optimum data length to make a reliable prediction are the main purposes of these approaches. This section of the book describes the abilities of novel techniques such as Gamma Test (GT), entropy theory (ET), Principle Component Analysis (PCA), cluster analysis (CA), Akaike’s Information Criterion (AIC ), and Bayesian Information Criterion (BIC ) in model data selection. The novelty of this work is that many of these approaches are used for the first time in hydrological modeling scenarios such as solar radiation estimation, rainfall-runoff modeling , and evapotranspiration modeling . Towards the end of this chapter, conventional data selection procedures such as the Cross-Correlation Approach (CCA), Cross-Validation Approach (CVA), and Data Splitting Approach (DSA) are explained in detail. These traditional approaches were used to check the authenticity of the newly applied methods in the later case study chapters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Hardcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Adcock RJ (1878) A problem in least squares. Analyst 5:53–54
Google Scholar
Agalbjorn S, Kon car N, Jones AJ (1997) A note on the gamma test. Neural Comput Appl 5(3):131–133. ISSN 0-941-0643
Google Scholar
Ahmadi A, Han D, Karamouz M, Remesan R et al (2009) Input data selection for solar radiation estimation. Hydrol Process 23(19):2754–2764
Google Scholar
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19:716–723
Google Scholar
Akaike H (1978) A Bayesian analysis of the minimum AIC procedure. Ann Inst Math Stat A (30):9–14
Google Scholar
Amorocho J, Espildora B (1973) Entropy in the assessment of uncertainty of hydrologic systems and models. Water Resour Res 9(6):1511–1522
Google Scholar
Boltzmann L (1877) Ueber die Beziehung eines allgemeine mechanischen Satzes zum zweiten Hauptsatzes der Warmetheorie. Sitzungsber. Akad. Wiss. Wien, Math-Naturwiss Kl 75:67–73
Google Scholar
Bonner RE (1964) On some clustering techniques. IBM J Res Dev 8:22–32
Google Scholar
Bray MTJ (2009) Rainfall analysis based on rain gauge and weather radar. PhD Thesis, University of Bristol (Unpublished)
Google Scholar
Breiman L, Friedman J, Olshen R, Stone C et al (1984) Classification and regression trees. Wadsworth, Belmon
Google Scholar
Breiman L (1996) Heuristics of instability and stabilization in model selection. Ann Stat 24:2350–2383
Google Scholar
Burman P (1989) A comparative study of ordinary cross-validation, r-fold cross-validation and the repeated learning-testing methods. Biometrika 76(3):503–514
Google Scholar
Burnham KP, Anderson DR (1998) Model selection and inference: A practical information-theoretical approach. Springer-Verlag, New York
Google Scholar
Caselton WF, Husain T (1980) Hydrologic networks: information transmission. J Water Res Plann Manage Div 106(WR2):503–519
Google Scholar
Cauchy AL (1829) Oeuvres IX(2):172–175
Google Scholar
Chapman TG (1986) Entropy as a measure of hydrologic data uncertainty and model performance. J Hydrol 85:111–126
Google Scholar
Connellan OP, James H (1998) Forecasting commercial property values in the short term. In: RICS Cutting edge conference, Leicester, RICS London. Available electronically from www.rics-foundation.org
Corcoran J, Wilson I, Ware J (2003) Predicting the geo-temporal variation of crime and disorder. Int J Forecast 19:623–634. doi:10.1016/S0169-2070(03)00095-5
Cormack RM (1971) A review of classification. J R Stat Soc 134:321–367
Google Scholar
De Oliveira AG (1999) Synchronisation of chaos and applications to secure communications. PhD thesis, Department of Computing, Imperial College of Science, Technology and Medicine, University of London
Google Scholar
Durrant PJ (2001) winGamma: a non-linear data analysis and modelling tool with applications to flood prediction. PhD thesis, Department of Computer Science, Cardiff University, Wales, UK
Google Scholar
Evans D, Jones AJ (2002) A proof of the gamma test. Proc R Soc Ser A 458(2027):2759–2799
Google Scholar
Everitt BS, Hand DJ (1981) Finite mixture distribution. Chapman and Hall, New York
Google Scholar
Finkelstein MO, Friedberg RM (1967) The application of an entropy theory of concentration to the Clayton Act. Yale Law Rev 76:677–717
Google Scholar
Florek K, Lukaszewiez JM, Perkal J, Steinhaus H, Zubrzchi S et al. (1951) Sur la liason et la division des points d’un ensemble fini. Colloquium Mathematicum 2:282–285
Google Scholar
Frenken K (2004) Entropy and information theory. In: Hanusch H, Pyka E (eds) The Elgar companion to neo-schumpeterian economics. Edward Elgar, Cheltenham
Google Scholar
Frenken K, Hekkert M, Godfroij P (2004) R&D portfolios in environmentally friendly automotive propulsion: variety, competition and policy implications. Technol Forecast Soc Change 71(5):485–507
Google Scholar
Gordon AD (1980) Classification. Chapman & Hall, London
Google Scholar
Hannachi A, O’Neill A (2001) Atmospheric multiple equilibria and non-Gaussian behaviour in model simulations. Q J R Meteorol Soc 127:939–958
Google Scholar
Harmancioglu N (1981) Measuring the information content of hydrological processes by the entropy concept. J Civil Eng Faculty of Ege University, Special Issue: Centennial of Ataturk’s Birth, Izmir, Turkey, pp 13–40
Google Scholar
Harmancioglu NB, Alpaslan N (1992) Water quality monitoring network design: a problem of multi-objective decision making. AWRA Water Resour Bull 28(1):179–192
Google Scholar
Harmancioglu N, Yevjevich V (1987) Transfer of hydrologic information among river points. J Hydrol 91:103–118
Google Scholar
Harmancioglu NB, Singh VP (1998) Entropy in environmental and water resources. In: Herschy RW, Fairbridge RW (eds) Encyclopedia of hydrology and water resources. Kluwer Academic Publishers, Boston, pp 225–241
Google Scholar
Hildenbrand W, Paschen H (1964) Ein axiomatische begründetes Konzentrationsmass. Stat Inf 3:53–61 (published by the statistical office of the European Communities)
Google Scholar
Hoskisson RE, Hitt MA, Johnson RA, Moesel DD (1993) Construct-validity of an objective (entropy) categorical measure of diversification. Strateg Manage J 14(3):215–235
Google Scholar
Husain T (1989) Hydrologic uncertainty measure and network design. Water Resour Bull 25(3):527–534
Google Scholar
Izraeli O, Murphy KJ (2003) The effect of industrial diversity on state unemployment rate and per capita income. Ann Reg Sci 37:1–14
Google Scholar
Jacquemin, AP, Berry CH (1979) Entropy measure of diversification and corporate growth. J Ind Econ 27(4):359–369
Google Scholar
James H, Connellan OP (2000) Forecasts of a small feature in a property index. In: Proceedings of RICS cutting edge conference, London, RICS London. Available electronically from www.rics-foundation.org
Johnson SC (1967) Hierarchical clustering schemes. Psycometrika 32:241–254
Google Scholar
Jolliffe IT (2004) Principal component analysis, 2nd edn. Springer, New York
Google Scholar
Jones AJ, Tsui A, de Oliveira AG (2002) Neural models of arbitrary chaotic systems: construction and the role of time delayed feedback in control and synchronization. Complex Int 9, p. tsui01:1–9
Google Scholar
Kemp SE, Wilson ID, Ware JA (2005) A tutorial on the gamma test. Int J Simul 6(1–2):67–75
Google Scholar
Kennedy P (2003) A guide to econometrics, 5th edn. Blackwell Publishing, Oxford
Google Scholar
Kidson JW (1975) Eigenvector analysis of monthly mean surface data. Mon Weather Rev 103:177–186
Google Scholar
Koncar N (1997) Optimisation methodologies for direct inverse neurocontrol. PhD thesis, Department of Computing, Imperial College of Science, Technology and Medicine, University of London
Google Scholar
Krstanovic PF, Singh VP (1992) Evaluation of rainfall networks using entropy I. Water Resour Manage 6:279–293
Google Scholar
Li W, Sherriff A, Liu X (2000) Assessing risk factors of human complex diseases by Akaike and Bayesian information criteria (abstract). Am J Hum Genet 67(Suppl):S222
Google Scholar
Li W, Yang Y (2000) How many genes are needed for a discriminant microarray data analysis? In: Critical assessment of techniques for microarray data mining workshop, pp 137–150
Google Scholar
MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley symposium on mathematical statistics and probability. University of California Press, vol 1, pp 281–297
Google Scholar
Maruyama T, Kawachi T (1998) Evaluation of rainfall characteristics using entropy. J Rainwater Catchment Syst 4(1):7–10
Google Scholar
Maryon RH (1979) Eigenanalysis of the Northern hemispherical 15-day mean surface pressure field and its application to long-range forecasting. 13 Branch Memorandum No. 82. UK Meteorological Office, Bracknell
Google Scholar
McLachlan GJ, Krishnan T (1997) The EM algorithm and extensions. Wiley, New York, p 274
Google Scholar
Mutua FM (1994) The use of the Akaike information criterion in the identification of an optimum flood frequency model. Hydrol Sci J 39(3):235–244
Google Scholar
Oliveira AG (1999) Synchronization of chaos and applications to secure communications. PhD thesis, Department of Computing, Imperial College of Science, Technology and Medicine, University of London
Google Scholar
Overland JE, Priesendorfer RW (1982) A significance test for principal components applied to a cyclone climatology. Mon Weather Rev 110:1–4
Google Scholar
Ozkul S, Harmancioglu NB, Singh VP (2000) Entropy-based assessment of water quality monitoring networks. J Hydrol Eng 5(1):90–100
Google Scholar
Palepu K (1985) Diversification strategy, profit performance and the entropy measure. Strateg Manage J 6:239–255
Google Scholar
Pearson K (1901) On lines and planes of closest fit to systems of points in space. Phil Mag 6(2):559–572
Google Scholar
Pi H, Peterson C (1994) Finding the embedding dimension and variable dependencies in time series. Neural Comput 5:509–520
Google Scholar
Picard RR, Cook RDJ (1984) Am Stat Assoc 79:575–583
Google Scholar
Remesan R, Shamim MA, Han D, Mathew J (2009) Runoff prediction using an integrated hybrid modelling scheme. J Hydrol 372:48–60
Google Scholar
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 62(2):461–464
Google Scholar
Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423, 623–656
Google Scholar
Shannon CE, Weaver W (1949) The mathematical theory of communication. University of Illinois Press, Urbana, Illinois
Google Scholar
Shao J (1993) Linear model selection by cross-validation. J Am Stat Assoc 88:486–494
Google Scholar
Shao J (1997) An asymptotic theory for linear model selection. Statistica Sinica 7:221–264
Google Scholar
Sneath PHA (1957) The application of computers to taxonomy. J Genet Microbiol 17:201–226
Google Scholar
Specht DF (1990) Probabilistic neural networks and the poly-nomial Adaline as complementary techniques for classification. IEEE Conference on Neural Networks, vol 1, no 1, March 1990
Google Scholar
Steinhaus H (1956) Sur la division des corp materiels en parties (in French). Bull Acad Polon Sci 4(12):801–804
Google Scholar
Tabony RC (1981) A principal component and spectral analysis of European rainfall. J Climatol 1:283
Google Scholar
Tan PN, Steinbach M, Kumar V et al (2006) Introduction to data mining. Pearson Addison Wesley, Boston
Google Scholar
Theil H (1972) Statistical decomposition analysis. North-Holland Publishing Co., Amsterdam
Google Scholar
Theil H (1967) Economics and information theory. North-Holland Publishing Co., Amsterdam
Google Scholar
Titterington DM, Smith AFM, Makov UE (1985) Statistical analysis of finite mixture distributions. Wiley, New York, p 243
Google Scholar
Tryon RC (1939) Cluster analysis. McGraw-Hill, New York
Google Scholar
Tsui APM (1999) Smooth data modelling and stimulus-response via stabilisation of neural chaos. PhD thesis, Department of Computing, Imperial College of Science, Technology and Medicine, University of London
Google Scholar
Tsui APM, Jones AJ, de Oliveira AG (2002) The construction of smooth models using irregular embeddings determined by a gamma test analysis. Neural Comput Appl 10(4):318–329. doi:10.1007/s005210200004
Ward JH (1963) Hierarchical grouping to optimise an objective function. J Am Stat Assoc 58:236–244
Google Scholar
Wigley TML, Lough JM, Jones PD (1984) Spatial patterns of precipitation in England and Wales and a revised, homogeneous England and Wales precipitation series. J Climatol 4:1–25
Google Scholar
Wilks DS (2006) Statistical methods in the atmospheric sciences, 2nd edn. Elsevier, Oxford
Google Scholar
Xu QS, Liang YZ (2001) Monte Carlo cross validation. Chemom Intell Lab Syst 56:1–11
Google Scholar
Yang Y, Burn DH (1994) An entropy approach to data collection network design. J Hydrol 157(4):307–324
Google Scholar

Download references

Author information

Authors and Affiliations

Cranfield Water Science Institute, Cranfield University, Cranfield, Bedfordshire, UK
Renji Remesan
Department of Computer Science, University of Bristol, Bristol, UK
Jimson Mathew

Authors

Renji Remesan
View author publications
You can also search for this author in PubMed Google Scholar
Jimson Mathew
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Renji Remesan .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Remesan, R., Mathew, J. (2015). Model Data Selection and Data Pre-processing Approaches. In: Hydrological Data Driven Modelling. Earth Systems Data and Models, vol 1. Springer, Cham. https://doi.org/10.1007/978-3-319-09235-5_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-09235-5_3
Published: 04 November 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09234-8
Online ISBN: 978-3-319-09235-5
eBook Packages: Earth and Environmental ScienceEarth and Environmental Science (R0)

Publish with us

Policies and ethics