Advertisement

Water Resources Management

, Volume 32, Issue 15, pp 5151–5168 | Cite as

Entropy as a Variation of Information for Testing the Goodness of Fit

  • Türkay Baran
  • Filiz BarbarosEmail author
  • Ali Gül
  • Gülay Onuşluel Gül
Article

Abstract

Increasing population, higher levels of human and industrial activities have affected water resources in the last decades. In addition, per capita demand for water in most countries is steadily increasing as more and more people achieve higher standards of living. Researchers need more information about water resources for their efficient use and effective management. In this respect, getting sufficient, accurate and quick information has great significance in water resources planning and management in parallel to the determination of the characteristics of water resources. To this end, useful and easily applicable methods have been explored to get optimum results and several test techniques have been investigated to get much more information based on available water resources data. In the presented study, Informational Entropy method is introduced as an alternative test method to test the goodness of fit of probability functions. The presented study gives a brief detail on the applicability of the concept as a goodness of fit tool on various cases from different spatial regions and varying meteorological characteristics. For this purpose, mean precipitation data for 60 stations in Turkey are investigated. Results by testing the goodness of fit of probability functions through the entropy-based method show that Informational Entropy can be applied for fitting the probability function based on the investigated datasets.

Keywords

Informational entropy Probability analysis Trend analysis Goodness of fit Water resources 

1 Introduction

All attempts towards an efficient development of water resources within technical and economic perspectives require data and information on the quality and quantity of such resources. Accordingly, measuring and evaluating hydrologic processes that represent time-variant characteristics of water resources is highly significant. It is up to the planner to extract the maximum amount of information from these series and to determine the required parameters for the design and operation of water resources systems. The planner is not only concerned with the availability of data but also with the proper delineation of data collection networks. This is due to the fact that collection, processing and acquisition of hydrologic data require a significant amount of labor and investment. Thus, it is highly important that collected data bring new information.

Considering the limited financial resources to be allocated to monitoring of data, emphasis must be put on the optimum selection of spatial of and temporal monitoring frequencies. The collected data must also be assured to meet the objectives of water resources planning and development. Consequently, it may be stated that an objective criterion is needed to determine the amount of information conveyed by observed data since all activities towards the planning of water resources rely on data and information contained in available observations.

Hydrologic processes must be observed to achieve optimum decisions regarding the design and operation of water resources systems. On the other hand, data collection practices must also be realized on an optimum basis in delineation of what, where, when and how long to measure. In relation to this need, planners lately used the terms expected information, increase of information, or deficiency of information and relate their design parameters to the information conveyed by available data. Information has often been expressed indirectly in terms of statistical parameters such as variance, standard error or correlation coefficient instead of quantitative terms (Harmancioglu and Singh 1998).

Entropy is a measure of the degree of uncertainty of random hydrological process. Since the reduction of uncertainty by means observations is equal to the amount of information gained, the entropy criterion indirectly measures the information content of a given series of data. Entropy measures can also employed when a monitoring activity reaches an optimal point in time after which any new data does not produce new information or can be use for diagnostic checking in time series modeling (Baran and Bacanli 2006, 2007a, b).

Sharifdoost et al. (2009) also introduced entropy method as an alternative test statistic for goodness of fit test. Alternative method has been compared with the classical test statistics Chi-squared, Kolmogorov-Smirnov and likelihood ratio tests. It is concluded that the entropy method is more sensitive than the usual statistics. Lee et al. (2011) proposed a test of fit based on maximum entropy. The test statistics are established and a corrected form for small and medium sample sizes is obtained through Monte Carlo simulations by real time series. Abbas et al. (2012) figured out the best fitting distribution to explain the annual maximum of daily rainfall for a specified time period. In their study, Gamma, Generalized Extreme Value (GEV) and Generalized Pareto (GP) distributions were fitted to annual maximum of daily rainfall data from each station and the performance of the distributions has been evaluated using different goodness of fit (GOF) tests by Chi-Square (CS), Kolmogorove-Smirnove (KS) and Anderson Darling (AD) tests. An empirical investigation of some tests of goodness of fit for the inverse Gaussian distribution has been carried out by Best et al. (2012). Three methods of goodness of fit tests, Chi-Square (C-S), Kolmogorov-Smirnov (K-S), and Anderson-Darling (A-D), have been evaluated for frequency analysis by Zeng et al. (2015). On the other hand, entropy-based measure of data–model fit that can be used to assess the quality of logistic regression models has been introduced in a study by Weiss and Dardick (2016). As an alternative method, in the study presented by Girardin and Lequesne (2017), mathematical justification of both tests based on Shannon entropy–called S tests– and relative entropy-called KL tests– was detailed leading to show their equivalence for testing any parametric composite null hypothesis of maximum entropy distributions. The methodology was applied over a real dataset of a DNA replication process to detect chicken cell lines.

By using the informational entropy method, the probability distribution of best fit to observed time series can be evaluated (Baran and Barbaros 2015; Baran et al. 2017a, b). Originating from this feasibility, the presented study targets exploring the applicability of the concept as a goodness of fit tool on various cases from different spatial regions and varying meteorological characteristics. The study consists of two parts. The first part aims basically at showing the strength of informational entropy to indicate variation of information. Originating from this intention, the mathematical formulation of Shannon’s entropy concept is employed as an indication of variation of information. The analyses to this end covers analyzing time series of meteorological variables, which were captured at meteorological monitoring stations located in a wide spatial extent of Turkey, through trend analysis approaches in order to get any primary indication about the occurrence of persistent trend behavior in the series investigated. The second part relates to testing the fit of probability distribution function of precipitation data in Turkey. For this purpose, 60 gaging stations having long-term precipitation data are investigated. In probability analysis, the best fit results of probability distribution functions achieved through the informational entropy method are compared to the Chi-square test indicator.

2 Informational Entropy

2.1 The Role of the Entropy Concept in Evaluation of Hydrologic Data

In mathematical communication theory, the concept of and the measure for information content have been derived from statistical and probabilistic principles. Within this respect, the theory has also been referred to as Statistical Communication Theory.

Mathematical communication theory analyzes the statistical structure for information of a series of numbers, signs or symbols that make up a communication signal, without considering all their kind, meaning, value or any other subjective characteristics. The term information content here refers to the capability of signals to create communication, and the basic problem is the generation of correct communication by sending a sufficient amount of signals, leading neither to any loss nor to repetition of information (Cherry 1957; Pfeiffer 1965).

In introducing the informational entropy concept, Shannon (1964) considered in much general sense that average information content of any data source should receive more prior consideration than any other parameter derived from the data set itself. With this logic, Shannon defines the information content, H(n), of a message sent by the transmitter as in Eq. (1):
$$ H(n)=-{\sum}_{k=1}^n{p}_k\mathit{\log}{p}_k $$
(1)
with,
$$ \sum \limits_{k=1}^n{p}_k=1 $$

The general concept of information content H(n) was later named with the entropy term (Shannon and Weaver 1949) as Shannon’s definition is very much similar to the entropy function described in statistical mechanics (Cherry 1957; Pierce 1961; Pfeiffer 1965).

Shannon’s entropy given in Eq. (1) is originally formulated for discrete variables and always assumes positive values. Shannon extended this expression to the continuous case by simply replacing the summation with an integral equation as given in Eq. (2):
$$ H(X)=-\underset{-\infty }{\overset{+\infty }{\int }}f(x). lnf(x).d(x) $$
(2)
with,
$$ {\int}_{-\infty}^{+\infty }f(x).d(x)=1 $$
Shannon’s definition of entropy as the selection of particular ∆x values produces entropy measures varying within the interval (−∞, +∞). On the contrary, the theoretical background for the random variable X, H(X) defines the condition in Eq. (3):
$$ 0\le H(X)\le lnN $$
(3)
where N is the number of events X assumes. The condition above indicates that the entropy function has upper (lnN) and lower (0 when X is deterministic) bounds, assuming positive values in between. The discrepancies encountered in practical applications of the concept essentially result from the above errors in the definition of entropy for continuous variables. The main difficulty associated with the applicability of the entropy concept in hydrology originates from the lack of a precise definition when dealing with continuous variables.

2.2 Need for Measuring the Information Content of Hydrologic Processes

The entropy concept known as the Mathematical Communication Theory or Information Theory seems to offer such a criterion and appears to bring appropriate solutions to the aforementioned problems in water resources engineering. In the area of hydrology and water resources, a range of applications of entropy has been reported during the last decades. In the entropy-based parameter estimation, the distribution parameters are expressed in terms of the given information (Singh 1997, 1998).

One of the most prominent uses of the entropy principle is on assessing uncertainties in varying aspects of hydrology science, ranging from hydrological variables and model parameters to water resources systems within a much general context. Indeed, the entropy concept finds places in different practices that may include specific cases, such as the derivation of frequency distributions and parameter estimation, or broader cases such as hydrometric data network design. The most distinctive yield of entropy in such applications is its capacity for measuring uncertainty or information in quantitative terms (Harmancioglu et al. 1992; Harmancioglu and Singh 1998, 2002; Singh 1997, 2003).

On the other hand, researchers have also noted that some mathematical difficulties are encountered in the computation of various informational entropy measures. The major problem is the controversy associated with the mathematical definition of entropy for continuous probability distribution functions. In this case, the lack of a precise definition of informational entropy leads to further mathematical difficulties and, thus, hinders the applicability of the concept in hydrology. This problem needs to be resolved so that the informational entropy concept can be set on an objective and reliable theoretical basis and thereby achieve widespread use in the solution of water-resources problems based on information and/or uncertainty.

Baran et al. (2017b) have defined entropy as the variation of information content instead of reduction of uncertainty ≡ amount of information gained. The method is demonstrated on the series of the monthly observations of selected stream gaging stations. The results confirm that the new entropy definition leads to results that are more reliable and can be effectively used for the solution of water resources engineering problems related to uncertainty and information. The mathematical formulation developed does not depend on the use of discretizing intervals so that a single value for the variation of information can be obtained. According to authors, a new definition describes the concept, not as an absolute measure of information but as a measure of variation of information.

3 Entropy Concept as Variation of Information

3.1 Entropy Concept as Variation of Information for Continuous Variables

Informational entropy has been defined as the variation of information, which indirectly equals the amount of uncertainty reduced by making observations. To develop such a definition, two measures of probability, p and q with (p and q ∈ K), are considered in the probability space (Ω, K). Here, q represents a priori probabilities (i.e., probabilities prior to making observations). When a process is defined in such a probability space, the information conveyed when the process assumes a finite value A {A ∈ K} in the same probability space is executed by Eq. (4):
$$ I=-\mathit{\ln}\left(\frac{p(A)}{q(A)}\right) $$
(4)
The process defined in Ω can assume one of the finite and discrete events (A1, …, An) ∈ K; thus, the entropy expression for any value An can be written as in Eq. (5):
$$ H\left(p/q\right)=-\mathit{\ln}\left(\frac{p\left({A}_n\right)}{q\left({A}_n\right)}\right)\kern0.5em \left(n=1,\dots, N\right) $$
(5)
The total information content of the probability space (Ω, K) can be defined (Eq. (6)) as the expected value of the information content of its elementary events:
$$ H\left(p/q\right)=-\sum p\left({A}_n\right).\mathit{\ln}\left(\frac{p\left({A}_n\right)}{q\left({A}_n\right)}\right) $$
(6)
Similarly, the entropy as variation of information H(X/X) of a random process x in the same probability space can be defined as in Eq. (7):
$$ H\left(X/{X}^{\ast}\right)=-\sum p\left({x}_n\right).\mathit{\ln}\left(\frac{p\left({x}_n\right)}{q\left({x}_n\right)}\right) $$
(7)
where, H(X/X) is in the form of conditional entropy, i.e., the entropy of X conditioned on X*. Here, the condition is represented by an a priori probability distribution function, which can be described as the reference level against which the variation of information in the process can be measured.
In this case, the informational entropy as variation of information in the continuous case can be written as in Eq. (8):
$$ H\left(X/{X}^{\ast}\right)=-\int f(x)\mathit{\ln}\left[\frac{f(x)}{q(x)}\right].d(x) $$
(8)

In Eq. (8), X* represents information available before making observations on the variable X, and X is the a posteriori information (i.e., information obtained by making observations). Similarly, q(x) is the a priori and f(x) the a posteriori probability density function for the random variable X.

Let us assume that the a priori {q(x)} and a posteriori {p(x)} probability distribution functions of the random variable X are known. If the ranges of possible values of the continuous variable X are divided into N discrete and infinitesimally small intervals of width Δx, the entropy expression for this continuous case can be given as in Eq. (9):
$$ H\left(X/{X}^{\ast}\right)=-\int p(x)\mathit{\ln}\left[\frac{p(x)}{q(x)}\right].d(x) $$
(9)

The above expression, as discussed by Guiasu (1977) and Jaynes (1983), describes the variation of information (or, indirectly, the uncertainty reduced by making observations) to replace the absolute measure of information content given in Eq. (2). At this point, the most important issue is the selection of a priori distribution. In case the process X is not observed at all, no information is available about it so that it is completely uncertain. In probability terms, this implies the selection of the uniform distribution. In other words, when no information exists about the variable X, the alternative events it may assume may be represented by equal probabilities or simply by the uniform probability distribution function. Previous research efforts where the above mathematical definitions were put forward (Guiasu 1977; Jaynes 1983) did not express, in the real sense, how the probability density functions q(x) and p(x) are to be decided. The present study develops an approach to address this fundamental need as described and experimented in the following sections.

If a priori {q(x)} is assumed to be uniform, and a posteriori {p(x)} distribution of X is assumed to be normal, the informational entropy H(X/X) can be expressed as in Eq. 10:
$$ H\left(X/{X}^{\ast}\right)=\mathit{\ln}\sqrt{2\pi }+ ln\sigma +\frac{1}{2}-\ln \left(b-a\right) $$
(10)
by integrating Eq. (9). The first three terms in this equation represent the marginal entropy of X and the last term stands for the maximum entropy. Accordingly, the variation of information can be expressed simply as follows:
$$ H\left(X/{X}^{\ast}\right)=H(X)-{H}_{max} $$
(11)
If a posteriori distribution of X is assumed to be lognormal, the informational entropy H(X/X*) becomes as in Eq. 12:
$$ H\left(X/{X}^{\ast}\right)=\mathit{\ln}\sqrt{2\pi }+\mathit{\ln}{\sigma}_y+{\mu}_y+\frac{1}{2}-\ln \left(b-a\right) $$
(12)
with and μy and σy being the mean and standard deviation of y = ln x.

In the approach presented herein, a progress from the most uncertain condition prior to any observation (associated with the assumption of uniform distribution considering equal probabilities for all values) was achieved toward the probability distribution foreseen based on the values monitored. It is made possible to compute variation of information for the assumption of a posteriori distribution, by considering Eq. 10 for normal and Eq. 12 for lognormal distributions accordingly.

3.2 The Meaning of the Variation of Information: The Distance between Two Continuous Distributions

The Max Norm can be used to measure the distance between two functions defined in the probability space and to assess whether these two functions approach to each other. According to Max Norm, the distance between two functions p(x) and q(x) is defined as in Eq. (13):
$$ \Delta \left(p,m\right)={\mathit{\sup}}_{-\infty <x<+\infty}\left\Vert p(x)-q(x)\right\Vert $$
(13)
When p(x) and q(x) are used to represent the standardized normal and the standardized uniform distribution functions, respectively, the difference function {h(x)} will be as follows:
$$ h(x)=p(x)-q(x) $$
(14)
It may be observed in Fig. 1 that, the critical points of the difference function are at h0, h1, and h2 so that the difference between the two functions {Δ(p, q)} can be expressed as in Eq. (15):
$$ \Delta \left(p,q\right)=\mathit{\max}\left\{{h}_0,{h}_1,{h}_2\right\} $$
(15)
Fig. 1

Critical values of the difference function as defined by the Max Norm

Based on the half-range value “a” in Fig. 1, the critical points h0, h1, and h2 can be obtained as in the Equations through 16 to 18:
$$ {h}_0=\frac{1}{\sqrt{2\pi }}\frac{1}{2a} $$
(16)
$$ {h}_1=\frac{1}{\sqrt{2\pi }}\frac{1}{2a}{e}^{\left(\frac{-{a}^2}{2}\right)} $$
(17)
$$ {h}_2=\frac{1}{\sqrt{2\pi }}{e}^{\left(\frac{-{a}^2}{2}\right)} $$
(18)
The problem here is then to find the distance between the two functions as the half-range value “a” which minimizes Δ(p, q) of Eq. (19). The critical half-range value “a” that satisfies this supremum is:
$$ a=\frac{3}{4}\sqrt{2\pi } $$
(19)

At the above critical half-range value “a”, which is obtained by Max Norm, it is possible to use the two functions p(x) and q(x) interchangeably with an optimum number of observations.

When two points represented by the a posteriori and a priori distribution functions, p(x) and q(x), respectively, in the same probability space they approach to each other. This indicates, in information terms, an information increase about the analyzed random process. The case when the two points coincide represents total information availability about the process.

3.3 Determination of Confidence Limits for Entropy Defined by Variation of Information

The proofs above, showed that i) The variation of information equation approaches a constant value, ii) it is possible to use the two functions q(x) (Uniform DF - priori probabilities -i.e., probabilities prior to making observations) and p(x) (Normal DF- a posteriori probability distribution - i.e., information obtained by making observations) and interchangeably with an optimum number of observations. Hence, it is possible to determine the confidence limits, if observations reached the optimum number for using the posterior distributions.

If the observed range [a, b] of the variable X is considered also as the population value of the range, R, of the variable, the maximum information content of the variable may be described as in Eq. (20):
$$ {H}_{max}= lnR $$
(20)
with;
$$ R=b-a\kern2.75em a<x<b $$
The confidence limits (acceptable region) of entropy can be determined by using the a posteriori probability distribution functions. If normal probability density function is selected the maximum entropy for the standard normal variable {N(0,1)}z is;
$$ {H}_{max}(z)=\mathit{\log}{R}_z $$
(21)
with the range of z being,
$$ {R}_x=2a $$
Here, the value “a” describes half-range of the variable. Then, the maximum entropy for variation x with N (μ,σ) is,
$$ {H}_{max}(x)=\log \left({R}_z\sigma \right) $$
(22)
If the critical half-range value is foreseen as:
$$ a=4\sigma $$
(23)
area under the normal curve may be approximated to be 1.
For the half-range value, replacing the appropriate values in Eq. (24), one obtains the acceptable entropy value for normal probability density function as in Eq. (25):
$$ H\left(X/{X}^{\ast}\right)=H(X)-{H}_{max} $$
(24)
$$ H{\left(X/{X}^{\ast}\right)}_{cr}=0.6605 $$
(25)
using normal logarithms. When the entropy H(X/X) of the variable which is assumed to be normal, remains below the above value, one may decide that the normal probability density function is acceptable and that sufficient amount of information has been collected about the process.
If the a posteriori distribution function is selected as lognormal LN(μy, σy), the variation of information for the variable x can be determined as:
$$ H\left(X/{X}^{\ast}\right)=\mathit{\log}\left[2 Sinh\left(a{\sigma}_y\right)\right]-\mathit{\log}{\sigma}_y-1.4189 $$
(26)
Here, since lognormal values will be positive, one may consider –a = 0. Then the acceptable value of H(X/X) for the lognormal distribution function will be as given in Eq. (27);
$$ H\left(X/{X}^{\ast}\right)=a{\sigma}_y-\mathit{\log}{\sigma}_y-1.4189 $$
(27)

One may observe here that, no single constant value exists to describe the confidence limit for lognormal distribution. Even if critical half range is determined, the confidence limits will vary according to the variance of the variable. However, if the variance of x is known, one can compute the confidence limits (Baran et al. 2017a, b).

If the variable is actually normally distributed and if sufficient amount of observations are obtained, the entropy of (24) will approach a value, which can be considered within as acceptable region. This is the case where one may state sufficient information is collected about the process.

Consequently, as can be seen in the mathematical justification above, the definition of variation of information has the quality of indicating a proximity measure between two density functions (uniform-normal; uniform-lognormal) defined in the probability space. When the marginal entropy is considered to be defined depending upon the foreseen probability structure for the investigated process, it serves to determine confidence interval to evaluate whether a suitable selection was made (or in other words, the probability distribution selection was properly performed) or not, such that a decision for the convenient selection of the a posteriori distribution structure can be made through the measure of variation of information by considering the appearing value for the confidence interval.

4 Available Data

There are different climates in various parts of Turkey due to irregular topography. In the Taurus Mountains, it is close to the shore and rain clouds cannot penetrate the interior of the country. As the rain clouds pass over the mountains and reach Central Anatolia, they do not have an important ability to produce rain. The North Black Sea Mountains and the Caucasus Mountain keep the rain cloud, which is why the region is affected by the long and very cold winters and continental climate. In the eastern mountains, minimum temperatures between −30 °C and − 38 °C are observed and can snow for 120 days a year. Winter often passes through heavy snowfall. The villagers in this area can remain isolated for several days during the winter storms (DMI 2002; Bacanli et al. 2008).

In Turkey, summers are hot and dry; temperatures are above 30 °C. Although spring and autumn usually pass slightly, abrupt hot and cold changes in the area are frequent in both seasons. Annual average precipitation is about 500–800 mm; actual quantities are determined by elevation. Annual average precipitation in Turkey is shown in Fig. 2.
Fig. 2

Annual average precipitation of Turkey (DMI 2016)

Western Anatolia has a mild Mediterranean climate with an average temperature of 9 °C in winter and 29 °C in summer. Similar climatic conditions are observed on the southern coast of Anatolia. The climate of Anatolian Plateau is a systematic climate. There is a big temperature difference between day and night. The average temperature in summer is 23 °C and winter is −2 °C. The climate in the Black Sea region is wet and moist (summer 23 °C, winter 7 °C). There is a long winter in Eastern Anatolia and Southeastern Anatolia, and there is snow from November to the end of April (average temperature in winter is −13 °C and 17 °C in summer). The long-term average temperature of Turkey is shown in Fig. 3.
Fig. 3

Long term mean temperature of Turkey (MGM 2017)

All meteorological data that help demonstrate such regional differences in meteorological characteristics and that constitute the information basis of the analyses conducted in the presented study were provided from the State Meteorological Service - DMI. Precipitation data of 60 DMI gaging stations were basically investigated. The observation period is from January 1950 to December 1998. From the total set of stations, only those that have sufficient record lengths and that are not associated with significant trends were included in the analyses before any immediate effort for identifying convenient probability structures. Trend analyses were performed on precipitation data for climatological/meteorological observation stations (Automated Weather Observing System - AWOS), each of which are located in 60 major cities of Turkey. All meteorological data are provided from Turkish State Meteorological Service (DMI). Besides, it is noteworthy to mention here that the trend analyses indicated significant trend appearances on monthly series even though there was no actual trend detected in annual precipitation data.

5 Application to Goodness of Fit to Annual Precipitation

Baran et al. (2017b) showed that if it is assumed that a priori distribution is uniform and a posteriori distribution is normal, the assumption would be accepted, and on the other hand, the assumption that a priori distribution is uniform and a posteriori distribution is lognormal would be rejected. They have performed the test for normally distributed synthetic data sets. They have also performed similar exercises by generating lognormally distributed synthetic series and assuming that the posteriori distribution first as lognormal and then as normal distribution. In their case, the assumption of a uniform a priori distribution and a normal a posteriori distribution was rejected while the assumption associated with uniform and lognormal distribution combination was accepted.

In the presented study, the annual mean precipitation data sets captured in major cities, were investigated for the goodness of fit tests. The trend analyses through both parametric and non-parametric tests (Spearman-Rho and Mann-Kendall) showed there are no significant trends in the data series employed.

The best fit Probability Distribution Function (PDF) was explored for each precipitation gaging station. For this purpose, prior parameters – the basic statistics (mean, standard deviation, skewness, and excess coefficients) were calculated and the parameters for the best fit PDF indicated by the Chi-square tests were computed.

In the following step, gaging stations were investigated in terms of Entropy as Variation of Information. Basic statistical parameters for normal and lognormal distributions were calculated. This analysis was followed by the computation of marginal entropies (H(X), Hmax and the variation of information H(X/H*)) for relevant distribution functions. The computations were carried out in a successive manner, using the first year’s 12 monthly total precipitation data, the second year’s 24, and so on until the total number of data are reached. Tests of the goodness of fit for both distributions by entropy as variation of information yielded the “Suitable” indication for the entire cases. The results are given as an example for the eight selected cities out of 60 which are located in different parts of Turkey. The indicative eight stations (each from eight different cities) were decided to exemplify the total set of 60 stations with the consideration that each indicates a different climatic zone throughout the entire territory of Turkey. The climatic zones that appear in Fig. 4 indicate the wet, humid, medium-humid, medium-dry, dry and very dry zones associated with Zonguldak, Adapazarı, Manisa, Izmir, Ankara and Adıyaman stations in the respective order. In designating the climate zones, different patterns in precipitation, evaporation, evapotranspiration and runoff characteristics were basically considered. In addition to this identification, the selections for medium-humid and medium-dry zones were made in doubles due to the prominent differences in the climate zones itselves from the perspectives of precipitation pattern and the variation of precipitation.
Fig. 4

Climate zones in Turkey and selected AWOS’ locations

The goodness of fit results for selected AWOS are presented in Tables 1, 2 and 3 (the Chi-square results in Table 1 and entropy results for normal and lognormal distributions in Tables 2 and 3, respectively).
Table 1

Results of Chi-square test for the considered distributions

Station name

Normal

χ2cr = 5,99

Log-normal

χ2cr = 5,99

Adapazarı

2,89

2,89

Adıyaman

0,67

1,22

Afyon

4,36

8,44

Amasya

2,00

2,52

Ankara

5,02

3,53

Izmir

0,90

1,31

Manisa

2,32

1,10

Zonguldak

8,04

7,43

  

bold figures show the best fitted distributions

Table 2

Results of informational entropy method for normal distribution

Station name

Annual mean

Range

Hmax

H(x)

H(X/X)

Test result

Adapazarı

821,67

540,9

5,87

5,21

0,65

Suitable

Adıyaman

729,98

606,75

5,91

5,61

0,30

Suitable

Afyon

420,67

385,8

5,11

4,66

0,45

Suitable

Amasya

455,28

289,1

4,97

4,75

0,22

Suitable

Ankara

399,15

364,5

4,80

4,66

0,14

Suitable

Izmir

680,06

760,5

5,94

5,63

0,31

Suitable

Manisa

733,24

788,4

5,98

5,63

0,34

Suitable

Zonguldak

1218,69

1065,1

6,46

5,67

0,79

Rejected

      
Table 3

Selected summarized results of informational entropy method for log normal distribution

Station name

Hmax

H(X)

H(X/X)

Hcr

Test result

Adapazarı

5,87

3,73

2,14

2,1434

Suitable

Adıyaman

5,91

4,19

1,71

2,0746

Suitable

Afyon

5,11

3,22

1,90

2,1050

Suitable

Amasya

4,97

3,31

1,66

2,1002

Suitable

Ankara

4,80

3,23

1,57

2,0909

Suitable

Izmir

5,94

4,20

1,74

2,0937

Suitable

Manisa

5,98

4,21

1,76

2,0774

Suitable

Zonguldak

6,46

4,20

2,25

2,1215

Rejected

     

6 Results and Discussion

Figures 5, 6, 7, 8, 9 and 10 give brief information about the precipitation values for the selected 8 AWOS in terms of entropy to test the goodness of fit of normal and lognormal probability distribution functions.
Fig. 5

Precipitation time series, by the assumption of a posteriori, a normal, and b lognormal probability distribution function for Adapazarı AWOS; where; Hmax - Maximum Entropy, H(X) - Marginal Entropy, H(X/X) Variation of Information and CL - confidence limits for posterior distribution

Fig. 6

Precipitation time series, by the assumption of a posteriori, a normal, and b lognormal probability distribution function for Adıyaman AWOS; where; Hmax - Maximum Entropy, H(X) - Marginal Entropy, H(X/X) - Variation of Information and CL - confidence limits for posterior distribution

Fig. 7

Precipitation time series, by the assumption of a posteriori, a normal, and b lognormal probability distribution function for Ankara AWOS; where; Hmax - Maximum Entropy, H(X) - Marginal Entropy, H(X/X) - Variation of Information and CL - confidence limits for posterior distribution

Fig. 8

Precipitation time series, by the assumption of a posteriori, a normal, and b lognormal probability distribution function for Izmir AWOS; where; Hmax - Maximum Entropy, H(X) - Marginal Entropy, H(X/X) - Variation of Information and CL - confidence limits for posterior distribution

Fig. 9

Precipitation time series, by the assumption of a posteriori, a normal, and b lognormal probability distribution function for Manisa AWOS; where; Hmax - Maximum Entropy, H(X) - Marginal Entropy, H(X/X) - Variation of Information and CL - confidence limits for posterior distribution

Fig. 10

Precipitation time series, by the assumption of a posteriori, a normal, and b lognormal probability distribution function for Zonguldak AWOS; where; Hmax - Maximum Entropy, H(X) - Marginal Entropy, H(X/X) - Variation of Information and CL - confidence limits for posterior distribution

The results showed that the same Chi-square statistics were obtained for the Adapazarı station (Table 1). The values for the variation of entropy computed from the complete series help to indicate that both normal and lognormal distributions can be accepted (Tables 2 and 3). However, it can be observed from the inspection of the Fig. 5a and b that the value for the variation of information took place out of the confidence limits for again both distributions in the first 17 years of the entire monitoring period. With the information gained through observations in later periods, the changing values of the variation of information indicate the better fit of the normal distribution.

The Chi-square computations again display the good-fit of both distributions in the case of Adıyaman station, with a slightly better performance indicated for the normal distribution (Table 1). The values for the variation of information computed from the entire series again indicate the acceptability for both normal and lognormal distributions (Tables 2 and 3). Figure 6a and b show that the values for the variation of information are inside the confidence limits for both distributions. The improved quantity of information provided by the increased length of the observation period provides a better fit for the normal distribution as indicated by the associated values of the variation of information.

In the case of Ankara station, acceptability of both distributions is indicated by the Chi-square statistics, but lognormal distribution looks to be preferably suitable (Table 1). The variation of entropy computed over the entire period, on the other hand, indicates the suitability of both distributions (Tables 2 and 3). The variation of information is apparently inside the confidence limits again for both distributions (Fig. 7a and b). As the values for the variation of information indicate, lognormal distribution can be reserved for a better performance as the observation period develops.

As can be seen from the mean total precipitation and range values (see Table 2), the average annual precipitation for the Manisa station is almost 1.6 times of the quantity of the Amasya station. Range/mean ratio values were also determined as 0.64 and 1.07 respectively for the Amasya and Manisa stations. While the best fit distribution is normal in the case of Amasya station, the analyses indicate the suitability of lognormal distribution for the Manisa station (Table 1 and Fig. 9). It is also observed collectively in Table 1 and Fig. 8 that Izmir station has precipitation quantity of 1.62 times more than that of the Afyon station, and the range/mean ratios are 0.92 and 1.12 for Afyon and Izmir stations respectively. Normal distribution appeared to be the best fit in the case of both stations.

The Chi-square results showed that, both distributions can be accepted for the Izmir station. However, the normal distribution is more suitable (Table 1). The values of the variations of entropy which were calculated by using the whole series indicate the same results as well (Tables 2 and 3). When Fig. 8a and b are evaluated, it is seen that the variation of information value is within the confidence limit for both distributions. The information gained with the increase in the number of observations shows that the developing pattern for the variation of information provides slightly better indication for accepting the lognormal distribution as opposed to the Chi-square result.

The Chi-square results showed that, both distributions can be accepted for the Manisa station but the lognormal distribution is more suitable than the normal distribution (Table 1). The values of the variations of entropy which were calculated by using the whole series lead to the same inference (Tables 2 and 3). Figure 9a and b demonstrate that the variation of information value is within the confidence limit for both distributions, while the increase in the data length through the development of monitored years with the associated figure for the variation of information provides clues for accepting lognormal distribution.

As regards the data monitored in the Zonguldak station, the Chi-square results suggest rejection for the both distributions (Table 1). The variation of entropy computed from the entire data series indicates the rejection for both (Tables 2 and 3). On the other hand, the variation of information statistic seem to stay inside the confidence limits for both as of the end of the years 6 and 7, but with the inclusion of the data from the later periods the changing pattern of the variation of information shows that confidence limits are exceeded for both distributions as given in Fig. 10a and b.

7 Conclusions

In the presented study, entropy analyses were practiced to investigate the validity of the probability distribution that was identified based on the time series data recorded in meteorological stations. To this end, the concept introduced with the given mathematical approach was named as the variation of information and by using this new definition distribution fitting performances for the normal and lognormal distributions were assessed by considering the associated confidence limits.

The computational results were evaluated in comparison to the Chi-square statistic, which results from a test that has been widely used in investigating the goodness-of-fit for a variety of distributions. As a result of evaluations for the normal and lognormal distribution for the monthly total rainfall values as time series, the variation of information concept was found to give results that were in line with Chi-square values.

The feasibility gained through the concept of variance of information in allowing tests for the acceptance of posterior distributions estimated based on the number of observations in the time series is expected also to contribute to the efforts in different analyses that specifically consider the distributional structure of investigated time series and to the efforts for examining the compatibility between the recorded time series and the synthetic series to result from modelling.

Consequently, the presented application of entropy as variation of information method can be stated as one of the effective tools to evaluate hydrological data besides the other testing methods and it can be stated as a potential tool to be used in investigating climate change effects, drought analysis and in decision making processes.

Notes

Acknowledgments

A previous shorter version of the paper has been presented in the 10th World congress of EWRA “Panta Rei” Athens, Greece, 5–9 July 2017.

Compliance with Ethical Standards

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Abbas K, Alamgir, Khan SA, Khan DM, Ali A, Khalil U (2012) Modeling the distribution of annual maximum rainfall in Pakistan. Eur J Sci Res 79(3):418–429. ISSN 1450-216XGoogle Scholar
  2. Bacanli UG, Dikbas F, Baran T (2008) Drought analysis and a sample study of Aegean Region. 6th International conference on ethics and environmental policies, ItalyGoogle Scholar
  3. Baran T, Bacanli UG (2006) Evaluation of suitability criteria in stochastic modeling. European Water 13/14:35–43Google Scholar
  4. Baran T, Bacanli UG (2007a) An entropy approach for diagnostic checking in time series analysis. Water SA 33(4):487–496Google Scholar
  5. Baran T, Bacanli UG (2007b) Evaluation of goodness of fit criterion in time series analysis. Digest 2006 17:1089–1102Google Scholar
  6. Baran T, Barbaros F (2015) Testing the goodness of fit by informational entropy, European Water Resources Association 9th World Congress. Water resources management in a changing world: challenges and opportunities, June 10-13, 2015, Istanbul, CD of Proceedings, 7 p, Book of Abstracts, p 59Google Scholar
  7. Baran T, Barbaros F, Gul A, Onusluel Gul G (2017a) An informational entropy application to test the goodness of fit of probability functions. 10th World Congress of EWRA on Water Resources and Environment, ‘Panta Rhei’, 5-9 July 2017, Athens, Greece, Congress Proceedings, pp 403–408Google Scholar
  8. Baran T, Harmancioglu N, Cetinkaya CP, Barbaros F (2017b) An extension to the revised approach in the assessment of informational entropy. Entropy 19:634.  https://doi.org/10.3390/e19120634 CrossRefGoogle Scholar
  9. Best DJ, Rayner JCW, Thas O (2012) Comparison of some tests of fit for the inverse gaussian distribution. Advances in Decision Sciences, 2012:Article ID 150303, 9 pages. Hindawi Publishing Corporation.  https://doi.org/10.1155/2012/150303
  10. Cherry C (1957) On human communication: a review, a survey and a criticism. the Technology Press of Massachusetts Institute of Technology, Massachusetts, 333 pCrossRefGoogle Scholar
  11. DMI (2002) Turkish state meteorological service. www.mgm.gov.tr. Accessed 15 March 2017
  12. DMI (2016) Turkish state meteorological service. https://www.mgm.gov.tr/FILES/arastirma/yagis-degerlendirme/2016alansal.pdf. Accessed 1 March 2017
  13. Girardin V, Lequesne J (2017) Entropy-based goodness-of-fit tests—a unifying framework: application to DNA replication. Commun Stat Theor M.  https://doi.org/10.1080/03610926.2017.1401084
  14. Guiasu S (1977) Information theory with applications. Mc Graw-Hill, New York. 439 p, ISBN 978-0070251090Google Scholar
  15. Harmancioglu N, Singh VP (1998) Entropy in environmental and water resources. In: Herschy RW, Fairbridge RW (eds) Encyclopedia of hydrology and water resources, vol 5. Kluwer Academic Publishers, Dordrecht, pp 225–241. ISBN 978-1-4020-4497-7CrossRefGoogle Scholar
  16. Harmancioglu N, Singh VP (2002) Data accuracy and data validation. In: Sydow A (ed) Encyclopedia of life support systems (EOLSS); knowledge for sustainable development, theme 11 on environmental and ecological sciences and resources, chapter 11.5 on environmental systems, vol 2. UNESCO Publishing-Eolss, Oxford, pp 781–798. ISBN 0 9542989-0-XGoogle Scholar
  17. Harmancioglu N, Singh VP, Alpaslan N (1992) Versatile uses of the entropy concept in water resources. In: Singh VP, Fiorentino M (eds) Entropy and energy dissipation in water resources, vol 9. Kluwer Academic Publishers, Dordrecht, pp 91–117. ISBN 978-94-011-2430-0CrossRefGoogle Scholar
  18. Jaynes ET (1983) In: Rosenkrantz RD (ed) Papers on probability, statistics and statistical physics. Springer, Dordrecht. 458 p, ISBN 978-94-009-6581-2Google Scholar
  19. Lee S, Vontab I, Karagrigoriou A (2011) A maximum entropy type test of fit. Comput Stat Data An 55:2635–2643CrossRefGoogle Scholar
  20. MGM (2017) Turkish state meteorological service. https://www.mgm.gov.tr/FILES/resmi-istatistikler/parametreAnalizi/Turkiye-Ortalama-Sicaklik.pdf. Accessed 10 Dec 2016
  21. Pfeiffer PE (1965) Concept of probability theory. McGraw-Hill Book Company, New York. 399 p, ISBN: 978-0486636771Google Scholar
  22. Pierce JR (1961) Symbols, signals and noise: the nature and process of communication. Harper and Row Publisher INC, New York. ISBN: 978-0061392320Google Scholar
  23. Shannon CE (1964) A mathematical theory of communication. In: Shannon, Weaver (eds) The mathematical theory of communication. The University of Illinois Press, UrbanaGoogle Scholar
  24. Shannon CE, Weaver W (1949) The mathematical theory of communication. University of Illinois Press, Urbana. 144 pGoogle Scholar
  25. Sharifdoost M, Nematollahi N, Pasha E (2009) Goodness of fit test and test of Independence by entropy. Journal of Mathematical Extension 3(2):43–59Google Scholar
  26. Singh VP (1997) The use of entropy in hydrology and water resources. Hydrol Process 11:587–626CrossRefGoogle Scholar
  27. Singh VP (1998) Entropy-based parameter estimation in hydrology. Kluwer Academic Publishers, Dordrecht. 364 pCrossRefGoogle Scholar
  28. Singh VP (2003) The entropy theory as a decision making tool in environmental and water resources. In: Karmeshu (ed) Entropy measures, maximum entropy principle and emerging applications. Studies in fuzziness and soft computing, vol 119. Springer, Berlin, pp 261–297. ISBN 978-3-540-36212-8CrossRefGoogle Scholar
  29. Weiss BA, Dardick W (2016) An entropy-based measure for assessing fuzziness in logistic regression. Educ Psychol Meas 76(6):986–1004CrossRefGoogle Scholar
  30. Zeng X, Wang D, Wu J (2015) Evaluating the three methods of goodness of fit test for frequency analysis. Journal of Risk Analysis and Crisis Response 5(3):178–187CrossRefGoogle Scholar

Copyright information

© Springer Nature B.V. 2018

Authors and Affiliations

  1. 1.Faculty of Engineering, Department of Civil EngineeringDokuz Eylul UniversityIzmirTurkey

Personalised recommendations