Enlightening the Relationship Between Distribution and Regression Fitting

Yu, Hang; Yin, Qian; Guo, Ping

doi:10.1007/978-3-319-68121-4_23

Hang Yu¹⁸,
Qian Yin¹⁸ &
Ping Guo¹⁸

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 510))

Included in the following conference series:

International Conference on Intelligence Science

1406 Accesses
7 Altmetric

Abstract

Statistical distribution fitting and regression fitting are both classic methods to model data. There are slight connections and differences between them, as a result they outperform each other in different cases. A analysis model for processing natural data, say astronomical pulsar data in this paper, is proposed to improve data fitting method performance. Then the insight behind the comprehensive fitting model is given and discussed.

You have full access to this open access chapter, Download conference paper PDF

A new extended normal regression model: simulations and applications

Article Open access 08 June 2019

Maria C.S. Lima, Gauss M. Cordeiro, … Abraão D.C. Nascimento

The extended skew Gaussian process for regression

Article 03 June 2014

M. T. Alodat & M. Y. AL-Rawwash

Study on probability plot correlation coefficient of the log-Weibull distribution

Article 01 June 2015

Ren-yan Jiang

Keywords

1 Introduction

Statistics is a subject aiming to model the world by rather collecting and analyzing data than to inference and prove precisely. Lots of natural data have been fitted into statistical distributions with significant performances. And with these statistical distributions scientists can utilize their off-the-shelf distribution properties. While unfortunately the diversity of our world is far more complex than the level those statistical distributions can totally model. Alternatively, polynomial regression is also an approach to automatically find empirical laws from data [1]. It is proposed to fit various patterns of data, and has already shown its powerful fitting ability. Researchers who use polynomial regression are conveniently not required to preliminarily estimate data distribution [2]. Regrettably, by implementing polynomial regression we eventually get a polynomial expression with less prior properties than traditional statistical distributions have.

Given that features of natural scientific data are relatively fixed, and they will still be studied in the future, it is significant to determine the best fitting method for certain scientific data feature. And we creatively name it fitting label. As a solution to the dilemma, we propose a discriminative fitting model to help scientists automatically fitting scientific data into the optimal expression. Once scientists get the fitting label of a certain feature of natural scientific data, those later scientists are able to directly update the fitting expression parameters on new data’s arrival. We operated experiments on an astronomical data set, ANTF Pulsar Catalogue [3]. Besides, given the reliability of P value is still controversial [4], and mean squared error and KL divergence are trustworthy measures for both fields of statistics and machine learning, we complementally took them in consideration.

The structure of this paper is shown as follows: the comprehensive fitting model is proposed in Sect. 2; and with it its corresponding experiments are showed and discussed in Sects. 3 and 4; the section of conclusion comes as the end.

2 Methodology

2.1 Comprehensive Fitting Model

As is shown in Algorithm 1, the selection model consists of mainly 3 steps. Step 1 is to determine whether the data is subject to a certain natural distribution or not through significance testing. Step 2, fitting data into significant distributions and operating regression, then calculating their respective similarities. Step 3 is to classify the data’s fitting label with distance-based classifier according to the similarities. Consequently, scientists are able to learn as more properties as possible of the data.

2.2 Statistical Significance Testing

As is shown in Algorithm 2, firstly we need human experts, astronomers for example, to give us candidate statistical distribution families for each feature. These distribution families with empirical parameters are then be tested one by one for statistical significance under selected statistical significance measures, usually P value. Concretely with specified natural science feature data, we perform statistical significance test to screen out distribution families that have statistically large significance levels for the feature. Candidate statistical families and selected significance measures are given into the testing process, If the output P value reaches the threshold, a significance level of 0.05 by convention, the distribution will be recorded as a significant distribution.

2.3 Similarity Evaluation

For this step, we use similarity measures to detect how well the expression fits with the data, which is shown in Algorithm 3. Significant statistical distributions obtained from last step are introduced here, and with it selected similarity measures. We perform t times k-fold cross-validation test to acquire the average similarity between the distribution and scientific data.

2.4 Distance-Based Classification

It is the final outcome, the fitting label for the feature data, producing step. We can take those similarity measures as distances, and run a distance-based classifier to eventually select the expression with minimal distance as the optimal fitting label. Here we bring a classic classifier k-Nearest Neighbours (abbreviated as kNN) with its parameter k chosen 1.

3 Experiments

A logical starting point is to test a distribution’s normality for its popularity in natural world [5]. While for features whose scatter diagram is shown in Fig. 1, they seem obviously subject to lognormal distribution [6, 7]. We took the logarithm of that data first, then tested the resulting data’s normality, which is shown in Fig. 2. As we know, logarithm of scale parameter and shape parameter of lognormal distribution are respectively expectation and standard deviation of normal distribution.

Table 1. Significance testing of P0

Full size table

Significance results are shown in Table 1. For those classic statistical testing methods, the values recorded are their P values. Results shows that lognormal has the highest significance among expert-selecting distributions.

Similarity evaluation results are shown in Table 2. The polynomial order is empirically chosen 2. Thus lognormal distribution has the lowest relative entropy and mean squared error. So it is the optimal fitting label for pulsar period feature. The temporary parameters of its corresponding normal distribution are mean \(-0.53455956344351729\) and standard deviation 0.99686283436845236.

Table 2. Similarity Evaluation of P0

Full size table

4 Discussions

As we know, the insight is a thought on inductive bias of Occam’s razor and the principle of maximum entropy. Occam’s razor has been the default heuristic technique in natural science for hundreds of years. It can be interpreted as stating that the one among competing hypotheses with the fewest assumptions should be selected [8, 9]. The preference for simplicity also comes to machine learning’s avoiding overfitting, a low-order fitting curve is almost always better than a high-order one. On the contrary, the second law of thermodynamics states that the total entropy of an isolated system can only increase over time. It is seen as an axiom of statistical thermodynamics. Here comes the principle of maximum entropy which states that the probability distribution that best represents the current state of knowledge is the one with largest entropy. And it also emphasizes a natural correspondence between statistical mechanics and information theory [10]. Numbers of statistical distributions widely used by natural scientists to model scientific data can be derived under the principle of maximum entropy. For any certain mean \(\mu \) and variance \(\sigma ^2\), the maximum entropy distribution \(p\left( x|\mu ,\sigma ^2\right) \) is normal distribution \(N\left( \mu ,\sigma ^2\right) \) [11].

5 Conclusions

For scientists we propose a discriminative fitting model through which diverse scientific data will be fitted into the optimal expression. And with our model as much as possible sample information can be used in future study. For data that is subject to a common statistical distribution, they then will be fitted into that distribution with parameters tuned by numerical optimization methods; comparatively for those who are not we model them by polynomial regression. Among main fields of natural science are still features whose fitting label remains unfixed, so it will be a great work to build a fitting label database with our model for scientists.

References

Edwards, J.R., Parry, M.E.: On the use of polynomial regression equations as an alternative to difference scores in organizational research. Acad. Manag. J. 36(6), 1577–1613 (1993)
Article Google Scholar
Theil, H.: A rank-invariant method of linear and polynomial regression analysis, 3; confidence regions for the parameters of polynomial regression equations. Stichting Mathematisch Centrum. Statistische Afdeling (SP 5a/50/R), 1–16 (1950)
Google Scholar
Manchester, R.N., Hobbs, G.B., Teoh, A., et al.: The Australia telescope national facility pulsar catalogue. Astron. J. 129(4), 1993 (2005)
Article Google Scholar
Nuzzo, R.: Statistical errors. Nature 506(7487), 150 (2014)
Article Google Scholar
Faucher-Giguere, C.A., Kaspi, V.M.: Birth and evolution of isolated radio pulsars. Astrophys. J. 643(1), 332 (2006)
Article Google Scholar
Lorimer, D.R., Faulkner, A.J., Lyne, A.G., et al.: The Parkes multibeam pulsar SurveyCVI. Discovery and timing of 142 pulsars and a Galactic population analysis. Mon. Not. R. Astron. Soc. 372(2), 777–800 (2006)
Article Google Scholar
Bates, S.D., Lorimer, D.R., Rane, A., Swiggum, J.: PsrPopPy: an open-source package for pulsar population simulations. Monthly Not. Roy. Astrono. Soc. 439(3), 2893–2902 (2014). https://academic.oup.com/mnras/article/439/3/2893/1108811/PsrPopPy-an-open-source-package-for-pulsar
Article Google Scholar
Gauch, H.G.: Scientific Method in Practice. Cambridge University Press, Cambridge (2003)
Google Scholar
Hoffmann, R., Minkin, V.I., Carpenter, B.K.: Ockham’s razor and chemistry. Bulletin de la Socit chimique de France 133(2), 117–130 (1996)
Google Scholar
Jaynes, E.T.: Information theory and statistical mechanics. Phys. Rev. 106(4), 620–630 (1957)
Article MathSciNet MATH Google Scholar
Jaynes, E.T.: Probability Theory: The Logic of Science. Cambridge University Press, Cambridge (2003)
Book MATH Google Scholar

Download references

Acknowledgments

The research work in this paper was supported by the grants from National Natural Science Foundation of China (61472043, 61375045) and the Joint Research Fund in Astronomy (U1531242) under cooperative agreement between the NSFC and CAS, Beijing Natural Science Foundation (4142030). Prof. Qian Yin is the author to whom all the correspondence should be addressed.

Author information

Authors and Affiliations

Image Processing and Pattern Recognition Laboratory, Beijing Normal University, Beijing, 100875, China
Hang Yu, Qian Yin & Ping Guo

Authors

Hang Yu
View author publications
You can also search for this author in PubMed Google Scholar
Qian Yin
View author publications
You can also search for this author in PubMed Google Scholar
Ping Guo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qian Yin .

Editor information

Editors and Affiliations

Chinese Academy of Sciences, Beijing, China
Zhongzhi Shi
Machine Intelligence Research Institute, Rockville, Maryland, USA
Ben Goertzel
Shanghai Maritime University, Shanghai, China
Jiali Feng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yu, H., Yin, Q., Guo, P. (2017). Enlightening the Relationship Between Distribution and Regression Fitting. In: Shi, Z., Goertzel, B., Feng, J. (eds) Intelligence Science I. ICIS 2017. IFIP Advances in Information and Communication Technology, vol 510. Springer, Cham. https://doi.org/10.1007/978-3-319-68121-4_23

Download citation

DOI: https://doi.org/10.1007/978-3-319-68121-4_23
Published: 27 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-68120-7
Online ISBN: 978-3-319-68121-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)