In commemoration of the foundation of the Japanese Journal of Statistics and Data Science (JJSD), our third special feature focuses on relationships and collaborations between information theory and statistics. Development and expansion in research areas of information theory and statistics are drawing a great deal of attention. This special feature comprises ten contributions from mainly three research topics: divergence-based statistical inference, high-dimensional sparse learning, and combinatorial design.

Divergence measure, an extension of distances over the set of probability distributions, is a transverse tool in mathematical sciences. In particular, the Kullback-Leibler divergence, or in other words, the relative entropy, is closely related to the maximum likelihood estimator in statistics and the code length in information theory (Kullback 1959; Cover and Thomas 2006). The information criterion (Akaike 1974) and Bayes coding (Clarke and Barron 2006) are interdisciplinary research topics based on the concept of divergences. Today, important classes of divergence measures such as Bregman divergences have been widely applied in data analysis and information sciences (Bregman 1967; Basu et al. 1998; Fujisawa and Eguchi 2008). The following five articles focus mainly on theoretical analysis and practical applications of divergence measures.

  • Machida and Takenouchi (2019) are concerned with non-negative matrix factorization (NMF), which is a typical means of feature extraction in the framework of unsupervised learning (Lee and Seung 2001). It is well known that the standard NMF algorithm is not robust against outlier noise. The authors propose robust NMF algorithms by combining statistical modeling of reconstruction and the \(\gamma\)-divergence.

  • Kawashima and Fujisawa (2019) study an extension of robust and sparse linear regression. They propose robust and sparse generalized linear models (McCullagh and Nelder 1989 based on the \(\gamma\)-divergence.

  • Ihara (2019) developed a new mathematical tool to prove the optimality of the coding scheme for the feedback capacity of discrete-time additive Gaussian noise. The author also shows that the minimum decoding error probability decreases with an exponential order that increases linearly with block length.

  • Sainudiin and Teng (2019) present a data-adaptive multivariate histogram estimator of an unknown density based on independent samples. The authors prove the universal performance guarantee under the \(L_1\) distance.

  • Abe and Fujisawa (2019) study a multivariate skew distribution using the transformation of scale. Also, they present additional properties of the distribution such as random number generation, non-degenerated Fisher information, and entropy maximization distribution.

In the past two decades, the notion of sparsity has attracted massive attention in the research realm of compressed sensing, high-dimensional statistical inference, and related matters (Tibshirani 1996). Today, highly developed detection technology enables us to observe extremely high-dimensional complex data. Statisticians are thus required to develop statistical methods to deal with such high-dimensional data. Sparsity is an assumption that only a small number of elements in high-dimensional data are significant. The regularization techniques that induce a sparsity pattern are useful for finding such significant elements. The L1-regularization for linear regression models has been intensively studied both in theory and practice (Hastie et al. 2015). In this special feature, the following three articles have to do with the sparse structure of high-dimensional data.

  • Komatsu et al. (2019) study group lasso, in which high-dimensional covariates are assumed to be clustered in groups. The authors propose the information criterion for the group Lasso under the framework of generalized linear models. They illustrate that their criterion is almost the same as or better than cross-validation.

  • Post-selection inference (Lee et al. 2013) is a statistical technique for determining salient variables after model or variable selection. Umezu and Takeuchi (2019) develop a selective inference framework for binary classification problems. They also conduct several simulation studies to confirm the statistical power of the test.

  • The sparse superposition code is known to achieve channel capacity (Joseph and Barron 2012). Takeishi and Takeuchi (2019) show an improved upper bound on its block error probability with least squares decoding, which is a fairly simplified and tighter bound than in previous results.

Combinatorial structure appears in both statistics and information theory. In statistics, the experimental design is a classical research topic in which combinatorics has been widely applied (Fisher 1940; Rao 1947). Also in coding theory, the combinatorial concept is important to design computationally efficient coding. Two articles provide an interesting connection between combinatorial structure and information processing.

  • Hirao and Sawa (2019) present a characterization theorem of a combinatorial structure called almost tight Euclidean design. Furthermore, the article includes a short review of a relationship between Euclidean designs for rotationally symmetric integrals and kernel approximation in machine learning.

  • Lu and Jimbo (2019) present a review of the brief history, basic problems, and significant results for constructing the arrays for combinatorial interaction testing. They also propose explicit construction for covering arrays involving information-theoretic methods.

We are grateful to all reviewers for their help in the process of refereeing the contributions and for sharing their time and knowledge. We also want to thank again all authors who have contributed interesting works to this special feature.