# P-splines with derivative based penalties and tensor product smoothing of unevenly distributed data

- First Online:

- Received:
- Accepted:

DOI: 10.1007/s11222-016-9666-x

- Cite this article as:
- Wood, S.N. Stat Comput (2017) 27: 985. doi:10.1007/s11222-016-9666-x

## Abstract

The P-splines of Eilers and Marx (Stat Sci 11:89–121, 1996) combine a B-spline basis with a discrete quadratic penalty on the basis coefficients, to produce a reduced rank spline like smoother. P-splines have three properties that make them very popular as reduced rank smoothers: (i) the basis and the penalty are sparse, enabling efficient computation, especially for Bayesian stochastic simulation; (ii) it is possible to flexibly ‘mix-and-match’ the order of B-spline basis and penalty, rather than the order of penalty controlling the order of the basis as in spline smoothing; (iii) it is very easy to set up the B-spline basis functions and penalties. The discrete penalties are somewhat less interpretable in terms of function shape than the traditional derivative based spline penalties, but tend towards penalties proportional to traditional spline penalties in the limit of large basis size. However part of the point of P-splines is not to use a large basis size. In addition the spline basis functions arise from solving functional optimization problems involving derivative based penalties, so moving to discrete penalties for smoothing may not always be desirable. The purpose of this note is to point out that the three properties of basis-penalty sparsity, mix-and-match penalization and ease of setup are readily obtainable with B-splines subject to derivative based penalization. The penalty setup typically requires a few lines of code, rather than the two lines typically required for P-splines, but this one off disadvantage seems to be the only one associated with using derivative based penalties. As an example application, it is shown how basis-penalty sparsity enables efficient computation with tensor product smoothers of scattered data.

### Keywords

Reduced rank spline P-spline Smoothing spline Derivative penalty Tensor product smooth## 1 Computing arbitrary derivative penalties for B-splines

*f*(

*x*) using a rank

*k*spline basis expansion \(f(x) = \sum _{j=1}^k \beta _j B_{m_1,j}(x)\), where \(B_{m_1,j}(x)\) is an order \(m_1\) B-spline basis function, and \(\beta _j \) is a coefficient to be estimated. In this paper order \(m_1 = 3\) will denote a cubic spline. Associated with the spline will be a derivative based penalty

*f*with respect to

*x*, and [

*a*,

*b*] is the interval over which the spline is to be evaluated. It is assumed that \(m_2 \le m_1\), otherwise the penalty is formulated in terms of a derivative that is not properly defined for the basis functions, which makes no sense. It is possible to write \(J={\varvec{\beta }}^\mathsf{T}\mathbf{S}{\varvec{\beta }}\) where \(\mathbf{S}\) is a band diagonal matrix of known coefficients. Computation of \(\mathbf{S}\) is the only part of setting up the smoother that presents any difficulty, since standard routines for evaluating B-splines basis functions (and their derivatives) are readily and widely available, and in any case the recursion for basis function evaluation is straightforward.

*i*and

*j*start at 1). In terms of the evaluated gradient vectors,

- 1.
For each interval \([x_{j},x_{j+1}]\), generate \(p+1\) evenly spaced points within the interval. For \(p=0\) the point should be at the interval centre, otherwise the points always include the end points \(x_j\) and \(x_{j+1}\). Let \(\mathbf{x}^\prime \) contain the unique

*x*values so generated, in ascending order. - 2.
Obtain the matrix \(\mathbf{G}\) mapping the spline coefficients to the \(m_2\text {th}\) derivative of the spline at the points \(\mathbf{x}^\prime \).

- 3.
If \(p=0\), \(\mathbf{W} = \text {diag}(\mathbf{h})\).

- 4.
It \(p>0\), let \(p+1 \times p+1\) matrices \(\mathbf{P}\) and \(\mathbf{H}\) have elements \(P_{ij} = (-1 + 2(i-1)/p)^j\) and \(H_{ij} = (1+(-1)^{i+j-2})/(i+j-1)\) (

*i*and*j*start at 1). Then compute matrix \(\tilde{\mathbf{W}} = \mathbf{P}^\mathsf{-T}\mathbf{HP}^{-1}\). Now compute \(\mathbf{W} = \sum _{q=1}^{k-m} \mathbf{W}^q\) where each \(\mathbf{W}^q\) is zero everywhere except at \(W^q_{i + pq - p, j + pq - p} = h_q \tilde{W}_{ij}/2\), for \(i=1,\ldots ,p+1\), \(j=1,\ldots ,p+1\). \(\mathbf{W}\) is banded with \(2p + 1\) non-zero diagonals. - 5.
The diagonally banded penalty coefficient matrix is \(\mathbf{S} = \mathbf{G}^\mathsf{T}\mathbf{WG}\).

- 6.
Optionally, compute the diagonally banded Cholesky decomposition \(\mathbf{R}^\mathsf{T}\mathbf{R} = \mathbf{W}\), and form diagonally banded matrix \(\mathbf{D} = \mathbf{RG}\), such that \(\mathbf{S} = \mathbf{D}^\mathsf{T}\mathbf{D}\).

*p*would even be as high as 10, and for \(p\le 10\), \(\mathbf{P}\)’s condition number is \(< 2 \times 10^4\). Of course \(\mathbf W\) is formed without explicitly forming the \(\mathbf{W}^q\) matrices. Step 6 can be accomplished by a banded Cholesky decomposition such as dpbtrf from LAPACK (accessible via routine mgcv:bandchol in R, for example). Alternatively see the appendix. However for applications with

*k*less than 1000 or so, a dense Cholesky decomposition might be deemed efficient enough. Note that step 6 is preferable to construction of \(\mathbf{D}\) by decomposition of \(\mathbf{S}\), since \(\mathbf{W}\) is positive definite by construction, while, for \(m_2>0\), \(\mathbf{S}\) is only positive semi-definite. As in the case of a discrete P-spline penalty, the leading order computational cost of evaluating \(\mathbf{S}\) (or \(\mathbf{D}\)) is

*O*(

*bk*) where

*b*is the number of bands in \(\mathbf{S}\): the constant of proportionality is lower for a discrete penalty of course, but in either case the cost is trivial relative to that of model fitting (which is \(O(nk^2)\) using dense matrix methods).

The simplicity of the algorithm rests on the ease with which \(\mathbf{G}\) and \(\mathbf{W}\) can be computed. Note that the construction is more general than that of Wand and Ormerod (2008), in allowing \(m_1\) and \(m_2\) to be chosen freely (rather than \(m_1\) determining \(m_2\)), and treating even \(m_1\) as well as odd.

## 2 Tensor product smoothing of unevenly distributed data

By construction the domain of the tensor product smooth is a rectangle, cuboid or hypercuboid, but it is often the case that the covariates to be smoothed over occupy only part of this domain. In this case it is possible for some basis functions to evaluate to zero at every covariate observation, and there is often little point in retaining these basis functions and their associated coefficients. Let \(\iota \) denote the index of a coefficient to be dropped from \({\varvec{\beta }}\) (along with its corresponding basis function). The naïve approach of dropping row and column \(\iota \) of each \(\mathbf{S}_j\) is equivalent to setting \(\beta _\iota \) to zero when evaluating \({\varvec{\beta }}^\mathsf{T}\mathbf{S}_j{\varvec{\beta }}\), which is not usually desirable. Rather than setting \(\beta _\iota =0\) in the penalty, we would like to omit those components of the penalty dependent on \(\beta _\iota \). This is easily achieved by dropping every row \(\kappa \) from \(\tilde{\mathbf{D}}_j\) for which \(\tilde{D}_{j,\kappa \iota } \ne 0\). Notice (i) this construction applies equally well to P-splines, and (ii) that without \(\mathbf{D}\) being diagonally banded this would be a rather drastic reduction of the penalty.

*x*,

*z*locations shown as black dots in Fig. 1. The figure shows the reconstruction of the test function using a tensor product smoother, based on cubic spline marginals with second derivative penalties. The left figure is for the full smoother, which had 625 coefficients, while the right figure is for the reduced version which had 358 coefficients. Since the \(\mathbf{S}\) matrix of a smoother can be viewed as the prior precision matrix of a Gaussian random field, the smoothing parameters can be estimated by marginal likelihood maximization (e.g. Wahba 1985), and the computational method of Wood (2011) was used for this purpose. Including smoothing parameter selection the reduced rank fit took around 1/8 of the computation time of the full rank fit, as a result of the reduction in basis size. The correlation between the fitted values for the two fits is 0.999. In the example the reduced rank fit has marginally smaller mean square reconstruction error than the full rank version, a feature that seems to be robust under repeated replication of the experiment.

In this example penalty sparsity was important in order to be able to reduce the penalty appropriately when eliminating irrelevant basis functions, but the sparsity was not further exploited in the computations. In contrast, Bayesian stochastic simulation with such models is much more efficient if the basis matrix, \(\mathbf{X}\), is sparse, so that the likelihood is efficiently computable at each step, and the penalty is sparse, so that computations involving \((\mathbf{X}^\mathsf{T}\mathbf{X}+ \lambda \mathbf{S})^{-1}\) can be used to make efficient proposals. Similarly for direct estimation in big data settings, sparse evaluation of the REML expressions given in Wood (2011), for example, rests on evaluation of terms like \(\log |\mathbf{X}^\mathsf{T}\mathbf{X}+ \lambda \mathbf{S}|\), which can be made more efficient if both \(\mathbf{X}\) and \(\mathbf{S}\) are sparse. See Davis (2006) for more on the computational exploitation of sparsity.

## 3 Conclusions

Given that the theoretical justification for using spline bases for smoothing is that they arise as the solutions to variational problems with derivative based penalties (see e.g. Wahba 1990; Duchon 1977), it is sometimes appealing to be able to use derivative based penalties for reduced rank smoothing also. Since the derivative based penalty is not reliant on even knot spacing there may also be practical advantages when uneven spacing of knots is desirable (e.g. Whitehorn et al. 2013). However if a sparse smoothing basis and penalty were required alongside the ability to mix-and-match penalty order and basis order, then the apparent complexity of obtaining the penalty matrix for derivative based penalties has hitherto presented an obstacle to their use. This note removes this obstacle, allowing the statistician an essentially free choice whether to use derivative based penalties or discrete penalties. Notice that there is nothing to prevent computation of several different orders of penalty for the same smoother, thereby facilitating the use of more general differential operators as penalties (e.g. Ramsay et al. 2007).

The splines described here are available in R package mgcv from version 1.8–12. They could be referred to as ‘D-splines’, but a new name is probably un-necessary. This work was supported by EPSRC grant EP/K005251/1, and I am grateful to two anonymous referees for several helpful suggestions.

## Acknowledgments

Open access funding provided by University of Bristol.

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.