1 Introduction

The paper develops new methods of nonparametric estimation of the distribution of compound Poisson data. A compound Poisson process \((W_{t})_{t\ge 0}\) is a Markov jump process with \(W_0=0\) characterised by a finite compounding measure \(\varLambda \) defined on the real line \(\mathbb {R}=(-\infty ,+\infty )\) such that

$$\begin{aligned} \varLambda (\{0\})=0,\quad \Vert \varLambda \Vert :=\varLambda (\mathbb {R})\in (0,\infty ). \end{aligned}$$
(1)

The jumps of this process occur at the constant rate \(\Vert \varLambda \Vert \), and the jump sizes are independent random variables with a common distribution \(\varLambda (\mathrm {d}x) /\Vert \varLambda \Vert \). In a more general context, the compound Poisson process is a particular case of a Lévy process with \(\varLambda \) being the corresponding integrable Lévy measure. Inference problems for such processes naturally arises in financial mathematics (Cont and Tankov 2003), queueing theory (Asmussen 2008), insurance (Mikosch 2009) and in many other situations modelled by compound Poisson and Lévy processes.

Suppose the compound Poisson process is observed at regularly spaced times \((W_h,W_{2 h},\ldots ,W_{n h})\) for some time step \(h>0\). The consecutive increments \(X_i=W_{i h}-W_{(i-1) h}\) then form a vector \((X_{1}, \ldots X_{n})\) of independent random variables having a common compound Poisson distribution with the characteristic function

$$\begin{aligned} \varphi (\theta )=Ee^{i\theta W_h}=e^{h\psi (\theta )},\quad \psi (\theta )= \int (e^{i\theta x}-1) \varLambda (\mathrm {d}x). \end{aligned}$$
(2)

Here and below the integrals are taken over the whole \(\mathbb {R}\) unless specified otherwise. Estimation of the measure \(\varLambda \) in terms of a sample \((X_{1}, \ldots X_{n})\) is usually called decompounding which is the main object of study in this paper.

We propose a combination of two nonparametric methods which we call characteristic function fitting (ChF) and convolution fitting (CoF). ChF may deal with a more general class of Lévy processes, while CoF explicitly targets the compound Poisson processes.

The ChF estimator for the jump measure \(\varLambda \) is obtained by minimisation of the loss functional

$$\begin{aligned} L_{\mathrm {ChF}}(\varLambda ) = \int |e^{ h\psi (\theta )} - \hat{\varphi }_{n}(\theta ) |^{2} \omega (\theta )\mathrm {d}\theta , \end{aligned}$$
(3)

where \(\psi (\theta )\equiv \psi (\theta ,\varLambda ) \) is given by (2),

$$\begin{aligned} \hat{\varphi }_{n}(\theta ) = \frac{1}{n} \sum _{k = 1}^{n} e^{i \theta X_{k}}. \end{aligned}$$

is the empirical characteristic function and \(\omega (\theta )\) is a weight function. It was shown in Neumann and Reiss (2009) in a more general Lévy process setting that minimising (3) leads to a consistent estimator of the Lévy triplet. Typically, \(\omega (\theta )\) is a positive constant for \(\theta \in [\theta _1,\theta _2]\) and zero otherwise, but it can also be chosen to grow as \(\theta \rightarrow 0\), this would lead to boosting an agreement of the moments of a fitted jump distribution with the empirical moments.

We compute explicitly the derivative of the loss functional (3) with respect to the measure \(\varLambda \), formula (18) in “Appendix”, and perform the steepest descent directly on the cone of non-negative measures to a local minimiser, further developing the approach by  Molchanov and Zuyev (2002). It must be noted that, as a simple example reveal, the functionals based on the empirical characteristic function usually have a very irregular structure, see Fig. 1. As a result, the steepest descent often fails to attend the global optimal solution, unless the starting point of the optimisation procedure is carefully chosen.

Fig. 1
figure 1

Illustration of intrinsic difficulties faced by any characteristic function fitting procedure. Plotted is the integrated squared modulus of the difference between two characteristic functions with measures \(\varLambda =\delta _1\) and \(\varLambda '=\lambda \delta _x\), \(x\in [-5,5],\ \lambda \in (0,5)\). Clearly, any algorithm based on closeness of characteristic functions, like (3), would have difficulties converging to the global minimum attained at point \(x=1,\lambda =1\) even in this simple two-parameter model

The CoF estimation method uses the fact that the convolution of \(F(x)=\mathbf {P}(W_h\le x)\),

$$\begin{aligned} F^{*2}(x) = \int F(y) F(x- y) dy, \end{aligned}$$

as a functional of \(\varLambda \) has an explicit form of an infinite Taylor series involving direct products of measures \(\varLambda \), see Theorem 2 in Sect. 4. After truncating it to only the first k terms, we build a loss function \(L_{\mathrm {CoF}}^{(k)}\) by comparing two estimates of \(F^{*2}\): the one based on the truncated series and the other being the empirical convolution \(F^{2*}_{n}\). CoF is able to produce nearly optimal estimates \(\hat{\varLambda } _{k}\) when large values of k are taken, but at the expense a drastically increased computation time.

A practical combination of these methods recommended by this paper is to find \(\hat{\varLambda } _{k}\) using CoF with a low value of k and then apply ChF with \(\hat{\varLambda } _{k}\) as the starting value. The estimate for such a two-step procedure will be denoted by \(\tilde{\varLambda }_{k}\) in the sequel.

To give an early impression of our approach, let us demonstrate the performance of our methods on the famous data by Ladislaus Bortkiewicz who collected the numbers of Prussian soldiers killed by a horse kick in 10 cavalry corps over a 20-year period (Bortkiewicz 1898). The counts 0, 1, 2, 3 and 4 were observed 109, 65, 22, 3 and 1 times, with 0.6100 deaths per year per cavalry unit. The author argues that the data are Poisson distributed which corresponds to the measure \(\varLambda =\lambda \delta _1\) concentrated on the point \(\{1\}\) (only jumps of size 1) and the mass \(\lambda \) being the parameter of the Poisson distribution estimated by the sample mean to be 0.61. Figure 2 on its top panel presents the estimated Lévy measures for the cut-off values \(k=1,2,3\) when using CoF method. For the values of \(k=1,2\), the result is a measure having many atoms. This is explained by the fact that the accuracy of the convolution approximation is not enough for these data, but \(k=3\) already results in a measure \(\hat{\varLambda }_3\) essentially concentrated at \(\{1\}\), thus supporting the Poisson model with parameter \(\Vert \hat{\varLambda }_3\Vert =0.6098\). In Sect. 4, we return to this example and explain why the choice of \(k=3\) is reasonable here. Caused by a possibly very irregular behaviour of the score function \(L_{\mathrm {ChF}}\) demonstrated above, we practically observed that the convergence of the ChF method depends critically on the choice of the initial measure, especially on its total mass. However, the proposed combination of CoF followed by ChF demonstrates (the bottom plot) that this two-step (faster) procedure results in the estimate \(\tilde{\varLambda }_{1}\), which is as good as more computationally demanding \(\hat{\varLambda }_3\).

Fig. 2
figure 2

The analysis of Bortkiewicz horse kick data. Top panel comparison of CoF estimates for \(k=1,2,3\). Bottom panel comparison of the estimate by CoF with \(k=3\) and a combination of CoF with \(k=1\) followed by ChF

Previously developed methods include discrete decompounding approach based on the inversion of Panjer recursions as proposed in Buchmann and Grübel (2003). van Es et al. (2007) and, lately, Duval (2013), Comte et al. (2014) studied the continuous decompounding problem when the measure \(\varLambda \) is assumed to have a density. They apply Fourier inversion in combination with kernel smoothing techniques for estimating the unknown density of the measure \(\varLambda \). In contrast, we do not distinguish between discrete and continuous \(\varLambda \) in that our algorithms, based on direct optimisation of functionals of a measure, work for both situations on a discretised phase space of \(\varLambda \). However, if one sees many small atoms appearing in the solution, which fill a thin grid, this may indicate that the true measure is absolutely continuous and some kind of smoothing should yield its density.

In this paper, we do not address estimation of more general Lévy processes allowing for \(\varLambda (-1,1)=\infty \). In the Lévy process setting, the most straightforward approach for estimating the distribution \(F(x)=\mathbf {P}(W_h\le x)\) is the moments fitting, see Feuerverger and McDunnough (1981b) and Carrasco and Florens (2000). Estimates of \(\varLambda \) can be obtained by maximising the likelihood ratio (see e.g. Quin and Lawless 1994) or by minimising some measure of proximity between F and the empirical distribution function \(\hat{F}_{n}(x) = \frac{1}{n} \sum _{k = 1}^{n} {{\mathrm{{1I}}}}_ {\{X_{k} \le x\}},\) where the dependence on \(\varLambda \) comes through F via the inversion formula of the characteristic function:

$$\begin{aligned} F(x) - F(x - 0) = \frac{1}{2 \pi } \lim _{y \rightarrow \infty } \int _{-y}^{y} \exp \{h\psi (\theta )-i \theta x\} \mathrm {d}\theta . \end{aligned}$$

For the estimation, the characteristic function in the integral above is replaced by the empirical characteristic function.

Parametric inference procedures based on the empirical characteristic function have been known for some time, see Feuerverger and McDunnough (1981a) and Sueishi and Nishiyama (2005), and the references therein. Algorithms based on the inversion of the empirical characteristic function and on the relation between its derivatives were proposed in Watteel and Kulperger (2003). Note that the inversion of the empirical characteristic function, in contrast to the inversion of its theoretical counterpart, generally leads to a complex valued measure which needs to be dealt with.

One of the reviewers has drawn our attention to the recent preprint Coca (2015) which promises to be useful for testing the presence of discrete and/or continuous jump distribution components as well as for obtaining approximation accuracy bounds based on the central limit theorem. Practical implementations of these theoretical results are yet to be explored.

The rest of the paper has the following structure. Section 2 introduces the theoretical basis of our approach—a constraint optimisation technique in the space of measures. Section 3 provides an algorithmic implementation of the corresponding steepest descent method in R-language. Section 4 develops the necessary ingredients for the CoF method based on the main analytical result of the paper, Theorem 2. Section 5 contains a broad range of simulation results illustrating performance of our algorithms on simulated data with various compounding measures, both discrete and continuous. Section 6 presents an application of our approach to real currency exchange data. Section 7 summarises our approach and gives some practical recommendations. We conclude by “Appendix” with proofs and explicit formulas for the gradients of the two loss functions used in our steepest descent algorithm.

2 Optimisation of functionals of a measure

In this section, we briefly present the main ingredients of the constrained optimisation of functionals of a measure. Theorem 1 gives necessary conditions for a local minimum of a strongly differentiable functional. This theorem justifies a major step in our optimisation algorithm described in Sect. 3. Further details of the underlying theory can be found in Molchanov and Zuyev (2000a, b).

Denote by \(\mathbb {M}\) and \(\mathbb {M}_+\) the class of signed and, respectively, non-negative measures with a finite total variation. The set \(\mathbb {M}\) then becomes a Banach space with sum and multiplication by real numbers defined set-wise: \((\eta _1+\eta _2)(B):=\eta _1(B)+\eta _2(B)\) and \((t\eta )(B):=t\eta (B)\) for any Borel set B and any real t. The set \(\mathbb {M}_+\) is a pointed cone in \(\mathbb {M}\) meaning that the zero measure is in \(\mathbb {M}_+\), \(\mu _1+\mu _2\in \mathbb {M}_+\), and \(t\mu \in \mathbb {M}_+\) as long as \(\mu _1,\mu _2,\mu \in \mathbb {M}_+\) and \(t\ge 0\).

A functional \(G:\mathbb {M}\mapsto \mathbb {R}\) is called Fréchet (or strongly) differentiable at \(\eta \in \mathbb {M}\) if there exists a bounded linear operator (a differential) \(DG(\eta )[\cdot ]:\ \mathbb {M}\mapsto \mathbb {R}\) such that

$$\begin{aligned} G(\eta +\nu )-G(\eta )=DG(\eta )[\nu ]+o(\Vert \nu \Vert ),\quad \Vert \nu \Vert \rightarrow 0, \end{aligned}$$
(4)

where \(\Vert \nu \Vert \) is the total variation of a signed measure \(\nu \in \mathbb {M}\). If for a given \(\eta \in \mathbb {M}\) there exists a bounded function \(\nabla G(\,\cdot \,;\eta )\,:\,\mathbb {R}\rightarrow \mathbb {R}\) such that

$$\begin{aligned} DG(\eta )[\nu ]=\int \nabla G(x;\eta )\,\nu (\mathrm {d}x)\ \text {for all}\ \nu \in \mathbb {M}, \end{aligned}$$

then \(\nabla G(\,\cdot \,;\eta )\) is called the gradient function for G at \(\eta \). Typically in applications, and it is indeed the case for the functionals of a measure considered in this paper, the gradient functions exist so that the differentials indeed have an integral form.

As a simple illustration, consider an integral of a bounded function \(G(\eta )=\int f(x)\eta (dx)\). Since this is already a bounded linear functional of \(\eta \), we get \(\nabla G(x;\eta )=f(x)\) for any \(\eta \). More generally, for a composition \(G(\eta )=u(\int f(x)\eta (\mathrm {d}x))\), the gradient function can be obtained by the Chain rule:

$$\begin{aligned} \nabla G(x;\eta )=u'\Bigl (\int f(y)\,\eta (\mathrm {d}y)\Bigr )\,f(x). \end{aligned}$$
(5)

The functional G in this example is strongly differentiable if both functions \(u'\) and f are bounded.

Taking into account condition (1), we aim to find a solution to the following constraint minimisation problem:

$$\begin{aligned} \varLambda =\mathrm{arg\,min} \{L(\eta ): \eta \in \mathbb {M}_+, \ \eta (\{0\})=0\}, \end{aligned}$$
(6)

where \(L:\ \mathbb {M}_+\mapsto \mathbb {R}\) is strongly differentiable functional of a measure. The following necessary condition of a minimum is proven in Appendix.

Theorem 1

Suppose that a \(\varLambda \) solves (6), and the functional L possesses a gradient function \(\nabla L(x;\varLambda )\) at this \(\varLambda \). Then

$$\begin{aligned} {\left\{ \begin{array}{ll} \nabla L (x;\varLambda ) \ge 0 \quad \text {for all}\ x \in \mathbb {R}\setminus \{0\}, \\ \nabla L (x;\varLambda ) = 0 \quad \varLambda -\mathrm {almost\ everywhere}. \\ \end{array}\right. } \end{aligned}$$
(7)

Remark 1

It can be shown similarly that the necessary condition (7) also holds for optimisation over the class of Lévy measures satisfying, in addition to (1), the integrability condition \(\int \min \{1,x^2\}\,\varLambda (\mathrm {d}x) < \infty \).

3 Steepest descent algorithm on the cone of positive measures

There is an extensive number of algorithms realising a parametric optimisation over a finite number of continuous variables, but optimisation algorithms over the cone of measures have been proposed only recently in Molchanov and Zuyev (2002) for the case of measures with a fixed total mass. The variation analysis of functionals of a measure outlined in the previous section allows us to develop a steepest descent type algorithm for minimisation of functionals of a compounding measure which we describe next. This algorithm has been used to obtain the simulation results presented in Sect. 5.

Recall that the principal optimisation problem has the form (6), where the functional \(L(\varLambda )\) is minimised over the measures \(\varLambda \) subject to the constraint (1). For computational purposes, a measure \(\varLambda \in \mathbb {M}_{+}\) is replaced by its discrete approximation which has a form of a linear combination \(\pmb {\varLambda }=\Sigma _{i=1}^{l} \lambda _{i} \delta _{x_{i}}\) of Dirac measures on a finite regular grid \(x_{1}, \ldots , x_{l}\in \mathbb {R}\), \(x_{i + 1} =x_{i}+2\varDelta \). Specifically, for a given measure \(\varLambda \), the atoms of \(\pmb {\varLambda }\) are given by

$$\begin{aligned} \lambda _{1}&:= \varLambda ((-\infty , x_{1} + \varDelta )), \nonumber \\ \lambda _{i}&:= \varLambda ( [x_{i} - \varDelta , x_{i} + \varDelta )), \quad \text {for } i = 2, \ldots , l - 1, \\ \lambda _{l}&:= \varLambda ([x_{l} - \varDelta , \infty )). \nonumber \end{aligned}$$
(8)

Clearly, the larger is l and the finer is the grid \(\{x_{1}, \ldots , x_{l}\}\) the better is the approximation, however, at a higher computational cost.

Respectively, the discretised version of the gradient function \(\nabla L(x;\varLambda )\) is the vector

$$\begin{aligned} \pmb {g}=(g_1,\ldots , g_l),\quad g_{i} := \nabla L (x_i;\pmb {\varLambda }), \quad i= 1, \ldots , l. \end{aligned}$$
(9)

Our main optimisation algorithm has the following structure:

figure a

In the master algorithm description above, the line 3 uses the necessary condition (7) as a test condition for the main cycle. In the computer realisations, we usually want to discard the atoms of a negligible size: for this purpose, we use a zero-value threshold parameter \(\tau _1\). Another threshold parameter \(\tau _2\) decides when the coordinates of the gradient vector are sufficiently small to be discarded. For the examples considered in the next section, we typically used the following values: \(\omega \equiv 1\), \(\tau _1=10^{-2}\) and \(\tau _2=10^{-6}\). The key MakeStep subroutine, mentioned on line 6, is described below. It calculates the admissible steepest direction \(\pmb {\nu }^{*}\) of size \(\Vert \pmb {\nu }^{*}\Vert \le \varepsilon \) and returns an updated vector \(\pmb {\varLambda } \leftarrow \pmb {\varLambda } + \pmb {\nu }^{*}\).

figure b

The MakeStep subroutine looks for a vector \(\pmb {\nu }^*\) which minimises the linear form \(\sum _{i=1}^{l} g_{i} \nu _{i}\) appearing in the Taylor expansion

$$\begin{aligned} L(\pmb {\varLambda }+\pmb {\nu })-L(\pmb {\varLambda })=\sum _{i=1}^{l} g_{i} \nu _{i}+o(|\pmb {\nu }|). \end{aligned}$$

This minimisation is subject to the following linear constraints

$$\begin{aligned}&\sum _{i = 1}^{l} |\nu _{i}| \le \varepsilon ,\quad \nu _i\ge -\lambda _i,\quad i=1,\ldots , l. \end{aligned}$$

The just described linear programming task has a straightforward solution given below.

For simplicity, we assume that \(g_{1} \ge \cdots \ge g_{l}\). Note that this ordering can always be achieved by a permutation of the components of the vector \(\pmb {g}\) and respectively, \(\pmb {\varLambda }\). Assume also that the total mass of \(\pmb {\varLambda }\) is bigger than the given positive stepsize \(\varepsilon \). Define two indices

$$\begin{aligned} i_g = \max \{i :g_{i} \ge |g_{l}|\},\ i_{\varepsilon } = \max \{i :\sum _{j = 1}^{i - 1} \lambda _{j} < \varepsilon \}. \end{aligned}$$

If \(i_{\varepsilon } \le i_g\), then the coordinates of \(\pmb {\nu }^{*}\) are given by

$$\begin{aligned} \nu _{i}^{*} := \left\{ \begin{array}{ll} -\lambda _{i} \quad &{} \text {for }\quad i \le i_{\varepsilon }, \\ \sum _{j = 1}^{i_{\varepsilon } - 1} \lambda _{j} - \varepsilon \quad &{} \text {for }\quad i = i_{\varepsilon } + 1, \\ 0 \quad &{} \text {for } \quad i \ge i_{\varepsilon } + 2, \end{array} \right. \end{aligned}$$

and if \(i_{\varepsilon } > i_g\), then

$$\begin{aligned} \nu _{i}^{*} := \left\{ \begin{array}{ll} -\lambda _{i}, \quad &{} \text {for }\quad i \le i_{g}, \\ 0 \quad &{} \text {for } \quad i_g< i < l, \\ \varepsilon - \sum _{j = 1}^{i_g} \lambda _{j} \quad &{} \text {for } \quad i = l. \end{array} \right. \end{aligned}$$

The presented algorithm is realised in the statistical computation environment R Core Team (2015) in the form of a library mesop which is freely downloadable from one of the authors’ webpage.Footnote 1

4 Description of the CoF method

As it was alluded in Introduction, the CoF method uses a representation of the convolution as a function of the compounding measure \(\varLambda \). We now formulate the main theoretical result of the paper on which the CoF method is based. The proof is given in Appendix.

We will need the following notation. For a function F, denote \(U_{x}F(y)=F(y-x)-F(x)\) and

$$\begin{aligned} U_{x_{1}, \ldots , x_{n}}F(y):= & {} U_{x_{n}}(U_{x_1, \ldots , x_{n-1}}F(y)) \\= & {} \sum _{J \subseteq \{1, 2, \ldots , n\}} (-1)^{n-|J|}F(y - \Sigma _{j \in J}x_{j}), \end{aligned}$$

where the sum is taken over all the subsets J of \(\{1, 2, \ldots , n\}\) including the empty set. Denote \(\Gamma _0(F,\varLambda ,y)=F(y)\), and

$$\begin{aligned} \Gamma _i(F,\varLambda ,y)= & {} {1\over i!} \int _{\mathbb {R}^i} U_{x_{1}, \ldots , x_{i}}F(y) \varLambda (\mathrm {d}x_{1}) \ldots \varLambda (\mathrm {d}x_{i}),\\ i\ge & {} 1. \end{aligned}$$

Theorem 2

Let \((W_t)_{t\ge 0}\) be a compound Poisson process characterised by (2), and \(F(y)=F_h(y)\) be the cumulative distribution function of \(W_h\) for a given positive h. Then for each real y, one has

$$\begin{aligned} F^{*2}(y)&= \sum _{i=0}^{\infty } h^i\Gamma _i(F,\varLambda ,y). \end{aligned}$$
(10)

Recall that the empirical convolution of a sample \((X_1,\cdots ,X_n)\),

$$\begin{aligned} \hat{F}_{n}^{*2}(y) := \frac{1}{\left( {\begin{array}{c}n\\ 2\end{array}}\right) } \sum _{1\le i<j\le n} {{\mathrm{{1I}}}}\{X_i + X_j \le y \}. \end{aligned}$$
(11)

is an unbiased and consistent estimator of \(F^{*2}(x)\), see Frees (1986). The CoF method looks for a finite measure \(\varLambda \) that minimises the following loss function

$$\begin{aligned} L_{\mathrm {CoF}}^{(k)}(\varLambda ) = \int \Big \{\sum _{i=0}^{k} h^i\Gamma _i(\hat{F}_n,\varLambda ,y)-\hat{F}_{n}^{*2}(y)\Big \}^{2} \omega (y)\mathrm {d}y.\nonumber \\ \end{aligned}$$
(12)

The infinite sum in (10) is truncated to k terms in (12) for computational reasons. The error introduced by the truncation can be accurately estimated by bounding the remainder term in the finite expansion formula (16) in the proof. Alternatively, turning to (10) and using \(0\le F(y)\le 1\), we obtain

\(\sup _{y\in \mathbb {R}}|U_{x_{1}, \ldots , x_{i}}F(y)|\le 2^{i-1}\), yielding a uniform bound

$$\begin{aligned}&\sup _{y\in \mathbb {R}}\sum _{i=k+1}^\infty h^i \big |\Gamma _i(F,\varLambda ,y) \big | \le R_k(h\Vert \varLambda \Vert ),\ \text {where} \nonumber \\&R_k(x)= \frac{1}{2}\sum _{n=k+1}^\infty \frac{(2x)^{n}}{n!}. \end{aligned}$$
(13)

Thus, to have a good estimate with this method, the upper bound \(R_k(h\Vert \varLambda \Vert )\) should be small, which could be achieved by reducing the time step h or/and increasing k. For instance, for the horse kick data considered in Introduction, we have \(h=1\) and the estimated value of \(\Vert \varLambda \Vert \) is 0.61, giving the values \(R_k(0.61)=0.58, 0.21, 0.06\) for \(k=1,2,3\). This indicates that \(k=3\) is rather adequate cut-off for the data.

If the expected number of jumps, \(h\Vert \varLambda \Vert \), in the time interval [0, h], is large, the sample values \(X_i\), in the case of a finite variances, would have approximately normal distribution. Since the normal distribution is determined by the first two moments only and not by the entire compounding distribution, an effective estimation of \(\varLambda /\Vert \varLambda \Vert \) is hardly possible, see Duval (2014) for a related discussion. Indeed, to get the upper bound close to 0.2 given \(h\Vert \varLambda \Vert =8\), one would need to take \(k=41\) which is hardly computationally possible.

To summarise, if one has a control on the choice of h, it should be taken so that the estimated value of \(h\Vert \varLambda \Vert \) is close to 1. For large values of this parameter, the central limit theorem prevents an effective estimation of \(\lambda \), while the small values would result in almost always single jumps and the optimisation procesure giving basically the sample measure as a solution. Similarly to the problem of choice of a kernel estimator or the histogram’s bin width, a compromise should be sought. A practical approach would be to try various values of h, as we demonstrate below in Sect. 6 on the real FX data.

5 Simulation results

To illustrate the performance of our estimation methods, we generated samples of size \(n=1000\) for compound Poisson processes driven by various kinds of measure \(\varLambda \). In Sects. 5.1, 5.2 and 5.3, we considered examples of discrete jump size distributions. Note that lattice distributions with both positive and negative jumps are particularly challenging because of possible cancellations of jumps, the case barely considered in the literature so far.

In Sects. 5.4 and 5.5, we present simulation results for two cases of continuously distributed jumps: non-negative and general. The continuous measures are replaced in the simulations by their discretised versions given by (8). The grid size in these examples was \(\varDelta =0.25\). Note that no special account is given to the fact that the measure is continuous and the algorithms work the same way as with genuine discrete measures. However, the presence of atoms filling the consecutive grid ranges should indicate that the true compounding measure is probably continuous. A separate analysis could be tried to formally check this hypothesis, for instance, by the methods proposed in Coca (2015). If confirmed, some kind of kernel smoothing could be used to produce an estimated density curve or specific estimation methods for continuously distributed jumps employed, like the ones mentioned in Introduction.

For all the considered examples, we applied three versions of the CoF with \(h=1\), \(k=1,2,3\) and \(\omega \equiv 1\). We also apply ChF using the estimate of CoF with \(k=1\). Observe that CoF with \(k=1\) can be made particularly fast because here we have a non-negative least squares optimisation problem. If the computation time is no concern, one can also implement CoF with higher values of k to use the resulting measure as a starting point to ChF. Given a complicated nature of the loss function, this may or may not lead to a better fit. In all these examples \(\Vert \varLambda \Vert =1\), which explains our particular choice of \(h = 1\), see the discussion above in connection to the error estimate (13).

5.1 Degenerate jump measure

Consider first the simplest measure \(\varLambda (dx)=\delta _1(dx)\) corresponding to a standard Poisson process with rate 1. Since all the jumps are integer valued and non-negative, it would be logical to take the non-negative integer grid for possible atom positions of the discretised \(\varLambda \). This is the way we have done it for the horse kick data analysis in Introduction. However, to test the robustness of our methods, we took the grid \(\{0,\pm 1/4,\pm 2/4,\ldots \}\). As a result, the estimated measures might place some mass on non-integer points or even on negative values of x to compensate for inaccurately fitted positive jumps. We have chosen to show on the graphs the discrepancies between the estimated and the true measure. An important indicator of the effectiveness of an estimation is the closeness of the total masses \(\Vert \hat{\varLambda }\Vert \) and \(\Vert \varLambda \Vert \). For \(\varLambda =\delta _1\), the probability to have more than 3 jumps is approximately 0.02; therefore, with the CoF method we expect that \(k=3\) would give an adequate estimate for these data. Indeed, the top panel of Fig. 3 demonstrates that the CoF with \(k=3\) is much more effective in detecting the jumps of the Poisson process compared to \(k=2\) and, especially, to \(k=1\). The latter methods generate large discrepancies both in atom sizes and in the total mass of the obtained measure. Observe also the presence of artifactual small atoms at large x and even at some non-integer locations.

The bottom panel shows that a good alternative to a rather computationally demanding CoF method with \(k=3\) is a much faster combined CoF–ChF method when \(\hat{\varLambda }_1\) measure is used as the initial measure in the ChF algorithm. The resulting measure \(\tilde{\varLambda }_1\) is almost identical to \(\hat{\varLambda }_3\), but also has the total mass closer to the target value 1. The total variation distances between the estimated measures \(\hat{\varLambda }_k\) and the theoretical measure \(\varLambda \) are 0.435, 0.084 and 0.053 for \(k=1,2,3\), respectively. The best fit provides the combined CoF–ChF method which produces a measure \(\tilde{\varLambda }_1\) within the distance of 0.043 from \(\varLambda \).

Fig. 3
figure 3

Simulation results for \(\varLambda =\delta _1\). Top panel the differences between \(\varLambda (\{x\})\) and their estimates \(\hat{\varLambda }_k(\{x\})\) obtained by CoF with \(k=1, 2, 3\). Zero values of the differences are not plotted. Bottom panel comparison of \(\hat{\varLambda }_3\) with \(\tilde{\varLambda }_1\) obtained by ChF initiated at \(\hat{\varLambda }_1\). Notice a drastic change in the vertical axis scale as we go from the top to the bottom panel

Fig. 4
figure 4

Simulation results for \(\varLambda =0.2\delta _{-1}+0.2\delta _1+0.6\delta _2\). Top panel the differences between \(\varLambda (\{x\})\) and their estimates \(\hat{\varLambda }_k(\{x\})\) obtained by CoF with \(k=1, 2, 3\). Bottom panel comparison of \(\hat{\varLambda }_3\) with \(\tilde{\varLambda }_1\)

Fig. 5
figure 5

Simulation results for a shifted Poisson distribution \(\varLambda (\{x\})={e^{-1}/ (x-1)!}\) for \(x=1, 2, \ldots \). Top panel the differences between \(\varLambda (\{x\})\) and their estimates \(\hat{\varLambda }_k(\{x\})\) obtained by CoF with \(k=1, 2, 3\). Bottom panel comparison of \(\hat{\varLambda }_3\) with \(\tilde{\varLambda }_1\) obtained by ChF initiated at \(\hat{\varLambda }_1\)

5.2 Discrete positive and negative jumps

Consider now a jump measure \(\varLambda =0.2\delta _{-1}+0.2\delta _1+0.6\delta _2\). This gives rise to a compound Poisson process with rate \(\Vert \varLambda \Vert =1\) and jumps of sizes \(-1,1,2\) having respective probabilities 0.2, 0.2 and 0.6. Figure 4 presents the results of our simulations. The presence of negative jumps cancelling positive jumps creates an additional difficulty for the estimation task. This phenomenon explains why the approximation obtained with \(k=2\) is worse than with \(k=1\) and \(k=3\): two jumps of sizes \(+\)1 and −1 sometimes cancel each other, which is indistinguishable from no jumps case, see the top panel of Fig. 4. Moreover, −1 and 2 added together are the same as having a single size 1 jump. The phenomenon still persists when we increased the sample size: \(k=1\) and \(k=3\) still perform better. Notice that going from \(k=1\) through \(k=2\) up to \(k=3\) improves the performance of CoF, although the computing time increases dramatically. The corresponding total variation distances of \(\hat{\varLambda }_k\) to the theoretical distribution are 0.3669, 0.6268 and 0.1558. The combined method gives the distance 0.0975, and according to the bottom plot, it is again a clear winner in this case too. It is also much faster.

5.3 Unbounded compounding distribution

As an example of a measure \(\varLambda \) with unbounded support, we take a shifted Poisson distribution with parameter 1. Figure 5 presents our simulation results for this case; for computation purposes, we took the interval \(x\in [-2,5]\) as the support range for the estimated measure. In practice, the support range should be enlarged if atoms start appearing on the boundaries of the chosen interval indicating a wider support of the estimated measure, see also Buchmann and Grübel (2003) for a related discussion. As the top panel reveals, also in this case the CoF method with \(k=3\) gives a better approximation than those with \(k=1\) or \(k=2\) (the total variation distance to the theoretical distribution is 0.1150 compared to 0.3256 and 0.9235, respectively) and the combined (faster) method gives an even better estimate with \(d_{\mathrm {TV}}(\tilde{\varLambda }_1,\varLambda )=0.0386\). Interestingly, the case of \(k=2\) was the worst in terms of the total variation distance to the original measure. We suspect that the ’pairing effect’ may be responsible: the jumps are better fitted with a single integer- valued variable rather than with the sum of two. The algorithm may also got stuck in a local minimum producing small atoms at non-integer positions.

5.4 Continuous non-negative compounding distribution

Consider a compound Poisson process of rate 1 with the compounding distribution being exponential with parameter 1. The top plot of Fig. 6 shows that, as expected, the approximation accuracy increases with k. Observe that the total variation distance \(d_{\mathrm {TV}}(\hat{\varLambda }_3,\pmb {\varLambda })= 0.0985\) is comparable with the discretisation error: \(d_{\mathrm {TV}}(\varLambda ,\pmb {\varLambda })=0.075\). A Gaussian kernel smoothed version of \(\hat{\varLambda }_3\) is presented at the bottom plot of Fig. 6. The visible discrepancy for small values of x is explained by the fact that there were no sufficiently many small jumps in the simulated sample for the algorithm to put more mass around 0.

Optimisation in the space of measures usually tends to produce atomic measures since these are boundary points of typical constraint sets in \(\mathbb {M}\). Indeed, \(\tilde{\varLambda }_1\) has smaller number of atoms than \(\pmb {\varLambda }\) does and still it approximates better the empirical characteristic function of the sample.

Fig. 6
figure 6

Simulation results for a compound Poisson process with jump intensity 1 and jump sizes having an exponential distribution with parameter 1. Top plot obtained measures for various algorithms, the bottom plot the theoretical exponential density and the smoothed version of \(\tilde{\varLambda }_1\) measure with a Gaussian kernel with the standard deviation 0.4

5.5 Continuous compounding distribution

Finally, Fig. 7 takes up the important example of compound Poisson processes with normally distributed jumps having both positive and negative values. Once again, the estimates \(\hat{\varLambda }_k\) improve as k increases, and the combined method CoF–ChF gives an estimate similar to \(\hat{\varLambda }_3\). Notice an inflection around 0 caused by the restraint on the estimated measure which imposes the origin to have a zero mass. This shows as a dip in the curve produced by the kernel smoother.

In the presented examples with continuous compounding distribution, when choosing the kernel smoother width, we were guided by a visual smoothness of the resulting curve. Similarly to a general smoothing procedure, optimisation of the kernel width requires additional criteria to be employed, like information criteria. It is also possible to add a specific term into the score function of the optimisation procedure depending on the kernel function which is responsible for the goodness of fit of the smoothed curve to the empirical data. We do not address, however, these issues here considering it a separate problem from a nonparametric measure fitting, see also Discussion section below.

Fig. 7
figure 7

Top plot estimated jump measure for a simulated sample with jump sizes having a normal distribution with the mean 0.5 and variance 0.25. Bottom plot the theoretical Gaussian density and the smoothed version of \(\hat{\varLambda }_3\) measure with a Gaussian kernel with the standard deviation 0.2

Fig. 8
figure 8

Top plot Consecutive increments of the log-returns of GBP rate against EUR from 2014-01-02 to 2014-10-10. Bottom plot the fitted stable distribution to the increment data

6 Currency exchange data application

Lévy processes are widely used in financial mathematics to model the dynamics of the log-returns which for a commodity with price \(S_t\) and time t is defined to be \(W_t=\log (S_t/S_0)\). For this model, the increments \(W_h-W_0\), \(W_{2h}-W_h,\ldots \) are independent and have a common infinitely divisible distribution. For example, many authors argue that the log-returns of the currency exchange rates in a stable market have indeed i.i.d. increments, see e.g. Cont (2001). We took FX data of the Great Britain Pound (GBP) against a few popular currencies and chose to work with GBP to EUR exchange rates in a period of 200 consecutive days of a relatively stable market from 2014-01-02 to 2014-10-10, see the top plot of Fig. 8. We fitted various, popular among financial analysts, distributions to the daily increments of the log-returns: Gaussian, GEV, Weibull and stable distributions. The best fit was obtained by the stable distribution. In order to have a consistent comparison with our methods, we used the loss function (3) to estimate the parameters of the stable distribution, such estimation method goes back to at least Paulson et al. (1975). The fitted stable \(S(1.882,-1,0.002,0;0)\) distribution (in S0 parametrisation) is resented on the bottom plot of Fig. 8. A formal Chi-square test, however, rejected the null hypothesis that the data are coming from the fitted stable distribution due to the large discrepancies in the tails. The distance between the empirical characteristic function and the fitted stable distribution’s characteristic function measured in terms of the score function \(L_{\mathrm {ChF}}\) was \(6.12\times 10^{-3}\). We then ran our CoF algorithm with \(k=1\) and obtained a distribution within the distance \(5.53\times 10^{-6}\) from the empirical characteristic function. Taking the resulting jump measure as a starting point to our ChF method, we arrived at the distribution within the distance \(8.71\times 10^{-7}\). The observed improvement is due to more accurate estimates of the large jumps of the exchange rates (which are not related to global economical or political events).

Fig. 9
figure 9

Top plot estimated Lévy measure for a GBP/EUR rate log-return increments. Bottom plot estimated compounding measure for FX data recorded with various lags: 1 (circle), 2 (plus), 4 (cross) and 8 (triangle) day intervals

It may be expected that, as in the case of a linear regression, the agreement of the estimated model with the data could be “too good”. To verify stability of our estimates, we ran our algorithms on the data with different time lags: every 2, 4 and 8 days records. It is interesting to note that even at 8 days lag our algorithms attained a distribution at the distance \(2.19\times 10^{-4}\), an order of magnitude closer to the empirical characteristic function than the fitted stable distribution despite that fact that 8 times less data were used, see Fig. 9.

The estimates of the measure \(\varLambda \) obtained for various lags are not that much different, apart from 8 days lag when only 25 observations are available, which reassures that our estimation methods give consistent results. These findings are illustrated on the bottom panel of Fig. 9.

7 Discussion

This paper deals with nonparametric inference for compound Poisson processes. We proposed and analysed new algorithms based on the characteristic function fitting (ChF) and convoluted cumulative distribution function fitting (CoF). The algorithms are based on the recently developed variational analysis of functionals of measures and the corresponding steepest descent methods for constraint optimisation on the cone of measures. CoF methods are capable of producing very accurate estimates, but at the expense of growing computational complexity. The ChF method critically depends on the initial approximation measure due to highly irregular behaviour of the objective function. We have observed that the problems of convergence of the ChF algorithms can often be effectively overcome by choosing the sample measure (discretised to the grid) as the initial approximation measure. However, a better alternative, as we demonstrated in the paper, is to use the measure obtained by the simplest (\(k=1\)) CoF algorithm. This combined CoF–ChF algorithm is fast and in majority of cases produces a measure which is closest in the total variation to the measure under estimation, and thus, this is our method of choice.

The practical experience we gained during various tests allows us to conclude that the suggested methods are especially well suited for estimation of discrete jump size distributions. They work well even with jumps that take both positive and negative values, not necessarily belonging to a regular lattice, demonstrating a clear advantage over the existing methods, see Buchmann and Grübel (2003), Buchmann and Grübel (2004). The use of our algorithms for continuous compounding distributions requires more trial and error in choosing the right discretisation grid and smoothing procedures. In order to properly take into account the continuity of the compounding measure, one may apply direct methods of the density estimation proposed by  van Es et al. (2007), Watteel and Kulperger (2003). Alternatively, one can try to develop an optimisation algorithm for the class of absolutely continuous measures by characterising their tangent cones. Additional conditions on the density may also be imposed, like Lipschitz kind of conditions, to make the feasible set closed in the corresponding measure topology.

8 Appendix

8.1 Proof of Theorem 1

First-order necessary criteria for constrained optimisation in a Banach space can be derived in terms of tangent cones. Let \(\mathbb {A}\) be a subset of \(\mathbb {M}\) and \(\eta \in \mathbb {A}\). The tangent cone to \(\mathbb {A}\) at \(\eta \) is the following subset of \(\mathbb {M}\):

$$\begin{aligned} \mathbb {T}_{\mathbb {A}}(\eta ) = \liminf _{t \downarrow 0}\ t^{-1}(\mathbb {A}-\eta ). \end{aligned}$$

Recall that the \(\liminf _n A_n\) for a family of subsets \((A_n)\) in a normed space is the set of the limits of all converging sequences \(\{a_n\}\) such that \(a_n\in A_n\) for all n. Equivalently, \(\mathbb {T}_{\mathbb {A}}(\eta )\) is the closure of the set of such \(\nu \in \mathbb {M}\) for which there exists an \(\varepsilon =\varepsilon (\nu )>0\) such that \(\eta + t\nu \in \mathbb {A}\) for all \(0\le t\le \varepsilon \).

By the definition of the tangent cone, if \(\eta \) is a point of minimum of a strongly differentiable function L over a set \(\mathbb {A}\), then one must have

$$\begin{aligned} DL(\eta )[\nu ]\ge 0\quad \text {for all}\ \nu \in \mathbb {T}_{\mathbb {A}}(\eta ). \end{aligned}$$
(14)

Indeed, assume that there exists \(\nu \in \mathbb {T}_{\mathbb {A}}(\eta )\) such that \(DL(\eta )[\nu ]:=-\varepsilon <0\). Then, there is a sequence of positive numbers \(t_n\downarrow 0\) and a sequence \(\eta _n\in \mathbb {A}\) such that \(\nu =\lim _n t_n^{-1}(\eta _n-\eta )\). Because \(\Vert \eta -\eta _n\Vert =t_n(1+o(1))\Vert \nu \Vert \rightarrow 0\), we obtain that \(\eta _n\rightarrow \eta \). Since any bounded linear operator is continuous, we also have

$$\begin{aligned} DL(\eta )[\nu ]&=DL(\eta )[\lim _n t_n^{-1}(\eta _n-\eta )]\\&=\lim _n t_n^{-1} DL(\eta )[\eta _n-\eta ]=-\varepsilon . \end{aligned}$$

Furthermore, by (4),

$$\begin{aligned} DL(\eta )[\eta _n-\eta ]&=L(\eta _n)-L(\eta )+o(\Vert \eta -\eta _n\Vert )\\&=L(\eta _n)-L(\eta )+o(t_n), \end{aligned}$$

implying

$$\begin{aligned} L(\eta _n)-L(\eta )= -t_n\varepsilon (1+o(1))<-t_n\varepsilon /2 \end{aligned}$$

for all sufficiently small \(t_n\). Thus, in any ball around \(\eta \) there exists an \(\eta _n\in \mathbb {A}\) such that \(L(\eta _n)<L(\eta )\), so that \(\eta \) is not a point of a local minimum of L over \(\mathbb {A}\). This finishes the proof of (14).

In our case, the constraint set \(\mathbb {A}\) is the set \(\mathbb {L}=\{\eta \in \mathbb {M}_+:\ \eta (\{0\})=0\}\). Next step is to find a sufficiently rich class of measures belonging to the tangent cone \(\mathbb {T}_{\mathbb {L}}(\varLambda )\) for s given \(\varLambda \in \mathbb {L}\). For this, notice that for any such \(\varLambda \), the Dirac measure \(\delta _x\) belongs to \(\mathbb {T}_{\mathbb {L}}(\varLambda )\) since \(\varLambda +t\delta _x\in \mathbb {L}\) for any \(t\ge 0\) as soon as \(x\ne 0\). Similarly, given any Borel \(B\subset \mathbb {R}\), the negative measure \(-\varLambda |_B:=-\varLambda (\,\cdot \,\cap B)\), which is the restriction of \(-\varLambda \) onto B, is also in the tangent cone \(\mathbb {T}_\mathbb {L}(\varLambda )\), because for any \(0\le t\le 1\) we have \(\varLambda -t\varLambda |_B\in \mathbb {L}\).

Since, under the assumptions of the theorem, \(\nabla L(x;\varLambda )\) is a gradient function, the necessary condition (14) becomes

$$\begin{aligned} \int \nabla L(x;\varLambda )\,\nu (\mathrm {d}x)\ge 0\quad \text {for all}\ \nu \in \mathbb {T}_{\mathbb {L}}(\varLambda ). \end{aligned}$$

Substituting \(\nu =\delta _x\) above we immediately obtain the inequality in (7). Finally, taking \(\nu =-\varLambda |_B\) yields

$$\begin{aligned} \int _B \nabla L(x;\varLambda )\,\varLambda (\mathrm {d}x)\le 0. \end{aligned}$$

Since this is true for any Borel B, we conclude that \(\nabla L(x;\varLambda )\le 0\) \(\varLambda \) almost everywhere which, combined with the previous inequality, gives the second relation in (7).

8.2 Proof of Theorem 2

Let \(\pmb {N}\) be the space of locally finite counting measures \(\varphi \) on \(\mathbb {R}\). Let \(\mathcal {N}\) be the smallest \(\sigma \)-algebra which makes measurable all the mappings \(\varphi \mapsto \varphi (B)\in \mathbb {Z}_+\) for \(\varphi \in \pmb {N}\) and compact sets B. A Poisson point process with the intensity measure \(\mu \) is a measurable mapping \(\varPi \) from some probability space into \([\pmb {N},\mathcal {N}]\) such that for any finite family of disjoint compact sets \(B_1,\cdots ,B_k\), the random variables \(\varPi (B_1),\cdots ,\varPi (B_k)\) are independent and each \(\varPi (B_i)\) has a Poisson distribution with parameter \(\mu (B_i)\). Clearly \(\mu (B)=\mathbf{E}\varPi (B)\) for any B. To emphasise the dependence of the distribution on \(\mu \), we write the expectation as \(\mathbf{E}_\mu \) in the sequel.

Consider a measurable function \(G :\pmb {N}\rightarrow \mathbb {R}\), and for a given \(z\in \mathbb {R}\) define the difference operator

$$\begin{aligned} D_{z}G(\varphi ) := G(\varphi + \delta _{z}) - G(\varphi ),\ \varphi \in \pmb {N}. \end{aligned}$$

For the iterations of such difference operators,

$$\begin{aligned} D_{z_{1}, \ldots , z_{n}}G = D_{z_{n}}(D_{z_{1}, \ldots , z_{n-1}}G),\quad (z_{1}, \ldots , z_{n})\in \mathbb {R}^n, \end{aligned}$$

it can be checked that

$$\begin{aligned} D_{z_{1}, \ldots , z_{n}}G(\nu ) = \sum _{J \subseteq \{1, 2, \ldots , n\}} (-1)^{n - |J|} G \big (\nu + \Sigma _{j \in J} \delta _{z_{j}} \big ), \end{aligned}$$

where |J| stands for the cardinality of J, so that if J is an empty set, then \(|J| = 0\). Define

$$\begin{aligned} T_\mu G(z_{1}, \ldots , z_{n}) := \mathbf{E}_\mu D_{z_{1}, \ldots , z_{n}}G(\varPi ). \end{aligned}$$

Suppose that the functional G is such that there exists a constant \(c > 0\) satisfying

$$\begin{aligned} |G \big ({\Sigma }_{j= 1}^{n} \delta _{z_{j}} \big )| \le c^{n}\ \text {for all}\ n \ge 1\ \text {and all}\ (z_{1}, \dots z_{n}). \end{aligned}$$

It was proved in Molchanov and Zuyev (2000a, Theorem 2.1) that if \(\mu , \mu '\) are finite measures, the expectation

\(\mathbf{E}_{\mu +\mu '} G(\varPi )\) exists and

$$\begin{aligned}&\mathbf{E}_{\mu +\mu '} G(\varPi ) =\mathbf{E}_{\mu } G(\varPi ) \nonumber \\&\quad + \sum _{i= 1}^{\infty } \frac{1}{i!} \int _{\mathbb {R}^i}T_{\mu }G(z_{1},\dots , z_{i}) \mu '(\mathrm {d}z_{1}) \ldots \mu '(\mathrm {d}z_{i}). \end{aligned}$$
(15)

Generalisations of this formula to infinite and signed measures for square integrable functionals can be found in Last (2014). A finite-order expansion formula can be obtained by representing the expectation above in the form

where \(\varPi \) and \(\varPi '\) are independent Poisson processes with intensity measures \(\mu \) and \(\mu '\), respectively, and then applying the moment expansion formula by Błaszczyszyn et al. (1997, Theorem 3.1) to \(G(\varPi +\varPi ')\) viewed as a functional of \(\varPi '\) with a given \(\varPi \). This gives us

$$\begin{aligned}&\mathbf{E}_{\mu +\mu '} G(\varPi ) =\mathbf{E}_{\mu } G(\varPi ) \nonumber \\&\quad + \sum _{i= 1}^{k} \frac{1}{i!} \int _{\mathbb {R}^i}T_{\mu }G(z_{1},\dots , z_{i}) \mu '(\mathrm {d}z_{1}) \ldots \mu '(\mathrm {d}z_{i})\nonumber \\&\quad + \,\frac{1}{(k+1)!} \int _{\mathbb {R}^{k+1}}T_{\mu +\mu '}G(z_{1},\dots , z_{k+1}) \mu '(\mathrm {d}z_{1}) \ldots \nonumber \\&\qquad \mu '(\mathrm {d}z_{k+1}). \end{aligned}$$
(16)

To prove Theorem 2, we use a coupling of the compound Poisson process \((W_t)_{t\ge 0}\) with a Poisson process \(\varPi \) on \(\mathbb {R}_+\times \mathbb {R}\) driven by the intensity measure \(\mu =\ell \times \varLambda \), where \(\ell \) is the Lebesgue measure on \([0,+\infty )\). Clearly,

$$\begin{aligned} W_{t} = \Sigma _{(t_j,x_{j} )\in \varPi _t} x_{j}=\int _0^t\int _{\mathbb {R}} x \varPi (\mathrm {d}s\, \mathrm {d}x), \end{aligned}$$

where for each realisation \(\Sigma _j \delta _{z_j}\) of \(\varPi \) with \(z_j=(t_{j}, x_{j})\), we denote by \(\varPi _t\) the restriction of \(\varPi \) onto \([0, t] \times \mathbb {R}\). For a fixed arbitrary \(y\in \mathbb {R}\) and a point configuration \(\varphi =\Sigma _j \delta _{(t_i,x_j)}\), consider a functional \(G_y\) defined by

$$\begin{aligned} G_y(\varphi ) = {{\mathrm{{1I}}}}\Bigl \{\sum _{(t_j,x_{j} )\in \varphi } x_j \le y\Bigr \} \end{aligned}$$

and notice that for any \(z=(t,x)\),

$$\begin{aligned} G_y(\varphi +\delta _z)={{\mathrm{{1I}}}}\Bigl \{\sum _{(t_j,x_{j} )\in \varphi } x_j \le y-x\Bigr \}=G_{y-x}(\varphi ). \end{aligned}$$
(17)

Expressing the cumulative distribution function \(F(y)=\mathbf {P}\{W_{h}\le y\}\) as an expectation

$$\begin{aligned} F(y) = \mathbf {P}_\mu \Big \{\sum _{(t_j,x_{j} )\in \varPi _h} x_{j}\le y\Big \}=\mathbf{E}_\mu G_y(\varPi _h), \end{aligned}$$

and putting \(\mu '=[0,h]\times \varLambda \), \(\mu ''=[h,2h]\times \varLambda \), we find

$$\begin{aligned} \mathbf{E}_{\mu '+\mu ''} G_y(\varPi )= \mathbf {P}\{W_{2h} \le y\} = \mathbf {P}\{W_{h} + W_{h}'' \le y\} = F^{*2}(y), \end{aligned}$$

where \(W_h''=W_{2h}-W_h\). Observe also that by iteration of (17),

$$\begin{aligned}&T_{\mu '}G_y(z_{1}, \ldots , z_{n})\\&\quad = \mathbf{E}_{\mu '} D_{z_{1}, \ldots , z_{n}}G_y(\varPi )\\&\quad = \sum _{J \subseteq \{1, 2, \ldots , n\}} (-1)^{n - |J|} \mathbf{E}_{\mu '} G_y \big (\varPi + \Sigma _{j \in J} \delta _{z_{j}} \big )\\&\quad = \sum _{J \subseteq \{1, 2, \ldots , n\}} (-1)^{n-|J|}F(y - \Sigma _{j \in J}x_{j})=U_{x_{1}, \ldots , x_{n}}F(y). \end{aligned}$$

To finish the proof, it now remains to apply expansion (15):

$$\begin{aligned}&F^{*2}(y)= \mathbf{E}_{\mu '+\mu ''} G_y(\varPi ) = F(y)\\&\qquad +\sum _{i=1}^{\infty } \frac{1}{i!} \int _{(\mathbb {R}_+\times \mathbb {R})^i} U_{x_{1}, \ldots , x_{i}}F(y)\, \mu ''(\mathrm {d}t_1\,\mathrm {d}x_{1}) \ldots \mu ''(\mathrm {d}t_n\,\mathrm {d}x_{i})\\&\quad = \sum _{i=0}^{\infty } h^i\Gamma _i(F,\varLambda ,y). \end{aligned}$$

8.3 Gradient of ChF loss function

The ChF method is based on the loss function \(L_{\mathrm {ChF}}\) given by (3), which is everywhere differentiable in Fréchet sense with respect to the measure \(\varLambda \). Aiming at the steepest descent gradient method described in Sect. 3 for obtaining the minimum of the loss function, we compute here the gradient of \(L_{\mathrm {ChF}}\) in terms of the following functions

$$\begin{aligned}&q_{1}(\theta , x) := \cos (\theta x) - 1, \quad q_{2}(\theta , x) := \sin (\theta x) - \theta x {{\mathrm{{1I}}}}_{\{|x|<\varepsilon \}},\\&Q_{i}(\theta , \varLambda ) := \int q_{2}(\theta , x) \varLambda (\mathrm {d}x),\quad i=1,2. \end{aligned}$$

Using this notation, the real and imaginary parts of an infinitely divisible distribution characteristic function \(\varphi =\varphi _1+i\varphi _2\) can be written down as

$$\begin{aligned} \varphi _1(\theta ,\varLambda )&= e^{h Q_{1}(\theta , \varLambda ) } \cos \{hQ_{2}(\theta ,\varLambda ) \}, \\ \varphi _2(\theta , \varLambda )&= e^{h Q_{1}(\theta , \varLambda ) } \sin \{ hQ_{2}(\theta , \varLambda ) \}. \end{aligned}$$

After noticing that \(\hat{\varphi }_{n}=\hat{\varphi }_{n,1}+i\hat{\varphi }_{n,2}\), with

$$\begin{aligned} \hat{\varphi }_{n,1}(\theta )= \frac{1}{n} \sum _{j=1}^{n} \cos (\theta X_{j}), \quad \hat{\varphi }_{n,2}(\theta ) = \frac{1}{n} \sum _{j=1}^{n} \sin (\theta X_{j}), \end{aligned}$$

the loss functional \(L_{\mathrm {ChF}}\) can be written as

$$\begin{aligned} L_{\mathrm {ChF}}(\varLambda )= & {} \int \big \{\varphi _1(\theta , \varLambda ) - \hat{\varphi }_{n,1}(\theta ) \big \}^{2} \omega (\theta )\mathrm {d}\theta \\&+ \int \big \{\varphi _2(\theta , \varLambda ) - \hat{\varphi }_{n,2}(\theta ) \big \}^{2} \omega (\theta )\mathrm {d}\theta . \end{aligned}$$

From this representation, the gradient function corresponding to the Fréchet derivative with respect to the measure \(\varLambda \) is obtained using the Chain rule (5):

$$\begin{aligned}&\nabla L_{\mathrm {ChF}}(x; \varLambda ) \nonumber \\&\quad = 2 \int \{\varphi _1(\theta , \varLambda ) - \hat{\varphi }_{n,1}(\theta )]\}\nabla \varphi _1(\theta )[x, \varLambda ] \omega (\theta )\mathrm {d}\theta \nonumber \\&\qquad + 2 \int \{\varphi _2(\theta , \varLambda ) - \hat{\varphi }_{n,2}(\theta )\} \nabla \varphi _2(\theta )[x, \varLambda ] \omega (\theta ) \mathrm {d}\theta , \end{aligned}$$
(18)

where the gradients of \(\varphi _i(\theta ):=\varphi _i(\theta , \varLambda )\), \(i=1,2\), with respect to the measure \(\varLambda \), are given by

$$\begin{aligned} \nabla \varphi _1(\theta )(x; \varLambda )&= he^{h Q_{1}(\theta , \varLambda )}\\&\quad \times \big \{ \cos \big (hQ_{2}(\theta , \varLambda )\big ) q_{1}(\theta , x)\\&\quad - \sin \big (hQ_{2}(\theta ,\varLambda )\big ) q_{2}(\theta , x) \big \}, \\ \nabla \varphi _2(\theta )(x; \varLambda )&= he^{h Q_{1}(\theta , \varLambda )}\\&\quad \times \big \{ \sin \big (hQ_{2}(\theta , \varLambda )\big ) q_{1}(\theta , x) \\&\quad + \cos \big (hQ_{2}(\theta ,\varLambda )\big ) q_{2}(\theta , x) \big \}. \end{aligned}$$

8.4 Gradient of CoF loss function

As with the ChF method, the CoF algorithm relies on the steepest descent approach. The needed gradient function has the form

$$\begin{aligned} \nabla L_{\mathrm {CoF}}^{(k)} (x; \varLambda )&= 2h \int \Big \{ \sum _{i=0}^{k} h^i\Gamma _i(\hat{F}_n,\varLambda ,y)-\hat{F}_{n}^{*2}(y)\Big \} \\&\quad \times \sum _{j= 0}^{k-1}h^j \varXi _j(\hat{F}_n,\varLambda ,y,x) \omega (y)\mathrm {d}y, \end{aligned}$$

where

$$\begin{aligned} \varXi _j(F,\varLambda ,y,x)= \frac{1}{j!} \int _{\mathbb {R}^{j}} U_{x, x_{1}, \ldots , x_{j}}F(y) \varLambda (\mathrm {d}x_{1}) \ldots \varLambda (\mathrm {d}x_{j}). \end{aligned}$$

This formula follows from the Chain rule (5) and the equality

$$\begin{aligned}&\nabla \Big (\sum _{j= 1}^{k} \frac{h^{j}}{j!} \int _{\mathbb {R}^{j}} U_{x_{1}, \ldots , x_{j}} F(y) \varLambda (\mathrm {d}x_{1}) \ldots \varLambda (\mathrm {d}x_{j}) \Big ) (x; \varLambda )\\&\quad = h\sum _{j= 0}^{k-1} \frac{h^{j}}{j!} \int _{\mathbb {R}^{j}} U_{x, x_{1}, \ldots , x_{j}} F(y) \varLambda (\mathrm {d}x_{1}) \ldots \varLambda (\mathrm {d}x_{j}). \end{aligned}$$

To justify the last identity, it suffices to see that for any integrable symmetric function \(u(x_{1}, \ldots ,x_{j}) \) of \(j\ge 1\) variables,

$$\begin{aligned}&\nabla \Big ( \int _{\mathbb {R}^{j}} u(x_{1}, \ldots ,x_{j}) \varLambda (\mathrm {d}x_{1}) \ldots \varLambda (\mathrm {d}x_{j}) \Big ) (x; \varLambda )\\&\quad = j\int _{\mathbb {R}^{j-1}}u(x,x_{1},\ldots ,x_{j-1}) \varLambda (\mathrm {d}x_{1}) \ldots \varLambda (\mathrm {d}x_{j- 1}). \end{aligned}$$

This is due to

$$\begin{aligned}&\int _{\mathbb {R}^{j}} u(x_{1}, \ldots ,x_{j}) (\varLambda +\nu )(\mathrm {d}x_{1}) \ldots (\varLambda +\nu )(\mathrm {d}x_{j})\\&\qquad - \int _{\mathbb {R}^{j}} u(x_{1}, \ldots ,x_{j}) \varLambda (\mathrm {d}x_{1}) \ldots \varLambda (\mathrm {d}x_{j}) \\&\quad = \sum _{k=1}^j\int _{\mathbb {R}^{j}}u(x_{1},\ldots ,x_{j}) \varLambda (\mathrm {d}x_{1}) \ldots \varLambda (\mathrm {d}x_{k- 1})\\&\qquad \times \nu (\mathrm {d}x_k)\varLambda (\mathrm {d}x_{k+1})\ldots \varLambda (\mathrm {d}x_{j})+o(\Vert \nu \Vert ), \end{aligned}$$

where the last sum equals

$$\begin{aligned} j\int _{\mathbb {R}^{j}}u(x,x_{1},\ldots ,x_{j-1})\nu (\mathrm {d}x) \varLambda (\mathrm {d}x_{1}) \ldots \varLambda (\mathrm {d}x_{j- 1}). \end{aligned}$$

For example, the cost function (12) with \(k=1\) and \(\omega (y)\equiv 1\) has the gradient

$$\begin{aligned}&\nabla L_{\mathrm {CoF}}^{(1)} (x; \varLambda ) \\&\quad = 2h \int \Big \{ \hat{F}_n(y) -\hat{F}_{n}^{*2}(y)\\&\qquad + \int \hat{F}_{n}(y-z) \varLambda (\mathrm {d}z) \Big \} \hat{F}_{n}(y-x)\,\mathrm {d}y. \end{aligned}$$

Respectively, the discretised gradient (9) used in the steepest descent algorithm is the vector \(\pmb {g}\) with the components

$$\begin{aligned} g_i= & {} 2h \int \Big \{ \hat{F}_n(y) -\hat{F}_{n}^{*2}(y)\nonumber \\&+ \sum _{j=1}^l \hat{F}_{n}(y-x_j) \lambda _j \Big \} \hat{F}_{n}(y-x_i)\,\mathrm {d}y,\nonumber \\ i= & {} 1,\cdots ,l. \end{aligned}$$
(19)