Background

Continued exploration of the the cross-validation-based approach proposed in [1] has revealed a number of issues with the original formulation of the optimization equation. This original formulation was ad hoc in its combination of two statistical approaches (cross-validation and information criteria), and the result was a metric without a clear basis in statistical theory. As such, we strongly recommend that users rely upon the method described here as opposed to one set forth in the original publication. In particular, the shortcomings can be summarized as follows:

  1. 1.

    Both cross-validation and information criterion approaches aim to avoid over-fitting. In the case of cross-validation, one attempts to estimate out-of-sample prediction error, so the score used should be a measure of prediction errors of the held-out points. If the model uses k too small or s too large, it is likely to overfit the training data and will predict the testing data poorly. On the other hand, if the model uses k too large or s too small, it will underfit the training data by missing the real variations in space use. Thus, cross-validation naturally penalizes model complexity because excessive complexity (small k) results in poor predictions. Information criteria approaches include a penalty term that increases with model complexity as measured by larger numbers of parameters. Using such an information criterion as a cross-validation score is not necessary since cross-validation should naturally penalize excessive model complexity.

  2. 2.

    The formulation of the information criterion score did not follow the rules of probability because probabilities of out-of-sample predictions were not properly normalized, and multiple probabilities were combined by summation. In this sense, it lacked a firm connection to the statistical theory underlying information criteria approaches.

Here we propose an alternative formulation in which we interpret a normalized version of LoCoH hulls as an estimated probability surface and recast the cross-validation score as the total log probability of out-of-sample predictions, a common choice in cross-validation schemes. The approach, explained in detail below, results in more appropriate behavior, but also has the effect of significantly altering the optimal parameter values selected by the algorithm. Thus, in addition to presenting the new cross-validation equation, we include tables and figures with the newly selected parameter values and newly calculated derived metric values (home range area, mean duration, and mean visitation rates). Finally, we offer an alternative R script that searches a much broader parameter space in a more efficient manner (Additional file 1).

Updated Cross-Validation Approach

Using the training/testing split as described in the original presentation of the algorithm, a grid-based exploration of parameter space was conducted (Fig. 1), whereby each of the training/testing datasets (i={1,...,n}) was analyzed at every combination of k and s values on the grid. This analysis entailed the creation of local convex hulls with k nearest neighbors and a scaling factor of s. In all subsequent analyses, we assume that the scaling of time follows a linear formulation; however, when movement patterns more closely exemplify diffusion dynamics, an alternative equation for the TSD may be more appropriate [2]. The test points (j={1,...,m}) were then laid upon the resulting hulls.

Fig. 1
figure 1

Conceptual Figure of Grid-based Search. A cross-validation surface is generated as the algorithm searches over a grid of alternative s and k values for each individual movement path. The increments of the grid can be chosen by the user. The peak in the surface indicates that the home range associated with the particular parameter set offers the highest probability for the test points. Here, the white boxes denote the maximum probability value, and thereby, the optimal parameter set

We formulate the probabilities for out-of-sample points by normalizing the LoCoH surface so that the probability of an observation occurring at a particular location can be calculated. This value is obtained by dividing the number of training hulls that contain the test point location (gi,j) by the summed area of all training hulls (Ai). Then, the log probability was calculated for each point per training hullset. To avoid log probability values of - , test points that were not contained within any hulls were assigned a probability value equal to the inverse of \(A_{i}^{2}\), resulting in a substantially lower log probability than that of a test point contained in a single hull. Finally, a single value (Pk,s) was assigned to each combination of k and s value by summing across all of the test points in all of the training/testing datasets:

$${P_{k,s}} = \sum_{i=1}^{n} \sum_{j=1}^{m} \log\left(\frac{g_{i,j}}{A_{i}}\right) $$

Because the probability of each test point is normalized based on the total area contained within all of the training hulls, there exists a natural penalty for high k values. For example, a k value equal to the number of training points (kmax; regardless of the s value) will result in all hulls being identical and each test point overlapping all of the hulls. However, the large total area of the hullset when k=kmax will result in relatively small probability values for each test point (i.e., independent probability values equal to the inverse of the area of one of the hulls), effectively penalizing the parameter set containing kmax. The underlying cross-validation procedure could very easily be extended for the optimization of the the adaptive parameter in the a-method (as opposed to the k-method) because of its scaling with the total area of the hullset.

Results

The optimal parameter values selected using the corrected cross-validation method are substantially different from those selected in the original publication (Table 1). However, because the original formulation was not supported by cohesive statistical theory, we will discuss these new results only in reference to the guideline-based parameter values rather than comparing them to the results emerging from the published algorithm. The mean s value selected using the algorithm for springbok was 0.02 (SE = 0.008) and for zebra was 0.0012 (SE = 0.0005). The mean s value selected using the guidelines for springbok was 0.005 (SE = 0.002) and 0.017 (SE = 0.002) for zebra. Thus, the s values selected by the algorithm and the guidelines were not significantly different for springbok (p=0.10), but were for zebra (p<0.001). In the case of the k values, the optimal values selected using the algorithm were significantly higher than those resulting from the guidelines. The mean k value selected using the algorithm for springbok was 225.5 (SE = 66.83) whereas the mean using the guidelines was 22.5 (SE = 1.71; p=0.003). The same trend was observed in zebra where the mean k value based on the algorithm was 347.2 (SE = 54.36), whereas the mean from the guidelines was 20 (SE = 1.58; p=0.004).

Table 1 Parameter values for analysis

The significantly higher k values emerging from the algorithm gave rise to significantly larger home ranges in both species (Table 2). In springbok, the mean home range size was 265.41 km2 (SE = 76.23 km2) using the high end of the guideline based range, and 401.64 km2 (SE = 127.56 km2) using the algorithm (p=0.05). In zebra, the mean home range was 694.43 km2 (SE = 80.81 km2) using the guidelines and 1081.29 km2 (SE = 162.17 km2) when the algorithm was applied (p=0.01). When the derived metrics were considered, however, the substantial differences in k values did not always result in significantly different duration (Table 3) and visitation rates (Table 4). Though the duration rates in zebra derived from the algorithm were, indeed, significantly higher than those derived using the high value from the range based on the guidelines (p=0.05), this was not the case for springbok (p=0.08). Similarly, the visitation rates emerging from the parameter sets selected by the algorithm were not significantly different from those derived based on the guidelines in either species (p=0.33 in springbok and p=0.15 in zebra).

Table 2 Home range areas (in square kilometers)
Table 3 Mean duration (MNLV) values. The derived metrics obtained using the parameter sets recommended by the algorithm and by the guidelines set forth in the T-LoCoH documentation
Table 4 Mean visitation (NSV) values

Conclusion

The results presented here indicate that the effect of selecting parameters using the algorithm rather than the guidelines will be highly contingent upon the focus of the research question. Where home range delineation is the goal, the results are likely to differ significantly (Fig. 2). In the case of epidemiological questions, however, the effects will be somewhat less predictable, and in certain cases, similar conclusions might be drawn irrespective of the approach used for selecting optimal parameters. If an element of the analysis involves comparisons across individuals or species, however, the cross-validation-based approach provides a unifying framework governed by statistical properties of the home ranges rather than subjective selections by the user.

Fig. 2
figure 2

Comparison of Resulting Home Ranges. An illustration of two sets of home ranges that result from the parameter sets chosen by the algorithm (red), the low range of the guide (blue), and the high range of the guide (black). The home range set on the left is based on the sample points from the springbok AG207, and the largest home range covers 429.81 km2. The home range set on the right is based on the GPS fixes from zebra AG256, and the largest home range covers 1363.21 km2

Fig. 3
figure 3

High Resolution Cross-Validation Surface. A high resolution depiction of a portion of the optimal parameter space traversed during the final stage of the efficient search algorithm. All parameter sets with log probability values above -10090 are shown, with darker shading indicating higher probability. In this particular application, the search is performed over smaller intervals of s (0.0001 rather than 0.001), and the optimal parameter set (k=171 and s=0.0133) is similar to the parameter set selected at the coarser scale