SpatioTemporal Scale Selection in Video Data
Abstract
This work presents a theory and methodology for simultaneous detection of local spatial and temporal scales in video data. The underlying idea is that if we process video data by spatiotemporal receptive fields at multiple spatial and temporal scales, we would like to generate hypotheses about the spatial extent and the temporal duration of the underlying spatiotemporal image structures that gave rise to the feature responses. For two types of spatiotemporal scalespace representations, (i) a noncausal Gaussian spatiotemporal scale space for offline analysis of prerecorded video sequences and (ii) a timecausal and timerecursive spatiotemporal scale space for online analysis of realtime video streams, we express sufficient conditions for spatiotemporal feature detectors in terms of spatiotemporal receptive fields to deliver scalecovariant and scaleinvariant feature responses. We present an indepth theoretical analysis of the scale selection properties of eight types of spatiotemporal interest point detectors in terms of either: (i)–(ii) the spatial Laplacian applied to the first and secondorder temporal derivatives, (iii)–(iv) the determinant of the spatial Hessian applied to the first and secondorder temporal derivatives, (v) the determinant of the spatiotemporal Hessian matrix, (vi) the spatiotemporal Laplacian and (vii)–(viii) the first and secondorder temporal derivatives of the determinant of the spatial Hessian matrix. It is shown that seven of these spatiotemporal feature detectors allow for provable scale covariance and scale invariance. Then, we describe a timecausal and timerecursive algorithm for detecting sparse spatiotemporal interest points from video streams and show that it leads to intuitively reasonable results. An experimental quantification of the accuracy of the spatiotemporal scale estimates and the amount of temporal delay obtained from these spatiotemporal interest point detectors is given, showing that: (i) the spatial and temporal scale selection properties predicted by the continuous theory are well preserved in the discrete implementation and (ii) the spatial Laplacian or the determinant of the spatial Hessian applied to the first and secondorder temporal derivatives leads to much shorter temporal delays in a timecausal implementation compared to the determinant of the spatiotemporal Hessian or the first and secondorder temporal derivatives of the determinant of the spatial Hessian matrix.
Keywords
Scale Scale space Scale selection Spatial Temporal Spatiotemporal Scale invariance Scale covariance Feature detection Differential invariant Interest point Video analysis Computer vision1 Introduction
A basic paradigm for video analysis consists of performing the first layers of visual processing based on successive layers of spatiotemporal receptive fields.
From a mathematical viewpoint, such an approach can be motivated from the fact that the measurement of the image intensity at a single point in space–time does in general not carry any meaningful information, since such a measurement is strongly dependent on external factors, such as the usually unknown illumination of the scene. The relevant information is instead carried by the relative relations between the measurements of image intensities at different points over space and time, which implies that it is natural to perform visual processing of video data based on local neighbourhoods over space and time.
From a biological viewpoint, such an approach can also be motivated from the fact that the first layers of mammalian vision can be modelled in terms of spatiotemporal receptive fields over multiple spatial and temporal scales. Cell recordings from neurones in the primary visual cortex have shown that there are spatiotemporal receptive fields tuned to different sizes and orientations in the image domain, to different integration times over the temporal domain as well as to different image velocities in space–time [12, 13, 32, 33]. Interestingly, the shapes of the spatiotemporal receptive field families that have been measured in biological vision can furthermore be explained by normative theories of visual receptive fields [69, 71, 75, 78], whose axiomatic derivation is based on structural properties of the environment in combination with assumptions about the internal structure of an idealized vision system to ensure a consistent treatment of image representations over multiple spatiotemporal scales.
A general problem when applying the notion of receptive fields in practice, however, is that the types of responses that are obtained in a specific situation can be strongly dependent on the scale levels at which they are computed. Figures 1, 2, 3 and 4 show illustrations of the this problem by showing snapshots of spatiotemporal receptive field responses over multiple spatial and temporal scales for a video sequence and for different types of spatiotemporal features computed from it. Note how qualitatively different types of responses are obtained at different spatiotemporal scales. At some spatiotemporal scales, we get strong responses due to the movements of the paddle or the motion of the paddler in the kayak. At other spatiotemporal scales, we get relatively larger responses because of the movements of the here unstabilized camera. The spatiotemporal texture due to the wave patterns on the water surface does also lead to different type of responses at different spatiotemporal scales. A computer vision system intended to process the visual input from general spatiotemporal scenes does therefore need to decide what responses within the family of spatiotemporal receptive fields over different spatial and temporal scales it should base its analysis on as well as about how the information from different subsets of spatiotemporal scales should be combined.
For purely spatial data, the problem of performing spatial scale selection is nowadays rather well understood. Given the spatial Gaussian scalespace concept [24, 34, 44, 46, 47, 59, 60, 67, 70, 106, 111, 120, 123], a general methodology for spatial scale selection has been developed based on local extrema over spatial scales of scalenormalized differential entities [62, 64, 65, 72, 73]. This general methodology has in turn been successfully applied to develop robust methods for imagebased matching and recognition [5, 41, 52, 68, 74, 84, 86, 87, 89, 90, 112, 113, 114] that are able to handle large variations of the size of the objects in the image domain and with numerous applications regarding object recognition, object categorization, multiview geometry, construction of 3D models from visual input, human–computer interaction, biometrics and robotics. Alternative approaches for spatial scale selection in other problem domains have also been proposed [7, 8, 10, 19, 28, 29, 31, 38, 39, 40, 54, 55, 66, 82, 83, 85, 91, 92, 105, 109, 115].
Much less research has, however, been performed on developing methods for choosing locally appropriate temporal scales for spatiotemporal analysis of video data. While some methods for temporal scale selection have been developed [49, 63, 122], the earliest methods suffer from either theoretical or practical limitations: the initial work on timecausal temporal scale selection in Lindeberg [63] is primarily developed over the discrete temporal Poisson scale space, which possesses a semigroup property over temporal scales and therefore leads to unnecessarily long temporal delays for reasons explained in Lindeberg [77, Appendix A]. The spatiotemporal scale selection method in Laptev and Lindeberg [49] is based on a spatiotemporal Laplacian operator that is not scale covariant under independent relative scaling transformations of the spatial versus the temporal domains (see Sect. 4.8), which implies that the spatial and temporal scale estimates will not be robust under independent variations of the spatial and temporal scales in video data as arise, for example, when viewing the same scene with two cameras having different sensor characteristics in terms of spatial resolution or temporal frame rate. The spatiotemporal scale selection method for the determinant of the spatiotemporal Hessian in Willems et al. [122] does not make use of the full flexibility of the notion of \(\gamma \)normalized derivative operators (see Sect. 4.5) and has not been previously developed over a timecausal and timerecursive spatiotemporal domain as is necessary for processing realtime image streams with requirements of short temporal latencies of the feature responses for timecritical applications and complementary requirements about only small compact buffers of past information.
The subject of this article is to develop an extended theory for performing spatiotemporal scale selection in video data, to generate hypotheses about local characteristic spatial and temporal scales in the video data before recognizing the objects or the spatiotemporal events in the scene that the camera is observing. For this domain, we can consider two basic use cases: For offline analysis of prerecorded video, one may take the liberty of accessing the virtual future in relation to any prerecorded time moment and make use of symmetric filtering over the temporal domain based on the noncausal Gaussian spatiotemporal scalespace theory [61, 67, 70]. For online analysis of realtime video streams on the other hand, the future cannot be accessed and we will base the analysis on a fully timecausal and timerecursive spatiotemporal scalespace concept for realtime image streams that only requires access to information from the present moment and a very compact buffer of what has occurred in the past [75] and which constitutes an extension of previous temporal scalespace and multiscale models [23, 27, 45, 81, 110]. Specifically, for performing spatiotemporal feature detection in the latter timecausal scenario, we will build upon a recently developed theory for temporal scale selection in a timecausal scalespace representation [77] and extend that theory to spatiotemporal scale selection for features that are computed based on a timecausal spatiotemporal scalespace representation. The resulting theory that we will arrive at can be seen as an extension of the previously developed spatial scale selection methodology [64, 65, 73] from spatial images to spatiotemporal video and realtime image streams.
To begin, we will start developing our theory for spatiotemporal scale selection with respect to the problem of detecting sparse spatiotemporal interest points [6, 9, 11, 14, 18, 20, 21, 30, 49, 88, 94, 97, 99, 100, 107, 122, 124, 126, 127], which may be regarded as a conceptually simplest problem domain because of the sparsity of spatiotemporal interest points and the close connection between this problem domain and the detection of spatial interest points for which there exists a theoretically wellfounded and empirically tested framework regarding scale selection over the spatial domain [1, 4, 5, 15, 17, 25, 42, 65, 72, 74, 84, 89, 90, 112]. Specifically, using a noncausal Gaussian spatiotemporal scalespace model, we will perform a theoretical analysis of the spatiotemporal scale selection properties of eight different types of spatiotemporal interest point detectors and show that seven of them: (i) the spatial Laplacian of the firstorder temporal derivative, (ii) the spatial Laplacian of the secondorder temporal derivative, (iii) the determinant of the spatial Hessian of the firstorder temporal derivative, (iv) the determinant of the spatial Hessian of the secondorder temporal derivative, (v) the determinant of the spatiotemporal Hessian matrix, (vi) the firstorder temporal derivative of the determinant of the spatial Hessian matrix and (vii) the secondorder temporal derivative of the determinant of the spatial Hessian matrix, do all lead to fully scalecovariant spatiotemporal scale estimates and scaleinvariant feature responses under independent scaling transformations of the spatial and the temporal domains. For (viii) the spatiotemporal Laplacian, it is on the other hand not possible to achieve scale covariance or scale invariance, which explains the poor robustness of the spatiotemporal interest points computed from the spatiotemporal Harris operator with scale selection based on the spatiotemporal Laplacian [49] on video data in which there are large independent variations in the spatial and temporal scales of the underlying spatiotemporal image structures.
Then, we will show how this theory can be transferred to an implementation based on fully timecausal spatiotemporal receptive fields to enable the detection of spatiotemporal features from realtime image streams in which the future cannot be accessed. Specifically, since any timecausal image measurement at a nonzero temporal scale will be associated with a nonzero temporal delay, we will introduce an additional parameter q to enable scale calibration of the spatiotemporal interest point detectors to deliver a temporal scale estimate at temporal scale \(\hat{\sigma }_{\tau } = q \, \hat{\sigma }_{\tau ,0}\) for \(q \le 1\) as opposed to the over the spatial domain more common choice of \(\hat{\sigma }_{s} = \hat{\sigma }_{s,0}\) to enable shorter temporal delays and therefore the ability to respond faster in timecritical realtime scenarios, motivated by the general observation that the temporal delay can be expected to be proportional to the temporal scale level when expressed in units of the temporal standard deviation of the temporal scalespace kernel.
Whereas the explicit algorithms and experiments in this paper are restricted to spatiotemporal scale selection at sparse interest points over space and time, in a companion paper [76] we develop complementary methods for computing dense maps of spatial and temporal scale estimates in video data based on a structurally similar theory.
1.1 Structure of this Article
As conceptual background to the work, Sect. 2 starts by describing the theoretical model for spatiotemporal receptive fields and the resulting scalespace concepts that we build upon for computing image and video representations over multiple spatial and temporal scales.
When to develop a theory for spatiotemporal scale selection, main questions regarding the foundations concern what properties the scale selection method should possess and how the scale estimates should be computed. In Sect. 3, we show how it is possible to construct a wellfounded theory for simultaneous selection of spatial and temporal scales in video data, by detecting local extrema over spatial and temporal scales of appropriately scalenormalized spatiotemporal derivative responses. This theory is generally valid for a large class of homogeneous spatiotemporal differential invariants and beyond the more explicit examples of spatiotemporal feature detectors considered in more detail in later sections. This theory specifically includes a general statement about scalecovariant properties of the resulting spatiotemporal scale estimates, which implies that the scale estimates are guaranteed to adaptively follow variabilities in spatial and temporal scale levels in the data. This theory also comprises scaleinvariant properties of the resulting spatiotemporal features and their magnitude strength measures, which imply that similar types of spatiotemporal image features, while at different scales, will be computed, if the data in video sequence are subject to independent scaling transformations of the spatial and the temporal domains. In these respects, the proposed theory obeys the desirable properties of a spatiotemporal scale selection methodology.
The theory presented so far, does, however, comprise two free parameters, a spatial scale normalization power \(\gamma _\mathrm{s}\) and a temporal scale normalization power \(\gamma _{\tau }\). To understand the behaviour of spatiotemporal feature detectors over multiple scales in more specific situations, Sect. 4 does then show how the scale selection properties of spatiotemporal feature detectors can be analysed by calculating their feature responses at multiple spatiotemporal scales in closed form to determine the scale normalization powers \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\).
Specifically, we present an indepth analysis of the theoretical scale selection properties of eight spatiotemporal derivative expressions that may be considered as candidates for defining spatiotemporal interest point detectors, when applied to idealized model patterns in the form of Gaussian blinks or Gaussian onset blobs of different spatial extent and of different temporal duration. By requiring that the selected spatial and temporal scales should reflect the spatial extent and the temporal duration of the input pattern, we show that seven of these spatiotemporal derivative expressions: (i)–(ii) the spatial Laplacian of the first and secondorder temporal derivatives, (iii)–(iv) the determinant of the spatial Hessian of the first and secondorder temporal derivatives, (v) the determinant of the spatiotemporal Hessian matrix and (vi)–(vii) the first and secondorder temporal derivatives of the determinant of the spatial Hessian matrix, can be scale calibrated to reflect the spatial extent and the temporal duration of the underlying spatiotemporal image structures that gave rise to the filter responses. For one of these expressions, an attempt to define a spatiotemporal Laplacian operator, the lack of scale covariance under independent scaling transformations of the spatial and temporal domains, corresponding scaleinvariant scale calibration cannot, however, be done for that operator. That in turn implies that applying the spatiotemporal Laplacian to video data in which there are unknown spatiotemporal scale variations can be expected to lead to undesirable artefacts.
In Sect. 5, we then present a general algorithm for detecting spatiotemporal interest points from spatiotemporal scalespace extrema of scalenormalized spatiotemporal expressions. Specifically, we present a detailed algorithm for detecting such image features based on a timecausal and timerecursive spatiotemporal scalespace representation. Compared to a corresponding algorithm expressed over a noncausal spatiotemporal scale space, as for the case of using a Gaussian spatiotemporal scale space for analysing prerecorded video sequences, our timecausal algorithm does never access information from the past and can therefore be applied in realtime settings on video streams. Additionally, by the timerecursive formulation, the requirements about temporal buffering of past information are much lower and do also imply the need for less computations, thus improving the computational efficiency, also if applied in a noncausal setting for analysing prerecorded video sequences.
As a verification of whether the proposed theory and methods do what they are supposed to do, Sect. 6 presents an experimental quantification of the numerical accuracy of the spatiotemporal scale estimates as well as the amount of temporal delay for the different types of spatiotemporal interest point detectors considered in this work, when applied to idealized spatiotemporal model patterns with ground truth and in the context of a timecausal spatiotemporal scalespace representation. The results do first of all show that the theoretical properties of spatiotemporal feature detectors responding at spatial and temporal scales corresponding to the spatial extent and the temporal duration do with very good approximation transfer to the proposed discrete implementation. Secondly, it is shown that the interest point detectors defined from applying either the spatial Laplacian or the determinant of the spatial Hessian to the first or secondorder temporal derivatives lead to significantly shorter temporal delays compared to the interest point detectors defined from the determinant of the spatiotemporal Hessian or the first and secondorder temporal derivatives of the determinant of the spatial Hessian. For timecritical applications, this implies that the temporal response properties from the first set of spatiotemporal feature detectors will be faster than for those from the other set and therefore the ability of an autonomous agent to react faster. Finally, Sect. 7 concludes with a summary and discussion.
1.2 Relations to Previous Contributions

the motivations underlying the developments of this theory and the relations to previous work (Sect. 1),

more details concerning the underlying spatiotemporal receptive field model (Sect. 2),

a more extensive description about the proposed general methodology for spatiotemporal scale selection including: (i) its formulation based on temporal scale normalization by \(L_p\)normalization of the temporal derivative operators, (ii) the theory for scaleinvariant and scalecovariant properties of the resulting spatiotemporal features with their spatiotemporal scale estimates as well as (iii) spatiotemporal scale selection based on spatiotemporal differential invariants expressed in terms of local gauge coordinates that guarantee rotational invariance and which could not be included in the conference paper because of lack of space (Sect. 3),

the treatment of two additional spatiotemporal differential invariants, the first and secondorder temporal derivatives of the determinant of the spatial Hessian matrix,

the detailed theoretical analysis of the scale selection properties of the eight different spatiotemporal differential invariants treated in this paper and showing the explicit derivations of how the spatial and temporal scale normalization \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\) should be determined by scale calibration for each feature detector (Sect. 4),

more complete details about the composed algorithm for detecting spatiotemporal interest points with spatiotemporal scale selection based on timecausal and timerecursive spatiotemporal receptive fields and including a change of order between the spatial and the temporal smoothing operations that substantially reduces the amount of computations (Sect. 5),

an experimental quantification of the accuracy of the scale estimates and the temporal delays for the different types of spatiotemporal feature detectors when applied to idealized spatiotemporal model patterns (Sect. 6) and

a detailed description of the corresponding spatial scalespace extrema algorithm on which the spatiotemporal scalespace extrema algorithm is based (“Appendix A”).
2 SpatioTemporal Receptive Field Model

\(x = (x_1, x_2)^\mathrm{T}\) denotes the image coordinates,

t denotes time,

s denotes the spatial scale,

\(\tau \) denotes the temporal scale,

\(v = (v_1, v_2)^\mathrm{T}\) denotes a local image velocity,
 \(\varSigma \) denotes a spatial covariance matrix determining the spatial shape of a spatial affine Gaussian kernel$$\begin{aligned} g(x;\; s, \varSigma ) = \frac{1}{2 \pi s \sqrt{\det \varSigma }} \mathrm{e}^{x^\mathrm{T} \varSigma ^{1} x/2s}, \end{aligned}$$(2)

\(g(x_1  v_1 t, x_2  v_2 t;\; s, \varSigma )\) denotes a spatial affine Gaussian kernel that moves with image velocity \(v = (v_1, v_2)\) in space–time and

\(h(t;\; \tau )\) is a temporal smoothing kernel over time,
For simplicity, we shall in this treatment henceforth restrict ourselves to space–time separable receptive fields obtained by setting the image velocity to zero \(v = (v_1, v_2)^\mathrm{T} = (0, 0)^\mathrm{T}\) and to receptive fields that are based on rotationally symmetric Gaussian kernels over the spatial domain by setting the spatial covariance matrix to a unit matrix \(\varSigma = I\).
An alternative model for timecausal temporal smoothing could be to instead use Koenderink’s scaletime kernels [45], which correspond to Gaussian smoothing on a logarithmically transformed temporal domain. For reasons described in detail in Lindeberg [77, Section 2.2], in particular the lack of a known timerecursive formulation for Koenderink’s scaletime kernels, which in turn implies a need for larger temporal buffers and more computational work for the temporal smoothing operation compared to using a timerecursive implementation of the timecausal limit kernel based on a set of recursive filters coupled in cascade [75, Section 6], we use the timecausal limit kernel for modelling the timecausal temporal smoothing operation in this work. As described in Lindeberg [75, Appendix 2], it is also possible to establish an approximate mapping between the parameters of the timecausal limit kernel and Koenderink’s scaletime kernel based on the requirement that the zero, first and secondorder temporal moments of the kernels in the two families should be equal [75, Equation (161)] and leading to qualitatively similar while not identical temporal receptive fields based on temporal derivatives of the timecausal scalespace kernels from the two families [75, Figure 11].
While yet a third type of ad hoc model for timecausal smoothing could possibly also be formulated based on truncated and timedelayed Gaussian kernels, with the temporal delay determined such that the truncation effects in some sense could be regarded as sufficiently small, we will not develop such an approach here because: (i) such a model could be expected to lead to significantly longer temporal delays and (ii) require significantly larger temporal buffers and more computational work compared to our family of timecausal and timerecursive scalespace kernels. For timecritical applications, where the temporal response properties of the vision system need to be as fast as possible, it should in general be much better to base the temporal processing on an inherently timecausal temporal scalespace concept.
2.1 ScaleNormalized SpatioTemporal Derivatives
2.2 Temporal Delays
3 General SpatialTemporal Scale Selection Methodology
In this section, we will describe a general spatiotemporal scale selection methodology for simultaneous computation of local characteristic spatial and temporal scale estimates from video data, which for appropriate choices of spatiotemporal derivative expressions for feature detection may reflect the spatial extent and the temporal duration of the underlying spatiotemporal image structures that gave rise to the feature responses.
3.1 Homogeneous SpatioTemporal Differential Expressions
An essential property of the definition of scalenormalized spatiotemporal derivative operators according to (13) is that they will lead to scalecovariant spatiotemporal image features, if the spatial smoothing performed by a spatial Gaussian kernel (2) and if the temporal smoothing is performed with either a noncausal temporal Gaussian kernel (3) or the timecausal limit kernel (4), provided that the underlying spatiotemporal expression \(\mathcal{D}_{\mathrm{norm}} L\) used for defining the spatiotemporal features is covariant under independent scaling transformations of the spatial and temporal domains.
3.2 Transformation Property Under Independent Scaling Transformations of the Spatial and the Temporal Domains
The scaling property (27) of homogeneous polynomial spatiotemporal differential invariants also extends to homogenous rational expressions of spatiotemporal derivatives, i.e., rational expressions formed by ratios of two homogeneous polynomials of the form (18).
3.3 General ScaleCovariant Property of the SpatioTemporal Scale Estimates
3.4 General ScaleCovariant and ScaleInvariant Properties of Feature Responses at Local Extrema Over SpatioTemporal Scales
For reasons that will be explained later in Sect. 4, there are, however, situations where it can be highly motivated to use scale normalization powers not equal to one. Then, the important message is that the magnitude estimates are transformed by a power law and can be compensated for by postnormalization of the magnitude responses that also takes the actual spatiotemporal scale levels into account.
3.5 SpatioTemporal Scale Selection for Homogeneous SpatioTemporal Differential Invariants in Terms of Gauge Coordinates
As a consequence, the scale estimates will be guaranteed to be rotationally invariant in the sense that if the spatial domain is globally rotated in image space, then both the spatial and the temporal scale estimates will be rotated in the same way as the spatial image positions. A corresponding rotational invariance property of the spatiotemporal scale estimates does also hold for other types of spatiotemporal differential expressions of the form (18) that are additionally rotationally invariant.
What remains in this theory is to choose appropriate scalenormalized spatiotemporal derivative expressions \(\mathcal{D}_{\mathrm{norm}} L\) for different visual tasks and to tune the scale normalization powers \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\) to additional complementary requirements. In next section, we will perform a detailed study of this for eight different spatiotemporal differential invariants with respect to the task of detecting spatiotemporal interest points.
4 SpatioTemporal Scale Selection in NonCausal Gaussian SpatioTemporal Scale Space
In this section, we will perform a closedform theoretical analysis of the spatial and the temporal scale selection properties that are obtained by detecting simultaneous local extrema over both spatial and temporal scales of different scalenormalized spatiotemporal differential expressions. We will specifically analyse: (i) how the spatial and temporal scale estimates \(\hat{s}\) and \(\hat{\tau }\) are related to the spatial extent \(s_0\) and the temporal duration \(\tau _0\) for different types of spatiotemporal model signals for which closedform theoretical analysis is possible and (ii) how the resulting scalenormalized magnitude responses of the different differential entities at the selected spatiotemporal scales depend upon the spatial extent \(s_0\) and the temporal duration \(\tau _0\) of the underlying image structures as well as upon a complementary parameter q introduced to enable detection of spatiotemporal image features at finer temporal scales than at the temporal scales at which they occur, to in turn enable shorter temporal delays when computing image features based on a timecausal spatiotemporal scalespace concept.
A main goal is to perform scale calibration, to determine suitable values of the spatial and temporal scale normalization parameters \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\) for different types of spatiotemporal feature detectors, in such a way that the selected spatial and temporal scale levels reflect the spatial extent and the temporal duration of the original spatiotemporal image structures that gave rise to the feature response. The methodology we shall follow is to calculate scalespace representations in closed form for Gaussianbased spatiotemporal image patterns for which the noncausal spatiotemporal scalespace representation can be obtained from the semigroup property of the Gaussian kernel. Then, given that explicit expressions can be calculated for the scalenormalized spatiotemporal derivatives, we will solve for the local extrema of the spatiotemporal differential invariant \(\mathcal{D}_{\mathrm{norm}} L\) over spatiotemporal scales, to define equations that determine the scale normalization powers \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\) from the constraints that the spatiotemporal scale estimates should obey \(\hat{s} = s_0\) and \(\hat{\tau } = q^2 \, \tau _0\).
The spatial assumption \(\hat{s} = s_0\) is similar to the method for scale calibration in the spatial scale selection methodology [64, 65, 72] and corresponds to detecting the image structure at the same scale as they appear, which should be optimal with regard to signal detection theory. Regarding the temporal assumption \(\hat{\tau } = q^2 \, \tau _0\), we do, however, also introduce a parameter \(q < 1\) to enforce temporal scale selection at finer temporal scales, to enable shorter temporal delays of the feature responses. As previously described in Sect. 2.2, for the timecausal scalespace representation the temporal delay can be expected to be proportional to the temporal scale in units of the standard deviation of the temporal smoothing kernel \(\delta \sim \sigma _{\tau } = \sqrt{\tau }\). A firstorder prediction is therefore that a value of \(q < 1\) can be expected to reduce the temporal delay by the order of a corresponding factor, to enable an autonomous agent using these features as input to respond faster in a timecritical realtime situation.
4.1 The Spatial Laplacian of the SecondOrder Temporal Derivative
4.2 The Spatial Laplacian of the FirstOrder Temporal Derivative
4.3 The Determinant of the Spatial Hessian Matrix Applied to the SecondOrder Temporal Derivative
4.4 The Determinant of the Spatial Hessian Matrix Applied to the FirstOrder Temporal Derivative
4.5 The Determinant of the SpatioTemporal Hessian Matrix
4.6 The SecondOrder Temporal Derivative of the Determinant of the Spatial Hessian Matrix
4.7 The FirstOrder Temporal Derivative of the Determinant of the Spatial Hessian Matrix
4.8 The SpatioTemporal Laplacian
The underlying theoretical reason for this lack of spatial and temporal scale invariance is that the attempt to define a spatiotemporal Laplacian operator according to (132) is not covariant under independent rescaling transformations of the spatial and temporal domains. The spatial Laplacian of the first and secondorder temporal derivatives, the determinant of the Hessian of the first and secondorder temporal derivatives and the determinant of the spatiotemporal Hessian are on the other hand truly covariant under such independent relative scaling transformations of the spatial and temporal domains.
To illustrate the practical consequence of the lack of spatiotemporal scale covariance for a differential entity used for spatiotemporal scale selection, let us consider two different video cameras that are observing the same scene. Let us for simplicity assume that the sensors in the two video cameras have the same spatial resolution, whereas the temporal resolutions differ by say a factor of two. If we define a spatiotemporal Laplacian operator for each video domain based on the native coordinate system of each respective individual camera, then the spatiotemporal Laplacian operator in the first video domain will correspond to a spatiotemporal Laplacian operator in the second video domain that differs by a factor of two in the value of \(\varkappa \). Thus, if we perform spatiotemporal scale selection by detecting local extrema over spatiotemporal scales of the spatiotemporal Laplacian, we will detect extrema in effective spatiotemporal differential expressions that differ between the two video domains. Specifically, this implies that we cannot exactly interrelate the spatiotemporal Laplacian responses between the two domains in the way necessary to carry out a proof of scale invariance for general classes of spatiotemporal image structures. Although the scale estimates could for another form of scale normalization be computed for the specific spatiotemporal image model of a Gaussian blink [49], corresponding scale selection properties are then not guaranteed to generalize to more general spatiotemporal image structures beyond the specific subfamily of image structures for which the scale calibration was performed. Because of the covariance properties of the spatiotemporal differential invariants \(\nabla _{(x,y),\mathrm{norm}}^2 L_{t,\mathrm{norm}}, \nabla _{(x,y),\mathrm{norm}}^2 L_{tt,\mathrm{norm}}, \det \mathcal{H}_{(x,y),\mathrm{norm}}\) \(L_{t,\mathrm{norm}}, \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{tt,\mathrm{norm}}\), \(\det \mathcal{H}_{(x,y,t),\mathrm{norm}} L, \partial _{t,\mathrm{norm}} (\mathcal{H}_{(x,y),\mathrm{norm}} L)\) and \(\partial _{tt,\mathrm{norm}} (\mathcal{H}_{(x,y),\mathrm{norm}} L)\), such interrelations can, however, be carried out for those differential operators between two video domains with undetermined relative scaling factors between the spatial and temporal domains. Consequently, these differential entities are therefore much better for spatiotemporal scale selection than the attempt to define a spatiotemporal Laplacian operator.
Additionally, if one would attempt to rank image features based on the corresponding scalenormalized magnitude measure \(\nabla _{(x, y, t),\mathrm{norm}}^2 L\), then the relative ranking of the image features could therefore also be different between the two domains of the two video cameras, whereas the corresponding relative ranking of image features is preserved for spatiotemporal scale selection based on the differential invariants \(\nabla _{(x,y),\mathrm{norm}}^2 L_{t,\mathrm{norm}},\) \(\nabla _{(x,y),\mathrm{norm}}^2 L_{tt,\mathrm{norm}},\) \(\det \mathcal{H}_{(x,y),\mathrm{norm}} L_{t,\mathrm{norm}}, \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{tt,\mathrm{norm}}\), \(\det \mathcal{H}_{(x,y,t),\mathrm{norm}} L, \partial _{t,\mathrm{norm}} (\mathcal{H}_{(x,y),\mathrm{norm}} L)\) and \(\partial _{tt,\mathrm{norm}} (\mathcal{H}_{(x,y),\mathrm{norm}} L)\).
If a scalenormalized spatiotemporal Laplacian operator is to be used for spatiotemporal feature detection anyway, the scale normalization according to (133) should, however, lead to better experimental results than the scale normalization according to (141), since the partial derivates with respects to the different dimensions of space and time in the scalenormalized differential expression (141) are not added in terms of dimensionless scalenormalized differential entities for the given values of a, b, c and d, whereas the partial derivatives with respect to space versus time are added in a dimensionless manner in the scalenormalized differential expression (133) if \(\gamma _\mathrm{s} = 1\) and \(\gamma _{\tau } = 1\) (and corresponding to \(a = 1, b = 0, c = 0\) and \(d = 1\) in (141) for the specific choice of \(\varkappa = 1\)).
4.9 Scale Normalization Powers of SpatioTemporal Interest Point Detectors
Scale normalization powers \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\) as determined from scale calibration of the seven spatiotemporal interest point detectors \(\nabla _{(x,y) ,\mathrm{norm}}^2 L_{t,\mathrm{norm}}, \nabla _{(x,y) ,\mathrm{norm}}^2 L_{tt,\mathrm{norm}}, \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{t,\mathrm{norm}}, \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{tt,\mathrm{norm}}, \det \mathcal{H}_{(x,y,t),\mathrm{norm}} L, \partial _{t,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L)\) and \(\partial _{tt,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L)\) that are guaranteed to lead to scalecovariant spatiotemporal scale estimates
\(\mathcal{D} L\)  \(\gamma _\mathrm{s}\)  \(\gamma _{\tau }\) 

\(\nabla _{(x,y),\mathrm{norm}}^2 L_{t,\mathrm{norm}}\)  1  \(\tfrac{q^2}{q^2 + 1}\) 
\(\nabla _{(x,y),\mathrm{norm}}^2 L_{tt ,\mathrm{norm}}\)  1  \(\tfrac{3 q^2}{2(q^2 + 1)}\) 
\(\det \mathcal{H}_{(x,y),\mathrm{norm}} L_{t,\mathrm{norm}}\)  1  \(\tfrac{q^2}{q^2 + 1}\) 
\(\det \mathcal{H}_{(x,y),\mathrm{norm}} L_{tt ,\mathrm{norm}}\)  1  \(\tfrac{3 q^2}{2(q^2 + 1)}\) 
\(\det \mathcal{H}_{(x,y,t) ,\mathrm{norm}} L\)  \(\tfrac{5}{4}\)  \(\tfrac{5 q^2}{2(q^2 + 1)}\) 
\(\partial _{t,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L)\)  1  \(\tfrac{q^2}{q^2 + 1}\) 
\(\partial _{tt,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L)\)  1  \(\tfrac{2 q^2}{q^2 + 1}\) 
Relations between magnitude thresholds for seven of the spatiotemporal interest point detectors studied in this paper in terms of a common local contrast parameter C
Magnitude thresholds for spatiotemporal interest operators  

\(\mathcal{D} L\)  For \(q = 1\)  For general q 
\(\nabla _{(x,y),\mathrm{norm}}^2 L_{t,\mathrm{norm}}\)  \(\frac{C}{4 \sqrt{\pi }}\)  \(\frac{C \, q^{\frac{q^2}{q^2+1}}}{2 \sqrt{2 \pi } \sqrt{q^2+1}}\) 
\(\nabla _{(x,y),\mathrm{norm}}^2 L_{tt,\mathrm{norm}}\)  \(\frac{C}{4 \sqrt{2}}\)  \(\frac{C q^{\frac{3 q^2}{q^2+1}}}{2 \left( q^2+1\right) ^{3/2}}\) 
\(\det \mathcal{H}_{(x,y),\mathrm{norm}} L_{t,\mathrm{norm}}\)  \(\frac{C^2}{64 \pi }\)  \(\frac{C^2 q^{\frac{2 q^2}{q^2+1}}}{32 \pi q^2 +32 \pi }\) 
\(\det \mathcal{H}_{(x,y),\mathrm{norm}} L_{tt,\mathrm{norm}}\)  \(\frac{C^2}{128}\)  \(\frac{C^2 q^{\frac{6 q^2}{q^2+1}}}{16 \left( q^2+1\right) ^3}\) 
\(\det \mathcal{H}_{(x,y,t),\mathrm{norm}} L\)  \(\frac{C^3}{128 \sqrt{2}}\)  \(\frac{C^3 \; q^{\frac{5 q^2}{q^2+1}}}{32 \left( q^2+1\right) ^{5/2}}\) 
\(\partial _{t,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L)\)  \(\frac{C^2}{32 \sqrt{\pi }}\)  \(\frac{C^2 \, q^{\frac{q^2}{q^2+1}}}{16 \sqrt{2 \pi } \sqrt{1+q^2}}\) 
\(\partial _{tt,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L)\)  \(\frac{C^2}{32}\)  \(\frac{C^2 \, q^{\frac{4 q^2}{q^2+1}}}{8 \left( q^2+1\right) ^2}\) 
4.10 Relating Magnitude Thresholds Between Different SpatioTemporal Feature Detectors
By considering the scalenormalized magnitude responses (55), (71), (82), (93), (104) (117) and (128) of the above scalecovariant spatiotemporal feature detectors and applying postnormalization of these entities to make the feature responses fully scaleinvariant, we can express relations between their magnitude responses in terms of the contrast C of the spatiotemporal image pattern that gave rise to the feature response according to Table 2. These relations can in turn be used for expressing coarse relations between magnitude thresholds for the different types of spatiotemporal interest operators.
5 SpatioTemporal Interest Points Detected as SpatioTemporal ScaleSpace Extrema Over Space–Time
In this section, we shall use the scalenormalized differential entities \(\nabla _{(x,y),\mathrm{norm}}^2 L_{t,\mathrm{norm}}, \nabla _{(x,y),\mathrm{norm}}^2 L_{tt,\mathrm{norm}}, \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{t,\mathrm{norm}}, \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{tt,\mathrm{norm}}, \det \mathcal{H}_{(x,y,t),\mathrm{norm}} L, \partial _{t,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L), \partial _{tt,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L)\) and \(\nabla _{(x, y, t),\mathrm{norm}}^2 L\) according to (42), (58), (74), (85), (96), (109), (120) and (133) for detecting spatiotemporal interest points. The overall idea of the most basic form of such an algorithm is to simultaneously detect both spatiotemporal points \((\hat{x}, \hat{y}, \hat{t})\) and spatiotemporal scales \((\hat{s}, \hat{\tau })\) at which the scalenormalized differential entity \((\mathcal{D}_{\mathrm{norm}} L)(x, y, t;\; s, \tau )\) simultaneously assumes local extrema with respect to both space–time (x, y, t) and spatiotemporal scales \((s, \tau )\).
For the use case of offline analysis of prerecorded video using a noncausal spatiotemporal scalespace representation, such a spatiotemporal scalespace extrema algorithm could be expressed as a straightforward generalization of the corresponding spatial scalespace extrema algorithm proposed in Lindeberg [65] and summarized on more compact form in “Appendix A”. The only major conceptual differences are that: (i) the image data should be expanded over both a spatial and a temporal scale parameter instead of just a spatial scale parameter and (ii) the local comparisons for detecting local extrema should be performed over a \(3 \times 3 \times 3 \times 3 \times 3\)neighbourhood over \((x, y, t;\; s, \tau )\) instead of over a \(3 \times 3 \times 3\)neighbourhood over \((x, y;\; s)\).
A computational problem when expanding a video sequence over both spatial and temporal scales, however, is that the amount of data may become very large, if expanding the video data into the 5D spatiotemporal scalespace representation over the spatial domain (x, y), the temporal domain t and the spatiotemporal scale parameters \((s, \tau )\). For this reason, we shall instead consider a timerecursive implementation that steps forward over time t and only maintains a much more compact timerecursive memory of past information, as a 4D representation over the spatial image coordinates (x, y) and the spatiotemporal scale levels \((s, \tau )\) at each time moment t. Therefore, the timerecursive implementation avoids expanding the internal memory over the temporal dimension and does also directly apply to a timecausal situation in which the future cannot be accessed.
5.1 TimeCausal and TimeRecursive Algorithm for SpatioTemporal ScaleSpace Extrema Detection
 1.
Determine a set of logarithmically distributed temporal scale levels \(\tau _k\) and spatial scale levels \(s_l\) at which the algorithm is to operate by computing spatiotemporal scalespace representations at these spatiotemporal scales.
 2.
Compute time constants \(\mu _k = (\sqrt{1 + 4 \, r^2 \, (\tau _k  \tau _{k1})}  1)/2\) according to Lindeberg [75, Equations (58) and (55)] for approximating the timecausal limit kernel by a finite number of recursive filters, where r denotes the frame rate and the temporal scale levels \(\tau _k\) are given in units of \([\text{ seconds }]^2\).
 3.Expand the first image frame f(x, y, 0) into its purely spatial scalespace representation \(L(x, y, 0;\; s, \tau _0)\) over the spatial scale levels \(s_l\) at the finest temporal scale \(\tau _0\) using the semigroup property of the discrete analogue of the Gaussian kernelwith initial condition \(L(x, y, 0;\; s_0, \tau _0) = f(x, y, 0)\) at the finest spatial scale \(s_0\).$$\begin{aligned} L(\cdot , \cdot , 0;\; s_l, \tau _0) = T(\cdot , \cdot ;\; s_ls_{l1}) * L(\cdot , \cdot , 0;\; s_{l1}, \tau _0) \end{aligned}$$(146)
 4.
For each temporal scale level \(\tau _k\), initiate a temporal buffer for temporal scalespace smoothing at this temporal scale using the purely spatial scalespace representation of the first frame as initial condition \(B(x, y, k, l) = L(x, y, 0;\; s_l, \tau _0)\).
 5.
For each spatial and temporal scale level, initiate a small number of temporal buffers for the nearest past frames. (This number should be equal to the maximum order of temporal differentiation.)
 6.Loop forwards over time t (in units of time steps):
 (a)Given a new image frame f(x, y, t), expand this frame into its purely spatial scalespace representation \(L(x, y, t;\; s, \tau _0)\) at the finest temporal scale \(\tau _0\)with initial condition \(L(x, y, t;\; s_0, \tau _0) = f(x, y, t)\) at the finest spatial scale \(s_0\).$$\begin{aligned} L(\cdot , \cdot , t;\; s_l, \tau _0) = T(\cdot , \cdot ;\; s_ls_{l1}) * L(\cdot , \cdot , t;\; s_{l1}, \tau _0). \end{aligned}$$(147)
 (b)Loop over the temporal scale levels k in ascending order:
 i.For each spatiotemporal scale level (k, l), perform temporal smoothing according to (with \(B(x, y, 0, l) = L(x, y, t;\; s_l, \tau _0)\))$$\begin{aligned}&B(x, y, k, l) := B(x, y, k, l) \nonumber \\&\quad + \frac{1}{1 + \mu _k}(B(x, y, k1, l)  B(x, y, k, l)).\nonumber \\ \end{aligned}$$(148)
 i.
 (c)
For all temporal and spatial scales, compute temporal derivatives using backward differences over the buffers from past frames.
 (c)
For all temporal and spatial scales, compute the scalenormalized differential entity \((\mathcal{D}_{\mathrm{norm}} L)(x, y, t;\; s_l, \tau _k)\) at that spatiotemporal scale.
 (e)For all points and spatiotemporal scales \((x, y;\; s_l, \tau _k)\) for which the magnitude of the postnormalized differential entity is above a predefined thresholdand optionally, if using complementary thresholding [74], the sign of a complementary differential expression^{3} \(\bar{\mathcal{D}} L\) is additionally positive$$\begin{aligned} (\mathcal{D}_{\mathrm{postnorm}} L)(x, y, t;\; s_l, \tau _k) \ge \theta , \end{aligned}$$(149)determine if the point is either a positive maximum or a negative minimum in comparison with its nearest neighbours over space (x, y), time t, spatial scales \(s_l\) and temporal scales \(\tau _k\).Because the detection of local extrema over time requires a future reference in the temporal direction, this comparison is not done at the most recent frame but at the nearest past frame.$$\begin{aligned} (\bar{\mathcal{D}} L)(x, y, t;\; s_l, \tau _k) \ge 0, \end{aligned}$$(150)
 i.
For each detected scalespace extremum, compute more accurate estimates of its spatiotemporal position \((\hat{x}, \hat{y}, \hat{t})\) and spatiotemporal scale \((\hat{s}, \hat{\tau })\) using parabolic interpolation along each dimension according to Lindeberg [77, Equation (115)]. Do also compensate the magnitude estimates by a magnitude correction factor computed for each dimension.
 i.
 (a)
Note specifically that by performing the spatial smoothing in the outer loop over spatiotemporal scales, the computationally more demanding spatial smoothing is performed only once for each spatial scale level, whereas the computationally more efficient temporal smoothing is performed in the inner loop over all combinations of spatial and temporal scales. The algorithm is also inherently parallel over spatiotemporal scale levels and lends itself to parallel implementation over a multicore architecture.
5.2 Postfiltering of SpatioTemporal ScaleSpace Extrema

To postfilter spatiotemporal scalespace extrema with respect to the nearest finer temporal scale, we introduce buffers for keeping a shortterm memory of purely temporal extrema of the scalenormalized differential expression \((\mathcal{D}_{\mathrm{norm}} L)(x, y, t;\; s \tau )\). If a point \((x, y, t;\; s \tau )\) is a local maximum (minimum) over time t, then keep this point in a the buffer of local maxima (minima) as long as the values monotonically decrease (increase) with time to later time moments. When a point has been detected as a candidate for a spatiotemporal scalespace maximum (minimum), check if there are active buffers of local maxima (minima) in a local spatial \(3 \times 3\)neighbourhood over space at the nearest finer temporal scale. If there is such a maximum (minimum) having a higher (lower) value than the current spatiotemporal scalespace maximum (minimum), then the current point is not allowed to become a scalespace extremum.

To postfilter spatiotemporal scalespace extrema with respect to the nearest coarser temporal scale, we put a record of the spatiotemporal scalespace extremum in a spatial \(3 \times 3\)neighbourhood over space at the nearest coarser temporal scale. If the original point was a scalespace maximum (minimum), then the shortterm memory is kept active as long as the scalenormalized differential expression \((\mathcal{D}_{\mathrm{norm}} L)(x, y, t;\; s \tau )\) continues to increase (decrease) over time. If the scalenormalized magnitude additionally would increase above the scalenormalized magnitude of the original candidate scalespace extremum, then the original candidate to a scalespace extremum is disregarded.
5.3 Experimental Results
Figures 7, 8, 9, 10 and 11 show the result of detecting spatiotemporal scalespace extrema in this way for three video sequences from the UCF101 dataset [104] and one video sequence from the KITTI dataset [26]. For these experiments, we used 21 spatial scale levels between \(\sigma _\mathrm{s} = 2\) and 21 pixels and 7 temporal scale levels between \(\sigma _{\tau } = 40~\text{ ms }\) and 2.56 s with seven additional prescales and distribution parameter \(c = 2\) for the timecausal limit kernel. To obtain comparable numbers of features from the different types of feature detectors, we adapted the thresholds on the scalenormalized differential invariants such that the average number of features from each feature detector was 50 features per frame for the kayaking video, a lower number of 30 features per frame for the videos of the table tennis player and the archer where the background is static, and a larger number of 200 features per frame for the driving scene, where the camera is moving relative to a cluttered environment.
Figure 7 and the first row of Fig. 10 show results computed from the same video of a kayaker as used for the illustrations in Figs. 1, 2, 3 and 4. As can be seen from the results, all eight feature detectors respond to regions in the video sequence where there are strong variations in image intensity over space and time. There are, however, also some qualitative differences between the results from the different spatiotemporal interest point detectors. The LGNinspired feature detectors \(\nabla _{(x,y),\mathrm{norm}}^2 L_{t,\mathrm{norm}}\) and \(\nabla _{(x,y),\mathrm{norm}}^2 L_{tt,\mathrm{norm}}\) respond both to the motion patterns of the paddler and to the spatiotemporal texture corresponding to the waves on the water surface that lead to temporal flickering effects and so do the operators \(\det \mathcal{H}_{(x,y),\mathrm{norm}} L_{t,\mathrm{norm}}\) and \(\det \mathcal{H}_{(x,y),\mathrm{norm}} L_{tt,\mathrm{norm}}\). The more corner detector inspired feature detectors \(\det \mathcal{H}_{(x,y,t),\mathrm{norm}} L, \partial _{t,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L)\) and \(\partial _{tt,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L)\) respond more to image features where there are simultaneously rapid variations over both of the spatial dimensions and the temporal dimension.
Figure 8 and the second row of Fig. 10 show corresponding results of detecting spatiotemporal scalespace extrema from a video sequence with a table tennis player. Here, we can note that the seven spatiotemporal interest point detectors \(\nabla _{(x,y),\mathrm{norm}}^2 L_{t,\mathrm{norm}}, \nabla _{(x,y),\mathrm{norm}}^2 L_{tt,\mathrm{norm}}, \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{t,\mathrm{norm}},\) \(\det \mathcal{H}_{(x,y),\mathrm{norm}} L_{tt,\mathrm{norm}}, \det \mathcal{H}_{(x,y,t),\mathrm{norm}} L, \partial _{t,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L)\) and \(\partial _{tt,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L)\) do all give rise to rather rich distributions of feature responses corresponding to the motion pattern of the tennis player. (The responses on the left part of the table tennis table are caused by cast shadows of the tennis player from the lamp in the ceiling). The LGNinspired feature detectors \(\nabla _{(x,y),\mathrm{norm}}^2 L_{t,\mathrm{norm}}\) and \(\nabla _{(x,y),\mathrm{norm}}^2 L_{tt,\mathrm{norm}}\) do both specifically generate responses when the ball flies over the net and so do the determinant of the spatiotemporal Hessian as well as the first and secondorder temporal derivatives of the spatial Laplacian. The responses due to the spatiotemporal Laplacian are, however, less specific to specific motion events, and with numerous responses from the almost static background. Incorporating also the theoretical limitations of the spatiotemporal Laplacian described in Sect. 4.8 as well as other limitations that will be described below, we conclude that this operator should therefore not be considered as a suitable feature detector for processing video data.
Figure 9 and the third row of Fig. 10 show the results of detecting corresponding spatiotemporal scalespace extrema from a video sequence with an archer. Here, we can note that the five spatiotemporal interest point detectors \(\nabla _{(x,y),\mathrm{norm}}^2 L_{t,\mathrm{norm}}, \nabla _{(x,y),\mathrm{norm}}^2 L_{tt,\mathrm{norm}},\) \(\det \mathcal{H}_{(x,y),\mathrm{norm}} L_{t,\mathrm{norm}},\det \mathcal{H}_{(x,y),\mathrm{norm}} L_{tt,\mathrm{norm}}\) and \(\det \mathcal{H}_{(x,y,t),\mathrm{norm}} L\) do in a corresponding manner give rise to rather rich distributions of feature responses corresponding to the motion pattern of the archer. For the determinant of the spatiotemporal Hessian, which operates like a threedimensional corner detector, there are, however, many more responses along the edges of the archer than for the other four feature detectors. The four feature detectors \(\nabla _{(x,y),\mathrm{norm}}^2 L_{t,\mathrm{norm}}, \nabla _{(x,y),\mathrm{norm}}^2 L_{tt,\mathrm{norm}}, \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{t,\mathrm{norm}}, \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{tt,\mathrm{norm}}\) based on first or secondorder temporal derivatives do all generate multiple responses when the arrow hits the cloth on the wall. The response of the determinant of the spatiotemporal Hessian is, however, delayed and not as strong as for the other competing spatiotemporal events in the scene.
Figure 11 shows results of applying these spatiotemporal scale extrema detection algorithms to a scene with a car driving along a road. Because image feature detection based on space–time separable spatiotemporal receptive fields is here applied to a scene where the camera is moving relative to the environment, static spatial image features in the world that move relative to the motion direction will here lead to spatiotemporal receptive field responses.
For the six basic spatiotemporal interest point detectors that constitute combinations of differential entities used for spatial interest point detection with temporal derivates: (i)–(ii) the spatial Laplacian applied to the first and secondorder temporal derivatives, (iii)–(iv) the determinant of the spatial Hessian applied to the first and secondorder temporal derivatives and (v)–(vi) the first and secondorder temporal derivatives of the determinant of the spatial Hessian matrix, we can note that all these spatiotemporal interest point detectors lead to feature responses for the parked cars that have qualitatively similarities to the responses from applying spatial interest point detectors to a static scene, with the additional constraint that there should also be relative motions between the camera and the environment. For (vii) the genuine 3D determinant of the spatiotemporal Hessian, the responses are on the other hand more selective, while for (viii) the spatiotemporal Laplacian, the responses are far less selective and less informative.
An alternative way of handling spatiotemporal scenes with dominant relative motions between the camera and the environment, in contrast to this use of space–time separable receptive fields for only image velocity \(v = 0\), is by exploiting the full structure of the spatiotemporal receptive field model (1), by considering spatiotemporal receptive fields with nonzero image velocities \(v \ne 0\), which can be locally adapted to the local motion direction corresponding to velocity adaptation [50, 51, 61] or alternatively performing local, regional or global image stabilization. Then, the image operations can be made truly covariant under local, regional or global Galilean image transformations [67, 71] and allow for a more explicit separation of spatiotemporal receptive field responses that correspond to more complex spatiotemporal image structures than local Galilean motions.
5.4 Covariance and Invariance Properties
From the theoretical scale selection properties of the spatial scalenormalized derivative operators according to the spatial scale selection theory in Lindeberg [65] in combination with the temporal scale selection properties of the temporal scale selection theory in Lindeberg [77] with the scale covariance of the underlying spatiotemporal derivative expressions \(\nabla _{(x,y),\mathrm{norm}}^2 L_{t,\mathrm{norm}}, \nabla _{(x,y),\mathrm{norm}}^2 L_{tt,\mathrm{norm}}, \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{t,\mathrm{norm}}, \det \mathcal{H}_{(x,y),\mathrm{norm}} L_{tt,\mathrm{norm}}, \det \mathcal{H}_{(x,y,t),\mathrm{norm}} L, \partial _{t,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L)\) and \(\partial _{tt,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L)\) described in Lindeberg [75], it follows that these spatiotemporal interest point detectors are truly scale covariant under independent scaling transformations of the spatial and the temporal domains if the temporal smoothing is performed by either a noncausal Gaussian kernel \(g(t;\; \tau )\) over the temporal domain or the timecausal limit kernel \(\varPsi (t;\; \tau , c)\). From the general proof in Sect. 3, it follows that the selected spatiotemporal scale levels transform in a scalecovariant way under independent scaling transformations of the spatial and the temporal domains. Additionally, the postnormalized magnitude estimates from these seven spatiotemporal differential invariants are truly scale invariant.
6 Quantifying the Accuracy of the Scale Estimates and the Amounts of Temporal Delays
The theoretical analysis of the scale selection properties of the different types of spatiotemporal interest point detectors presented in Sect. 4 was performed for a noncausal Gaussian spatiotemporal concept and using model signals based on Gaussian or integrated Gaussian intensity profiles over time. While it was conceptually shown in Lindeberg [77] that important scale selection properties in terms of temporal scaleinvariance transfer from a noncausal Gaussian temporal scalespace concept to the timecausal temporal scalespace concept based on the timecausal limit kernel, it is of interest to also quantify the numerical properties in terms of the spatiotemporal scale estimates and the temporal delays obtained from a truly timecausal scalespace concept and a timecausal implementation.
Numerical quantification of the spatiotemporal scale selection properties of four spatiotemporal interest point detectors when applied to model signals defined as timecausal Gaussian blinks of spatial extent \(\sigma _{s,0} = 8~\text{ pixels }\) and different temporal durations \(\sigma _{\tau ,0} = 40\), 80, 160, 320 and 640 ms
\(\sigma _{s,0}\)  \(\sigma _{\tau ,0}\)  \(\nabla _{(x,y)}^2 L_{tt}\)  \(\det \mathcal{H}_{(x,y)} L_{tt}\)  \(\det \mathcal{H}_{(x,y,t)} L\)  

\(\hat{\sigma }_\mathrm{s}\)  \(\hat{\sigma }_{\tau }\)  \(\delta \)  \(\hat{\sigma }_\mathrm{s}\)  \(\hat{\sigma }_{\tau }\)  \(\delta \)  \(\hat{\sigma }_\mathrm{s}\)  \(\hat{\sigma }_{\tau }\)  \(\delta \)  
Scale selection for a timecausal Gaussian blink using \(q = 1\)  
8  40  7.99  37  6  7.99  37  6  7.99  42  60 
8  80  7.99  71  \(\)5  7.99  73  \(\)5  7.99  79  107 
8  160  7.99  179  \(\)18  7.99  173  \(\)18  7.99  157  210 
8  320  7.99  334  \(\)36  7.99  330  \(\)36  7.99  313  426 
8  640  7.99  676  \(\)64  7.99  663  \(\)64  7.99  626  869 
Scale selection for a timecausal Gaussian blink using \(q = 3/4\)  
8  40  7.99  36  3  7.99  34  6  7.99  33  42 
8  80  7.99  36  \(\)27  7.99  48  58  7.99  48  60 
8  160  7.99  117  \(\)57  7.99  114  \(\)56  7.99  105  109 
8  320  7.99  223  \(\)123  7.99  220  \(\)123  7.99  204  213 
8  640  7.99  439  \(\)246  7.99  436  \(\)246  7.99  418  433 
\(\sigma _{s,0}\)  \(\sigma _{\tau ,0}\)  \(\partial _{tt} (\det \mathcal{H}_{(x,y)} L)\)  \(\partial _{tt} (\det \mathcal{H}_{(x,y)} L)\)  

\(\hat{\sigma }_\mathrm{s}\)  \(\hat{\sigma }_{\tau }\)  \(\delta \)  \(\hat{\sigma }_\mathrm{s}\)  \(\hat{\sigma }_{\tau }\)  \(\delta \)  
Scale selection for a timecausal Gaussian blink using \(q = 1\) or \(q = 3/4\)  
8  40  7.99  37  67  7.99  29  48  
8  80  7.99  73  116  7.99  51  69  
8  160  7.99  152  222  7.99  95  119  
8  320  7.99  298  445  7.99  194  229  
8  640  7.99  596  901  7.99  392  460 
Numerical quantification of the spatiotemporal scale selection properties of three spatiotemporal interest point detectors when applied to model signals defined as timecausal Gaussian onset blobs of spatial extent \(\sigma _{s,0} = 8~\text{ pixels }\) and different temporal durations \(\sigma _{\tau ,0} = 40\), 80, 160, 320 and 640 ms
\(\sigma _{s,0}\)  \(\sigma _{\tau ,0}\)  \(\nabla _{(x,y)}^2 L_t\)  \(\det \mathcal{H}_{(x,y)} L_t\)  \(\partial _t (\det \mathcal{H}_{(x,y)} L)\)  

\(\hat{\sigma }_\mathrm{s}\)  \(\hat{\sigma }_{\tau }\)  \(\delta \)  \(\hat{\sigma }_\mathrm{s}\)  \(\hat{\sigma }_{\tau }\)  \(\delta \)  \(\hat{\sigma }_\mathrm{s}\)  \(\hat{\sigma }_{\tau }\)  \(\delta \)  
Scale selection for a timecausal Gaussian onset blob using \(q = 1\)  
8  40  7.99  43  37  7.99  43  57  7.99  36  87 
8  80  7.99  74  116  7.99  75  116  7.99  72  179 
8  160  7.99  150  240  7.99  152  240  7.99  151  370 
8  320  7.99  311  498  7.99  313  498  7.99  302  762 
8  640  7.99  616  1023  7.99  620  1023  7.99  605  1557 
Scale selection for a timecausal Gaussian onset blob using \(q = 3/4\)  
8  40  7.99  32  34  7.99  30  34  7.99  35  58 
8  80  7.99  56  65  7.99  56  65  7.99  50  113 
8  160  7.99  106  130  7.99  106  130  7.99  103  228 
8  320  7.99  207  267  7.99  208  267  7.99  201  469 
8  640  7.99  421  552  7.99  422  552  7.99  406  961 
6.1 TimeCausal Gaussian Blink
To quantify the transfer of the spatiotemporal scale selection properties to a timecausal spatiotemporal domain, we first generated a set of videos with timecausal Gaussian blinks obtained by filtering a discrete delta function with a discrete Gaussian kernel over the spatial domain and a discrete approximation of the timecausal limit kernel over the temporal domain. Such videos sequences were generated with spatial extent \(\sigma _{s,0} = 8~\text{ pixels }\) and temporal durations of \(\sigma _{\tau ,0} = 40\), 80, 160, 320 and 640 ms at a frame rate of 50 frames/s and for distribution parameter \(c = 2\) of the timecausal limit kernel. The reason for not varying the spatial scale parameter in this experiment is that the properties of the spatial scale selection mechanism have already been sufficiently well established and tested.
Then, we detected spatiotemporal scalespace extrema of: (i) the spatial Laplacian of the secondorder temporal derivative of \(\nabla _{(x,y),\mathrm{norm}}^2 L_{tt,\mathrm{norm}}\), (ii) the determinant of the spatial Hessian of the secondorder temporal derivative \(\det \mathcal{H}_{(x,y),\mathrm{norm}} L_{tt,\mathrm{norm}}\), (iii) the determinant of the spatiotemporal Hessian \(\det \mathcal{H}_{(x,y,t),\mathrm{norm}} L\) and (iv) the secondorder temporal derivative of the determinant of the spatial Hessian \(\partial _{tt,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L)\) for each one of these videos, and recorded (i) the selected spatial scale \(\hat{\sigma }_\mathrm{s}\) in units of pixels, (ii) the selected temporal scale \(\hat{\sigma }_{\tau }\) in units of milliseconds and (iii) a measure of the effective temporal delay \(\delta = \hat{t}  t_{\mathrm{max}}\) defined as the time difference between the time moment \(\hat{t}\) at which the spatiotemporal scalespace extremum is detected and the time moment \(t_{\mathrm{max}}\) at which the spatiotemporal maximum in the input function occurred. The motivation for the latter choice is that because of the timecausal model, each spatiotemporal pattern is associated with an inherent temporal delay. By compensating for this delay, the intention is that the compensated delay score should more reflect the additional amount of temporal delay caused by the timecausal feature detection method.
The results of these experiments are given in Table 3 for two different settings of the temporal scale calibration parameter q. Note that (i) the spatial scale estimates \(\hat{\sigma }_\mathrm{s}\) are highly accurate and that (ii) when using \(q = 1\) the temporal scale estimates \(\hat{\sigma }_{\tau }\) do also give good estimates of the temporal duration of the underlying spatiotemporal image structures considering the coarse sampling of the temporal scale levels induced by a distribution parameter of \(c = 2\), which means that the ratio between adjacent temporal scale levels is equal to two in units of dimension \([\text{ time }]\) and which in turn limits the effective resolution of the temporal scale estimates. Additionally, the implementation differs from the presented scale selection theory in the respects that: (i) the theoretical analysis has been performed based on the noncausal Gaussian temporal scalespace model, whereas the experiments are performed using the timecausal scalespace model, (ii) the spatiotemporal scale selection theory is continuous, whereas the discrete implementation is based on the discrete analogue of the Gaussian kernel [56] over space and recursive filters over time and (iii) for shorter temporal scales, the temporal scales of the model signals are close to the inner temporal scale in the video, determined by the frame rate of 50 fps corresponding to 20 ms between adjacent frames, implying that the temporal discretization effects at shorter temporal scales become stronger.
For this family of model signals, the spatial Laplacian of the secondorder temporal derivative \(\nabla _{(x,y),\mathrm{norm}}^2 L_{tt,\mathrm{norm}}\) and the determinant of the Hessian of the secondorder temporal derivative \(\det \mathcal{H}_{(x,y),\mathrm{norm}} L_{tt,\mathrm{norm}}\) respond very fast to the onset of a spatiotemporal Gaussian blob when using \(q = 1\). For the determinant of the spatiotemporal Hessian \(\det \mathcal{H}_{(x,y,t),\mathrm{norm}} L\) and the secondorder temporal derivative of the determinant of the spatial Hessian \(\partial _{tt,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L)\), the temporal delays are, however, substantial when using \(q = 1\). By instead choosing the temporal scale calibration parameter q to a lower value of \(q = 3/4\), the effective temporal delays can be substantially reduced in many cases up to a reduction near 50% for the determinant of the spatiotemporal Hessian \(\det \mathcal{H}_{(x,y,t),\mathrm{norm}} L\) and the secondorder temporal derivative of the determinant of the spatial Hessian \(\partial _{tt,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L)\) at the cost of less accurate but still not completely unreasonable estimates of the temporal duration of the underlying spatiotemporal image structures.
A general conclusion that we can draw from this experiments is that the operators \(\nabla _{(x,y),\mathrm{norm}}^2 L_{tt,\mathrm{norm}}\) and \(\det \mathcal{H}_{(x,y),\mathrm{norm}} L_{tt,\mathrm{norm}}\) that operate directly on temporal derivatives respond significantly faster compared to the operator \(\det \mathcal{H}_{(x,y,t),\mathrm{norm}} L\) that operates on the joint space–time structure and the operator \(\partial _{tt,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L)\) that operates on temporal derivatives of a nonlinear spatial differential invariant.
6.2 TimeCausal Gaussian Onset Blob
To quantify the transfer of the spatiotemporal scale selection properties for another class of model signals, we then generated a set of videos with timecausal Gaussian onset blobs obtained by filtering a the tensor product between a discrete delta function over the spatial domain and discrete Heaviside function over the temporal domain function with a discrete Gaussian kernel over the spatial domain and a discrete approximation of the timecausal limit kernel over the temporal domain. Such videos sequences were generated with spatial extent \(\sigma _{s,0} = 8~\text{ pixels }\) and temporal durations of \(\sigma _{\tau ,0} = 40\), 80, 160, 320 and 640 ms at a frame rate of 50 frames/s and for distribution parameter \(c = 2\) of the timecausal limit kernel.
Then, we detected spatiotemporal scalespace extrema of: (i) the spatial Laplacian of the firstorder temporal derivative \(\nabla _{(x,y),\mathrm{norm}}^2 L_{t,\mathrm{norm}}\), (ii) the determinant of the spatial Hessian of the firstorder temporal derivative \(\det \mathcal{H}_{(x,y),\mathrm{norm}} L_{t,\mathrm{norm}}\) and (iii) the firstorder temporal derivative of the determinant of the spatial Hessian \(\partial _{t,\mathrm{norm}} (\det \mathcal{H}_{(x,y),\mathrm{norm}} L)\) for each one of these videos, and recorded the (i) the selected spatial scale \(\hat{\sigma }_\mathrm{s}\) in units of pixels, (ii) the selected temporal scale \(\hat{\sigma }_{\tau }\) in units of milliseconds and (iii) a measure of the effective temporal delay \(\delta = \hat{t}  t_{\mathrm{max}}\) defined as the time difference between the time moment \(\hat{t}\) at which the spatiotemporal scalespace extremum is detected and the time moment \(t_{\mathrm{max}}\) at which the spatiotemporal maximum of the spatiotemporal scalespace kernel at the same spatiotemporal scale \(\sigma _{\tau ,0}\) occurs.
The results of these experiments are given in Table 4 for two different settings of the temporal scale calibration parameter q. Note that (i) again the spatial scale estimates \(\hat{\sigma }_\mathrm{s}\) are highly accurate and that (ii) when using \(q = 1\) the temporal scale estimates \(\hat{\sigma }_{\tau }\) do also give good estimates of the temporal duration of the underlying spatiotemporal signals again considering the coarse sampling of the temporal scale levels induced by sparse sampling the temporal scale levels resulting from the distribution parameter of \(c = 2\) for the timecausal limit kernel, which in turn means that the ratio between adjacent temporal scale levels is equal to two in units of dimension \([\text{ time }]\) and which again limits the effective resolution of the temporal scale estimates. For this problem of onset detection, the temporal delays are, however, longer than for the previous problem of detecting blinks. By instead choosing the temporal scale calibration parameter q to a lower value of \(q = 3/4\), the effective temporal delay can be substantially reduced in some cases up to a reduction near 50 % for the spatial Laplacian of the firstorder temporal derivative of \(\nabla _{(x,y),\mathrm{norm}}^2 L_{t,\mathrm{norm}}\) and the determinant of the spatial Hessian of the firstorder temporal derivative \(\det \mathcal{H}_{(x,y),\mathrm{norm}} L_{t,\mathrm{norm}}\) at the cost of less accurate but still not completely unreasonable estimates of the temporal duration of the underlying spatiotemporal image structures
7 Summary and Discussion
We have presented a general theory and methodology for performing simultaneous detection of local characteristic spatial and temporal scale estimates in video data. The theory comprises both (i) feature detection performed within a noncausal spatiotemporal scalespace representation computed for offline analysis of prerecorded video data and (ii) feature detection performed from realtime image streams where the future cannot be accessed and memory requirements call for timerecursive algorithms based on only compact buffers of what has occurred in the past.
As a theoretical foundation for spatiotemporal scale selection, we have stated general sufficiency results regarding scalecovariant spatiotemporal scale estimates and complementary invariance properties of spatiotemporal features defined from video data in which there may be independent scaling transformations of the spatial and the temporal domains. For a wide class of homogeneous spatiotemporal differential expressions, the spatiotemporal scale estimates obtained from the presented theory and methodology have been shown to obey the basic property that they adaptively follow independent local spatial and temporal scaling transformations in the video data, which constitutes a basic requirement on a spatiotemporal scale selection mechanism. In other words, if the spatial size of the image structures changes by a factor \(S_\mathrm{s}\) in the spatial domain and/or the temporal duration of the spatiotemporal image structures changes by a factor \(S_{\tau }\), then the spatial scale parameter in units of \(\sigma _\mathrm{s} = \sqrt{s}\) and the temporal scale parameter in units of \(\sigma _{\tau } = \sqrt{\tau }\) of the detected spatiotemporal image features will change by corresponding factors. Additionally, we have shown that the magnitude estimates either are automatically invariant under spatiotemporal scaling transformations or can be compensated to become so by postnormalization, depending on the specific values of the scale normalization parameters \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\). These properties together imply that the presented theory and methodology obeys the necessary properties to handle video data in which there may be large spatial and temporal scaling variations in the spatiotemporal image structures.
For seven specific spatiotemporal differential invariants: (i)–(ii) the spatial Laplacian of the first and secondorder temporal derivatives, (iii)–(iv) the determinant of the spatial Hessian of the first and secondorder temporal derivatives, (v) the determinant of the spatiotemporal Hessian matrix and (vi)–(vii) the first and secondorder temporal derivatives of the determinant of the spatial Hessian, we have performed an indepth analysis of their theoretical scale selection properties and shown how scale calibration can be performed to determine the spatial and temporal scale normalization powers \(\gamma _\mathrm{s}\) and \(\gamma _{\tau }\) such that the selected spatiotemporal scale levels reflect the spatial extent and the temporal duration of the underlying spatiotemporal features that gave rise to the feature responses. These spatiotemporal differential invariants can all be used for formulating spatiotemporal interest point detectors. Theoretically and experimentally, we have described and illustrated their properties and shown that they lead to intuitively reasonable results.
For one spatiotemporal differential expression, an attempt to define a spatiotemporal Laplacian, we have on the other hand shown that this differential expression is not scale covariant under independent rescalings of the spatial and temporal domains, which explains a previously noted poor robustness of the scale selection step in the spatiotemporal interest point detector based on the spatiotemporal Harris operator [49].
Whereas the presented spatiotemporal scale selection theory is fully continuous over space and time, we have by quantitative experiments on model signals with ground truth shown that the numerical accuracy of the spatiotemporal scale estimates carries over to a carefully designed discrete implementation, based on the discrete analogue of the Gaussian over space and a cascade of firstorder recursive filters over time.
To allow for different tradeoffs between the temporal response properties of timecausal spatiotemporal feature detection (shorter temporal delays) in relation to signal detection theory, which would call for detection of image structures at the same spatial and temporal scales as they occur, we have specifically introduced a parameter q to regulate the temporal scale calibration to finer temporal scales \(\hat{\tau } = q^2 \, \tau _0\) as opposed to the more common choice \(\hat{s} = s_0\) over the spatial domain. According to the presented theoretical analysis of scale selection properties in noncausal spatiotemporal scale space, the results predict that this parameter should reduce the temporal delay by a factor of q: \(\Delta t \mapsto q \, \Delta t\). Our numerical experiments with scale selection properties in timecausal spatiotemporal scale space confirm that a substantial decrease in temporal delay is obtained. The specific choice of the parameter q should be optimized with respect to the task that the spatiotemporal selection and the spatiotemporal features are to be used for and given specific requirements of the application domain.
We have also presented an explicit algorithm for detecting spatiotemporal interest points in a timecausal and timerecursive context in which the future cannot be accessed and memory requirements call for only compact buffers to store partial records of what has occurred in the past and presented experimental results of applying this algorithm to realworld video data for the different types of spatiotemporal interest point detectors that we have studied theoretically.
Experimentally, we have shown that four of the presented spatiotemporal interest operators: (i)–(ii) the spatial Laplacian of the first and secondorder temporal derivatives and (iii)–(iv) the determinant of the Hessian of the first and secondorder temporal derivatives, lead to significantly shorter temporal delays than (v) the determinant of the spatiotemporal Hessian matrix or (vi)–(vii) the first and secondorder temporal derivatives of the determinant of the spatial Hessian.
While the experimental results in this paper have been presented solely based on a timecausal and timerecursive spatiotemporal concept, the overall methodology can also be implemented based on a noncausal Gaussian spatiotemporal scalespace concept [67]. Such an implementation would, however, require more computations and larger temporal buffers compared to using the timecausal and timerecursive receptive fields based on firstorder integrators coupled in cascade that constitute the temporal smoothing model underlying the implementation reported in this work. Additionally, an ad hoc use of timedelayed truncated Gaussian kernels instead would be expected to lead to less rapid temporal responses for timecritical applications compared to the truly timecausal scalespace kernels used for the experiments in this work. For offline analysis of prerecorded data on an architecture where computational and memory resources do not constitute a bottleneck, such a noncausal implementation would on the other hand have the potential of computing more accurate image features, since the method could then also make use of information from the future in relation to any prerecorded time moment, which is not permitted for these timecausal operations.
We propose that the spatiotemporal scale selection mechanism presented in this paper should be far more general than the more specific applications developed here for detecting spatiotemporal interest points. Concerning extensions of the approach, a first natural extension concerns extending the sparse spatiotemporal scale selection into dense spatiotemporal scale selection, which is addressed in a companion paper [76]. A second natural extension is to extend the current use of a space–time separable spatiotemporal scalespace representation based on spatiotemporal receptive fields (1) with image velocity zero to incorporate mechanisms for velocityadapted spatiotemporal receptive fields with nonzero image velocities and/or image stabilization.
Footnotes
 1.
When computing estimates of the temporal delay of the timecausal spatiotemporal scalespace kernel in an actual discrete implementation, we do, however, not make use of the approximate expression (16). Instead, we do for each temporal scale level compute the temporal maximum point of the discrete timecausal scalespace kernel that approximates the continuous timecausal kernel, and do then add additionally half a time step \(\Delta t/2\) for each order of temporal differentiation as implemented in terms of backward difference operators over time \(\partial _{t^n} L \approx \delta _t^n L/(\Delta t)^n\), where \(\Delta t\) denotes the temporal time step between successive frames.
 2.
This notation is intended to reflect the fact that a set of multiple spatiotemporal scale estimates \( (\hat{s}, \hat{\tau })\) may be obtained at any point (x, y, t) in space–time, corresponding to qualitatively different types of spatiotemporal image structures at different spatiotemporal scales.
 3.
For example, if performing spatiotemporal interest point detection using the spatial Laplacian operator \(\nabla ^2 L\) applied to either of the first or the secondorder temporal derivatives \(L_t\) or \(L_{tt}\), complementary thresholding can be performed by applying the unsigned Hessian feature strength measure \(\mathcal{D}_1 L = L_{xx} L_{yy}  L_{xy}^2  k \, (L_{xx} + L_{yy})^2\) [74] to either the first or the secondorder temporal derivatives, respectively, for \(k \in [0, 1/4[\) with preferred choice of \(k \in [0.04, 0.10]\), to suppress multiple responses along elongated image structures over the spatial domain. This implies that complementary thresholding for these Laplacianbased spatiotemporal interest operators should be performed based on \(\mathcal{D}_1 L_t = L_{xxt} L_{yyt}  L_{xyt}^2  k \, (L_{xxt} + L_{yyt})^2 > 0\) or \(\mathcal{D}_1 L_{tt} = L_{xxtt} L_{yytt}  L_{xytt}^2  k \, (L_{xxtt} + L_{yytt})^2 > 0\).
References
 1.Aanaes, H., LindbjergDahl, A., Pedersen, K.S.: Interesting interest points: a comparative study of interest point performance on a unique data set. Int. J. Comput. Vis. 97(1), 18–35 (2012)Google Scholar
 2.Abramowitz, M., Stegun, I.A. (eds.): Handbook of Mathematical Functions, 55th edn. National Bureau of Standards, Applied Mathematics Series (1964)Google Scholar
 3.Adelson, E., Bergen, J.: Spatiotemporal energy models for the perception of motion. J. Opt. Soc. Am. A 2, 284–299 (1985)Google Scholar
 4.Alcantarilla, P.F., Bartoli, A., Davison, A.J.: KAZE features. In: Proceedings of European Conference on Computer Vision (ECCV 2012). Springer LNCS, vol. 7577, pp. 214–227 (2012)Google Scholar
 5.Bay, H., Ess, A., Tuytelaars, T., van Gool, L.: Speeded up robust features (SURF). Comput. Vis. Image Underst. 110(3), 346–359 (2008)Google Scholar
 6.Bilinski, P., Bremond, F.: Evaluation of local descriptors for action recognition in videos. In: International Conference on Computer Vision Systems, pp. 61–70 (2011)Google Scholar
 7.Brox, T., Weickert, J.: A TV flow based local scale measure for texture discrimination. In: Proceedings of European Conference on Computer Vision (ECCV 2004), pp. 578–590 (2004)Google Scholar
 8.Brox, T., Weickert, J.: A TV flow based local scale estimate and its application to texture discrimination. J. Vis. Commun. Image Represent. 17(5), 1053–1073 (2006)Google Scholar
 9.Chakraborty, B., Holte, M.B., Moeslund, T.B., Gonzàlez, J.: Selective spatiotemporal interest points. Comput. Vis. Image Underst. 116(3), 396–410 (2012)Google Scholar
 10.Comaniciu, D., Ramesh, V., Meer, P.: The variable bandwidth mean shift and datadriven scale selection. In: Proceedings of International Conference on Computer Vision (ICCV 2001), pp. 438–445. Vancouver, Canada (2001)Google Scholar
 11.Dawn, D.D., Shaikh, S.H.: A comprehensive survey of human action recognition with spatiotemporal interest point (STIP) detector. Vis. Comput. 32(3), 289–306 (2016)Google Scholar
 12.DeAngelis, G.C., Anzai, A.: A modern view of the classical receptive field: linear and nonlinear spatiotemporal processing by V1 neurons. In: Chalupa, L.M., Werner, J.S. (eds.) The Visual Neurosciences, vol. 1, pp. 704–719. MIT Press (2004)Google Scholar
 13.DeAngelis, G.C., Ohzawa, I., Freeman, R.D.: Receptive field dynamics in the central visual pathways. Trends Neurosci. 18(10), 451–457 (1995)Google Scholar
 14.de Geest, R., Tuytelaars, T.: Dense interest features for video processing. In: Proceedings of International Conference on Image Processing (ICIP 2014), pp. 5771–5775 (2014)Google Scholar
 15.Demirci, M.F., Platel, B., Shokoufandeh, A., Florack, L., Dickinson, S.J.: The representation and matching of images using top points. J. Math. Imaging Vis. 35(2), 103–116 (2009)MathSciNetGoogle Scholar
 16.Derpanis, K.G., Wildes, R.P.: Spacetime texture representation and recognition based on a spatiotemporal orientation analysis. IEEE Trans. Pattern Anal. Mach. Intell. 34(6), 1193–1205 (2012)Google Scholar
 17.Dickscheid, T., Schindler, F., Förstner, W.: Coding images with local features. Int. J. Comput. Vis. 94(2), 154–174 (2011)zbMATHGoogle Scholar
 18.Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatiotemporal features. In: Proceedings of 2nd Joint Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65–72. Beijing, China (2005)Google Scholar
 19.Elder, J., Zucker, S.: Local scale control for edge detection and blur estimation. IEEE Trans. Pattern Anal. Mach. Intell. 20(7), 699–716 (1998)Google Scholar
 20.Everts, I., van Gemert, J.C., Gevers, T.: Evaluation of color STIPs for human action recognition. In: Proceedings of Computer Vision and Pattern Recognition (CVPR 2013), pp. 2850–2857 (2013)Google Scholar
 21.Everts, I., van Gemert, J.C., Gevers, T.: Evaluation of color spatiotemporal interest points for human action recognition. IEEE Trans. Image Process. 23(4), 1569–1580 (2014)MathSciNetzbMATHGoogle Scholar
 22.Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional twostream network fusion for video action recognition. arXiv preprint arXiv:1604.06573 (2016)Google Scholar
 23.Fleet, D.J., Langley, K.: Recursive filters for optical flow. IEEE Trans. Pattern Anal. Mach. Intell. 17(1), 61–67 (1995)Google Scholar
 24.Florack, L.M.J.: Image Structure. Series in Mathematical Imaging and Vision. Springer, Berlin (1997)Google Scholar
 25.Förstner, W., Dickscheid, T., Schindler, F.: Detecting interpretable and accurate scaleinvariant keypoints. In: Proceedings of International Conference on Computer Vision (ICCV 2009), pp. 2256–2263 (2009)Google Scholar
 26.Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. Int. J. Robot. Res. 32(11), 1231–1237 (2013)Google Scholar
 27.Guichard, F.: A morphological, affine, and Galilean invariant scalespace for movies. IEEE Trans. Image Process. 7(3), 444–456 (1998)Google Scholar
 28.Hassner, T., Mayzels, V., ZelnikManor, L.: On SIFTs and their scales. In: Proceedings of Computer Vision and Pattern Recognition (CVPR 2012), pp. 1522–1528. Providence, Rhode Island (2012)Google Scholar
 29.Hassner, T., Filosof, S., Mayzels, V., ZelnikManor, L.: SIFTing through scales. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1431–1443 (2016)Google Scholar
 30.Holte, M.B., Chakraborty, B., Gonzalez, J., Moeslund, T.B.: A local 3D motion descriptor for multiview human action recognition from 4D spatiotemporal interest points. IEEE J. Sel. Top. Signal Process. 6(5), 553–565 (2012)Google Scholar
 31.Hong, B.W., Soatto, S., Ni, K., Chan, T.: The scale of a texture and its application to segmentation. In: Proceedings of Computer Vision and Pattern Recognition (CVPR 2008), pp. 1–8 (2008)Google Scholar
 32.Hubel, D.H., Wiesel, T.N.: Receptive fields of single neurones in the cat’s striate cortex. J. Physiol. 147, 226–238 (1959)Google Scholar
 33.Hubel, D.H., Wiesel, T.N.: Brain and Visual Perception: The Story of a 25Year Collaboration. Oxford University Press, Oxford (2005)Google Scholar
 34.Iijima, T.: Observation theory of twodimensional visual patterns. Technical Report, Papers of Technical Group on Automata and Automatic Control, IECE, Japan (1962)Google Scholar
 35.Jacobs, N., Pless, R.: Time scales in video surveillance. IEEE Trans. Circuits Syst. Video Technol. 18(8), 1106–1113 (2008)Google Scholar
 36.Jhuang, H., Serre, T., Wolf, L., Poggio, T.: A biologically inspired system for action recognition. In: International Conference on Computer Vision (ICCV’07), pp. 1–8 (2007)Google Scholar
 37.Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)Google Scholar
 38.Jones, P.W., Le, T.M.: Local scales and multiscale image decompositions. Appl. Comput. Harmonic Anal. 26(3), 371–394 (2009)MathSciNetzbMATHGoogle Scholar
 39.Kadir, T., Brady, M.: Saliency, scale and image description. Int. J. Comput. Vis. 45(2), 83–105 (2001)zbMATHGoogle Scholar
 40.Kang, Y., Morooka, K., Nagahashi, H.: Scale invariant texture analysis using multiscale local autocorrelation features. In: Proceedings of Scale Space and PDE Methods in Computer Vision (ScaleSpace’05). Springer LNCS, vol. 3459, pp. 363–373 (2005). SpringerGoogle Scholar
 41.Ke, Y., Sukthankar, R.: PCASIFT: A more distinctive representation for local image descriptors. In: Proceedings of Computer Vision and Pattern Recognition (CVPR’04), pp. II: 506–513. Washington, DC (2004)Google Scholar
 42.Khan, N.Y., McCane, B., Wyvill, G.: SIFT and SURF performance evaluation against various image deformations on benchmark dataset. In: Proceedings of International Conference on Digital Image Computing Techniques and Applications (DICTA 2011), pp. 501–506 (2011)Google Scholar
 43.Kläser, A., Marszalek, M., Schmid, C.: A spatiotemporal descriptor based on 3Dgradients. In: Proceedings of British Machine Vision Conference, Leeds, UK (2008)Google Scholar
 44.Koenderink, J.J.: The structure of images. Biol. Cybern. 50, 363–370 (1984)MathSciNetzbMATHGoogle Scholar
 45.Koenderink, J.J.: Scaletime. Biol. Cybern. 58, 159–162 (1988)MathSciNetzbMATHGoogle Scholar
 46.Koenderink, J.J., van Doorn, A.J.: Representation of local geometry in the visual system. Biol. Cybern. 55, 367–375 (1987)MathSciNetzbMATHGoogle Scholar
 47.Koenderink, J.J., van Doorn, A.J.: Generic neighborhood operators. IEEE Trans. Pattern Anal. Mach. Intell. 14(6), 597–605 (1992)Google Scholar
 48.Laptev, I., Lindeberg, T.: Local descriptors for spatiotemporal recognition. In: Proceedings of ECCV’04 Workshop on Spatial Coherence for Visual Motion Analysis, Springer LNCS, vol. 3667, pp. 91–103. Prague, Czech Republic (2004)Google Scholar
 49.Laptev, I., Lindeberg, T.: Spacetime interest points. In: Proceedings of International Conference on Computer Vision (ICCV 2003), pp. 432–439. Nice, France (2003)Google Scholar
 50.Laptev, I., Lindeberg, T.: Velocityadapted spatiotemporal receptive fields for direct recognition of activities. Image Vis. Comput. 22(2), 105–116 (2004)Google Scholar
 51.Laptev, I., Caputo, B., Schuldt, C., Lindeberg, T.: Local velocityadapted motion events for spatiotemporal recognition. Comput. Vis. Image Underst. 108, 207–229 (2007)Google Scholar
 52.Larsen, A.B.L., Darkner, S., Dahl, A.L., Pedersen, K.S.: Jetbased local image descriptors. In: Proceedings of European Conference on Computer Vision (ECCV 2012), Springer LNCS, vol. 7574, pp. III: 638–650. Springer (2012)Google Scholar
 53.Li, Z., Gavves, E., Jain, M., Snoek, C.G.M.: VideoLSTM convolves, attends and flows for action recognition. arXiv preprint arXiv:1607.01794 (2016)Google Scholar
 54.Li, Y., Tax, D.M.J., Loog, M.: Supervised scaleinvariant segmentation (and detection). In: Proceedings of Scale Space and Variational Methods in Computer Vision (SSVM 2011), Springer LNCS, vol. 6667, pp. 350–361. Springer, Ein Gedi, Israel (2012)Google Scholar
 55.Li, Y., Tax, D.M.J., Loog, M.: Scale selection for supervised image segmentation. Image Vis. Comput. 30(12), 991–1003 (2012)Google Scholar
 56.Lindeberg, T.: Scalespace for discrete signals. IEEE Trans. Pattern Anal. Mach. Intell. 12(3), 234–254 (1990)Google Scholar
 57.Lindeberg, T.: Discrete derivative approximations with scalespace properties: a basis for lowlevel feature extraction. J. Math. Imaging Vis. 3(4), 349–376 (1993)Google Scholar
 58.Lindeberg, T.: Effective scale: a natural unit for measuring scalespace lifetime. IEEE Trans. Pattern Anal. Mach. Intell. 15(10), 1068–1074 (1993)Google Scholar
 59.Lindeberg, T.: ScaleSpace Theory in Computer Vision. Springer, Berlin (1993)zbMATHGoogle Scholar
 60.Lindeberg, T.: Scalespace theory: a basic tool for analysing structures at different scales. J. Appl. Stat. 21(2), 225–270 (1994)Google Scholar
 61.Lindeberg, T.: Linear spatiotemporal scalespace. In: ter Haar Romeny, B.M., Florack, L.M.J., Koenderink, J.J., Viergever, M.A. (eds.) Proceedings of International Conference on ScaleSpace Theory in Computer Vision (ScaleSpace’97), Springer LNCS, vol. 1252, pp. 113–127. Springer, Utrecht, The Netherlands (1997)Google Scholar
 62.Lindeberg, T.: Principles for automatic scale selection. In: Handbook on Computer Vision and Applications, pp. 239–274. Academic Press, Boston, USA (1999). http://www.csc.kth.se/cvap/abstracts/cvap222.html
 63.Lindeberg, T.: On automatic selection of temporal scales in timecasual scalespace. In: Sommer, G., Koenderink, J.J. (eds.) Proceedings of AFPAC’97: Algebraic Frames for the PerceptionAction Cycle, Springer LNCS, vol. 1315, pp. 94–113. Kiel, Germany (1997)Google Scholar
 64.Lindeberg, T.: Edge detection and ridge detection with automatic scale selection. Int. J. Comput. Vis. 30(2), 117–154 (1998)Google Scholar
 65.Lindeberg, T.: Feature detection with automatic scale selection. Int. J. Comput. Vis. 30(2), 77–116 (1998)Google Scholar
 66.Lindeberg, T.: A scale selection principle for estimating image deformations. Image Vis. Comput. 16(14), 961–977 (1998)Google Scholar
 67.Lindeberg, T.: Generalized Gaussian scalespace axiomatics comprising linear scalespace, affine scalespace and spatiotemporal scalespace. J. Math. Imaging Vis. 40(1), 36–81 (2011)MathSciNetzbMATHGoogle Scholar
 68.Lindeberg, T.: Scale invariant feature transform. Scholarpedia 7(5), 10,491 (2012)Google Scholar
 69.Lindeberg, T.: A computational theory of visual receptive fields. Biol. Cybern. 107(6), 589–635 (2013)MathSciNetzbMATHGoogle Scholar
 70.Lindeberg, T.: Generalized axiomatic scalespace theory. In: Hawkes, P. (ed.) Advances in Imaging and Electron Physics, vol. 178, pp. 1–96. Elsevier, Amsterdam (2013)Google Scholar
 71.Lindeberg, T.: Invariance of visual operations at the level of receptive fields. PLoS ONE 8(7), e66,990 (2013)Google Scholar
 72.Lindeberg, T.: Scale selection properties of generalized scalespace interest point detectors. J. Math. Imaging Vis. 46(2), 177–210 (2013)MathSciNetzbMATHGoogle Scholar
 73.Lindeberg, T.: Scale selection. In: Ikeuchi, K. (ed.) Computer Vision: A Reference Guide, pp. 701–713. Springer, Berlin (2014)Google Scholar
 74.Lindeberg, T.: Image matching using generalized scalespace interest points. J. Math. Imaging Vis. 52(1), 3–36 (2015)MathSciNetzbMATHGoogle Scholar
 75.Lindeberg, T.: Timecausal and timerecursive spatiotemporal receptive fields. J. Math. Imaging Vis. 55(1), 50–88 (2016)MathSciNetzbMATHGoogle Scholar
 76.Lindeberg, T.: Dense scale selection over space, time and spacetime. arXiv preprint arXiv:1709.08603 (2017)Google Scholar
 77.Lindeberg, T.: Temporal scale selection in timecausal scale space. J. Math. Imaging Vis. 58(1), 57–101 (2017)MathSciNetGoogle Scholar
 78.Lindeberg, T.: Normative theory of visual receptive fields. arXiv preprint arXiv:1701.06333 (2017)Google Scholar
 79.Lindeberg, T.: Spatiotemporal scale selection in video data. In: Proceedings of ScaleSpace and Variational Methods for Computer Vision (SSVM 2017), Springer LNCS, vol. 10302, pp. 3–15. Kolding, Denmark (2017)Google Scholar
 80.Lindeberg, T., Bretzner, L.: Realtime scale selection in hybrid multiscale representations. In: Griffin, L., Lillholm, M. (eds.) Proc. ScaleSpace Methods in Computer Vision (ScaleSpace’03), Springer LNCS, vol. 2695, pp. 148–163. Springer, Isle of Skye, Scotland (2003)Google Scholar
 81.Lindeberg, T., Fagerström, D.: Scalespace with causal time direction. In: Proceedings of European Conference on Computer Vision (ECCV’96), Springer LNCS, vol. 1064, pp. 229–240. Cambridge, UK (1996)Google Scholar
 82.Liu, X.M., Wang, C., Yao, H., Zhang, L.: The scale of edges. In: Proceedings of Computer Vision and Pattern Recognition (CVPR 2012), pp. 462–469 (2012)Google Scholar
 83.Loog, M., Li, Y., Tax, D.: Maximum membership scale selection. In: Multiple Classifier Systems, Springer LNCS, vol. 5519, pp. 468–477. Springer (2009)Google Scholar
 84.Lowe, D.G.: Distinctive image features from scaleinvariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)Google Scholar
 85.Luo, B., Aujol, J.F., Gousseau, Y.: Local scale measure from the topographic map and application to remote sensing images. Multiscale Model. Simul. 8(1), 1–29 (2009)MathSciNetzbMATHGoogle Scholar
 86.Mainali, P., Lafruit, G., Yang, Q., Geelen, B., Gool, L.V., Lauwereins, R.: SIFER: Scaleinvariant feature detector with error resilience. Int. J. Comput. Vis. 104(2), 172–197 (2013)zbMATHGoogle Scholar
 87.Mainali, P., Lafruit, G., Tack, K., van Gool, L., Lauwereins, R.: Derivativebased scale invariant image feature detector with error resilience. IEEE Trans. Image Process. 23(5), 2380–2391 (2014)MathSciNetzbMATHGoogle Scholar
 88.Maninis, K., Koutras, P., Maragos, P.: Advances on action recognition in videos using an interest point detector based on multiband spatiotemporal energies. In: International Conference on Image Processing (ICIP 2014), pp. 1490–1494 (2014)Google Scholar
 89.Mikolajczyk, K., Schmid, C.: Scale and affine invariant interest point detectors. Int. J. Comput. Vis. 60(1), 63–86 (2004)Google Scholar
 90.Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 27(10), 1615–1630 (2005)Google Scholar
 91.Mrázek, P., Navara, M.: Selection of optimal stopping time for nonlinear diffusion filtering. Int. J. Comput. Vis. 52(2–3), 189–203 (2003)Google Scholar
 92.Ng, J., Bharath, A.A.: Steering in scale space to optimally detect image structures. In: Proceedings of European Conference on Computer Vision (ECCV 2004), Springer LNCS, vol. 3021, pp. 482–494 (2004)Google Scholar
 93.Niebles, J.C., Wang, H., FeiFei, L.: Unsupervised learning of human action categories using spatialtemporal words. Int. J. Comput. Vis. 79(3), 299–318 (2008)Google Scholar
 94.Oikonomopoulos, A., Patras, I., Pantic, M.: Spatiotemporal salient points for visual recognition of human actions. IEEE Trans. Syst. Man Cybern. Part B 36(3), 710–719 (2005)Google Scholar
 95.Poppe, R.: A survey on visionbased human action recognition. Image Vis. Comput. 28(6), 976–990 (2010)Google Scholar
 96.Ranjan, A., Black, M.J.: Optical flow estimation using a spatial pyramid network. arXiv preprint arXiv:1611.00850 (2016)Google Scholar
 97.Rapantzikos, K., Avrithis, Y., Kollias, S.: Dense saliencybased spatiotemporal feature points for action recognition. In: Proceedings of Computer Vision and Pattern Recognition (CVPR 2009), pp. 1454–1461 (2009)Google Scholar
 98.RiveroMoreno, C.J., Bres, S.: Spatiotemporal primitive extraction using Hermite and Laguerre filters for early vision video indexing. In: Image Analysis and Recognition. Springer LNCS , vol.3211, pp. 825–832 (2004)Google Scholar
 99.Scovanner, P., Ali, S., Shah, M.: A 3dimensional SIFT descriptor and its application to action recognition. In: Proceedings of ACM International Conference on Multimedia, pp. 357–360 (2007)Google Scholar
 100.Shabani, A.H., Clausi, D.A., Zelek, J.S.: Evaluation of local spatiotemporal salient feature detectors for human action recognition. In: Proceedings of Computer and Robot Vision (CRV 2012), pp. 468–475 (2012)Google Scholar
 101.Shabani, A.H., Clausi, D.A., Zelek, J.S.: Improved spatiotemporal salient feature detection for action recognition. In: British Machine Vision Conference (BMVC’11), pp. 1–12. Dundee, UK (2011)Google Scholar
 102.Shao, L., Mattivi, R.: Feature detector and descriptor evaluation in human action recognition. In: Proceedings of ACM International Conference on Image and Video Retrieval (CIVR’10), pp. 477–484. Xian, China (2010)Google Scholar
 103.Simonyan, K., Zisserman, A.: Twostream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems (NIPS 2014), pp. 568–576 (2014)Google Scholar
 104.Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human action classes from videos in the wild. Tech. Rep. CRCVTR1201, Center for Research in Computer Vision, University of Central Florida (2012). arXiv preprint arXiv:1212.0402Google Scholar
 105.Sporring, J., Colios, C.J., Trahanias, P.E.: Generalized scale selection. In: Proceedings of International Conference on Image Processing (ICIP’00), pp. 920–923. Vancouver, Canada (2000)Google Scholar
 106.Sporring, J., Nielsen, M., Florack, L., Johansen, P. (eds.): Gaussian ScaleSpace Theory: Proceedings of PhD School on ScaleSpace Theory. Series in Mathematical Imaging and Vision. Springer, Copenhagen, Denmark (1997)Google Scholar
 107.Stöttinger, J., Hanbury, A., Sebe, N., Gevers, T.: Sparse color interest points for image retrieval and object categorization. IEEE Trans. Image Process. 21(5), 2681–2692 (2012)MathSciNetzbMATHGoogle Scholar
 108.Tamrakar, A., Ali, S., Yu, Q., Liu, J., Javed, O., Divakaran, A., Cheng, H., Sawhney, H.: Evaluation of lowlevel features and their combinations for complex event detection in open source videos. In: Proceedings of Computer Vision and Pattern Recognition (CVPR 2012), pp. 3681–3688 (2012)Google Scholar
 109.Tau, M., Hassner, T.: Dense correspondences across scenes and scales. IEEE Trans. Pattern Anal. Mach. Intell. 38(5), 875–888 (2016)Google Scholar
 110.ter Haar Romeny, B., Florack, L., Nielsen, M.: Scaletime kernels and models. In: Proceedings of International Conference on ScaleSpace and Morphology in Computer Vision (ScaleSpace’01), Springer LNCS. Springer, Vancouver, Canada (2001)Google Scholar
 111.ter Haar Romeny, B.: FrontEnd Vision and Multiscale Image Analysis. Springer, Berlin (2003)Google Scholar
 112.Tuytelaars, T., Mikolajczyk, K.: A Survey on Local Invariant Features, Foundations and Trends in Computer Graphics and Vision, vol. 3(3). Now Publishers (2008)Google Scholar
 113.Tuytelaars, T., van Gool, L.: Matching widely separated views based on affine invariant regions. Int. J. Comput. Vis. 59(1), 61–85 (2004)Google Scholar
 114.van de Sande, K.E.A., Gevers, T., Snoek, C.G.M.: Evaluating color descriptors for object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1582–1596 (2010)Google Scholar
 115.Vanhamel, I., Mihai, C., Sahli, H., Katartzis, A., Pratikakis, I.: Scale selection for compact scalespace representation of vectorvalued images. Int. J. Comput. Vis. 84(2), 194–204 (2009)Google Scholar
 116.Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In: Proceedings of Computer Vision and Pattern Recognition (CVPR 2011), pp. 3169–3176 (2011)Google Scholar
 117.Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectorypooled deepconvolutional descriptors. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), pp. 4305–4314 (2015)Google Scholar
 118.Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of International Conference on Computer Vision (ICCV 2013), pp. 3551–3558 (2013)Google Scholar
 119.Wang, H., Ullah, M.M., Kläser, A., Laptev, I., Schmid, C.: Evaluation of local spatiotemporal features for action recognition. In: Proceedings of British Machine Vision Conference (BMVC 2009). London, UK (2009)Google Scholar
 120.Weickert, J., Ishikawa, S., Imiya, A.: Linear scalespace has first been proposed in Japan. J. Math. Imaging Vis. 10(3), 237–252 (1999)MathSciNetzbMATHGoogle Scholar
 121.Weinland, D., Ronfard, R., Boyer, E.: A survey of visionbased methods for action representation, segmentation and recognition. Comput. Vis. Image Underst. 115(2), 224–241 (2011)Google Scholar
 122.Willems, G., Tuytelaars, T., van Gool, L.: An efficient dense and scaleinvariant spatiotemporal interest point detector. In: Proceedings og European Conference on Computer Vision (ECCV 2008), Springer LNCS, vol. 5303, pp. 650–663. Marseille, France (2008)Google Scholar
 123.Witkin, A.P.: Scalespace filtering. In: Proceedings of 8th International Joint Conference on Artificial Intelligence, pp. 1019–1022. Karlsruhe, Germany (1983)Google Scholar
 124.Wong, S.F., Cipolla, R.: Extracting spatiotemporal interest points using global information. In: International Conference on Computer Vision (ICCV 2007), pp. 1–8 (2007)Google Scholar
 125.ZelnikManor, L., Irani, M.: Eventbased analysis of video. In: Proceedings of Computer Vision and Pattern Recognition (CVPR’01), pp. II: 123–130 (2001)Google Scholar
 126.Zhen, X., Shao, L.: Action recognition via spatiotemporal local features: a comprehensive study. Image Vis. Comput. 50, 1–13 (2016)Google Scholar
 127.Zhu, Y., Chen, W., Guo, G.: Evaluating spatiotemporal interest point features for depthbased action recognition. Image Vis. Comput. 32(8), 453–464 (2014)Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.