Largescale predictive modeling and analytics through regression queries in data management systems
Abstract
Regression analytics has been the standard approach to modeling the relationship between input and output variables, while recent trends aim to incorporate advanced regression analytics capabilities within data management systems (DMS). Linear regression queries are fundamental to exploratory analytics and predictive modeling. However, computing their exact answers leaves a lot to be desired in terms of efficiency and scalability. We contribute with a novel predictive analytics model and an associated statistical learning methodology, which are efficient, scalable and accurate in discovering piecewise linear dependencies among variables by observing only regression queries and their answers issued to a DMS. We focus on inDMS piecewise linear regression and specifically in predicting the answers to meanvalue aggregate queries, identifying and delivering the piecewise linear dependencies between variables to regression queries and predicting the data dependent variables within specific data subspaces defined by analysts and data scientists. Our goal is to discover a piecewise linear data function approximation over the underlying data only through query–answer pairs that is competitive with the best piecewise linear approximation to the ground truth. Our methodology is analyzed, evaluated and compared with exact solution and nearperfect approximations of the underlying relationships among variables achieving orders of magnitude improvement in analytics processing.
Keywords
Predictive analytics Piecewise linear regression learning Querydriven analytics Data subspace exploration Vector regression quantization1 Introduction
Predictive Modeling and Analytics (PMA) concerns data exploration, model fitting, and regression model learning tasks used in many reallife applications [5, 16, 22, 40, 43]. The major goal of PMA is to explore and analyze multidimensional feature vector data spaces [1]. Recently, we have seen a rapid growth of largescale advanced regression analytics in areas like deep learning for image recognition [22], genome analysis [43] and aggregation analytics [9].
Predictive models like linear regression for prediction and logistic regression for classification are typically desired for exploring data subspaces of a ddimensional data space of interest in \(\mathbb {R}^{d}\) realvalued space. In inDMS exploratory analytics and exploratory computing [24], such data subspaces are identified using selection operators over the values of attributes of interest. Within such data subspaces, PMA can provide local approximation functions or models focusing mainly on identifying dependencies among features like covariance estimations and linear regression coefficients. Selection operators include radius (a.k.a. distance near neighbor (dNN) [7]) queries, which are of high importance in nowadays applications: contextual data stream analytics [4], aggregate predictive analytics over DMS [6], edge computing analytics over data streams in Internet of Things environments [31], locationbased predictive analytics [3], searching for statistical correlations of spatially close objects (within a radius), measuring multivariate skewness [36], spatial analytics [38] focusing on the construction of semivariograms in a specific geographical region [55, 56] earth analytics monitoring regions of interest from sensors’ acoustic signals, and environmental monitoring for chemical compounds correlation analysis given a geographical area.
A major challenge in PMA is to model and learn the very local statistical information of analysts’ interested data subspaces, e.g., local regression coefficients and local data approximation functions, and then extrapolate such knowledge to predict such information for unexplored data subspaces [53]. Based on this abstraction of PMA, which is massively applied on the abovementioned reallife applications, we focus on two important predictive analytics queries for inDMS analytics: meanvalue queries and linear regression queries.
Example 1
Consider the running example in Fig. 2. Seismologists issue a meanvalue query Q1 over a 3dim. space \((u,x_{1},x_{2}) \in \mathbb {R}^{3}\), which returns the mean value y of the feature u (seismic signal; Pwave speed) of those spatial points \((x_{1},x_{2}) \in \mathbb {D}(\mathbf {x}_{0},\theta ) \subset \mathbb {R}^{2}\) projections (referring to surface longitude and latitude) within a disk of center \(\mathbf {x}_{0}\) and radius \(\theta \). The query Q1 is central to PMA because the average y is always used as a linear sufficient statistic for the data subspace \(\mathbb {D}\), and it is the best linear predictor of the seismic signal output u based on the region identified around the center point \((x_{1},x_{2}) \in \mathbb {D}(\mathbf {x}_{0},\theta )\) [32].
A linear regression query Q2 calculates the coefficients of a linear regression function within a defined data subspace. For example, in Fig. 2, consider geophysicists issuing queries Q2 over a 3dim. space \((u,x_{1},x_{2}) \in \mathbb {R}^{3}\), which returns the seismic primarywave (Pwave) velocity uintercept (\(b_{0}\)) and the coefficients \(b_{1}\) and \(b_{2}\) for \(x_{1}\) (longitude) and \(x_{2}\) (latitude), where the \(\mathbf {x} = [x_{1},x_{2}]\) points belong to a subspace \(\mathbb {D}(\mathbf {x}_{0}, \theta ) \in \mathbb {R}^{2}\). By estimating the linear coefficients, e.g., the parameter row vector \(\mathbf {b} = [b_{0}, b_{1}, b_{2}]\), we can then interpret the relationships among the features \(\mathbf {x}\) and u and assess the statistical significance of each feature of \(\mathbf {x}\) within \(\mathbb {D}(\mathbf {x}_{0},\theta )\). The output of the Q2 query refers to the dependency of u with \(\mathbf {x}\), which in our example is approximated by a 2dim. plane \(u \approx b_{0} + b_{1}x_{1} + b_{2}x_{2}\), and quantifies how well the local linear model fits the data.
where SQRT(x) is the square root of real number x.
The result is the intercept \(b_{0}\) and regression coefficients \(\mathbf {b} = [b_{1}, b_{2}]\).
To evaluate queries Q1 and Q2, the system must access the data to establish the data subspace \(\mathbb {D}(\mathbf {x}_{0},\theta )\), and then take the average value of u in that subspace for query Q1 (e.g., the average seismic signal speed in San Andreas, CA, region) and invoke a multivariate linear regression algorithm [32] for Q2. The Q1 and Q2 type queries are provided by all modern PMA systems like Spark analytics [47], MATLAB^{1} and DMS systems, e.g., XLeratorDB^{2} of Microsoft SQL Server^{3} and Oracle UTL_NLA.^{4}
Remark 1
Please refer to Table 2 in “Appendix” for a table of notations and symbols used in this paper.
1.1 Desiderata

D1 Are there linear dependencies among dimensions in unexplored data subspaces, and which are such subspaces?

D2 If there are data subspaces, where linear approximations fit well with high confidence, can the system provide these yet unknown linear regression models efficiently and scalably to the analysts?

D3 If in some subspaces linear approximations do not fit well w.r.t. analysts needs, can the system provide fitting models through piecewise local linear approximations?

D4 A solution must meet scalability, efficiency, and accuracy desiderata as well.
Let us now move a step further to provide more information about the modeled relationship function g in (1). The derivation of several local linear approximations, as opposed to a single linear approximation over the whole data space, can provide more accurate and significant insights. The key issue to note here is that a global (single) linear approximation of g interpolating among all items of the whole data space \(\mathbb {D}\) leaves, in general, much to be desired: The analysts presented with a single global linear approximation might have an inaccurate view due to missing ‘local’ statistical dependencies within unknown local data subspaces that comprise \(\mathbb {D}\). This will surely lead to prediction errors and approximation inaccuracies when issuing queries Q1 and Q2 to the DMS.
Example 2
Consider the input–output in a (u, x) 2dim. space in Fig. 3(upper) and the actual data function \(u = g(x)\) (in red). A Q2 query issued over the data subspace \(\mathbb {D}(x_{0},\theta )\) will calculate the intercept \(b_{0}\) and slope \(b_{1}\) of the linear approximation \(u \approx \hat{g}(x) = b_{0} + b_{1}x\) (the green line l) over those \(x \in \mathbb {D}(x_{0},\theta )\). Evidently, such a line shows a very coarse and unrepresentative dependency between output u and input x, since u and x do not linearly depend on each other within the entire data subspace \(\mathbb {D}(x_{0},\theta )\). The point is that we should obtain a finer grained and more accurate dependency between output u and input x. The principle of local linearity [26] states that linear approximations of the underlying data function in certain data subspaces fit the global nonlinearity better in the entire data subspace of interest. In Fig. 3(upper), we observe four local linear approximations \(l_{1}, \ldots , l_{4}\) in the data subspace. Therefore, it would be preferable if, as a result of query Q2, the analysts were provided with a list of the local line segments \(\mathcal {S} = \{l_{1}, \ldots , l_{4}\}\), a.k.a. piecewise linear regression. These ‘local’ segments better approximate the linearity of output u. Moreover, in Fig. 3(lower) the underlying data function \(u = g(x_{1},x_{2})\) in the 3dim. data space does not exhibit linearity over the entire \((x_{1},x_{2})\) plane. We can observe how the global linear relationship \(\hat{g}(x_{1},x_{2})\) cannot capture the very local statistical dependencies between \(\mathbf {x} = [x_{1},x_{2}]\) and u, which are better captured in certain data subspaces by certain local line segments \(\hat{g}_{k}(x_{1},x_{2})\).
Definition 1
The data function \(g: \mathbb {R}^{d} \rightarrow \mathbb {R}\) is a Kpiecewise linear function if there exists a partition of the input data space \(\mathbb {D} \subset \mathbb {R}^{d}\) into K disjoint subspaces \(\mathbb {D}_{1}, \ldots \mathbb {D}_{K}\) with corresponding linear regression parameters \(\mathbf {b}_{X,1}, \ldots , \mathbf {b}_{X,K} \in \mathbb {R}^{d}\) such that for all \(\mathbf {x} = [x_{1}, \ldots , x_{d}] \in \mathbb {R}^{d}\) we have that \(u(\mathbf {x}) = \mathbf {b}_{X,k} \cdot \mathbf {x}^{\top }\), if \(\mathbf {x} \in \mathbb {D}_{k}\).
The case where K is fixed (given) has received considerable attention in the research community [54]. The special case of piecewise polynomial functions (splines) has been also used in the context of inference including density estimation, histograms, and regression [39].
Let us now denote with \(\mathcal {L}_{K}\) the space of Kpiecewise linear functions. While the ground truth may be close to a piecewise linear function, even in certain subspaces, generally we do not assume that it exactly follows a piecewise linear function (yet unknown). In this case, our goal is to recover a piecewise linear function that is competitive with the best piecewise linear approximation to the ground truth.
Remark 2
If the segments of the data function g were known a priori, the segmented regression problem could have been immediately reduced to K independent linear regression problems. In the general case, where the location of the segment boundaries and their corresponding coefficients are unknown, one needs to discover them using information provided only by the observations of input–output pairs \((\mathbf {x}_{i},u_{i})\). To address this problem, previous works [13, 54] while being statistically efficient are computationally slow and prohibited for largescale data sets, i.e., the running time for a given data subspace scales at least quadratically with the size of data points n in the queried data subspace, thus, being impractical for large data subspaces or even worse for the entire data space.
In our context, however, the analysts explore the data space only by issuing queries over specific data subspaces, thus, observing only the answers of the analytics queries. Specifically, the analysts do not know before issuing a query how the data function behaves within an ad hoc defined data subspace \(\mathbb {D}(\mathbf {x}_{0},\theta )\). When a query Q2 is issued, it is not known whether the data function g behaves with the same linearity throughout the entire \(\mathbb {D}(\mathbf {x}_{0},\theta )\) or not, and within which subspaces, if any, g changes its trend and u and \(\mathbf {x}\) exhibit linear dependencies. Thus, the desiderata D2 and D3 focus on learning the boundaries of these local subspaces within \(\mathbb {D}(\mathbf {x}_{0},\theta )\) and within each local subspace, discovering the linear dependency (segment) between output u and input \(\mathbf {x}\). This would arm analysts and data scientists with much more accurate knowledge on how the data function \(g(\mathbf {x})\) behaves within a given data subspace \(\mathbb {D}(\mathbf {x}_{0},\theta )\). Hence, decisions on further data exploration w.r.t. complex model selection and/or validation can be taken by the analysts.

approximate the underlying data function g over the analysts’ queried data subspaces by estimating the unknown K segments and their unknown local model coefficients/PLR segments. This has to be achieved based only on the issued queries and their answers, where no data access is provided to analysts by the DMS;

predict the list \(\mathcal {S}\) of the linear models (segments) that model the PLR estimator data function \(\hat{g}\) minimizing the MSE in (3). Such models best explain (fit) the underlying data function g over a given data subspace \(\mathbb {D}(\mathbf {x}_{0},\theta )\);

predict the answer y of any unseen meanvalue query over a data subspace \(\mathbb {D}(\mathbf {x}_{0},\theta )\);

predict the output data value \(\hat{u}\) given an unseen input \(\mathbf {x}\) based on the approximate data function \(\hat{g}\).
Remark 3
In the prediction phase, that is, after training, no access to the underlying data systems is required, thus, ensuring desideratum D4.
1.2 Challenges and organization
In Fig. 4, we show the system context within which our rationale and contributions unfold. A DMS serves analytics queries from a large user community. Over the time, all users (data scientists, statisticians, analysts, applications) will have issued a large number of queries (\(\mathcal {Q} = \{\)Q\(_{1}\), Q\(_{2}\), ...,Q\(_{n}\}\)), and the system will have produced responses (e.g., \(y_{1}, y_{2}, \ldots , y_{n}\) for Q1 queries). Our key idea is to inject a novel statistical learning model and novel query processing algorithms in between users and the DMS that monitors queries and responses and learns to associate a query with its response. After training, say after the first \(m < n\) queries \(\mathcal {T} =\{\)Q\(_{1}, \ldots ,\) Q\(_{m}\}\) then, for any new/unseen query Q\(_{t}\) with \(m < t \le n\), i.e., Q\(_{t} \in \mathcal {V} = \mathcal {Q} \setminus \mathcal {T}= \{\)Q\(_{m+1}, \ldots ,\) Q\(_{n}\}\), our model approximates the data function g with an estimator function \(\hat{g}\) through a list \(\mathcal {S}\) of local linear regression coefficients (line segments) that best fits the actual and unknown function g given the query Q\(_{t}\)’s data subspace and predicts its response \(\hat{y}_{t}\) without accessing the DMS. The efficiency and scalability benefits of our approach are evident. Computing the exact answers to queries Q1 and Q2 can be very timeconsuming, especially for large data subspaces. So if this model and algorithms can deliver highly accurate answers, query processing times will be dramatically reduced. Scalability is also ensured for two reasons. Firstly, in the data dimension, as query Q1 and Q2 executions (after training) do not involve data accesses, even dramatic increases in DB size do not impact query execution. Secondly, in the querythroughput dimension, avoiding DMS internal resource utilization (that would be required if all Q1 and Q2 queries were executed over the DMS data) saves resources that can be devoted to support larger numbers of queries at any given point in time. Viewed from another angle, our contributions aim to exploit all the work performed by the DMS engine when answering previous queries, in order to facilitate accurate answers to future queries efficiently and scalably.

Identify the number and boundaries of the data subspaces with local linearities and deliver the local linear approximations for each subspace identified, i.e., predict the list \(\mathcal {S}\) for an unseen query Q2, thus, no need to execute the query Q2. Clearly, this challenge copes further with the following problems: the boundaries of these data subspaces are unknown and cannot be determined even if we could scan all of the data, which in any case would be inefficient and less scalable.

Predict the average value of answer y for an unseen query Q1, thus, no need to execute the query Q1.
The paper is organized as follows: Sect. 2 reports on the related work and provides our major contribution of this work. In Sect. 3, we formulate our problems and provide preliminaries, while Sect. 4 provides our novel statistical learning algorithms for largescale predictive modeling. Section 5 introduces our proposed querydriven methodologies, corresponding algorithms and analyses, while in Sect. 6, we report on the piecewise linear approximation and query–answer prediction methods. The convergence analysis and the inherent computational complexity are elaborated in Sect. 7, and we provide a comprehensive performance evaluation and comparative assessment of our methodology in Sect. 8. Finally, Sect. 9 summarizes our work and discusses our future research agenda in the direction of querydriven predictive modeling.
2 Related work and contribution
2.1 Related work
Outwith DMS environments, statistical packages like MATLAB and R^{5} support fitting regression functions. However, their algorithms for doing so are inefficient and hardly scalable. Moreover, they lack support for relational and declarative Q1 and Q2 queries. So, if data are already in a DMS, they would need to be moved back and forth between external analytics environments and the DMS, resulting in considerable inconveniences and performance overheads, (if at all possible for big datasets). At any rate, modern DMSs should provide analysts with rich support for PMA.
An increasing number of major database vendors include in their products data mining and machine learning analytic tools. PostgreSQL, MySQL, MADLib (over PostgreSQL) [21] and commercial tools like Oracle Data Miner, IBM Intelligent Miner and Microsoft SQL Server Data Mining provide SQLlike interfaces for analysts to specify regression tasks. Academic efforts include MauveDB [23], which integrates regression models into a DMS, while similarly FunctionDB [50] allows analysts to directly pose regression queries against a DMS. Also, Bismarck [27] integrates and supports inDMS analytics, while [48] integrates and supports least squares regression models over training datasets defined by arbitrary join queries on database tables. All such works that also support Q1 and Q2 queries can serve as the DMS within Fig. 4. However, in the big data era exact Q1, Q2 computations leave much to be desired in efficiency and scalability, as the system must first execute the selection, establishing the data subspaces per query, and then access all tuples in Q1, Q2.
Apart from the standard multivariate linear regression algorithm, i.e., adopting ordinary least squares (OLS) for function approximation [29], related literature contains more elaborate piecewise regression algorithms, e.g., [10, 12, 18, 28], which can actually detect the nonlinearity of a data function g and provide multiple local linear approximations. Given an ad hoc exploration query over n points in a ddim. space, the standard OLS regression algorithm has asymptotic computational complexity of \(O(n^{2}d^{5})\) and \(O(nd^{2}+d^{3})\), respectively [32]. Therefore, OLS algorithms suffer from poor scalability and efficiency—especially as n is getting larger, and/or in highdimensional spaces, as will be quantified in Sect. 8. Such methodologies are suffering such overheads for every query issued, which is highly undesirable. To address this, one may think to perform dataspace analyses only once, seeking to derive all local linear regression models for the whole of the data space, and use the derived models for all queries. Indeed, a literature survey reveals several methods like [12, 42], which identify the nonlinearity of data function g and provide multiple local linear approximations. Unfortunately, these methods are very computationally expensive and thus do not scale with the size n of the data points. All these methods execute queries like query Q2, going through a series of stages: partitioning the entire data space into clusters, assigning each data point to one of these clusters, and fitting a linear regression function to each of the clusters. However, data clustering cannot automatically guarantee that the withincluster nonlinearity of the data function g is captured by a local linear fit. Hence, all these methods are iterative, repeating the above stages until convergence to minimize the residuals estimation error of all approximated local linear regression functions. For instance, the method in [10] clusters and regresses the entire data space against K clusters with a complexity of \(O(K(n^{2}d + nd^{2}))\). Similarly, the incremental adaptive controller method [20] using selforganizing structures requires \(O(n^{2}dT)\) for training purposes. The same holds for the methods [12, 19, 20, 28] that combine iterative clustering and classification for piecewise regression requiring also \(O(n^{2}dT)\). Linear regression methods indicate their high costs when computing exact answers. As all these methods derive regression models over the whole data space, e.g., over trillions of points, the scalability and efficiency desiderata are missed, as Sect. 8 will showcase.
This paper significantly extends our previous work [8] for scalable regression queries in the dimensions of mathematical analyses, fundamental theorems and proofs for vector quantization and piecewise multivariate linear regression (Sects. 5 and 6), theoretical analyses and proofs of the PLR data approximation and prediction error bounds (Sect. 5), analysis of the model convergence, variants of partial and global convergence of PLR data approximation, query answer prediction (Sects. 7 and 7.2), comprehensive sensitivity analysis, and comparative assessment of the proposed methodology (Sect. 8).
Our approach accurately supports predicting the result of meanvalue Q1 queries, approximating the underlying data function g based on (multiple) local linear models of regression Q2 queries, and predicting the output data values given unseen inputs by estimating the underlying data function. It does so while achieving high prediction accuracy and goodness of fit, after training without executing Q1 and Q2, thus, without accessing data. This ensures a highly efficient and scalable solution, which is independent of the data sizes, as Sect. 8 will show.
2.2 Contribution

deliver a PLRbased data approximation of the data function g over different unseen subspaces that best explain the underlying function g within \(\mathbb {D}(\mathbf {x}_{0},\theta )\),

predict the data output value \(\hat{u} \approx g(\mathbf {x})\) for each unseen data input \(\mathbf {x} \in \mathbb {D}(\mathbf {x}_{0},\theta )\),

predict the average value y of the data output \(u=g(\mathbf {x})\) with \(\mathbf {x} \in \mathbb {D}(\mathbf {x}_{0},\theta )\).

A statistical learning methodology for querydriven PLR approximations of data functions over multidimensional data spaces. This methodology indirectly extracts information about the unknown data function g only by observing and learning the mapping between aggregation queries and their answers.

A joint optimization algorithm for minimizing the PLR data approximation error and answer prediction error in light of quantizing the query space.

Convergence analyses of the methodology including variants for supporting partial and global convergence.

Mathematical analyses of the querydriven PLR approximation and prediction error bounds;

Meanvalue and datavalue prediction algorithms for unseen Q1 and Q2 queries.

A PLR data approximation algorithm over data subspaces defined by unseen Q2 queries.

Sensitivity analysis and comparative performance assessment with PLR and multivariate linear regression algorithms found in the literature in terms of scalability, prediction accuracy, data value prediction error and goodness of fit of PLR data approximation.
3 Problem analysis
3.1 Definitions
Let \(\mathbf {x}= [x_{1}, \ldots , x_{d}] \in \mathbb {R}^{d}\) denote a multivariate random data input row vector, and \(u \in \mathbb {R}\) a univariate random output variable, with (unknown) joint probability distribution \(P(u,\mathbf {x})\). We notate \(g: \mathbb {R}^{d} \rightarrow \mathbb {R}\) with \(\mathbf {x} \mapsto u\) the unknown underlying data function from input \(\mathbf {x}\) to output \(u = g(\mathbf {x})\).
Definition 2
The linear regression function of input \(\mathbf {x} \in \mathbb {R}^{d}\) onto output \(u \in \mathbb {R}\) is: \(u = b_{0} + \sum _{i=1}^{d}b_{i}x_{i} + \epsilon = b_{0} + \mathbf {b}\mathbf {x}^{\top } + \epsilon \), where: \(\epsilon \) is a random error with mean \(\mathbb {E}[\epsilon ] = 0\) and variance \(Var(\epsilon )=\sigma ^{2} >0\), \(\mathbf {b} = [b_{1}, \ldots , b_{d}]\) is the slope row vector of real coefficients and \(b_{0}\) is the intercept.
Definition 3
The pnorm (\(L_{p}\)) distance between two input vectors \(\mathbf {x}\) and \(\mathbf {x}'\) from \(\mathbb {R}^{d}\) for \(1 \le p < \infty \), is \(\mathbf {x}  \mathbf {x}' _{p} = (\sum _{i=1}^{d} x_{i}x_{i}'^{p})^{\frac{1}{p}}\) and for \(p = \infty \), is \(\mathbf {x}  \mathbf {x}' _{\infty } = \max _{i=1, \ldots , d} \{ x_{i}x_{i}' \}\).
Consider a scalar \(\theta >0\), hereinafter referred to as radius, and a dataset \(\mathcal {B}\) consisting of n input–output pairs \((\mathbf {x}_{i},u_{i}) \in \mathcal {B}\).
Definition 4
Given input \(\mathbf {x} \in \mathbb {R}^{d}\) and radius \(\theta \), a data subspace \(\mathbb {D}(\mathbf {x},\theta )\) is the convex data subspace of \(\mathbb {R}^{d}\), which includes input vectors \(\mathbf {x}_{i}: \mathbf {x}_{i}  \mathbf {x} _{p} \le \theta \) with \((\mathbf {x}_{i},u_{i}) \in \mathcal {B}\).
Definition 5
Definition 6
The \(L_{2}^{2}\) distance or similarity measure between queries \(\mathbf {q}, \mathbf {q}' \in \mathbb {Q}\) is \(\mathbf {q}  \mathbf {q}' _{2}^{2} = \mathbf {x}  \mathbf {x}' _{2}^{2} + (\theta \theta ')^{2}\).
Definition 7
The queries \(\mathbf {q}\), \(\mathbf {q}'\), which define the subspaces \(\mathbb {D}(\mathbf {x},\theta )\) and \(\mathbb {D}(\mathbf {x}',\theta ')\), respectively, overlap if for the boolean indicator \(\mathcal {A}(\mathbf {q},\mathbf {q}') \in \{{\texttt {TRUE}, \texttt {FALSE}}\}\) holds true that: \(\mathcal {A}(\mathbf {q},\mathbf {q}') = (\mathbf {x}  \mathbf {x}' _{2} \le \theta + \theta ')\) = TRUE.
A query \(\mathbf {q} = [\mathbf {x},\theta ]\) defines a data subspace \(\mathbb {D}(\mathbf {x},\theta )\) w.r.t. dataset \(\mathcal {B}\).
3.2 Problem formulation
 CH1: predict the aggregate output outcome \(\hat{y}\) of a random query \(\mathbf {q} = [\mathbf {x}, \theta ]\). Given an unknown query function \(f: \mathbb {Q} \subset \mathbb {R}^{d+1} \rightarrow \mathbb {R}\), which maps a query \(\mathbf {q} = [\mathbf {x},\theta ] \mapsto y\), we seek a queryPLR estimate \(\hat{f} \in \mathcal {L}_{K}\) to predict the actual answer \(y = f(\mathbf {q}) = f(\mathbf {x},\theta )\)^{6} for an unseen query \(\mathbf {q}\), i.e., \(\hat{y} = \hat{f}(\mathbf {q}) = \hat{f}(\mathbf {x},\theta )\). The challenge is:$$\begin{aligned} \hat{f} = \arg \min _{f \in \mathcal {L}_{K}}\text{ MSE }(f). \end{aligned}$$
 CH2: identify the local linear approximations of the unknown data function \(u=g(\mathbf {x})\) over the data subspaces \(\mathbb {D}(\mathbf {x},\theta )\) defined by unseen queries \(\mathbf {q} = [\mathbf {x},\theta ]\). Based on the queryPLR estimate \(\hat{f}\) we seek a statistical learning methodology \(\mathcal {F}\) to extract a dataPLR estimate \(\hat{g} \in \mathcal {L}_{K}\) from the queryPLR estimate \(\hat{f}\), notated by \(\hat{g} = \mathcal {F}(\hat{f})\) to fit the data function g. The challenge is:$$\begin{aligned} \hat{g} = \arg \min _{g' \in \mathcal {L}_{K}}\{ \text{ MSE }(g')g' = \mathcal {F}(\hat{f})\}. \end{aligned}$$

CH3: predict the data output \(\hat{u}\) of a random input data vector \(\mathbf {x}\) based on the dataPLR estimate \(\hat{g}\), i.e., \(\hat{u} = \hat{g}(\mathbf {x})\).
Theorem 1
(Decomposition). \(y = \mathbb {E}[y\mathbf {x},\theta ] + \epsilon \), where \(\epsilon \) is meanindependent of \(\mathbf {x}\) and \(\theta \), i.e., \(\mathbb {E}[\epsilon \mathbf {x},\theta ] = 0\) and therefore \(\epsilon \) is uncorrelated with any function of \(\mathbf {x}\) and \(\theta \).
For proof of Theorem 1, refer to [32]. According to Theorem 1, the aggregate output y can be decomposed into a conditional expectation function \(\mathbb {E}[y\mathbf {x},\theta ]\), hereinafter referred to as query regression function, which is explained by \(\mathbf {x}\) and \(\theta \), and a left over (noisy component) which is orthogonal to (i.e., uncorrelated with) any function of \(\mathbf {x}\) and \(\theta \).
In our context, the query regression function is a good candidate for minimizing the EPE in (5) envisaged as a local representative value for answer y over the data subspace \(\mathbb {D}(\mathbf {x},\theta )\). Therefore, the conditional expectation function is the best predictor of answer y given \(\mathbb {D}(\mathbf {x},\theta )\):
Theorem 2
(Conditional Expectation Function). Let \(f(\mathbf {x},\theta )\) be any function of \(\mathbf {x}\) and \(\theta \). The conditional expectation function \(\mathbb {E}[y\mathbf {x},\theta ]\) solves the optimization problem: \(\mathbb {E}[y\mathbf {x},\theta ] = \arg \min _{f(\mathbf {x},\theta )}\mathbb {E}[(yf(\mathbf {x},\theta ))^{2}]\), i.e., it is the minimum mean squared error predictor of y given \(\mathbf {x}, \theta \).
For proof of Theorem 2, refer to [32].
Remark 4
We rely on Theorems 1 and 2 to build our statistical learning methodology \(\mathcal {F}\) for estimating a queryPLR \(\hat{f}\) and then, based on Theorem 1, we will be estimating the dataPLR \(\hat{g}\) only through \(\hat{f}\) and the answer–query pairs \((\mathbf {q},y) = ([\mathbf {x},\theta ],y)\), without accessing the actual data pairs \((\mathbf {x},u)\).
The solution to (5) is \(f(\mathbf {x},\theta ) = \mathbb {E}[y\mathbf {x},\theta ]\), i.e., the conditional expectation of answer y over \(\mathbb {D}(\mathbf {x},\theta )\). However, the number of data points \(n_{\theta }(\mathbf {x})\) in \(\mathbb {D}(\mathbf {x},\theta )\) is finite; thus, such conditional expectation is approximated by averaging all data outputs \(u_{i}\)’s conditioning at \(\mathbf {x}_{i} \in \mathbb {D}(\mathbf {x},\theta )\). Moreover, the answer y of a query \(\mathbf {q}\) refers to the best regression estimator over \(\mathbb {D}(\mathbf {x},\theta )\). Each query center \(\mathbf {x} \in \mathbb {D}(\mathbf {x},\theta )\) and corresponding answer y provides information to locally learn the dependency between output u and input \(\mathbf {x}\), i.e., the data function g. In this context, similar queries w.r.t. \(L_{2}\) distance provide insight for data function g over overlapped data subspaces.
The queryPLR estimate function \(\hat{f}(\mathbf {x},\theta )\) from challenge CH1’s outcome is used for estimating the multiple local line segments (i.e., local linear regression coefficients intercept and slope) of the dataPLR estimate function \(\hat{g}\). This is achieved by a novel statistical learning methodology \(\mathcal {F}\), which learns from a continuous query–answer stream \(\{(\mathbf {q}_{1},y_{1}), \ldots , (\mathbf {q}_{t},y_{t})\}\) through the interactions between the users and the system. We can then formulate our problems are:
Problem 1
Given a finite number of query–answer pairs, approximate the queryPLR function \(\hat{f}(\mathbf {x},\theta )\) and predict the aggregate answer \(\hat{y}\) of an unseen query \(\mathbf {q} = [\mathbf {x},\theta ]\).
Problem 2
Given only the queryPLR function \(\hat{f}(\mathbf {x},\theta )\) from Problem 1, approximate the dataPLR function \(\hat{g}(\mathbf {x})\) and predict the data output \(\hat{u}\) of an unseen data input \(\mathbf {x}\).
3.3 Preliminaries
3.3.1 Incremental learning and stochastic gradient descent
3.3.2 Adaptive vector quantization
4 Solution fundamentals
4.1 Methodology overview
We first proceed with a solution of Problem 1 to approximate the query function f through a queryPLR function \(\hat{f}\). Then, we use the approximate \(\hat{f}\) to address Problem 2 to approximate the data function g by a dataPLR function \(\hat{g}\).
In Problem 1, we approximate the query function \(f(\mathbf {x},\theta )\) with a set of queryspace LLMs (or queryLLMs), each of which is constrained to a local region of the query space \(\mathbb {Q}\), defined by similar queries w.r.t. \(L_{2}\) distance. Similar queries are those queries with similar centers \(\mathbf {x}\) and similar radii \(\theta \). Our general idea for those queryspace LLMs is the quantization of the query space \(\mathbb {Q}\) into a finite number of query subspaces \(\mathbb {Q}_{k}\) such that the query function f can be linearly approximated by a queryLLM \(f_{k}, k = 1 \ldots , K\), that is the kth PLR segment. Those query subspaces may be rather large in areas of the query vectorial space \(\mathbb {Q}\) where the query function f indeed behaves approximately linear and must be smaller where this is not the case. The total number K of such query subspaces depends on the desired approximation (goodness of fit) and the query–answer prediction accuracy, and may be limited by the available issued queries since overfitting might occur.
Fundamentally, we incrementally quantize the query space \(\mathbb {Q}\) over a series of issued queries through quantization vectors, hereinafter referred to as query prototypes, in \(\mathbb {Q}\). Then, we associate each query subspace \(\mathbb {Q}_{k}\) with a queryLLM \(f_{k}\) in the query–answer space, where the query function f behaves approximately linear.
In Problem 2, principally each query subspace \(\mathbb {Q}_{k}\) is associated with a data subspace \(\mathbb {D}_{k}\), i.e., for a query \(\mathbf {q} \in \mathbb {Q}_{k} \subset \mathbb {R}^{d+1}\), its corresponding query point \(\mathbf {x} \in \mathbb {D}_{k} \subset \mathbb {R}^{d}\). This implies that the input vector \(\mathbf {x}\) (of the query \(\mathbf {q}\)) is constrained to be drawn only from the kth data subspace \(\mathbb {D}_{k}\). Based on that association, we use the queryLLM \(f_{k}\) to estimate the dataLLM \(g_{k}\), i.e., estimate the local intercept and slope of the data function g over the kth data subspace \(\mathbb {D}_{k}\).
4.2 Query local linear mapping

The local intercept, with two components: the local expectation of answer y, i.e., \(\mathbb {E}[y] = f_{k}(\mathbb {E}[\mathbf {x}],\mathbb {E}[\theta ])\), notated by the scalar coefficient \(y_{k}\); and the local expectation query \(\mathbb {E}[\mathbf {q}] = [\mathbb {E}[\mathbf {x}],\mathbb {E}[\theta ]]\) notated by the vectorial coefficient \(\mathbf {w}_{k} = [\mathbf {x}_{k},\theta _{k}] \in \mathbb {Q}_{k}\), with \(\mathbf {x}_{k} = \mathbb {E}[\mathbf {x}]\) and \(\theta _{k} = \mathbb {E}[\theta ]\) such that \([\mathbf {x},\theta ] \in \mathbb {Q}_{k}\). Hereinafter, \(\mathbf {w}_{k}\) is referred to as the prototype of the query subspace \(\mathbb {Q}_{k}\).

The local slope \(\mathbf {b}_{k} = [\mathbf {b}_{X,k},b_{\varTheta ,k}]\) of \(f_{k}\) over \(\mathbb {Q}_{k}\), which denotes the gradient \(\nabla f_{k}(\mathbb {E}[\mathbf {x}],\mathbb {E}[\theta ])\) of \(f_{k}\) at the local expectation query \(\mathbf {w}_{k}\).
Remark 5
It is worth mentioning that the constraint \(y_{k} = f_{k}(\mathbf {x}_{k},\theta _{k}), \forall k \in [K]\) in the optimization problem (15) requires that in each query subspace \(\mathbb {Q}_{k}\), the corresponding queryLLM \(f_{k}\) refers to a (hyper)plane that minimizes the EPE and, also, given a query \(\mathbf {q}\) with a query point \(\mathbf {x} = \mathbb {E}[\mathbf {x}\mathbf {x} \in \mathbb {D}(\mathbf {x}_{k},\theta _{k})]\) being the centroid of the corresponding data subspace \(\mathbb {Q}_{k}\) and radius \(\theta = \mathbb {E}[\theta \mathbf {x} \in \mathbb {D}(\mathbf {x}_{k},\theta _{k}), \mathbf {q} \in \mathbb {Q}_{k}]\) being the mean radius of all the queries from \(\mathbb {Q}_{k}\), it secures that \(f_{k}\) supports the OP2 and OP3.
However, we need to further optimize \(\alpha _{k}\) to satisfy also the optimization properties OP2 and OP3.
4.3 Our statistical learning methodology
Our statistical learning methodology \(\mathcal {F}\) departs from the optimization problem in (15) to additionally support the optimization properties OP2 and OP3. Our methodology is formally based on a joint optimization problem of optimal quantization and regression. This is achieved by incrementally identifying withinsubspaces linearities in the query space and then estimating therein the queryLLM coefficients such that we preserve the optimization properties OP1, OP2, and OP3.
4.3.1 Joint quantization–regression optimization for queryLLMs
Firstly, we should identify the subspaces \(\mathbb {Q}_{k}\), i.e., determine their prototypes \(\mathbf {w}_{k}\), their number K, and their coefficients \(y_{k}\) and \(\mathbf {b}_{k}\), in which the query function f can be well approximated by LLMs. We identify the prototypes \(\mathbf {w}_{k}\) (associated with \(\mathbb {Q}_{k}, k \in [K]\)) by incrementally partitioning the query space \(\mathbb {Q} = \cup _{k=1}^{K}\mathbb {Q}_{k}\). Before elaborating on our methodology, we provide an illustrative example on query space quantization.
Example 3
Figure 5(upper) shows 1,000 issued queries \(\mathbf {q}_{t}=[\mathbf {x}_{t},\theta _{t}]\) over the 2D input space \(\mathbf {x}=(x_{1},x_{2})\in [1.5,1.5]^{2}\). Each query is represented by a disk with center \(\mathbf {x}_{t}\) and radius \(\theta _{t}\). Figure 5(lower) shows the five query prototypes \(\mathbf {w}_{k} = [\mathbf {x}_{k},\theta _{k}]\), \(k \in [5]\) projected onto the 2D input space. Note, centers \(\mathbf {x}_{k}\) of the prototypes \(\mathbf {w}_{k}\) correspond to Voronoi sites under \(L_{2}\) onto the data space.
The introduction of the query space quantization before predicting the query’s answer, i.e., regression of aggregate answer y on the query vector \(\mathbf {q}\), raises a natural fundamental question:
Answer: There is one response on that question: one can consider a VQ as part of the regression estimate function \(\hat{f}\). The overall goal is not purely regression, i.e., query–answer prediction using query function f, but also PLR fitting of the underlying data function g. The VQ yields several benefits starting from constructing the query prototypes \(\{\mathbf {w}_{k} = [\mathbf {x}_{k},\theta _{k}]\}_{k=1}^{K}\) of the queryLLMs \(f_{k}\), that is minimizing the EQE (OP2), to constructing the intercepts and slopes \(\{(y_{k},\mathbf {b}_{k})\}_{k=1}^{K}\), which are needed to minimize the EPE (OP1) and also to derive the dataLLMs \(g_{k}\) (OP3). And, based on Theorem 7, the query prototypes \(\mathbf {w}_{k}\) converge to the optimal vector prototypes only when adopted by the VQ; specifically by an incrementally growing AVQ, as it will be elaborated later. The inclusion of estimating the query prototypes \(\mathbf {w}_{k}\) provides a methodology not suggested by the regression/prediction goal alone, which nonetheless allows one to weight the prediction performance as being the more important criterion and which may eventually yield better regression algorithms. However, in this case our goal has with one model to satisfy the optimization properties OP1, OP2, and OP3 simultaneously, and this can be viewed as finding an algorithm for jointly designing a VQ and PLRbased predictor to yield performance close to that achievable by an optimal PLRbased predictor operating on the original answer–query pairs and input–output data pairs, as it will be shown at our performance Sect. 8.
The quantization of the query space \(\mathbb {Q}\) operates as a mechanism to project an unseen query \(\mathbf {q}\) to the closest query subspace \(\mathbb {Q}_{k}\) w.r.t. \(L_{2}\) distance from the prototype \(\mathbf {w}_{k}\), wherein we learn the dependency between the aggregate answer y with the query point \(\mathbf {x}\) and radius \(\theta \).
Example 4
Figure 6 depicts the association from the query space to the 3D data space. A query prototype \(\mathbf {w}_{j}\), a disk on the input space \((x_{1},x_{2})\), is now associated with the queryLLM \(f_{j}(\mathbf {x},\theta )\) and its corresponding regression plane \(u_{j} = f_{j}(\mathbf {x},\theta _{j})\) on the data space \((u, x_{1}, x_{2})\), which approximates the actual data function \(u = g(x_{1},x_{2}) = x_{1}(x_{2}+1)\). Note, in each local plane, we learn the local intercept \(y_{j}\) and slope \(\mathbf {b}_{j}\) where \(\mathbf {x}_{j}\) is the representative of the data subspace \(\mathbb {D}_{j}\) (see Theorems 7, 8 and 9).
4.3.2 DataLLM function derivation from queryLLM function
Concerning Problem 1, the prediction of the aggregate output \(\hat{y}\) of an unseen query \(\mathbf {q}\) is provided by neighboring queryLLM functions \(f_{k}\), as will be elaborated later. Concerning Problem 2, we derive the linear dataLLM function \(g_{k}\) (intercept and slope) between output u and input \(\mathbf {x}\) over the data subspace \(\mathbb {D}\) given the queryLLM function \(f_{k}\). Then, we approximate the PLR estimate of data function g by interpolating many dataLLMs.
Based on Theorems 1 and 2, we obtain that the data output \(u = g(\mathbf {x}) = \mathbb {E}[u\mathbf {x}] + \epsilon \). In that context, we can approximate the data function \(g(\mathbf {x})\) over the data subspace \(\mathbb {D}_{k}\), i.e., the PLR segment \(g_{k}\) from the corresponding queryLLM function \(f_{k}\) conditioned on the mean radius \(\theta _{k}\).
Theorem 3
Proof
Example 5
We provide the following visualization in Fig. 7 to better explain and provide insights of the dataLLMs derivation from queryLLMs. Specifically, Fig. 7 interprets the mapping methodology \(\mathcal {F}\) from the queryLLMs to the dataLLMs after obtaining the optimal values for the parameters that satisfy the optimization properties OP1, OP2, and OP3. We observe three regression planes in the query–answer space \((x,\theta ,y)\), which are approximated by the three queryLLMs \(f_{1},f_{2}\) and \(f_{3}\). This indicates the PLR approximate of the query function f. Now, focus on the regression plane \(f_{k}(x,\theta )\) along with the query prototype \(\mathbf {w}_{k} = [x_{k},\theta _{k}]\). The corresponding dataLLM function \(g_{k}(x)\) for those data inputs \(x \in \mathbb {D}(x_{k},\theta _{k})\) derives from the queryLLM \(f_{k}(x,\theta _{k})\) since, as proved in Theorem 7, the radius \(\theta _{k}\) is the expected radius of all the queries \(\mathbf {q}\) with \(\mathbf {w}_{k} = \mathbb {E}[\mathbf {q}v(\mathbf {q}) = k]\), i.e., \(\theta _{k} = \mathbb {E}[\theta v(\mathbf {q}) = k]\). The dataLLM is represented by the linear regression approximation \(g_{k}(x)\) laying on the regression plane defined by the queryLLM \(f_{k}\). We obtain the PLR data approximate g over all data input x space by following the dataLLMs \(g_{k}\) over the planes defined by the queryLLMs \(f_{k}\), as illustrated in the inner plot in Fig. 7. As we are moving from one queryLLM \(f_{k1}\) to the next one \(f_{k}\), we derive the corresponding dataLLMs \(g_{k1}\) to \(g_{k}\) by setting \(\theta = \theta _{k1}\) and \(\theta =\theta _{k}\) to the queryLLM definitions (linear models) such that: \(\theta _{k1} = \mathbb {E}[\theta v(\mathbf {q}) = k1]\) and \(\theta _{k1} = \mathbb {E}[\theta v(\mathbf {q}) = k]\), respectively. Hence, based on this trajectory we derive the PLR estimate data function \(\hat{g}\) of the underlying data function g.
Remark 6
It is worth noting that the data function g based on Theorem 3 is achieved based only by the knowledge extracted from answer–query pairs and not by accessing the data points.
5 Querydriven statistical learning methodology
In this section we propose our querydriven statistical learning algorithm for our methodology through which all the queryLLM parameters \(\alpha _{k}\) minimize both (21) and (22). Then, we provide the PLR approximation error bound of the PLR estimate functions \(f_{k}\) of query function f and the impact of our VQ algorithm in this error.
Let us focus on the EQE \(\mathcal {J}\) in (21) and liaise with Example 3 (Fig. 5). We seek the best possible approximation of a random query \(\mathbf {q}\) out of the set \(\{\mathbf {w}_{k}\}_{k=1}^{K}\) of finite K query prototypes. We consider the closest neighbor projection of query \(\mathbf {q}\) to a query prototype \(\mathbf {w}_{j}\), which represents the jth query subspace \(\mathbb {Q}_{j} \subset \{ \mathbf {q} \in \mathbb {Q}: \mathbf {q}\mathbf {w}_{j} _{2} = \min _{k} \mathbf {q}\mathbf {w}_{k} _{2} \}\). We incrementally minimize the objective function \(\mathcal {J}\) with the presence of a random query \(\mathbf {q}\) and update the winning prototype \(\mathbf {w}_{j}\) accordingly. However, the number of the query subspaces and, thus, query prototypes \(K>0\), is completely unknown and not necessarily constant. The key problem is to decide on an appropriate K value. In the literature a variety of AVQ methods exists, however, not suitable for incremental implementation, because K must be supplied in advance.
We propose a conditionally growing AVQ algorithm under \(L_{2}\) distance in which the prototypes are sequentially updated with the incoming queries and their number is adaptively growing, i.e., the number K increases if a criterion holds true. Given that K is not available apriori, our VQ minimizes the objective \(\mathcal {J}\) with respect to a threshold value \(\rho \). This threshold determines the current number of prototypes K. Initially, the query space has a unique (random) prototype, i.e., \(K=1\). Upon the presence of a query \(\mathbf {q}\), our algorithm first finds the winning query prototype \(\mathbf {w}_{j}\) and then updates the prototype \(\mathbf {w}_{j}\) only if the condition \(\mathbf {q}\mathbf {w}_{j}_{2} \le \rho \) holds true. Otherwise, the query \(\mathbf {q}\) is currently considered as a new prototype, thus, increasing the value of K by one. Through this conditional quantization, our VQ algorithm leaves the random queries to selfdetermine the resolution of quantization. Evidently, a high \(\rho \) value would result in coarse query space quantization (i.e., low resolution partition) while low \(\rho \) values yield a finegrained quantization of the query space. The parameter \(\rho \) is associated with the stability–plasticity dilemma a.k.a. vigilance in Adaptive Resonance Theory [30]. In our case, the vigilance \(\rho \) represents a threshold of similarity between queries and prototypes, thus, guiding our VQ algorithm in determining whether a new query prototype should be formed.
Remark 7
Let us now focus on the EPE \(\mathcal {H}\) in (22) and liaise with Examples 3 and 4 (Figs. 5 and 6). The objective function \(\mathcal {H}\) is conditioned on the winning queryprototype index \(j = \arg \underset{k}{\min } \mathbf {q}  \mathbf {w}_{k}_{2}\), i.e., it is guided by the VQ \(v(\mathbf {q}) = j\). Our target is to incrementally learn the queryLLM coefficients offset \(y_{j}\) and slope \(\mathbf {b}_{j}\) of the LLM function \(f_{j}\), which are associated with the winning query prototype \(\mathbf {w}_{j} \in \mathbb {Q}_{j}\) for a random query \(\mathbf {q}\).
Theorem 4
Given a pair of query–answer \((\mathbf {q},y)\) and its winning query prototype \(\mathbf {w}_{j}\), the optimization parameter \(\alpha \) converges to the optimal parameter \(\alpha ^{*}\), if it is updated as:
Proof
Through the incremental training of the parameters set \(\alpha = \{(y_{k},\mathbf {b}_{k}, \mathbf {w}_{k})\}_{k=1}^{K}\), each queryLLM function \(f_{k}\) has estimated its parameters. The PLR approximation error bound for the LLM function \(f_{k}\) around the query prototype \(\mathbf {w}_{k}\) depends on the dimension d and curvature (second derivative) of the function \(f_{k}\) in the query subspace \(\mathbb {Q}_{k}\) as provided in Theorem 5. The approximation depends on the resolution of quantization K. Notably, the more prototypes K, the better the approximation of the query function f is achieved by queryLLMs, as proved in Theorem 6.
Theorem 5
Proof
Theorem 6
For a random query \(\mathbf {q}\), the expected approximation error given K queryLLM functions \(f_{k}\), \(k \in [K]\) is bounded by \(\sum _{k \in [K]}\mathcal {C}_{k}O(\frac{d}{K})\), where \(\mathcal {C}_{k}\) is defined in Theorem 5.
Proof
6 Data and query functions approximation and prediction
In this section we propose an algorithm that uses the queryLLM functions to approximate the PLR data function g over a data subspace given the corresponding dataLLM functions and an algorithm to predict the aggregate answer y of an unseen query based on the queryLLM functions.
Example 6
Figure 6 shows the average value and regression query prediction: An unseen query \(\mathbf {q} = [\mathbf {x},\theta ]\) is projected onto input space \(\mathbf {x}=(x_{1},x_{2})\) to derive the neighborhood set of prototypes \(\mathcal {W}(\mathbf {q}) = \{\mathbf {w}_{i},\mathbf {w}_{k},\mathbf {w}_{l}\}\). Then, we access the queryLLM functions \(f_{i}, f_{k}, f_{l}\) to predict the aggregate output \(\hat{y}\) for query Q1 (see Algorithm 2) and retrieve the data regression planes coefficients \(\mathcal {S}\) of the dataLLM functions \(g_{i},g_{k},g_{l}\) from queryLLM functions \(f_{i}, f_{k}, f_{l}\), respectively, for query Q2 (see Algorithm 3).
6.1 Query Q1: meanvalue aggregate prediction
6.2 Query Q2: PLRbased data function approximation

(Case 1) either partially overlap with several identified convex data subspaces \(\mathbb {D}_{k}\) (corresponding to query subspaces \(\mathbb {Q}_{k}\)), or

(Case 2) be contained or contain a data subspace \(\mathbb {D}_{k}\), or

(Case 3) be outside of any data subspace \(\mathbb {D}_{k}\).
For the Case 3, the PLR approximation of the data function \(g(\mathbf {x})\) derives by extrapolating the linearity trend of \(u = g(\mathbf {x}) = f_{j}(\mathbf {x},\theta _{j}): j = \arg \min _{k} \mathbf {q}  \mathbf {w}_{k}_{2}\) over the data subspace, with u intercept: \(y_{j} \mathbf {b}_{X,j}\mathbf {x}_{j}^{\top }\) and the u slope: \(\mathbf {b}_{X,j}\).
The PLR approximation of the data function is shown in Algorithm 3, which returns the set of the dataLLM functions \(\mathcal {S}\) defined over the data subspace \(\mathbb {D}(\mathbf {x},\theta )\) for a given unseen query \(\mathbf {q} = [\mathbf {x},\theta ]\). Note that depending on the query radius \(\theta \) and the overlapping neighborhood set \(\mathcal {W}(\mathbf {q})\), we obtain: \(1 \le \mathcal {S} \le K\), where \( \mathcal {S}\) is the cardinality of the set \(\mathcal {S}\).
Remark 8
Figure 8(upper) shows how the data function \(u = g(x)\) is accurately approximated by \(K=6\) dataLLMs (green interpolating local lines) compared with the global linear regression function (REG in red) over the data subspace \(\mathbb {D}(0.5,0.5)\). We also illustrate the K linear models derived by the actual PLR data approximation algorithm [44], i.e., the best possible PLR data approximation should we have access to that data subspace, which corresponds to OPT\(_{K}\) in (3). Unlike our model, PLR needs access to the data and is thus very expensive; specifically, it involves a forward/backward iterative approach to produce the multiple linear models [44]. Our model, instead, incrementally derives the dataLLMs based on the optimization problems in (21) and (22). Note that the derived dataLLMs are highly accurate.
7 Convergence analysis and complexity
7.1 Global convergence analysis
In this section we show that our stochastic joint optimization algorithm is asymptotically stable. Concerning the objective function \(\mathcal {J}\) in (21), the query prototypes \(\mathbf {w}_{k} = [\mathbf {x}_{k},\theta _{k}]\) converge to the centroids (mean vectors) of the query subspaces \(\mathbb {Q}_{k}\). This convergence reflects the partition capability of our proposed AVQ algorithm into the prototypes of the query subspaces. The query subspaces naturally represent the (hyper)spheres of the data subspaces that the analysts are interested in accessed by their query centers \(\mathbf {x}_{k}\) and radii \(\theta _{k}\), \(\forall k\).
Concerning the objective function \(\mathcal {H}\) in (22), the approximation coefficients slope and intercept in Theorem 3 converge, too. This convergence refers to the linear regression coefficients that would have been derived should we were able to a fit linear regression function over each data subspace \(\mathbb {D}_{k}\), given that we had access to the data.
Theorem 7 refers to the convergence of a query prototype \(\mathbf {w}_{k}\) to the local expectation query \(\mathbb {E}[\mathbf {q}\mathbb {Q}_{k}] = \mathbb {E}[\mathbf {q}v(\mathbf {q}) = k]\) given our AVQ algorithm.
Theorem 7
If \(\mathbb {E}[\mathbf {q}\mathbb {Q}_{k}] = \mathbb {E}[\mathbf {q}v(\mathbf {q}) = k]\) is the local expectation query of the query subspace \(\mathbb {Q}_{k}\) and the query prototype \(\mathbf {w}_{k}\) is the subspace representative from our AVQ algorithm, then \(P(\mathbf {w}_{k} =\mathbb {E}[\mathbf {q}\mathbb {Q}_{k}]) = 1\) at equilibrium.
Proof
We provide two convergence theorems for the coefficients \(y_{k}\) and \(\mathbf {b}_{k}\) of the queryLLM \(f_{k}\). Firstly, we focus on the aggregate answer prediction \(y = y_{k} + \mathbf {b}_{k}(\mathbf {q}\mathbf {w}_{k})^{\top }\). Given that the query prototype \(\mathbf {w}_{k}\) has converged, i.e., \(\mathbf {w}_{k} = \mathbb {E}[\mathbf {q}\mathbb {Q}_{k}]\) from Theorem 7, then the expected aggregate value \(\mathbb {E}[y\mathbb {Q}_{k}]\) converges to the \(y_{k}\) coefficient of the queryLLM \(f_{k}\). This also reflects our assignments of the statistical mapping \(\mathcal {F}\) of the local expectation query \(\mathbf {w}_{k}\) to the mean of the queryLLM \(f_{k}\), i.e., \(f_{k}(\mathbb {E}[\mathbf {x}_{k}\mathbb {Q}_{k}],\mathbb {E}[\theta _{k}\mathbb {Q}_{k}]) = \mathbb {E}[y\mathbb {Q}_{k}]\). This refers to the local associative convergence of coefficient \(y_{k}\) given a query \(\mathbf {q} \in \mathbb {Q}_{k}\). In other words, the convergence of the query subspace enforces also convergence in the output domain.
Theorem 8
(Associative Convergence) If the query prototype \(\mathbf {w}_{k}\) has converged, i.e., \(\mathbf {w}_{k} = \mathbb {E}[\mathbf {q}\mathbb {Q}_{k}]\), then the coefficient \(y_{k}\) of the queryLLM \(f_{k}\) converges to the expectation \(\mathbb {E}[y\mathbb {Q}_{k}]\).
Proof
Finally, we provide a convergence theorem for \(\mathbf {b}_{k}\) as the slope of the linear regression of \(\mathbf {q}\mathbf {w}_{k}\) onto \(yy_{k}\).
Theorem 9
Proof
7.2 Partial convergence analysis
The entire statistical learning model runs in two phases: the training phase and the prediction phase. In the training phase, the queryLLM prototypes \((\mathbf {w}_{k},\mathbf {b}_{k}, y_{k})\), \(k \in [K]\), are updated upon the observation of a query–answer pair \((\mathbf {q},y)\) until their convergence w.r.t. the global stopping criterion in (24). In the prediction phase, the model proceeds with the meanvalue prediction of the aggregate answer \(\hat{y}\), the PLR data approximation of the data function g and the output data value prediction \(\hat{u}\), without execution of any incoming query after convergence at \(t^{*}\). The major requirement for the model to transit from the training to the prediction phase is the triggering of the global stopping criterion at \(t^{*}\) w.r.t. a fixed \(\gamma >0\) convergence threshold.
Let us now provide an insight of this global criterion. The model convergence means that on average for all the trained queryLLM prototypes their improvement w.r.t. a new incoming query–answer pair is not as much significant as it was at the early stage of the training phase. The rate of updating such prototypes, which is reflected by the difference vector norms of \((\mathbf {w}_{k,t},\mathbf {b}_{k,t}, y_{k,t})\) and \((\mathbf {w}_{k,t1},\mathbf {b}_{k,t1}, y_{k,t1})\) at observation t and \(t1\), respectively, is decreasing as the number of query–answer pairs increases, i.e., \(t \rightarrow \infty \); refer also to convergence analysis in Sect. 7.
In a real world setting, however, we cannot obtain an infinite number of training pairs to ensure convergence. Instead, we are sequentially provided a finite number of training pairs \((\mathbf {q}_{t},y_{t})\) from a finite training set \(\mathcal {T}\). We obtain model convergence given that there are enough pairs in the set \(\mathcal {T}\) such that the criterion in (24) is satisfied. More interestingly, we have observed that some of the queryLLM prototypes, say \(L < K\) converge with less training query–answer pairs than all provides pairs \(\mathcal {T}\). Specifically, for those L prototypes, which represent certain data subspaces \(\mathbb {D}_{\ell }\) and query subspaces \(\mathbb {Q}_{\ell }\), \(\ell =1, \ldots , L\), it holds true that the convergence criterion \(\max \{\varGamma _{\ell }^{\mathcal {J}},\varGamma _{\ell }^{\mathcal {H}}\}_{t} \le \gamma \) for \(t < t^{*}\), where \(t^{*}\) corresponds to the last observed training pair where the entire model has globally converged, given a fixed \(\gamma \) convergence threshold. In this case, we introduce the concept of partial convergence if there is at least a subset of queryLLM prototypes, which have already converged w.r.t. \(\gamma \) at an earlier stage than the entire model (entire set of parameters). Interestingly, those \(\ell \) queryLLM prototypes transit from their training phase to the prediction phase. The partial convergence on those data subspaces is due to the fact that there were relatively more queries issued to those data subspaces compared to some other data subspaces up to the tth observation with \(t < t^{*}\). Moreover, by construction of our model, only a relatively small subset of queryLLM prototypes are required for meanvalue prediction and PLR data approximations (refer to the overlapping set \(\mathcal {W}\) in Sect. 6). Hence, based on the flexibility of the partial convergence, we can proceed with prediction and data approximation to certain incoming queries issued onto those data subspaces, where their corresponding queryLLM prototypes have partially converged, while the entire model is still on a training phase, i.e., it has not yet globally converged.
The advantage of this methodology is that we deliver predicted answers to the analysts’ queries without imposing the execution delay for those queries. Evidently, we obtain the flexibility to either proceed with the query execution after the prediction for refining more the converged data subspace or not. In both options, the analysts ‘do not need to wait’ for the system to execute firstly the query and then being delivered the answers. This motivated us to introduce a progressive predictive analytics or intermediate phase, where some parts of the model can, after their local convergence, provide predicted answers to the analysts without waiting for the entire model to converge.
 Case A If the consensual ratio \(\frac{\ell }{\ell + \kappa } \ge r\), i.e., more than \(r\%\) of the query prototypes in \(\mathcal {W}(\mathbf {q}_{t})\) have locally converged, with \(r \in (0.5, 1)\) then two options are available:

Case A.I The model predicts and delivers the answer based only on those \(\ell \) query prototypes which have converged to the analysts and, then, executes the query for updating the \(\kappa \) not yet converged query prototypes to align with the model convergence mode. In this case, the analysts are delivered a predicted answer where the degree of confidence for this answer is regulated through the consensual ratio r. The meanvalue prediction and PLR data approximation is achieved as described in Algorithms 2 and 3 by replacing \(\mathcal {W}(\mathbf {q})\) with the locally converged query prototypes \(\mathcal {C}(\mathbf {q})\) in (32). After the query execution, the queryLLM prototypes from the unconverged set \(\mathcal {U}(\mathbf {q})\) in (32) are updated as described in Algorithm 1. Obviously, if the consensual ratio \(\frac{\ell }{\ell + \kappa } = 1\), then there is no such an intermediate phase.

Case A.II The model predicts and delivers the answer based only on those \(\ell \) query prototypes which have converged, to the analysts, and does not execute the query, thus, no update is performed for those \(\kappa \) query prototypes. The meanvalue prediction and PLR data approximation is achieved as described in Algorithm 2 and Algorithm 3 by replacing \(\mathcal {W}(\mathbf {q})\) with the locally converged query prototypes \(\mathcal {C}(\mathbf {q})\) in (32). This obviously delays the global convergence and reduces the number of queries executed for convergence. This option is only preferable when most of the incoming queries focus on specific data subspaces and not on the entire data space. In other words, there is no meaning for the entire model to globally converge to transit from the training phase to the prediction phase, if most of the queries are issued on very specific data subspaces. At the extreme case, the model could delay a lot its convergence if more than 50% of the query prototypes are involved in the overlapping sets for all the incoming queries. To alleviate this case, our model creates new prototypes (incrementally) only when there is at least some interest on a specific data subspace, as discussed in Sect. 5 adopting the principles of adaptive resonance theory [30].


Case B Otherwise, i.e., the consensual ratio \(\frac{\ell }{\ell + \kappa } < r\), the model acts as usual in the training phase, i.e., it first executes the query and delivers the actual answer to the analyst, and then based on this actual answer it updates the prototypes as discussed in Sect. 5.
Remark 9
The prediction performance of the model in the intermediate phase is up to the performance of the model in the single prediction phase. This is attributed to the predicted answers based on the partial convergence w.r.t. consensual threshold r, where only r% of the queryLLM prototypes from the overlapping set \(\mathcal {W}(\mathbf {q})\) are used for prediction given an unseen query \(\mathbf {q}\). The prediction performance is a nondecreasing function with the number of observations t with \(t_{\top } \le t \le t^{*}\) as will be shown in our performance evaluation Sect. 8.
7.3 Computational complexity
In this section we report on the computational complexity of our model during the training and prediction phases. In the global convergence mode, the model ‘waits’ for the triggering of the criterion in (24) to transit from the training to the prediction phase. Under SGD over the objective minimization functions \(\mathcal {J}\) and \(\mathcal {H}\), with the hyperbolic learning schedule in (7), our model requires \(O(1/\gamma )\) [15] number of training pairs to reach the convergence threshold \(\gamma \). This means that the residual difference between the objective function value \(\mathcal {J}^{t^{*}}\) after \(t^{*}\) pairs and the optimal value \(\mathcal {J}^{*}\), i.e., with the optimal queryLLM parameters, asymptotically decreases exponentially, also known as linear convergence [25]. In this mode, there is a clear separation between the training and prediction phases, while the upper bound of the expected excess difference \(\mathbb {E}[\mathcal {J}^{t}\mathcal {J}^{*}]\) after t training pairs is \(O\left( \sqrt{\frac{\log t}{t}}\right) \) [52], given a hyperbolic learning schedule in (7).
In the prediction phase, which is the operational mode of our model, given a meanvalue query Q1 and a linear regression query Q2, we require O(dK) to calculate the neighborhood \(\mathcal {W}\) set and deliver the queryLLM functions, respectively, i.e., independent on the data size, thus, achieving scalability. We also require O(dK) space to store the query prototypes and the queryLLM coefficients. The derivation of the dataLLMs is then O(1) given than we have identified the queryLLMs for a given linear regression query.
8 Performance evaluation
8.1 Performance metrics
The proposed methodology deals with two major statistical learning components: prediction of the aggregate answer and data output, and data function approximation over data subspaces. For evaluating the performance of our model in light of these components, we should assess the model predictability and goodness of fit, respectively.
Predictability refers to the capability of a model to predict an output given an unseen input, i.e., such input–output pair is not provided during the model’s training phase. Measures of prediction focus on the differences between values predicted and values actually observed. Goodness of fit describes how well a model fits a set of observations, which were provided in the model’s training phase. It provides an understating on how well the selected independent variables (input) explain the variability in the dependent (output) variable. Measures of goodness of fit summarize the discrepancy between actual/observed values during training and the values approximated under the model in question.
We compare our statistical methodology against its ground truth counterparts: the multivariate linear regression model over data subspaces, hereafter referred to as REG, and the piecewise linear model (PLR) over data subspaces, both of which have full access to the data. Note that the PLR data approximation is the optimal multiple linear modeling over data subspaces we can obtain because it is constructed by accessing the data. Hence, we demonstrate how effectively our dataLLMs approximate the ground truth data function g and the optimal PLR data approximation. Specifically, we compare against the REG model using DMS PostgreSQL and the MATLAB and the PLR model using the ARESLab (MATLAB) toolbox^{7} for building PLR models based on the multivariate adaptive regression splines method in [44]. We show that our model is scalable and efficient and as (or even more than) accurate than the REG model, w.r.t. predictability and goodness of fit, and close to the accuracy obtained by the optimal PLR model. Our model is dramatically more scalable and efficient as, unlike REG and PLR models, it does not need access to data, yielding up to six orders of magnitude faster query execution.
8.1.1 Predictability
8.1.2 Goodness of fit
8.2 Experimental setup
8.2.1 Real and synthetic datasets
Real datasets The real dataset R1 consists of 6dim. feature vectors corresponding to the concentration level of 6 gases, namely, Ethanol (E1), Ethylene (E2), Ammonia (A1), Acetaldehyde (A2), Acetone (A3), and Toluene (T) derived from chemical sensors. The sensors measurements of the dataset R1 were gathered within 36 months in a gas delivery platform facility situated at the ChemoSignals Laboratory in the BioCircuits Institute (BCI^{8}), University of California. We expand the R1 size by adding extra 6dim. vectors with Gaussian noise, thus, in total the R1 dataset contains \(15 \cdot 10^{6}\) multidimensional data vectors of gases concentration levels. With the R1 dataset we wished to delve into accuracy issues and this dataset was chosen because its data exhibits nonlinear relationships among features. All ddim. realvalued vectors are scaled and normalized in [0,1] (\(d \in \{1, \ldots , 6\}\)) with significant nonlinear dependencies among the features, evidenced by a high FVU = 4.68. This indicates that a linear approximation of the entire data space is definitely to no avail, presenting a challenging dataset for our approach. Figure 10 shows the R1 scatter plot matrix for all gases concentrations (before scaling and normalization) depicting the dependencies between gases and the corresponding histograms of each dimension. We obtain significant correlations among many gases, indicatively E1 with A1, A2 and A3 with Pearson correlation coefficient 0.41, 0.23, and 0.98 (\(p < 0.05\)), respectively, and E2 with A2 having correlation 0.36 (\(p < 0.05\)). By further analyzing the R1 dataset, the first three Principal Components (PCs) explain the 99.73% of the total variance by 73.57%, 23.94%, and 2.22%, respectively, which are used for Q1 and Q2 analytics queries (prediction of the meanvalue and model fitting).
Synthetic dataset To further evaluate scalability and efficiency along with accuracy, we now use a big synthetic dataset deriving from a benchmark function to ensure also significant nonlinearity. The R2 synthetic dataset of input–output pairs \((u,\mathbf {x})\) contains \(10^{10}\) ddim. real data generated by the Rosenbrock function [46] \(u = g(\mathbf {x})\) and \(d \in \{1, \ldots , 6\}\). This is the popular benchmark function for testing nonlinear, gradientbased optimization algorithms. It has a global minimum inside a long, narrow, parabolic shaped flat valley, where convergence to the global minimum, however, is extremely nontrivial [46]. We obtain the Rosenbrock \(u = g(\mathbf {x}) = \sum _{i=1}^{d1}100(x_{i+1}x_{i}^{2})^{2} + (1x_{i})^{2}\), \(\mathbf {x} = [x_{1}, \ldots , x_{d}]\), attribute domain \(x_{i} \le 10\) and global minimum is 0 at \(x_{i}=1, \forall d\). Obviously, there is no linear dependency among features in the data space evidenced by a FVU = 12.45. In addition, we generate \(10^{10}\) vectors adding noise \(\epsilon \sim \mathcal {N}(0,1)\) to each dimension. For illustration purposes, Fig. 12 shows the R2 dataset of the Rosenbrock function \(u = g(x_{1}, x_{2})\) with two \((d=2)\) variables \(x_{1}\) and \(x_{2}\) and its corresponding PLR approximation through \(K=23\) LLMs.
8.2.2 Query workloads, training and testing sets
Query workload Firstly, we generate certain query workloads to train and test our model. The random queries \(\mathbf {q} = [\mathbf {x},\theta ]\) with centers \(\mathbf {x}\) and radii \(\theta \) over the data subspaces are generated with uniformly distributed centers \(\mathbf {x} \in [0,1]^{d}\) for the R1 and R3 datasets and in \([10,10]^{d}\) for the R2 dataset (recall that the data vectors in R1 and R3 are scaled and normalized in [0,1]). That is, the query centers can uniformly at random appear over all the data space defined by the domains of the datasets dimensions \(d \in \{1,\ldots ,6\}\). The query radius \(\theta \) affects the training time and the prediction quality (both in predictability and goodness of fit). In brief, a larger (smaller) \(\theta \) implies shorter (longer) training times as will be elaborated later. For each query, the radius \(\theta \sim \mathcal {N}(\mu _{\theta },\sigma ^{2}_{\theta })\) is generated from a Gaussian distribution with mean \(\mu _{\theta }\), variance \(\sigma ^{2}_{\theta }\). We set random radius \(\theta \sim \mathcal {N}(0.1, 0.01)\) for the R1 and R3 datasets and \(\theta \sim \mathcal {N}(1, 0.25)\) for the R2 dataset, covering \(\sim \)20% in each feature data range; the justification for this setting is discussed later. Section 8.7 provides an extensive experimental and theoretical analysis of the impact of \(\theta \) on the model performance. Based on this set up, we generated random queries \(\mathbf {q}\) that are issued over the data spaces of the R1, R2 and R3 datasets. We use these queries for training and testing our models as follows.
Training and testing sets We describe how we generate the training and testing query–answer sets from the abovementioned query workload methodology. To train our model, we generate training files \(\mathcal {T}\) consisting of random queries \(\mathbf {q}\) as described above along with their actual aggregate answers y after executing them. To test the performance of our models, we generate different testing files \(\mathcal {V}\) dedicated only for predictions containing random queries of various sizes: \(\mathcal {T} \in \{10^{3}, \ldots , 10^{4}\}\) and \(M = \mathcal {V} \in \{10^{3}, \ldots , 2\cdot 10^{4}\}\), respectively. Specifically, the training sets \(\mathcal {T}\) and testing sets \(\mathcal {V}\) contain pairs of queries and answers, i.e., \((\mathbf {q},y)\), where the queries were executed over the R1, R2, and R3 datasets (see also Fig. 4). We adopted the crossvalidation technique [51] to evaluate all the predictive models by partitioning the original query–answer sets into a training set to train the models, and a test set to evaluate them. We use 10fold crossvalidation, where the original query–answer set is randomly partitioned into 10 equal size subsets. Of the 10 subsets, a single subset is retained as the validation dataset for testing the models, and the remaining 9 subsamples are used as training data. The crossvalidation process is then repeated 10 times (the folds), with each of the 10 subsamples used exactly once as the validation data. The 10 results from the folds are then be averaged to produce a single estimation of the abovementioned performance metrics.
8.3 Model training and convergence
Experimental Parameters
Parameters  Range/value 

Data dimensionality d  \(\{2, 3, 5, 6\}\) 
Real dataset R1 [45]  \(15\cdot 10^{6}\) vectors in \([0,1]^{d}\) 
Synthetic dataset R2  \(10^{10}\) Rosenbrock in \([10, 10]^{d}\) 
Real dataset R3 [16]  \(10\cdot 10^{6}\) vectors in \([0,1]^{d}\) 
Vigilance coefficient a  [0.05, 1] 
Consensual threshold r  0.7 
Convergence threshold \(\gamma \)  0.01 
Training dataset size \(\mathcal {T}\)  \([10^{3}, \ldots , 10^{4}]\) 
Testing dataset size \(\mathcal {V}\)  \([10^{3}, \ldots , 2 \cdot 10^{4}]\) 
Initial learning rate \(\eta _{0}\)  0.5 [14] 
Query center/point  Uniform vectors in \([0,1]^{d}\) 
Query radius \(\theta \)  Gaussian values \(\mathcal {N}(\mu _{\theta },\sigma _{\theta }^{2})\) 
Query mean radius \(\mu _{\theta }\)  [0.01, 0.99] 
Query radius dev. \(\sigma _{\theta }\)  0.01 
Figure 14 shows the relation between the percentage of the number of the training pairs \(\mathcal {T}\)% used for a specific percentage of query prototypes to partially converge given the intermediate phase for \(d \in \{2,5\}\) over R1 dataset with quantization coefficient \(a = 0.25\). Specifically, we observe that with only 35% of the training pairs, i.e., with almost 1800 query–answer pairs (landmark \(t_{\top } \approx 1800\)), we obtain a model convergence of the 70–80% of the queryLLM prototypes. This indicates that the entire model has partially converged to a great portion w.r.t. number of queryLLM prototypes requiring a relatively small number of training pairs. In this case, the intermediate phase is deemed of high importance for delivering predictions to the analysts while the model is still being in a ‘quasitraining’ mode. The model converges with a high rate as more training pairs from the training set \(\mathcal {T}\) are observed after the convergence of the 70% of the queryLLM prototypes. This suggests to set the consensual threshold for the intermediate phase \(r = 0.7\). However, during this phase, the delivered predictions to the analysts have to be assessed w.r.t. prediction accuracy, as will be discussed in Sect. 8.4.
8.4 Evaluation of Q1 query: predictability and scalability
Figures 17 and 18(lower) show the robustness of the our model w.r.t. predictability with various testing file sizes \(\mathcal {V}\) for R1, R2, and R3 datasets, respectively. Once the LLM model has converged, it provides a low and constant prediction error in terms of RMSE for different data dimensions d, indicating the robustness of the training and convergence phase of the proposed model. This means that the model after transiting into the prediction phase can accurately predict the aggregate answer y via the identified and optimized queryLLM functions thus no query processing and data access is needed at that phase.
We now examine the impact of the model partial convergence on the predictability, i.e., when the model is in the intermediate phase between the training and the prediction phases. Figure 19 (upper) shows the partial RMSE \(\tilde{e}\) of the predicted aggregate answer y (A1 metric) during the intermediate phase of the model and the achieved RMSE e during the prediction phase against the percentage of training pairs for consensual threshold \(r = 0.7\) over dimension \(d \in \{2, 5\}\) for the dataset R1. Similar results are obtained from R2 and R3 datasets. Specifically, the partial RMSE \(\tilde{e}\) is obtained only from the converged query prototypes during the intermediate phase as described in Case A.I in Sect. 7.2 for \(r=0.7\). That is, from those query prototypes whose any additional training pair \((\mathbf {q},y)\) does not significantly move the query prototypes in the query space. We observe the predictability capability of our model w.r.t. number of training pairs such that with almost 35% for \(d=2\) (and 45% for \(d=5\)) of the observed training pairs, the model achieves a RMSE value close to the RMSE value obtained in the fully prediction phase, i.e., after observing 100% of the observed training pairs from \(\mathcal {T}\). This indicates the flexibility of our model to proceed with accurate predictions even being in the intermediate phase, where some of the query prototypes are still in a training mode until the model entirely converges.
More interestingly, Fig. 19(lower) shows the efficiency of our model in achieving high prediction accuracy even during the intermediate phase describe above. The model being in the intermediate phase can provide RMSE values close to that at the end of the training phase by having 70% of the prototypes converged after observing 37% of the training pairs from the training set \(\mathcal {T}\). This demonstrates the fast convergence of the model and its immediate application for delivering predictions to the analysts and realtime predictive analytics applications while not yet being fully converged.
Remark 10
The RMSE in Fig. 19(lower) is normalized in [0,1] for comparison reasons with the percentages of converged prototypes and training pairs.
8.5 Evaluation of Q2 query: PLR data approximation and scalability
We evaluate the Q2 queries by using our query/dataLLMs model against the REG and PLR models and show the statistical significance of the derived accuracy metrics. The explanation over the linear/nonlinear behaviors of data function g is interpreted by the variance explanation and model fitting metrics fraction of the variance unexplained FVU and coefficient of determination CoD against the quantization resolution coefficient a and the model prototypes K. Figures 20 and 21(upper) show the sum of squared residuals SSR between the actual answers and the predicted answers for the dataLLMs and REG model with \(d \in \{2,5, 6\}\) over the datasets R1, R2, and R3 with \(p < 0.05\).
In Fig. 22(lower) and in Fig. 21(lower) the data function g in R1 and R3 datasets does not behave linearly in all the random data subspaces. This is evidenced by the FVU metric of the REG model, which is relatively close to/over 1 for \(d=2\), \(d=5\), and \(d=6\) with \(p<0.05\). This information is unknown a priori to analysts, hence the results using the REG model would be fraught with approximation errors indicating ‘bad’ model fitting. It is worth noting that, the average number of dataLLM functions that are returned to the analysts for all the issued testing queries in the testing set \(\mathcal {V}\) is \(\mathcal {S} = 4.62\) per query with variance 3.88. This denotes the nonlinearity behavior of data function g and the finegrained and accurate explanation of the function g within a specific data subspace \(\mathbb {D}(\mathbf {x},\theta )\) per query \(\mathbf {q} = [\mathbf {x},\theta ]\). Here, the PLR model achieves the lowest FVU value, i.e., best model fitting as expected (\(p<0.05\)), but note that this is also achieved by our dataLLM functions with a quantization coefficient \(a < 0.1\).
Figure 23(upper) shows the coefficient of determination CoD \(R^{2}\) for the LLM, REG, and PLR models over the R1 dataset (similar results are obtained for datasets R2 and R3) having a significance level of 5%. A positive value of \(R^{2}\) close to 1 depicts that a linear approximation is a good fit for the unknown data function g. While, a value of \(R^{2}\) close to 0, and especially, a negative value of \(R^{2}\) indicates a significantly bad fit signaling inaccuracies in function approximation. In our case, with \(K>60\) query prototypes, our model achieves high and positive \(R^{2}\) indicating that our model better explains the random queried data subspaces \(\mathbb {D}(\mathbf {x},\theta )\) compared with the obtained explanation of the current inDMS REG model over exactly the same data subspaces and \(p<0.05\). The REG model achieves low \(R^{2}\) values, including negative ones, thus it is inappropriate for predictions and function approximation. This indicates that the underlying data function g highly exhibits nonlinearities, which are not known to the analysts apriori. By adopting our model, the analysts progressively learn the underlying data function g and also via the derived dataLLM functions capture the hidden nonlinearities over the queried data subspaces. Such subspaces could never be known to the analysts unless exhaustive data access and exploration takes place. This capability is only provided by our model. Notably, as the quantization coefficient \(a \rightarrow 1\), our model increases significantly the coefficient of variation \(R^{2}\) value indicating a better capture of the specificities of the underlying data function g, thus, providing more accurate linear models. Again, the dataaccess exhaustive PLR model achieves the highest CoD values, however at the cost of high insufficiency; see Fig. 24(lower). Regardless, note that our model can catch the PLR’s CoD value by simply increasing K, i.e., the granularity of query space quantization.
Figures 24(lower) and 26(lower) show the Q2 execution time over the dataset R2 and R3, respectively, for dataLLM (\(a=0.25\), i.e., \(K=(92,450)\) for \(d=(2,5)\)) through Algorithm 3, the REG model from PostgreSQL (\(d=2\)) (REGDBMS), the REG model from MATLAB (\(d=5\)) (REGMATLAB), and the optimal PLR against dataset size. The derived results are statistically significant with \(p<0.05\). Our model is highly scalable (note the flat curves) in both datasets and highly efficient, achieving 0.56 ms/query and 0.78 ms/query (even for massive datasets)–up to six orders of magnitude better than the REG and PLR models for R2 and R3 datasets, respectively. The full picture is then that our model provides ultimate scalability by being independent of the size of the dataset) and many orders of magnitude higher efficiency, while it ensures great goodness of fit (CoD,FVU), similar to that of PLR.
8.6 Data output predictability
The LLM model can successfully predict the data output u by being statistically robust in terms of number of testing pairs \(\mathcal {V}\) (\(p<0.05\)) and assume comparable or, even, lower prediction error than the REG model. This denotes that our model, by fusing different dataLLM functions which better capture the characteristics of the underlying data function g, provides better data output u prediction than a ‘global’ REG model over random queried data subspaces \(\mathbb {D}\). Evidently, the PLR model achieves the lowest RMSE value by actually accessing the data and captures the actual nonlinearity of the data function g through linear models. However, this is achieved with relatively high computational complexity, higher than the REG model including polynomially dataaccess process [44]. Note, the data output prediction times for the LLM, REG, and PLR models in this experiment are the same presented in Fig. 24: The LLM model executes our Algorithm 2 by replacing \(\theta =\theta _{k}\) in (30), \(\forall \mathbf {w}_{k} \in \mathcal {W}(\mathbf {q})\), the REG model creates the linear approximation over the data space \(\mathbb {D}\), and the PLR adaptively finds the best linear models for data fitting in each prediction request.
Overall, the proposed LLM model through the training, intermediate and prediction phases achieves statistically significant scalability and accuracy performance compared with the inRDBMS REG model and the dataaccess intensive PLR model (\(p<0.05\)). The scalability of the proposed model in the predictive analytics era is achieved by predicting the query answers and delivering to analysts the statistical behavior of the underlying data function without accessing the raw data and without processing/executing the analytics queries, as opposed to the datadriven REG and PLR models in the literature,
8.7 Impact of radius \(\theta \)
We experiment with different mean values \(\mu _{\theta }\) of the radius \(\theta \sim \mathcal {N}(\mu _{\theta },\sigma ^{2}_{\theta })\) having a fixed variance \(\sigma ^{2}_{\theta }\) to examine the impact on the model training, quality of aggregate answer prediction, and PLR approximation of the underlying data function g. We examine the number of training pairs, \(\mathcal {T}\), where our method requires to reach the convergence threshold \(\gamma = 0.01\). We also examine the impact of radius \(\theta \) on the RMSE and CoD metrics. Hence, three factors (\(\mathcal {T}\), RMSE, and CoD) are influenced by the radius \(\theta \). We experiment with mean radius \(\mu _{\theta } \in \{0.01, \ldots , 0.99\}\) over the R1 dataset (similar results are obtained in R2 and R3 datasets). Consider the queries with high radius \(\theta \) drawn from Gaussian \(\mathcal {N}(\mu _{\theta },\sigma ^{2}_{\theta })\) with high mean radius \(\mu _{\theta }\). Then, radius \(\theta \) nearly covers the entire input data range and aggregate answer y is close to the average value of output u for all queries, i.e., \(n_{\theta }\) contains all \(\mathbf {x}\) input data points. In this case, all query prototypes \(\mathbf {w}_{k}\) correspond to constant queryLLM functions \(f_{k}(\mathbf {x},\theta ) \approx y_{k} = y\), where aggregate answer \(y = \mathbb {E}[u]\) unconditioned to \(\mathbf {x}\) and radius \(\theta \). Hence, the training and convergence of all LLMs is trivial since there is no any specificity to be extracted from each queryLLM function \(f_{k}\). Our method converges with a low number of training pairs \(\mathcal {T}\) as shown in Fig. 27(lower). On the other hand, a small \(\theta \) value refers to learning ‘meticulously’ all the specificities for all LLMs. In this case, our method requires a relatively high number of training query–answer pairs \(\mathcal {T}\) to converge; see Fig. 27(lower).
9 Conclusions and future plans
We focused on the inferential task of piecewise linear regression and predictive modeling which are central to inDMS predictive analytics. We introduced an investigation route, whereby answers from previously executed aggregate and regression queries are exploited to train novel statistical learning models which discover and approximate the unknown underlying data function with piecewise linear regression planes, predict future meanvalue query answers, and predict the data output. We contribute with a statistical learning methodology, which yields highly accurate answers and data function approximation based only on the query–answer pairs and avoiding data access after the model training phase. The performance evaluation and comparative assessment revealed very promising results.
Our methodology is shown to be highly accurate, extremely efficient in computing query results (with submillisecond latencies even for massive datasets, yielding up to six orders of magnitude improvement compared to computing exact answers, produced by piecewise linear regression and global linear approximation models), and scalable, as predictions during query processing do not require access to the DMS engine, thus being insensitive to dataset sizes.
Our plans for future work focus on developing a framework that can dynamically and optimally switch between the training/intermediate phases and query prediction phases as analysts interests shift between data subspaces. Moreover, the developing framework is expected to cope with nonlinear approximations by evolving and expanding the fundamental representatives of both: data and query subspaces for supporting robust query subspace adaptation and for dealing with data spaces with online data updates.
Footnotes
Notes
Acknowledgements
The authors would like to thank the anonymous reviewers for their valuable comments and suggestions to improve the quality of the paper. This work is funded by the EU H2020 GNFUV Project RAWFIE–OC2–EXP–SCI (Grant#645220), under the EC FIRE+ initiative.
References
 1.Abbott, D.: Applied Predictive Analytics: Principles and Techniques for the Professional Data Analyst, 1st edn. Wiley, Hoboken (2014)Google Scholar
 2.Adjeroh, D.A., Lee, M.C., King, I.: A distance measure for video sequence similarity matching. In: Proceedings International Workshop on MultiMedia Database Management Systems (Cat. No.98TB100249), pp. 72–79 (1998)Google Scholar
 3.Amirian, P., Basiri, A., Morley, J.: Predictive analytics for enhancing travel time estimation in navigation apps of apple, google, and microsoft. In: Proceedings of the 9th ACM SIGSPATIAL International Workshop on Computational Transportation Science, IWCTS ’16, pp. 31–36. ACM, New York (2016)Google Scholar
 4.Anagnostopoulos, C.: Qualityoptimized predictive analytics. Appl. Intell. 45(4), 1034–1046 (2016)CrossRefGoogle Scholar
 5.Anagnostopoulos, C., Kolomvatsos, K.: Predictive intelligence to the edge through approximate collaborative context reasoning. Appl. Intell. 48(4), 966–991 (2018)CrossRefGoogle Scholar
 6.Anagnostopoulos, C., Savva, F., Triantafillou, P.: Scalable aggregation predictive analytics: a querydriven machine learning approach. Appl. Intell. 48, 2546 (2018). https://doi.org/10.1007/s104890171093y CrossRefGoogle Scholar
 7.Anagnostopoulos, C., Triantafillou, P.: Learning set cardinality in distance nearest neighbours. In: 2015 IEEE International Conference on Data Mining, pp. 691–696 (2015)Google Scholar
 8.Anagnostopoulos, C., Triantafillou, P.: Efficient scalable accurate regression queries in indbms analytics. In: 2017 IEEE 33rd International Conference on Data Engineering (ICDE), pp. 559–570 (2017). https://doi.org/10.1109/ICDE.2017.111
 9.Anagnostopoulos, C., Triantafillou, P.: Querydriven learning for predictive analytics of data subspace cardinality. ACM Trans. Knowl. Discov. Data 11(4), 47 (2017). https://doi.org/10.1145/3059177 CrossRefGoogle Scholar
 10.Ari, B., Gvenir, H.A.: Clustered linear regression. Knowl. Based Syst. 15(3), 169–175 (2002)CrossRefGoogle Scholar
 11.Avron, H., Sindhwani, V., Woodruff, D.P.: Sketching structured matrices for faster nonlinear regression. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS’13, pp. 2994–3002. Curran Associates Inc. (2013)Google Scholar
 12.Bagirov, A., Clausen, C., Kohler, M.: An algorithm for the estimation of a regression function by continuous piecewise linear functions. Comput. Optim. Appl. 45(1), 159–179 (2010)MathSciNetCrossRefGoogle Scholar
 13.Bai, J., Perron, P.: Estimating and testing linear models with multiple structural changes. Econometrica 66(1), 47–78 (1998)MathSciNetCrossRefGoogle Scholar
 14.Bottou, L.: Stochastic gradient descent tricks. In: Montavon, G., Orr, G.B., Mller, K.R. (eds.) Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science, vol. 7700, 2nd edn, pp. 421–436. Springer, Berlin (2012)CrossRefGoogle Scholar
 15.Bousquet, O., Bottou, L.: The tradeoffs of large scale learning. In: Platt, J.C., Koller, D., Singer, Y., Roweis, S.T. (eds.) Advances in Neural Information Processing Systems, vol. 20, pp. 161–168. Curran Associates Inc, Red Hook (2008)Google Scholar
 16.Candanedo, L.M., Feldheim, V., Deramaix, D.: Data driven prediction models of energy use of appliances in a lowenergy house. Energy Build. 140, 81–97 (2017)CrossRefGoogle Scholar
 17.Chatterjee, S., Guntuboyina, A., Sen, B.: On risk bounds in isotonic and other shape restricted regression problems. Ann. Stat. 43(4), 1774–1800 (2015)MathSciNetCrossRefGoogle Scholar
 18.Cherkassky, V., LariNajafi, H.: Constrained topological mapping for nonparametric regression analysis. Neural Netw. 4(1), 27–40 (1991)CrossRefGoogle Scholar
 19.Choi, C.H., Choi, J.Y.: Constructive neural networks with piecewise interpolation capabilities for function approximations. IEEE Trans. Neural Netw. 5(6), 936–944 (1994)CrossRefGoogle Scholar
 20.Choi, J.Y., Farrell, J.A.: Nonlinear adaptive control using networks of piecewise linear approximators. IEEE Trans. Neural Netw. 11(2), 390–401 (2000)CrossRefGoogle Scholar
 21.Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J.M., Welton, C.: Mad skills: new analysis practices for big data. Proc. VLDB Endow. 2(2), 1481–1492 (2009)CrossRefGoogle Scholar
 22.Dean, J., Corrado, G.S., Monga, R., Chen, K., Devin, M., Le, Q.V., Mao, M.Z., Ranzato, M., Senior, A., Tucker, P., Yang, K., Ng, A.Y.: Large scale distributed deep networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems, NIPS’12, pp. 1223–1231. Curran Associates Inc. (2012)Google Scholar
 23.Deshpande, A., Madden, S.: Mauvedb: Supporting modelbased user views in database systems. In: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, SIGMOD ’06, pp. 73–84. ACM, New York (2006)Google Scholar
 24.Di Blas, N., Mazuran, M., Paolini, P., Quintarelli, E., Tanca, L.: Exploratory computing: a comprehensive approach to data sensemaking. Int. J. Data Sci. Anal. 3(1), 61–77 (2017)CrossRefGoogle Scholar
 25.Dennis Jr., J.E., Schnabel, R.B.: Numerical Methods for Unconstrained Optimization and Nonlinear Equations. Prentice Hall Series in Computational Mathematics. Prentice Hall, Upper Saddle River (1983)zbMATHGoogle Scholar
 26.Fan, J., Gijbels, I.: Local Polynomial Modelling and Its Applications. Monographs on Statistics and Applied Probability Series, vol. 66. Chapman & Hall, London (1996)zbMATHGoogle Scholar
 27.Feng, X., Kumar, A., Recht, B., Ré, C.: Towards a unified architecture for inrdbms analytics. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD ’12, pp. 325–336. ACM, New York (2012)Google Scholar
 28.FerrariTrecate, G., Muselli, M.: A new learning method for piecewise linear regression. In: Artificial Neural Networks—ICANN 2002, International Conference, Madrid, 28–30 Aug 2002, Proceedings, pp. 444–449 (2002)Google Scholar
 29.Freedman, D.: Statistical Models : Theory and Practice. Cambridge University Press, Cambridge (2005)CrossRefGoogle Scholar
 30.Grossberg, S.: Adaptive resonance theory: how a brain learns to consciously attend, learn, and recognize a changing world. Neural Netw. 37, 1–47 (2013)CrossRefGoogle Scholar
 31.Harth, N., Anagnostopoulos, C.: Qualityaware aggregation predictive analytics at the edge. In: 2017 IEEE International Conference on Big Data (Big Data), pp. 17–26 (2017)Google Scholar
 32.Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics. Springer, New York (2001)CrossRefGoogle Scholar
 33.Jeffreys, H., Jeffreys, B.S.: ‘Taylor’s Theorem’ Paragraph. Methods of Mathematical Physics, vol. 1.133, 3rd edn, pp. 50–51. Cambridge University Press, Cambridge (1988)Google Scholar
 34.Jordan, M.I.: On statistics, computation and scalability. Bernoulli 19(4), 1378–1390 (2013)MathSciNetCrossRefGoogle Scholar
 35.Jordan, M.I.: Computational thinking, inferential thinking and “big data”. In: Proceedings of the 34th ACM SIGMODSIGACTSIGAI Symposium on Principles of Database Systems, PODS ’15, pp. 1–1. ACM, New York (2015)Google Scholar
 36.Khattree, R., Bahuguna, M.: An alternative data analytic approach to measure the univariate and multivariate skewness. Int. J. Data Sci. Anal. (2018). https://doi.org/10.1007/s4106001801061
 37.Kyng, R., Rao, A., Sachdeva, S.: Fast, provable algorithms for isotonic regression in all pnorms. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS’15, pp. 2719–2727. MIT Press, Cambridge (2015)Google Scholar
 38.Li, X., Anselin, L., Koschinsky, J.: Geoda web: enhancing webbased mapping with spatial analytics. In: Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems, SIGSPATIAL ’15, pp. 94:1–94:4. ACM, New York (2015)Google Scholar
 39.Meyer, M.C.: Inference using shaperestricted regression splines. Ann. Appl. Stat. 2(3), 1013–1033 (2008)MathSciNetCrossRefGoogle Scholar
 40.Moustra, M., Avraamides, M., Christodoulou, C.: Artificial neural networks for earthquake prediction using time series magnitude data or seismic electric signals. Expert Syst. Appl. 38(12), 15032–15039 (2011)CrossRefGoogle Scholar
 41.Mukherji, A., Lin, X., Toto, E., Botaish, C.R., Whitehouse, J., Rundensteiner, E.A., Ward, M.O.: Fire: a twolevel interactive visualization for deep exploration of association rules. Int. J. Data Sci. Anal. 2018, 1–26 (2018)Google Scholar
 42.Nakayama, K., Hirano, A., Kanbe, A.: A structure trainable neural network with embedded gating units and its learning algorithm. In: Proceedings of the IEEEINNSENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium, vol. 3, pp. 253–258 (2000)Google Scholar
 43.Nothaft, F.A., Massie, M., Danford, T., Zhang, Z., Laserson, U., Yeksigian, C., Kottalam, J., Ahuja, A., Hammerbacher, J., Linderman, M., Franklin, M.J., Joseph, A.D., Patterson, D.A.: Rethinking dataintensive science using scalable analytics systems. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15, pp. 631–646. ACM, New York (2015)Google Scholar
 44.O’Sullivan, F.: Discussion: multivariate adaptive regression splines. Ann. Stat. 19(1), 99–102 (1991)CrossRefGoogle Scholar
 45.RodriguezLujan, I., Fonollosa, J., Vergara, A., Homer, M., Huerta, R.: On the calibration of sensor arrays for pattern recognition using the minimal number of experiments. Chemom. Intell. Lab. Syst. 130, 123–134 (2014)CrossRefGoogle Scholar
 46.Rosenbrock, H.H.: An automatic method for finding the greatest or least value of a function. Comput. J. 3(3), 175 (1960)MathSciNetCrossRefGoogle Scholar
 47.Salloum, S., Dautov, R., Chen, X., Peng, P.X., Huang, J.Z.: Big data analytics on apache spark. Int. J. Data Sci. Anal. 1(3), 145–164 (2016)CrossRefGoogle Scholar
 48.Schleich, M., Olteanu, D., Ciucanu, R.: Learning linear regression models over factorized joins. In: Proceedings of the 2016 International Conference on Management of Data, SIGMOD ’16, pp. 3–18. ACM, New York (2016)Google Scholar
 49.Schneider, P., Biehl, M., Hammer, B.: Adaptive relevance matrices in learning vector quantization. Neural Comput. 21(12), 3532–3561 (2009)MathSciNetCrossRefGoogle Scholar
 50.Thiagarajan, A., Madden, S.: Querying continuous functions in a database system. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, pp. 791–804. ACM, New York (2008)Google Scholar
 51.Trippa, L., Waldron, L., Huttenhower, C., Parmigiani, G.: Bayesian nonparametric crossstudy validation of prediction methods. Ann. Appl. Stat. 9(1), 402–428 (2015)MathSciNetCrossRefGoogle Scholar
 52.Vapnik, V.N., Chervonenkis, A.Y.: On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16(2), 264–280 (1971)CrossRefGoogle Scholar
 53.Venkataraman, S., Yang, Z., Franklin, M., Recht, B., Stoica, I.: Ernest: Efficient performance prediction for largescale advanced analytics. In: Proceedings of the 13th Usenix Conference on Networked Systems Design and Implementation, NSDI’16, pp. 363–378. USENIX Association, Berkeley (2016)Google Scholar
 54.Yamamoto, Y., Perron, P.: Estimating and testing multiple structural changes in linear models using band spectral regressions. Econom. J. 16(3), 400–429 (2013)MathSciNetCrossRefGoogle Scholar
 55.Yeh, E., Niekrasz, J., Freitag, D.: Unsupervised discovery and extraction of semistructured regions in text via selfinformation. In: Proceedings of the 2013 Workshop on Automated Knowledge Base Construction, AKBC ’13, pp. 103–108. ACM, New York (2013)Google Scholar
 56.Zheng, L., Wang, S., Liu, Y., Lee, C.H.: Information theoretic regularization for semisupervised boosting. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, pp. 1017–1026. ACM, New York (2009)Google Scholar
 57.Zhou, X., Zhou, X., Chen, L., Shu, Y., Bouguettaya, A., Taylor, J.A.: Adaptive subspace symbolization for contentbased video detection. IEEE Trans. Knowl. Data Eng. 22(10), 1372–1387 (2010)CrossRefGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.