Approximate OLAP on Sustained Data Streams

Shaikh, Salman Ahmed; Kitagawa, Hiroyuki

doi:10.1007/978-3-319-55699-4_7

Salman Ahmed Shaikh¹⁸ &
Hiroyuki Kitagawa¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10178))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

2501 Accesses
2 Citations

Abstract

Many organizations require detailed and real time analysis of their business data for effective decision making. OLAP is one of the commonly used methods for the analysis of static data and has been studied by many researchers. OLAP is also applicable to data streams, however the requirement to produce real time analysis on fast and evolving data streams is not possible unless the data to be analysed reside on memory. Keeping in view the limited size and the volatile nature of the memory, we propose a novel architecture AOLAP which in addition to storing raw data streams to the secondary storage, maintains data stream’s summaries in a compact memory-based data structure. This work proposes the use of piece-wise linear approximation (PLA) for storing such data summaries corresponding to each materialized node in the OLAP cube. Since the PLA is a compact data structure, it can store the long data streams’ summaries in comparatively smaller space and can give approximate answers to OLAP queries.

OLAP analysts query different nodes in the OLAP cube interactively. To support such analysis by the PLA-based data cube without the unnecessary amplification of querying errors, inherent in the PLA structure, many nodes should be materialized. However, even though each PLA structure is compact, it is impossible to materialize all the nodes in the OLAP cube. Thus, we need to select the best set of materialized nodes which can give query results with the minimum approximation errors within the given memory bound. This problem is NP-hard. Hence this work also proposes an optimization scheme to support this selection. Detailed experimental evaluation is performed to prove the effectiveness of the use of PLA structure and the optimization scheme.

Download conference paper PDF

Analyzing Sequential Data in Standard OLAP Architectures

Real-Time Bigdata Analytics: A Stream Data Mining Approach

Answering Temporal Analytic Queries over Big Data Based on Precomputing Architecture

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

With the increase of stream data sources, such as sensors, GPS, micro blogs, e-business, etc., the need to aggregate and analyze stream data is increasing. Many applications require instant decisions exploiting the latest information from the data streams. For instance, timely analysis of business data is required for improving profit, network packets need to be monitored in real time for identifying network attacks, etc. Online analytical processing (OLAP) is a well-known and useful approach to analyse data in a multi-dimensional fashion, initially given for disk-based static data (we call it traditional OLAP). For the effective OLAP analysis, the data is converted into a multi-dimensional schema, also known as star schema. The data in star schema is represented as a data cube, where each cube cell contains measure across multiple dimensions. A user may be interested in analysing data across different combination of dimensions or examining different views of it. These are often termed as OLAP operations and to support these operations, data is organized as lattice nodes.

A number of solutions have been proposed for OLAP analysis on data streams in the near past [1,2,3]. The requirement to produce real time OLAP analysis on fast and evolving data streams is not possible unless the data to be analysed reside on primary memory. However the size of the primary memory is limited and it is volatile. Therefore we need an in-memory compact data structure that can quickly answer user queries in addition to a non-volatile backup of the data streams. Hence we propose a novel architecture AOLAP (Approximate Stream OLAP), which in addition to storing raw data streams to the secondary storage, maintains data stream’s summaries in a compact memory-based data structure. This work proposes the use of piece-wise linear approximation (PLA) for storing such data summaries corresponding to each materialized node in the OLAP cube. PLA can store the long data streams’ summaries in comparatively smaller space on the primary memory and can give approximate answers to OLAP queries. It provides an impressive data compression ratio and answers user queries with max error guarantees, and has been studied by many researchers [4,5,6,7].

When performing OLAP analysis, users usually request different lattice nodes. Generally only a few nodes are materialized while the other requested nodes are computed on ad-hoc basis, because the materialization of all the nodes is memory expensive. On the other hand, materialization of too less nodes require a lot of ad-hoc computation which is computationally (time) expensive. The optimization of the space-time trade-off in traditional OLAP, when choosing the lattice nodes to materialize is a NP-hard problem [8]. Selecting the lattice nodes to materialize for PLA-based stream OLAP is a space-error trade-off in addition to the space-time trade-off. Since PLA can maintain long data stream summaries on memory and in the most cases can answer queries from it, where the computation is extremely fast, space-error trade-off is more significant among the two trade-offs in the context of PLA-based stream OLAP. Hence this work also proposes an optimization scheme which selects the $\eta $ (user defined parameter) lattice nodes to materialize such that the overall querying error is minimized. We support the contributions of this work with the following real-world example.

Example 1

A big retail chain collects sales quantities of their stores at the granularity of individual product, store location and promotion (under which the product is sold) dimensions which arrive every minute as an infinite time series data stream. The top management is interested in analysing the sales in real time to avoid shortage of products’ supply in any city or state. In addition, the top management is interested in finding the recent past advertisement strategies and/or promotional campaigns for specific products and brands to decide future advertisement budget/strategy and to execute promotional campaigns.

It is not possible to perform such analysis in real-time if all the data to be analysed must be fetched from secondary storage. Keeping in view the importance and demand of real-time analysis, a small degree of error in query results may be tolerated as a trade-off for timely analysis. $\blacksquare $

Our contributions in this work can be summarized as follows:

A novel architecture AOLAP, which in addition to storing raw data streams to the secondary storage, maintains data streams summaries in a compact memory-based data structure.
Use of the PLA structure to compactly maintain the stream OLAP cubes.
An optimization scheme to select the lattice nodes to materialize which can minimize the querying error.
Detailed experimental evaluation to prove the effectiveness of the use of PLA structure for the materialized nodes and the optimization scheme.

The rest of the paper is organized as follows: Sect. 2 reviews essential concepts. Section 3 discusses the related work. In Sect. 4, PLA-based sustained storage is presented. In Sect. 5, the proposed AOLAP architecture and query processing over PLA-based storage are presented. The estimation of querying error is presented in Sect. 6. The proposed optimization scheme is presented in Sect. 7. The effectiveness of our contributions is experimentally evaluated in Sects. 8 and 9 concludes this paper and discusses future directions.

2 Essential Concepts

2.1 Piecewise Linear Approximation (PLA)

PLA is a method of constructing a function to approximate a single valued function of one variable in terms of a sequence of linear segments [9]. Precisely, let S be a time series of discrete data points $(t_i,x_i)$, where $i \in [1,n]$, $t_i$ is the i-th timestamp, $x_i$ is the i-th value and we wish to approximate $x_i$ with a piece-wise linear function $f(t_i)$, using a small number of segments such that the error $|f(t_i)-x_i| \le \epsilon $, where $\epsilon $ is a user defined error parameter. The goal is to record only the successive line segments, and not the individual data points, to reduce the overhead of recording entire time-series.

Authors in [9] proposed an online algorithm to construct such an f having the minimum number of line segments. For completeness, the algorithm is described in Algorithm 1. It takes a data point $p = (t_i,x_i)$ and an error parameter $\epsilon $. Let P be the set of points processed so far, the algorithm maintains the property that all the points in P can be approximated with a line segment within $\epsilon $. If $P \cup \{p\}$ can be approximated with a line segment then it is added to P, else the points in P are output as a line segment and a new line segment is started with the point p.

Example 2

Consider a retail chain time series with the dimensions and a business fact of Example 1. It is a series of a 5-tuple $<t,p,s,m,x>$; the timestamp (minute) (t), product (p), store (s), promotion (m) and the sales quantity (x).

($1, p_1, s_1, m_1, 48$), ($1, p_2, s_1, m_1, 48$), ($2, p_1, s_1, m_1, 43$), ($2, p_2, s_1, m_1, 64$), (3, $p_1$, $s_1$, $m_1$, 60), (3, $p_2$, $s_1$, $m_1$, 73), (4, $p_1, s_1, m_1, 75$), (4, $p_2, s_1,$ $m_1, 58$), (5, $p_1, s_1,$ $m_1, 35$), ($5, p_2,$ $s_1,$ $m_1, 87$), (6, $p_1, s_1,$ $m_1, 52$), (6, $p_2, s_1,$ $m_1, 7$), (7, $p_1, s_1,$ $m_1, 95$), (7, $p_2, s_1,$ $m_1, 2$), ...

Assuming $\epsilon $ = 10, the tuples for the dimension keys $p_1$, $s_1$ and $m_1$ in the above time series can be approximated by the following piecewise function.

$$\begin{aligned} f_{p_1,s_1,m_1}(t) = {\left\{ \begin{array}{ll} 9.8t + 32 &{} 1 \le t \le 4 \\ 30t - 119.33 &{} 5 \le t\le 7 \\ ... \end{array}\right. } \end{aligned}$$

Figure 1 shows the PLA segments of the $f_{p_1,s_1,m_1}(t)$. f1(t) and f2(t) are the PLA segments formed by the tuples for timestamps $1 \le t \le 4$ and $5 \le t\le 7$, respectively. The accurate sales quantities are shown in the figure for illustration only, while the approximate sales quantities can be obtained from the PLA segments. Note that when using PLA, we only maintain PLA segments (slopes and intercepts) in memory and not the actual data points, resulting in data size reduction. $\blacksquare $

2.2 Online Analytical Processing (OLAP)

OLAP is a technique for interactive analysis over multidimensional data. For efficient OLAP analysis, the underlying database schema is usually converted into a partially-normalized star schema. The data in star schema is represented as a data cube consisting of several dimension tables and a fact table. Dimension tables contain descriptive attributes, while the fact tables contain business facts called measures and foreign keys referring to primary keys in the dimension tables. Some of the dimension attributes are hierarchically connected. A number of dimension hierarchies compose a cube lattice, where each node corresponds to different combination of attributes at different hierarchy levels and an edge between two nodes represents a subsumption relation between them. Hence nodes in a lattice are combinations of dimension attributes and represent OLAP queries.

For instance, consider the star schema benchmark [10] shown in Fig. 2a with a fact table LINEORDER and four dimension tables, PART, CUSTOMER, SUPPLIER and DATE. Attributes Quantity, ExtendedPrice, OrdTotalPrice, Discount, Revenue, etc. of the LINEORDER are the business facts, while CustKey, PartKey and SuppKey are foreign keys of CUSTOMER, PART and SUPPLIER dimensions respectively. Additionally, each dimension table contains hierarchical relationship among some of its attributes. For example, SUPPLIER dimension contains hierarchy among the following attributes: City $->$ Nation $->$ Region. If we consider interaction of the PART, CUSTOMER and SUPPLIER dimensions only (without considering their internal hierarchies), the corresponding lattice is given by Fig. 2b. In the figure, the nodes with the border are materialized and the associated tables show their tuples. Once an OLAP lattice has been generated, users can register queries and apply OLAP operations to it. The queries registered to non-materialized nodes are computed from the materialized nodes on ad-hoc basis.

3 Related Work

3.1 Compact Data Structures and Approximate Querying

Compact data structures have long been utilized to summarize voluminous and velocious data streams and answer queries from them approximately. H. Elmeleegy et al. in [4] proposed two PLA-based stream compression algorithms, swing filters and slide filters, to represent a time-varying numerical signal within some preset error value. The PLA line segments in the swing filter are connected whereas mostly disconnected in the slide filter. The slide filter proposed in their work is almost similar to the one proposed by O’Rourke in [9].

Zhewei et al. in [7] proposed sketching techniques that support historical and window queries over summarized data. The data summary is maintained using the count-min sketch and the AMS sketch and the persistence is achieved by utilizing PLAs. Their work can provide persistence for counters only and can support point, heavy hitter and join size queries. [6] presented an online algorithm to optimize the representation size of the PLA for streaming time-series data. A PLA function f can be constructed using either only continuous (joint) line segments or only disjoint line segments. To optimize the size of f, the authors gave an adaptive solution that uses a mixture of joint and disjoint PLA segments and they named it mixed-type PLA.

Wavelet is also a famous technique which is often used for hierarchical data decomposition and summarisation. The technique proposed in [11] can effectively perform the wavelet decomposition with maximum error metrics. However, since the technique uses dynamic programming, it is computationally expensive. Therefore it cannot be used effectively for the data streams, which require one-pass methodology in linear time. [12] proposed a method for one-pass wavelet synopses with the maximum error metric. [12] shows that by using a number of intuitive thresholding techniques, it is possible to approximate the technique discussed in [11]. However, wavelet summarization can have a number of disadvantages in many situations as many parts of the time series may be approximated very poorly [12]. [13] used a sampling approach to answer OLAP queries approximately, however they did not consider lattice nodes materialization issue as we do. [2] compared different summarization methods on data streams and proved that the PLA is the best data summarization technique as far as querying error is concerned. Hence we propose the use of PLAs to summarize the data streams in this work. None of the above work considered the use of compact data structure for the OLAP as we do in this work.

3.2 Stream OLAP and View Maintenance

OLAP has been intensively studied by database researchers. [14] proposed a systematic study of the OLAP node and index-selection problem. Authors in [8] investigated the issue of nodes materialization when it is expensive to materialize all nodes. They presented a greedy algorithm that determines a good set of nodes to materialize. However, these work can only deal with static data.

One of the earliest work on stream OLAP was given by J. Han et al. [1]. They proposed an architecture called StreamCube to facilitate OLAP for streams. In order to reduce the query response time and the storage cost, StreamCube keeps the distant data at coarse granularity and very new data at fine granularity and pre-computes some OLAP queries at coarser, intermediate, and finer granularity levels. However, their work does not use compact data structures, therefore can not be used to maintain long data histories. Furthermore older data in their work is only available at coarser granularity, thus limiting the range of queries.

Phantoms are intermediate queries to accelerate user queries. Zhang et al. [15] proposed the use of phantoms to reduce the overall cost (processing and data transfer cost) within very limited memory of a network interface card. Although their work can reduce aggregation query cost, but is not capable of answering ad-hoc OLAP queries. M. Sadoghi et al. in [3] presented a lineage-based data store that combines real-time transactional and analytical processing within an engine with the help of the lineage-based storage. However their focus is storage architecture and not the core OLAP. Ahmad et al. [16] presented viewlet transforms, which materializes a query and a set of its higher-order deltas as views resulting in a reduced overall view maintenance cost by trading space.

In contrast to the above works, this work proposes a compact data structure based stream OLAP, capable of maintaining sustained data stream summaries and answering user queries approximately with maximum error guarantees.

4 PLA-Based Sustained Storage

The PLA is a compact data structure and can be used for sustained in-memory data summaries. The term sustained in this work corresponds to the long data summaries that PLA can accommodate by approximating several data points with a segment. Thus the main idea of our proposal is to store time series data points as PLA line segments for all the OLAP lattice nodes that need to be materialized, to reduce the overhead of recording complete time-series. This paper assumes that only the business facts arrive as time-series stream, while the dimensions are not treated as stream as they are updated less frequently.

Let S be a time-series data stream consisting of tuples $(t_i, k_{1i}, k_{2i},..., k_{di}, m_i)$, where $t_i$ is a timestamp, $i \in [1,n]$ and $t_i \le t_{i+1}$, $k_{1i}, k_{2i}, ..., k_{di}$ constitute a d-dimensional key and $m_i$ is a business fact or measure. To keep the discussion simple, this work assumes that one tuple for every key combination arrives at each timestamp, however it is easily extendible for the general case. Recall that PLA approximates data points using a piece-wise linear function, such that the error between the approximated and the actual data point is within the user-defined error parameter, $\epsilon $. For the data points in S, we wish to approximate $m_i$ using a piece-wise linear function $f_{k_{1i}, ..., k_{di}}(t_i)$, such that the $|f_{k_{1i}, ..., k_{di}}(t_i)-m_i| \le \epsilon $. The PLA needs to be maintained for each d-dimensional key. This paper, like most of the previous work that discussed this problem in an online setting [4, 6, 17, 18], assumes $L_\infty $-metric for the error computation. This is due to the fact that other error computations are not suitable for online algorithms as they require sum of errors over the entire time series.

The above approach would result in a sustained PLA-based storage for each key. The number of line segments required for each PLA and the cost of a PLA line segment computation depend on the choice of error parameter $\epsilon $. Larger $\epsilon $ would result in a smaller number of line segments but larger line segment computation cost and approximation error and vice versa. Also note that for multiple measures, multiple PLA structures need to be maintained per key.

5 Architecture and Query Processing

5.1 AOLAP Architecture

This section presents the proposed Approximate Stream OLAP (AOLAP) architecture, shown in Fig. 3, that enables users to obtain approximate answer of their OLAP queries. Given the dimension information, the number of dimensions to materialize ($\eta $) and utilizing the proposed optimization scheme (Sect. 7), the AOLAP system selects the $\eta $ nodes to materialize. For each materialized node, the AOLAP system maintains a PLA structure discussed in Sect. 4.

As the time series data arrive, the Lattice Manager calls the PLA algorithm (Algorithm 1) for each materialized node and update the corresponding PLA structures (hereafter materialized node is called PLAV). In Fig. 3, lattice nodes within rectangular boundaries represents PLAVs. The node at the lowest granularity, i.e., the node (Supplier, Part, Customer) in the figure, is always materialized to enable the AOLAP system to answer all possible user queries. The raw data stream is also stored in some non-volatile storage to avoid permanent data loss in case of system failure and to enable users to obtain accurate answers of their queries if needed.

In contrast to data stream, primary memory is finite. Since the users are interested in analysing recent data more frequently than the old or historical data, the Storage Manager flushes the old PLA segments to the secondary storage once they reach the memory limits or as specified by end user. These segments may be used to answer the historical queries to avoid computing the results from raw data stream, which is computationally expensive. Since the data is compact, this flushing may be done periodically rather than continuously or when the system is not overloaded by user queries. This also makes the system durable as in the case of system crash, the old segments are not permanently lost while the very new segments, not yet flushed to the secondary storage, can be reconstructed from the raw data stream available in the non-volatile storage.

5.2 Query Processing

The Query Manager in the AOLAP architecture is responsible for accepting user queries, computing the results from the PLAVs and sending the results to the end user. Since a user can query any lattice node, the results are generated using the nearest PLAV to keep the querying error small. The Lattice Manager, on the request from Query Manager, generates the query results and sends them to the Query Manager. For example in Fig. 3, the user query (Supplier), represented by oval boundary, can be answered using the PLAV (Supplier, Part).

OLAP queries over data streams generally involve aggregation operations over current, historical or some window data. Typical OLAP aggregation operations include SUM, MAX, MIN, AVG, etc. Users may also be interested in analysing raw facts across multiple dimensions. To answer a historical window query for a key k or any combination of keys from d-dimensional key for time range $(t',t]$, find the recorded measures $\hat{m_i}$ for all $t_i \in (t',t]$ as an approximation of $m_i$ and perform the requested aggregate operations to obtain $\widehat{m_{t',t}}$. Let the average length of a PLA line segment in terms of timestamp is l, then the cost of finding a measure $\hat{m_i}$ can be given by $\frac{n}{l}$, where n is the length of stream.

Table 1. Querying PLA

Full size table

Table 2. OLAP Operations on Table 1 data

Full size table

Example 3

Once again consider the time series and piecewise function $f_{p_1,s_1,m_1}(t)$ of Example 2. Now we would like to query $f_{p_1,s_1,m_1}(t)$ segments for the following OLAP aggregation operations: MAX, MIN, SUM, AVG, where $1 \le t \le 7$.

Table 1 shows the accurate ($m_i$) and approximate ($\hat{m_i}$) measures of the Example 2 time series for keys $p_1$, $s_1$ and $m_1$. The approximate measures are obtained from the PLA-based storage. Table 2 lists the OLAP aggregation operations performed on $\hat{m_i}$. It is interesting to note that the exact and the approximate measures for the OLAP operations SUM and AVG are quite similar, although they are computed from several approximate measures. This is due to the mutual cancellation of + and - errors in the individual approximate values. $\blacksquare $

6 Querying Error

In order to select the optimal PLAVs (nodes to materialize), an estimation of the overall querying error is needed, i.e., the aggregated querying error of all the lattice nodes, which forms the basis of our optimization problem presented in Sect. 7. In a d-dimensional lattice, there exist $2^d$ nodes [19]. Let $V=\{v_1,...v_{2^d}\}$ denotes the set of all the lattice nodes. The overall querying error is computed by taking into consideration the set of nodes chosen for materialization ($V_m$) and the number of rows in each node ($|v_i|$). Since the number of rows in a node is not known beforehand, it is estimated using the domain size of dimension attributes.

Consider two nodes $v_i \in V$ and $v_j \in V_m$, then $v_i \preceq v_j$ shows the dependence relationship between the queried node ($v_i$) and the materialized node ($v_j$), that is, a query can be answered from a materialized node if the queried node is dependent on the materialized node. Since a query can be answered from more than one materialized nodes, we choose the nearest node which can minimize the fraction $\frac{|v_j|}{|v_i|}$, as the larger $\frac{|v_j|}{|v_i|}$ results in the amplification of the overall querying error. The overall querying error can be expressed as:

$$\begin{aligned} \epsilon . \sum _{v_i \in V} min_{\forall v_j \in V_m \mid v_i \preceq v_j} \left( \frac{|v_j|}{|v_i|} \right) \end{aligned}$$

(1)

Note that in the Eq. 1, the fraction $\frac{|v_j|}{|v_i|}$ depends on the number of rows in the materialized and the querying nodes. By choosing the smaller fraction we actually choose the node $v_j$ with the smaller number of rows, that is, we need to aggregate a less number of rows to answer a query, resulting in smaller processing time and querying error as each row may contributes to the querying error.

7 Optimization Scheme

The PLA-based sustained storage discussed in Sect. 4 can be used to materialize only one lattice node. Since an OLAP lattice contain several nodes and during analysis a user may request any node, a baseline approach is to materialize all the lattice nodes. However, the baseline approach may results in prohibitively large number of nodes to materialize ($2^d$), specially when the number of dimensions is high, which is extremely memory costly. In the following we propose an optimization algorithm to solve this issue. Additionally we consider the reference frequency ($f_i$), the frequency with which a lattice node is queried by end-users, of each lattice node in the computation of querying error. Nodes or queries with the low reference frequencies contribute less to the overall querying error and vice versa. Hence the overall querying error considering the reference frequencies can be expressed as:

$$\begin{aligned} \epsilon . \sum _{v_i \in V} min_{\forall v_j \in V_m \mid v_i \preceq v_j} \left( \frac{|v_j|}{|v_i|} . f_i \right) \end{aligned}$$

(2)

7.1 Optimization Problem

Choosing which lattice nodes to materialize for the PLA-based stream OLAP is a space-error trade-off in addition to the space-time trade-off of traditional OLAP. However the focus of this work is only the space-error trade-off which is more significant in the context of PLA-based stream OLAP and is a NP-hard problem. Hence we propose a greedy optimization algorithm to find the optimal solution. Here we assume that the number of nodes to be materialized ($\eta $) is provided and the reference frequencies of the lattice nodes is known.

Optimization Problem: Given the number of nodes to materialize, $\eta $, and the reference frequency of each lattice node, f $=\{f_1,f_2,...,f_{2^d}\}$, materialize the nodes that can minimize the overall querying error.

7.2 Greedy Optimization Algorithm

Having introduced the optimization scheme, we are ready to present the proposed optimization algorithm (Algorithm 2). The algorithm takes as input a set of lattice nodes (V), the finest node ($v_f$), the number of nodes to materialize ($\eta $), the PLA error parameter ($\epsilon $) and the reference frequencies (f). The algorithm outputs a set of nodes to materialize ($V_m$). Note that the node at the finest granularity, ($v_f$), is always chosen to materialize because it contains data at the most granular level and therefore can answer all the queries, however answering coarser level queries from $v_f$ results in the amplification of querying error. Therefore the proposed greedy algorithm finds $\eta $ nodes to materialize, besides $v_f$, such that the overall querying error (Eq. 2) is minimized.

In the algorithm, the inner for loop (Lines 5–13) computes the overall querying error for each candidate node $v_j$ in $V \setminus V_m$ using Eq. 2 (Line 7). Lines 8–11 keeps track of the current best candidate node. At the end of the inner for loop, the best candidate node is selected and added to the set of materialized nodes (Line 14). This loop is executed $\eta $ times to select the $\eta $ best nodes to materialize (outer for loop). At the end of the algorithm, the set of the best nodes to materialize $V_m$ is returned (Line 17).

8 Experiments

8.1 Experimental Setup

Environment: For the sake of experiments a prototype system corresponding the AOLAP architecture is developed in C++. The experiments are performed on one of the node of HP BladeSystem c7000 with Intel Xeon (ES-2650 v3 @ 2.3 GHz) processor and 6 GB RAM running Ubuntu 14.10 OS.

Data: We used TPC-H^{Footnote 1} benchmark for experiments, well-known for OLAP analysis. However its schema is modified according to the Star Schema Benchmark (SSB) [10] as shown in Fig. 2a. The LINEORDER fact table contains 6,000,000 tuples and the dimension tables, PART, CUSTOMER and SUPPLIER contains 200,000, 30,000 and 2,000 tuples respectively. We considered the following dimensional hierarchies. CUSTOMER: Custkey $->$ Nation $->$ Region, SUPPLIER: Suppkey $->$ Nation $->$ Region, PART: PARTKEY, where NATION and REGION contain 25 and 5 unique tuples, respectively. The hierarchical lattice of the dimensions contain 32 nodes.

The time series is generated by identifying 10 K unique dimension keys combinations in the LINEORDER fact table and feeding them repeatedly to the system. In order to avoid the repetition of fact values, we only fed the dimension keys repeatedly. The fact values are repeated after every 6,000,000 tuples (which is the size of the fact table) and are quite non-uniform. We selected this business fact to show the usability of the PLA on non-uniform data, as the PLA results in low compression ratio on non-uniform data while high compression ratio on uniform data. The system time is used as the time series timestamp.

Comparative Methods: To evaluate the effectiveness of the proposed optimization scheme, we compared it with the following methods: (1) Random: The lattice nodes to materialize are chosen randomly. (2) Frequency: The lattice nodes with high reference frequencies are chosen for materialization.

We used the following five ways to assign reference frequencies to lattice nodes to cover different types of use cases in various applications.

Rand: Frequencies are assigned randomly within [0, 1] range.
AllHigh: High frequencies are assigned randomly within [0.8, 1] range.
AllLow: Low frequencies are assigned randomly within [0, 0.2] range.
CoarseHigh: Higher frequencies are assigned to coarser aggregation levels.
FineHigh: Higher frequencies are assigned to finer aggregation levels.

8.2 Experimental Evaluation

Experimental evaluation is subdivided into measuring the memory space utilization and querying error percentage. The evaluation is done for the worst case SUM operation, i.e., we aggregated the absolute querying error values. Unless otherwise stated, the following default parameter values are used in the experiments: $\eta $ = 6, $\epsilon $ = 3% (the value of $\epsilon $ is set as the percentage of the maximum value in the fact table) and frequency method = Rand. Each experiment is performed 5 times and their average values are reported in the graphs.

Memory Space Utilization. To evaluate the effectiveness of the PLA, we compared the memory space consumed when using PLA-based storage to that of ordinary storage (which stores the actual data points) in Figs. 4 and 5. The storage space is measured in terms of the number of PLA segments for the PLA-based storage and the number of data points for the ordinary storage. Since a PLA segment requires twice memory space than a data point, we divided the total number of data points by a factor of 2 to keep the comparison fair.

The average amount of memory consumed by the PLA-based storage decreases with the increase in PLA error-parameter ($\epsilon $), as can be observed from Fig. 4. This is because as the $\epsilon $ increases, a PLA segment can approximate a larger number of data points thereby reducing the number of line segments required by the PLA-based storage, which results in reduced memory space consumption. In most of the cases in Fig. 4, the memory space used by the PLA-based storage is upto 3 times less than the ordinary storage for $\epsilon $ = 4% and higher. This proves that the use of PLA for the materialization of lattice nodes can significantly reduce the memory consumption.

We also measured the memory space consumption by varying the number of materialized lattice nodes ($\eta $) as shown in Fig. 5. As $\eta $ increases, the memory space consumption of both the PLA-based storage and the ordinary storage increases because we need to store data at increased number of aggregations levels. However the memory consumption of the PLA-based storage is lower than that of ordinary storage for all the $\eta $ values. Note that we used highly non-uniform data values for the experiments, where it is difficult for the PLA algorithm to approximate the larger number of data points with one line segment. For the uniform time series data, for instance hourly temperature values or stock price data, the PLA-based storage is expected to be far more advantageous.

Querying Error. This section compares the querying error of the proposed optimization scheme to the frequency and the random methods.

Firstly, experiments are performed for different frequency allocation methods as shown in Fig. 6. It is evident from the Figs. 6a, b, c and d that the greedy optimization scheme selects the best $\eta $ nodes to materialize, resulting in the least querying error. Note the use of logarithmic scale on the y-axis. On the other hand, frequency based method gives priority to the nodes with high frequencies to materialize, while random method randomly chooses $\eta $ nodes to materialize, however both the comparative methods results in higher querying error. Furthermore in Fig. 6, the proposed optimized scheme results in the similar querying error for all the reference frequency allocations, because it always chooses the nodes that minimizes the querying error. Additionally the frequency method performs best for the frequency allocation approach FineHigh because it causes most of the nodes at finer granularity to materialize, which are at the middle or finer level of the lattice. When the nodes from these levels are materialized, coarser level queries can be answered from them leading to the reduction of the big querying error, that may otherwise results when the coarser level nodes need to be answered from the most finer level node. On the other hand, the random method behaves randomly for all the frequency allocations because of the reasons discussed above.

Next we perform experiments by varying $\eta $. Increasing $\eta $ reduces the querying error as can be observed from Fig. 7. Here again the optimization scheme performs the best. Furthermore when using the proposed optimized scheme, we do not need to materialize many nodes to get the results with acceptable querying error. For instance in Fig. 7, out of the total 32 nodes, materialization of 9 or 12 nodes can significantly reduces the querying error, hence saving a lot of memory.

Finally experiments are performed by varying $\epsilon $ in Fig. 8. Increasing $\epsilon $ slightly increases the querying error which is mainly observable from the bars of the optimized scheme and the frequency method. However, here again the random method results in random querying errors for each $\epsilon $ value due to the random selection of the lattice nodes to materialize. Moreover, the querying error increases significantly for $\epsilon $ = 5%, because for higher $\epsilon $, the PLA algorithm approximates a larger number of data points with a single PLA segment, possibly with higher absolute error values. Thus resulting in higher querying error.

9 Conclusion and Future Work

In this work we propose a novel architecture Approximate Stream OLAP (AOLAP) for maintaining time series data streams summaries, corresponding to each materialized lattice node, in a compact memory-based data structure, in addition to storing raw data streams to the secondary storage. We used piece-wise linear approximation as an in-memory compact data structure which can answer user queries approximately. In addition, we propose an optimization scheme to select the $\eta $ lattice nodes to materialize, such that the overall querying error, caused by the approximation, is minimized. Experiments prove that the PLA-based storage can significantly reduce the memory consumption for a small cost of querying error and the nodes selected by the optimization algorithm to materialize can minimize the overall querying error. In the future we plan to extend this work to incorporate dependence relation between lattice nodes so that the number of PLA structures need to be maintained can be further reduced.

Notes

1.
TPC-H. http://www.tpc.org/tpch/.

References

Han, J., Chen, Y., et al.: Stream cube: an architecture for multi-dimensional analysis of data streams. Distrib. Parallel Databases 18(2), 173–197 (2005)
Article Google Scholar
Duan, Q., Wang, P., Wu, M.X., Wang, W., Huang, S.: Approximate query on historical stream data. In: Hameurlain, A., Liddle, S.W., Schewe, K.-D., Zhou, X. (eds.) DEXA 2011. LNCS, vol. 6861, pp. 128–135. Springer, Heidelberg (2011). doi:10.1007/978-3-642-23091-2_12
Chapter Google Scholar
Sadoghi, M., Bhattacherjee, S., Bhattacharjee, B., Canim, M.: L-store: a real-time OLTP and OLAP system. In: CoRR (2016)
Google Scholar
Elmeleegy, H., Elmagarmid, A.K., Cecchet, E., Aref, W.G., Zwaenepoel, W.: Online piece-wise linear approximation of numerical streams with precision guarantees. Proc. VLDB Endow. 2(1), 145–156 (2009)
Article Google Scholar
Xie, Q., Zhu, J., et al.: Efficient buffer management for piecewise linear representation of multiple data streams. In: ACM CIKM, pp. 2114–2118 (2012)
Google Scholar
Luo, G., Yi, K., Cheng, S.W., Li, Z., Fan, W., He, C., Mu, Y.: Piecewise linear approximation of streaming time series data with max-error guarantees. In: 2015 IEEE 31st ICDE, pp. 173–184 (2015)
Google Scholar
Wei, Z., Luo, G., Yi, K., Du, X., Wen, J.-R.: Persistent data sketching. In: ACM SIGMOD 2015, pp. 795–810 (2015)
Google Scholar
Harinarayan, V., Rajaraman, A., Ullman, J.D.: Implementing data cubes efficiently. In: ACM SIGMOD, pp. 205–216 (1996)
Google Scholar
O’Rourke, J.: An on-line algorithm for fitting straight lines between data ranges. Commun. ACM 24(9), 574–578 (1981)
Article MATH Google Scholar
O’Neil, P., O’Neil, E., Chen, X., Revilak, S.: The Star Schema Benchmark and Augmented Fact Table Indexing. In: Nambiar, R., Poess, M. (eds.) TPCTC 2009. LNCS, vol. 5895, pp. 237–252. Springer, Heidelberg (2009). doi:10.1007/978-3-642-10424-4_17
Google Scholar
Garofalakis, M., Kumar, A.: Deterministic wavelet thresholding for maximum-error metrics. In: ACM PODS (2004)
Google Scholar
Karras, P., Mamoulis, N.: One-pass wavelet synopses for maximum-error metrics. In: PVLDB, pp. 421–432 (2005)
Google Scholar
De Rougemont, M., Cao, P.T.: Approximate answers to OLAP queries on streaming data warehouses. In: Proceedings of the Fifteenth International Workshop on Data Warehousing and OLAP, pp. 121–128 (2012)
Google Scholar
Talebi, Z.A., Chirkova, R., Fathi, Y., Stallmann, M.: Exact and inexact methods for selecting views and indexes for OLAP performance improvement. In: EDBT (2008)
Google Scholar
Zhang, R., Koudas, N., Ooi, B.C., Srivastava, D., Zhou, P.: Streaming multiple aggregations using phantoms. VLDB J. 19(4), 557–583 (2010)
Article Google Scholar
Koch, C., Ahmad, Y., Kennedy, O., Nikolic, M., Nötzli, A., Lupei, D., Shaikhha, A.: DBToaster: higher-order delta processing for dynamic, frequently fresh views. VLDB J. 23(2), 253–278 (2014)
Article Google Scholar
Lazaridis, I., Mehrotra, S.: Capturing sensor-generated time series with quality guarantees. In: ICDE, pp. 429–440 (2003)
Google Scholar
Olston, C., Jiang, J., Widom, J.: Adaptive filters for continuous queries over distributed data streams. In: ACM SIGMOD, pp. 563–574 (2003)
Google Scholar
Gray, J., Chaudhuri, S., et al.: Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min. Knowl. Discov. 1(1), 29–53 (1997)
Article Google Scholar

Download references

Acknowledgment

This research was partly supported by the program “Research and Development on Real World Big Data Integration and Analysis” of the RIKEN, Japan.

Author information

Authors and Affiliations

Center for Computational Sciences, University of Tsukuba, Tsukuba, Japan
Salman Ahmed Shaikh
Faculty of Engineering Information and Systems, University of Tsukuba, Tsukuba, Japan
Hiroyuki Kitagawa

Authors

Salman Ahmed Shaikh
View author publications
You can also search for this author in PubMed Google Scholar
Hiroyuki Kitagawa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Salman Ahmed Shaikh .

Editor information

Editors and Affiliations

Arizona State University, Tempe - Phoenix, Arizona, USA
Selçuk Candan
of Science and Technology, Hong Kong University of Science and Technology, Hong Kong, China
Lei Chen
Aalborg University , Aalborg, Denmark
Torben Bach Pedersen
University of New South Wales , Sydney, New South Wales, Australia
Lijun Chang
The University of Queensland , Brisbane, Queensland, Australia
Wen Hua

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shaikh, S.A., Kitagawa, H. (2017). Approximate OLAP on Sustained Data Streams. In: Candan, S., Chen, L., Pedersen, T., Chang, L., Hua, W. (eds) Database Systems for Advanced Applications. DASFAA 2017. Lecture Notes in Computer Science(), vol 10178. Springer, Cham. https://doi.org/10.1007/978-3-319-55699-4_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-55699-4_7
Published: 22 March 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-55698-7
Online ISBN: 978-3-319-55699-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics