Skip to main content
Log in

Monitoring distributed fragmented skylines

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

Distributed skyline computation is important for a wide range of domains, from distributed and web-based systems to ISP-network monitoring and distributed databases. The problem is particularly challenging in dynamic distributed settings, where the goal is to efficiently monitor a continuous skyline query over a collection of distributed streams. All existing work relies on the assumption of a single point of reference for object attributes/dimensions: objects may be vertically or horizontally partitioned, but the accurate value of each dimension for each object is always maintained by a single site. This assumption is unrealistic for several distributed applications, where object information is fragmented over a set of distributed streams (each monitored by a different site) and needs to be aggregated (e.g., averaged) across several sites. Furthermore, it is frequently useful to define skyline dimensions through complex functions over the aggregated objects, which raises further challenges for dealing with distribution and object fragmentation. We present the first known distributed algorithms for continuous monitoring of skylines over complex functions of fragmented multi-dimensional objects. Our algorithms rely on decomposition of the skyline monitoring problem to a select set of distributed threshold-crossing queries, which can be monitored locally at each site. We propose several optimizations, including: (a) a technique for adaptively determining the most efficient monitoring strategy for each object, (b) an approximate monitoring technique, and (c) a strategy that reduces communication overhead by grouping together threshold-crossing queries. Furthermore, we discuss how our proposed algorithms can be used to address other continuous query types. A thorough experimental study with synthetic and real-life data sets verifies the effectiveness of our schemes and demonstrates order-of-magnitude improvements in communication costs compared to the only alternative centralized solution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  1. Babcock, B., Olston, C.: Distributed top-k monitoring. In: SIGMOD, pp. 28–39 (2003)

  2. Balke, W.T., Gntzer, U., Zheng, J.X.: Efficient distributed skylining for web information systems. In: EDBT (2004)

  3. Börzsönyi, S., Kossmann, D., Stocker, K.: The skyline operator. In: ICDE (2001)

  4. Burdakis, S., Deligiannakis, A.: Detecting outliers in sensor networks using the geometric approach. In: ICDE (2012)

  5. Cheema, M.A., Lin, X., Zhang, W., Zhang, Y.: A safe zone based approach for monitoring moving skyline queries. In: EDBT (2013)

  6. Cheng, R., Kalashnikov, D.V., Prabhakar, S.: Evaluation of probabilistic queries over imprecise data in constantly-evolving environments. Inf. Syst. 32(1), 104–130 (2007)

    Article  Google Scholar 

  7. Cormode, G., Garofalakis, M.: Approximate continuous querying over distributed streams. TODS 33(2), 1–42 (2008)

    Article  Google Scholar 

  8. Cormode, G., Garofalakis, M., Muthukrishnan, S., Rastogi, R.: Holistic aggregates in a networked world: distributed tracking of approximate quantiles. In: SIGMOD (2005)

  9. Cranor, C., Johnson, T., Spatscheck, O., Shkapenyuk, V.: Gigascope: A stream database for network applications. In: SIGMOD (2003)

  10. Cui, B., Lu, H., Xu, Q., Chen, L., Dai, Y., Zhou, Y.: Parallel distributed processing of constrained skyline queries by filtering. In: ICDE (2008)

  11. Das, A., Ganguly, S., Garofalakis, M., Rastogi, R.: Distributed set-expression cardinality estimation. In: VLDB, pp. 312–323 (2004)

    Chapter  Google Scholar 

  12. Graham, R., Knuth, D., Patashnik, O.: Concrete Mathematics: A Foundation for Computer Science. Addison-Wesley, Upper Saddle River (1989)

    MATH  Google Scholar 

  13. HadjAli, A., Pivert, O., Prade, H.: On different types of fuzzy skylines. ISMIS 2011, 581–591 (2011)

    Google Scholar 

  14. Hose, K., Vlachou, A.: A survey of skyline processing in highly distributed environments. VLDB J. 21(3), 359–384 (2011)

    Article  Google Scholar 

  15. Huang, Z., Lu, H., Ooi, B.C., Tung, A.K.H.: Continuous skyline queries for moving objects. TKDE 18(12), 1645–1658 (2006)

    Google Scholar 

  16. Keren, D., Sharfman, I., Schuster, A., Livne, A.: Shape sensitive geometric monitoring. TKDE 24(8), 1520–1535 (2012)

    Google Scholar 

  17. Koltun, V., Papadimitriou, C.: Approximately dominating representatives. Theor. Comput. Sci. 371(3), 148–154 (2007)

    Article  MathSciNet  Google Scholar 

  18. Lazerson, A., Sharfman, I., Keren, D., Schuster, A., Garofalakis, M.N., Samoladas, V.: Monitoring distributed streams using convex decompositions. PVLDB 8(5), 545–556 (2015)

    Google Scholar 

  19. Lee, J., Hwang, S.: Scalable skyline computation using a balanced pivot selection technique. Inf. Syst. 39, 1–21 (2014)

    Article  Google Scholar 

  20. Madden, S., Franklin, M., Hellerstein, J., Hong, W.: The design of an acquisitional query processor for sensor networks. In: SIGMOD (2003)

  21. Olston, C., Jiang, J., Widom, J.: Adaptive filters for continuous queries over distributed data streams. In: SIGMOD (2003)

  22. Papadias, D., Fu, G., Chase, M., Seeger, B.: Progressive skyline computation in database systems. TODS 30(1), 41–82 (2005)

    Article  Google Scholar 

  23. Papapetrou, O., Garofalakis, M.N.: Continuous fragmented skylines over distributed streams. In: ICDE (2014)

  24. Sharfman, I., Schuster, A., Keren, D.: A geometric approach to monitoring threshold functions over distributed data streams. In: SIGMOD (2006)

  25. Tao, Y., Papadias, D.: Maintaining sliding window skylines on data streams. TKDE 18(2), 377–391 (2006)

    Google Scholar 

  26. Tao, Y., Xiao, X., Pei, J.: SUBSKY: efficient computation of skylines in subspaces. In: ICDE (2006)

  27. Trimponias, G., Bartolini, I., Papadias, D., Yang, Y.: Skyline processing on distributed vertical decompositions. TKDE 25(4), 850–862 (2013). https://doi.org/10.1109/TKDE.2011.266

    Article  Google Scholar 

  28. Vlachou, A., Doulkeridis, C., Kotidis, Y., Vazirgiannis, M.: Efficient routing of subspace skyline queries over highly distributed data. TKDE 22(12), 1694–1708 (2010)

    Google Scholar 

  29. Wu, P., Agrawal, D., Egecioglu, Ö., El Abbadi, A.: DeltaSky: Optimal maintenance of skyline deletions without exclusive dominance region generation. In: ICDE (2007)

  30. Yuan, Y., Lin, X., Liu, Q., Wang, W., Yu, J.X., Zhang, Q.: Efficient computation of the skyline cube. In: VLDB (2005)

  31. Zhang, S., Mamoulis, N., Cheung, D.W.: Scalable skyline computation using object-based space partitioning. In: SIGMOD (2009)

  32. Zhang, Z., Cheng, R., Papadias, D., Tung, A.: Minimizing the communication cost for continuous skyline maintenance. In: SIGMOD (2009)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Odysseas Papapetrou.

Appendix: Proofs

Appendix: Proofs

Theorem 1

Monitoring the Direct  threshold-crossing query \(Q_{t_0}(\varvec{g},\) \(\mathbf {v}(o_i)|\mathbf {v}(o_j),\) \(\mathbf {0})\) for object \(o_i\) at sites \({{\mathcal {S}}}=\) \({\mathcal {P}}(o_i)\cup {\mathcal {P}}(o_j)\) is provably less communication-efficient than monitoring the corresponding Pivot threshold query \(Q_{t_0}(\varvec{f},\) \(\mathbf {v}(o_i),\) \(\overrightarrow{pp}_{i,j})\), when all functions in \(\varvec{f}\) are linear, and \(r=\frac{|{\mathcal {S}}|}{|{\mathcal {P}}(o_i)|}>2\).

Proof

We will use \(Q^p\) to denote the threshold-crossing query between objects \(o_i\) and \(o_j\) monitored by Pivot, and \(Q^d\) the query monitored by Direct. We will show that when both queries are instantiated with the same data, i.e., with identical object values at time \(t_0\), the minimum required update \(\mathbf {u}_d\) of \(o_i\) that will cause a threshold crossing on \(Q^d\) is smaller than the corresponding minimum required update \(\mathbf {u}_p\) for \(Q^p\). Therefore, \(Q^d\) will be violated more frequently, causing more synchronizations. For simplicity, we examine only the case for a function vector \(\varvec{f}\) where all constituting functions are linear, and we focus only on object \(o_i\), i.e., we consider \(o_j\) to be stationary on the node receiving the update of \(o_i\). This can happen, e.g., when the node p monitoring \(o_i\) does not monitor \(o_j\), or when it did not receive any update for \(o_j\) since the last synchronization.

Consider any node \(p\in {\mathcal {P}}(o_i)\) receiving an update \(\mathbf {u}\) for \(o_i\) at time t. This update will cause a threshold crossing for \(Q^p\) only if \(\mathop {\mathrm {sgn}}(\varvec{f}(\mathbf {v}(o_i,t)) - \tau ) \ne \mathop {\mathrm {sgn}}(\varvec{f}(\mathbf {v}(o_i,t_0)) - \tau )\), with \(\tau =(\varvec{f}(\mathbf {v}(o_i,t_0))+\varvec{f}(\mathbf {v}(o_j,t_0)))/2\). Since \(\varvec{f}\) is linear, \(\varvec{f}(\mathbf {v}(o_i,t))=\varvec{f}(\mathbf {v}(o_i,t_0)+\mathbf {u})= \varvec{f}(\mathbf {v}(o_i,t_0))+\varvec{f}(\mathbf {u})\).

Recall that \(\varvec{f}\) is a function vector. We need to consider each dimension k of \(\varvec{f}\) separately. A threshold crossing due to dimension k will occur when \(\mathop {\mathrm {sgn}}(\varvec{f}(\mathbf {v}(o_i,t_0))[k]+\varvec{f}(\mathbf {u})[k] - \tau [k]) \ne \mathop {\mathrm {sgn}}(\varvec{f}(\mathbf {v}(o_i,t_0))[k] - \tau [k] )\). Without loss of generality, assume that \(\varvec{f}(\mathbf {v}(o_i,t_0))[k]<\varvec{f}(\mathbf {v}(o_j,t_0))[k]\) (the other case is symmetric). Then, \(\tau [k] > \varvec{f}(\mathbf {v}(o_i,t_0))[k]\), and threshold crossing on \(Q^p\) can occur only when \(\varvec{f}(\mathbf {u})[k]\) surpasses \(\tau [k] - \varvec{f}(\mathbf {v}(o_i,t_0))[k]\), i.e., \(\varvec{f}(\mathbf {u}_p)[k]> \frac{\varvec{f}(\mathbf {v}(o_j,t_0))[k]-\varvec{f}(\mathbf {v}(o_i,t_0))[k]}{2}\).

Now consider the case of Direct. \(Q^d\) will be violated in dimension k when

$$\begin{aligned} \mathop {\mathrm {sgn}}(\varvec{f}(\mathbf {v}&(o_i,t))[k]-\varvec{f}(\mathbf {v}(o_j,t))[k]) \ne \nonumber \\&\mathop {\mathrm {sgn}}(\varvec{f}(\mathbf {v}(o_i,t_0))[k]-\varvec{f}(\mathbf {v}(o_j,t_0))[k]) \end{aligned}$$
(3)

By our assumption that \(\varvec{f}(\mathbf {v}(o_i,t_0))[k]<\varvec{f}(\mathbf {v}(o_j,t_0))[k]\), we know that \(\mathop {\mathrm {sgn}}(\varvec{f}(\mathbf {v}(o_i,t_0))[k]-\varvec{f}(\mathbf {v}(o_j,t_0))[k])=-1\). Therefore, a threshold crossing will be caused only when the LHS of Eq. 3 becomes positive:

$$\begin{aligned} \mathop {\mathrm {sgn}}(\varvec{f}(\mathbf {v}(o_i,t))[k]-\varvec{f}(\mathbf {v}(o_j,t))[k])=+1 \Rightarrow \nonumber \\ \varvec{f}(\mathbf {v}(o_i,t))[k]-\varvec{f}(\mathbf {v}(o_j,t))[k]>0 \end{aligned}$$
(4)

As discussed in the paper, to account for the fact that \(o_i\) is not monitored by all nodes, we need to scale the local statistics drift vector for \(o_i\) by \(r=|{\mathcal {S}}|/|{\mathcal {P}}(o_i)|\). Since \(\varvec{f}\) is linear, \(\varvec{f}(\mathbf {v}(o_i,t))=\varvec{f}(\mathbf {v}(o_i,t_0)+r\mathbf {u})=\varvec{f}(\mathbf {v}(o_i,t_0))+r\varvec{f}(\mathbf {u})\). Substituting \(\varvec{f}\) in Eq. 4, and since \(\mathbf {v}(o_j,t)=\mathbf {v}(o_j,t_0)\), we get \(\varvec{f}(\mathbf {v}(o_i,t_0))[k]+r\varvec{f}(\mathbf {u})[k] -\varvec{f}(\mathbf {v}(o_j,t_0))[k] > 0\). Therefore, the condition for threshold crossing becomes \(\varvec{f}(\mathbf {u}_d)[k] > \frac{\varvec{f}(\mathbf {v}(o_i,t_0))[k] -\varvec{f}(\mathbf {v}(o_j,t_0))[k]}{r}\). Thus, if \(r>2\), for all dimensions k, \(\varvec{f}(\mathbf {u}_d)[k]\) will be smaller than \(\varvec{f}(\mathbf {u}_p)[k]\), which directly implies that \(Q^d\) will be violated with a smaller magnitude update. \(\square \)

Theorem 2

The extracted threshold queries are sufficient for accurate fragmented skyline monitoring, i.e., as long as no threshold violation occurs, the skyline is guaranteed to stay the same. They are also minimal, in the sense that omitting any of the queries breaks the correctness guarantees.

Proof

We will prove that the threshold queries are sufficient for detecting whenever an object changes status, i.e., enters or leaves the skyline. The proof is valid for both Pivot and Direct. First, we consider the simpler case of an object \(o_i\) not belonging in the skyline at time \(t_0\), to show that it cannot enter the skyline without first causing a threshold violation. For \(o_i\), the algorithm constructs a threshold crossing query between \(o_i\) and an object \(o_j\) that dominates \(o_i\). As long as the threshold query is not violated by an update of either \(o_i\) or \(o_j\), \(o_j\) continues to dominate \(o_i\), which guarantees that \(o_i\) does not enter the skyline.

Second, we consider an object \(o_i\) that belongs in the skyline at time \(t_0\). We will prove that \(o_i\) cannot be removed from the skyline without first causing a threshold violation, which will enable the coordinator to detect the change in the skyline. \(o_i\) can be removed from the skyline only due to an update of \(o_i\) or an update of any object \(o_j\), which will dominate \(o_i\). We have the following cases:

  • \(o_j\), which did not belong in the skyline at time \(t_0\), is updated and dominates \(o_i\). Since \(o_j\) dominates an object that was previously skyline object, this means that \(o_j\) first needs to become part of the skyline. This, of course, corresponds to the case addressed earlier, thus causing a violation of the threshold query that monitors \(o_j\) and enabling the coordinator to detect the skyline update.

  • Object \(o_j\), which belonged in the skyline at time \(t_0\), is updated and now dominates \(o_i\). Since at time \(t_0\) object \(o_j\) did not dominate \(o_i\), there existed at least one dimension k for which \(\varvec{f}(\mathbf {v}(o_i,t_0))[k]<\varvec{f}(\mathbf {v}(o_j,t_0))[k]\). Also, since \(o_j\) also belonged in the skyline, our monitoring algorithm constructed a threshold query between \(o_j\) and its immediate skyline neighbor \(o_h\) at dimension k that satisfies \(\varvec{f}(\mathbf {v}(o_h,t_0))[k]<\varvec{f}(\mathbf {v}(o_j,t_0))[k]\). There are two cases: (a) \(o_h\) is the object \(o_i\), in which case the corresponding threshold query will be violated, or (b) \(o_h\) is not \(o_i\), in which case \(\varvec{f}(\mathbf {v}(o_i,t_0))[k]<\varvec{f}(\mathbf {v}(o_h,t_0))[k]\) (by the definition of \(o_h\)), and the threshold query of \(o_j\) corresponding to \(o_h\) will be violated. In both cases, the violation will cause synchronization, which will enable the coordinator to detect the change in the skyline. (Note that \(o_h\) will also be monitoring its nearest dominating neighbor in the skyline (say, \(o_l\)) so that, if \(o_l\) at some point takes the position of the nearest neighbor of \(o_j\), then \(o_h\) would fire; in general, it is not difficult to see that some monitoring rule will fire if the nearest skyline neighbor of \(o_j\) changes, so we can assume that the monitored nearest skyline neighbor is always current.)

We also need to prove that all constructed threshold queries are required for correctly monitoring the skyline. Again, we need to consider the two types of queries separately.

  • Queries monitoring domination of a non-skyline object: Recall that only one query is constructed. If this query is removed for any non-skyline object \(o_i\), then the algorithm will not be able to track the location of \(o_i\), possibly masking skyline updates.

  • Queries monitoring dominance of a skyline object: Two queries are constructed per dimension, with the two immediate skyline neighbors. By removing any of these queries for a skyline object \(o_i\), then we will not be able to track the location of the object in the corresponding dimension, possibly masking skyline updates.

\(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Papapetrou, O., Garofalakis, M. Monitoring distributed fragmented skylines. Distrib Parallel Databases 36, 675–715 (2018). https://doi.org/10.1007/s10619-018-7223-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-018-7223-7

Keywords

Navigation