Skip to main content
Log in

A probabilistic data model and algebra for location-based data warehouses and their implementation

  • Published:
GeoInformatica Aims and scope Submit manuscript

Abstract

This paper proposes a novel, probabilistic data model and algebra that improves the modeling and querying of uncertain data in spatial OLAP (SOLAP) to support location-based services. Data warehouses that support location-based services need to combine complex hierarchies, such as road networks or transportation infrastructures, with static and dynamic content, e.g., speed limits and vehicle positions, respectively. Both the hierarchies and the content are often uncertain in real-world applications. Our model supports the use of probability distributions within both facts and dimensions. We give an algebra that correctly aggregates uncertain data over uncertain hierarchies. This paper also describes an implementation of the model and algebra, gives a complexity analysis of the algebra, and reports on an empirical, experimental evaluation of the implementation. The work is motivated with a real-world case study, based on our collaboration with a leading Danish vendor of location-based services.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. A note on terminology: This paper uses the term “(un)certainty” about an inherent property of the data. The term “probability(-ies)” is used about a mathematical construct which is used to model the (un)certainty of the given data. Finally, the term “confidence” is used about the level of trust that the user attributes to a query result.

  2. Since the User and Time dimensions are certain, using GROUP d (a, v ms , t, [1; 1], [0; 1], [1; 1]) instead of GROUP l will yield the same result.

References

  1. Kothuri R, Godfrind A, Beinat E (2004) Pro oracle spatial, Apress

  2. Harinath S, Quinn SR (2005) Professional SQL Server Analysis Services 2005 with MDX, Wrox

  3. Kimball R, Reeves L, Ross M, Thornwaite W (1998) The data warehouse lifecycle toolkit. Wiley

  4. Han J (2008) Olap, spatial. In: Encyclopedia of GIS, pp 809–812

  5. Pedersen TB, Jensen CS, Dyreson CE (2001) A foundation for capturing and querying complex multidimensional data. Inf Syst 26(5):383–423

    Article  Google Scholar 

  6. Dyreson CE, Pedersen TB, Jensen CS (2003) Incomplete information in multidimensional databases. In: Multidimensional databases, pp 282–309

  7. Parker A, Subrahmanian VS, Grant J (2007) A logical formulation of probabilistic spatial databases. IEEE Trans Knowl Data Eng 19(11):1541–1556

    Article  Google Scholar 

  8. Euman a/s (2008). http://www.euman.com

  9. Dyreson CE (1996) A bibliography on uncertainty management in information systems. In: Uncertainty management in information systems, pp 415–458

  10. Date CJ (1986) Null values in database management. Reading, ch. 15. Addison-Wesley, MA, pp 313–334

  11. Grant J (1980) Incomplete information in a relational database. Fundam Math III(3):363–378

    Google Scholar 

  12. Bosc P, Galibourg M, Hamon G (1988) Fuzzy querying with SQL: extensions and implementation aspects. Fuzzy Set Syst 28:333–349

    Article  Google Scholar 

  13. Barbará D, García-Molina H, Porter D (1992) The management of probabilistic data. IEEE Trans Knowl Data Eng 4(5):487–502

    Article  Google Scholar 

  14. Dyreson CE, Pedersen TB, Jensen CS (2003) Incomplete information in multidimensional databases. In: Multidimensional databases, pp 282–309

  15. Benjelloun O, Sarma AD, Halevy AY, Theobald M, Widom J (2008) Databases with uncertainty and lineage. VLDB J 17(2):243–264

    Article  Google Scholar 

  16. Cheng R, Singh S, Prabhakar S, Shah R, Vitter JS, Xia Y (2006) Efficient join processing over uncertain data. In: CIKM, pp 738–747

  17. Cavallo R, Pittarelli M (1987) The theory of probabilistic databases. In: VLDB, pp 71–81

  18. Gelenbe E, Hébrail G (1986) A probability model of uncertainty in data bases. In: ICDE, pp 328–333

  19. Dalvi NN, Suciu D (2007) Efficient query evaluation on probabilistic databases. VLDB J 16(4):523–544

    Article  Google Scholar 

  20. Papadias D, Zhang J, Mamoulis N, Tao Y (2003) Query processing in spatial network databases. In: VLDB, pp 802–813

  21. Saltenis S, Jensen CS, Leutenegger ST, Lopez MA (2000) Indexing the positions of continuously moving objects. In: SIGMOD conference, pp 331–342

  22. Sun J, Papadias D, Tao Y, Liu B (2004) Querying about the past, the present, and the future in spatio-temporal databases. In: ICDE, pp 202–213

  23. Trajcevski G, Wolfson O, Zhang F, Chamberlain S (2002) The geometry of uncertainty in moving objects databases. In: EDBT, pp 233–250

  24. Trajcevski G (2003) Probabilistic range queries in moving objects databases with uncertainty. In: MobiDE, pp 39–45

  25. Cheng R, Kalashnikov DV, Prabhakar S (2004) Querying imprecise data in moving object environments. IEEE Trns Knowl Data Eng 16(9):1112–1127

    Article  Google Scholar 

  26. Cheng R, Kalashnikov DV, Prabhakar S (2003) Evaluating probabilistic queries over imprecise data. In: SIGMOD Conference, pp 551–562

  27. Pedersen TB, Tryfona N (2001) Pre-aggregation in spatial data warehouses. In: SSTD, pp 460–480

  28. Tao Y, Kollios G, Considine J, Li F, Papadias D (2004) Spatio-temporal aggregation using sketches. In: ICDE, pp 214–226

  29. Zhang D, Tsotras VJ, Gunopulos D (2002) Efficient aggregation over objects with extent. In: PODS, pp 121–132

  30. Zhang D, Gunopulos D, Tsotras VJ, Seeger B (2003) Temporal and spatio-temporal aggregations over data streams using multiple time granularities. Inf Syst 28(1–2):61–84

    Article  Google Scholar 

  31. Güting RH, Böhlen MH, Erwig M, Jensen CS, Lorentzos NA, Schneider M, Vazirgiannis M (2000) A foundation for representing and quering moving objects. ACM Trans Database Syst 25(1):1–42

    Article  Google Scholar 

  32. NCHRP (1997) A generic data model for linear referencing systems. Transportation Research Board, Washington, DC,

  33. Speicys L, Jensen CS (2008) Enabling location-based services—multi-graph representation of transportation networks. Geoinformatica 12(2):219–253

    Article  Google Scholar 

  34. Hage C, Jensen CS, Pedersen TB, Speicys L, Timko I (2003) Integrated data management for mobile services in the real world. In: VLDB, pp 1019–1030

  35. Jensen CS, Kligys A, Pedersen TB, Timko I (2004) Multidimensional data modeling for location-based services. VLDB J 13(1):1–21

    Google Scholar 

  36. Timko I, Pedersen TB (2004) Capturing complex multidimensional data in location-based data warehouses. In: GIS, pp 147–156

  37. Burdick D, Deshpande PM, Jayram TS, Ramakrishnan R, Vaithyanathan S (2007) Olap over uncertain and imprecise data. VLDB J 16(1):123–144

    Article  Google Scholar 

  38. Jarke M, Lenzerini M, Vassiliou Y, Vassiliadis P (2003) Fundamentals of data warehouses. Springer, Heidelberg

    Book  Google Scholar 

  39. Vaisman AA, Zimányi E (2009) What is spatio-temporal data warehousing? In: DaWaK, pp 9–23

  40. Malinowski E, Zimányi E (2008) Advanced data warehouse design: from conventional to spatial and temporal applications. Springer, Heidelberg

    Google Scholar 

  41. Escribano A, Gómez LI, Kuijpers B, Vaisman AA (2007) Piet: a gis-olap implementation. In: DOLAP, pp 73–80

  42. Gómez LI, Haesevoets S, Kuijpers B, Vaisman AA (2009) Spatial aggregation: data model and implementation Inf Syst 34(6):551–576

    Article  Google Scholar 

  43. Gómez LI, Kuijpers B, Vaisman AA (2011) A data model and query language for spatio-temporal decision support. Geoinformatica 15(3):455–496

    Article  Google Scholar 

  44. Gómez LI, Gómez SA, Vaisman AA (2012) A generic data model and query language for spatiotemporal olap cube analysis. In: EDBT, pp 300–311

  45. Malinowski E, Zimányi E (2007) Logical representation of a conceptual model for spatial data warehouses. Geoinformatica 11(4):431–457

    Article  Google Scholar 

  46. Pourabbas E (2003) Cooperation with geographic databases. In: Multidimensional databases, pp 393–432

  47. da Silva J, Vera ASC, de Oliveira AG, do Nascimento Fidalgo R, Salgado AC, Times VC (2007) Querying geographical data warehouses with geomdql. In: SBBD, pp 223–237

  48. da Silva J, Times VC, Salgado AC (2006) An open source and web based framework for geographic and multidimensional processing. In: SAC, pp 63–67

  49. Bimonte S, Miquel M (2010) When spatial analysis meets olap: Multidimensional model and operators. IJDWM 6(4):33–60

    Google Scholar 

  50. Bimonte S, Bertolotto M, Gensel J, Boussaid O (2012) Spatial olap and map generalization: Model and algebra. IJDWM 8(1):24–51

    Google Scholar 

  51. Viswanathan G, Schneider M 2011 Olap formulations for supporting complex spatial objects in data warehouses, In: DaWaK, pp 39–50

  52. Xu J, Güting RH (2013) Ageneric data model for moving objects. Geoinformatica 17(1):125–172

    Article  Google Scholar 

  53. Timko I, Dyreson CE, Pedersen TB (2005) Probabilistic data modeling and querying for location-based data warehouses. In: SSDBM, pp 273–282

  54. DeGroot MH, Schervish MJ (2002) Probability and statistics. Addison-Wesley, Reading

    Google Scholar 

  55. Klug AC (1982) Equivalence of relational algebra and relational calculus query languages having aggregate functions. J ACM 29(3):699–717

    Article  Google Scholar 

  56. Abiteboul S, Kanellakis PC, Grahne G (1987) On the representation and querying of sets of possible worlds. In: SIGMOD Conference, pp 34–48

  57. Oracle berkeley db (2008). http://www.oracle.com/technology/products/berkeley-db/index.html

  58. Dyreson CE (1996) Information retrieval from an incomplete data cube. In: VLDB, pp 532–543

  59. Cormen TH, Leiserson CE, Rivest RL, Stein C (2001) Introduction to algorithms, 2nd edn. The MIT Press, Cambridge

    Google Scholar 

  60. Brinkhoff T (2002) A framework for generating network-based moving objects. Geoinformatica 6(2):153–180

    Article  Google Scholar 

  61. Burdick D, Doan A, Ramakrishnan R, Vaithyanathan S (2007) Olap over imprecise data with domain constraints. In: VLDB, pp 39–50

  62. Martino SD, Bimonte S, Bertolotto M, Ferrucci F (2009) Integrating google earth within olap tools for multidimensional exploration and analysis of spatial data. In: ICEIS, pp 940–951

  63. Silva R, Moura-Pires J, Santos MY (2011) Spatial clustering to uncluttering map visualization in solap. In: ICCSA, vol 1. pp 253–268

  64. Dyreson CE, Florez OU (2011) Building a display of missing information in a data sieve. In: DOLAP, pp 53–60

Download references

Acknowledgement

This work is supported in part by the BagTrack project funded by the Danish National Advanced Technology Foundation under grant no. 010-2011-1.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Curtis Dyreson.

A Electronic Appendix

A Electronic Appendix

This is an electronic appendix that will accompany the paper in the journal’s digital library, but is not a part of the paper.

1.1 A.1 Expected degrees of containment

In this appendix, we introduce a new interpretation for degrees of containment. The motivation for the new interpretation is as follows. Assume that we are given a dimension D with its set of categories and the relation on its dimension values \(\sqsubset\). As mentioned in Section 3, with the safe degrees of containment, the notation \(v_1 \sqsubset_{d} v_2\), where \(v_1 \in \widehat{D}\), \(v_2 \in \widehat{D}\), and d ∈ [0; 1] means that the value v 2 contains at least d · 100 % of the value v 1. The disadvantage of this approach is that inferred, transitive relationships between dimension values are very likely to receive a degree of containment equal to 0, because we infer only those degrees that we can guarantee. This makes the data too uncertain for practical use.

In order to make the transitive relationships more useful, we introduce the expected degrees of containment. Our approach is based on probability theory [54]. We consider each dimension value as an infinite set of points. We deal with the probabilistic events of the form “any point in v 1 is contained in v 2”.

Definition A.1

(Expected degree of containment) Given two dimension values, v 1 and v 2, and a number, d ∈ [0; 1], the notations \(v_1 \sqsubset v_2 \wedge \mathit{Deg}_{\rm exp}(v_1, v_2) = d\) (or \(v_{1} \sqsubset_{d} v_{2}\), for short) mean that “v 2 is expected to contain d · 100 % of v 1”, or, more formally, “any point in v 1 is contained in v 2 with a probability of d”. We term d, expected degree of containment.

Next, we define a rule for inferring non-immediate relationships between dimension values with expected degrees.

Definition A.2

(Transitivity of partial containment with expected degrees) The rule of transitivity of partial containment with expected degrees is defined as follows:

$$ \begin{array}{rll} &&\forall(v, v_1, \ldots, v_n, v^{\prime}) \in \widehat{D} \times \ldots \times \widehat{D}\\[3pt] &&{\kern12pt}\left(\bigwedge\limits_{i=1}^n v \sqsubset_{d_{i}} v_i \wedge v_i \sqsubset_{d_i^{\prime}} v^{\prime} \Rightarrow v \sqsubset v^{\prime} \wedge \mathit{Deg_{\rm exp}\left(v, v^{\prime}\right)} = \sum\limits_{i=1}^n d_i \cdot d_i^{\prime}\right) \end{array} $$

The idea behind the rule from Definition A.2 is explained next. We will use notation P(e) for the probability of the event e. Let us first consider a special case of the rule, when i = 1 (i.e., when there is only one, unique path between values v and v′). Then, the rule takes the following form:

$$ \forall \left(v, v_1, v^{\prime}\right) \in \mathit{\widehat{D}} \times \mathit{\widehat{D}} \times \widehat{D} \left(v \sqsubset_{d_{1}} v_1 \wedge v_1 \sqsubset_{d_1^{\prime}} v^{\prime} \Rightarrow v \sqsubset_{d_1 \cdot d_1^{\prime}} v^{\prime}\right) $$

First, \(v \sqsubset_{d_1} v_1\) means that P(e) = d 1, where e is “any point in v is contained in v 1”. Second, \(v_1 \sqsubset_{d_1^{\prime}} v\) means that \(P(e^{\prime}) = d_1^{\prime}\), where e′ is “any point in v 1 is contained in v”. The conjunction of these two events, e ∧ e′ (i.e., “any point in v is contained in v′”) is equivalent to \(v \sqsubset v^{\prime}\)). Next, having assumed that the events e and e′ are independent, \(P(e \wedge e^{\prime})=d_1 \cdot d_1^{\prime}\). This means that we have inferred the relationship \(v \sqsubset_{d_1 \cdot d_1^{\prime}} v^{\prime}\).

The general case of the rule allows n paths between v and v′. The ith path goes through a value, v i . Then, the event e (i.e., “any point in v is contained in v′”) is a disjunction of n disjoint events, \(\bigvee_{i=1}^n e_i\), where e i is “any point in v is contained in v′, given the ith path”. The events e 1, e 2, ...e n are disjoint, because (1) we assume that values from the same category, in particular, the values v 1, v 2, ...v n do not overlap, and (2) consequently, the n events “any point in v is contained in v i ” are disjoint. Thus, the general case of the rule is n applications of the rule’s special case. The ith application concerns an ith path and infers the probability \(d_i \cdot d_i^{\prime}\) of the event e i . This means that the event e has the probability of \(d = Deg_{\rm exp}(v, v^{\prime}) = \sum_{i=1}^n d_i \cdot d_i^{\prime}\) (i.e., that there is a relationship \(v \sqsubset_d v^{\prime}\)).

The aggregation process must perform correct aggregation, which intuitively means that the aggregate results should be correct with respect to the underlying facts, i.e., include the correct fraction (as specified by the hierarchy) of a value of a particular fact in a particular part of an aggregate result. Thus, no over- or under-counting of facts should occur in the aggregation process. To ensure this, a warehouse must consider all relevant aggregation paths between the source and destination category. Since no aggregation path is ignored during inferences of transitive partial containment relationships with expected degrees, the rule thus offers support for correct aggregation, which is missing from the analogous rule with safe degrees. A further support is offered by the rules for inferring fact characterizations (see Section 4). The example below illustrates the need to consider all aggregation paths in order to achieve correct aggregation.

Example A.1

We show how to infer transitive partial containment relationships with expected degrees using the hierarchy of Fig. 10.

Fig. 10
figure 10

Dimension type \(\mathcal{T}_{r}\) from Fig. 3 (part) and its instance

First, we demonstrate the support for correct aggregation. In the subdimension D p , we have values e 1 ∈ C Interval, e 2 ∈ C Interval and e ∈ C Scope such that \({e_{\mathrm{1}} \sqsubset_1 e}\) and \({e_{\mathrm{2}} \sqsubset_1 e}\). Then, in the subdimension D g , we have a value p 1 ∈ C Poly_3 such that \(p_1 \sqsubset_{0.3} e_1\) and \(p_1 \sqsubset_{0.7} e_2\). In other words, we have two aggregation paths between values p 1 and e. Consequently, by “summing up” the paths, we infer that \(p_1 \sqsubset_{0.3\cdot 1 + 0.7 \cdot 1 = 1} e\).

Second, we demonstrate the improvement in certainty of transitive relationships, compared to those obtained by the rule with safe degrees. In the subdimension D l , we have value a 1 ∈ C Link such that \(p \sqsubset_{0.8} a_1\). Consequently, we infer that \(p_1 \sqsubset_{1 \cdot 0.8 = 0.8} a_1\). Note that the last relationship would have received a (much lower) safe degree of 0, by p-to-p transitivity.

The set of fact-dimension relationships is stored in the LBDW and the probabilistic fact characterizations are inferred when needed. For the inference, the warehouse uses the rules described next, in Sections A.1.1 and A.1.2. In essence, the rules provide a recursive definition of the notion of probabilistic fact characterization. The rules are valid with expected degrees of containment.

1.1.1 A.1.1 Basic rules

The three basic rules for inferring fact characterizations for all \((f,v) \in F \times \widehat{D}\) are given below.

  1. 1.

    If a fact, f, is attached to and covers a dimension value, v, with the probability [p min; p max], then we can infer that f covers v with the probability of [p min; p max].

    $$f,v, p_{\rm min}, p_{\rm max}) \in R^{c,p} \Rightarrow f \rightsquigarrow^{c}_{[p_{\rm min}; p_{\rm max}]} v$$
  2. 2.

    If a fact, f, is attached to and inside a dimension value, v, with the probability of [p min; p max], then we can infer that f is inside v with a probability of [p min; p max].

    $$ (f,v, p_{\rm min}, p_{\rm max}) \in R^{i,p} \Rightarrow f \rightsquigarrow^{i}_{[p_{\rm min}; p_{\rm max}]} v$$
  3. 3.

    If a fact f covers a dimension value v with a probability of [p min; p max], then f is also inside v with the same probability. The idea behind the rule is as follows. If a piece of content covers a dimension value with some probability then it is also inside it with at least that probability.

    $$f \rightsquigarrow^{c}_{[p_{\rm min}; p_{\rm max}]} v \Rightarrow f \rightsquigarrow^{i}_{[p_{\rm min}; p_{\rm max}]} v$$

1.1.2 A.1.2 The characterization sum rule

In the following, we present the most important rule for inferring a fact characterization, called the characterization sum rule. Among other things, the rule provides support for correct aggregation. We first give the rule and then explain what it does.

Definition A.3

(Characterization sum rule) For any fact, f, and dimension values, v 1, v 2, ..., v n , and v, the following holds:

$$ \left(\bigwedge\limits_{i = {\rm 1}}^n \left(v_i \sqsubset_{d_i} v \wedge f \rightsquigarrow^{i}_{[p_{\rm min}^i; p_{\rm max}^i]} v_i\right) \Rightarrow f \rightsquigarrow^{i}_{[p_{\rm min}; p_{\rm max}]} v)\right), $$

where

$$ p_{\rm min} = \sum\limits_{i=1}^n d_i \cdot p_{\rm min}^i \quad \textnormal{ and } \quad p_{\rm max} = min\left( \sum\limits_{i = 1}^n d_i \cdot p_{\rm max}^i, 1\right) $$

The basic idea behind the rule is that we obtain the probability for a fact characterization by summing up probabilities for that fact characterization obtained through n different aggregation paths. Let a fact f be inside values v 1, v 2, ..., and v n with probabilities at least p 1, p 2, ..., and p n , respectively. Then, if any point from v 1, v 2, ..., and v n is contained in the dimension value, v, with probabilities d 1, d 2, ..., and d n , respectively, then f is also inside v with probability of at least \(p_{\rm min}^1 \cdot d_1 + p_{\rm min}^2 \cdot d_2 + \ldots + p_{\rm min}^n \cdot d_n\) and at most \(p_{\rm max}^1 \cdot d_1 + p_{\rm max}^2 \cdot d_2 + \ldots + p_{\rm max}^n \cdot d_n\) (if the sum is lower than 1) or 1 (if the sum is equal to or greater than 1).

A more formal way to explain the rule is as follows. Let P(e) be the probability of event e and P(e ∧ e′) represent the probability of the conjunction of events e and e′. First, let the event e 1 be “a piece of content is inside a segment v” (i.e., is “\(f \rightsquigarrow^i v\)”). We need to compute p min, which is a lower bound on P(e 1). Clearly, \(P(e_1) = \sum_{i=1}^n P(e_2^i \wedge e_3^i)\), where \(e_2^i\) is “\(f \rightsquigarrow^i v_i\)” and \(e_3^i\) is “\(v_i \sqsubset v\)”. Since events \(e_2^i\) and \(e_3^i\) are independent, \(P(e_2^i \wedge e_3^i) = P(e_2^i) \cdot P(e_3^i)\). Next, \(P(e_2^i) \geq p_{\rm min}^i\) and \(P(e_3^i) \geq d_i\) or \(P(e_3^i) = d_i\), if d i is an expected or safe degree, respectively. Next, \(P(e_2^i) \geq p_{\rm min}^i\) and \(P(e_3^i) = d_i\). This means that \(P(e_1) \geq \sum_{i=1}^n d_i \cdot p_{\rm min}^i\). So, \(p_{\rm min} = \sum_{i=1}^n d_i \cdot p_{\rm min}^i\). The case of p max (i.e., the maximum probability that a piece of content is inside a segment v) is analogous. However, since in this case we sum upper bounds on probabilities, the resulting upper bound, p max, may be higher than 1. Since according to probability theory the maximum probability of any event is 1, we “cut” p max down to 1. We note that p min will always be at most 1, which follows from Eq. 1 (Section 4) and the fact that d i  ≤ 1 for all is.

Since no aggregation path is ignored in the process of inferring fact characterizations, the characterization sum rule offers significant support for correct aggregation. Furthermore, if expected degrees are used for constructing an MO, the rule for inferring transitive relationships between dimension values (see Section A.1) provides additional support. In particular, the combined effect of these two rules is that a query engine may perform inferences on an MO in any order without losing any information (i.e., transitive relationships between values first, then fact characterizations, or in the reverse order).

Example A.2

Given a dimension hierarchy from Fig. 10, we exemplify the use of the characterization sum rule. Suppose our data warehouse has data on (uncertain) positions of a user in the kilometer-post representation, which are stored as \((f_1, p_1, 0, 0.1) \in R^{i,p}\) and \((f_1, p_2, 0.9, 1) \in R^{i,p}\). Then, the positions of the user in the link-node representation are deduced as follows. First, assuming that the degrees from Fig. 10 are expected degrees, we infer the following relationships: \(p_1 \sqsubset_{0.8} a_1\), \(p_1 \sqsubset_{0.2} a_2\), \(p_2 \sqsubset_{0.8} a_1\), and \(p_2 \sqsubset_{0.2} a_2\). Second, by basic rule 2, we obtain the fact characterizations \(f_1 \rightsquigarrow^i_{[0;0.1]} p_1\) and \(f_1 \rightsquigarrow^i_{[0.9; 1]} p_2\). Finally, by the characterization sum rule, we infer the characterizations \(f_1 \rightsquigarrow_{[p_{\rm min}^1; p_{\rm max}^1]}^i a_1\) and \(f_1 \rightsquigarrow_{[p_{\rm min}^2; p_{\rm max}^2]}^i a_2\), where \(p_{\rm min}^1 = 0.8 \cdot 0 + 0.8 \cdot 0.9 = 0.72\), \(p_{\rm max}^1 = 0.8 \cdot 0.1 + 0.8 \cdot 1 = 0.88\), \(p_{\rm min}^2 = 0.2 \cdot 0 + 0.2 \cdot 0.9 = 0.18\), and \(p_{\rm max}^2 = 0.2 \cdot 0.1 + 0.2 \cdot 1 = 0.22\).

1.2 A.2 The rest of the algebra

For unary operators, we assume one n-dimensional MO,

$$ M = \{\mathcal{S}, F, D_M, R_M\} $$

where

$$ D_M = \{D_i, i=1, \ldots, n\}, $$

and

$$ R_M = \{R_i, i = 1, \ldots, n\}. $$

For binary operators, we assume two n-dimensional MO’s

$$ M_1 = (\mathcal{S}_1, F_1, D_{M_1}, R_{M_1}) $$

and

$$ M_2 = (\mathcal{S}_2, F_2, D_{M_2}, R_{M_2}) $$

where

$$ \begin{array}{rll} &&D_{M_1} = \{D_i^1, i = 1, \ldots, n\}, \qquad D_{M_2} = \{D_i^2, i = 1, \ldots, n\}, \\ &&R_{M_1} = \{R_i^1, i = 1, \ldots, n\}, {\rm and} \qquad R_{M_2} = \{R_i^2, i = 1, \ldots, n\}. \end{array} $$

In addition, we use the notation \(\widehat{D}_i\) for the union of all the dimension values from D i .

1.2.1 A.2.1 Selection

The selection operator is used to select a subset of the facts in an MO based on a predicate.

Definition A.4

(Selection operator) Let K = {i, c}, where the symbols i and c stand for “inside” and “covering”, respectively. The selection operator, σ, uses a predicate

$$ q:\widehat{D}_1 \times \ldots \times \widehat{D}_n \times ([0;1] \times [0;1])^n \times K^n \rightarrowtail \{\mathit{true}, \mathit{false}\} $$

The parameters of q are n dimension values, each from a different dimension, n intervals of probability values, and n “inside” or “covering” symbols. Applying the operator to M yields the following set of facts.

$$ \begin{array}{rll} &&\left\{\vphantom{(q(v_{1}, [p_{\rm min}^1; p_{\rm max}^1], k_1, \ldots, v_{n}, [p_{\rm min}^n; p_{\rm max}^n], k_n) \wedge \bigwedge_{j=1}^n f \rightsquigarrow_{[p_{\rm min}^j; p_{\rm max}^j]}^{k_j} v_j)}f \in F \mid \exists (v_1, \ldots, v_{n}) \in \widehat{D}_1 \times \ldots \times \widehat{D}_n \left(\vphantom{(q(v_{1}, [p_{\rm min}^1; p_{\rm max}^1], k_1, \ldots, v_{n}, [p_{\rm min}^n; p_{\rm max}^n], k_n) \wedge \bigwedge_{j=1}^n f \rightsquigarrow_{[p_{\rm min}^j; p_{\rm max}^j]}^{k_j} v_j)}\exists \left(\left[p_{\rm min}^1; p_{\rm max}^1\right], \ldots, \left[p_{\rm min}^n; p_{\rm max}^n\right]\right) \right.\right.\\ &&\left.\left.{\kern12pt}\in ([0;1] \times [0;1])^n \left(\vphantom{(q(v_{1}, [p_{\rm min}^1; p_{\rm max}^1], k_1, \ldots, v_{n}, [p_{\rm min}^n; p_{\rm max}^n], k_n) \wedge \bigwedge_{j=1}^n f \rightsquigarrow_{[p_{\rm min}^j; p_{\rm max}^j]}^{k_j} v_j)}\exists (k_1, \ldots, k_n) \in K^n \right.\right.\right.\\ &&\left.\left.\left.{\kern12pt}\left(q\left(v_{1}, \left[p_{\rm min}^1; p_{\rm max}^1\right], k_1, \ldots, v_{n}, \left[p_{\rm min}^n; p_{\rm max}^n\right], k_n\right) \wedge \bigwedge_{j=1}^n f \rightsquigarrow_{\left[p_{\rm min}^j; p_{\rm max}^j\right]}^{k_j} v_j\right)\right)\right)\right\} \end{array} $$

Selection chooses the set of facts that are characterized by dimension values where q evaluates to true. This operator supports probabilistic covering/inside fact characterizations. Specifically, the operator allows us to formulate queries that select facts that are characterized (1) with given intervals of uncertainty (i.e., \([p_{\rm min}^i; p_{\rm max}^i]\)) for a characterization by the dimension D i and (2) kind of characterization (i.e., inside, covering, or both) by means of k i for a characterization by the dimension D i . In addition, we restrict the fact-dimension relations accordingly, while the dimensions and the fact schema stay the same.

Example A.3

(Selection operator) Continuing Example A.2, suppose that we would like to select reliable data on male users, m ∈ C Sex , on a link, a 1 ∈ C L_L , at a future time, t ∈ C Second. For this, the predicate, q, is defined as follows.

$$ \begin{array}{rll} && q \left(v_1, \left[p_{\rm min}^1; p_{\rm max}^1\right], k_1, v_2, \left[p_{\rm min}^2; p_{\rm max}^2\right], k_2, v_3, \left[p_{\rm min}^3; p_{\rm max}^3\right], k_3\right) = true \Leftrightarrow \\ &&{\kern6pt}(v_1 = m~\wedge~p_{\rm min}^1 = p_{\rm max}^1 = 1~\wedge~k_1 = i)~\wedge \\ &&{\kern6pt}(v_2 = a_1~\wedge~[p_{\rm min}^2; p_{\rm max}^2] \subseteq [0.5;1]~\wedge~k_2 = i~\wedge \\ &&{\kern6pt}(v_3 = t~\wedge~p_{\rm min}^3 = p_{\rm max}^3 = 1~\wedge~k_3 = i) \end{array} $$

The predicate defines the reliable data as the fact characterizations such as: (1) in the USER and TIME dimension, the minimum and maximum probability equals to 1, (2) in the LOCATION dimension, the minimum probability is at least 0.5 and the maximum probability is any (i.e., up to 1).

Suppose we have two characterizations \(f_1 \rightsquigarrow^i_{[1;1]} m\) and \(f_1 \rightsquigarrow^i_{[1;1]} t\) in the USER and TIME dimension, respectively. This means that the value of the predicate q depends on the characterizations in the LOCATION dimension. Since we have inferred the characterization \(f_1 \rightsquigarrow_{[0.72; 0.88]}^i a_1\), the fact f 1 would contribute to the result (i.e. \(f_1 \in F^{\prime}\)). However, if we replace a 1 with a 2 in the query, then the fact f 1 would be outside the result, because of the characterization \(f_1 \rightsquigarrow_{[0.18; 0.22]}^i a_2\). As another example, we could select all data that is unreliable with respect to positioning, for instance, to remove it from a subsequent computation, as follows:

$$ \begin{array}{rll} &&q \left(v_1, \left[p_{\rm min}^1; p_{\rm max}^1\right], k_1, v_2, \left[p_{\rm min}^2; p_{\rm max}^2\right], k_2,v_3, \left[p_{\rm min}^3; p_{\rm max}^3\right], k_3\right)\\ &&{\kern6pt} = true~\Leftrightarrow~\left[p_{\rm min}^2; p_{\rm max}^2\right] \subseteq [0; 0.5) \end{array} $$

1.2.2 A.2.2 Other aggregate functions

The SUM function, is a generalization of COUNT. Intuitively, the former sums arbitrary values of a measure, while the latter sums unit values. Suppose, in an MO, the nth dimension supplies data for the function. We assume that this dimension is regular (i.e. (1) there are only full containment relationships in the dimension hierarchy and (2) facts are only mapped to this dimension deterministically). Then, we define the minimum expected sum by modifying the definition of the minimum expected count as follows.

Definition A.5

(SUM function) Given the group G, the minimum expected sum is:

$$ \mathit{\mathit{SUM}_{\rm min}(G) = \sum\limits_{j=1}^N \left(p^{1,j}_{\rm min} \cdot \ldots \cdot p^{n-1,j}_{\rm min} \cdot g\left(v_n, f_j\right)\right)} $$

where g(v n , f j ) is a numerical value assigned to a dimension value v such that \(v \sqsubset v_n\) and (f j , v, 1, 1) ∈ R n .

Note that only the most precise data is summed. Definitions of maximum or average expected and possible or definite sums can be obtained by modifying the definitions of the corresponding counts analogously.

Example A.4

(SUM function) For example, suppose in our case study, we added a dimension for vehicle weights. Suppose further that in the new dimension the semantics of containment relationships is as follows: \(w_i \sqsubset w_j\) means that the weight represented by w i is lower that that represented by w j . In addition, suppose (1) values \(w_{5} \in \widehat{D}_4\), \(w_{2.5} \in \widehat{D}_4\), and \(w_{1.75} \in \widehat{D}_4\), stand for 5, 2.5, and 1.75 tons, respectively, (2) \(w_{2.5} \sqsubset w_{5}\) and \(w_{1.75} \sqsubset w_{5}\), and (3) (f 1, w 1.75, 1, 1) ∈ R 4 and (f 2, w 2.5, 1,1) ∈ R 4. Thus, v(w 5, f 1) = 1.75 and v(w 5, f 2) = 2.5. Then, we could find the minimum expected sum of vehicle weights from the group \(G_l^{\prime} = {Group}(m, a_1, t, w_{5})\) (see Example 5.2), as follows: \({SUM_{\rm min}}(G_l^{\prime}) = 1 \cdot 0.72 \cdot 1 \cdot 1 \cdot 1.75 + 1 \cdot 1 \cdot 1 \cdot 1 \cdot 2.5 = 1.26 + 2.5 = 3.76\).

Next, we consider the AVG function.

Definition A.6

(AVG function) Given the group G, we define (different kinds of) the function as follows:

$$ \mathit{AVG}_{\rm mod}(G) = \frac{\mathit{SUM}_{\rm mod}(G)}{COUNT_{\rm mod}(G)} $$

where mod is one of the following: min, max, avg, def, and pos.

Finally, we consider the MIN and MAX functions. We give a formal definition for the MIN function only. The MAX function may be defined analogously.

Definition A.7

(MIN function) Given the group G, we define the possible and definite minimum as follows:

$$ \mathit{MIN}_{\rm mod}(G) = \mathit{min} (\{ v(v_n, f_j), j=1, \ldots, N\}) $$

where mod is either pos or def and min is a function that returns the minimum number from a set of numbers. Analogously with the COUNT function, MIN pos (MIN def) is defined, if G is a liberal (conservative) group.

1.2.3 A.2.3 Union operator

The union operator is used to take the union of two MOs. Prior to defining the operator itself, we define two helper union operators, union on dimensions and union on fact-dimension relations.

In Definition A.8, we assume two dimensions of the same type \(\mathcal{T}\):

  1. 1.

    D 1 with its set of categories \(C_{D_1}\) and relation \(\sqsubset^{D_1}\) and

  2. 2.

    D 2 with its set of categories \(C_{D_2}\) and \(\sqsubset^{D_2}\).

Let us also assume that \(C_{D_1} = \{C_j^1, j=1, \ldots, m\}\) and \(C_{D_2} = \{C_j^2, j=1, \ldots, m\}\).

Definition A.8

(Dimension union) The dimension union operator on dimensions, ∪ D, is defined as follows:

$$ D^{\prime} = D_{1} \bigcup^{D} D_{2} = \left(C_{D^{\prime}}, \sqsubset^{D^{\prime}}\right) $$

where \(C_{D^{\prime}} = \{C_{j}^{1} \bigcup C_{j}^{2}, j=1, \ldots, m\}\). The relation \(\sqsubset^{D^{\prime}}\) is defined as follows:

$$ \forall (v_1, v_2) \!\in\! (D_1 \cup D_2) \!\times\! (D_1 \cup D_2) \left(v_1 \!\in\! \mathit{Desc}(v_2) \!\wedge\! v_1 \sqsubset_d^{D^{\prime}} v_2 \!\Leftrightarrow\! v_1 \sqsubset_{d_1}^{D_1} v_2 \vee v_1 \sqsubset_{d_2}^{D_2} v_2\right) $$

where Desc(v 2) is a set of immediate predecessors of v 2 and d depends on d 1 and d 2.

Stated less formally, given two dimensions of the same type, the union operator on dimensions performs set union on corresponding categories and builds a new relation on dimension values: there exists a direct relationship between two dimension values if there exists a direct relationship between the values in the first dimension, in the second dimension, or in both. The degree of containment for a resulting relationship may be determined in different ways. We discuss this issue later in this section. Note that only the degrees of containment for the direct relationships are found using these rules. The indirect relationships between values in the resulting dimension are inferred using our transitivity rules from Section 4.

In Definition A.9, we assume two fact-dimension relations:

  1. 1.

    R 1 that relates facts from a set F 1 with dimension values from a dimension D 1 and

  2. 2.

    R 2 that relates facts from a set F 2 with dimension values from a dimension D 2.

The sets of facts are of the same fact type and the dimensions are of the same dimension type.

Definition A.9

(Fact-dimension union) The union operator on fact-dimension relations, ∪ R, is defined as follows:

$$ R_{1} \bigcup^R R_{2} = \left\{\left(f, v, p_{\rm min}^{\prime}, p_{\rm max}^{\prime}\right)~|~ \left(f, v, p_{\rm min}^1, p_{\rm max}^1\right) \in R_1~\vee \left(f, v, p_{\rm min}^2, p_{\rm max}^2\right) \in R_2\right\} $$

where \(p_{\rm min}^{\prime}\) and \(p_{\rm max}^{\prime}\) depend on \(p_{\rm min}^1\), \(p_{\rm max}^1\), \(p_{\rm min}^2\), and \(p_{\rm max}^2\).

Stated less formally, given two fact-dimension relations, relating facts and dimensions of the same type, the union operator on the relations builds new fact-dimension relation: the new relation relates a fact and a dimension value if the first relation, the second relation, or both the relations relate(s) the fact and the value from the first dimension, from the second dimension, or from both. The probabilities for a resulting relationship may be determined in different ways. We discuss this issue later in this section. Note that only fact-dimension relations are found using these rules. The fact characterizations are inferred using the rules from Section 4.

Definition A.10

(Multidimensional union) Consider two n-dimensional MO’s with the same fact schema (i.e., \(\mathcal{S}_{1} = \mathcal{S}_{2}\)). The union operator on MOs, ∪, is defined as:

$$ M_{1} \bigcup M_{2} = \left(\mathcal{S}^{\prime},F^{\prime},D^{\prime}_{M^{\prime}}, R^{\prime}_{M^{\prime}}\right) $$

where

  1. 1.

    \(\mathcal{S}^{\prime} = \mathcal{S}_{1}\),

  2. 2.

    \(F^{\prime} = F_{1} \bigcup F_{2}\),

  3. 3.

    \(D^{\prime}_{M^{\prime}} = \{D_{i}^{1} \bigcup^D D_{i}^{2}, i=1, \ldots, n\}\),

  4. 4.

    \(R^{\prime}_{M^{\prime}} = \{R_{i}^{1} \bigcup^R R_{i}^{2}, i=1, \ldots, n\}\).

Stated less formally, given two MO’s with common fact schemas, the union operator combines dimensions and fact-dimension relations with the help of the ∪ D and ∪ R operator, respectively.

1.2.4 A.2.4 Other operators

Other operators, such as projection and identity-based join, are like their deterministic counterparts. These operators do not transform the probabilities of the fact characterizations or the degrees of containment in the dimensions. They only need to preserve the probabilities. Therefore, we define the probabilistic version of these operators as their deterministic counterparts in [5] except the probabilistic operators take probabilistic MOs as arguments and result in probabilistic MOs.

1.3 A.3 Additional implementation details

1.3.1 A.3.1 Computing fact characterizations

The readChars function stores inferred relationships between dimension values for later use. However, this solution requires some memory for storing the relationships. If there is a need to reduce memory usage, a version of the function without storage of inferred relationships, readCharsNoSave, may be used. Algorithm 4 presents the pseudocode for readCharsNoSave. In the following, we discuss how to modify Algorithm 3 to obtain an algorithm for readCharsNoSave. Naturally, \(paths_i^{{\rm cat}_i}\) that are used to store inferred relationships is removed from the list of input parameters. Then, we remove lines 9–14 that use the dynamically computed relationships for computing fact characterizations. Finally, we remove lines 22–23 that store the newly inferred relationships.

1.3.2 A.3.2 Generalizations of computing aggregate values

Now, we discuss how to modify Algorithms 6.2 and 6.3 in order to implement several generalizations of the aggregation algorithm. Algorithm 2 uses liberal groups, denoted Group l (v 1, v 2, ..., v n ), and outputs minimum counts, denoted COUNT min (v 1, v 2, ..., v n ). We discuss how to also implement conservative and degree-of-confidence grouping and to also produce maximum and average counts (see Definitions 5.2 and 5.3).

The generalized aggregation algorithm is presented in Algorithm A.2. In the following, we compare Algorithm A.2 with Algorithm 2. We have three extra input parameters. First, the collection of maximum probability fact-dimension hashtables, \(R^{ht}_{M, {\rm max}} = \{R^{ht}_{i, {\rm max}}, i = 1, 2, \ldots, n\}\), is now also required for computing maximum or average counts. Second, gr ∈ {l, c, d} indicates which grouping to use: liberal, v, conservative, c, or degree-of-confidence, d. Third, the set of probability intervals, \(\{[p^i_{{\rm min}\prime}; p^i_{{\rm max}\prime}], i = 1, 2, \ldots, n\}\), is used for filtering facts out in case of degree-of-confidence grouping. If gr is set to l or c, this parameter is ignored. Finally, mod ∈ {min, max, avg} indicates which kind of count to compute. The output parameter, COUNT mod, now depends on mod.

Next, the initialization part (lines 1–5) does not change at all.

However, the body of the main, “foreach” loop in line 6 changes significantly, though it has the same structure. First, in lines 7–8, we compute fact characterizations, which corresponds to lines 7–8 of Algorithm 2. Second, in lines 9–29, we construct and fill the groups according to these characterizations, which corresponds to lines 9–15 of Algorithm 2.

As for computing fact characterizations, in line 8, we now call the function readPairChars instead of readChars. The differences between the functions are as follows. The readPairChars function returns two lists of fact characterizations, f_chars i, min and f_chars i, max instead of one, f_chars i , because both minimum and maximum probabilities of facts are used for filtering with degree-of-confidence grouping and for computing average counts (i.e., with gr = d and mod = avg). Accordingly, the readPairChars function takes as input two fact-dimension hashtables, \(R^{ht}_{i, {\rm min}}\) and \(R^{ht}_{i, {\rm max}}\), instead of one, \(R^{ht}_{i}\). The body of this function is very similar to that of readChars from Algorithm 3. The only difference is that two sets of fact-dimensional relationships are read (see line 2 in Algorithm 3) and used later for computations.

As for constructing and filling groups, several important modifications are made. For each group, the “while” loop now iterates over dimensions instead of the “foreach” loop (see line 12, in both the algorithms), because an early stop is now done if a fact, f, is filtered out. The early stop is requested by setting the boolean flag, contrib to false. The flag is initialized in line 11. The “while” loop’s variable, i, is initiated in line 11 and updated in line 26. The body of the “while” loop consists of three parts. First, in line 13–14, we read the minimum and maximum probabilities, \(p^i_{\rm min}\) and \(p^i_{\rm max}\), of the fact characterization considered, \(f \rightsquigarrow v_i\), which corresponds to line 13 in Algorithm 2. Second, the switch block in lines 15–20 checks whether the fact must be filtered out. If so, then contrib is set to false. Third, if the fact, f, has not been filtered out, the switch block in lines 21–24 updates the running total of the fact’s contribution to the group, p.

Now we discuss how to check whether the fact must be filtered out (lines 15–20). We do the check according to the definition of grouping operators from Definition 5.2. We do the check differently for conservative (case gr = c, lines 16–17), degree-of-confidence (case gr = d, lines 18–19), and liberal (case gr = l, line 20) grouping. Specifically, with the conservative grouping, it is enough to check whether the maximum probability, \(p_{\rm max}^i\), is lower than 1. However, with the degree-of-confidence grouping, we are forced to check whether both minimum and maximum probabilities, \(p_{\rm min}^i\) and \(p_{\rm max}^i\), fall within the given interval, \([p_{{\rm min}\prime}^i;p_{{\rm max}\prime}^i]\). At the same time, with the liberal grouping, every fact contributes to the result, regardless of its probabilities. For this reason, no check is performed. As for updating the running total of the fact’s contribution to the group, p, in lines 21–25, we do it according to the definition of COUNT aggregation functions from Definition 5.3. We do it differently for minimum count (case mod = min, line 23), maximum count (case mod = max, line 24), and average count (case mod = avg, line 25). Specifically, with the minimum or maximum count, only minimum or maximum probabilities, \(p^i_{\rm min}\) or \(p^i_{\rm max}\), contributes to the result, respectively. At the same time, with the average count, the average of two probabilities, \(\frac{p^i_{\rm min} + p^i_{\rm max}}{2}\), contributes to the result.

Finally, we update the count of the current group, denoted COUNT mod (v 1, v 2, ..., v n ), in lines 27–28, which correspond to line 14 in Algorithm 2. Now the update is done only if the fact, f, has not been filtered out (i.e., if contrib has not been set to false).

In the following, we discuss the complexity of the functions readChars and readCharsNoSave presented in Algorithms 6.3 and A.1, respectively. The function readChars runs in O(mk) time, if all the non-immediate relations between dimension values have been computed before. In this case, we only run the loop in lines 11–13 in k times per iteration of the loop in lines 3–30, which does m iterations. The function readCharsNoSave is much slower. It runs in O(mck 2) time, where c is the number of categories per dimension, because it always performs the breadth-first search to compute the non-immediate relations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Timko, I., Dyreson, C. & Pedersen, T.B. A probabilistic data model and algebra for location-based data warehouses and their implementation. Geoinformatica 18, 357–403 (2014). https://doi.org/10.1007/s10707-013-0180-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10707-013-0180-4

Keywords

Navigation