Abstract
It is common practice to spend considerable time refining source data to address issues of data quality before beginning any data analysis. For example, an analyst might impute missing values or detect and fuse duplicate records representing the same real-world entity. However, there are many situations where there are multiple possible candidate resolutions for a data quality issue, but there is not sufficient evidence for determining which of the resolutions is the most appropriate. In this case, the only way forward is to make assumptions to restrict the space of solutions and/or to heuristically choose a resolution based on characteristics that are deemed predictive of “good” resolutions. Although it is important for the analyst to understand the impact of these assumptions and heuristic choices on her results, evaluating this impact can be highly nontrivial and time-consuming. For several decades now, the fields of probabilistic, incomplete, and fuzzy databases have developed strategies for analyzing the impact of uncertainty on the outcome of analyses. This general family of uncertainty-aware databases aims to model ambiguity in the results of analyses expressed in standard languages like SQL, SparQL, R, or Spark. An uncertainty-aware database uses descriptions of potential errors and ambiguities in source data to derive a corresponding description of potential errors or ambiguities in the result of an analysis accessing this source data. Depending on the technique, these descriptions of uncertainty may be either quantitative (bounds, probabilities) or qualitative (certain outcomes, unknown values, explanations of uncertainty). In this chapter, we explore the types of problems that techniques from uncertainty-aware databases address, survey solutions to these problems, and highlight their application to fixing data quality issues.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
In database terminology, we would say that a functional dependency id → name holds for the dataset, i.e., the “id” value of a record determines its name value. Or put differently, there are no two records that have the same SSN, but a different name. The intricacies of functional dependencies are beyond the scope of this paper. The interested reader is referred to database textbooks (e.g., [1]). Furthermore, see, e.g., [12] for how constraints like functional dependencies are used to repair data errors.
- 2.
The reader may wonder whether it is possible to encode a certain record r as multiple x-tuples that all have r as an instantiation and where for each such x-tuple r, we have p r(r) < 1. However, recall that x-tuples are assumed to be independent of each other. Thus, there would exist a possible world with a nonzero probability that does not contain r constructed by choosing an instantiation r′≠ r or no instantiation for every x-tuple r with r ∈r.
- 3.
Note that global conditions are not strictly necessary for expressive power, but they may allow for a more compact/convenient representation of a probabilistic database.
- 4.
Consider an incomplete database \(\mathcal D\) with 2n possible worlds D 1 …\(D_{2^n}\). (the construction has to be modified slightly if the number of possible worlds is not a power of 2). Then we use n variables: v 1, …, v n. An assignment to these variables is interpreted as a number i in binary identifying one possible world D i. For example, if there are 4 = 22 possible worlds, then we would use two variables v 1 and v 2, and the assignment v 1↦T and v 2↦F represents the possible world 1 ⋅ 21 + 0 ⋅ 20 = 2. The database constructed contains all records that are possible in \(\mathcal D\). For an assignment α, let n(α) denote the number encoded by α. Then the local condition for record r is \(\bigvee \limits _{\alpha : r \in D_{n(\alpha )}} \bigwedge \limits _{j: \alpha (v_j) = \mathbf T} v_j\).
- 5.
Note that [16] used per variable distributions which is less general.
- 6.
Note that more than two options can be modeled by multiple boolean variables. For example, four alternatives can be modeled with annotations v 1 ∧ v 2, ¬v 1 ∧ v 2, v 1 ∧¬v 2, and ¬v 1 ∧¬v 2, respectively.
- 7.
For a more thorough introduction, we refer the interested reader to a textbook by Garcia-Molina et al. [14].
- 8.
Observe that a binary version of this problem can be applied in the case of incomplete databases. A tuple is certain if its local condition is implied by the global condition.
- 9.
This prevents repeated identifiers if a record appears on both sides.
References
S. Abiteboul, R. Hull, V. Vianu, Foundations of Databases (Addison-Wesley, Reading, 1995)
L. Antova, C. Koch, On apis for probabilistic databases, in QDB/MUD, Auckland (2008), pp. 41–56
L. Antova, C. Koch, D. Olteanu, MayBMS: managing incomplete information with probabilistic world-set decompositions, in ICDE, Istanbul (IEEE Computer Society, 2007), pp. 1479–1480
J. Boulos, N.N. Dalvi, B. Mandhani, S. Mathur, C. Ré, D. Suciu, MYSTIQ: a system for finding more answers by using probabilities, in SIGMOD Conference (ACM, New York, 2005), pp. 891–893
D. Crankshaw, P. Bailis, J.E. Gonzalez, H. Li, Z. Zhang, M.J. Franklin, A. Ghodsi, M.I. Jordan, The missing piece in complex analytics: low latency, scalable model management and serving with velox, in CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, 4–7 Jan 2015, Online Proceedings (2015). www.cidrdb.org
P. Dagum, R.M. Karp, M. Luby, S.M. Ross, An optimal algorithm for Monte Carlo estimation. SIAM J. Comput. 29(5), 1484–1496 (2000)
N. Dalvi, D. Suciu, The dichotomy of probabilistic inference for unions of conjunctive queries. J. ACM 59(6), 30:1–30:87 (2013)
G.V. den Broeck, D. Suciu, Query processing on probabilistic data: a survey. Found. Trends Databases 7(3–4), 197–341 (2017)
A. Deshpande, S. Madden, MauveDB: supporting model-based user views in database systems, in Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, SIGMOD’06, New York (ACM, 2006), pp. 73–84
L. Detwiler, W. Gatterbauer, B. Louie, D. Suciu, P. Tarczy-Hornoch, Integrating and ranking uncertain scientific data, in ICDE, Shanghai (2009), pp. 1235–1238
R. Fagin, P.G. Kolaitis, R.J. Miller, L. Popa, Data exchange: semantics and query answering. Theor. Comput. Sci. 336(1), 89–124 (2005). Database Theory
W. Fan, Dependencies revisited for improving data quality, in Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2008, Vancouver, 9–11 June 2008, ed. by M. Lenzerini, D. Lembo (ACM, 2008), pp. 159–170
R. Fink, J. Huang, D. Olteanu, Anytime approximation in probabilistic databases. VLDB J. 22(6), 823–848 (2013)
M. Garcia, J. Ulman, J. Wisdom, Database Systems: The Complete Book (Prentice Hall, Upper Saddle River, 2002)
W. Gatterbauer, D. Suciu, Dissociation and propagation for approximate lifted inference with standard relational database management systems. VLDBJ 26(1), 5–30 (2017)
T. Green, V. Tannen, Models for incomplete and probabilistic information, in Current Trends in Database Technology – EDBT 2006, ed. by T. Grust, H. Höpfner, A. Illarramendi, S. Jablonski, M. Mesiti, S. Müller, P.-L. Patranjan, K.-U. Sattler, M. Spiliopoulou, J. Wijsen. Lecture Notes in Computer Science, vol. 4254 (Springer, Berlin/Heidelberg, 2006), pp. 278–296
T.J. Green, G. Karvounarakis, V. Tannen, Provenance semirings, in PODS, Beijing (2007), pp. 31–40
T.J. Green, G. Karvounarakis, N.E. Taylor, O. Biton, Z.G. Ives, V. Tannen, ORCHESTRA: facilitating collaborative data sharing, in SIGMOD, Beijing (2007), pp. 1131–1133
P. Guagliardo, L. Libkin, Making SQL queries correct on incomplete databases: a feasibility study, in PODS (ACM, New York, 2016), pp. 211–223
J. Huang, L. Antova, C. Koch, D. Olteanu, MayBMS: a probabilistic database management system, in Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, SIGMOD’09, New York (ACM, 2009), pp. 1071–1074
T. Imieliński, W. Lipski, Jr., Incomplete information in relational databases. J. ACM 31(4), 761–791 (1984)
Z.G. Ives, T.J. Green, G. Karvounarakis, N.E. Taylor, V. Tannen, P.P. Talukdar, M. Jacob, F. Pereira, The orchestra collaborative data sharing system. SIGMOD Rec. 37(3), 26–32 (2008)
R. Jampani, F. Xu, M. Wu, L.L. Perez, C. Jermaine, P.J. Haas, MCDB: a Monte Carlo approach to managing uncertain data, in SIGMOD, Vancouver (2008), pp. 687–700
R.M. Karp, M. Luby, Monte-Carlo algorithms for enumeration and reliability problems, in 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, Berkeley, vol. 0 (1983), pp. 56–64
O. Kennedy, The PIP MayBMS plugin. http://maybms.sourceforge.net
O. Kennedy, C. Koch, PIP: a database system for great and small expectations, in ICDE. (IEEE Computer Society, Piscataway, 2010), pp. 157–168
O.A. Kennedy, S. Nath, Jigsaw: efficient optimization over uncertain enterprise data, in Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD’11, New York (ACM, 2011), pp. 829–840
D. Koller, N. Friedman, Probabilistic Graphical Models: Principles and Techniques (MIT Press, Cambridge, 2009)
P. Kumari, S. Achmiz, O. Kennedy, Communicating data quality in on-demand curation, in QDB (2016)
J. Li, B. Saha, A. Deshpande, A unified approach to ranking in probabilistic databases. pVLDB 2(1), 502–513 (2009)
A. Nandi, Y. Yang, O. Kennedy, B. Glavic, R. Fehling, Z.H. Liu, D. Gawlick, Mimir: bringing ctables into practice. Technical report, ArXiv (2016)
D. Olteanu, J. Huang, C. Koch, Approximate confidence computation in probabilistic databases, in ICDE (IEEE Computer Society, Piscataway, 2010), pp. 145–156
C.H. Papadimitriou, Computational Complexity (Wiley, Reading, 2003)
S. Parsons, Probabilistic Graphical Models: Principles and Techniques by D. Koller, N. Friedman (MIT Press), 1231pp. $95.00, ISBN 0-262-01319-3. Knowl. Eng. Rev. 26(2), 237–238 (2011)
C. Ré, N.N. Dalvi, D. Suciu, Efficient top-k query evaluation on probabilistic data, in ICDE (IEEE Computer Society, Los Alamitos, 2007), pp. 886–895
S. Singh, C. Mayfield, S. Mittal, S. Prabhakar, S. Hambrusch, R. Shah, Orion 2.0: native support for uncertain data, in SIGMOD, Vancouver (2008), pp. 1239–1242
M.A. Soliman, I.F. Ilyas, K.C. Chang, Top-k query processing in uncertain databases, in ICDE (IEEE Computer Society, Los Alamitos, 2007), pp. 896–905
M. Stonebraker, D.J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E.J. O’Neil, P.E. O’Neil, A. Rasin, N. Tran, S.B. Zdonik, C-store: a column-oriented DBMS, in VLDB (ACM, New York, 2005), pp. 553–564
D. Suciu, D. Olteanu, C. Ré, C. Koch, Probabilistic Databases. Synthesis Lectures on Data Management. (Morgan & Claypool Publishers, San Rafael, 2011)
J. Widom, Trio: a system for integrated management of data, accuracy, and lineage. Technical Report (2004)
Y. Yang, N. Meneghetti, R. Fehling, Z.H. Liu, O. Kennedy, Lenses: an on-demand approach to ETL. Proc. VLDB Endow. 8(12), 1578–1589 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Kennedy, O., Glavic, B. (2019). Analyzing Uncertain Tabular Data. In: Bossé, É., Rogova, G. (eds) Information Quality in Information Fusion and Decision Making. Information Fusion and Data Science. Springer, Cham. https://doi.org/10.1007/978-3-030-03643-0_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-03643-0_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-03642-3
Online ISBN: 978-3-030-03643-0
eBook Packages: Computer ScienceComputer Science (R0)