Analyzing Uncertain Tabular Data

Kennedy, Oliver; Glavic, Boris

doi:10.1007/978-3-030-03643-0_12

Oliver Kennedy⁴ &
Boris Glavic⁵

Part of the book series: Information Fusion and Data Science ((IFDS))

1245 Accesses
1 Citations

Abstract

It is common practice to spend considerable time refining source data to address issues of data quality before beginning any data analysis. For example, an analyst might impute missing values or detect and fuse duplicate records representing the same real-world entity. However, there are many situations where there are multiple possible candidate resolutions for a data quality issue, but there is not sufficient evidence for determining which of the resolutions is the most appropriate. In this case, the only way forward is to make assumptions to restrict the space of solutions and/or to heuristically choose a resolution based on characteristics that are deemed predictive of “good” resolutions. Although it is important for the analyst to understand the impact of these assumptions and heuristic choices on her results, evaluating this impact can be highly nontrivial and time-consuming. For several decades now, the fields of probabilistic, incomplete, and fuzzy databases have developed strategies for analyzing the impact of uncertainty on the outcome of analyses. This general family of uncertainty-aware databases aims to model ambiguity in the results of analyses expressed in standard languages like SQL, SparQL, R, or Spark. An uncertainty-aware database uses descriptions of potential errors and ambiguities in source data to derive a corresponding description of potential errors or ambiguities in the result of an analysis accessing this source data. Depending on the technique, these descriptions of uncertainty may be either quantitative (bounds, probabilities) or qualitative (certain outcomes, unknown values, explanations of uncertainty). In this chapter, we explore the types of problems that techniques from uncertainty-aware databases address, survey solutions to these problems, and highlight their application to fixing data quality issues.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In database terminology, we would say that a functional dependency id → name holds for the dataset, i.e., the “id” value of a record determines its name value. Or put differently, there are no two records that have the same SSN, but a different name. The intricacies of functional dependencies are beyond the scope of this paper. The interested reader is referred to database textbooks (e.g., [1]). Furthermore, see, e.g., [12] for how constraints like functional dependencies are used to repair data errors.
2.
The reader may wonder whether it is possible to encode a certain record r as multiple x-tuples that all have r as an instantiation and where for each such x-tuple r, we have p _r(r) < 1. However, recall that x-tuples are assumed to be independent of each other. Thus, there would exist a possible world with a nonzero probability that does not contain r constructed by choosing an instantiation r′≠ r or no instantiation for every x-tuple r with r ∈r.
3.
Note that global conditions are not strictly necessary for expressive power, but they may allow for a more compact/convenient representation of a probabilistic database.
4.
Consider an incomplete database $\mathcal D$ with 2ⁿ possible worlds D ₁ …$D_{2^n}$. (the construction has to be modified slightly if the number of possible worlds is not a power of 2). Then we use n variables: v ₁, …, v _n. An assignment to these variables is interpreted as a number i in binary identifying one possible world D _i. For example, if there are 4 = 2² possible worlds, then we would use two variables v ₁ and v ₂, and the assignment v ₁↦T and v ₂↦F represents the possible world 1 ⋅ 2¹ + 0 ⋅ 2⁰ = 2. The database constructed contains all records that are possible in $\mathcal D$. For an assignment α, let n(α) denote the number encoded by α. Then the local condition for record r is $\bigvee \limits _{\alpha : r \in D_{n(\alpha )}} \bigwedge \limits _{j: \alpha (v_j) = \mathbf T} v_j$.
5.
Note that [16] used per variable distributions which is less general.
6.
Note that more than two options can be modeled by multiple boolean variables. For example, four alternatives can be modeled with annotations v ₁ ∧ v ₂, ¬v ₁ ∧ v ₂, v ₁ ∧¬v ₂, and ¬v ₁ ∧¬v ₂, respectively.
7.
For a more thorough introduction, we refer the interested reader to a textbook by Garcia-Molina et al. [14].
8.
Observe that a binary version of this problem can be applied in the case of incomplete databases. A tuple is certain if its local condition is implied by the global condition.
9.
This prevents repeated identifiers if a record appears on both sides.

References

S. Abiteboul, R. Hull, V. Vianu, Foundations of Databases (Addison-Wesley, Reading, 1995)
MATH Google Scholar
L. Antova, C. Koch, On apis for probabilistic databases, in QDB/MUD, Auckland (2008), pp. 41–56
Google Scholar
L. Antova, C. Koch, D. Olteanu, MayBMS: managing incomplete information with probabilistic world-set decompositions, in ICDE, Istanbul (IEEE Computer Society, 2007), pp. 1479–1480
Google Scholar
J. Boulos, N.N. Dalvi, B. Mandhani, S. Mathur, C. Ré, D. Suciu, MYSTIQ: a system for finding more answers by using probabilities, in SIGMOD Conference (ACM, New York, 2005), pp. 891–893
Google Scholar
D. Crankshaw, P. Bailis, J.E. Gonzalez, H. Li, Z. Zhang, M.J. Franklin, A. Ghodsi, M.I. Jordan, The missing piece in complex analytics: low latency, scalable model management and serving with velox, in CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, 4–7 Jan 2015, Online Proceedings (2015). www.cidrdb.org
P. Dagum, R.M. Karp, M. Luby, S.M. Ross, An optimal algorithm for Monte Carlo estimation. SIAM J. Comput. 29(5), 1484–1496 (2000)
Article MathSciNet Google Scholar
N. Dalvi, D. Suciu, The dichotomy of probabilistic inference for unions of conjunctive queries. J. ACM 59(6), 30:1–30:87 (2013)
Google Scholar
G.V. den Broeck, D. Suciu, Query processing on probabilistic data: a survey. Found. Trends Databases 7(3–4), 197–341 (2017)
Article Google Scholar
A. Deshpande, S. Madden, MauveDB: supporting model-based user views in database systems, in Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, SIGMOD’06, New York (ACM, 2006), pp. 73–84
Google Scholar
L. Detwiler, W. Gatterbauer, B. Louie, D. Suciu, P. Tarczy-Hornoch, Integrating and ranking uncertain scientific data, in ICDE, Shanghai (2009), pp. 1235–1238
Google Scholar
R. Fagin, P.G. Kolaitis, R.J. Miller, L. Popa, Data exchange: semantics and query answering. Theor. Comput. Sci. 336(1), 89–124 (2005). Database Theory
Google Scholar
W. Fan, Dependencies revisited for improving data quality, in Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2008, Vancouver, 9–11 June 2008, ed. by M. Lenzerini, D. Lembo (ACM, 2008), pp. 159–170
Google Scholar
R. Fink, J. Huang, D. Olteanu, Anytime approximation in probabilistic databases. VLDB J. 22(6), 823–848 (2013)
Article Google Scholar
M. Garcia, J. Ulman, J. Wisdom, Database Systems: The Complete Book (Prentice Hall, Upper Saddle River, 2002)
Google Scholar
W. Gatterbauer, D. Suciu, Dissociation and propagation for approximate lifted inference with standard relational database management systems. VLDBJ 26(1), 5–30 (2017)
Article Google Scholar
T. Green, V. Tannen, Models for incomplete and probabilistic information, in Current Trends in Database Technology – EDBT 2006, ed. by T. Grust, H. Höpfner, A. Illarramendi, S. Jablonski, M. Mesiti, S. Müller, P.-L. Patranjan, K.-U. Sattler, M. Spiliopoulou, J. Wijsen. Lecture Notes in Computer Science, vol. 4254 (Springer, Berlin/Heidelberg, 2006), pp. 278–296
Google Scholar
T.J. Green, G. Karvounarakis, V. Tannen, Provenance semirings, in PODS, Beijing (2007), pp. 31–40
Google Scholar
T.J. Green, G. Karvounarakis, N.E. Taylor, O. Biton, Z.G. Ives, V. Tannen, ORCHESTRA: facilitating collaborative data sharing, in SIGMOD, Beijing (2007), pp. 1131–1133
Google Scholar
P. Guagliardo, L. Libkin, Making SQL queries correct on incomplete databases: a feasibility study, in PODS (ACM, New York, 2016), pp. 211–223
Google Scholar
J. Huang, L. Antova, C. Koch, D. Olteanu, MayBMS: a probabilistic database management system, in Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, SIGMOD’09, New York (ACM, 2009), pp. 1071–1074
Google Scholar
T. Imieliński, W. Lipski, Jr., Incomplete information in relational databases. J. ACM 31(4), 761–791 (1984)
Article MathSciNet Google Scholar
Z.G. Ives, T.J. Green, G. Karvounarakis, N.E. Taylor, V. Tannen, P.P. Talukdar, M. Jacob, F. Pereira, The orchestra collaborative data sharing system. SIGMOD Rec. 37(3), 26–32 (2008)
Article Google Scholar
R. Jampani, F. Xu, M. Wu, L.L. Perez, C. Jermaine, P.J. Haas, MCDB: a Monte Carlo approach to managing uncertain data, in SIGMOD, Vancouver (2008), pp. 687–700
Google Scholar
R.M. Karp, M. Luby, Monte-Carlo algorithms for enumeration and reliability problems, in 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, Berkeley, vol. 0 (1983), pp. 56–64
Google Scholar
O. Kennedy, The PIP MayBMS plugin. http://maybms.sourceforge.net
O. Kennedy, C. Koch, PIP: a database system for great and small expectations, in ICDE. (IEEE Computer Society, Piscataway, 2010), pp. 157–168
Google Scholar
O.A. Kennedy, S. Nath, Jigsaw: efficient optimization over uncertain enterprise data, in Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD’11, New York (ACM, 2011), pp. 829–840
Google Scholar
D. Koller, N. Friedman, Probabilistic Graphical Models: Principles and Techniques (MIT Press, Cambridge, 2009)
MATH Google Scholar
P. Kumari, S. Achmiz, O. Kennedy, Communicating data quality in on-demand curation, in QDB (2016)
Google Scholar
J. Li, B. Saha, A. Deshpande, A unified approach to ranking in probabilistic databases. pVLDB 2(1), 502–513 (2009)
Google Scholar
A. Nandi, Y. Yang, O. Kennedy, B. Glavic, R. Fehling, Z.H. Liu, D. Gawlick, Mimir: bringing ctables into practice. Technical report, ArXiv (2016)
Google Scholar
D. Olteanu, J. Huang, C. Koch, Approximate confidence computation in probabilistic databases, in ICDE (IEEE Computer Society, Piscataway, 2010), pp. 145–156
Google Scholar
C.H. Papadimitriou, Computational Complexity (Wiley, Reading, 2003)
MATH Google Scholar
S. Parsons, Probabilistic Graphical Models: Principles and Techniques by D. Koller, N. Friedman (MIT Press), 1231pp. $95.00, ISBN 0-262-01319-3. Knowl. Eng. Rev. 26(2), 237–238 (2011)
Google Scholar
C. Ré, N.N. Dalvi, D. Suciu, Efficient top-k query evaluation on probabilistic data, in ICDE (IEEE Computer Society, Los Alamitos, 2007), pp. 886–895
Google Scholar
S. Singh, C. Mayfield, S. Mittal, S. Prabhakar, S. Hambrusch, R. Shah, Orion 2.0: native support for uncertain data, in SIGMOD, Vancouver (2008), pp. 1239–1242
Google Scholar
M.A. Soliman, I.F. Ilyas, K.C. Chang, Top-k query processing in uncertain databases, in ICDE (IEEE Computer Society, Los Alamitos, 2007), pp. 896–905
Google Scholar
M. Stonebraker, D.J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E.J. O’Neil, P.E. O’Neil, A. Rasin, N. Tran, S.B. Zdonik, C-store: a column-oriented DBMS, in VLDB (ACM, New York, 2005), pp. 553–564
Google Scholar
D. Suciu, D. Olteanu, C. Ré, C. Koch, Probabilistic Databases. Synthesis Lectures on Data Management. (Morgan & Claypool Publishers, San Rafael, 2011)
Google Scholar
J. Widom, Trio: a system for integrated management of data, accuracy, and lineage. Technical Report (2004)
Google Scholar
Y. Yang, N. Meneghetti, R. Fehling, Z.H. Liu, O. Kennedy, Lenses: an on-demand approach to ETL. Proc. VLDB Endow. 8(12), 1578–1589 (2015)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University at Buffalo, SUNY, Buffalo, NY, USA
Oliver Kennedy
Computer Science, Illinois Institute of Technology, Chicago, IL, USA
Boris Glavic

Authors

Oliver Kennedy
View author publications
You can also search for this author in PubMed Google Scholar
Boris Glavic
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Oliver Kennedy .

Editor information

Editors and Affiliations

IMT-Atlantique, Brest, France
Éloi Bossé
The State University of New York at Buffalo, Buffalo, NY, USA
Galina L. Rogova

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kennedy, O., Glavic, B. (2019). Analyzing Uncertain Tabular Data. In: Bossé, É., Rogova, G. (eds) Information Quality in Information Fusion and Decision Making. Information Fusion and Data Science. Springer, Cham. https://doi.org/10.1007/978-3-030-03643-0_12

Download citation

DOI: https://doi.org/10.1007/978-3-030-03643-0_12
Published: 02 April 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-03642-3
Online ISBN: 978-3-030-03643-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics