Skip to main content

Part of the book series: Information Fusion and Data Science ((IFDS))

Abstract

It is common practice to spend considerable time refining source data to address issues of data quality before beginning any data analysis. For example, an analyst might impute missing values or detect and fuse duplicate records representing the same real-world entity. However, there are many situations where there are multiple possible candidate resolutions for a data quality issue, but there is not sufficient evidence for determining which of the resolutions is the most appropriate. In this case, the only way forward is to make assumptions to restrict the space of solutions and/or to heuristically choose a resolution based on characteristics that are deemed predictive of “good” resolutions. Although it is important for the analyst to understand the impact of these assumptions and heuristic choices on her results, evaluating this impact can be highly nontrivial and time-consuming. For several decades now, the fields of probabilistic, incomplete, and fuzzy databases have developed strategies for analyzing the impact of uncertainty on the outcome of analyses. This general family of uncertainty-aware databases aims to model ambiguity in the results of analyses expressed in standard languages like SQL, SparQL, R, or Spark. An uncertainty-aware database uses descriptions of potential errors and ambiguities in source data to derive a corresponding description of potential errors or ambiguities in the result of an analysis accessing this source data. Depending on the technique, these descriptions of uncertainty may be either quantitative (bounds, probabilities) or qualitative (certain outcomes, unknown values, explanations of uncertainty). In this chapter, we explore the types of problems that techniques from uncertainty-aware databases address, survey solutions to these problems, and highlight their application to fixing data quality issues.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In database terminology, we would say that a functional dependency id → name holds for the dataset, i.e., the “id” value of a record determines its name value. Or put differently, there are no two records that have the same SSN, but a different name. The intricacies of functional dependencies are beyond the scope of this paper. The interested reader is referred to database textbooks (e.g., [1]). Furthermore, see, e.g., [12] for how constraints like functional dependencies are used to repair data errors.

  2. 2.

    The reader may wonder whether it is possible to encode a certain record r as multiple x-tuples that all have r as an instantiation and where for each such x-tuple r, we have p r(r) < 1. However, recall that x-tuples are assumed to be independent of each other. Thus, there would exist a possible world with a nonzero probability that does not contain r constructed by choosing an instantiation r′≠ r or no instantiation for every x-tuple r with r ∈r.

  3. 3.

    Note that global conditions are not strictly necessary for expressive power, but they may allow for a more compact/convenient representation of a probabilistic database.

  4. 4.

    Consider an incomplete database \(\mathcal D\) with 2n possible worlds D 1\(D_{2^n}\). (the construction has to be modified slightly if the number of possible worlds is not a power of 2). Then we use n variables: v 1, …, v n. An assignment to these variables is interpreted as a number i in binary identifying one possible world D i. For example, if there are 4 = 22 possible worlds, then we would use two variables v 1 and v 2, and the assignment v 1T and v 2F represents the possible world 1 ⋅ 21 + 0 ⋅ 20 = 2. The database constructed contains all records that are possible in \(\mathcal D\). For an assignment α, let n(α) denote the number encoded by α. Then the local condition for record r is \(\bigvee \limits _{\alpha : r \in D_{n(\alpha )}} \bigwedge \limits _{j: \alpha (v_j) = \mathbf T} v_j\).

  5. 5.

    Note that [16] used per variable distributions which is less general.

  6. 6.

    Note that more than two options can be modeled by multiple boolean variables. For example, four alternatives can be modeled with annotations v 1 ∧ v 2, ¬v 1 ∧ v 2, v 1 ∧¬v 2, and ¬v 1 ∧¬v 2, respectively.

  7. 7.

    For a more thorough introduction, we refer the interested reader to a textbook by Garcia-Molina et al. [14].

  8. 8.

    Observe that a binary version of this problem can be applied in the case of incomplete databases. A tuple is certain if its local condition is implied by the global condition.

  9. 9.

    This prevents repeated identifiers if a record appears on both sides.

References

  1. S. Abiteboul, R. Hull, V. Vianu, Foundations of Databases (Addison-Wesley, Reading, 1995)

    MATH  Google Scholar 

  2. L. Antova, C. Koch, On apis for probabilistic databases, in QDB/MUD, Auckland (2008), pp. 41–56

    Google Scholar 

  3. L. Antova, C. Koch, D. Olteanu, MayBMS: managing incomplete information with probabilistic world-set decompositions, in ICDE, Istanbul (IEEE Computer Society, 2007), pp. 1479–1480

    Google Scholar 

  4. J. Boulos, N.N. Dalvi, B. Mandhani, S. Mathur, C. Ré, D. Suciu, MYSTIQ: a system for finding more answers by using probabilities, in SIGMOD Conference (ACM, New York, 2005), pp. 891–893

    Google Scholar 

  5. D. Crankshaw, P. Bailis, J.E. Gonzalez, H. Li, Z. Zhang, M.J. Franklin, A. Ghodsi, M.I. Jordan, The missing piece in complex analytics: low latency, scalable model management and serving with velox, in CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, 4–7 Jan 2015, Online Proceedings (2015). www.cidrdb.org

  6. P. Dagum, R.M. Karp, M. Luby, S.M. Ross, An optimal algorithm for Monte Carlo estimation. SIAM J. Comput. 29(5), 1484–1496 (2000)

    Article  MathSciNet  Google Scholar 

  7. N. Dalvi, D. Suciu, The dichotomy of probabilistic inference for unions of conjunctive queries. J. ACM 59(6), 30:1–30:87 (2013)

    Google Scholar 

  8. G.V. den Broeck, D. Suciu, Query processing on probabilistic data: a survey. Found. Trends Databases 7(3–4), 197–341 (2017)

    Article  Google Scholar 

  9. A. Deshpande, S. Madden, MauveDB: supporting model-based user views in database systems, in Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, SIGMOD’06, New York (ACM, 2006), pp. 73–84

    Google Scholar 

  10. L. Detwiler, W. Gatterbauer, B. Louie, D. Suciu, P. Tarczy-Hornoch, Integrating and ranking uncertain scientific data, in ICDE, Shanghai (2009), pp. 1235–1238

    Google Scholar 

  11. R. Fagin, P.G. Kolaitis, R.J. Miller, L. Popa, Data exchange: semantics and query answering. Theor. Comput. Sci. 336(1), 89–124 (2005). Database Theory

    Google Scholar 

  12. W. Fan, Dependencies revisited for improving data quality, in Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2008, Vancouver, 9–11 June 2008, ed. by M. Lenzerini, D. Lembo (ACM, 2008), pp. 159–170

    Google Scholar 

  13. R. Fink, J. Huang, D. Olteanu, Anytime approximation in probabilistic databases. VLDB J. 22(6), 823–848 (2013)

    Article  Google Scholar 

  14. M. Garcia, J. Ulman, J. Wisdom, Database Systems: The Complete Book (Prentice Hall, Upper Saddle River, 2002)

    Google Scholar 

  15. W. Gatterbauer, D. Suciu, Dissociation and propagation for approximate lifted inference with standard relational database management systems. VLDBJ 26(1), 5–30 (2017)

    Article  Google Scholar 

  16. T. Green, V. Tannen, Models for incomplete and probabilistic information, in Current Trends in Database Technology – EDBT 2006, ed. by T. Grust, H. Höpfner, A. Illarramendi, S. Jablonski, M. Mesiti, S. Müller, P.-L. Patranjan, K.-U. Sattler, M. Spiliopoulou, J. Wijsen. Lecture Notes in Computer Science, vol. 4254 (Springer, Berlin/Heidelberg, 2006), pp. 278–296

    Google Scholar 

  17. T.J. Green, G. Karvounarakis, V. Tannen, Provenance semirings, in PODS, Beijing (2007), pp. 31–40

    Google Scholar 

  18. T.J. Green, G. Karvounarakis, N.E. Taylor, O. Biton, Z.G. Ives, V. Tannen, ORCHESTRA: facilitating collaborative data sharing, in SIGMOD, Beijing (2007), pp. 1131–1133

    Google Scholar 

  19. P. Guagliardo, L. Libkin, Making SQL queries correct on incomplete databases: a feasibility study, in PODS (ACM, New York, 2016), pp. 211–223

    Google Scholar 

  20. J. Huang, L. Antova, C. Koch, D. Olteanu, MayBMS: a probabilistic database management system, in Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, SIGMOD’09, New York (ACM, 2009), pp. 1071–1074

    Google Scholar 

  21. T. Imieliński, W. Lipski, Jr., Incomplete information in relational databases. J. ACM 31(4), 761–791 (1984)

    Article  MathSciNet  Google Scholar 

  22. Z.G. Ives, T.J. Green, G. Karvounarakis, N.E. Taylor, V. Tannen, P.P. Talukdar, M. Jacob, F. Pereira, The orchestra collaborative data sharing system. SIGMOD Rec. 37(3), 26–32 (2008)

    Article  Google Scholar 

  23. R. Jampani, F. Xu, M. Wu, L.L. Perez, C. Jermaine, P.J. Haas, MCDB: a Monte Carlo approach to managing uncertain data, in SIGMOD, Vancouver (2008), pp. 687–700

    Google Scholar 

  24. R.M. Karp, M. Luby, Monte-Carlo algorithms for enumeration and reliability problems, in 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, Berkeley, vol. 0 (1983), pp. 56–64

    Google Scholar 

  25. O. Kennedy, The PIP MayBMS plugin. http://maybms.sourceforge.net

  26. O. Kennedy, C. Koch, PIP: a database system for great and small expectations, in ICDE. (IEEE Computer Society, Piscataway, 2010), pp. 157–168

    Google Scholar 

  27. O.A. Kennedy, S. Nath, Jigsaw: efficient optimization over uncertain enterprise data, in Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD’11, New York (ACM, 2011), pp. 829–840

    Google Scholar 

  28. D. Koller, N. Friedman, Probabilistic Graphical Models: Principles and Techniques (MIT Press, Cambridge, 2009)

    MATH  Google Scholar 

  29. P. Kumari, S. Achmiz, O. Kennedy, Communicating data quality in on-demand curation, in QDB (2016)

    Google Scholar 

  30. J. Li, B. Saha, A. Deshpande, A unified approach to ranking in probabilistic databases. pVLDB 2(1), 502–513 (2009)

    Google Scholar 

  31. A. Nandi, Y. Yang, O. Kennedy, B. Glavic, R. Fehling, Z.H. Liu, D. Gawlick, Mimir: bringing ctables into practice. Technical report, ArXiv (2016)

    Google Scholar 

  32. D. Olteanu, J. Huang, C. Koch, Approximate confidence computation in probabilistic databases, in ICDE (IEEE Computer Society, Piscataway, 2010), pp. 145–156

    Google Scholar 

  33. C.H. Papadimitriou, Computational Complexity (Wiley, Reading, 2003)

    MATH  Google Scholar 

  34. S. Parsons, Probabilistic Graphical Models: Principles and Techniques by D. Koller, N. Friedman (MIT Press), 1231pp. $95.00, ISBN 0-262-01319-3. Knowl. Eng. Rev. 26(2), 237–238 (2011)

    Google Scholar 

  35. C. Ré, N.N. Dalvi, D. Suciu, Efficient top-k query evaluation on probabilistic data, in ICDE (IEEE Computer Society, Los Alamitos, 2007), pp. 886–895

    Google Scholar 

  36. S. Singh, C. Mayfield, S. Mittal, S. Prabhakar, S. Hambrusch, R. Shah, Orion 2.0: native support for uncertain data, in SIGMOD, Vancouver (2008), pp. 1239–1242

    Google Scholar 

  37. M.A. Soliman, I.F. Ilyas, K.C. Chang, Top-k query processing in uncertain databases, in ICDE (IEEE Computer Society, Los Alamitos, 2007), pp. 896–905

    Google Scholar 

  38. M. Stonebraker, D.J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E.J. O’Neil, P.E. O’Neil, A. Rasin, N. Tran, S.B. Zdonik, C-store: a column-oriented DBMS, in VLDB (ACM, New York, 2005), pp. 553–564

    Google Scholar 

  39. D. Suciu, D. Olteanu, C. Ré, C. Koch, Probabilistic Databases. Synthesis Lectures on Data Management. (Morgan & Claypool Publishers, San Rafael, 2011)

    Google Scholar 

  40. J. Widom, Trio: a system for integrated management of data, accuracy, and lineage. Technical Report (2004)

    Google Scholar 

  41. Y. Yang, N. Meneghetti, R. Fehling, Z.H. Liu, O. Kennedy, Lenses: an on-demand approach to ETL. Proc. VLDB Endow. 8(12), 1578–1589 (2015)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Oliver Kennedy .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Kennedy, O., Glavic, B. (2019). Analyzing Uncertain Tabular Data. In: Bossé, É., Rogova, G. (eds) Information Quality in Information Fusion and Decision Making. Information Fusion and Data Science. Springer, Cham. https://doi.org/10.1007/978-3-030-03643-0_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-03643-0_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-03642-3

  • Online ISBN: 978-3-030-03643-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics