Advertisement

Analyzing Uncertain Tabular Data

  • Oliver KennedyEmail author
  • Boris Glavic
Chapter
Part of the Information Fusion and Data Science book series (IFDS)

Abstract

It is common practice to spend considerable time refining source data to address issues of data quality before beginning any data analysis. For example, an analyst might impute missing values or detect and fuse duplicate records representing the same real-world entity. However, there are many situations where there are multiple possible candidate resolutions for a data quality issue, but there is not sufficient evidence for determining which of the resolutions is the most appropriate. In this case, the only way forward is to make assumptions to restrict the space of solutions and/or to heuristically choose a resolution based on characteristics that are deemed predictive of “good” resolutions. Although it is important for the analyst to understand the impact of these assumptions and heuristic choices on her results, evaluating this impact can be highly nontrivial and time-consuming. For several decades now, the fields of probabilistic, incomplete, and fuzzy databases have developed strategies for analyzing the impact of uncertainty on the outcome of analyses. This general family of uncertainty-aware databases aims to model ambiguity in the results of analyses expressed in standard languages like SQL, SparQL, R, or Spark. An uncertainty-aware database uses descriptions of potential errors and ambiguities in source data to derive a corresponding description of potential errors or ambiguities in the result of an analysis accessing this source data. Depending on the technique, these descriptions of uncertainty may be either quantitative (bounds, probabilities) or qualitative (certain outcomes, unknown values, explanations of uncertainty). In this chapter, we explore the types of problems that techniques from uncertainty-aware databases address, survey solutions to these problems, and highlight their application to fixing data quality issues.

Keywords

Databases Probabilistic databases Incomplete databases Incomplete information Uncertain data 

References

  1. 1.
    S. Abiteboul, R. Hull, V. Vianu, Foundations of Databases (Addison-Wesley, Reading, 1995)zbMATHGoogle Scholar
  2. 2.
    L. Antova, C. Koch, On apis for probabilistic databases, in QDB/MUD, Auckland (2008), pp. 41–56Google Scholar
  3. 3.
    L. Antova, C. Koch, D. Olteanu, MayBMS: managing incomplete information with probabilistic world-set decompositions, in ICDE, Istanbul (IEEE Computer Society, 2007), pp. 1479–1480Google Scholar
  4. 4.
    J. Boulos, N.N. Dalvi, B. Mandhani, S. Mathur, C. Ré, D. Suciu, MYSTIQ: a system for finding more answers by using probabilities, in SIGMOD Conference (ACM, New York, 2005), pp. 891–893Google Scholar
  5. 5.
    D. Crankshaw, P. Bailis, J.E. Gonzalez, H. Li, Z. Zhang, M.J. Franklin, A. Ghodsi, M.I. Jordan, The missing piece in complex analytics: low latency, scalable model management and serving with velox, in CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, 4–7 Jan 2015, Online Proceedings (2015). www.cidrdb.org
  6. 6.
    P. Dagum, R.M. Karp, M. Luby, S.M. Ross, An optimal algorithm for Monte Carlo estimation. SIAM J. Comput. 29(5), 1484–1496 (2000)MathSciNetCrossRefGoogle Scholar
  7. 7.
    N. Dalvi, D. Suciu, The dichotomy of probabilistic inference for unions of conjunctive queries. J. ACM 59(6), 30:1–30:87 (2013)Google Scholar
  8. 8.
    G.V. den Broeck, D. Suciu, Query processing on probabilistic data: a survey. Found. Trends Databases 7(3–4), 197–341 (2017)CrossRefGoogle Scholar
  9. 9.
    A. Deshpande, S. Madden, MauveDB: supporting model-based user views in database systems, in Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, SIGMOD’06, New York (ACM, 2006), pp. 73–84Google Scholar
  10. 10.
    L. Detwiler, W. Gatterbauer, B. Louie, D. Suciu, P. Tarczy-Hornoch, Integrating and ranking uncertain scientific data, in ICDE, Shanghai (2009), pp. 1235–1238Google Scholar
  11. 11.
    R. Fagin, P.G. Kolaitis, R.J. Miller, L. Popa, Data exchange: semantics and query answering. Theor. Comput. Sci. 336(1), 89–124 (2005). Database TheoryGoogle Scholar
  12. 12.
    W. Fan, Dependencies revisited for improving data quality, in Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2008, Vancouver, 9–11 June 2008, ed. by M. Lenzerini, D. Lembo (ACM, 2008), pp. 159–170Google Scholar
  13. 13.
    R. Fink, J. Huang, D. Olteanu, Anytime approximation in probabilistic databases. VLDB J. 22(6), 823–848 (2013)CrossRefGoogle Scholar
  14. 14.
    M. Garcia, J. Ulman, J. Wisdom, Database Systems: The Complete Book (Prentice Hall, Upper Saddle River, 2002)Google Scholar
  15. 15.
    W. Gatterbauer, D. Suciu, Dissociation and propagation for approximate lifted inference with standard relational database management systems. VLDBJ 26(1), 5–30 (2017)CrossRefGoogle Scholar
  16. 16.
    T. Green, V. Tannen, Models for incomplete and probabilistic information, in Current Trends in Database Technology – EDBT 2006, ed. by T. Grust, H. Höpfner, A. Illarramendi, S. Jablonski, M. Mesiti, S. Müller, P.-L. Patranjan, K.-U. Sattler, M. Spiliopoulou, J. Wijsen. Lecture Notes in Computer Science, vol. 4254 (Springer, Berlin/Heidelberg, 2006), pp. 278–296Google Scholar
  17. 17.
    T.J. Green, G. Karvounarakis, V. Tannen, Provenance semirings, in PODS, Beijing (2007), pp. 31–40Google Scholar
  18. 18.
    T.J. Green, G. Karvounarakis, N.E. Taylor, O. Biton, Z.G. Ives, V. Tannen, ORCHESTRA: facilitating collaborative data sharing, in SIGMOD, Beijing (2007), pp. 1131–1133Google Scholar
  19. 19.
    P. Guagliardo, L. Libkin, Making SQL queries correct on incomplete databases: a feasibility study, in PODS (ACM, New York, 2016), pp. 211–223Google Scholar
  20. 20.
    J. Huang, L. Antova, C. Koch, D. Olteanu, MayBMS: a probabilistic database management system, in Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, SIGMOD’09, New York (ACM, 2009), pp. 1071–1074Google Scholar
  21. 21.
    T. Imieliński, W. Lipski, Jr., Incomplete information in relational databases. J. ACM 31(4), 761–791 (1984)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Z.G. Ives, T.J. Green, G. Karvounarakis, N.E. Taylor, V. Tannen, P.P. Talukdar, M. Jacob, F. Pereira, The orchestra collaborative data sharing system. SIGMOD Rec. 37(3), 26–32 (2008)CrossRefGoogle Scholar
  23. 23.
    R. Jampani, F. Xu, M. Wu, L.L. Perez, C. Jermaine, P.J. Haas, MCDB: a Monte Carlo approach to managing uncertain data, in SIGMOD, Vancouver (2008), pp. 687–700Google Scholar
  24. 24.
    R.M. Karp, M. Luby, Monte-Carlo algorithms for enumeration and reliability problems, in 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, Berkeley, vol. 0 (1983), pp. 56–64Google Scholar
  25. 25.
    O. Kennedy, The PIP MayBMS plugin. http://maybms.sourceforge.net
  26. 26.
    O. Kennedy, C. Koch, PIP: a database system for great and small expectations, in ICDE. (IEEE Computer Society, Piscataway, 2010), pp. 157–168Google Scholar
  27. 27.
    O.A. Kennedy, S. Nath, Jigsaw: efficient optimization over uncertain enterprise data, in Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD’11, New York (ACM, 2011), pp. 829–840Google Scholar
  28. 28.
    D. Koller, N. Friedman, Probabilistic Graphical Models: Principles and Techniques (MIT Press, Cambridge, 2009)zbMATHGoogle Scholar
  29. 29.
    P. Kumari, S. Achmiz, O. Kennedy, Communicating data quality in on-demand curation, in QDB (2016)Google Scholar
  30. 30.
    J. Li, B. Saha, A. Deshpande, A unified approach to ranking in probabilistic databases. pVLDB 2(1), 502–513 (2009)Google Scholar
  31. 31.
    A. Nandi, Y. Yang, O. Kennedy, B. Glavic, R. Fehling, Z.H. Liu, D. Gawlick, Mimir: bringing ctables into practice. Technical report, ArXiv (2016)Google Scholar
  32. 32.
    D. Olteanu, J. Huang, C. Koch, Approximate confidence computation in probabilistic databases, in ICDE (IEEE Computer Society, Piscataway, 2010), pp. 145–156Google Scholar
  33. 33.
    C.H. Papadimitriou, Computational Complexity (Wiley, Reading, 2003)zbMATHGoogle Scholar
  34. 34.
    S. Parsons, Probabilistic Graphical Models: Principles and Techniques by D. Koller, N. Friedman (MIT Press), 1231pp. $95.00, ISBN 0-262-01319-3. Knowl. Eng. Rev. 26(2), 237–238 (2011)Google Scholar
  35. 35.
    C. Ré, N.N. Dalvi, D. Suciu, Efficient top-k query evaluation on probabilistic data, in ICDE (IEEE Computer Society, Los Alamitos, 2007), pp. 886–895Google Scholar
  36. 36.
    S. Singh, C. Mayfield, S. Mittal, S. Prabhakar, S. Hambrusch, R. Shah, Orion 2.0: native support for uncertain data, in SIGMOD, Vancouver (2008), pp. 1239–1242Google Scholar
  37. 37.
    M.A. Soliman, I.F. Ilyas, K.C. Chang, Top-k query processing in uncertain databases, in ICDE (IEEE Computer Society, Los Alamitos, 2007), pp. 896–905Google Scholar
  38. 38.
    M. Stonebraker, D.J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E.J. O’Neil, P.E. O’Neil, A. Rasin, N. Tran, S.B. Zdonik, C-store: a column-oriented DBMS, in VLDB (ACM, New York, 2005), pp. 553–564Google Scholar
  39. 39.
    D. Suciu, D. Olteanu, C. Ré, C. Koch, Probabilistic Databases. Synthesis Lectures on Data Management. (Morgan & Claypool Publishers, San Rafael, 2011)Google Scholar
  40. 40.
    J. Widom, Trio: a system for integrated management of data, accuracy, and lineage. Technical Report (2004)Google Scholar
  41. 41.
    Y. Yang, N. Meneghetti, R. Fehling, Z.H. Liu, O. Kennedy, Lenses: an on-demand approach to ETL. Proc. VLDB Endow. 8(12), 1578–1589 (2015)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringUniversity at Buffalo, SUNYBuffaloUSA
  2. 2.Computer ScienceIllinois Institute of TechnologyChicagoUSA

Personalised recommendations