Skip to main content
Log in

Determining the Real Data Completeness of a Relational Dataset

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Low quality of data is a serious problem in the new era of big data, which can severely reduce the usability of data, mislead or bias the querying, analyzing and mining, and leads to huge loss. Incomplete data is common in low quality data, and it is necessary to determine the data completeness of a dataset to provide hints for follow-up operations on it. Little existing work focuses on the completeness of a dataset, and such work views all missing values as unknown values. In this paper, we study how to determine real data completeness of a relational dataset. By taking advantage of given functional dependencies, we aim to determine some missing attribute values by other tuples and capture the really missing attribute cells. We propose a data completeness model, formalize the problem of determining the real data completeness of a relational dataset, and give a lower bound of the time complexity of this problem. Two optimal algorithms to determine the data completeness of a dataset for different cases are proposed. We empirically show the effectiveness and the scalability of our algorithms on both real-world data and synthetic data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Rahm E, Do H H. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 2000, 23(4): 3-13.

    Google Scholar 

  2. Eckerson W W. Data warehousing special report: Data quality and the bottom line. Application Development Trends, 2002, (5): 1-9.

    Google Scholar 

  3. Poleto F Z, Singer J M, Paulino C D. Missing data mechanisms and their implications on the analysis of categorical data. Statistics and Computing, 2011, 21(1): 31-43.

    Article  MathSciNet  MATH  Google Scholar 

  4. Chen K, Chen H, Conway N, Hellerstein J M, Parikh T S. Usher: Improving data quality with dynamic forms. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(8): 1138-1153.

    Article  Google Scholar 

  5. Arocena P C, Glavic B, Miller R J. Value invention in data exchange. In Proc. the 2013 International Conference on Management of Data, June 2013, pp.157-168.

  6. Dong X L, Gabrilovich E, Murphy K, Dang V, Horn W, Lugaresi C, Sun S, Zhang W. Knowledge-based trust: Estimating the trustworthiness of web sources. Proceedings of the VLDB Endowment, 2015, 8(9): 938-949.

    Article  Google Scholar 

  7. Wolf G, Khatri H, Chokshi B, Fan J, Chen Y, Kambhampati S. Query processing over incomplete autonomous databases. In Proc. the 33rd International Conference on Very Large Data Bases, Sept. 2007, pp.651-662.

  8. Motro A. Integrity = Validity + Completeness. ACM Transactions on Database Systems, 1989, 14(4): 480-502.

    Article  Google Scholar 

  9. Yang K, Li J, Wang C. Missing values estimation in microarray data with partial least squares regression. In Proc. the 6th International Conference on Computational Science, May 2006, pp.662-669.

  10. Beskales G, Ilyas I F, Golab L. Sampling the repairs of functional dependency violations under hard constraints. Proceedings of the VLDB Endowment, 2010, 3(1/2): 197-207.

    Article  Google Scholar 

  11. Li P, Dong X, Maurino A, Srivastava D. Linking temporal records. Proceedings of the VLDB Endowment, 2011, 4(11): 956-967.

    MATH  Google Scholar 

  12. Motro A, Rakov I. Not all answers are equally good: Estimating the quality of database answers. In Proc. the 1997 Flexible Query Answering Systems, June 1997, pp.1-21.

  13. Naumann F, Freytag J C, Leser U. Completeness of integrated information sources. Information Systems, 2004, 29(7): 583-615.

    Article  Google Scholar 

  14. Biswas J, Naumann F, Qiu Q. Assessing the completeness of sensor data. In Proc. the 11th International Conference on Database Systems for Advanced Applications, April 2006, pp.717-732.

  15. Levy A Y. Obtaining complete answers from incomplete databases. In Proc. the 22nd International Conference on Very Large Data Bases, Sept. 1996, pp.402-412.

  16. Razniewski S, Nutt W. Completeness of queries over incomplete databases. Proceedings of the VLDB Endowment, 2011, 4(11): 749-760.

    Google Scholar 

  17. FanW, Geerts F. Capturing missing tuples and missing values. In Proc. the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, June 2010, pp.169-178.

  18. Fan W, Geerts F. Relative information completeness. ACM Transactions on Database Systems, 2010, 35(4): Article No. 27.

  19. Prokoshyna N, Szlichta J, Chiang F, Miller R J, Srivastava D. Combining quantitative and logical data cleaning. Proceedings of the VLDB Endowment, 2015, 9(4): 300-311.

    Article  Google Scholar 

  20. Abiteboul S, Hull R, Vianu V. Foundations of Databases: The Logical Level (1st edition). Addison-Wesley Longman Publishing Co., Inc., 1995.

  21. Silberschatz A, Korth H, Sudarshan S. Database System Concepts (4th edition). McGraw-Hill Education, 2001.

  22. Cheng S, Li J. Sampling based (epsilon, delta)-approximate aggregation algorithm in sensor networks. In Proc. the 29th International Conference on Distributed Computing Systems, June 2009, pp.273-280.

  23. Khalefa M E, Mokbel M F, Levandoski J J. Skyline query processing for incomplete data. In Proc. the 24th International Conference on Data Engineering, April 2008, pp.556-565.

  24. Salloum M, Dong X L, Srivastava D, Tsotras V J. Online ordering of overlapping data sources. Proceedings of the VLDB Endowment, 2013, 7(3): 133-144.

    Article  Google Scholar 

  25. Zhao B, Rubinstein B I, Gemmell J, Han J. A Bayesian approach to discovering truth from conflicting sources for data integration. Proceedings of the VLDB Endowment, 2012, 5(6): 550-561.

    Article  Google Scholar 

  26. Dong X L, Saha B, Srivastava D. Less is more: Selecting sources wisely for integration. In Proc. the 39th International Conference on Very Large Data Bases, August 2013, pp.37-48.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yong-Nan Liu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, YN., Li, JZ. & Zou, ZN. Determining the Real Data Completeness of a Relational Dataset. J. Comput. Sci. Technol. 31, 720–740 (2016). https://doi.org/10.1007/s11390-016-1659-x

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-016-1659-x

Keywords

Navigation