Abstract
Data profiling technology is very valuable for data governance and data quality control because people need it to verify and review the quality of structured, semi-structured, and unstructured data. In this paper, we first review relevant works and discuss their definitions of data profiling. Second, we offer a new definition and propose new classifications for data profiling tasks. Third, the paper presents several free and commercial profiling tools. Fourth, authors offer a new data quality metrics and data quality score calculation. Finally, authors discuss a data profiling tool framework for big data.
Keywords
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Zikopoulos, P., Eaton, C.: Understanding big data: analytics for enterprise class hadoop and streaming data. McGraw-Hill Osborne Media (2011)
Buneman, P.: Semistructured data. In: Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 117–121. ACM (1997)
Buneman, P., Davidson, S., Fernandez, M., Suciu, D.: Adding structure to unstructured data. In: Database Theory, ICDT 1997, pp. 336–350. Springer, Heidelberg (1997)
Khatri, V.: Brown, C.V: Designing data governance. Communications of the ACM 53(1), 148–152 (2010)
Pipino, L.L., Lee, Y.W., Wang, R.Y.: Data quality assessment. Communications of the ACM 45(4), 211–218 (2002)
Wang, R.Y.: A product perspective on total data quality management. Communications of the ACM 41(2), 58–65 (1998)
Kumar, R., Yadav, A.: Aggregate Profiler – Data Quality. http://sourceforge.net/projects/dataquality/
Talend Company. Talend Open Studio for Data Quality. http://www.talend.com/products/data-quality
DataCleaner Company. DataCleaner Manual. http://datacleaner.org/resources/docs/4.0.10/pdf/datacleaner-reference.pdf
IBM Company. InfoSphere Information Server: Information Center. http://www-01.ibm.com/support/knowledgecenter/SSZJPZ_9.1.0/
Informatica Company. Data Profiling Solutions. https://www.informatica.com/data-profiling.html
Oracle Company. Oracle Enterprise Data Quality. http://www.oracle.com/us/products/middleware/data-integration/enterprise-data-quality/overview/index.html
SAP Company. SAP Information Steward. http://scn.sap.com/docs/DOC-8751
SAS Company. SAS Products: DataFlux Data Management Studio. http://support.sas.com/software/products/dfdmstudioserver/
A Data Governance Solution Tailored for Your Role. Collibra Solution Comments. https://www.collibra.com/solution/
Pipino, L.L., Lee, Y.W., Wang, R.Y.: Data quality assessment. Communications of the ACM 45(4), 211–218 (2002)
Rahm, E., Do, H.H.: Data cleaning: Problems and current approaches. IEEE Data Eng. Bull 23(4), 3–13 (2000)
Lee, Y.W., Pipino, L.L., Funk, J.D., Wang, R.Y.: Journey to data quality. The MIT Press (2009)
Moody, D.L.: Metrics for evaluating the quality of entity relationship models. In: Conceptual Modeling, ER 1998, pp. 211–225. Springer, Heidelberg (1998)
Ballou, D.P., Tayi, G.K.: Enhancing data quality in data warehouse environments. Communications of the ACM 42(1), 73–78 (1999)
Calero, C., Piattini, M., Pascual, C., Serrano, M.A.: Towards Data Warehouse Quality Metrics. In: DMDW, p. 2 (2001)
Loshin, D.: Monitoring Data Quality Performance Using Data Quality Metrics: A White Paper. Informatica, November 2006
The Six Primary Dimensions for Data Quality Assessment. The Six Primary Dimensions for Data Quality Assessment. http://www.enterprisemanagement360.com/white_paper/six-primary-dimensions-data-quality-assessment/
The Ultimate Guide to Data Governance Metrics: Healthcare Edition:40 Ways for Payers and Providers to Measure Information Quality Success. The Ultimate Guide to Data Governance Metrics: Healthcare Edition (2012). http://www.ajilitee.com/wp-content/uploads/2013/09/Ultimate-Guide-to-Data-Governance-Metrics-for-Healthcare-Ajilitee-June-2012.pdf
Zikopoulos, P., Eaton, C.: Understanding big data: Analytics for enterprise class hadoop and streaming data. McGraw-Hill Osborne Media (2011)
LaValle, S., Lesser, E., Shockley, R., Hopkins, M.S., Kruschwitz, N.: Big data, analytics and the path from insights to value. MIT Sloan Management Review 21 (2013)
CUDA GPUs. NVIDIA Developer. June 4, 2012. https://developer.nvidia.com/cuda-gpus
Apache Spark™ - Lightning-Fast Cluster Computing. Apache Spark™ - Lightning-Fast Cluster Computing. http://spark.apache.org/
Apache Storm. http://storm.apache.org/
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Dai, W., Wardlaw, I., Cui, Y., Mehdi, K., Li, Y., Long, J. (2016). Data Profiling Technology of Data Governance Regarding Big Data: Review and Rethinking. In: Latifi, S. (eds) Information Technology: New Generations. Advances in Intelligent Systems and Computing, vol 448. Springer, Cham. https://doi.org/10.1007/978-3-319-32467-8_39
Download citation
DOI: https://doi.org/10.1007/978-3-319-32467-8_39
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-32466-1
Online ISBN: 978-3-319-32467-8
eBook Packages: EngineeringEngineering (R0)