Skip to main content

Data Profiling

  • Reference work entry
  • First Online:

Synonyms

Database profiling

Definition

Data profiling refers to the activity of creating small but informative summaries of a database [1]. These summaries range from simple statistics such as the number of records in a table and the number of distinct values of a field, to more complex statistics such as the distribution of n-grams in the field text, to structural properties such as keys and functional dependencies. Database profiles are useful for database exploration, detection of data quality problems [2], and for schema matching in data integration [1]. Database exploration helps a user identify important database properties, whether it is data of interest or data quality problems. Schema matching addresses the critical question, “do two fields or sets of fields or tables represent the same information?” Answers to these questions are very useful for designing data integration scripts.

Historical Background

Databases which support a complex organization tend to be quite complex...

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   4,499.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   6,499.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Recommended Reading

  1. Evoke Software. Data profiling and mapping, the essential first step in data migration and integration projects. Available at: http://www.evokesoftware.com/pdf/wtpprDPM.pdf (2000).

  2. Dasu T, Johnson T, Muthukrishnan S, Shkapenyuk V. Mining database structure; or, how to build a data quality browser. In: Proceedings of the ACM SIGMOD International Conference on Management of data; 2002. p. 240–51.

    Google Scholar 

  3. Dasu T, Johnson T. Exploratory data mining and data cleaning. New York: Wiley Interscience; 2003.

    Book  MATH  Google Scholar 

  4. Kang J, Naughton JF. On schema matching with opaque column names and data values. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2003. p. 205–16.

    Google Scholar 

  5. Broder A. On the resemblance and containment of documents. In: Proceedings of the IEEE Conference on Compression and Comparison of Sequences; 1997. p. 21–9.

    Google Scholar 

  6. Dasu T, Johnson T, Marathe A. Database exploration using database dynamics. IEEE Data Eng Bull. 2006;29(2):43–59.

    Google Scholar 

  7. Gravano L, Ipeirotis PG, Jagadish HV, Koudas N, Muthukrishnan S, Srivastava D. Approximate String Joins in a Database (Almost) for Free. In: Proceedings of the 27th International Conference on Very Large Data Bases; 2001. p. 491–500.

    Google Scholar 

  8. Huhtala Y, Karkkainen J, Porkka P, Toivonen H. TANE: an efficient algorithm for discovering functional and approximate dependencies. Comp J. 1999;42(2):100–11.

    Article  MATH  Google Scholar 

  9. Shen W, DeRose P, Vu L, Doan AH, Ramakrishnan R. Source-aware entity matching: a compositional approach. In: Proceedings of the 23rd International Conference on Data Engineering. p. 196–205.

    Google Scholar 

  10. IBM Websphere Information Integration. Available at: http://ibm.ascential.com

  11. Informatica Data Explorer. Available at: http://www.informatica.com/products_services/data_explorer

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Theodore Johnson .

Editor information

Editors and Affiliations

Section Editor information

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Science+Business Media, LLC, part of Springer Nature

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Johnson, T. (2018). Data Profiling. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_601

Download citation

Publish with us

Policies and ethics