Encyclopedia of Big Data Technologies

2019 Edition
| Editors: Sherif Sakr, Albert Y. Zomaya

Data Wrangling

  • Jeffrey HeerEmail author
  • Joseph M. Hellerstein
  • Sean Kandel
Reference work entry
DOI: https://doi.org/10.1007/978-3-319-77525-8_9

Synonyms

Definitions

Data wrangling is the process of profiling and transforming datasets to ensure they are actionable for a set of analysis tasks. One central goal is to make data usable: to put data in a form that can be parsed and manipulated by analysis tools. Another goal is to ensure that data is responsive to the intended analyses: that the data contain the necessary information, at an acceptable level of description and correctness, to support successful modeling and decision-making.

Overview

Despite significant advances in technologies for data management and analysis, it remains time-consuming to inspect a dataset and mold it to a form that allows meaningful analysis to begin. Analysts must regularly restructure data to make it palatable to databases, statistics packages, and visualization tools. To improve data quality, analysts must also identify and address issues such as misspellings, missing data, unresolved duplicates, and outliers.

Data wrangling is...

This is a preview of subscription content, log in to check access.

References

  1. Carr DB, Littlefield RJ, Nicholson W, Littlefield J (1987) Scatterplot matrix techniques for large N. J Am Stat Assoc 82(398):424–436MathSciNetGoogle Scholar
  2. Chiticariu L, Kolaitis PG, Popa L (2008) Interactive generation of integrated schemas. In: ACM SIGMOD, pp 833–846Google Scholar
  3. Codd EF (1971b) Further normalization of the data base relational model. In: Courant computer science symposia 6, Data base systems, (New York, May 24–25) pp 33–64, Prentice-HallGoogle Scholar
  4. Dasu T, Johnson T (2003) Exploratory data mining and data cleaning. Wiley, New YorkzbMATHCrossRefGoogle Scholar
  5. Dasu T, Johnson T, Muthukrishnan S, Shkapenyuk V (2002) Mining database structure; or, how to build a data quality browser. In: ACM SIGMOD, pp 240–251Google Scholar
  6. Doan A, Halevy A, Ives Z (2012) Principles of data integration. Elsevier, AmsterdamGoogle Scholar
  7. Eaton C, Plaisant C, Drizd T (2003) The challenge of missing and uncertain data. In: Proceedings of the IEEE visualization, p 100Google Scholar
  8. Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE TKDE 19(1):1–16Google Scholar
  9. Fisher K, Walker D (2011) The PADS project: an overview. In: International conference on database theory, Mar 2011Google Scholar
  10. Galhardas H, Florescu D, Shasha D, Simon E (2000) AJAX: an extensible data cleaning tool. In: ACM SIGMOD, p 590Google Scholar
  11. Gulwani S (2011) Automating string processing in spreadsheets using input-output examples. In: ACM POPL, pp 317–330zbMATHGoogle Scholar
  12. Guo PJ, Kandel S, Hellerstein J, Heer J (2011) Proactive wrangling: mixed-initiative end-user programming of data transformation scripts. In: ACM user interface software & technology (UIST)Google Scholar
  13. Harris W, Gulwani S (2011) Spreadsheet table transformations from examples. In: ACM PLDIGoogle Scholar
  14. Heer J, Hellerstein JM, Kandel S (2015) Predictive interaction for data transformation. In: CIDRGoogle Scholar
  15. Hellerstein JM (2008) Quantitative data cleaning for large databases. White Paper, United Nations Economic Commission for EuropeGoogle Scholar
  16. Hodge V, Austin J (2004) A survey of outlier detection methodologies. Artif Intell Rev 22(2):85–126zbMATHCrossRefGoogle Scholar
  17. Horvitz E (1999) Principles of mixed-initiative user interfaces. In: ACM CHI, pp 159–166Google Scholar
  18. Huynh D, Mazzocchi S (2010) Google refine. http://code.google.com/p/google-refine/
  19. Kang H, Getoor L, Shneiderman B, Bilgic M, Licamele L (2008) Interactive entity resolution in relational data: a visual analytic tool and its evaluation. IEEE TVCG 14(5):999–1014Google Scholar
  20. Kandel S, Heer J, Plaisant C, Kennedy J, van Ham F, Riche NH, Weaver C, Lee B, Brodbeck D, Buono P (2011a) Research directions in data wrangling: visualizations and transformations for usable and credible data. Inf Vis J 10(4):271–288CrossRefGoogle Scholar
  21. Kandel S, Paepcke A, Hellerstein J, Heer J (2011b) Wrangler: interactive visual specification of data transformation scripts. In: ACM human factors in computing systems (CHI)Google Scholar
  22. Kandel S, Paepcke A, Hellerstein J, Heer J (2012a) Enterprise data analysis and visualization: an interview study. In: IEEE visual analytics science & technology (VAST)Google Scholar
  23. Kandel S, Parikh R, Paepcke A, Hellerstein J, Heer J (2012b) Profiler: integrated statistical analysis and visualization for data quality assessment. In: Advanced visual interfacesCrossRefGoogle Scholar
  24. Lakshmanan LVS, Sadri F, Subramanian SN (2001) SchemaSQL: an extension to SQL for multidatabase interoperability. ACM Trans Database Syst 26(4): 476–519zbMATHCrossRefGoogle Scholar
  25. Rahm E, Bernstein PA (2001) A survey of approaches to automatic schema matching. VLDB J 10:334–350zbMATHCrossRefGoogle Scholar
  26. Raman V, Hellerstein JM (2001) Potter’s wheel: an interactive data cleaning system. In: VLDB, pp 381–390Google Scholar
  27. Robertson GG, Czerwinski MP, Churchill JE (2005) Visualization of mappings between schemas. In: ACM CHI, pp 431–439Google Scholar
  28. Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: ACM SIGKDDCrossRefGoogle Scholar
  29. Stonebraker M, Bruckner D, Ilyas IF, Beskales G, Cherniack M, Zdonik SB, Pagan A, Xu S (2013) Data curation at scale: the data tamer system. In: CIDRGoogle Scholar
  30. Wickham H (2014) Tidy data. J Stat Softw 59(10):1–23CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Jeffrey Heer
    • 1
    Email author
  • Joseph M. Hellerstein
    • 2
  • Sean Kandel
    • 3
  1. 1.University of WashingtonSeattleUSA
  2. 2.University of California, BerkeleyBerkeleyUSA
  3. 3.Trifacta Inc.San FranciscoUSA