Data Wrangling

Heer, Jeffrey; Hellerstein, Joseph M.; Kandel, Sean

doi:10.1007/978-3-319-77525-8_9

Data Wrangling

Jeffrey Heer³,
Joseph M. Hellerstein⁴ &
Sean Kandel⁵

Reference work entry
First Online: 01 January 2019

193 Accesses

Synonyms

Data preparation

Definitions

Data wrangling is the process of profiling and transforming datasets to ensure they are actionable for a set of analysis tasks. One central goal is to make data usable: to put data in a form that can be parsed and manipulated by analysis tools. Another goal is to ensure that data is responsive to the intended analyses: that the data contain the necessary information, at an acceptable level of description and correctness, to support successful modeling and decision-making.

Overview

Despite significant advances in technologies for data management and analysis, it remains time-consuming to inspect a dataset and mold it to a form that allows meaningful analysis to begin. Analysts must regularly restructure data to make it palatable to databases, statistics packages, and visualization tools. To improve data quality, analysts must also identify and address issues such as misspellings, missing data, unresolved duplicates, and outliers.

Data wrangling is...

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 849.99; Price excludes VAT (USA)

Hardcover Book: USD 999.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Normal forms beyond first normal form (second normal form, etc.) are often less desirable for analysis purposes: one might wish to denormalize data (e.g., by joining relations with primary-foreign key relationships) in order to more conveniently perform analysis over a single table.

References

Carr DB, Littlefield RJ, Nicholson W, Littlefield J (1987) Scatterplot matrix techniques for large N. J Am Stat Assoc 82(398):424–436
MathSciNet Google Scholar
Chiticariu L, Kolaitis PG, Popa L (2008) Interactive generation of integrated schemas. In: ACM SIGMOD, pp 833–846
Google Scholar
Codd EF (1971b) Further normalization of the data base relational model. In: Courant computer science symposia 6, Data base systems, (New York, May 24–25) pp 33–64, Prentice-Hall
Google Scholar
Dasu T, Johnson T (2003) Exploratory data mining and data cleaning. Wiley, New York
Book MATH Google Scholar
Dasu T, Johnson T, Muthukrishnan S, Shkapenyuk V (2002) Mining database structure; or, how to build a data quality browser. In: ACM SIGMOD, pp 240–251
Google Scholar
Doan A, Halevy A, Ives Z (2012) Principles of data integration. Elsevier, Amsterdam
Google Scholar
Eaton C, Plaisant C, Drizd T (2003) The challenge of missing and uncertain data. In: Proceedings of the IEEE visualization, p 100
Google Scholar
Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE TKDE 19(1):1–16
Google Scholar
Fisher K, Walker D (2011) The PADS project: an overview. In: International conference on database theory, Mar 2011
Google Scholar
Galhardas H, Florescu D, Shasha D, Simon E (2000) AJAX: an extensible data cleaning tool. In: ACM SIGMOD, p 590
Google Scholar
Gulwani S (2011) Automating string processing in spreadsheets using input-output examples. In: ACM POPL, pp 317–330
MATH Google Scholar
Guo PJ, Kandel S, Hellerstein J, Heer J (2011) Proactive wrangling: mixed-initiative end-user programming of data transformation scripts. In: ACM user interface software & technology (UIST)
Google Scholar
Harris W, Gulwani S (2011) Spreadsheet table transformations from examples. In: ACM PLDI
Google Scholar
Heer J, Hellerstein JM, Kandel S (2015) Predictive interaction for data transformation. In: CIDR
Google Scholar
Hellerstein JM (2008) Quantitative data cleaning for large databases. White Paper, United Nations Economic Commission for Europe
Google Scholar
Hodge V, Austin J (2004) A survey of outlier detection methodologies. Artif Intell Rev 22(2):85–126
Article MATH Google Scholar
Horvitz E (1999) Principles of mixed-initiative user interfaces. In: ACM CHI, pp 159–166
Google Scholar
Huynh D, Mazzocchi S (2010) Google refine. http://code.google.com/p/google-refine/
Kang H, Getoor L, Shneiderman B, Bilgic M, Licamele L (2008) Interactive entity resolution in relational data: a visual analytic tool and its evaluation. IEEE TVCG 14(5):999–1014
Google Scholar
Kandel S, Heer J, Plaisant C, Kennedy J, van Ham F, Riche NH, Weaver C, Lee B, Brodbeck D, Buono P (2011a) Research directions in data wrangling: visualizations and transformations for usable and credible data. Inf Vis J 10(4):271–288
Article Google Scholar
Kandel S, Paepcke A, Hellerstein J, Heer J (2011b) Wrangler: interactive visual specification of data transformation scripts. In: ACM human factors in computing systems (CHI)
Google Scholar
Kandel S, Paepcke A, Hellerstein J, Heer J (2012a) Enterprise data analysis and visualization: an interview study. In: IEEE visual analytics science & technology (VAST)
Google Scholar
Kandel S, Parikh R, Paepcke A, Hellerstein J, Heer J (2012b) Profiler: integrated statistical analysis and visualization for data quality assessment. In: Advanced visual interfaces
Book Google Scholar
Lakshmanan LVS, Sadri F, Subramanian SN (2001) SchemaSQL: an extension to SQL for multidatabase interoperability. ACM Trans Database Syst 26(4): 476–519
Article MATH Google Scholar
Rahm E, Bernstein PA (2001) A survey of approaches to automatic schema matching. VLDB J 10:334–350
Article MATH Google Scholar
Raman V, Hellerstein JM (2001) Potter’s wheel: an interactive data cleaning system. In: VLDB, pp 381–390
Google Scholar
Robertson GG, Czerwinski MP, Churchill JE (2005) Visualization of mappings between schemas. In: ACM CHI, pp 431–439
Google Scholar
Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: ACM SIGKDD
Book Google Scholar
Stonebraker M, Bruckner D, Ilyas IF, Beskales G, Cherniack M, Zdonik SB, Pagan A, Xu S (2013) Data curation at scale: the data tamer system. In: CIDR
Google Scholar
Wickham H (2014) Tidy data. J Stat Softw 59(10):1–23
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Washington, Seattle, WA, USA
Jeffrey Heer
University of California, Berkeley, Berkeley, CA, USA
Joseph M. Hellerstein
Trifacta Inc., San Francisco, CA, USA
Sean Kandel

Authors

Jeffrey Heer
View author publications
You can also search for this author in PubMed Google Scholar
Joseph M. Hellerstein
View author publications
You can also search for this author in PubMed Google Scholar
Sean Kandel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jeffrey Heer .

Editor information

Editors and Affiliations

Institute of Computer Science, University of Tartu, Tartu, Estonia
Sherif Sakr
School of Information Technologies, Sydney University, Sydney, Australia
Albert Y. Zomaya

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Heer, J., Hellerstein, J.M., Kandel, S. (2019). Data Wrangling. In: Sakr, S., Zomaya, A.Y. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-77525-8_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-77525-8_9
Published: 20 February 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77524-1
Online ISBN: 978-3-319-77525-8
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics