Synonyms
Definitions
Data wrangling is the process of profiling and transforming datasets to ensure they are actionable for a set of analysis tasks. One central goal is to make data usable: to put data in a form that can be parsed and manipulated by analysis tools. Another goal is to ensure that data is responsive to the intended analyses: that the data contain the necessary information, at an acceptable level of description and correctness, to support successful modeling and decision-making.
Overview
Despite significant advances in technologies for data management and analysis, it remains time-consuming to inspect a dataset and mold it to a form that allows meaningful analysis to begin. Analysts must regularly restructure data to make it palatable to databases, statistics packages, and visualization tools. To improve data quality, analysts must also identify and address issues such as misspellings, missing data, unresolved duplicates, and outliers.
Data wrangling is...
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Normal forms beyond first normal form (second normal form, etc.) are often less desirable for analysis purposes: one might wish to denormalize data (e.g., by joining relations with primary-foreign key relationships) in order to more conveniently perform analysis over a single table.
References
Carr DB, Littlefield RJ, Nicholson W, Littlefield J (1987) Scatterplot matrix techniques for large N. J Am Stat Assoc 82(398):424–436
Chiticariu L, Kolaitis PG, Popa L (2008) Interactive generation of integrated schemas. In: ACM SIGMOD, pp 833–846
Codd EF (1971b) Further normalization of the data base relational model. In: Courant computer science symposia 6, Data base systems, (New York, May 24–25) pp 33–64, Prentice-Hall
Dasu T, Johnson T (2003) Exploratory data mining and data cleaning. Wiley, New York
Dasu T, Johnson T, Muthukrishnan S, Shkapenyuk V (2002) Mining database structure; or, how to build a data quality browser. In: ACM SIGMOD, pp 240–251
Doan A, Halevy A, Ives Z (2012) Principles of data integration. Elsevier, Amsterdam
Eaton C, Plaisant C, Drizd T (2003) The challenge of missing and uncertain data. In: Proceedings of the IEEE visualization, p 100
Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE TKDE 19(1):1–16
Fisher K, Walker D (2011) The PADS project: an overview. In: International conference on database theory, Mar 2011
Galhardas H, Florescu D, Shasha D, Simon E (2000) AJAX: an extensible data cleaning tool. In: ACM SIGMOD, p 590
Gulwani S (2011) Automating string processing in spreadsheets using input-output examples. In: ACM POPL, pp 317–330
Guo PJ, Kandel S, Hellerstein J, Heer J (2011) Proactive wrangling: mixed-initiative end-user programming of data transformation scripts. In: ACM user interface software & technology (UIST)
Harris W, Gulwani S (2011) Spreadsheet table transformations from examples. In: ACM PLDI
Heer J, Hellerstein JM, Kandel S (2015) Predictive interaction for data transformation. In: CIDR
Hellerstein JM (2008) Quantitative data cleaning for large databases. White Paper, United Nations Economic Commission for Europe
Hodge V, Austin J (2004) A survey of outlier detection methodologies. Artif Intell Rev 22(2):85–126
Horvitz E (1999) Principles of mixed-initiative user interfaces. In: ACM CHI, pp 159–166
Huynh D, Mazzocchi S (2010) Google refine. http://code.google.com/p/google-refine/
Kang H, Getoor L, Shneiderman B, Bilgic M, Licamele L (2008) Interactive entity resolution in relational data: a visual analytic tool and its evaluation. IEEE TVCG 14(5):999–1014
Kandel S, Heer J, Plaisant C, Kennedy J, van Ham F, Riche NH, Weaver C, Lee B, Brodbeck D, Buono P (2011a) Research directions in data wrangling: visualizations and transformations for usable and credible data. Inf Vis J 10(4):271–288
Kandel S, Paepcke A, Hellerstein J, Heer J (2011b) Wrangler: interactive visual specification of data transformation scripts. In: ACM human factors in computing systems (CHI)
Kandel S, Paepcke A, Hellerstein J, Heer J (2012a) Enterprise data analysis and visualization: an interview study. In: IEEE visual analytics science & technology (VAST)
Kandel S, Parikh R, Paepcke A, Hellerstein J, Heer J (2012b) Profiler: integrated statistical analysis and visualization for data quality assessment. In: Advanced visual interfaces
Lakshmanan LVS, Sadri F, Subramanian SN (2001) SchemaSQL: an extension to SQL for multidatabase interoperability. ACM Trans Database Syst 26(4): 476–519
Rahm E, Bernstein PA (2001) A survey of approaches to automatic schema matching. VLDB J 10:334–350
Raman V, Hellerstein JM (2001) Potter’s wheel: an interactive data cleaning system. In: VLDB, pp 381–390
Robertson GG, Czerwinski MP, Churchill JE (2005) Visualization of mappings between schemas. In: ACM CHI, pp 431–439
Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: ACM SIGKDD
Stonebraker M, Bruckner D, Ilyas IF, Beskales G, Cherniack M, Zdonik SB, Pagan A, Xu S (2013) Data curation at scale: the data tamer system. In: CIDR
Wickham H (2014) Tidy data. J Stat Softw 59(10):1–23
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this entry
Cite this entry
Heer, J., Hellerstein, J.M., Kandel, S. (2019). Data Wrangling. In: Sakr, S., Zomaya, A.Y. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-77525-8_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-77525-8_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77524-1
Online ISBN: 978-3-319-77525-8
eBook Packages: Computer ScienceReference Module Computer Science and Engineering