Abstract
In today’s information age, processing customer information in a standardized and accurate manner is known to be a difficult task. Data collection methods vary from source to source by format, volume, and media type. Therefore, it is advantageous to deploy customized data hygiene techniques to standardize the data for meaningfulness and usefulness based on the organization.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
NameCheck is an algorithm currently being used by Axciom.
References
AJAX (2007) AJAX Spell Checker. Retrieved from http://www.broken-notebook.com/spell_checker/.
ASPELL (2007) ASPELL. Retrieved from http://aspell.net/metaphone/.
Becchetti C, Ricotti LP (1999) Speech Recognition: Theory and C++ Implementation. John Wiley & Sons.
Beitzel SM, Jensen EC, and Grossman, DA (2002) Retrieving OCR text: A survey of current approaches. White Paper.
Brill E, Moore RC (2002) An improved error model for noisy channel spelling correction. In: Proceedings of ACL-2000, the 38th Annual Meeting of the Association for Computational Linguistics, pp 286-293.
Cardinal J (2002) Quantization with an information-theoretic distortion measure. Technical Report 491, ULB.
Census (2007) Census Bureau Home Page, www.census.gov.
Damerau FJ (1990) Evaluating computer generated domain-oriented vocabularies. Information Process. Management. 26: 791 – 801.
Durhaiw I, Lamb DA, and Sax JB (1983) Spelling correction in user interfaces. CACM 26: 764–773.
Golding A, Schabes Y (1996) Combining trigram based and feature-based methods for context-sensitive spelling correction. In: Joshi A, and Palmer M, (eds.). Proceedings of the 34th Annual Meeting of the ACL. San Francisco.
JSPELL (2007) JSPELL HTML. Retrieved from http://www.thesolutioncafe.com/html-spell-checker.html.
Lee L (1999) Measures of distributional similarity. In: Proceedings of the 37th Annual Meeting of the ACL.
Levenshtein VI (1965) Binary codes capable of correcting deletions, insertions and reversals. Doklady Akademii Nauk SSSR 163: 845-848, also {1966) Soviet Physics Doklady 10: 707-710.
Kukich K (1992) Techniques for Automatically Correcting Words in Text. ACM Computing Surveys, Vol. 24, No. 4.
Mihov S, Ringlstetter C, Schulz KU, and Strohmaier C (2003) Lexical post-correction of OCR-results: The web as a dynamic secondary dictionary? In: Document Analysis and Recognition Proceedings Volume 2, pp 03–06.
NetSpell (2007) Near Miss Strategy. Retrieved from http://www.codeproject.com/csharp/NetSpell.asp.
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48: 443-453.
Pedro JM, Purdy PH, Vasconcelos N (2004) A Kullback-Leibler divergence based kernel for SVM classification in multimedia application. In: Thrun S, Saul L, Scholkopf B (eds) Advances in Neural Information Processing Systems 16, MIT Press, Cambridge, MA.
Philips L (1990) Hanging on the metaphone. Computer Language, 7 (12): 39-43.
Philips L (2000) The double-metaphone search algorithm. C/C++ User's Journal, 18(6).
Taghva K, Stofsky E (2001) OCRSpell: an interactive spelling correction system for OCR errors in text. IJDAR, 3: 125-137.
Tillenius M (1996) Efficient generation and ranking of spelling error corrections Master’s thesis, Royal Institute of Technology, Stockholm, Sweden.
Trenkle JM and Vogt RC (1994) Disambiguation and spelling correction for a neural network based character recognition system. In: Proceedings of SPIE. Volume 2181, pp 322-333.
Ullman JR (1977) A Binary n-Gram Technique for Automatic Correction of Substitution, Deletion, Insertion, and Reversal Errors in Words. Computer J., 20 (2): 141-147.
Varol C, Robinette C, Kulaga J, Bayrak C, Wagner R, Goff D (2006) Application of Near Miss Strategy and Edit Distance to Handle Dirty Data. In: ALAR Conference on Applied Research in Information Technology, March 3, Conway, Arkansas, USA.
Veronis, J (1998) Morphosyntactic correction in natural language interfaces. In: Proceedings of the 12th International Conference on Computational Linguistics. Budapest, Hungary, pp 708-713.
Wu S, Manber U (1992a) AGREP - A Fast Approximate Pattern Matching Tool. In: Proc. Usenix Winter 1992 Technical Conf., pp 153-162.
Wu S, Manber U (1992b) Fast Text Searching With Errors. Comm. ACM, Vol. 35.
Yannakoudakis EJ, Fawthrop D (1983) The rules of spelling errors. Information Processing Management 19 (2): 87–99.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Varol, C., Bayrak, C., Wagner, R., Goff, D. (2009). Application of the Near Miss Strategy and Edit Distance to Handle Dirty Data. In: Chan, Y., Talburt, J., Talley, T. (eds) Data Engineering. International Series in Operations Research & Management Science, vol 132. Springer, Boston, MA. https://doi.org/10.1007/978-1-4419-0176-7_5
Download citation
DOI: https://doi.org/10.1007/978-1-4419-0176-7_5
Published:
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4419-0175-0
Online ISBN: 978-1-4419-0176-7
eBook Packages: Computer ScienceComputer Science (R0)