Ontology-Driven Approximate Duplicate Elimination of Postal Addresses
Part of the
Lecture Notes in Computer Science
book series (LNCS, volume 5027)
In several common real-life cases of usage of postal address databases an important problem that is often necessary to solve is the one of duplicate elimination. This may occur because a database of addresses is merged to another one, for instance during a joint-venture or a fusion between two companies, so that two or more than two addresses are the same.
Though a trivial approach based upon identification can be used in principle, this attempt would indeed fail in any concrete case, in particular for postal addresses, because the same address can be written in several different ways so that an approximate approach can be adopted successfully, under the condition that the duplicate elimination is correctly performed. We identify an ontology-driven approach for postal addresses which solves the problem in an approximate fashion. The algorithm is based upon a modification of the Levenshtein distance, obtained by introducing the notion of admissible abbreviation, and has a threefold outcome: eliminate duplicates, do not eliminate duplicates, undecided.
KeywordsString Match Postal Address Levenshtein Distance Graph Edit Distance Fuzzy Ontology
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
VLDB 2002. In: Proceedings of 28th International Conference on Very Large Data Bases, Hong Kong, China, August 20-23, 2002, Morgan Kaufmann, San Francisco (2002)Google Scholar
Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: VLDB , pp. 586–597 (2002)Google Scholar
Apostolico, A., Guerra, C.: The longest common subsequence problem revisited. Algorithmica 2, 316–336 (1987)CrossRefMathSciNetGoogle Scholar
Bunke, H.: On a relation between graph edit distance and maximum common subgraph. Pattern Recognition Letters 18(8), 689–694 (1997)CrossRefMathSciNetGoogle Scholar
Hamming, R.W.: Error detecting and error correcting codes. Bell System Technology Journal 29, 147–160 (1950)MathSciNetGoogle Scholar
Jiang, T., Lin, G., Ma, B., Zhang, K.: A general edit distance between rna structures. Journal of Computational Biology 9(2), 371–388 (2002)CrossRefGoogle Scholar
Lee, C.-S., Jian, Z.-W., Huang, L.-K.: A fuzzy ontology and its application to news summarization. IEEE Transactions on Systems, Man and Cybernetics Part B 35(5), 859–880 (2005)CrossRefGoogle Scholar
Lee, C.-S., Jiang, C.-C., Hsieh, T.-C.: A genetic fuzzy agent using ontology model for meeting scheduling system. Information Sciences 176(9), 1131–1155 (2006)zbMATHCrossRefGoogle Scholar
Lee, C.-S., Kao, Y.-F., Kuo, Y.-H., Wang, M.-H.: Automated ontology construction for unstructured text documents. Data and Knowledge Engineering 60(3), 547–566 (2007)CrossRefGoogle Scholar
Lee, C.-S., Wang, M.-H., Chen, J.-J.: Ontology-based intelligent decision support agent for cmmi project monitoring and control. International Journal of Approximate Reasoning (2007)Google Scholar
Lee, C.-S., Wang, M.-H.: Ontology-based computational intelligent multi-agent and its application to cmmi assessment. Applied Intelligence (2007)Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)MathSciNetGoogle Scholar
McNamee, P., Mayfield, J.: Character n-gram tokenization for european language text retrieval. Inf. Retr. 7(1-2), 73–97 (2004)CrossRefGoogle Scholar
Oncina, J., Sebban, M.: Learning stochastic edit distance: Application in handwritten character recognition. Pattern Recognition 39(9), 1575–1587 (2006)zbMATHCrossRefGoogle Scholar
Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18(6), 1245–1262 (1989)zbMATHCrossRefMathSciNetGoogle Scholar
© Springer-Verlag Berlin Heidelberg 2008