Advertisement

Ontology-Driven Approximate Duplicate Elimination of Postal Addresses

  • Matteo Cristani
  • Alessio Gugole
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5027)

Abstract

In several common real-life cases of usage of postal address databases an important problem that is often necessary to solve is the one of duplicate elimination. This may occur because a database of addresses is merged to another one, for instance during a joint-venture or a fusion between two companies, so that two or more than two addresses are the same.

Though a trivial approach based upon identification can be used in principle, this attempt would indeed fail in any concrete case, in particular for postal addresses, because the same address can be written in several different ways so that an approximate approach can be adopted successfully, under the condition that the duplicate elimination is correctly performed. We identify an ontology-driven approach for postal addresses which solves the problem in an approximate fashion. The algorithm is based upon a modification of the Levenshtein distance, obtained by introducing the notion of admissible abbreviation, and has a threefold outcome: eliminate duplicates, do not eliminate duplicates, undecided.

Keywords

String Match Postal Address Levenshtein Distance Graph Edit Distance Fuzzy Ontology 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    VLDB 2002. In: Proceedings of 28th International Conference on Very Large Data Bases, Hong Kong, China, August 20-23, 2002, Morgan Kaufmann, San Francisco (2002)Google Scholar
  2. 2.
    Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: VLDB [1], pp. 586–597 (2002)Google Scholar
  3. 3.
    Apostolico, A., Guerra, C.: The longest common subsequence problem revisited. Algorithmica 2, 316–336 (1987)CrossRefMathSciNetGoogle Scholar
  4. 4.
    Bunke, H.: On a relation between graph edit distance and maximum common subgraph. Pattern Recognition Letters 18(8), 689–694 (1997)CrossRefMathSciNetGoogle Scholar
  5. 5.
    Hamming, R.W.: Error detecting and error correcting codes. Bell System Technology Journal 29, 147–160 (1950)MathSciNetGoogle Scholar
  6. 6.
    Jiang, T., Lin, G., Ma, B., Zhang, K.: A general edit distance between rna structures. Journal of Computational Biology 9(2), 371–388 (2002)CrossRefGoogle Scholar
  7. 7.
    Lee, C.-S., Jian, Z.-W., Huang, L.-K.: A fuzzy ontology and its application to news summarization. IEEE Transactions on Systems, Man and Cybernetics Part B 35(5), 859–880 (2005)CrossRefGoogle Scholar
  8. 8.
    Lee, C.-S., Jiang, C.-C., Hsieh, T.-C.: A genetic fuzzy agent using ontology model for meeting scheduling system. Information Sciences 176(9), 1131–1155 (2006)zbMATHCrossRefGoogle Scholar
  9. 9.
    Lee, C.-S., Kao, Y.-F., Kuo, Y.-H., Wang, M.-H.: Automated ontology construction for unstructured text documents. Data and Knowledge Engineering 60(3), 547–566 (2007)CrossRefGoogle Scholar
  10. 10.
    Lee, C.-S., Wang, M.-H., Chen, J.-J.: Ontology-based intelligent decision support agent for cmmi project monitoring and control. International Journal of Approximate Reasoning (2007)Google Scholar
  11. 11.
    Lee, C.-S., Wang, M.-H.: Ontology-based computational intelligent multi-agent and its application to cmmi assessment. Applied Intelligence (2007)Google Scholar
  12. 12.
    Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)MathSciNetGoogle Scholar
  13. 13.
    McNamee, P., Mayfield, J.: Character n-gram tokenization for european language text retrieval. Inf. Retr. 7(1-2), 73–97 (2004)CrossRefGoogle Scholar
  14. 14.
    Oncina, J., Sebban, M.: Learning stochastic edit distance: Application in handwritten character recognition. Pattern Recognition 39(9), 1575–1587 (2006)zbMATHCrossRefGoogle Scholar
  15. 15.
    Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18(6), 1245–1262 (1989)zbMATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Matteo Cristani
    • 1
  • Alessio Gugole
    • 1
  1. 1.University of Verona37134Verona

Personalised recommendations