Abstract
The Web has dramatically increased the need for efficient and flexible mechanisms to provide integrated views over multiple heterogeneous information sources. When multiple sources need to be integrated, each source may represent data differently. A common problem is the possible inconsistency of the data: the very same term may have different values, due to misspelling, a permuted word order, spelling variants and so on. In this paper, we present an improvement from our previous work for reducing inconsistency found in existing databases. The objective of our method is integration and standardization of different values that refer to the same term. All the values that refer to a same term are clustered by measuring their degree of similarity. The clustered values can be assigned to a common value that could be substituted for the original values. The paper describes and compares five different similarity measures for clustering and evaluates their performance on real-world data. The method we present may work well in practice but it is time-consuming.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
D. Bitton and D. J. DeWitt. Duplicate record elimination in large data files. ACM Transactions on Database Systems, 8(2):255–265, June 1983.
J. C. French, A. L. Powell, and E. Schulman. Applications of Approximate Word Matching in Information Retrieval. In F. Golshani and K. Makki, editors, Proceedings of the Sixth International Conference on Information and Knowledge Management (CIKM 1997), pages 9–15, Las Vegas (USA), November 10-14 1997. ACM Press.
J. A. Hartigan. Clustering Algorithms. A Wiley Publication in Applied Statistics. John Wiley & Sons, New York (USA), 1975.
M. A. Hernandez and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Journal of Data Mining and Knowledge Discovery, 2(1):9–37, 1998.
V. I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Cybernetics and Control Theory, 10:707–710, 1966.
S. Luján-Mora. An Algorithm for Computing the Invariant Distance from Word Position. Internet: http://www.dlsi.ua.es/~slujan/files/idwp.ps, June 2000.
S. Luján-Mora and M. Palomar. Clustering of Similar Values, in Spanish, for the Improvement of Search Systems. In M. C. Monard and J. S. Sichman, editors, International Joint Conference IBERAMIA-SBIA 2000 Open Discussion Track Proceedings, pages 217–226, Atibaia, São Paulo (Brazil), November 19-22 2000. ICMC/USP.
A. E. Monge and C. P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD’97), pages 23–29, Tucson (USA), May 11 1997.
A. Motro and I. Rakov. Estimating the Quality of Databases. In T. Andreasen, H. Christiansen, and H. Larsen, editors, Proceedings of FQAS 98: Third International Conference on Flexible Query Answering Systems, volume 1495 of Lecture Notes in Artificial Intelligence, pages 298–307, Roskilde (Denmark), May 1998. Springer-Verlag.
E. T. O’Neill and D. Vizine-Goetz. The Impact of Spelling Errors on Databases and Indexes. In C. Nixon and L. Padgett, editors, 10th National Online Meeting Proceedings, pages 313–320, New York (USA), May 9–11 1989. Learned Information Inc.
C. J. V. Rijsbergen. Information Retrieval. Butterworths, London (UK), 2 edition, 1979.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Luján-Mora, S., Palomar, M. (2001). Comparing String Similarity Measures for Reducing Inconsistency in Integrating Data from Different Sources. In: Wang, X.S., Yu, G., Lu, H. (eds) Advances in Web-Age Information Management. WAIM 2001. Lecture Notes in Computer Science, vol 2118. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-47714-4_18
Download citation
DOI: https://doi.org/10.1007/3-540-47714-4_18
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42298-3
Online ISBN: 978-3-540-47714-3
eBook Packages: Springer Book Archive