An Ontology-Based Method for Duplicate Detection in Web Data Tables

Buche, Patrice; Dibie-Barthélemy, Juliette; Khefifi, Rania; Saïs, Fatiha

doi:10.1007/978-3-642-23088-2_38

Patrice Buche^20,21,
Juliette Dibie-Barthélemy²²,
Rania Khefifi²³ &
…
Fatiha Saïs²³

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6860))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

1238 Accesses

Abstract

We present, in this paper, a duplicate detection method in semantically annotated Web data tables, driven by a domain Termino-Ontological Resource (TOR). Our method relies on the fuzzy semantic annotations automatically associated with the Web data tables. A fuzzy semantic annotation is automatically associated with each row of a Web data table. It corresponds to the instantiation of a composed concept of the domain TOR, which represents the semantic n-ary relationship that exists between the columns of the Web data table. A fuzzy semantic annotation contains fuzzy values expressed as fuzzy sets. We propose an automatic duplicate detection method which consists in detecting the pairs of duplicate fuzzy semantic annotations and relies on (i) knowledge declared in the domain TOR and on (ii) similarity measures between fuzzy sets. Two new similarity measures are defined to compare both, the symbolic fuzzy values and the numerical fuzzy values. Our method has been tested on a real application in the domain of chemical risk in food.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Hignette, G., Buche, P., Dibie-Barthélemy, J., Haemmerlé, O.: Fuzzy annotation of web data tables driven by a domain ontology. In: Aroyo, L., Traverso, P., Ciravegna, F., Cimiano, P., Heath, T., Hyvönen, E., Mizoguchi, R., Oren, E., Sabou, M., Simperl, E. (eds.) ESWC 2009. LNCS, vol. 5554, pp. 638–653. Springer, Heidelberg (2009)
Chapter Google Scholar
Zadeh, L.: Fuzzy sets. Information and Control 8, 338–353 (1965)
Article MATH Google Scholar
Saïs, F., Pernelle, N., Rousset, M.C.: Combining a logical and a numerical method for data reconciliation. J. Data Semantics 12, 66–94 (2009)
Article Google Scholar
Buche, P., Haemmerlé, O.: Towards a unified querying system of both structured and semi-structured imprecise data using fuzzy view. In: Ganter, B., Mineau, G.W. (eds.) ICCS 2000. LNCS, vol. 1867, pp. 207–220. Springer, Heidelberg (2000)
Chapter Google Scholar
Buche, P., Dibie-Barthélemy, J., Chebil, H.: Flexible sparql querying of web data tables driven by an ontology. In: Andreasen, T., Yager, R.R., Bulskov, H., Christiansen, H., Larsen, H.L. (eds.) FQAS 2009. LNCS, vol. 5822, pp. 345–357. Springer, Heidelberg (2009)
Chapter Google Scholar
Roche, C., Calberg-Challot, M., Damas, L., Rouard, P.: Ontoterminology - a new paradigm for terminology. In: KEOD, pp. 321–326 (2009)
Google Scholar
Reymonet, A., Thomas, J., Aussenac-Gilles, N.: Modelling ontological and terminological resources in OWL DL. In: OntoLex-Workshop at ISWC 2007 (2007)
Google Scholar
Dubois, D., Prade, H.: The three semantics of fuzzy sets. Fuzzy Sets and Systems 90, 141–150 (1997)
Article MATH Google Scholar
Bouchon-Meunier, B., Rifqi, M., Bothorel, S.: Towards general measures of comparison of objects. Fuzzy Sets and Systems 11, 143–153 (1996)
Article MATH Google Scholar
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: KDD, pp. 39–48 (2003)
Google Scholar
Jaccard, P.: Etude comparative de la distribution florale dans une portion des alpes et des jura. Bulletin de la Société Vaudoise des Sciences Naturelles 37, 547–579 (1901)
Google Scholar
Tversky, A.: Features of similarity. Psychological Review 84, 327–352 (1977)
Article Google Scholar
Largeron, C., Kaddour, B., Fernandez, M.: Softjaccard: une mesure de similarité entre ensembles de chaînes de caractères pour l’unification d’entités nommées. In: Extaction et Gestion des Connaissances (EGC) (2009)
Google Scholar
Hsieh, C.H., Chen, S.H.: Similarity of generalized fuzzy numbers with graded mean integration represntation. In: Proc. 8th IFSA World Congr., vol. 2, pp. 551–555 (1999)
Google Scholar
Chen, S.M.: New methods for subjective mental workload assessment and fuzzy risk analysis. Cybernetics and Systems 27, 449–472 (1996)
Article MATH Google Scholar
Chen, S.J., Chen, S.M.: Fuzzy risk analysis based on similarity measures of generalized fuzzy numbers. IEEE 11(1), 45–56 (2003)
Google Scholar
Cohn, D.A., Atlas, L.E., Ladner, R.E.: Improving generalization with active learning. Machine Learning 15(2), 201–221 (1994)
Google Scholar
Tejada, S., Knoblock, C.A., Minton, S.: Learning object identification rules for information integration. Inf. Syst. 26(8), 607–633 (2001)
Article MATH Google Scholar
Saïs, F., Pernelle, N., Rousset, M.C.: L2R: A logical method for reference reconciliation. In: AAAI Conference on Artificial Intelligence, pp. 329–334 (2007)
Google Scholar
Gonzalez, H., Halevy, A.Y., Jensen, C.S., Langen, A., Madhavan, J., Shapley, R., Shen, W.: Google fusion tables: data management, integration and collaboration in the cloud. In: SoCC, pp. 175–180 (2010)
Google Scholar
Gonzalez, H., Halevy, A.Y., Jensen, C.S., Langen, A., Madhavan, J., Shapley, R., Shen, W., Goldberg-Kidon, J.: Google fusion tables: web-centered data management and collaboration. In: SIGMOD Conference, pp. 1061–1066 (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

INRA - UMR IATE, 2, place Pierre Viala, F-34060, Montpellier Cedex 2, France
Patrice Buche
LIRMM, CNRS-UM2, F-34392, Montpellier, France
Patrice Buche
INRA - Mét@risk & AgroParisTech, 16 rue Claude Bernard, F-75231, Paris Cedex 5, France
Juliette Dibie-Barthélemy
LRI (CNRS & Paris-Sud 11 University)/INRIA Saclay, 4 rue Jacques Monod, bât. G, F-91893, Orsay Cedex, France
Rania Khefifi & Fatiha Saïs

Authors

Patrice Buche
View author publications
You can also search for this author in PubMed Google Scholar
Juliette Dibie-Barthélemy
View author publications
You can also search for this author in PubMed Google Scholar
Rania Khefifi
View author publications
You can also search for this author in PubMed Google Scholar
Fatiha Saïs
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institut de Recherche en Informatique de Toulouse (IRIT), Paul Sabatier University, 118, route de Narbonne, 31062, Toulouse Cedex, France
Abdelkader Hameurlain
Brigham Young University, 784 TNRB, 84602, Provo, UT, USA
Stephen W. Liddle
Software Competence Center Hagenberg and Johannes-Keppler-University Linz, Softwarepark 21, 4232, Hagenberg, Austria
Klaus-Dieter Schewe
School of Information Technology and Electrical Engineering, University of Queensland, 4072, Brisbane, QLD, Australia
Xiaofang Zhou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Buche, P., Dibie-Barthélemy, J., Khefifi, R., Saïs, F. (2011). An Ontology-Based Method for Duplicate Detection in Web Data Tables. In: Hameurlain, A., Liddle, S.W., Schewe, KD., Zhou, X. (eds) Database and Expert Systems Applications. DEXA 2011. Lecture Notes in Computer Science, vol 6860. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23088-2_38

Download citation

DOI: https://doi.org/10.1007/978-3-642-23088-2_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23087-5
Online ISBN: 978-3-642-23088-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics