SemGen—Towards a Semantic Data Generator for Benchmarking Duplicate Detectors

Gottesheim, Wolfgang; Mitsch, Stefan; Retschitzegger, Werner; Schwinger, Wieland; Baumgartner, Norbert

doi:10.1007/978-3-642-20244-5_47

Wolfgang Gottesheim²⁰,
Stefan Mitsch²⁰,
Werner Retschitzegger²⁰,
Wieland Schwinger²⁰ &
…
Norbert Baumgartner²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6637))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

1663 Accesses
1 Citations

Abstract

Benchmarking the quality of duplicate detection methods requires comprehensive knowledge on duplicate pairs in addition to sufficient size and variability of test data sets. While extending real-world data sets with artificially created data is promising, current approaches to such synthetic data generation, however, work solely on a quantitative level, which entails that duplicate semantics are only implicitly represented, leading to only insufficiently configurable variability.

In this paper we propose SemGen, a semantics-driven approach to synthetic data generation. SemGen first diversifies real-world objects on a qualitative level, before in a second step quantitative values are generated. To demonstrate the applicability of SemGen, we propose how to define duplicate semantics for the domain of road traffic management. A discussion of lessons learned concludes the paper.

This work has been funded by the Austrian Federal Ministry of Transport, Innovation and Technology (BMVIT) under grant FIT-IT 819577.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Allen, J.F.: Maintaining knowledge about temporal intervals. Communications of the ACM 26(11), 832–843 (1983)
Article MATH Google Scholar
Baumgartner, N., Gottesheim, W., Mitsch, S., Retschitzegger, W., Schwinger, W.: BeAware!—situation awareness, the ontology-driven way. International Journal of Data and Knowledge Engineering 69(11), 1181–1193 (2010)
Article Google Scholar
Baumgartner, N., Gottesheim, W., Mitsch, S., Retschitzegger, W., Schwinger, W.: Towards duplicate detection for situation awareness based on spatio-temporal relations. In: Proceedings of the 9th International Conference on Ontologies, DataBases and Applications of Semantics, Crete, Greece (October 2010)
Google Scholar
Bertolazzi, P., Santisy, L.D., Scannapieco, M.: Automatic record matching in cooperative information systems. In: Proceedings of the ICDT 2003 International Workshop on Data Quality in Cooperative Information Systems, DQCIS 2003 (2003)
Google Scholar
Brinkhoff, T.: A framework for generating network-based moving objects. GeoInformatica 6(2), 153–180 (2002)
Article MATH Google Scholar
Bruno, N., Chaudhuri, S.: Flexible database generators. In: Proceedings of the 31st International Conference on Very Large DataBases, pp. 1097–1107 (2005)
Google Scholar
Bruns, H.T., Egenhofer, M.J.: Similarity of spatial scenes. In: Kraak, M.-J., Molenaar, M. (eds.) Proceedings of the 7th International Symposium on Spatial Data Handling (SDH), Delft, The Netherlands, August 1996, pp. 31–42 (1996)
Google Scholar
Chays, D., Dan, S., Frankl, P.G., Vokolos, F.I., Weber, E.J.: A framework for testing database applications. SIGSOFT Software Engineering Notes 25, 147–157 (2000)
Article Google Scholar
Dylla, F., Wallgrün, J.O.: On generalizing orientation information in OPRA_m. In: Freksa, C., Kohlhase, M., Schill, K. (eds.) KI 2006. LNCS (LNAI), vol. 4314, pp. 274–288. Springer, Heidelberg (2007)
Chapter Google Scholar
Freksa, C.: Conceptual neighborhood and its role in temporal and spatial reasoning. In: Proceedings of the IMACS International Workshop on Decision Support Systems and Qualitative Reasoning, Toulouse, France, March 1991, pp. 181–187 (1991)
Google Scholar
Freksa, C.: Temporal reasoning based on semi-intervals. Artificial Intelligence 54(1), 199–227 (1992)
Article MathSciNet Google Scholar
Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, SIGMOD Rec., New York, NY, USA, pp. 127–138 (1995)
Google Scholar
Hoag, J.E., Thompson, C.W.: A parallel general-purpose synthetic data generator. SIGMOD Rec. 36, 19–24 (2007)
Article Google Scholar
Houkjær, K., Torp, K., Wind, R.: Simple and realistic data generation. In: Proceedings of the 32nd International Conference on Very Large Data Bases, pp. 1243–1246 (2006)
Google Scholar
Kim, W., Choi, B.-J., Hong, E.-K., Kim, S.-K., Lee, D.: A taxonomy of dirty data. Data Mining and Knowledge Discovery 7, 81–99 (2003)
Article MathSciNet Google Scholar
Naumann, F., Herschel, M.: An Introduction to Duplicate Detection. Morgan & Claypool (2010)
Google Scholar
Randell, D.A., Cui, Z., Cohn, A.G.: A spatial logic based on regions and connection. In: Proceedings of the 3rd International Conference on Knowledge Representation and Reasoning (October 1992)
Google Scholar
Tzouramanis, T., Vassilakopoulos, M., Manolopoulos, Y.: On the generation of time-evolving regional data. GeoInformatica 6, 207–231 (2002)
Article MATH Google Scholar
Weis, M., Naumann, F., Brosy, F.: A duplicate detection benchmark for xml (and relational) data. In: SIGMOD 2006 Workshop on Information Quality for Information Systems (IQIS), Chicago, IL, USA (June 2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Johannes Kepler University Linz, Altenbergerstr. 69, 4040, Linz, Austria
Wolfgang Gottesheim, Stefan Mitsch, Werner Retschitzegger & Wieland Schwinger
team Communication Technology Mgt. Ltd., Goethegasse 3, 1010, Vienna, Austria
Norbert Baumgartner

Authors

Wolfgang Gottesheim
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Mitsch
View author publications
You can also search for this author in PubMed Google Scholar
Werner Retschitzegger
View author publications
You can also search for this author in PubMed Google Scholar
Wieland Schwinger
View author publications
You can also search for this author in PubMed Google Scholar
Norbert Baumgartner
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, KLN, Hong Kong, China
Jianliang Xu
School of Information Science and Engineering, Northeastern University, Shenyang, 110004, Liaoning, China
Ge Yu
School of Computer Science, Fudan University, 220 Handan Road, 200433, Shanghai, China
Shuigeng Zhou
Institute for Computer Science and Business Information Systems (ICB), University of Duisburg-Essen, Schützenbahn 70, 45117, Essen, Germany
Rainer Unland

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gottesheim, W., Mitsch, S., Retschitzegger, W., Schwinger, W., Baumgartner, N. (2011). SemGen—Towards a Semantic Data Generator for Benchmarking Duplicate Detectors. In: Xu, J., Yu, G., Zhou, S., Unland, R. (eds) Database Systems for Adanced Applications. DASFAA 2011. Lecture Notes in Computer Science, vol 6637. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20244-5_47

Download citation

DOI: https://doi.org/10.1007/978-3-642-20244-5_47
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20243-8
Online ISBN: 978-3-642-20244-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics