Abstract
Benchmarking the quality of duplicate detection methods requires comprehensive knowledge on duplicate pairs in addition to sufficient size and variability of test data sets. While extending real-world data sets with artificially created data is promising, current approaches to such synthetic data generation, however, work solely on a quantitative level, which entails that duplicate semantics are only implicitly represented, leading to only insufficiently configurable variability.
In this paper we propose SemGen, a semantics-driven approach to synthetic data generation. SemGen first diversifies real-world objects on a qualitative level, before in a second step quantitative values are generated. To demonstrate the applicability of SemGen, we propose how to define duplicate semantics for the domain of road traffic management. A discussion of lessons learned concludes the paper.
This work has been funded by the Austrian Federal Ministry of Transport, Innovation and Technology (BMVIT) under grant FIT-IT 819577.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Allen, J.F.: Maintaining knowledge about temporal intervals. Communications of the ACM 26(11), 832–843 (1983)
Baumgartner, N., Gottesheim, W., Mitsch, S., Retschitzegger, W., Schwinger, W.: BeAware!—situation awareness, the ontology-driven way. International Journal of Data and Knowledge Engineering 69(11), 1181–1193 (2010)
Baumgartner, N., Gottesheim, W., Mitsch, S., Retschitzegger, W., Schwinger, W.: Towards duplicate detection for situation awareness based on spatio-temporal relations. In: Proceedings of the 9th International Conference on Ontologies, DataBases and Applications of Semantics, Crete, Greece (October 2010)
Bertolazzi, P., Santisy, L.D., Scannapieco, M.: Automatic record matching in cooperative information systems. In: Proceedings of the ICDT 2003 International Workshop on Data Quality in Cooperative Information Systems, DQCIS 2003 (2003)
Brinkhoff, T.: A framework for generating network-based moving objects. GeoInformatica 6(2), 153–180 (2002)
Bruno, N., Chaudhuri, S.: Flexible database generators. In: Proceedings of the 31st International Conference on Very Large DataBases, pp. 1097–1107 (2005)
Bruns, H.T., Egenhofer, M.J.: Similarity of spatial scenes. In: Kraak, M.-J., Molenaar, M. (eds.) Proceedings of the 7th International Symposium on Spatial Data Handling (SDH), Delft, The Netherlands, August 1996, pp. 31–42 (1996)
Chays, D., Dan, S., Frankl, P.G., Vokolos, F.I., Weber, E.J.: A framework for testing database applications. SIGSOFT Software Engineering Notes 25, 147–157 (2000)
Dylla, F., Wallgrün, J.O.: On generalizing orientation information in OPRAm. In: Freksa, C., Kohlhase, M., Schill, K. (eds.) KI 2006. LNCS (LNAI), vol. 4314, pp. 274–288. Springer, Heidelberg (2007)
Freksa, C.: Conceptual neighborhood and its role in temporal and spatial reasoning. In: Proceedings of the IMACS International Workshop on Decision Support Systems and Qualitative Reasoning, Toulouse, France, March 1991, pp. 181–187 (1991)
Freksa, C.: Temporal reasoning based on semi-intervals. Artificial Intelligence 54(1), 199–227 (1992)
Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, SIGMOD Rec., New York, NY, USA, pp. 127–138 (1995)
Hoag, J.E., Thompson, C.W.: A parallel general-purpose synthetic data generator. SIGMOD Rec. 36, 19–24 (2007)
Houkjær, K., Torp, K., Wind, R.: Simple and realistic data generation. In: Proceedings of the 32nd International Conference on Very Large Data Bases, pp. 1243–1246 (2006)
Kim, W., Choi, B.-J., Hong, E.-K., Kim, S.-K., Lee, D.: A taxonomy of dirty data. Data Mining and Knowledge Discovery 7, 81–99 (2003)
Naumann, F., Herschel, M.: An Introduction to Duplicate Detection. Morgan & Claypool (2010)
Randell, D.A., Cui, Z., Cohn, A.G.: A spatial logic based on regions and connection. In: Proceedings of the 3rd International Conference on Knowledge Representation and Reasoning (October 1992)
Tzouramanis, T., Vassilakopoulos, M., Manolopoulos, Y.: On the generation of time-evolving regional data. GeoInformatica 6, 207–231 (2002)
Weis, M., Naumann, F., Brosy, F.: A duplicate detection benchmark for xml (and relational) data. In: SIGMOD 2006 Workshop on Information Quality for Information Systems (IQIS), Chicago, IL, USA (June 2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gottesheim, W., Mitsch, S., Retschitzegger, W., Schwinger, W., Baumgartner, N. (2011). SemGen—Towards a Semantic Data Generator for Benchmarking Duplicate Detectors. In: Xu, J., Yu, G., Zhou, S., Unland, R. (eds) Database Systems for Adanced Applications. DASFAA 2011. Lecture Notes in Computer Science, vol 6637. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20244-5_47
Download citation
DOI: https://doi.org/10.1007/978-3-642-20244-5_47
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20243-8
Online ISBN: 978-3-642-20244-5
eBook Packages: Computer ScienceComputer Science (R0)