Skip to main content

SemGen—Towards a Semantic Data Generator for Benchmarking Duplicate Detectors

  • Conference paper
Database Systems for Adanced Applications (DASFAA 2011)

Abstract

Benchmarking the quality of duplicate detection methods requires comprehensive knowledge on duplicate pairs in addition to sufficient size and variability of test data sets. While extending real-world data sets with artificially created data is promising, current approaches to such synthetic data generation, however, work solely on a quantitative level, which entails that duplicate semantics are only implicitly represented, leading to only insufficiently configurable variability.

In this paper we propose SemGen, a semantics-driven approach to synthetic data generation. SemGen first diversifies real-world objects on a qualitative level, before in a second step quantitative values are generated. To demonstrate the applicability of SemGen, we propose how to define duplicate semantics for the domain of road traffic management. A discussion of lessons learned concludes the paper.

This work has been funded by the Austrian Federal Ministry of Transport, Innovation and Technology (BMVIT) under grant FIT-IT 819577.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Allen, J.F.: Maintaining knowledge about temporal intervals. Communications of the ACM 26(11), 832–843 (1983)

    Article  MATH  Google Scholar 

  2. Baumgartner, N., Gottesheim, W., Mitsch, S., Retschitzegger, W., Schwinger, W.: BeAware!—situation awareness, the ontology-driven way. International Journal of Data and Knowledge Engineering 69(11), 1181–1193 (2010)

    Article  Google Scholar 

  3. Baumgartner, N., Gottesheim, W., Mitsch, S., Retschitzegger, W., Schwinger, W.: Towards duplicate detection for situation awareness based on spatio-temporal relations. In: Proceedings of the 9th International Conference on Ontologies, DataBases and Applications of Semantics, Crete, Greece (October 2010)

    Google Scholar 

  4. Bertolazzi, P., Santisy, L.D., Scannapieco, M.: Automatic record matching in cooperative information systems. In: Proceedings of the ICDT 2003 International Workshop on Data Quality in Cooperative Information Systems, DQCIS 2003 (2003)

    Google Scholar 

  5. Brinkhoff, T.: A framework for generating network-based moving objects. GeoInformatica 6(2), 153–180 (2002)

    Article  MATH  Google Scholar 

  6. Bruno, N., Chaudhuri, S.: Flexible database generators. In: Proceedings of the 31st International Conference on Very Large DataBases, pp. 1097–1107 (2005)

    Google Scholar 

  7. Bruns, H.T., Egenhofer, M.J.: Similarity of spatial scenes. In: Kraak, M.-J., Molenaar, M. (eds.) Proceedings of the 7th International Symposium on Spatial Data Handling (SDH), Delft, The Netherlands, August 1996, pp. 31–42 (1996)

    Google Scholar 

  8. Chays, D., Dan, S., Frankl, P.G., Vokolos, F.I., Weber, E.J.: A framework for testing database applications. SIGSOFT Software Engineering Notes 25, 147–157 (2000)

    Article  Google Scholar 

  9. Dylla, F., Wallgrün, J.O.: On generalizing orientation information in OPRAm. In: Freksa, C., Kohlhase, M., Schill, K. (eds.) KI 2006. LNCS (LNAI), vol. 4314, pp. 274–288. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  10. Freksa, C.: Conceptual neighborhood and its role in temporal and spatial reasoning. In: Proceedings of the IMACS International Workshop on Decision Support Systems and Qualitative Reasoning, Toulouse, France, March 1991, pp. 181–187 (1991)

    Google Scholar 

  11. Freksa, C.: Temporal reasoning based on semi-intervals. Artificial Intelligence 54(1), 199–227 (1992)

    Article  MathSciNet  Google Scholar 

  12. Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, SIGMOD Rec., New York, NY, USA, pp. 127–138 (1995)

    Google Scholar 

  13. Hoag, J.E., Thompson, C.W.: A parallel general-purpose synthetic data generator. SIGMOD Rec. 36, 19–24 (2007)

    Article  Google Scholar 

  14. Houkjær, K., Torp, K., Wind, R.: Simple and realistic data generation. In: Proceedings of the 32nd International Conference on Very Large Data Bases, pp. 1243–1246 (2006)

    Google Scholar 

  15. Kim, W., Choi, B.-J., Hong, E.-K., Kim, S.-K., Lee, D.: A taxonomy of dirty data. Data Mining and Knowledge Discovery 7, 81–99 (2003)

    Article  MathSciNet  Google Scholar 

  16. Naumann, F., Herschel, M.: An Introduction to Duplicate Detection. Morgan & Claypool (2010)

    Google Scholar 

  17. Randell, D.A., Cui, Z., Cohn, A.G.: A spatial logic based on regions and connection. In: Proceedings of the 3rd International Conference on Knowledge Representation and Reasoning (October 1992)

    Google Scholar 

  18. Tzouramanis, T., Vassilakopoulos, M., Manolopoulos, Y.: On the generation of time-evolving regional data. GeoInformatica 6, 207–231 (2002)

    Article  MATH  Google Scholar 

  19. Weis, M., Naumann, F., Brosy, F.: A duplicate detection benchmark for xml (and relational) data. In: SIGMOD 2006 Workshop on Information Quality for Information Systems (IQIS), Chicago, IL, USA (June 2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Gottesheim, W., Mitsch, S., Retschitzegger, W., Schwinger, W., Baumgartner, N. (2011). SemGen—Towards a Semantic Data Generator for Benchmarking Duplicate Detectors. In: Xu, J., Yu, G., Zhou, S., Unland, R. (eds) Database Systems for Adanced Applications. DASFAA 2011. Lecture Notes in Computer Science, vol 6637. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20244-5_47

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-20244-5_47

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20243-8

  • Online ISBN: 978-3-642-20244-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics