Skip to main content

“Same, Same but Different” A Survey on Duplicate Detection Methods for Situation Awareness

  • Conference paper
On the Move to Meaningful Internet Systems: OTM 2009 (OTM 2009)

Abstract

Systems supporting situation awareness typically deal with a vast stream of information about a large number of real-world objects anchored in time and space provided by multiple sources. These sources are often characterized by frequent updates, heterogeneous formats and most crucial, identical, incomplete and often even contradictory information. In this respect, duplicate detection methods are of paramount importance allowing to explore whether or not information having, e.g., different origins or different observation times concern one and the same real-world object. Although many such duplicate detection methods have been proposed in literature—each of them having different origins, pursuing different goals and often, by nature, being heavily domain-specific—the unique characteristics of situation awareness and their implications on the method’s applicability were not the focus up to now. This paper examines existing duplicate detection methods appearing to be suitable in the area of situation awareness and identifies their strengths and shortcomings. As a prerequisite, based on a motivating case study in the domain of road traffic management, an evaluation framework is suggested, which categorizes the major requirements on duplicate detection methods with regard to situation awareness.

This work has been funded by the Austrian Federal Ministry of Transport, Innovation and Technology (BMVIT) under grant FIT-IT 819577.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Endsley, M.R.: Design and evaluation for situation awareness enhancement. In: Proceedings of the Human Factors Society 32nd Annual Meeting, Santa Monica, CA, USA, pp. 97–101. Human Factors Society (1988)

    Google Scholar 

  2. Bleiholder, J., Naumann, F.: Data fusion. ACM Computing Surveys 41(1) (2008)

    Google Scholar 

  3. Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2(1), 9–37 (1998)

    Article  Google Scholar 

  4. Slivinskas, G., Jensen, C.S., Snodgrass, R.T.: A foundation for conventional and temporal query optimization addressing duplicates and ordering. IEEE Transactions on Knowledge and Data Engineering 13(1), 21–49 (2001)

    Article  Google Scholar 

  5. Schwering, A., Raubal, M.: Measuring semantic similarity between geospatial conceptual regions. In: Proceedings of the 1st International Conference on GeoSpatial Semantics, Mexico City, Mexico, pp. 90–106 (2005)

    Google Scholar 

  6. Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: Proceedings of the 28th International Conference on Very Large Data Bases, VLDB Endowment, pp. 586–597 (2002)

    Google Scholar 

  7. Deng, F., Rafiei, D.: Approximately detecting duplicates for streaming data using stable bloom filters. In: Proceedings of the 2006 ACM SIGMOD Intl. Conference on Management of Data, pp. 25–36. ACM Press, New York (2006)

    Chapter  Google Scholar 

  8. Jefferey, S.R., Alonso, G., Franklin, M.J., Hong, W., Widom, J.: Declarative support for sensor data cleaning. In: Proceedings of the 4th International Conference on Pervasive Computing, Dublin, Ireland, pp. 83–100. Springer, Heidelberg (2006)

    Google Scholar 

  9. Weis, M., Naumann, F.: Dogmatix Tracks Down Duplicates in XML. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Baltimore, MD, USA, June 2005, pp. 431–442. ACM Press, New York (2005)

    Chapter  Google Scholar 

  10. Noy, N.F.: Semantic integration: A survey of ontology-based approaches. SIGMOD Rec. 33(4), 65–70 (2004)

    Article  Google Scholar 

  11. Wongsuphasawat, K., Shneiderman, B.: Finding comparable temporal categorical records: A similarity measure with an interactive visualization. Technical Report HCIL-2009-08, University of Maryland (2009)

    Google Scholar 

  12. Herzog, T.N., Scheuren, F.J., Winkler, W.E.: Data Quality and Record Linkage Techniques. Springer, Heidelberg (2007)

    MATH  Google Scholar 

  13. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering 19(1), 1–16 (2007)

    Article  Google Scholar 

  14. Schwering, A.: Approaches to semantic similarity measurement for geo-spatial data: A survey. Transactions in GIS 12(1), 5–29 (2008)

    Article  Google Scholar 

  15. Morris, A., Velegrakis, Y., Bouquet, P.: Entity identification on the semantic web. In: Proceedings of the 5th International Workshop on Semantic Web Applications and Perspectives, Rome, Italy (2008)

    Google Scholar 

  16. Dong, X., Halevy, A., Madhavan, J.: Reference reconciliation in complex information spaces. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pp. 85–96. ACM Press, New York (2005)

    Chapter  Google Scholar 

  17. Mularie, W.M.: World Geodetic System 1984–Its Definition and Relationships with Local Geodetic Systems. Technical Report TR8350.2, National Imagery and Mapping Agency (2000)

    Google Scholar 

  18. ITU-R: TF.460-4, Annex I. International Telecommunication Union (1970)

    Google Scholar 

  19. Baumgartner, N., Retschitzegger, W.: Towards a situation awareness framework based on primitive relations. In: Proceedings of the IEEE Conference on Information, Decision, and Control (IDC), Adelaide, Australia, pp. 291–295. IEEE, Los Alamitos (2007)

    Google Scholar 

  20. Abraham, T., Roddick, J.F.: Survey of spatio-temporal databases. GeoInformatica 3(1), 61–99 (1999)

    Article  Google Scholar 

  21. Caspi, Y., Irani, M.: Spatio-temporal alignment of sequences. IEEE Transactions on Pattern Analysis Machine Intelligence 24(11), 1409–1424 (2002)

    Article  Google Scholar 

  22. Liao, T.W.: Clustering of time series data–a survey. Pattern Recognition 38(11), 1857–1874 (2005)

    Article  MATH  Google Scholar 

  23. Dyreson, C.E., Evans, W., Lin, H., Snodgrass, R.T.: Efficiently supporting temporal granularities. IEEE Trans. on Knowledge and Data Eng. 12(4), 568–587 (2000)

    Article  Google Scholar 

  24. Worboys, M.: Computation with imprecise geospatial data. Computer, Environment and Urban Systems 22(2), 85–106 (1998)

    Article  Google Scholar 

  25. Khatri, V., Ram, S., Snodgrass, R.T., O’Brien, G.M.: Supporting user-defined granularities in a spatiotemporal conceptual model. Annals of Mathematics and Artificial Intelligence 36(1-2), 195–232 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  26. Baumgartner, N., Retschitzegger, W., Schwinger, W., Kotsis, G., Schwietering, C.: Of situations and their neighbors—Evolution and Similarity in Ontology-Based Approaches to Situation Awareness. In: Kokinov, B., Richardson, D.C., Roth-Berghofer, T.R., Vieu, L. (eds.) CONTEXT 2007. LNCS (LNAI), vol. 4635, pp. 29–42. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  27. Metwally, A., Agrawal, D., El Abbadi, A.: Duplicate detection in click streams. In: Proceedings of the 14th International Conference on World Wide Web, pp. 12–21. ACM, New York (2005)

    Chapter  Google Scholar 

  28. Cheng, J., Ke, Y., Ng, W.: A survey on algorithms for mining frequent itemsets over data streams. Knowledge and Information Systems 16(1), 1–27 (2008)

    Article  MathSciNet  Google Scholar 

  29. Jensen, C.S., Snodgrass, R.T.: Temporal data management. IEEE Transactions on Knowledge and Data Engineering 11(1), 36–44 (1999)

    Article  Google Scholar 

  30. Dekhtyar, A., Ross, R., Subrahmanian, V.S.: Probabilistic temporal databases, I: Algebra. ACM Transactions on Database Systems 26(1), 41–95 (2001)

    Article  MATH  Google Scholar 

  31. Yick, J., Mukherjee, B., Ghosal, D.: Wireless sensor network survey. Computer Networks 52(12), 2292–2330 (2008)

    Article  Google Scholar 

  32. Subramaniam, S., Palpanas, T., Papadopoulos, D., Kalogeraki, V., Gunopulos, D.: Online outlier detection in sensor data using non-parametric models. In: Proc. of the 32nd Intl. Conf. on Very Large Data Bases, VLDB Endowment, pp. 187–198 (2006)

    Google Scholar 

  33. Thor, A., Rahm, E.: MOMA - A Mapping-based Object Matching System. In: Proc. of the 3rd Biennial Conf. on Innovative Data Systems Research, Asilomar, CA, USA, pp. 247–258 (2007)

    Google Scholar 

  34. Rusu, L.I., Rahayu, J.W., Taniar, D.: On data cleaning in building XML data warehouses. In: Proceedings of the 6th International Conference on Information Integration and Web-based Applications Services. Austrian Computer Society, Jakarta (2004)

    Google Scholar 

  35. Weis, M., Naumann, F., Jehle, U., Lufter, J., Schuster, H.: Industry-scale duplicate detection. Proceedings of the VLDB Endowment 1(2), 1253–1264 (2008)

    Google Scholar 

  36. Kalfoglou, Y., Schorlemmer, M.: Ontology Mapping: The State of the Art. The Knowledge Engineering Review 18(1), 1–31 (2003)

    Article  Google Scholar 

  37. Choi, N., Song, I.Y., Han, H.: A survey on ontology mapping. ACM SIGMOD Record 35(3), 34–41 (2006)

    Article  Google Scholar 

  38. Castano, S., Ferrara, A., Lorusso, D., Montanelli, S.: On the ontology instance matching problem. In: Proceedings of the 19th International Conference on Database and Expert Systems Applications, pp. 180–184. IEEE, Turin (2008)

    Chapter  Google Scholar 

  39. Qin, H., Dou, D., LePendu, P.: Discovering executable semantic mappings between ontologies. In: Meersman, R., Tari, Z. (eds.) OTM 2007, Part I. LNCS, vol. 4803, pp. 832–849. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  40. Beeri, C., Kanza, Y., Safra, E., Sagiv, Y.: Object fusion in geographic information systems. In: Proceedings of the Thirtieth international conference on Very Large Data Bases, VLDB Endowment, pp. 816–827 (2004)

    Google Scholar 

  41. Sehgal, V., Getoor, L., Viechnicki, P.D.: Entity resolution in geospatial data integration. In: Proc. of the 14th Annual ACM International Symposium on Advances in Geographic Information Systems, pp. 83–90. ACM Press, New York (2006)

    Chapter  Google Scholar 

  42. Rodríguez, M.A., Bertossi, L., Caniupán, M.: An inconsistency tolerant approach to querying spatial databases. In: Proc. of the 16th Intl. Conf. on Advances in Geographic Information Systems, pp. 1–10. ACM Press, New York (2008)

    Google Scholar 

  43. Bakillah, M., Mostafavi, M.A., Bédard, Y.: A semantic similarity model for mapping between evolving geospatial data cubes. In: Meersman, R., Tari, Z., Herrero, P. (eds.) OTM 2006 Workshops. LNCS, vol. 4278, pp. 1658–1669. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  44. Rivest, S., Bdard, Y., Proulx, M.J., Nadeau, M., Hubert, F., Pastor, J.: SOLAP technology: Merging business intelligence with geospatial technology for interactive spatio-temporal exploration and analysis of data. ISPRS Journal of Photogrammetry and Remote Sensing 60(1), 17–33 (2005)

    Article  Google Scholar 

  45. Frentzos, E., Pelekis, N., Ntoutsi, I., Theodoridis, Y.: Trajectory Database Systems. In: Mobility, Data Mining and Privacy—Geographic Knowledge Discovery, pp. 151–188. Springer, Heidelberg (2008)

    Google Scholar 

  46. Chen, L., Özsu, M.T., Oria, V.: Robust and fast similarity search for moving object trajectories. In: Proceedings of the International Conference on Management of Data, pp. 491–502. ACM, New York (2005)

    Google Scholar 

  47. Frentzos, E., Gratsias, K., Theodoridis, Y.: Index-based most similar trajectory search. In: Proc. of the 23rd Int. Conf. on Data Engineering, pp. 816–825. IEEE, Los Alamitos (2007)

    Chapter  Google Scholar 

  48. Hwang, J.R., Kang, H.Y., Li, K.J.: Searching for similar trajectories on road networks using spatio-temporal similarity. In: Proc. of the 10th East Euro. Conf. on Adv. in Databases and Inf. Sys., Thessaloniki, Greece, pp. 282–295. Springer, Heidelberg (2006)

    Google Scholar 

  49. Baumgartner, N., Retschitzegger, W., Schwinger, W.: Lost in time, space, and meaning—an ontology-based approach to road traffic situation awareness. In: Proc. of the 3rd Worksh. on Context Awareness for Proactive Sys. Guildford, UK (2007)

    Google Scholar 

  50. Roddick, J.F., Spiliopoulou, M.: A survey of temporal knowledge discovery paradigms and methods. IEEE Trans. on Knowl. and Data Eng. 14(4) (2002)

    Google Scholar 

  51. Boriah, S., Chandola, V., Kumar, V.: Similarity measures for categorical data: A comparative evaluation. In: Proceedings of the SIAM International Conference on Data Mining, pp. 243–254. SIAM, Atlanta (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Baumgartner, N., Gottesheim, W., Mitsch, S., Retschitzegger, W., Schwinger, W. (2009). “Same, Same but Different” A Survey on Duplicate Detection Methods for Situation Awareness. In: Meersman, R., Dillon, T., Herrero, P. (eds) On the Move to Meaningful Internet Systems: OTM 2009. OTM 2009. Lecture Notes in Computer Science, vol 5871. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-05151-7_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-05151-7_22

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-05150-0

  • Online ISBN: 978-3-642-05151-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics