Advertisement

Fast Approximate Duplicate Detection for 2D-NMR Spectra

  • Björn Egert
  • Steffen Neumann
  • Alexander Hinneburg
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4544)

Abstract

2D-Nuclear magnetic resonance (NMR) spectroscopy is a powerful analytical method to elucidate the chemical structure of molecules. In contrast to 1D-NMR spectra, 2D-NMR spectra correlate the chemical shifts of 1H and 13C simultaneously. To curate or merge large spectra libraries a robust (and fast) duplicate detection is needed. We propose a definition of duplicates with the desired robustness properties mandatory for 2D-NMR experiments. A major gain in runtime performance wrt. previously proposed heuristics is achieved by mapping the spectra to simple discrete objects. We propose several appropriate data transformations for this task. In order to compensate for slight variations of the mapped spectra, we use appropriate hashing functions according to the locality sensitive hashing scheme, and identify duplicates by hash-collisions.

Keywords

Nuclear Magnetic Resonance Nuclear Magnetic Resonance Spectrum Hash Function Nuclear Magnetic Resonance Spectroscopy Integer Vector 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Tsipouras, A., Ondeyka, J., Dufresne, C., et al.: Using similarity searches over databases of estimated c-13 nmr spectra for structure identification of natural products. Analytica Chimica Acta 316, 161–171 (1995)CrossRefGoogle Scholar
  2. Barros, A.S., Rutledge, D.N.: Segmented principal component transform-principal component analysis. Chemometrics & Intelligent Laboratory Systems 78, 125–137 (2005)CrossRefGoogle Scholar
  3. Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 39–48. ACM Press, New York (2003)CrossRefGoogle Scholar
  4. Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. In: Selected papers from the sixth international conference on World Wide Web, pp. 1157–1166. Elsevier Science Publishers, Essex, UK (1997)Google Scholar
  5. Chowdhury, A., Frieder, O., Grossman, D., McCabe, M.C.: Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst. 20(2), 171–191 (2002)CrossRefGoogle Scholar
  6. Cohen, E.: Size-estimation framework with applications to transitive closure and reachability. J. Comput. Syst. Sci. 55(3), 441–453 (1997)zbMATHCrossRefGoogle Scholar
  7. Cohen, J.D., Lin, M.C., Manocha, D., Ponamgi, M.K.: I-COLLIDE: An interactive and exact collision detection system for large-scale environments. Symposium on Interactive 3D Graphics 218, 189–196 (1995)CrossRefGoogle Scholar
  8. Conrad, J.G., Guo, X.S., Schriber, C.P.: Online duplicate document detection: signature reliability in a dynamic retrieval environment. In: CIKM ’03: Proceedings of the twelfth international conference on Information and knowledge management, pp. 443–452. ACM Press, New York (2003)CrossRefGoogle Scholar
  9. Deng, F., Rafiei, D.: Approximately detecting duplicates for streaming data using stable bloom filters. In: SIGMOD ’06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pp. 25–36. ACM Press, New York (2006)CrossRefGoogle Scholar
  10. Gionis, A., Gunopulos, D., Koudas, N.: Efficient and tunable similar set retrieval. In: SIGMOD ’01: Proceedings of the 2001 ACM SIGMOD international conference on Management of data, pp. 247–258. ACM Press, New York (2001)CrossRefGoogle Scholar
  11. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: VLDB’99: Proceedings of the 25th International Conference on Very Large Data Bases, pp. 518–529. Morgan Kaufmann Publishers, CA USA (1999)Google Scholar
  12. Gomes, D., Santos, A.L., Silva, M.J.: Managing duplicates in a web archive. In: SAC ’06: Proceedings of the 2006 ACM symposium on Applied computing, pp. 818–825. ACM Press, New York, NY, USA (2006)CrossRefGoogle Scholar
  13. Henzinger, M.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 284–291. ACM Press, New York (2006)CrossRefGoogle Scholar
  14. Hernandez, M.A., Stolfo, S.J.: Real-world data is dirty: Data cleansing and the merge/purge problem. Data. Mining and Knowledge Discovery 2(1), 9–37 (1998)CrossRefGoogle Scholar
  15. Hinneburg, A., Egert, B., Porzel, A.: Duplicate detection of 2d-nmr spectra. Journal of Integrative Bioinformatics 4(1), 53 (2007)Google Scholar
  16. Indyk, P., Motwani, R.: Approximate nearest neighbor - towards removing the curse of dimensionality. In: Proceedings of the 30th Symposium on Theory of Computing, pp. 604–613 (1998)Google Scholar
  17. Ke, Y., Sukthankar, R., Huston, L.: An efficient parts-based near-duplicate and sub-image retrieval system. In: MULTIMEDIA ’04: Proceedings of the 12th annual ACM international conference on Multimedia, pp. 869–876. ACM Press, New York (2004)CrossRefGoogle Scholar
  18. Krishnan, P., Kruger, N.J., Ratcliffe, R.G.: Metabolite fingerprinting and profiling in plants using nmr. Journal of Experimental Botany 56, 255–265 (2005)CrossRefGoogle Scholar
  19. Farkas, M., Bendl, J., Welti, D.H., et al.: Similarity search for a h-1 nmr spectroscopic data base. Analytica Chimica Acta. 206, 173–187 (1988)CrossRefGoogle Scholar
  20. Metwally, A., Agrawal, D., Abbadi, A.E.: Duplicate detection in click streams. In: WWW ’05: Proceedings of the 14th international conference on World Wide Web, pp. 12–21. ACM Press, New York (2005)CrossRefGoogle Scholar
  21. Noren, G.N., Orre, R., Bate, A.: A hit-miss model for duplicate detection in the who drug safety database. In: KDD ’05: Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pp. 459–468. ACM Press, New York (2005)CrossRefGoogle Scholar
  22. Steinbeck, C., Krause, S., Kuhn, S.: Nmrshiftdb-constructing a free chemical information system with open-source components. J. chem. inf. & comp. sci. 43, 1733–1739 (2003)CrossRefGoogle Scholar
  23. Weis, M., Naumann, F.: Detecting duplicate objects in xml documents. In: IQIS ’04: Proceedings of the 2004 international workshop on Information quality in information systems, pp. 10–19. ACM Press, New York (2004)Google Scholar
  24. Yang, H., Callan, J.: Near-duplicate detection by instance-level constrained clustering. In: SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 421–428. ACM Press, New York (2006)CrossRefGoogle Scholar

Copyright information

© Springer Berlin Heidelberg 2007

Authors and Affiliations

  • Björn Egert
    • 1
  • Steffen Neumann
    • 1
  • Alexander Hinneburg
    • 2
  1. 1.Leibniz Institute of Plant Biochemistry, Department of Stress and Developmental BiologyGermany
  2. 2.Institute of Computer Science, Martin-Luther-University of Halle-WittenbergGermany

Personalised recommendations