What’s New? What’s Certain? – Scoring Search Results in the Presence of Overlapping Data Sources

  • Philipp Hussels
  • Silke Trißl
  • Ulf Leser
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4544)


Data integration projects in the life sciences often gather data on a particular subject from multiple sources. Some of these sources overlap to a certain degree. Therefore, integrated search results may be supported by one, few, or all data sources. To reflect these differences, results should be ranked according to the number of data sources that support them. How such a ranking should look like is not clear per se. Either, results supported by only few sources are ranked high because this information is potentially new, or such results are ranked low because the strength of evidence supporting them is limited.

We present two scoring schemes to rank search results in the integrated protein annotation database Columba. We define a surprisingness score, preferring results supported by few sources, and a confidence score, preferring frequently encountered information. Unlike many other scoring schemes our proposal is purely data-driven and does not require users to specify preferences among sources. Both scores take the concrete overlaps of data sources into account and do not presume statistical independence. We show how our schemes have been implemented efficiently using SQL.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Berman, H., Westbrook, J., Feng, Z., Gilliland, G., et al.: The Protein Data Bank. Nucleic Acids Research 28(1), 235–242 (2000)CrossRefGoogle Scholar
  2. 2.
    Bleiholder, J., Khuller, S., Naumann, F., Raschid, L., Wu, Y.: Query Planning in the Presence of Overlapping Sources. In: Ioannidis, Y., Scholl, M.H., Schmidt, J.W., Matthes, F., Hatzopoulos, M., Boehm, K., Kemper, A., Grust, T., Boehm, C. (eds.) EDBT 2006. LNCS, vol. 3896, pp. 811–828. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  3. 3.
    Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.-C., et al.: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Research 31(1), 365–370 (2003)CrossRefGoogle Scholar
  4. 4.
    Florescu, D., Koller, D., Levy, A.: Using Probabilistic Information in Data Integration.. In: Proceedings of the VLDB, pp. 216–225. Morgan Kaufmann, San Francisco (1997)Google Scholar
  5. 5.
    Huttenhower, C., Hibbs, M., Myers, C., Troyanskaya, O.: A scalable method for integration and functional analysis of multiple microarray datasets. Bioinformatics 22(23), 2890–2897 (2006)CrossRefGoogle Scholar
  6. 6.
    Joshi-Tope, G., Gillespie, M., Vastrik, I., D’Eustachio, P., et al.: Reactome: a knowledgebase of biological pathways. Nucleic Acids Research 33, D428–D432 (2005)CrossRefGoogle Scholar
  7. 7.
    Kanehisa, M., Goto, S., Kawashima, S., Nakaya, A.: The KEGG databases at GenomeNet. Nucleic Acids Research 30, 42–46 (2002)CrossRefGoogle Scholar
  8. 8.
    Lacroix, Z., Murthy, H., Naumann, F., Raschid, L.: Links and Paths through Life Sciences Data Sources. In: DILS 2004. LNCS (LNBI), vol. 2994, pp. 203–211 Springer, Heidelberg (2004)Google Scholar
  9. 9.
    Lemer, C., Antezana, E., Couche, F., Fays, F., et al.: The aMAZE LightBench: a web interface to a relational database of cellular processes. Nucleic Acids Research 32, D443–D448 (2004)CrossRefGoogle Scholar
  10. 10.
    Marcotte, E., Pellegrini, M., Ng, H., Rice, D., et al.: Detecting protein function and protein-protein interactions from genome sequences. Science 285(5428), 751–753 (1999)CrossRefGoogle Scholar
  11. 11.
    Martin, A.C.: Mapping PDB chains to UniProtKB entries. Bioinformatics 21(23), 4297–4301 (2005)CrossRefGoogle Scholar
  12. 12.
    von Mering, C., Jensen, L., Snel, B., Hooper, S., et al.: STRING: known and predicted protein-protein associations, integrated and transferred across organisms. Nucleic Acids Research 33,D433–D437 (2005)CrossRefGoogle Scholar
  13. 13.
    Naumann, F., Leser, U., Freytag, J.-C.: Quality-driven Integration of Heterogenous Information Systems. In: Proceedings of the VLDB, pp. 447–458. Morgan Kaufmann, San Francisco (1999)Google Scholar
  14. 14.
    Rother, K., Müller, H., Trissl, S., Koch, I., et al.: Columba: Multidimensional Data Integration of Protein Annotations. In: DILS 2004. LNCS (LNBI), vol. 2994, pp. 156–171. Springer, Heidelberg (2004)Google Scholar
  15. 15.
    Velankar, S., Mcneil, P., Mittard-Runte, V., Suarez, A., et al.: E-MSD: an integrated data resource for bioinformatics. Nucleic Acids Research 33, D262+ (2005)CrossRefGoogle Scholar
  16. 16.
    Via, A., Zanzoni, A., Helmer-Citterich, M.: Seq2Struct: a resource for establishing sequence-structure links. Bioinformatics 21(4), 551–553 (2004)CrossRefGoogle Scholar
  17. 17.
    Yanai, I., DeLisi, C.: The society of genes: networks of functional links between genes from comparative genomics. Genome Biology 3(11) : (research0064) (October 2002)Google Scholar

Copyright information

© Springer Berlin Heidelberg 2007

Authors and Affiliations

  • Philipp Hussels
    • 1
  • Silke Trißl
    • 1
  • Ulf Leser
    • 1
  1. 1.Humboldt-Universität zu Berlin, Institute of Computer Sciences, D-10099 BerlinGermany

Personalised recommendations