Skip to main content

Uncertain Groupings: Probabilistic Combination of Grouping Data

  • Conference paper
  • First Online:
Database and Expert Systems Applications (Globe 2015, DEXA 2015)

Abstract

Probabilistic approaches for data integration have much potential [7]. We view data integration as an iterative process where data understanding gradually increases as the data scientist continuously refines his view on how to deal with learned intricacies like data conflicts. This paper presents a probabilistic approach for integrating data on groupings. We focus on a bio-informatics use case concerning homology. A bio-informatician has a large number of homology data sources to choose from. To enable querying combined knowledge contained in these sources, they need to be integrated. We validate our approach by integrating three real-world biological databases on homology in three iterations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    This second condition ‘not equal’ is theoretically not necessary (See Sect. 2.4).

  2. 2.

    Actually, this is a simplification as both can be incorrect (see Sect. 4).

  3. 3.

    \(\mathbb {P}\) denotes a power set.

References

  1. Altenhoff, A., Dessimoz, C.: Phylogenetic and functional assessment of orthologs inference projects and methods. PLoS Comput. Biol. 5, e1000262 (2009)

    Article  Google Scholar 

  2. Antova, L., Koch, C., Olteanu, D.: \({10^{(10^{6})}}\) worlds and beyond: efficient representation and processing of incomplete information. VLDB J. 18(5), 1021–1040 (2009)

    Article  Google Scholar 

  3. Koonin, E.: Orthologs, paralogs, and evolutionary genomics. Annu. Rev. Genet. 39, 309–338 (2005)

    Article  Google Scholar 

  4. Kuzniar, A., Lin, K., He, Y., Nijveen, H., Pongor, S., Leunissen, J.A.M.: Progmap: an integrated annotation resource for protein orthology. Nucleic Acids Res. 37(suppl. 2), W428–W434 (2009)

    Article  MATH  Google Scholar 

  5. Kuzniar, A., van Ham, R., Pongor, S., Leunissen, J.: The quest for orthologs: finding the corresponding gene across genomes. Trends Genet. 24, 539–551 (2008)

    Article  Google Scholar 

  6. Louie, B., Detwiler, L., Dalvi, N., Shaker, R., Tarczy-Hornoch, P., Suciu, D.: Incorporating uncertainty metrics into a general-purpose data integration system. In: 19th International Conference on Scientific and Statistical Database Management. SSBDM 2007, p. 19 (2007)

    Google Scholar 

  7. Magnani, M., Montesi, D.: A survey on uncertainty management in data integration. J. Data Inf. Qual. 2(1), 5:1–5:33 (2010)

    Google Scholar 

  8. NCBI Resource Coordinators. Database resources of the national center for biotechnology information. 41(D1), D8–D20 (2013)

    Google Scholar 

  9. Powell, S., Szklarczyk, D., Trachana, K., Roth, A., Kuhn, M., Muller, J., Arnold, R., Rattei, T., Letunic, I., Doerks, T., et al.: eggNOG v3. 0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges. Nucleic Acids Res. 40, D284–D289 (2011)

    Article  Google Scholar 

  10. van Keulen, M.: Managing uncertainty: the road towards better data interoperability. IT - Inf. Technol. 54(3), 138–146 (2012)

    Article  Google Scholar 

  11. Wanders, B., van Keulen, M., van der Vet, P.E.: Uncertain groupings: probabilistic combination of grouping data. Technical report TR-CTIT-14-12, Centre for Telematics and Information Technology, University of Twente, Enschede (2014)

    Google Scholar 

  12. Widom, J.: Trio: A system for integrated management of data, accuracy, and lineage. Technical report 2004–40, Stanford InfoLab (2004)

    Google Scholar 

  13. Wu, C.H., Nikolskaya, A., Huang, H., Yeh, L.-S.L., Natale, D.A., Vinayaka, C.R., Hu, Z.-Z., Mazumder, R., Kumar, S., Kourtesis, P., Ledley, R.S., Suzek, B.E., Arminski, L., Chen, Y., Zhang, J., Cardenas, J.L., Chung, S., Castro-Alvear, J., Dinkov, G., Barker, W.C.: Pirsf: family classification system at the protein information resource. Nucleic Acids Res. 32(suppl. 1), D112–D114 (2004)

    Article  Google Scholar 

Download references

Acknowledgements

We would like to thank the late Tjeerd Boerman for his work on the use case and his initial concept of groupings. We would also like to thank Arnold Kuzniar for his insights and feedback on our use of biological databases and Ivor Wanders for his reviewing and editing assistance.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Brend Wanders .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Wanders, B., van Keulen, M., van der Vet, P. (2015). Uncertain Groupings: Probabilistic Combination of Grouping Data. In: Chen, Q., Hameurlain, A., Toumani, F., Wagner, R., Decker, H. (eds) Database and Expert Systems Applications. Globe DEXA 2015 2015. Lecture Notes in Computer Science(), vol 9261. Springer, Cham. https://doi.org/10.1007/978-3-319-22849-5_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-22849-5_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-22848-8

  • Online ISBN: 978-3-319-22849-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics