Skip to main content

An Entity Resolution Framework for Deduplicating Proteins

  • Conference paper
Data Integration in the Life Sciences (DILS 2008)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 5109))

Included in the following conference series:

  • 858 Accesses

Abstract

An important prerequisite to successfully integrating protein data is detecting duplicate records spread across different databases. In this paper, we describe a new framework for protein entity resolution, called PERF, which deduplicates protein mentions using a wide range of protein attributes. A mention refers to any recorded information about a protein, whether it is derived from a database, a high-throughput study, or literature text mining, among others. PERF can be easily extended to deduplicate protein-protein interactions (PPIs) as well. This framework translates mentions into instances of a reference schema to facilitate mention comparisons. PERF also uses “virtual attribute dependencies” to “enhance” mentions with additional attribute values. PERF computes a likelihood measure based upon the textual value similarity of mention attributes. A prototype implementation of the framework was tested, and these tests indicate that PERF can clearly separate duplicate mentions from non-duplicate mentions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Kersey, P.J., Duarte, J., Williams, A., Karavidopoulou, Y., Birney, E., Apweiler, R.: The International Protein Index: An integrated database for proteomics experiments. Proteomics 4, 1985–1988 (2004)

    Article  Google Scholar 

  2. Birkland, A., Yona, G.: BIOZON: a system for unification, management and analysis of heterogeneous biological data. BMC Bioinformatics 7(70) (2006)

    Google Scholar 

  3. Berg, J.M., Tymoczko, J.L., Stryer, L.: Biochemistry, 5th edn. W.H. Freeman, New York (2006)

    Google Scholar 

  4. Prieto, C., Rivas, J.D.L.: APID: Agile Protein Interaction DataAnalyzer. Nucleic Acids Research 34, W298–W302 (2006)

    Google Scholar 

  5. National Center for Biotechnology Information (NCBI), http://www.ncbi.nlm.nih.gov

  6. Hubbard, T., Andrews, D., Caccamo, M., Cameron, G., Chen, Y., Clamp, M., Clarke, L., Coates, G., Cox, T., Cunningham, F., Curwen, V., Cutts, T., Down, T., Durbin, R., Fernandez-Suarez, X.M., Gilbert, J., Hammond, M., Herrero, J., Hotz, H., Howe, K., Iyer, V., Jekosch, K., Kahari, A., Kasprzyk, A., Keefe, D., Keenan, S., Kokocinsci, F., London, D., Longden, I., McVicker, G., Melsopp, C., Meidl, P., Potter, S., Proctor, G., Rae, M., Rios, D., Schuster, M., Searle, S., Severin, J., Slater, G., Smedley, D., Smith, J., Spooner, W., Stabenau, A., Stalker, J., Storey, R., Trevanion, S., Ureta-Vidal, A., Vogel, J., White, S., Woodwark, C., Birney, E.: Ensembl 2005. Nucleic Acids Research 33(Database issue), D447–D453 (2005)

    Google Scholar 

  7. The UniProt Consortium. The Universal Protein Resource (UniProt). Nucleic Acids Research 35, 193–197 (2007)

    Google Scholar 

  8. Bader, G.D., Betel, D., Hogue, C.V.W.: BIND: the Biomolecular Interaction Network Database. Nucleic Acids Research 31(1), 248–250 (2003)

    Article  Google Scholar 

  9. Bader, G.D., Donaldson, I., Wolting, C., Ouellette, B.F., Pawson, T., Hogue, C.W.: BINDThe Biomolecular Interaction Network Database. Nucleic Acids Research 29(1), 242–245 (2001)

    Article  Google Scholar 

  10. Bader, G.D., Hogue, C.V.W.: BINDa data specification for storing and describing biomolecular interactions, molecular complexes and pathways. Bioinformatics 16(5), 465–477 (2000)

    Article  Google Scholar 

  11. Peri, S., Navarro, J.D., Amanchy, R., Kristiansen, T.Z., Jonnalagadda, C.K., Surendranath, V., Niranjan, V., Muthusamy, B., Gandhi, T.K., Gronborg, M., Ibarrola, N., Deshpande, N., Shanker, K., Shivashankar, H.N., Rashmi, B.P., Ramya, M.A., Zhao, Z., Chandrika, K.N., Padma, N., Harsha, H.C., Yatish, A.J., Kavitha, M.P., Menezes, M., Choudhury, D.R., Suresh, S., Ghosh, N., Saravana, R., Chandran, S., Krishna, S., Joy, M., Anand, S.K., Madavan, V., Joseph, A., Wong, G.W., Schiemann, W.P., Constantinescu, S.N., Huang, L., Khosravi-Far, R., Steen, H., Tewari, M., Ghaffari, S., Blobe, G.C., Dang, C.V., Garcia, J.G., Pevsner, J., Jensen, O.N., Roepstorff, P., Deshpande, K.S., Chinnaiyan, A.M., Hamosh, A., Chakravarti, A., Pandey, A.: Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Research 13, 2363–2371 (2003)

    Article  Google Scholar 

  12. Mishra, G., Suresh, M., Kumaran, K., Kannabiran, N., Suresh, S., Bala, P., Shivkumar, K., Anuradha, N., Reddy, R., Raghavan, T.M., Menon, S., Hanumanthu, G., Gupta, M., Upendran, S., Gupta, S., Mahesh, M., Jacob, B., Matthew, P., Chatterjee, P., Arun, K.S., Sharma, S., Chandrika, K.N., Deshpande, N., Palvankar, K., Raghavnath, R., Krishnakanth, K., Karathia, H., Rekha, B., Rashmi, N.S., Vishnupriya, G., Kumar, H.G.M., Nagini, M., Kumar, G.S.S., Jose, R., Deepthi, P., Mohan, S.S., Gandhi, T.K.B., Harsha, H.C., Deshpande, K.S., Sarker, M., Prasad, T.S.K., Pandey, A.: Human Protein Reference Database - 2006 Update. Nucleic Acids Research 34, D411–D414 (2006)

    Google Scholar 

  13. Chatr-aryamontri, A., Ceol, A., Palazzi, L.M., Nardelli, G., Schneider, M.V., Castagnoli, L., Cesareni, G.: MINT: the Molecular INTeraction database. Nucleic Acids Research 35(Database issue), D572–D574 (2007)

    Google Scholar 

  14. Munich Information Center for Protein Sequences (MIPS), http://mips.gsf.de

  15. Kerrien, S., Alam-Faruque, Y., Aranda, B., Bancarz, I., Bridge, A., Derow, C., Dimmer, E., Feuermann, M., Friedrichsen, A., Huntley, R., Kohler, C., Khadake, J., Leroy, C., Liban, A., Lieftink, C., Montecchi-Palazzi, L., Orchard, S., Risse, J., Robbe, K., Roechert, B., Thorneycroft, D., Zhang, Y., Apweiler, R., Hermjakob, H.: IntAct Open Source Resource for Molecular Interaction Data. Nucleic Acids Research 35(Database issue), D561–D565 (2007)

    Google Scholar 

  16. Hermjakob, H., Montecchi-Palazzi, L., Lewington, C., Mudali, S., Kerrien, S., Orchard, S., Vingron, M., Roechert, B., Roepstorff, P., Valencia, A., Margalit, H., Armstrong, J., Bairoch, A., Cesareni, G., Sherman, D., Apweiler, R.: IntAct: an open source molecular interaction database. Nucleic Acids Research 32(Database issue), D452–D455 (2004)

    Google Scholar 

  17. Salwinski, L., Miller, C.S., Smith, A.J., Pettit, F.K., Bowie, J.U., Eisenberg, D.: The Database of Interacting Proteins: 2004 update. NAR 32(Database issue), 449–451 (2004)

    Article  Google Scholar 

  18. Apweiler, R., Bairoch, A., Wu, C.H.: Protein sequence databases. Current Opinion in Chemical Biology 8, 76–80 (2004)

    Article  Google Scholar 

  19. INSDC: International Nucleotide Sequence Database Collaboration, http://www.insdc.org

  20. Mrowka, R., Patzak, A., Herzel, H.: Is There a Bias in Proteome Research? Genome Research 11, 1971–1973 (2001)

    Article  Google Scholar 

  21. The Cancer Cell Map, http://www.cellmap.org

  22. Lochovsky, L.: An Entity Resolution Framework for Deduplicating Proteins. MSc thesis. University of Toronto (2008)

    Google Scholar 

  23. Lee, M.L., Ling, T.W., Low, W.L.: Designing Functional Dependencies for XML. In: Jensen, C.S., Jeffery, K.G., Pokorný, J., Šaltenis, S., Bertino, E., Böhm, K., Jarke, M. (eds.) EDBT 2002. LNCS, vol. 2287, pp. 124–141. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  24. Tatusova, T.A., Madden, T.L.: Blast 2 sequences - a new tool for comparing protein and nucleotide sequences. FEMS Microbiol Lett. 174, 247–250 (1999)

    Article  Google Scholar 

  25. Damerau, F.J.: A technique for computer detection and correction of spelling errors. Communications of the ACM 7(3), 171–176 (1964)

    Article  Google Scholar 

  26. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10, 707 (1966)

    MathSciNet  Google Scholar 

  27. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufman, Burlington (2001)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Amos Bairoch Sarah Cohen-Boulakia Christine Froidevaux

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lochovsky, L., Topaloglou, T. (2008). An Entity Resolution Framework for Deduplicating Proteins. In: Bairoch, A., Cohen-Boulakia, S., Froidevaux, C. (eds) Data Integration in the Life Sciences. DILS 2008. Lecture Notes in Computer Science(), vol 5109. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69828-9_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-69828-9_9

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-69827-2

  • Online ISBN: 978-3-540-69828-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics