An Entity Resolution Framework for Deduplicating Proteins

Lochovsky, Lucas; Topaloglou, Thodoros

doi:10.1007/978-3-540-69828-9_9

Lucas Lochovsky¹ &
Thodoros Topaloglou¹

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 5109))

Included in the following conference series:

International Workshop on Data Integration in the Life Sciences

858 Accesses

Abstract

An important prerequisite to successfully integrating protein data is detecting duplicate records spread across different databases. In this paper, we describe a new framework for protein entity resolution, called PERF, which deduplicates protein mentions using a wide range of protein attributes. A mention refers to any recorded information about a protein, whether it is derived from a database, a high-throughput study, or literature text mining, among others. PERF can be easily extended to deduplicate protein-protein interactions (PPIs) as well. This framework translates mentions into instances of a reference schema to facilitate mention comparisons. PERF also uses “virtual attribute dependencies” to “enhance” mentions with additional attribute values. PERF computes a likelihood measure based upon the textual value similarity of mention attributes. A prototype implementation of the framework was tested, and these tests indicate that PERF can clearly separate duplicate mentions from non-duplicate mentions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Kersey, P.J., Duarte, J., Williams, A., Karavidopoulou, Y., Birney, E., Apweiler, R.: The International Protein Index: An integrated database for proteomics experiments. Proteomics 4, 1985–1988 (2004)
Article Google Scholar
Birkland, A., Yona, G.: BIOZON: a system for unification, management and analysis of heterogeneous biological data. BMC Bioinformatics 7(70) (2006)
Google Scholar
Berg, J.M., Tymoczko, J.L., Stryer, L.: Biochemistry, 5th edn. W.H. Freeman, New York (2006)
Google Scholar
Prieto, C., Rivas, J.D.L.: APID: Agile Protein Interaction DataAnalyzer. Nucleic Acids Research 34, W298–W302 (2006)
Google Scholar
National Center for Biotechnology Information (NCBI), http://www.ncbi.nlm.nih.gov
Hubbard, T., Andrews, D., Caccamo, M., Cameron, G., Chen, Y., Clamp, M., Clarke, L., Coates, G., Cox, T., Cunningham, F., Curwen, V., Cutts, T., Down, T., Durbin, R., Fernandez-Suarez, X.M., Gilbert, J., Hammond, M., Herrero, J., Hotz, H., Howe, K., Iyer, V., Jekosch, K., Kahari, A., Kasprzyk, A., Keefe, D., Keenan, S., Kokocinsci, F., London, D., Longden, I., McVicker, G., Melsopp, C., Meidl, P., Potter, S., Proctor, G., Rae, M., Rios, D., Schuster, M., Searle, S., Severin, J., Slater, G., Smedley, D., Smith, J., Spooner, W., Stabenau, A., Stalker, J., Storey, R., Trevanion, S., Ureta-Vidal, A., Vogel, J., White, S., Woodwark, C., Birney, E.: Ensembl 2005. Nucleic Acids Research 33(Database issue), D447–D453 (2005)
Google Scholar
The UniProt Consortium. The Universal Protein Resource (UniProt). Nucleic Acids Research 35, 193–197 (2007)
Google Scholar
Bader, G.D., Betel, D., Hogue, C.V.W.: BIND: the Biomolecular Interaction Network Database. Nucleic Acids Research 31(1), 248–250 (2003)
Article Google Scholar
Bader, G.D., Donaldson, I., Wolting, C., Ouellette, B.F., Pawson, T., Hogue, C.W.: BINDThe Biomolecular Interaction Network Database. Nucleic Acids Research 29(1), 242–245 (2001)
Article Google Scholar
Bader, G.D., Hogue, C.V.W.: BINDa data specification for storing and describing biomolecular interactions, molecular complexes and pathways. Bioinformatics 16(5), 465–477 (2000)
Article Google Scholar
Peri, S., Navarro, J.D., Amanchy, R., Kristiansen, T.Z., Jonnalagadda, C.K., Surendranath, V., Niranjan, V., Muthusamy, B., Gandhi, T.K., Gronborg, M., Ibarrola, N., Deshpande, N., Shanker, K., Shivashankar, H.N., Rashmi, B.P., Ramya, M.A., Zhao, Z., Chandrika, K.N., Padma, N., Harsha, H.C., Yatish, A.J., Kavitha, M.P., Menezes, M., Choudhury, D.R., Suresh, S., Ghosh, N., Saravana, R., Chandran, S., Krishna, S., Joy, M., Anand, S.K., Madavan, V., Joseph, A., Wong, G.W., Schiemann, W.P., Constantinescu, S.N., Huang, L., Khosravi-Far, R., Steen, H., Tewari, M., Ghaffari, S., Blobe, G.C., Dang, C.V., Garcia, J.G., Pevsner, J., Jensen, O.N., Roepstorff, P., Deshpande, K.S., Chinnaiyan, A.M., Hamosh, A., Chakravarti, A., Pandey, A.: Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Research 13, 2363–2371 (2003)
Article Google Scholar
Mishra, G., Suresh, M., Kumaran, K., Kannabiran, N., Suresh, S., Bala, P., Shivkumar, K., Anuradha, N., Reddy, R., Raghavan, T.M., Menon, S., Hanumanthu, G., Gupta, M., Upendran, S., Gupta, S., Mahesh, M., Jacob, B., Matthew, P., Chatterjee, P., Arun, K.S., Sharma, S., Chandrika, K.N., Deshpande, N., Palvankar, K., Raghavnath, R., Krishnakanth, K., Karathia, H., Rekha, B., Rashmi, N.S., Vishnupriya, G., Kumar, H.G.M., Nagini, M., Kumar, G.S.S., Jose, R., Deepthi, P., Mohan, S.S., Gandhi, T.K.B., Harsha, H.C., Deshpande, K.S., Sarker, M., Prasad, T.S.K., Pandey, A.: Human Protein Reference Database - 2006 Update. Nucleic Acids Research 34, D411–D414 (2006)
Google Scholar
Chatr-aryamontri, A., Ceol, A., Palazzi, L.M., Nardelli, G., Schneider, M.V., Castagnoli, L., Cesareni, G.: MINT: the Molecular INTeraction database. Nucleic Acids Research 35(Database issue), D572–D574 (2007)
Google Scholar
Munich Information Center for Protein Sequences (MIPS), http://mips.gsf.de
Kerrien, S., Alam-Faruque, Y., Aranda, B., Bancarz, I., Bridge, A., Derow, C., Dimmer, E., Feuermann, M., Friedrichsen, A., Huntley, R., Kohler, C., Khadake, J., Leroy, C., Liban, A., Lieftink, C., Montecchi-Palazzi, L., Orchard, S., Risse, J., Robbe, K., Roechert, B., Thorneycroft, D., Zhang, Y., Apweiler, R., Hermjakob, H.: IntAct Open Source Resource for Molecular Interaction Data. Nucleic Acids Research 35(Database issue), D561–D565 (2007)
Google Scholar
Hermjakob, H., Montecchi-Palazzi, L., Lewington, C., Mudali, S., Kerrien, S., Orchard, S., Vingron, M., Roechert, B., Roepstorff, P., Valencia, A., Margalit, H., Armstrong, J., Bairoch, A., Cesareni, G., Sherman, D., Apweiler, R.: IntAct: an open source molecular interaction database. Nucleic Acids Research 32(Database issue), D452–D455 (2004)
Google Scholar
Salwinski, L., Miller, C.S., Smith, A.J., Pettit, F.K., Bowie, J.U., Eisenberg, D.: The Database of Interacting Proteins: 2004 update. NAR 32(Database issue), 449–451 (2004)
Article Google Scholar
Apweiler, R., Bairoch, A., Wu, C.H.: Protein sequence databases. Current Opinion in Chemical Biology 8, 76–80 (2004)
Article Google Scholar
INSDC: International Nucleotide Sequence Database Collaboration, http://www.insdc.org
Mrowka, R., Patzak, A., Herzel, H.: Is There a Bias in Proteome Research? Genome Research 11, 1971–1973 (2001)
Article Google Scholar
The Cancer Cell Map, http://www.cellmap.org
Lochovsky, L.: An Entity Resolution Framework for Deduplicating Proteins. MSc thesis. University of Toronto (2008)
Google Scholar
Lee, M.L., Ling, T.W., Low, W.L.: Designing Functional Dependencies for XML. In: Jensen, C.S., Jeffery, K.G., Pokorný, J., Šaltenis, S., Bertino, E., Böhm, K., Jarke, M. (eds.) EDBT 2002. LNCS, vol. 2287, pp. 124–141. Springer, Heidelberg (2002)
Chapter Google Scholar
Tatusova, T.A., Madden, T.L.: Blast 2 sequences - a new tool for comparing protein and nucleotide sequences. FEMS Microbiol Lett. 174, 247–250 (1999)
Article Google Scholar
Damerau, F.J.: A technique for computer detection and correction of spelling errors. Communications of the ACM 7(3), 171–176 (1964)
Article Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10, 707 (1966)
MathSciNet Google Scholar
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufman, Burlington (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Toronto,
Lucas Lochovsky & Thodoros Topaloglou

Authors

Lucas Lochovsky
View author publications
You can also search for this author in PubMed Google Scholar
Thodoros Topaloglou
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Amos Bairoch Sarah Cohen-Boulakia Christine Froidevaux

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lochovsky, L., Topaloglou, T. (2008). An Entity Resolution Framework for Deduplicating Proteins. In: Bairoch, A., Cohen-Boulakia, S., Froidevaux, C. (eds) Data Integration in the Life Sciences. DILS 2008. Lecture Notes in Computer Science(), vol 5109. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69828-9_9

Download citation

DOI: https://doi.org/10.1007/978-3-540-69828-9_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69827-2
Online ISBN: 978-3-540-69828-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics