Finding Transport Proteins in a General Protein Database

Das, Sanmay; Saier, Milton H.; Elkan, Charles

doi:10.1007/978-3-540-74976-9_9

Sanmay Das¹,
Milton H. Saier Jr.¹ &
Charles Elkan¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4702))

Included in the following conference series:

European Conference on Principles of Data Mining and Knowledge Discovery

3534 Accesses
8 Citations

Abstract

The number of specialized databases in molecular biology is growing fast, as is the availability of molecular data. These trends necessitate the development of automatic methods for finding relevant information to include in specialized databases. We show how to use a comprehensive database (SwissProt) as a source of new entries for a specialized database (TCDB, the Transport Classification Database). Even carefully constructed keyword-based queries perform poorly in determining which SwissProt records are relevant to TCDB; we show that a machine learning approach performs well. We describe a maximum-entropy classifier, trained on SwissProt records, that achieves high precision and recall in cross-validation experiments. This classifier has been deployed as part of a pipeline for updating TCDB that allows a human expert to examine only about 2% of SwissProt records for potential inclusion in TCDB. The methods we describe are flexible and general, so they can be applied easily to other specialized databases.

Download to read the full chapter text

Chapter PDF

Protein Bioinformatics Databases and Resources

Web-Based Resources to Investigate Protease Function

UniProt Protein Knowledgebase

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Aslam, J.A., Pavlu, V., Yilmaz, E.: A statistical method for system evaluation using incomplete judgments. In: Proc. ACM SIGIR, pp. 541–548. ACM, New York (2006)
Google Scholar
Bateman, A.: Editorial. Nucleic Acids Res. Database Issue, 34(D1) (2006)
Google Scholar
Brow, T., Settles, B., Craven, M.: Classifying biomedical articles by making localized decisions. In: Proc. TReC 2005 (2005)
Google Scholar
Craven, M., Kumlien, J.: Constructing biological knowledge bases by extracting information from text sources. In: Proc. 7th Intl. Conf. on Intelligent Systems for Molecular Biol. (1999)
Google Scholar
Donaldson, I., et al.: PreBIND and Textomy–mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics 4(1) (2003)
Google Scholar
Galperin, M.Y.: The molecular biology database collection: 2007 update. Nucleic Acids Res. Database Issue, 35 (2007)
Google Scholar
William Hersh, A., Cohen, J., Yang, R.T., Roberts, B.P., Hearst, M.: Trec 2005 genomics track overview. In: Proc. TREC (2005)
Google Scholar
Krallinger, M., Valencia, A.: Text-mining and information-retrieval services for molecular biology. Genome Biol. 6(7), 224–230 (2005)
Article Google Scholar
McCallum, A.K.: Mallet: A machine learning for language toolkit (2002), http://mallet.cs.umass.edu
Nigam, K., Lafferty, J., McCallum, A.: Using maximum entropy for text classification. In: Proc. IJCAI-99 Workshop on Machine Learning for Inf. Filtering, pp. 61–67 (1999)
Google Scholar
Saier Jr., M.H., Tran, C.V., Barabote, R.D.: TCDB: the Transporter Classification Database for membrane transport protein analyses and information. Nucleic Acids Res. 36(Database Issue), D181–D186 (2006)
Google Scholar
Shatkay, H.: Hairpins in bookstacks: Information retrieval from biomedical text. Briefings in Bioinformatics 6(3), 222–238 (2005)
Article Google Scholar
Yeh, A.S., Hirschman, L., Morgan, A.A.: Evaluation of text data mining for database curation: Lessons learned from the KDD Challenge Cup. Bioinformatics 19(Suppl. 1), i331–i339 (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

University of California, San Diego, La Jolla, CA 92093, USA
Sanmay Das, Milton H. Saier Jr. & Charles Elkan

Authors

Sanmay Das
View author publications
You can also search for this author in PubMed Google Scholar
Milton H. Saier Jr.
View author publications
You can also search for this author in PubMed Google Scholar
Charles Elkan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Joost N. Kok Jacek Koronacki Ramon Lopez de Mantaras Stan Matwin Dunja Mladenič Andrzej Skowron

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Das, S., Saier, M.H., Elkan, C. (2007). Finding Transport Proteins in a General Protein Database. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds) Knowledge Discovery in Databases: PKDD 2007. PKDD 2007. Lecture Notes in Computer Science(), vol 4702. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74976-9_9

Download citation

DOI: https://doi.org/10.1007/978-3-540-74976-9_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74975-2
Online ISBN: 978-3-540-74976-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Finding Transport Proteins in a General Protein Database

Abstract

Chapter PDF

Similar content being viewed by others

Protein Bioinformatics Databases and Resources

Web-Based Resources to Investigate Protease Function

UniProt Protein Knowledgebase

Keywords

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Finding Transport Proteins in a General Protein Database

Abstract

Chapter PDF

Similar content being viewed by others

Protein Bioinformatics Databases and Resources

Web-Based Resources to Investigate Protease Function

UniProt Protein Knowledgebase

Keywords

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation