Toward a Computational Theory of Data Acquisition and Truthing

Stork, David G.

doi:10.1007/3-540-44581-1_13

David G. Stork³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2111))

Included in the following conference series:

International Conference on Computational Learning Theory

1993 Accesses
1 Citations

Abstract

The creation of a pattern classifier requires choosing or creating a model, collecting training data and verifying or “truthing” this data, and then training and testing the classifier. In practice, individual steps in this sequence must be repeated a number of times before the classifier achieves acceptable performance. The majority of the research in computational learning theory addresses the issues associated with training the classifier (learnability, convergence times, generalization bounds, etc.). While there has been modest research effort on topics such as cost-based collection of data in the context of a particular classifier model, there remain numerous unsolved problems of practical importance associated with the collection and truthing of data. Many of these can be addressed with the formal methods of computational learning theory. A number of these issues, as well as new ones — such as the identification of “hostile” contributors and their data — are brought to light by the Open Mind Initiative, where data is openly contributed over the World Wide Web by non-experts of varying reliabilities. This paper states generalizations of formal results on the relative value of labeled and unlabeled data to the realistic case where a labeler is not a foolproof oracle but is instead somewhat unreliable and error-prone. It also summarizes formal results on strategies for presenting data to labelers of known reliability in order to obtain best estimates of model parameters. It concludes with a call for a rich, powerful and practical computational theory of data acquisition and truthing, built upon the concepts and techniques developed for studying general learning systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

José M. Bernardo and Adrian F.M. Smith. Bayesian Theory. John Wiley and Sons, New York, NY, 1994.
MATH Google Scholar
David Blackwell and M.A. Girshick. Theory of Games and Statistical Decisions. Dover Publications, New York, NY, 1979.
MATH Google Scholar
Mindy Bokser, 1999. Personal communication (Caere Corporation).
Google Scholar
John S. Breese, David Heckerman, and Carl Kadie. Empirical analysis of predictive algorithms for collaborative filtering. In Gregory F. Cooper and Serafín Moral, editors, Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence (UAI-98), pages 43–52, San Francisco, 1998. Morgan Kaufmann.
Google Scholar
Vittorio Castelli and Thomas M. Cover. The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter. Proc. IEEE Transactions on Information Theory, IT-42(6):2102–2117, 1996.
Article MathSciNet Google Scholar
David Cohn, Les Atlas, and Richard Ladner. Improving generalization with active learning. Machine Learning, 15(2):201–221, 1994.
Google Scholar
Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification. Wiley, New York, NY, second edition, 2001.
MATH Google Scholar
Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramaswamy Uthurusamy, editors. Advances in knowledge discovery and data mining. MIT Press, Cambridge, MA, 1996.
Google Scholar
Yoav Freund. Boosting a weak learning algorithm by majority. In Proceedings of the Third Workshop on Computational Learning Theory, pages 202–216, San Mateo, CA, 1990. Morgan Kaufmann.
Google Scholar
Jerome H. Friedman. On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery, 1(1):55–77, 1997.
Article Google Scholar
Tin Kam Ho and Henry S. Baird. Large-scale simulation studies in pattern recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-19(10):1067–1079, 1997.
Google Scholar
Chuck Lam. Open Data Acquisition: Theory and Experiments. PhD thesis, Stanford University, Stanford, CA, 2002. in preparation.
Google Scholar
Chuck Lam and David G. Stork. Optimal strategies for collecting and truthing data from contributors of varying reliabilities, 2001. submitted for publication.
Google Scholar
Doug B. Lenat. CYC: A large-scale investment in knowledge infrastructure. Communications of the ACM, 38(11):33–38, 1995.
Article Google Scholar
Tom M. Mitchell. Machine Learning. McGraw-Hill, New York, 1997.
MATH Google Scholar
David G. Stork. The Open Mind Initiative. IEEE Intelligent Systems & their applications, 14(3):19–20, 1999.
Google Scholar
David G. Stork. Open data collection for training intelligent software in the Open Mind Initiative. In Proceedings of the Engineering Intelligent Systems Conference (EIS 2000), Paisley, Scotland, 2000.
Google Scholar
David G. Stork. An architecture supporting the collection and monitoring of data openly contributed over the World Wide Web. In Proceedings of the Workshop on Enabling Technologies, Infrastructure for Collaborative Enterprises (WET ICE), Cambridge, MA, 2001.
Google Scholar
Richard Valliant, Alan H. Dorfman, and Richard M. Royall. Finite population sampling and inference: A prediction approach. Wiley Series in Probability and Statistics. John Wiley and Sons, New York, NY, 2000.
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Ricoh California Research Center, 2882 Sand Hill Road Suite 115, Menlo Park, CA, 94025-7022
David G. Stork

Authors

David G. Stork
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Engineering, Department of Computer Science, University of California, Santa Cruz, Santa Cruz, CA, 95064, USA
David Helmbold
Research School of Information Sciences and Engineering Department of Telecommunications Engineering, Australian National University, Canberra, 0200, Australia
Bob Williamson

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Stork, D.G. (2001). Toward a Computational Theory of Data Acquisition and Truthing. In: Helmbold, D., Williamson, B. (eds) Computational Learning Theory. COLT 2001. Lecture Notes in Computer Science(), vol 2111. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44581-1_13

Download citation

DOI: https://doi.org/10.1007/3-540-44581-1_13
Published: 13 September 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42343-0
Online ISBN: 978-3-540-44581-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics