Representativeness in Corpus Design
The present paper addresses a number of issues related to achieving ‘representativeness’ in linguistic corpus design, including: discussion of what it means to `represent’ a language, definition of the target population, stratified versus proportional sampling of a language, sampling within texts, and issues relating to the required sample size (number of texts) of a corpus. The paper distinguishes among various ways that linguistic features can be distributed within and across texts; it analyzes the distributions of several particular features, and it discusses the implications of these distributions for corpus design.
The paper argues that theoretical research should be prior in corpus design, to identify the situational parameters that distinguish among texts in a speech community, and to identify the types of linguistic features that will be analyzed in the corpus. These theoretical considerations should be complemented by empirical investigations of linguistic variation in a pilot corpus of texts, as a basis for specific sampling decisions. The actual construction of a corpus would then proceed in cycles: the original design based on theoretical and pilot-study analyses, followed by collection of texts, followed by further empirical investigations of linguistic variation and revision of the design.
KeywordsRelative Clause Require Sample Size Tolerable Error Word Type Text Type
Unable to display preview. Download preview PDF.
- Biber, Douglas. 1990. Methodological issues regarding corpus-based analyses of linguistic variation. Literary and Linguistic Computing, 5.Google Scholar
- Biber, Douglas. 1993a. An analytical framework for register studies, Sociolinguistic perspectives on register ed. by D. Biber, and E. Finegan, New York: Oxford University Press. (in press).Google Scholar
- Biber, Douglas. 1993b. Register variation and corpus design. To appear in Computational Linguistics.Google Scholar
- Brown, Penelope, and Colin Fraser. 1979. Speech as a marker of situation. Social markers in speech, ed. by Klaus R. Scherer and Howard Giles, 33–62. Cambridge: Cambridge University Press.Google Scholar
- Duranti, Alessandro. 1985. Sociocultural dimensions of discourse. Handbook of discourse analysis (Vol. 1), ed. by Teun van Dijk, 193–230. New York: Academic Press.Google Scholar
- Francis, W. Nelson, and Henry Ku6era. 1964/1979. Manual of information to accompany A Standard Corpus of Present-Day Edited American English, for use with Digital Computers. Department of Linguistics, Brown University.Google Scholar
- Halliday, Michael A.K., and Ruqaiya Hasan. 1989. Language, context, and text: Aspects of language in a social-semiotic perspective. Oxford: Oxford University Press.Google Scholar
- Henry, Gary T. 1990. Practical sampling. Newbury Park, CA: Sage.Google Scholar
- Hymes, Dell H. 1974. Foundations in sociolinguistics. Philadelphia: University of Pennsylvania Press.Google Scholar
- Johansson, Stig, Geoffrey N. Leech, and Helen Goodluck. 1978. Manual of information to accompany the Lancaster-Oslo/Bergen Corpus of British English, for use with digital computers. Department of English, University of Oslo.Google Scholar
- Kalton, Graham. 1983. Introduction to survey sampling. Newbury Park, CA: Sage.Google Scholar
- Sudman, Seymour. 1976. Applied sampling. New York: Academic Press.Google Scholar
- Svartvik, Jan, and Randolph Quirk (eds.). 1980. A corpus of English conversation. Lund: C.W.K. Gleerup.Google Scholar
- Williams, Bill. 1978. A sampler on sampling. New York: John Wiley and Sons.Google Scholar