Corpora; Document repositories; Text databases
A document database is a collection of stored texts managed by a system that provides query and update facilities. Usually the database includes many documents related by their subject matter, origin, or applicability to an enterprise. The content of each document may be free text, semi-structured text including a few well-identified fields (e.g., title, author, date), or highly structured tagged text such as might be encoded using XML. Occasionally documents may also contain multimedia components.
In contrast, the term corpus (plural corpora) typically refers to a static collection of texts that have been assembled by experts to study linguistic phenomena (e.g., the Brown Corpus, created in 1964 to study American English, and the Swedish Language Bank) or to provide a rich source of text for lexicographic needs (e.g., the Dictionary of Old English Corpus, including all extant texts written in Old English in the period...
- 2.Chin AG, editor. Text databases and document management: theory and practice. Hershey: Idea Group; 2001.Google Scholar
- 4.Kilpeläinen P, Lindén G, Mannila H, Nikunen E. A structured text database system. In: Proceedings of the International Conference on Electronic Publishing, Document Manipulation and Typography; 1990. p. 139–51.Google Scholar
- 6.Lowe B, Zobel J, Sacks-Davis R. A formal model for databases of structured text. In: Proceedings of the 4th International Conference on Database Systems for Advanced Applications; 1995. p. 449–56.Google Scholar
- 8.Sacks-Davis R, Arnold-Moore T, Zobel J. Database systems for structured documents. In: Proceedings of the International Symposium on Advanced Database Technologies and Their Integration; 1994. p. 272–83.Google Scholar
- 9.Salminen A, Tompa FW. Requirements for XML document database systems. In: Proceedings of the ACM Symposium on Document Engineering; 2001. p. 85–94.Google Scholar