Abstract
Web corpora are a cornerstone of modern Language Technology. Corpora built from the web are convenient because their creation is fast and inexpensive. Several studies have been carried out to assess the representativeness of general-purpose web corpora by comparing them to traditional corpora. Less attention has been paid to assess the representativeness of specialized or domain-specific web corpora. In this paper, we focus on the assessment of domain representativeness of web corpora and we claim that it is possible to assess the degree of domain-specificity, or domainhood, of web corpora. We present a case study where we explore the effectiveness of different measures - namely the Mann-Withney-Wilcoxon Test, Kendall correlation coefficient, Kullback–Leibler divergence, log-likelihood and burstiness - to gauge domainhood. Our findings indicate that burstiness is the most suitable measure to single out domain-specific words from a specialized corpus and to allow for the quantification of domainhood.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
SNOMED CT browser is available at http://browser.ihtsdotools.org/.
- 2.
The lists of the selected 155 SNOMED CT terms and the tokenized gold standard (165 entries) are available here: http://santini.se/eCareCorpus/home.htm.
References
Baroni, M., Bernardini, S.: BootCat: bootstrapping corpora and terms from the web. In: LREC (2004)
Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The wacky wide web: a collection of very large linguistically processed web-crawled corpora. Lang. Resour. Eval. 43(3), 209–226 (2009)
Biber, D.: Representativeness in corpus design. Literary Linguist. Comput. 8(4), 243–257 (1993)
Church, K.W., Gale, W.A.: Poisson mixtures. Nat. Lang. Eng. 1(2), 163–190 (1995)
Ciaramita, M., Baroni, M.: A figure of merit for the evaluation of web-corpus randomness. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (2006)
Desagulier, G.: Corpus Linguistics and Statistics with R. Springer, Heidelberg (2017). https://doi.org/10.1007/978-3-319-64572-8
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19(1), 61–74 (1993)
Ferraresi, A., Zanchetta, E., Baroni, M., Bernardini, S.: Introducing and evaluating ukWaC, a very large web-derived corpus of English. In: Proceedings of the 4th Web as Corpus Workshop (WAC-4) Can We Beat Google, pp. 47–54 (2008)
Fletcher, W.H.: Implementing a BNC-compare-able web corpus. Building and Exploring Web Corpora, pp. 43–56 (2007)
Gries, S.T.: Elementary statistical testing with R. In: Krug, M., Schlüter, J. (eds.) Research Methods in Language Variation and change (2013)
Gustafson-Capková, S., Hartmann, B.: Manual of the Stockholm Umeå corpus version 2.0. Stockholm University (2006)
Irvine, A., Callison-Burch, C.: A comprehensive analysis of bilingual lexicon induction. Comput. Linguist. 43(2), 273–310 (2017)
Katz, S.M.: Distribution of content words and phrases in text and language modelling. Nat. Lang. Eng. 2(1), 15–59 (1996)
Kilgarriff, A.: Comparing corpora. Int. J. Corpus Linguist. 6(1), 97–133 (2001)
Kilgarriff, A.: Simple maths for keywords. In: Proceedings of the Corpus Linguistics Conference, Liverpool, UK (2009)
Kilgarriff, A.: Comparable corpora within and across languages, word frequency lists and the KELLY project. In: Proceedings of the 3rd Workshop on Building and Using Comparable Corpora, pp. 1–5 (2010)
Pierrehumbert, J.B.: Burstiness of verbs and derived nouns. In: Santos, D., Lindén, K., Ng’ang’a, W. (eds.) Shall We Play the Festschrift Game?, pp. 99–115. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30773-7_8
Rayson, P., Garside, R.: Comparing corpora using frequency profiling. In: Proceedings of the workshop on Comparing Corpora, pp. 1–6. Association for Computational Linguistics (2000)
Santini, M., Jönsson, A., Nyström, M., Alireza, M.: A web corpus for eCare: collection, lay annotation and learning-First results. In: Proceedings of the 2nd International Workshop on Language Technologies and Applications (LTA17). FedCSIS (2017)
Sharoff, S.: Know thy corpus! Exploring frequency distributions in large corpora. In: Diab, M., Villavicencio, A. (eds.) Essays in Honor of Adam Kilgarriff. Text Speech and Language Technology Series. Springer, Heidelberg (2017)
Strandqvist, W., Santini, M., Lind, L., Jönsson, A.: Towards a quality assessment of web corpora for language technology applications. In: Proceedings of TISLID18 - Languages For Digital Lives and Cultures. Ghent University, Belgium (2018)
Wong, W., Liu, W., Bennamoun, M.: Constructing specialised corpora through analysing domain representativeness of websites. Lang. Resour. Eval. 45(2), 209–241 (2011)
Zhao, Z., Mei, Q.: Questions about questions: an empirical analysis of information needs on twitter. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1545–1556. ACM (2013)
Acknowledgement
This research was supported by E-care@home, a “SIDUS - Strong Distributed Research Environment” project funded by the Swedish Knowledge Foundation. Project website: http://ecareathome.se/.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Santini, M., Strandqvist, W., Nyström, M., Alirezai, M., Jönsson, A. (2018). Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity in Web Corpora. In: Elloumi, M., et al. Database and Expert Systems Applications. DEXA 2018. Communications in Computer and Information Science, vol 903. Springer, Cham. https://doi.org/10.1007/978-3-319-99133-7_17
Download citation
DOI: https://doi.org/10.1007/978-3-319-99133-7_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99132-0
Online ISBN: 978-3-319-99133-7
eBook Packages: Computer ScienceComputer Science (R0)