Skip to main content

Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity in Web Corpora

  • Conference paper
  • First Online:
Database and Expert Systems Applications (DEXA 2018)

Abstract

Web corpora are a cornerstone of modern Language Technology. Corpora built from the web are convenient because their creation is fast and inexpensive. Several studies have been carried out to assess the representativeness of general-purpose web corpora by comparing them to traditional corpora. Less attention has been paid to assess the representativeness of specialized or domain-specific web corpora. In this paper, we focus on the assessment of domain representativeness of web corpora and we claim that it is possible to assess the degree of domain-specificity, or domainhood, of web corpora. We present a case study where we explore the effectiveness of different measures - namely the Mann-Withney-Wilcoxon Test, Kendall correlation coefficient, Kullback–Leibler divergence, log-likelihood and burstiness - to gauge domainhood. Our findings indicate that burstiness is the most suitable measure to single out domain-specific words from a specialized corpus and to allow for the quantification of domainhood.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    SNOMED CT browser is available at http://browser.ihtsdotools.org/.

  2. 2.

    The lists of the selected 155 SNOMED CT terms and the tokenized gold standard (165 entries) are available here: http://santini.se/eCareCorpus/home.htm.

References

  1. Baroni, M., Bernardini, S.: BootCat: bootstrapping corpora and terms from the web. In: LREC (2004)

    Google Scholar 

  2. Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The wacky wide web: a collection of very large linguistically processed web-crawled corpora. Lang. Resour. Eval. 43(3), 209–226 (2009)

    Article  Google Scholar 

  3. Biber, D.: Representativeness in corpus design. Literary Linguist. Comput. 8(4), 243–257 (1993)

    Article  Google Scholar 

  4. Church, K.W., Gale, W.A.: Poisson mixtures. Nat. Lang. Eng. 1(2), 163–190 (1995)

    Article  Google Scholar 

  5. Ciaramita, M., Baroni, M.: A figure of merit for the evaluation of web-corpus randomness. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (2006)

    Google Scholar 

  6. Desagulier, G.: Corpus Linguistics and Statistics with R. Springer, Heidelberg (2017). https://doi.org/10.1007/978-3-319-64572-8

    Book  Google Scholar 

  7. Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19(1), 61–74 (1993)

    Google Scholar 

  8. Ferraresi, A., Zanchetta, E., Baroni, M., Bernardini, S.: Introducing and evaluating ukWaC, a very large web-derived corpus of English. In: Proceedings of the 4th Web as Corpus Workshop (WAC-4) Can We Beat Google, pp. 47–54 (2008)

    Google Scholar 

  9. Fletcher, W.H.: Implementing a BNC-compare-able web corpus. Building and Exploring Web Corpora, pp. 43–56 (2007)

    Google Scholar 

  10. Gries, S.T.: Elementary statistical testing with R. In: Krug, M., Schlüter, J. (eds.) Research Methods in Language Variation and change (2013)

    Google Scholar 

  11. Gustafson-Capková, S., Hartmann, B.: Manual of the Stockholm Umeå corpus version 2.0. Stockholm University (2006)

    Google Scholar 

  12. Irvine, A., Callison-Burch, C.: A comprehensive analysis of bilingual lexicon induction. Comput. Linguist. 43(2), 273–310 (2017)

    Article  MathSciNet  Google Scholar 

  13. Katz, S.M.: Distribution of content words and phrases in text and language modelling. Nat. Lang. Eng. 2(1), 15–59 (1996)

    Article  Google Scholar 

  14. Kilgarriff, A.: Comparing corpora. Int. J. Corpus Linguist. 6(1), 97–133 (2001)

    Article  Google Scholar 

  15. Kilgarriff, A.: Simple maths for keywords. In: Proceedings of the Corpus Linguistics Conference, Liverpool, UK (2009)

    Google Scholar 

  16. Kilgarriff, A.: Comparable corpora within and across languages, word frequency lists and the KELLY project. In: Proceedings of the 3rd Workshop on Building and Using Comparable Corpora, pp. 1–5 (2010)

    Google Scholar 

  17. Pierrehumbert, J.B.: Burstiness of verbs and derived nouns. In: Santos, D., Lindén, K., Ng’ang’a, W. (eds.) Shall We Play the Festschrift Game?, pp. 99–115. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30773-7_8

    Chapter  Google Scholar 

  18. Rayson, P., Garside, R.: Comparing corpora using frequency profiling. In: Proceedings of the workshop on Comparing Corpora, pp. 1–6. Association for Computational Linguistics (2000)

    Google Scholar 

  19. Santini, M., Jönsson, A., Nyström, M., Alireza, M.: A web corpus for eCare: collection, lay annotation and learning-First results. In: Proceedings of the 2nd International Workshop on Language Technologies and Applications (LTA17). FedCSIS (2017)

    Google Scholar 

  20. Sharoff, S.: Know thy corpus! Exploring frequency distributions in large corpora. In: Diab, M., Villavicencio, A. (eds.) Essays in Honor of Adam Kilgarriff. Text Speech and Language Technology Series. Springer, Heidelberg (2017)

    Google Scholar 

  21. Strandqvist, W., Santini, M., Lind, L., Jönsson, A.: Towards a quality assessment of web corpora for language technology applications. In: Proceedings of TISLID18 - Languages For Digital Lives and Cultures. Ghent University, Belgium (2018)

    Google Scholar 

  22. Wong, W., Liu, W., Bennamoun, M.: Constructing specialised corpora through analysing domain representativeness of websites. Lang. Resour. Eval. 45(2), 209–241 (2011)

    Article  Google Scholar 

  23. Zhao, Z., Mei, Q.: Questions about questions: an empirical analysis of information needs on twitter. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1545–1556. ACM (2013)

    Google Scholar 

Download references

Acknowledgement

This research was supported by E-care@home, a “SIDUS - Strong Distributed Research Environment” project funded by the Swedish Knowledge Foundation. Project website: http://ecareathome.se/.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marina Santini .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Santini, M., Strandqvist, W., Nyström, M., Alirezai, M., Jönsson, A. (2018). Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity in Web Corpora. In: Elloumi, M., et al. Database and Expert Systems Applications. DEXA 2018. Communications in Computer and Information Science, vol 903. Springer, Cham. https://doi.org/10.1007/978-3-319-99133-7_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-99133-7_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-99132-0

  • Online ISBN: 978-3-319-99133-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics