Skip to main content

Introducing the Notion of ‘Contrast’ Features for Language Technology

  • Conference paper
  • First Online:

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1062))

Abstract

In this paper, we explore whether there exist ‘contrast’ features that help recognize if a text variety is a genre or a domain. We carry out our experiments on the text varieties that are included in the Swedish national corpus, called Stockholm-Umeå Corpus or SUC, and build several text classification models based on text complexity features, grammatical features, bag-of-words features and word embeddings. Results show that text complexity features and grammatical features systematically perform better on genres rather than on domains. This indicates that these features can be used as ‘contrast’ features because, when in doubt about the nature of a text category, they help bring it to light.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    See https://spraakbanken.gu.se/eng/resources/corpus.

  2. 2.

    See http://weka.sourceforge.net/doc.dev/weka/classifiers/functions/SMO.html.

  3. 3.

    See https://deeplearning.cms.waikato.ac.nz/.

  4. 4.

    Weighted Averaged F-Measure is the sum of all the classes F-measures, each weighted according to the number of instances with that particular class label. It is a more reliable metric than the harmonic F-measures (F1).

References

  1. Altszyler, E., Sigman, M., Slezak, D.F.: Corpus specificity in LSA and word2vec: the role of out-of-domain documents. arXiv preprint arXiv:1712.10054 (2017)

  2. Falkenjack, J., Heimann Mühlenbock, K., Jönsson, A.: Features indicating readability in Swedish text. In: Proceedings of the 19th Nordic Conference of Computational Linguistics (NoDaLiDa-2013), No. 085 in NEALT Proceedings Series 16, Oslo, Norway, pp. 27–40. Linköping University Electronic Press (2013)

    Google Scholar 

  3. Falkenjack, J., Rennes, E., Fahlborg, D., Johansson, V., Jönsson, A.: Services for text simplification and analysis. In: Proceedings of the 21st Nordic Conference on Computational Linguistics, pp. 309–313 (2017)

    Google Scholar 

  4. Falkenjack, J., Santini, M., Jönsson, A.: An exploratory study on genre classification using readability features. In: Proceedings of the Sixth Swedish Language Technology Conference (SLTC 2016), Umeå, Sweden (2016)

    Google Scholar 

  5. Francis, W.N., Kucera, H.: Brown Corpus Manual: Manual of Information to Accompany a Standard Corpus of Present-Day Edited American English for Use with Digital Computers. Brown University, Providence (1979)

    Google Scholar 

  6. Gustafson-Capková, S., Hartmann, B.: Manual of the Stockholm Umeå Corpus version 2.0. Stockholm University (2006)

    Google Scholar 

  7. Heimann Mühlenbock, K.: I see what you mean. Assessing readability for specific target groups. Dissertation, Språkbanken, Department of Swedish, University of Gothenburg (2013). http://hdl.handle.net/2077/32472

  8. Johansson, S., Leech, G.N., Goodluck, H.: Manual of information to accompany the Lancaster-Oslo/Bergen Corpus of British English, for use with digital computer. University of Oslo, Department of English (1978)

    Google Scholar 

  9. Jönsson, S., Rennes, E., Falkenjack, J., Jönsson, A.: A component based approach to measuring text complexity. In: Proceedings of the Seventh Swedish Language Technology Conference 2018 (SLTC-2018) (2018)

    Google Scholar 

  10. Karlgren, J., Cutting, D.: Recognizing text genres with simple metrics using discriminant analysis. In: Proceedings of the 15th Conference on Computational Linguistics, vol. 2, pp. 1071–1075. Association for Computational Linguistics (1994)

    Google Scholar 

  11. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

  12. Santini, M.: Automatic Identification of Genre in Web Pages: A New Perspective. LAP Lambert Academic Publishing, Saarbrücken (2011)

    Google Scholar 

  13. Wastholm, P., Kusma, A., Megyesi, B.: Using linguistic data for genre classification. In: Proceedings of the Swedish Artificial Intelligence and Learning Systems Event, SAIS-SSLS (2005)

    Google Scholar 

  14. Van der Wees, M., Bisazza, A., Monz, C.: Evaluation of machine translation performance across multiple genres and languages. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018) (2018)

    Google Scholar 

  15. Van der Wees, M., Bisazza, A., Weerkamp, W., Monz, C.: What’s in a domain? Analyzing genre and topic differences in statistical machine translation. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), vol. 2, pp. 560–566 (2015)

    Google Scholar 

  16. Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Cambridge (2016)

    Google Scholar 

Download references

Acknowledgements

This research was supported by E-care@home, a “SIDUS – Strong Distributed Research Environment” project, funded by the Swedish Knowledge Foundation [kk-stiftelsen, Diarienr: 20140217]. Project website: http://ecareathome.se/

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marina Santini .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Santini, M., Danielsson, B., Jönsson, A. (2019). Introducing the Notion of ‘Contrast’ Features for Language Technology. In: Anderst-Kotsis, G., et al. Database and Expert Systems Applications. DEXA 2019. Communications in Computer and Information Science, vol 1062. Springer, Cham. https://doi.org/10.1007/978-3-030-27684-3_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-27684-3_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-27683-6

  • Online ISBN: 978-3-030-27684-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics