Skip to main content

Saturation Tests in Application to Validation of Opinion Corpora: A Tool for Corpora Processing

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10930))

Abstract

Opinion processing has recently gained much interest among computational linguists, public relation experts, marketing companies, and politicians. Studies of the natural language expression of opinions, desires, emotions, and related phenomena require appropriate tools and methodologies. We propose tools for collection of empirical data in the form of a corpus, limiting our research field to customers’ written opinions about widely used on-line booking services in the area of hotel reservations (via Booking.com). In this paper, we present the corpus acquisition procedure and our data acquisition tool, as well as discuss our decisions about the selection of the source data. We also present some limitations of our proposal and propose a validation methodology for the resulting corpora.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Only a few opinion corpora exist. One of the best known is the MPQA Opinion Corpus of English texts (University of Pittsburg, PA, USA), http://mpqa.cs.pitt.edu/corpora/mpqa_corpus/ [3]. See also the five-billion-word Corpus of Japanese blogs annotated for affective features [4].

  2. 2.

    Booking.com guests’ comments are not copyright protected elements of the content but just publicly presented opinion recordings.

  3. 3.

    Such as political, religious, or custom-related opinions.

  4. 4.

    Information about Booking.com presented in this paper was collected in November, 2015.

  5. 5.

    To get a more precise idea on the nature of these limitations, the reader can consult Booking.com Guest Review Guidelines. To find it, open Booking.com and select any hotel. Find and click Our quests’ experiences on the bar at the top of the page and then click read more (last checked on July 30, 2017).

  6. 6.

    OCAS was designed and implemented by a team composed of visiting Erasmus students of computer science (Süleyman Menken, Emre Çelikörs, and Veysi Ozan Dağlayan from Turkey and Arcaeli Martinez and Adrian Barreiro Vilalustre from Spain) and Polish students of linguistics (Marta Witkowska and Urszula Morzyk), under the supervision of Zygmunt Vetulani (AMU).

  7. 7.

    In fact, OCASSC may be easily generalized to a system allowing generation of subcorpora of desired size for various XML formats.

  8. 8.

    We say that the corpus is representative for a given language phenomenon, or a class of phenomena, if it contains examples for all relevant aspects of this phenomenon.

  9. 9.

    To measure the length of a segment, we may use various units, such as characters, words, or sentences. In this paper we will use text words or opinions as the measurement units.

  10. 10.

    A data gathering procedure is considered sound with respect to the given objective if it guarantees acquisition of all data necessary to reach this objective.

  11. 11.

    A choice of measure units will of course affect the value of the 10% ratio.

  12. 12.

    The value is to be fixed depending on what one needs the corpus for.

  13. 13.

    According to Muller [12], in addition to the ΔV/ΔN ratio, it is also useful to consider the number (V1) of hapax legomena observed in the initial segment of the corpus of length N. For a fixed length of segments, the ratio ΔV/ΔN was shown to converge to V1/N with an increase in corpus length N [11].

  14. 14.

    Note, however, that the stopping criterion considered here does not apply when a huge amount of text data is necessary to support statistical or neural-networks-based methods used to analyze texts.

  15. 15.

    Julia Hartwig, a famous Polish poet known for her preference for adjectives, used to say that adjective is “the most important part of speech” [13].

  16. 16.

    6,340 hotels in 28 the most visited cities.

  17. 17.

    In OCASSC this list is called “dictionary” and is loaded by the user (see the function “use my own adjective dictionary”).

References

  1. Collins English Dictionary—Complete & Unabridged 2012 Digital Edition; © William Collins Sons & Co. Ltd. 1979, 1986 © HarperCollins Publishers (1998, 2000, 2003, 2005, 2006, 2007, 2009, 2012)

    Google Scholar 

  2. Charaudeau, P., Maingueneau, D.: Dictionnaire d’Analyse du Discours. Seuil, Paris (2002)

    Google Scholar 

  3. Stoyanov, V., Cardie, C., Litman, D., Wiebe, J.: Evaluating an opinion annotation scheme using a new multi-perspective question and answer corpus. In: Shanahan, J.G., Qu, Y., Wiebe, J. (eds.) Computing Attitude and Affect in Text: Theory and Applications. The Information Retrieval Series, vol. 20, pp. 77–91. Springer, Dordrecht (2006)

    Chapter  Google Scholar 

  4. Ptaszynski, M., Rzepka, R., Araki, K., Momouchi, Y.: Automatically annotating a five-billion-word corpus of Japanese blogs for affect and sentiment analysis. In: Proceedings of the 3rd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis, Jeju, Republic of Korea, pp. 89–98. Association for Computational Linguistics, Stroudsburg (2012)

    Google Scholar 

  5. Esuli, A., Sebastiani, F.: SentiWordNet: a publicly available lexical resource for opinion mining. In: Proceedings of the 5th Conference on Language Resources and Evaluation, LREC 2006, pp. 417–422. European Language Resources Association, Genoa (2006)

    Google Scholar 

  6. Vetulani, Z., Vetulani G., Kochanowski, B.: Recent advances in development of a lexicon-grammar of Polish: PolNet 3.0. In: Calzolari, N., Choukri, K., Declerck, T., Goggi, S., Grobelnik, M., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., Piperidis, S. (eds.) Proceedings of the Tenth International Conference on Language Resources and Evaluation, LREC 2016, pp. 2851–2854. European Language Resources Association, Paris (2016)

    Google Scholar 

  7. Pak, A., Paroubek, P.: Twitter as a corpus for sentiment analysis and opinion mining. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., Tapias, D. (eds.) Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010, 17–23 May 2010, Valletta, Malta, pp. 1320–1326. European Language Resources Association, Genoa (2010)

    Google Scholar 

  8. Read, J.: Using emoticons to reduce dependency in machine learning techniques for sentiment classification. In: Knight, K., Ng, H.T., Oflazer, K. (eds.) 43rd Annual Meeting of the Association of Computational Linguistics 2005, Proceedings of the Conference, University of Michigan. The Association for Computer Linguistics, New Brunswick (2005)

    Google Scholar 

  9. McEnry, T., Hardie, A.: Corpus Linguistics: Method. Theory and Practice. Cambridge University Press, Cambridge (2012)

    Google Scholar 

  10. Kittredge, R.: Semantic processing of texts in restricted sublanguage. Comput. Math Appl. 9(1), 45–58 (1983)

    Article  Google Scholar 

  11. Vetulani, Z.: Linguistic problems in the theory of man-machine communication in natural language. Universitätsverlag Dr, N. Brockmeyer, Bochum (1989)

    Google Scholar 

  12. Muller, Ch.: Peut-on estimer l’étendue d’un lexique? Cah. Lexicol. 27, 3–29 (1975)

    Google Scholar 

  13. Legieżyńska, A.: Julia Hartwig. Wdzięczność. Wydawnictwo Uniwersytetu Łódzkiego, Łódź (in Polish) (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zygmunt Vetulani .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Vetulani, Z., Witkowska, M., Menken, S., Canbolat, U. (2018). Saturation Tests in Application to Validation of Opinion Corpora: A Tool for Corpora Processing. In: Vetulani, Z., Mariani, J., Kubis, M. (eds) Human Language Technology. Challenges for Computer Science and Linguistics. LTC 2015. Lecture Notes in Computer Science(), vol 10930. Springer, Cham. https://doi.org/10.1007/978-3-319-93782-3_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-93782-3_27

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-93781-6

  • Online ISBN: 978-3-319-93782-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics