Saturation Tests in Application to Validation of Opinion Corpora: A Tool for Corpora Processing

Vetulani, Zygmunt; Witkowska, Marta; Menken, Suleyman; Canbolat, Umut

doi:10.1007/978-3-319-93782-3_27

Saturation Tests in Application to Validation of Opinion Corpora: A Tool for Corpora Processing

Zygmunt Vetulani¹⁶,
Marta Witkowska¹⁶,
Suleyman Menken¹⁷ &
…
Umut Canbolat¹⁷

Conference paper
First Online: 16 June 2018

519 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10930))

Abstract

Opinion processing has recently gained much interest among computational linguists, public relation experts, marketing companies, and politicians. Studies of the natural language expression of opinions, desires, emotions, and related phenomena require appropriate tools and methodologies. We propose tools for collection of empirical data in the form of a corpus, limiting our research field to customers’ written opinions about widely used on-line booking services in the area of hotel reservations (via Booking.com). In this paper, we present the corpus acquisition procedure and our data acquisition tool, as well as discuss our decisions about the selection of the source data. We also present some limitations of our proposal and propose a validation methodology for the resulting corpora.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Only a few opinion corpora exist. One of the best known is the MPQA Opinion Corpus of English texts (University of Pittsburg, PA, USA), http://mpqa.cs.pitt.edu/corpora/mpqa_corpus/ [3]. See also the five-billion-word Corpus of Japanese blogs annotated for affective features [4].
2.
Booking.com guests’ comments are not copyright protected elements of the content but just publicly presented opinion recordings.
3.
Such as political, religious, or custom-related opinions.
4.
Information about Booking.com presented in this paper was collected in November, 2015.
5.
To get a more precise idea on the nature of these limitations, the reader can consult Booking.com Guest Review Guidelines. To find it, open Booking.com and select any hotel. Find and click Our quests’ experiences on the bar at the top of the page and then click read more (last checked on July 30, 2017).
6.
OCAS was designed and implemented by a team composed of visiting Erasmus students of computer science (Süleyman Menken, Emre Çelikörs, and Veysi Ozan Dağlayan from Turkey and Arcaeli Martinez and Adrian Barreiro Vilalustre from Spain) and Polish students of linguistics (Marta Witkowska and Urszula Morzyk), under the supervision of Zygmunt Vetulani (AMU).
7.
In fact, OCASSC may be easily generalized to a system allowing generation of subcorpora of desired size for various XML formats.
8.
We say that the corpus is representative for a given language phenomenon, or a class of phenomena, if it contains examples for all relevant aspects of this phenomenon.
9.
To measure the length of a segment, we may use various units, such as characters, words, or sentences. In this paper we will use text words or opinions as the measurement units.
10.
A data gathering procedure is considered sound with respect to the given objective if it guarantees acquisition of all data necessary to reach this objective.
11.
A choice of measure units will of course affect the value of the 10% ratio.
12.
The value is to be fixed depending on what one needs the corpus for.
13.
According to Muller [12], in addition to the ΔV/ΔN ratio, it is also useful to consider the number (V1) of hapax legomena observed in the initial segment of the corpus of length N. For a fixed length of segments, the ratio ΔV/ΔN was shown to converge to V1/N with an increase in corpus length N [11].
14.
Note, however, that the stopping criterion considered here does not apply when a huge amount of text data is necessary to support statistical or neural-networks-based methods used to analyze texts.
15.
Julia Hartwig, a famous Polish poet known for her preference for adjectives, used to say that adjective is “the most important part of speech” [13].
16.
6,340 hotels in 28 the most visited cities.
17.
In OCASSC this list is called “dictionary” and is loaded by the user (see the function “use my own adjective dictionary”).

References

Collins English Dictionary—Complete & Unabridged 2012 Digital Edition; © William Collins Sons & Co. Ltd. 1979, 1986 © HarperCollins Publishers (1998, 2000, 2003, 2005, 2006, 2007, 2009, 2012)
Google Scholar
Charaudeau, P., Maingueneau, D.: Dictionnaire d’Analyse du Discours. Seuil, Paris (2002)
Google Scholar
Stoyanov, V., Cardie, C., Litman, D., Wiebe, J.: Evaluating an opinion annotation scheme using a new multi-perspective question and answer corpus. In: Shanahan, J.G., Qu, Y., Wiebe, J. (eds.) Computing Attitude and Affect in Text: Theory and Applications. The Information Retrieval Series, vol. 20, pp. 77–91. Springer, Dordrecht (2006)
Chapter Google Scholar
Ptaszynski, M., Rzepka, R., Araki, K., Momouchi, Y.: Automatically annotating a five-billion-word corpus of Japanese blogs for affect and sentiment analysis. In: Proceedings of the 3rd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis, Jeju, Republic of Korea, pp. 89–98. Association for Computational Linguistics, Stroudsburg (2012)
Google Scholar
Esuli, A., Sebastiani, F.: SentiWordNet: a publicly available lexical resource for opinion mining. In: Proceedings of the 5th Conference on Language Resources and Evaluation, LREC 2006, pp. 417–422. European Language Resources Association, Genoa (2006)
Google Scholar
Vetulani, Z., Vetulani G., Kochanowski, B.: Recent advances in development of a lexicon-grammar of Polish: PolNet 3.0. In: Calzolari, N., Choukri, K., Declerck, T., Goggi, S., Grobelnik, M., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., Piperidis, S. (eds.) Proceedings of the Tenth International Conference on Language Resources and Evaluation, LREC 2016, pp. 2851–2854. European Language Resources Association, Paris (2016)
Google Scholar
Pak, A., Paroubek, P.: Twitter as a corpus for sentiment analysis and opinion mining. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., Tapias, D. (eds.) Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010, 17–23 May 2010, Valletta, Malta, pp. 1320–1326. European Language Resources Association, Genoa (2010)
Google Scholar
Read, J.: Using emoticons to reduce dependency in machine learning techniques for sentiment classification. In: Knight, K., Ng, H.T., Oflazer, K. (eds.) 43rd Annual Meeting of the Association of Computational Linguistics 2005, Proceedings of the Conference, University of Michigan. The Association for Computer Linguistics, New Brunswick (2005)
Google Scholar
McEnry, T., Hardie, A.: Corpus Linguistics: Method. Theory and Practice. Cambridge University Press, Cambridge (2012)
Google Scholar
Kittredge, R.: Semantic processing of texts in restricted sublanguage. Comput. Math Appl. 9(1), 45–58 (1983)
Article Google Scholar
Vetulani, Z.: Linguistic problems in the theory of man-machine communication in natural language. Universitätsverlag Dr, N. Brockmeyer, Bochum (1989)
Google Scholar
Muller, Ch.: Peut-on estimer l’étendue d’un lexique? Cah. Lexicol. 27, 3–29 (1975)
Google Scholar
Legieżyńska, A.: Julia Hartwig. Wdzięczność. Wydawnictwo Uniwersytetu Łódzkiego, Łódź (in Polish) (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Adam Mickiewicz University in Poznań, Poznań, Poland
Zygmunt Vetulani & Marta Witkowska
University of Kocaeli, İzmit, Kocaeli, Turkey
Suleyman Menken & Umut Canbolat

Authors

Zygmunt Vetulani
View author publications
You can also search for this author in PubMed Google Scholar
Marta Witkowska
View author publications
You can also search for this author in PubMed Google Scholar
Suleyman Menken
View author publications
You can also search for this author in PubMed Google Scholar
Umut Canbolat
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zygmunt Vetulani .

Editor information

Editors and Affiliations

Adam Mickiewicz University, Poznań, Poland
Zygmunt Vetulani
LIMSI-CNRS, Orsay Cedex, France
Joseph Mariani
Adam Mickiewicz University, Poznań, Poland
Marek Kubis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vetulani, Z., Witkowska, M., Menken, S., Canbolat, U. (2018). Saturation Tests in Application to Validation of Opinion Corpora: A Tool for Corpora Processing. In: Vetulani, Z., Mariani, J., Kubis, M. (eds) Human Language Technology. Challenges for Computer Science and Linguistics. LTC 2015. Lecture Notes in Computer Science(), vol 10930. Springer, Cham. https://doi.org/10.1007/978-3-319-93782-3_27

Download citation

DOI: https://doi.org/10.1007/978-3-319-93782-3_27
Published: 16 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93781-6
Online ISBN: 978-3-319-93782-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics