Keywords

1 Introduction

1.1 Motivation

Twitter data have become a source of insights to study consumer behavior in different contexts and domains. Regarding food consumer behavior, the need for using Twitter raises in some limitations of common practices due to the strong influence of contextual variables and the subjective charge that consumers’ responses can have when looking for socially desirable or over- rationalized answers [1].

However, Twitter offers an option to those limitations, since it becomes a natural consumption setting which provides access to consumer information spontaneously. According to Vidal et al., [1], Twitter is used to present daily information including consumption routines and comments; and, since eating and drinking are some of the most common human activities, tweets can be a data source for food-related consumer behavior insights.

According to Nielsen [2], Twitter is well positioned to study eating situations since it provides researchers the opportunity to retrieve spontaneous data, generated in real-life settings. In this sense, it is possible to collect data in any situation, considering that consumers are increasingly using smartphones to access social media [2].

1.2 The Potential of Twitter in Marketing Research

Evolution of digital trends such as social networks, mobile technologies, cloud computing or Big Data provide a huge source of information about consumer behavior, needs and requirements [3]. Therefore, companies can offer completely new services, participate interactively with customers and provide a completely different work environment, which is the reason why digital technology plays a critical role in consumer research strategies.

The proposal of Uhl et al., [4], argues that using a Customer Centralization strategy on organizations allows to position and to achieve economic success on organizations. From the perspective of the customer’s life cycle, the Customer Centralization strategy focuses on four phases: information, supply, purchase and after-sales. In a transversal way, the customer experience goes through every phase. It is defined as a managerial philosophy that involves a complete alignment of the company with existing and potential relationships with its customers; it makes customer the focus of all commercial considerations. The central principle is to increase the value of the company and the customer itself through the systematic management of these relationships.

In order to improve company’s customer knowledge and considering Twitter as an invaluable source of information about customer profile, interests and behaviors, this paper proposes an algorithm able to analyze food mention behavior in social networks from different points of view. The algorithm is able to identify the context in which a customer mentions food related words and characterizes the situation in which it was posted. In addition, it considers not only narrative texts, but also hashtags and user mention. The proposed algorithm demonstrates improvement on current approaches employed in food-related studies in social media.

This paper is organized as follows: In the Sect. 2, recent studies that use twitter information to perform analysis in specific areas of knowledge such as food-consumption, tobacco consumption and healthcare are presented. Section 3 describes the proposed algorithm. First, it describes the food extraction processes used for the construction of the knowledge base that supports the algorithm, and then explains the proposed algorithm for food detection. Section 4 presents the main results obtained in the case study and concludes with Sect. 5 presenting our main contributions and future work.

2 Related Works

Recent food-related studies have focused on problems, topics and consumer behavior research in public health. Vidal et al., [1] found that “people tended to mainly tweet about eating situations when they enjoyed them, due to the foods they consumed, the place in which they were and/or the people they shared the meal with.”

Abbar et al., [5] found that foods mentioned in daily tweets of users are predictive of the national obesity and diabetes statistics, showing how the calories tweeted are linked to user interests and demographic indicators, and that users sharing more friends are more likely to display a similar interest towards food. This work includes demographic indicators correlated with food-related information. The studies from Abbar et al. enriched data using a variety of sources, which allowed considering nutritional value of the foods mentioned in tweets, demographic characteristics of the users who tweet them, their interests, and the social network they belonged to.

In a recent study, Prier et al., [6] used LDA to find topics related to tobacco consumption, such as addiction recovery, other drug use, and anti-smoking campaigns. Finally, Dredze et al., [7] applied a Food Topic Aspect Model on tweets, to find out mentions of various aliments; The results suggest that chronic health behaviors, such as tobacco use, can be identified and measured, however, this does not apply to other short-term health events, such as outbreaks of disease. Also, it is found that the demographics of Twitter users can affect this type of studies, leaving the debate open.

Users of online social networks (OSN) reveal a lot about themselves; however, depending on their privacy concerns, they also choose not to share details that seem sensitive to them, reconfiguring access to their information in the OSN [11]. Many applications on Facebook that are well-known for being able to use them, request a lot of information from the user [12]. On the contrary, the proposal presented in this article is based exclusively on the publicly available information of users of social networks and, in that sense, does not violate any agreement on the use of data applicable in America and Europe. In addition, demographic data derived from OSN users, and employed for the food-consumption analysis, is the result of a previous project that demonstrates and validates the potential of twitter public publications to infer valuable information about its users [13, 14].

3 Consumer Food Choice Identification

In order to explore Twitter data, we used bag of words [8, 9, 13] as a method to understand the tweet content related to food consumption. This method uses an initial food knowledge base with 1128 words, generated by an automatic domain constructor [16]. In the first approach of the analysis, we found out that a large portion of the tweets that include food words are not referring to actual food consumption, this is one of the most important challenges on the algorithm. Most of them refer to popular sayings that include food words like:

“amigo el ratón del queso (friend the mouse of the cheese )”.

“cuentas claras y el chocolate espeso (bills clear and chocolate thick)”.

“sartén por el mango (taking the frying pan by the handle)”.

“al pan, pan y al vino, vino (the bread is bread and the wine is wine )”.

Some tweets had another type of expressions widely used in other contexts, that includes food words; for example, the word jam in the Colombian political context is associated with corruption issues and is widely used in social networks, for example:

“Desastroso es un gobierno lleno de mermelada y clientelismo (a government full of jam and clientelism is disastrous)”.

Additionally, this tweet understanding also shows that many users refer to specific products associated with popular brands, without using the food word, such as “Pony Malta (soda)”, “CocaCola (soda)”, “Galletas Oreo (cookies)”. In the same way, other users use hashtags like “#almuerzo (#lunch) and #aguacate (#avocado)” or mentions (or usernames) such as “@baileysoriginal and @BogotaBeerCo” to make a reference to products or places of consumption. Finally, we also concluded that users refer to food consumption with emojis as shown on Fig. 1.

Fig. 1.
figure 1

Emoji use referring food consumption.

Taking into account these insights, we had two main challenges: create a knowledge base that can be used to analyze food mention behavior in narrative texts from social networks and propose a new food mention identification algorithm that recognizes the context of food-related words using different aspects of the publication to disambiguate it, like hashtags, user mentions, emojis and food n-grams with n > = 1 as well as non-food n-grams.

3.1 Knowledge Base Generation

In this section we present the result of the knowledge base generation, which is composed of 11 lists, namely:

  • Emoji list: this list was constructed using as primary source the 11th version of the Unicode emoji characters and sequences from Unicode standard. This list has a total of 2620 emojis.

  • Food emoji list: Felbo et al., [10] used the emoji prediction to find topics related to feeling in different domains. In our proposal we use a subset of the 95 emojis in the emoji list, named as the food emoji list. An extract of this list is presented in Fig. 2, where both the emoji and its meaning in English and Spanish, can be seen.

    Fig. 2.
    figure 2

    Food emoji list sample.

  • What list: to construct this list, the initial auto generated food list (see Sect. 3) was manually reviewed, generating a new list which considers only unigrams, used on the food consumption context. It contains 776 words including their stem.

  • Where list: this list allows identifying places, locations or spatial situations associated with food consumption. It contains 128 words.

  • Who list: this list enables to identify people with whom food is shared using relationships and professions. It contains 112 words.

  • When list: this list has moments, occasions and temporary situations in which people consume food. It contains 27 words.

  • Food stop word list: it contains 178 popular sayings or expressions (non-food n-grams) frequently used on Twitter with food words, which do not correspond to the food consumption context. This list aims to be a filter to discard tweets.

  • Food list: this list is composed of 441 n-grams with n > 1.

  • Food user mention list: this list details 95 Twitter usernames associated with products, brands and places.

  • Food entity list: food brand list with 812 elements.

  • Food hashtag list: includes 450 hashtags related to food, that are typically used to refer to specific products or places.

According to the beforehand described lists, in the following section we present the proposed algorithm to identify food mentions in Twitter text.

3.2 Modelling

The proposed algorithm focuses on determining whether the tweet can be related to food context by text or by entity. In order to accomplish this, the main input of the algorithm is the tweet preprocessed text, which contains tokens, their stem and recognized entities. The algorithm is shown on Fig. 3 and explained next:

Fig. 3.
figure 3

Proposed food mention identification algorithm

First, if the tweet only contains non-food n-gram, it is discarded and the algorithm finishes. Otherwise, the algorithm tries to determine a food context relationship by text or by entity:

  • By Text: for each token in a tweet, the algorithm checks if its stem belongs to the what list, if so, it validates the token only if it is a noun. If there are no more tokens to check, the algorithm determines a food context relationship by text.

  • By Entity: the algorithm determines a food context relationship by entity, if the tweet contains food context n-grams, brands, hashtags, mentions or emojis using the recognized entities from the preprocessed text.

Consequently, the algorithm assures a food context relationship only if, on the previous steps, at least one relationship or food mention was determined. In that case, the algorithm tries to identify context characteristics like places, people, or moments, using the where, who and when lists. Otherwise, the tweet is discarded. As a result, the algorithm stores five types of elements: (i) food n-grams, product or brand; (ii) place, whose identification is made through words, hashtags or mentions; (iii) people, with whom the food is consumed; (iv) moment of consumption (time of day, consumption time, day of the week, among others) and finally, (v) tweet publication time.

3.3 Evaluation

To evaluate the proposed algorithm, an ETL (Extract, Transform, Load) system was designed and implemented using Big Data technologies. As shown in Fig. 4, Twitter is used as data source, which is extracted using its public APIFootnote 1 implementation, in PythonFootnote 2.

Fig. 4.
figure 4

ETL implementation design.

The extracted data are stored in a MongoDBFootnote 3 database, in the Staging area. Then, in the transformation stage, four steps take place:

  1. 1.

    Data cleaning: in this step, data is selected from the Staging area to be prepared, only if their geographic location corresponds to Colombia and their language to Spanish.

  2. 2.

    Text preprocessing: in this step, Twitter text is tokenized, tagged, parsed, stemmed and lemmatized using the spaCy’sFootnote 4 Spanish processing pipeline. Additionally, the structured text (user mentions, hashtags and emojis - using the emoji list described on Sect. 3.1.) is extracted and labeled accordingly.

  3. 3.

    Named Entity Recognition (NER): here, the n-grams from the following lists are recognized and labeled as entities within the text: food list, food stop word list, food user mention list, food hashtag list and food entity list.

  4. 4.

    Food mention identification: in this last step, our algorithm (see Sect. 3.2) is used to identify whether or not a tweet contains a food mention.

Finally, if the tweet contains a food mention, it is uploaded to Mongo DB in the load stage. It is worth mentioning that this exercise is part of a project named Digital Segmentation System from CAOBA [13], which, based on the information from Twitter, generates an approximation to the characteristics of the users who create the publications. These characteristics range from sociodemographic aspects, to sociographic attributes related to emotions, interests, and polarity. These variables will be considered in the next section.

Taking into account the ETL system, a case study was constructed with 1.3 million tweets extracted during fifteen days within the same month. In this period, our proposed algorithm identified 11,691 tweets that mentioned food, corresponding to 2% of the extracted tweets. A sample of the results obtained from the algorithm were manually evaluated to identify if the original tweet is actually related to the food context; as a result, a precision of 70% was obtained.

4 Results

The loaded tweets were classified depending on the type of the mentioned words. Our method manages to identify 1,310 different words, where 59 are mentioned 100 or more times. Table 1 shows the 20 most frequent words according to the time of day (breakfast: 5 am–9 am, lunch: 11 am–2 pm, snack: 10 am/3 pm–5 pm and dinner: 6 pm–9 pm).

Table 1. Number of tweets according to the type of content entity. All words mark with * cannot be translated to English.

In general, the word “cerveza” (beer) is the most frequent, almost at any time of the day; however, there is a group of words showing the consistency with a Colombian dietary routine to be mentioned at a specific time of day, such is the case of bebida caliente (hot drink), pan (bread), queso (cheese), arepa (white corn cake) and chocolate(chocolate) at breakfast; or arroz (rice), pollo (chicken), pizza y carne (pizza and meat) at lunchtime.

According to the previously established classification, it was found that, 33.108 times, a word was identified as food, product or brand; 1.324 times as a place, 1.426 times as a companion and 2.726 times as a consumption occasion. For the three last classifications, the most frequently mentioned words are presented in Fig. 5.

Fig. 5.
figure 5

Words cloud for where, who and when

Additionally, it is possible to know the behavior of users according to the day time in which they publish. Figure 6 shows a tendency to publish more tweets around noon and between 18–21 h.

Fig. 6.
figure 6

Publication Frequency Distribution of tweets according to publication time

In relation to the sociographic variables, emotion is detected for 40% of the tweets and polarity for 86.1%. When performing the individual analysis of the most frequent words and their relationship with the emotion of the tweet, it is observed (see Table 2) that words like cake, cocktail, hot drink, wine and avocado, have a high participation with tweets related to joy. On the other hand, words like dinner, pizza and chicken, have shares in sadness exceeding 20% of the tweets.

Table 2. Most frequent words according to emotions. All words mark with * cannot be translated to English.

4.1 Characterizing Users by Age and Gender

The following table (see Table 3) shows the grouping by type of food and age groups of users who mention them, considering about 50% of the most mentioned words. It is observed that there is a trend in the 35-year old population, and more towards the mention of healthier foods, such as meat, cheese, chicken or the so-called “Natural Food” which in this text refers to as the usual or homemade food: rice, pasta, potato. The tendency to mention alcoholic beverages is strong in the Colombian tweets; however, it is much more pronounced in the population under 35 years.

Table 3. Food group versus age range.

When the differences at the gender level are observed, there is a tendency of men towards the mention of alcoholic beverages, such as wine, drink and beer; whereas in women, terms such as desserts, sweets, milkshakes and chocolates, are more frequent. That is, a pronounced tendency was found in women towards mentioning sweets; nevertheless, the mention of fruits by them, is also evident. These results are presented in the word clouds of Fig. 7.

Fig. 7.
figure 7

Food related words by gender (men and women)

The differences in the mentions of alcoholic beverages are also perceived at a socio-economic level, representing the greatest differences; such as fondness for beer into lower socio-economical levels, and the opposite for cocktail, a more expensive drink. Coffee and hot drinks also predominant in high strata. The following table (see Table 4) presents the words showing the greatest differences between strata.

Table 4. Word distribution by socioeconomic level. All words mark with * cannot be translated to English.

5 Discussion and Future Work

Food preferences expressed in social networks as Twitter become a valuable source of information for making decisions about consumer centralization strategies. Therefore, knowing tendencies of publishing, interests and behaviors based on comments published by users allows identifying pertinence and strength of marketing strategies.

Through a case study, it is shown that Twitter information provides some elements that allow a global analysis of preferences in foods, products or brands, based on the mentions made by users in the network. Despite founding just a 2% of messages related to food, there is a significant number of users, which would significantly exceed approximations made by other methodologies, such as specialized surveys. The advantage of having continuous information collection also enables a significant increase in the volume of users that can be identified over time, as well as the identification of patterns or changes in behavior, constituting a very relevant aspect.

One of the most valuable elements of the exercise is the generation of the knowledge base, which must be adjusted to ensure that the products of interest (including those of competitors) are at the base; in turn, more specific relationships can be established about users’ opinions or emotions about them. An advantage of the way in which the algorithm was implemented is the possibility of making these adjustments without major difficulties. This would create new sources of unstructured open data, allowing other systems to feed from their knowledge bases, such as systems of health, marketing or others [15].

Despite the remarkable advantages, it is important to note that the algorithm’s accuracy is 70%, a value associated mainly with trying to build a knowledge base for such a broad domain. This behavior affects the results, generating erroneous interpretations; however, as mentioned before, if this algorithm was applied to a more specific domain, its performance would increase.

To estimate the magnitude of interpretation errors, it will be necessary to deepen in a content analysis where the intentionality is verified directly in the tweet texts. This kind of analysis requires a huge amount of time for its completion, which exceeds the initial objectives and scope of this research. However, it is proposed as a future work.