1 Introduction

Nowadays, Sentiment Analysis (SA) is the pioneering approach used to analyze people’s opinions about a product or an event to identify breakpoints in public opinion [1] towards a specific target/subject. Specifically, Patients and health consumers are storming intentionally their experiences and their opinions on Social media. However, regarding this daily massive shared patient’s experiences, time property is substantially very important at detecting the minute sentiment information covered towards a set of drugs or events. For example, a negative event like chemical therapy or radiation therapy may occur several times with given ADE (Adverse Drug Event) even more time of medication or patient change. Indeed, each time they may appear with an irregular sentiment polarity.

In the context of medical and pharmaceutical industry, the traditional form of clinical notes such as CRFs (Case Report Form) that used to summarize physical examination and details of the medical history of patients’ experiences towards specific drugs or events is not credible and less efficient at defining the changeable emotional state of patients through the process of medication. Moreover, the existing methods of SA are less helpful at defining the real quantity of emotion due to the sentiment computation complexity regarding medical text that needs several transformations. Moreover, the major Issue is the inability of such general-purpose SA tools to accurately recognizing and defining the sentiment expressed towards medical components e.g. drugs, side effects or ADE (Adverse Drug Events) and newly emerged medical entities or diseases or treatments/scientific studies at large [2]. Indeed an irregular sentiment is covered towards these items when may appear collectively at a different time of medication.

In this paper, we present a hybrid system based on an unsupervised biomedical concept extraction in a given context with autoregressive time series modelling. In Order to define the daily model by mining and personalizing various changeable emotional state of users towards a specific target (medical entity) or subject (event).

The remainder of the paper is organized as follows. Section 2 gives a view of sentiment analysis approaches and methods in the same research context. Section 3, introduces the aspects of analysis were taken in this study. Section 3, we explained the proposed system and tasks. Section 4 we present the experimentation of baseline method and results of the proposed system applied on twitter microblogs. Section 4.3 summarize the contribution of this paper and outlines research directions towards achieving further advances in this area.

2 Sentiment Analysis

There has been an increased interest in analyses on health social media content. In 2016, Rodrigues et al. [3] develop a Sentiment Analysis tool “SentiHealthCancer (SHC-pt)” that improves the detection of the emotional state of patients in Brazilian online cancer communities. In other studies, Liu et al. [4] develop a framework consists of medical entity extraction for recognizing patient discussions of drug and events. Leaman et al. [5] explored the value of patient intelligence on pharmacovigilance on social media. Rill et al. [6] proposed early detection of the Twitter trend, which was faster than Google Trends. They considered temporal changes in the number of tweets to decide the emerging topic. The polarity of tweets was decided using sentiment lexica like SenticNet3, SWN, etc., where the polarity of novel words was determined by plotting a relational graph of emerging political tweets at different time periods.

Otherwise, there are many general-purpose [7] sentiment analysis tools such as VADER (Valence Aware Dictionary and Sentiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. When those tool miss identifies several specific domain components and does not incorporate specific dictionaries cannot establish the accurate meaning of expressed sentiments toward specific Pharma or medical object/subject. We have tracked those gaps of the existing tools by analyzing Pharma’s reviews based on the existing methods [8]. Otherwise, the previous studies are less efficient due to data sparseness, low accuracy due to non-consideration of context, type of text like culture text modifiers, and presence of domain-specific words, as they may result in an inaccurate classification of users’ opinions.

3 Proposed System and Aspect of Analysis

Twitter is one of the most well-known online social networks that enjoy extreme popularity in the recent years. Precisely, a millions of patients’ messages and reports have been posted each day. Moreover, proposed system aims at detecting real-world sentiment expressed towards a medical components on twitter. It is about analyze and get the latest and the credible survey by identifying embedding relation between patient self-reports. Briefly, it is three aspect based sentiment analysis model.

3.1 Aspects of Analysis

SA is the task of extracting sentiments and detecting attitudes concerning different topics and entities as expressed in textual input. The classical methods of identifying sentiment polarity aim at calculating sentiment’ sum of global text based on sentiment dictionaries that are considered as a set of its individual words. Indeed, the amount of positiveness/negativeness is quantified in a binary fashion, which assigned High point to positive polarity and low point to negative polarity [9].

Text Aspect

Nowadays, SA has a big interest in not just identifying sentiment but also perform multi-task analysis based on several aspects of analysis [2]. Specifically, within analysing a sentiment of patients self-report contents. Collectively, these reports contain highly unstructured data combining text, drugs name, emoticons, ADE and unknown issues related to an event or drug that are used in making public aware of various issues [10]. Our system has considered three main aspect. Text aspect is important stage of identity relevant sentiment of patient messages or reports on social media [11], it still complicated due to the form of text that may so limited, informal and ambiguous. In this case, getting processed word-distribution space has big gain to determine opinion words [12]. Indeed, we explore in depth the semantic relation between text items to detect the relevant opinion word as features saving the correlation between them [13].

Entity Aspect

Medical features: The medical text or patient self-report may contain another type of entity as a combination of specific medical components e.g. medical event, side effects or ADE or drugs’ name. In order to enhance the accuracy of sentiment classification is been mandatory to analyse sentiment regarding entity aspect of analysis.

In this stage, we attempt to recognize and define irregular medical features. Table 1 below presents some tweets in the medical context that is annotated based on Unified System language System (UMLS) knowledge sources. Thus, it clearly show how the tweet can contain dependent medical components and how can context and target of analysis may be different each time.

Table 1. Annotated tweets based on UMLS knowledge sources.

Real-word annotations are Performed an explicit way for system to recognize and label the topic and target of analysis.

Twitter features: Also, Twitter has specific features e.g. retweet count, hashtags, mentions and form of text. However, Twitter is responsible for popularizing the use of term hashtag as way to group conversations to follow a particular topic or highlight specific entity. Indeed, we go over Slicing and dicing data to analyse hashtags and the other related twitter entities, mentions and retweet count that are interesting to define who are factors can affect relevant daily sentiment appeared towards a specific target. Hashtag aspect can directly pop up a hidden subjectivity of related topic or entity. Table 2 below shows the most informative hashtag of 18992 hashtag appeared in 249581 tweets in many times.

Table 2. The most informative hashtag of 18992 hashtag appeared in 249581.

The plot below describes how some hashtags can appear frequently in patient reports. That may refer to an importance of object/subject in real time speaking (Fig. 1).

Fig. 1.
figure 1

The 50 most informative hashtag of 18992 hashtag appeared in 249581 tweets.

Regarding the time axes, hashtag frequency vary differently in several time fraction. As shown in the Fig. 2 below.

Fig. 2.
figure 2

Example of changeable frequence of hashtag over the time axes.

Time Aspect

Find the “signal from the noised patient report” in real time through the medication period is very interesting in the term of speed responsiveness to detect issues related drugs or treatments e.g. it allows looking at map time of inter-infections components (disease, drug or event).

We mainly concerned with Time series modelling that often suggests time aspect based analysis. Thus, time series generated from uncorrelated report/tweet that is generally motivated by the presumption that correlation between adjacent items in time is best explained in terms of a dependence of the current value on past values. The time domain approach focuses on modelling some future value of a time series as a parametric function of the current and past values.

3.2 Proposed System

Our proposed system present a conditional sentiment analysis to detecting new irregular features based on past features. In the first level, we aim at preparing our space to be as an input of filtering and selecting relevant features. Indeed the selection features was performed based on three aspects of analysis, that is will combine each time to be as an input of Autoregressive learning method. The daily update aims at detecting irregular features based on time series. When our system will able to label emerged entities based on historical data of past irregular features. As shown in the Fig. 3.

Fig. 3.
figure 3

View of proposed system

Data Streaming:

Streaming Data is data that is generated continuously when interacting with Twitter via a REST API, we can search for existing tweets in fact, that is, tweets that have already been published and made available for search. Often these APIs limit the amount of tweets you can retrieve, not just in terms of rate limits as discussed in the previous section, but also in terms of time span. In fact, it’s usually possible to go back in time up to approximately one week, meaning that older tweets are not retrievable. A second aspect to consider about the REST API is that these are usually the best effort, but they are not guaranteed to provide all the tweets published on Twitter, as some tweets could be unavailable to search or simply indexed with a small delay. On the other hand, the Streaming API looks into the future. Once we open a connection, we can keep it open and go forward in time. By keeping the HTTP connection open, we can retrieve all the tweets that match our filter criteria, as they are published.

The Streaming API is, generally speaking, the preferred way of downloading a huge amount of tweets, as the interaction with the platform is limited to keeping one of the connections open. On the downside, collecting tweets in this way can be more time consuming, as we need to wait for the tweets to be published before we can collect them. To summarize, our system use the Streaming API to download and seek the massive amount of tweets in medical and pharmaceutical context.

Preprocessing Data:

This step is about prepare collected unstructured data instead, that is, the raw text of the tweet. We’ll use text preprocessing method, normalization and we’ll perform some statistical analysis on the tweets.

However, the data preprocessing is a crucial step in sentiment analysis [11], since selecting the appropriate preprocessing methods, the correctly classified instances can be increased [12]. In view of the above, our system used combinations of methods:

  • Tokenization.

  • Emoticons replacement.

  • Punctuation marks.

  • Word normalization.

  • Regular expressions operations.

  • Removing other twitter components: URL and slang stuff that is considered as twitter abbreviation like RT (RT is an abbreviation for ReTweet, which is like Repeat).

Features Extraction:

Opinion words model is created at first level, n-grams method is used for developing opinion-word features. As mentioned above, a number of additional features that occasionally can enhance our system to be more concisely in calculating real-word expressed sentiment towards drug name, medical event or company name. E.g. hashtags, retweeted count, mentions and medical entities. Medical entities can be diseases, drugs, symptoms, etc. Previously, researchers in the field have used hand crafted features to identify medical entities in medical literature. It has been found that in contrast with semantic approaches which require rich domain-knowledge for rule or pattern construction, statistical approaches are more scalable. Medical Entity Recognition is a crucial step towards efficient medical texts analysis [14]. In recent years, tools such as MetaMap and cTAKES have been widely used for medical concept extraction on medical literature and clinical notes. The Case of QuickUMLS [15] a fast, unsupervised, approximate dictionary matching algorithm for medical concept extraction.

In this study, we extend unsupervised biomedical concept extraction medical entity recognition for tweets based on The UMLS, or Unified Medical Language System, is a set of files and software that brings together many health and biomedical vocabularies and standards to enable interoperability between computer systems. Then, a second medical features model was created based on QuickUMLS. Moreover, we have developed an additional functions of specific twitter setting to seek related features: Hashtags and mentions. As shown in the previous section twitter features like hashtag that may occurred frequently in several patient report will hardly correlated with expressed sentiment behind.

Analytics Engine:

Stream learning approach suggest incremental changes to the algorithm basically, its retraining as new record on a new set of tweets come in. the updating process applied on whole space. Our system is based on this stochastic calculation in which future values are estimated based on past values regarding each time at the combination set of features and sentiment.

The supporting new main function of our system:

  • Quantify exactly what change is over time and the sentiment behind.

  • Tracking the relationship between irregular features from time perspective.

Often Machine learning task operate with a dataset that has a single slice of time or don’t consider the time aspect at all. Our system time-dependent is based on classified tweet to get link extracted feature with a given sentiment score. Each time irregular feature appears, sentiment score will be firmly informed and fed into autoregressive algorithm to generate a new model. As the primary objective of time series analysis is to provide a statistical setting for describing the character of data. The case of our system that is defined a collection of random features sentiment-labeled indexed according to the order they are obtained in time. In order to yield the correct class and confident prediction of each new input, we attempt to discretize our extraction model. That means, we make a number of statistical properties repeat constantly over time. In this step we use Kalman Filter for fitting a modified form of continuous time autoregressive model, which can be particularly useful with uncorrelated twitter data as sampled time series. Briefly, Kalman Filter uses a series of measurements observed over time to produce estimate of unknown variable as shown in Fig. 4. The case of our system, when embedded patient report has many unknown components. In a fixed interval a set of irregular features were created when unobserved components Estimation is conditional on the information made available after time t. as shown in the Fig. 4.

Fig. 4.
figure 4

Describe how connect the values at adjacent time period

4 Experimentation and Discussion

4.1 Dataset

We use sentiment140 (sentiment140, s.d.) dataset in this step, it contains 1,600,000 tweets extracted using the twitter API that were hand-classified. The polarity of the tweet (0 = negative, 2 = neutral, 4 = positive). In addition, we rescaled sentiment value as range of values from 0 (completely negative) to 1 (completely positive) and assume that values from 0.35 to 0.50w are somewhere in the middle and they are neutral.

4.2 Baseline Model

The baseline model is the well-known model that hybrid lexicon-based approach and Machine learning algorithm. Indeed, it is a static quantification based on TF-IDF vectorization and uses regression learning method. We used Rest API to collect tweets in the context of the pharmaceutical industry. Table 3, in this table the following information, text of tweet, value of sentiment given by the baseline tool.

Table 3. Results of baseline on over 1 M tweets.

In the above results, classified tweet have bad estimation, thus the baseline model do not able to adopt the existing relationship between medical entity (asthma steroids), disease (asthma attacks) and ADEs (Common asthma steroids linked to side effects in adrenal glands) that can hardly change the quantified sentiment every time they appeared collectively. Specifically, when speaker’s experiences linked to specific healthcare/pharmaceutical firms.

4.3 Results and Output Data from the Study

In the previous section, we highlight how our system can be more concisely to define the hidden sentiment information that correlates temporally with many modalities regarding irregular components.

In what follows, we present the results obtained by applying Kalman filter parameters generated from uncorrelated features as based-model for continuous time series fitting, as described in the previous section. The proposed method is a statistical measure that summarizes sentiment information looking at the historical data towards dependent entities from the time perspective. It deals specifically with newly emerged diseases, events or Drug’s ADE. This model will minimize the annotation cost and get dynamic identification of real word entities while maximizing the performance of our sentiment classification Machine Learning based. Whether a public sentiment indicator extracted from daily twitter data. To this end, we have streamed a large number of tweets from the present. Table 4 show some obtained results on a different slice of time. As shown in Fig. 5 many sentiments covered differently over the time. That is clearly described by time series modeling in Fig. 6. On the same day, we have a changeable emotional state ranged from 0, 1 extremely negative to 0, 9 extremely positive towards several medical components (drugs names, medical events), scientific studies or pharmaceutical company name at large. Indeed, medical components may frequently capture varied elements ranging from medical issues, product accessibility issues to potential side effects.

Table 4. Obtained results of experiments on different time interval
Fig. 5.
figure 5

Describes how connect the values at adjacent time periods

Fig. 6.
figure 6

Covered sentiment over Time of 100 tweet in 60 s on Thu Nov 02 22:55:57:01–60

When observations are irregular in time space, modifications to the estimators need to be made. Time series problems usually look at four main components:

  • Seasonality: it aims at seasonal behavior that assume a series is generated as the sum of trend effect, and error.

  • Trends: it aims at noting the gradually increasing underlying trend and the rather regular variation.

  • Cycles: It tends to exhibit repetitive behavior, with regularly repeating cycles that are easily visible. This periodic behavior is of interest because underlying processes of interest may be regular and the rate or frequency of oscillation characterizing the behavior of the underlying series would help to identify them.

  • Irregular components: regarding time axes in patients life through medication process, several drugs issues are possible, new diseases emerged can appear with unknown ADE.

In Order to prove the efficiency and the ability of our system to detect and solve problems that is change. We should detect a frequency of detected irregular components over seasonality and others components. We can seek another type of setting and specific irregular features in this context of analysis several drugs/events depending seasonality and periodic factors. In the presence of uncertainty caused by noised and emotional state of the speaker, the main objective of proposing Kalman filter as learning algorithm is to combining sentiment information and take advantage of the appeared correlations between past defined components and newly emerged ones based on stream information continuously changing.

5 Conclusion

SA is offering real opportunities to obtain fine grain data on patients and their environment that ensures a competitive edge to better understanding patients’ experiences shared on social media. Turning patients’ opinions into actionable information is having a profound impact on healthcare and pharmaceutical firms. Patients self-reports on social media, frequently capture varied elements ranging from medical issues, product accessibility issues to potential side effects. The case of our system that aim at identifying the embedded relationship between those components over the time axes.

Our perspective study aim at enhancing capabilities’ system of better observing patients’ physiological signals and helps provide situational awareness to the bedside and have an approximate idea of what the general mood is. Otherwise, from the time perspective, we can reconstruct visually some original hypothesis that may support patient and save lives.