Text mining and semantics: a systematic mapping study
As text semantics has an important role in text meaning, the term semantics has been seen in a vast sort of text mining studies. However, there is a lack of studies that integrate the different research branches and summarize the developed works. This paper reports a systematic mapping about semantics-concerned text mining studies. This systematic mapping study followed a well-defined protocol. Its results were based on 1693 studies, selected among 3984 studies identified in five digital libraries. The produced mapping gives a general summary of the subject, points some areas that lacks the development of primary or secondary studies, and can be a guide for researchers working with semantics-concerned text mining. It demonstrates that, although several studies have been developed, the processing of semantic aspects in text mining remains an open research problem.
KeywordsSystematic review Text mining Text semantics
Text mining techniques have become essential for supporting knowledge discovery as the volume and variety of digital text documents have increased, either in social networks and the Web or inside organizations. Text sources, as well as text mining applications, are varied. Although there is not a consensual definition established among the different research communities , text mining can be seen as a set of methods used to analyze unstructured data and discover patterns that were unknown beforehand .
The pre-processing step is about preparing data for pattern extraction. In this step, raw text is transformed into some data representation format that can be used as input for the knowledge extraction algorithms. The activities performed in the pre-processing step are crucial for the success of the whole text mining process. The data representation must preserve the patterns hidden in the documents in a way that they can be discovered in the next step. In the pattern extraction step, the analyst applies a suitable algorithm to extract the hidden patterns. The algorithm is chosen based on the data available and the type of pattern that is expected. The extracted knowledge is evaluated in the post-processing step. If this knowledge meets the process objectives, it can be put available to the users, starting the final step of the process, the knowledge usage. Otherwise, another cycle must be performed, making changes in the data preparation activities and/or in pattern extraction parameters. If any changes in the stated objectives or selected text collection must be made, the text mining process should be restarted at the problem identification step.
Text data are not naturally in a format that is suitable for the pattern extraction, which brings additional challenges to an automatic knowledge discovery process. The meaning of natural language texts basically depends on lexical, syntactic, and semantic levels of linguistic knowledge. Each level is more complex and requires a more sophisticated processing than the previous level. This is a common trade-off when dealing with natural language processing: expressiveness versus processing cost. Thus, lexical and syntactic components have been more broadly explored in text mining than the semantic component . Recently, text mining researchers have become more interested in text semantics, looking for improvements in the text mining results. The reason for this increasing interest can be assigned both to the progress of the computing capacity, which is constantly reducing the processing time, and to developments in the natural language processing field, which allow a deeper processing of raw texts.
Company A acquired Company B.
Company B acquired Company A.
Company B was acquired by Company A.
Company A purchased Company B.
Besides, going even deeper in the interpretation of the sentences, we can understand their meaning—they are related to some takeover—and we can, for example, infer that there will be some impacts on the business environment.
Traditionally, text mining techniques are based on both a bag-of-words representation and application of data mining techniques. In this approach, only the lexical component of the texts are considered. In order to get a more complete analysis of text collections and get better text mining results, several researchers directed their attention to text semantics.
Text semantics can be considered in the three main steps of text mining process: pre-processing, pattern extraction and post-processing. In the pre-processing step, data representation can be based on some sort of semantic aspect of the text collection. In the pattern extraction, semantic information can be used to guide the model generation or to refine it. In the post-processing step, the extracted patterns can be evaluated based on semantic aspects. Either way, text mining based on text semantics can go further than text mining based only on lexicon or syntax. A proper treatment of text semantics can lead to more appropriate results for certain applications . For example, semantic information has an important impact on document content and can be crucial to differentiate documents which, despite the use of the same vocabulary, present different ideas about the same subject.
The term semantics has been seen in a vast sort of text mining studies. However, there is a lack of studies that integrate the different branches of research performed to incorporate text semantics in the text mining process. Secondary studies, such as surveys and reviews, can integrate and organize the studies that were already developed and guide future works.
Thus, this paper reports a systematic mapping study to overview the development of semantics-concerned studies and fill a literature review gap in this broad research field through a well-defined review process. Semantics can be related to a vast number of subjects, and most of them are studied in the natural language processing field. As examples of semantics-related subjects, we can mention representation of meaning, semantic parsing and interpretation, word sense disambiguation, and coreference resolution. Nevertheless, the focus of this paper is not on semantics but on semantics-concerned text mining studies. As the term semantics appears in text mining studies in different contexts, this systematic mapping aims to present a general overview and point some areas that lack the development of primary studies and those areas that secondary studies would be of great help. This paper aims to point some directions to the reader who is interested in semantics-concerned text mining researches.
As it covers a wide research field, this systematic mapping study started with a space of 3984 studies, identified in five digital libraries. Due to time and resource limitations, except for survey papers, the mapping was done primarily through information found in paper abstracts. Therefore, our intention is to present an overview of semantics-concerned text mining, presenting a map of studies that has been developed by the research community, and not to present deep details of the studies. The papers were analyzed in relation to their application domains, performed tasks, applied methods and resources, and level of user’s interaction. The contribution of this paper is threefold: (i) it presents an overview of semantics-concerned text mining studies from a text mining viewpoint, organizing the studies according to seven aspects (application domains, languages, external knowledge sources, tasks, methods and algorithms, representation models, and user’s interaction); (ii) it quantifies and confirms some previous feelings that we had about our study subject; and (iii) it provides a starting point for those, researchers or practitioners, who are initiating works on semantics-concerned text mining.
The remainder of this paper is organized as follows. The “Method applied for systematic mapping” section presents an overview of systematic mapping method, since this is the type of literature review selected to develop this study and it is not widespread in the text mining community. In this section, we also present the protocol applied to conduct the systematic mapping study, including the research questions that guided this study and how it was conducted. The results of the systematic mapping, as well as identified future trends, are presented in the “Results and discussion” section. The “Conclusion” section concludes this work.
Method applied for systematic mapping
The review reported in this paper is the result of a systematic mapping study, which is a particular type of systematic literature review [3, 4]. Systematic literature review is a formal literature review adopted to identify, evaluate, and synthesize evidences of empirical results in order to answer a research question. It is extensively applied in medicine, as part of the evidence-based medicine . This type of literature review is not as disseminated in the computer science field as it is in the medicine and health care fields1, although computer science researches can also take advantage of this type of review. We can find important reports on the use of systematic reviews specially in the software engineering community [3, 4, 6, 7]. Other sparse initiatives can also be found in other computer science areas, as cloud-based environments , image pattern recognition , biometric authentication , recommender systems , and opinion mining .
A systematic review is performed in order to answer a research question and must follow a defined protocol. The protocol is developed when planning the systematic review, and it is mainly composed by the research questions, the strategies and criteria for searching for primary studies, study selection, and data extraction. The protocol is a documentation of the review process and must have all the information needed to perform the literature review in a systematic way. The analysis of selected studies, which is performed in the data extraction phase, will provide the answers to the research questions that motivated the literature review. Kitchenham and Charters  present a very useful guideline for planning and conducting systematic literature reviews. As systematic reviews follow a formal, well-defined, and documented protocol, they tend to be less biased and more reproducible than a regular literature review.
When the field of interest is broad and the objective is to have an overview of what is being developed in the research field, it is recommended to apply a particular type of systematic review named systematic mapping study [3, 4]. Systematic mapping studies follow an well-defined protocol as in any systematic review. The main differences between a traditional systematic review and a systematic mapping are their breadth and depth. While a systematic review deeply analyzes a low number of primary studies, in a systematic mapping a wider number of studies are analyzed, but less detailed. Thus, the search terms of a systematic mapping are broader and the results are usually presented through graphs. Systematic mapping studies can be used to get a mapping of the publications about some subject or field and identify areas that require the development of more primary studies and areas in which a narrower systematic literature review would be of great help to the research community.
This paper reports a systematic mapping study conducted to get a general overview of how text semantics is being treated in text mining studies. It fills a literature review gap in this broad research field through a well-defined review process. As a systematic mapping, our study follows the principles of a systematic mapping/review. However, as our goal was to develop a general mapping of a broad field, our study differs from the procedure suggested by Kitchenham and Charters  in two ways. Firstly, Kitchenham and Charters  state that the systematic review should be performed by two or more researchers. Although our mapping study was planned by two researchers, the study selection and the information extraction phases were conducted by only one due to the resource constraints. In this process, the other researchers reviewed the execution of each systematic mapping phase and their results. Secondly, systematic reviews usually are done based on primary studies only, nevertheless we have also accepted secondary studies (reviews or surveys) as we want an overview of all publications related to the theme.
In the following subsections, we describe our systematic mapping protocol and how this study was conducted.
Systematic mapping planning
- Research question: the main research question that guided this study was “How is semantics considered in text mining studies?” The main question was detailed in seven secondary questions, all of them related to text mining studies that consider text semantics in some way:
What are the application domains that focus on text semantics?
What are the natural languages being considered when working with text semantics?
Which external sources are frequently used in text mining studies when text semantics is considered?
In what text mining tasks is the text semantics most considered?
What methods and algorithms are commonly used?
How can texts be represented?
Do users or domain experts take part in the text mining process?
Study identification: the study identification was performed through searches for studies conducted in five digital libraries: ACM Digital Library, IEEE Xplore, Science Direct, Web of Science, and Scopus. The following general search expression was applied in both Title and Keywords fields, when allowed by the digital library search engine: semantic* AND text* AND (mining OR representation OR clustering OR classification OR association rules).
Study selection: every study returned in the search phase went to the selection phase. Studies were selected based on title, abstract, and paper information (as number of pages, for example). Through this analysis, duplicated studies (most of them were studies found in more than one database) were identified. Besides, studies which match at least one of the following exclusion criteria were rejected: (i) one page papers, posters, presentations, abstracts, and editorials; (ii) papers hosted in services with restricted access and not accessible; (iii) papers written in languages different from English or Portuguese; and (iv) studies that do not deal with text mining and text semantics.
- Information extraction: the information extraction phase was performed with papers accepted in the selection phase (papers that were not identified as duplicated or rejected). The abstracts were read in order to extract the information presented in Fig. 2.
As any literature review, this study has some bias. The advantage of a systematic literature review is that the protocol clearly specifies its bias, since the review process is well-defined. There are bias related to (i) study identification, i.e., only papers matching the search expression and returned by the searched digital libraries were selected; (ii) selection criteria, i.e., papers that matches the exclusion criteria were rejected; and (iii) information extraction, i.e., the information were mainly extracted considering only title and abstracts. It is not feasible to conduct a literature review free of bias. However, it is possible to conduct it in a controlled and well-defined way through a systematic process.
Systematic mapping conduction
This paper reports the results obtained after the execution of two cycles of the systematic mapping phases. The first cycle was executed based on searches performed in January 2014. The second cycle was an update of the first cycle, with searches performed in February 20162. A total of 3984 papers were found using the search expression in the five digital libraries. In the selection phase, 725 duplicated studies were identified and 1566 papers were rejected according to the exclusion criteria, mainly based on their title and abstract. Most of the rejected papers match the last exclusion criteria (Studies that do not deal with text mining and text semantics). Among them, we can find studies that deal with multimedia data (images, videos, or audio) and with construction, description, or annotation of corpus.
After the selection phase, 1693 studies were accepted for the information extraction phase. In this phase, information about each study was extracted mainly based on the abstracts, although some information was extracted from the full text. The results of the accepted paper mapping are presented in the next section.
Results and discussion
The results of the systematic mapping study is presented in the following subsections. We start our report presenting, in the “Surveys” section, a discussion about the eighteen secondary studies (surveys and reviews) that were identified in the systematic mapping. Then, each following section from “Application domains” to “User’s interaction” is related to a secondary research question that guided our study, i.e., application domains, languages, external knowledge sources, text mining tasks, methods and algorithms, representation model, and user’s interaction. In the “Systematic mapping summary and future trends” section, we present a consolidation of our results and point some gaps of both primary and secondary studies.
Some studies accepted in this systematic mapping are cited along the presentation of our mapping. We do not present the reference of every accepted paper in order to present a clear reporting of the results.
In this systematic mapping, we identified 18 survey papers associated to the theme text mining and semantics [14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]. Each paper exploits some particularity of this broad theme. In the following, we present a short overview of these papers, which is based on the full text of the papers.
Grobelnik  presents, briefly but in a very clear form, an interesting discussion of text processing in his three-page paper. The author organizes the field in three main dimensions, which can be used to classify text processing approaches: representation, technique, and task. The task dimension is about the kind of problems, we solve through the text processing. Document search, clustering, classification, summarization, trend detection, and monitoring are examples of tasks. Considering how text representations are manipulated (technique dimension), we have the methods and algorithms that can be used, including machine learning algorithms, statistical analysis, part-of-speech tagging, semantic annotation, and semantic disambiguation. In the representation dimension, we can find different options for text representation, such as words, phrases, bag-of-words, part-of-speech, subject-predicate-object triples and semantically annotated triples.
Grobelnik  also presents the levels of text representations, that differ from each other by the complexity of processing and expressiveness. The most simple level is the lexical level, which includes the common bag-of-words and n-grams representations. The next level is the syntactic level, that includes representations based on word co-location or part-of-speech tags. The most complete representation level is the semantic level and includes the representations based on word relationships, as the ontologies. Several different research fields deal with text, such as text mining, computational linguistics, machine learning, information retrieval, semantic web and crowdsourcing. Grobelnik  states the importance of an integration of these research areas in order to reach a complete solution to the problem of text understanding.
Stavrianou et al.  present a survey of semantic issues of text mining, which are originated from natural language particularities. This is a good survey focused on a linguistic point of view, rather than focusing only on statistics. The authors discuss a series of questions concerning natural language issues that should be considered when applying the text mining process. Most of the questions are related to text pre-processing and the authors present the impacts of performing or not some pre-processing activities, such as stopwords removal, stemming, word sense disambiguation, and tagging. The authors also discuss some existing text representation approaches in terms of features, representation model, and application task. The set of different approaches to measure the similarity between documents is also presented, categorizing the similarity measures by type (statistical or semantic) and by unit (words, phrases, vectors, or hierarchies).
Stavrianou et al.  also present the relation between ontologies and text mining. Ontologies can be used as background knowledge in a text mining process, and the text mining techniques can be used to generate and update ontologies. The authors conclude the survey stating that text mining is an open research area and that the objectives of the text mining process must be clarified before starting the data analysis, since the approaches must be chosen according to the requirements of the task being performed.
Methods that deal with latent semantics are reviewed in the study of Daud et al. . The authors present a chronological analysis from 1999 to 2009 of directed probabilistic topic models, such as probabilistic latent semantic analysis, latent Dirichlet allocation, and their extensions. The models are classified according to their main functionality. They describe their advantages, disadvantages, and applications.
Wimalasuriya and Dou , Bharathi and Venkatesan , and Reshadat and Feizi-Derakhshi  consider the use of external knowledge sources (e.g., ontology or thesaurus) in the text mining process, each one dealing with a specific task. Wimalasuriya and Dou  present a detailed literature review of ontology-based information extraction. The authors define the recent information extraction subfield, named ontology-based information extraction (OBIE), identifying key characteristics of the OBIE systems that differentiate them from general information extraction systems. Besides, they identify a common architecture of the OBIE systems and classify existing systems along with different dimensions, as information extraction method applied, whether it constructs and updates the ontology, components of the ontology, and type of documents the system deals with. Bharathi and Venkatesan  present a brief description of several studies that use external knowledge sources as background knowledge for document clustering. Reshadat and Feizi-Derakhshi  present several semantic similarity measures based on external knowledge sources (specially WordNet and MeSH) and a review of comparison results from previous studies.
Schiessl and Bräscher  and Cimiano et al.  review the automatic construction of ontologies. Schiessl and Bräscher , the only identified review written in Portuguese, formally define the term ontology and discuss the automatic building of ontologies from texts. The authors state that automatic ontology building from texts is the way to the timely production of ontologies for current applications and that many questions are still open in this field. Also, in the theme of automatic building of ontologies from texts, Cimiano et al.  argue that automatically learned ontologies might not meet the demands of many possible applications, although they can already benefit several text mining tasks. The authors divide the ontology learning problem into seven tasks and discuss their developments. They state that ontology population task seems to be easier than learning ontology schema tasks.
Jovanovic et al.  discuss the task of semantic tagging in their paper directed at IT practitioners. Semantic tagging can be seen as an expansion of named entity recognition task, in which the entities are identified, disambiguated, and linked to a real-world entity, normally using a ontology or knowledge base. The authors compare 12 semantic tagging tools and present some characteristics that should be considered when choosing such type of tools.
Specifically for the task of irony detection, Wallace  presents both philosophical formalisms and machine learning approaches. The author argues that a model of the speaker is necessary to improve current machine learning methods and enable their application in a general problem, independently of domain. He discusses the gaps of current methods and proposes a pragmatic context model for irony detection.
The application of text mining methods in information extraction of biomedical literature is reviewed by Winnenburg et al. . The paper describes the state-of-the-art text mining approaches for supporting manual text annotation, such as ontology learning, named entity and concept identification. They also describe and compare biomedical search engines, in the context of information retrieval, literature retrieval, result processing, knowledge retrieval, semantic processing, and integration of external tools. The authors argue that search engines must also be able to find results that are indirectly related to the user’s keywords, considering the semantics and relationships between possible search results. They point that a good source for synonyms is WordNet.
Leser and Hakenberg  presents a survey of biomedical named entity recognition. The authors present the difficulties of both identifying entities (like genes, proteins, and diseases) and evaluating named entity recognition systems. They describe some annotated corpora and named entity recognition tools and state that the lack of corpora is an important bottleneck in the field.
Dagan et al.  introduce a special issue of the Journal of Natural Language Engineering on textual entailment recognition, which is a natural language task that aims to identify if a piece of text can be inferred from another. The authors present an overview of relevant aspects in textual entailment, discussing four PASCAL Recognising Textual Entailment (RTE) Challenges. They declared that the systems submitted to those challenges use cross-pair similarity measures, machine learning, and logical inference. The authors also describe tools, resources, and approaches commonly used in textual entailment tasks and conclude with the perspective that in the future, the constructed entailment “engines” will be used as a basic module by the text-understanding applications.
Irfan et al.  present a survey on the application of text mining methods in social network data. They present an overview of pre-processing, classification and clustering techniques to discover patterns from social networking sites. They point out that the application of text mining techniques can reveal patterns related to people’s interaction behaviors. The authors present two basic pre-processing activities: feature extraction and feature selection. The authors also review classification and clustering approaches. They present different machine learning algorithms and discuss the importance of ontology usage to introduce explicit concepts, descriptions, and the semantic relationships among concepts. Irfan et al.  identify the main challenges related to the manipulation of social network texts (such as large data, data with impurities, dynamic data, emotions interpretations, privacy, and data confidence) and to text mining infrastructure (such as usage of cloud computing and improvement of the usability of text mining methods).
In the context of semantic web, Sheth et al.  define three types of semantics: implicit semantics, formal semantics, and powerful (or soft) semantics. Implicit semantics are those implicitly present in data patterns and is not explicitly represented in any machine processable syntax. Machine learning methods exploit this type of semantics. Formal semantics are those represented in some well-formed syntactic form and are machine-processable. The powerful semantics are the sort of semantics that allow uncertainty (that is, the representation of degree of membership and degree of certainty) and, therefore, allowing abductive or inductive reasoning. The authors also correlates the types of semantics with some core capabilities required by a practical semantic web application. The authors conclude their review asserting the importance of focusing research efforts in representation mechanisms for powerful semantics in order to move towards the development of semantic applications.
The formal semantics defined by Sheth et al.  is commonly represented by description logics, a formalism for knowledge representation. The application of description logics in natural language processing is the theme of the brief review presented by Cheng et al. .
The broad field of computational linguistics is presented by Martinez and Martinez . Considering areas of computational linguistics that can be interesting to statisticians, the authors describe three main aspects of computational linguistics: formal language, information retrieval, and machine learning. The authors present common models for knowledge representation, addressing their statistical characteristics and providing an overview of information retrieval and machine learning methods related to computational linguistics. They describe some of the major statistical contributions to the areas of machine learning and computational linguistics, from the point of view of classification and clustering algorithms. Martinez and Martinez  emphasize that machine translation, part-of-speech tagging, word sense disambiguation, and text summarization are some of the identified applications that statisticians can contribute.
Bos  presents an extensive survey of computational semantics, a research area focused on computationally understanding human language in written or spoken form. He discusses how to represent semantics in order to capture the meaning of human language, how to construct these representations from natural language expressions, and how to draw inferences from the semantic representations. The author also discusses the generation of background knowledge, which can support reasoning tasks. Bos  indicates machine learning, knowledge resources, and scaling inference as topics that can have a big impact on computational semantics in the future.
As presented in this section, the reviewed secondary studies exploit some specific issues of semantics-concerned text mining researches. In contrast to them, this paper reviews a broader range of text mining studies that deal with semantic aspects. To the best of our knowledge, this is the first report of a mapping of this field. We present the results of our systematic mapping study in the following sections, organized in seven dimensions of the text mining studies derived from our secondary research questions: application domains, languages, external knowledge usage, tasks, methods and algorithms, representation model, and user’s interaction.
What are the application domains that focus on text semantics?
The second most frequent identified application domain is the mining of web texts, comprising web pages, blogs, reviews, web forums, social medias, and email filtering [41, 42, 43, 44, 45, 46]. The high interest in getting some knowledge from web texts can be justified by the large amount and diversity of text available and by the difficulty found in manual analysis. Nowadays, any person can create content in the web, either to share his/her opinion about some product or service or to report something that is taking place in his/her neighborhood. Companies, organizations, and researchers are aware of this fact, so they are increasingly interested in using this information in their favor. Some competitive advantages that business can gain from the analysis of social media texts are presented in [47, 48, 49]. The authors developed case studies demonstrating how text mining can be applied in social media intelligence. From our systematic mapping data, we found that Twitter is the most popular source of web texts and its posts are commonly used for sentiment analysis or event extraction.
Besides the top 2 application domains, other domains that show up in our mapping refers to the mining of specific types of texts. We found research studies in mining news, scientific papers corpora, patents, and texts with economic and financial content.
What are the natural languages being considered when working with text semantics?
Whether using machine learning or statistical techniques, the text mining approaches are usually language independent. However, specially in the natural language processing field, annotated corpora is often required to train models in order to resolve a certain task for each specific language (semantic role labeling problem is an example). Besides, linguistic resources as semantic networks or lexical databases, which are language-specific, can be used to enrich textual data. Most of the resources available are English resources. Thus, the low number of annotated data or linguistic resources can be a bottleneck when working with another language. There are important initiatives to the development of researches for other languages, as an example, we have the ACM Transactions on Asian and Low-Resource Language Information Processing , an ACM journal specific for that subject.
Chinese is the second most mentioned language (26.4% of the studies reference the Chinese language). Wu et al.  point two differences between English and Chinese: in Chinese, there are no white spaces between words in a sentence and there are a higher number of frequent words (the number of frequent words in Chinese is more than twice the number of English frequent words). These characteristics motivate the development of methods and experimental evaluations specifically for Chinese.
This mapping shows that there is a lack of studies considering languages other than English or Chinese. The low number of studies considering other languages suggests that there is a need for construction or expansion of language-specific resources (as discussed in “External knowledge sources” section). These resources can be used for enrichment of texts and for the development of language specific methods, based on natural language processing.
External knowledge sources
Which external sources are frequently used in text mining studies when text semantics is considered?
Text mining initiatives can get some advantage by using external sources of knowledge. Thesauruses, taxonomies, ontologies, and semantic networks are knowledge sources that are commonly used by the text mining community. Semantic networks is a network whose nodes are concepts that are linked by semantic relations. The most popular example is the WordNet , an electronic lexical database developed at the Princeton University. Depending on its usage, WordNet can also be seen as a thesaurus or a dictionary .
There is not a complete definition for the terms thesaurus, taxonomy, and ontology that is unanimously accepted by all research areas. Weller  presents an interesting discussion about the term ontology, including its origin and proposed definitions. She concluded the discussion stating that: “Ontologies should unambiguously represent shared background knowledge that helps people within a community of interest to understand each other. And they should make computer-readable indexing of information possible on the Web” . The same can be said about thesauruses and taxonomies. In a general way, thesauruses, taxonomies, and ontologies are normally specialized in a specific domain and they usually differs from each other by their degree of expressiveness and complexity in their relational constructions . Ontology would be the most expressive type of knowledge representation, having the most complex relations and formalized construction.
The second most used source is Wikipedia , which covers a wide range of subjects and has the advantage of presenting the same concept in different languages. Wikipedia concepts, as well as their links and categories, are also useful for enriching text representation [74, 75, 76, 77] or classifying documents [78, 79, 80]. Medelyan et al.  present the value of Wikipedia and discuss how the community of researchers are making use of it in natural language processing tasks (in special word sense disambiguation), information retrieval, information extraction, and ontology building.
The use of Wikipedia is followed by the use of the Chinese-English knowledge database HowNet . Finding HowNet as one of the most used external knowledge source it is not surprising, since Chinese is one of the most cited languages in the studies selected in this mapping (see the “Languages” section). As well as WordNet, HowNet is usually used for feature expansion [83, 84, 85] and computing semantic similarity [86, 87, 88].
Web pages are also used as external sources [89, 90, 91]. Normally, web search results are used to measure similarity between terms. We also found some studies that use SentiWordNet , which is a lexical resource for sentiment analysis and opinion mining [93, 94]. Among other external sources, we can find knowledge sources related to Medicine, like the UMLS Metathesaurus [95, 96, 97, 98], MeSH thesaurus [99, 100, 101, 102], and the Gene Ontology [103, 104, 105].
Text mining tasks
In what text mining tasks is the text semantics most considered?
Besides classification and clustering, we can note that semantic concern are present in tasks as information extraction [106, 107, 108], information retrieval [109, 110, 111], sentiment analysis [112, 113, 114, 115], and automatic ontology building [116, 117], as well as the pre-processing step itself [118, 119].
Methods and algorithms
What methods and algorithms are commonly used?
Beyond latent semantics, the use of concepts or topics found in the documents is also a common approach. The concept-based semantic exploitation is normally based on external knowledge sources (as discussed in the “External knowledge sources” section) [74, 124, 125, 126, 127, 128]. As an example, explicit semantic analysis  rely on Wikipedia to represent the documents by a concept vector. In a similar way, Spanakis et al.  improved hierarchical clustering quality by using a text representation based on concepts and other Wikipedia features, such as links and categories.
The issue of text ambiguity has also been the focus of studies. Word sense disambiguation can contribute to a better document representation. It is normally based on external knowledge sources and can also be based on machine learning methods [36, 130, 131, 132, 133].
Other approaches include analysis of verbs in order to identify relations on textual data [134, 135, 136, 137, 138]. However, the proposed solutions are normally developed for a specific domain or are language dependent.
In Fig. 9, we can observe the predominance of traditional machine learning algorithms, such as Support Vector Machines (SVM), Naive Bayes, K-means, and k-Nearest Neighbors (KNN), in addition to artificial neural networks and genetic algorithms. The application of natural language processing methods (NLP) is also frequent. Among these methods, we can find named entity recognition (NER) and semantic role labeling. It shows that there is a concern about developing richer text representations to be input for traditional machine learning algorithms, as we can see in the studies of [55, 139, 140, 141, 142].
Text representation models
How can texts be represented?
The most popular text representation model is the vector space model. In this model, each document is represented by a vector whose dimensions correspond to features found in the corpus. When features are single words, the text representation is called bag-of-words. Despite the good results achieved with a bag-of-words, this representation, based on independent words, cannot express word relationships, text syntax, or semantics. Therefore, it is not a proper representation for all possible text mining applications.
The use of richer text representations is the focus of several studies [62, 79, 97, 143, 144, 145, 146, 147, 148]. Most of the studies concentrate on proposing more elaborated features to represent documents in the vector space model, including the use of topic model techniques, such as LSI and LDA, to obtain latent semantic features. Deep learning  is currently applied to represent independent terms through their associated concepts, in an attempt to narrow the relationships between the terms [150, 151]. The use of distributed word representations (word embeddings) can be seen in several works of this area in tasks such as classification [88, 152, 153], summarization , and information retrieval .
Besides the vector space model, there are text representations based on networks (or graphs), which can make use of some text semantic features. Network-based representations, such as bipartite networks and co-occurrence networks, can represent relationships between terms or between documents, which is not possible through the vector space model [147, 156, 157, 158].
In addition to the text representation model, text semantics can also be incorporated to text mining process through the use of external knowledge sources, like semantic networks and ontologies, as discussed in the “External knowledge sources” section.
Do users or domain experts take part in the text mining process?
Text mining is a process to automatically discover knowledge from unstructured data. Nevertheless, it is also an interactive process, and there are some points where a user, normally a domain expert, can contribute to the process by providing his/her previous knowledge and interests. As an example, in the pre-processing step, the user can provide additional information to define a stoplist and support feature selection. In the pattern extraction step, user’s participation can be required when applying a semi-supervised approach. In the post-processing step, the user can evaluate the results according to the expected knowledge usage.
Despite the fact that the user would have an important role in a real application of text mining methods, there is not much investment on user’s interaction in text mining research studies. A probable reason is the difficulty inherent to an evaluation based on the user’s needs. In empirical research, researchers use to execute several experiments in order to evaluate proposed methods and algorithms, which would require the involvement of several users, therefore making the evaluation not feasible in practical ways.
Systematic mapping summary and future trends
How is semantics considered in text mining studies?
Semantics is an important component in natural language texts. Consequently, in order to improve text mining results, many text mining researches claim that their solutions treat or consider text semantics in some way. However, text mining is a wide research field and there is a lack of secondary studies that summarize and integrate the different approaches. How is semantics considered in text mining studies? Looking for the answer to this question, we conducted this systematic mapping based on 1693 studies, accepted among the 3984 studies identified in five digital libraries. In the previous subsections, we presented the mapping regarding to each secondary research question. In this subsection, we present a consolidation of our results and point some future trends of semantics-concerned text mining.
As previously stated, the objective of this systematic mapping is to provide a general overview of semantics-concerned text mining studies. The papers considered in this systematic mapping study, as well as the mapping results, are limited by the applied search expression and the research questions. It is not feasible to cover all published papers in this broad field. Therefore, the reader can miss in this systematic mapping report some previously known studies. It is not our objective to present a detailed survey of every specific topic, method, or text mining task. This systematic mapping is a starting point, and surveys with a narrower focus should be conducted for reviewing the literature of specific subjects, according to one’s interests.
The quantitative analysis of the scientific production by each text mining dimension (presented from the “Application domains” section to the “User’s interaction” section) confirmed some previous feelings that we had about our study subject and highlighted other interesting characteristics of the field. Text semantics is closely related to ontologies and other similar types of knowledge representation. We also know that health care and life sciences is traditionally concerned about standardization of their concepts and concepts relationships. Thus, as we already expected, health care and life sciences was the most cited application domain among the literature accepted studies. This application domain is followed by the Web domain, what can be explained by the constant growth, in both quantity and coverage, of Web content.
It was surprising to find the high presence of the Chinese language among the studies. Chinese language is the second most cited language, and the HowNet, a Chinese-English knowledge database, is the third most applied external source in semantics-concerned text mining studies. Looking at the languages addressed in the studies, we found that there is a lack of studies specific to languages other than English or Chinese. We also found an expressive use of WordNet as an external knowledge source, followed by Wikipedia, HowNet, Web pages, SentiWordNet, and other knowledge sources related to Medicine.
Text classification and text clustering, as basic text mining tasks, are frequently applied in semantics-concerned text mining researches. Among other more specific tasks, sentiment analysis is a recent research field that is almost as applied as information retrieval and information extraction, which are more consolidated research areas. SentiWordNet, a lexical resource for sentiment analysis and opinion mining, is already among the most used external knowledge sources.
The treatment of latent semantics, through the application of LSI, stands out when looking at methods and algorithms. Besides that, traditional text mining methods and algorithms, like SVM, KNN, and K-means, are frequently applied and researches tend to enhance the text representation by applying NLP methods or using external knowledge sources. Thus, text semantics can be incorporated to the text mining process mainly through two approaches: the construction of richer terms in the vector space representation model or the use of networks or graphs to represent semantic relations between terms or documents.
In real application of the text mining process, the participation of domain experts can be crucial to its success. However, the participation of users (domain experts) is seldom explored in scientific papers. The difficulty inherent to the evaluation of a method based on user’s interaction is a probable reason for the lack of studies considering this approach.
The mapping indicates that there is space for secondary studies in areas that has a high number of primary studies, such as studies of feature enrichment for a better text representation in the vector space model; use of classification methods; use of clustering methods; and the use of latent semantics in text mining. A detailed literature review, as the review of Wimalasuriya and Dou  (described in “Surveys” section), would be worthy for organization and summarization of these specific research subjects.
Considering the development of primary studies, we identified three main future trends: user’s interaction, non-English text processing, and graph-based representation. We expect an increase in the number of studies that have some level of user’s interaction to bring his/her needs and interests to the process. This is particularly valuable for the clustering task, because a considered good clustering solution can vary from user to user . We also expect a raise of resources (linguistic resources and annotated corpora) for non-English languages. These resources are very important to the development of semantics-concerned text mining techniques. Higher availability of non-English resources will allow a higher number of studies dealing with these languages. Another future trend is the development and use of graph-based text representation. Nowadays, there are already important researches in this direction, and we expect that it will increase as graph-based representations are more expressive than traditional representations in the vector space model.
Text semantics are frequently addressed in text mining studies, since it has an important influence in text meaning. However, there is a lack of secondary studies that consolidate these researches. This paper reported a systematic mapping study conducted to overview semantics-concerned text mining literature. The scope of this mapping is wide (3984 papers matched the search expression). Thus, due to limitations of time and resources, the mapping was mainly performed based on abstracts of papers. Nevertheless, we believe that our limitations do not have a crucial impact on the results, since our study has a broad coverage.
The main contributions of this work are (i) it presents a quantitative analysis of the research field; (ii) its conduction followed a well-defined literature review protocol; (iii) it discusses the area regarding seven important text mining dimensions: application domain, language, external knowledge source, text mining task, method and algorithm, representation model, and user’s interaction; and (iv) the produced mapping can give a general summary of the subject and can be of great help for researchers working with semantics and text mining. Thus, this work filled a gap in the literature as, to the best of our knowledge, this is the first general literature review of this wide subject.
Although several researches have been developed in the text mining field, the processing of text semantics remains an open research problem. The field lacks secondary studies in areas that has a high number of primary studies, such as feature enrichment for a better text representation in the vector space model. Another highlight is about a language-related issue. We found considerable differences in numbers of studies among different languages, since 71.4% of the identified studies deal with English and Chinese. Thus, there is a lack of studies dealing with texts written in other languages. When considering semantics-concerned text mining, we believe that this lack can be filled with the development of good knowledge bases and natural language processing methods specific for these languages. Besides, the analysis of the impact of languages in semantic-concerned text mining is also an interesting open research question. A comparison among semantic aspects of different languages and their impact on the results of text mining techniques would also be interesting.
1 A simple search for “systematic review” on the Scopus database in June 2016 returned, by subject area, 130,546 Health Sciences documents (125,254 of them for Medicine) and only 5,539 Physical Sciences (1328 of them for Computer Science). The coverage of Scopus publications are balanced between Health Sciences (32% of total Scopus publication) and Physical Sciences (29% of total Scopus publication).
2 It was not possible to perform the second cycle of searches in ACM Digital Library because of a change in the interface of this search engine. However, it must be notice that only eight studies that was found only in this database was accepted in the first cycle. All other studies was also retrieved by other search engines (specially Scopus, which retrieved more than 89% of accepted studies.)
3 Word cloud created with support of Wordle .
The authors would like to thank the financial support of grant #132666/2016-2, National Council for Scientific and Technological Development (CNPq); grants #2013/14757-6, #2014/08996-0, and #2016/07620-2, São Paulo Research Foundation (FAPESP); and Coordination for the Improvement of Higher Education Personnel (CAPES).
RAS and SOR planned this systematic mapping study. RAS conducted its first cycle (searches performed in January 2014). JA and RAS conducted its second cycle (searches performed in February 2016). RAS and SOR analyzed the results and drafted the manuscript after the first cycle and updated it after the second cycle. JA was involved in updating the manuscript with the second cycle results. All authors revised and approved the final manuscript.
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 1.Miner G, Elder J, Hill T, Nisbet R, Delen D, Fast A (2012) Practical text mining and statistical analysis for non-structured text data applications. 1st edn. Academic Press, Boston.Google Scholar
- 2.Aggarwal CC, Zhai C (eds)2012. Mining text data. Springer, Durham.Google Scholar
- 3.Kitchenham B, Charters S (2007) Guidelines for performing systematic literature reviews in software engineering. EBSE Technical Report EBSE-2007-01. Keele University and Durham University Joint Report, Durham, UK.Google Scholar
- 4.Petersen K, Feldt R, Mujtaba S, Mattsson M (2008) Systematic mapping studies in software engineering In: EASE 2008: Proceedings of the 12th International Conference on Evaluation and Assessment in Software Engineering. EASE’08, 68–77. British Computer Society, Swinton, UK.Google Scholar
- 7.Felizardo KR, Nakagawa EY, MacDonell SG, Maldonado JC (2014) A visual analysis approach to update systematic reviews In: EASE’14: Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering, 4:1–4:10. ACM, New York.Google Scholar
- 12.Khan K, Baharudin BB, Khan A, et al (2009) Mining opinion from text documents: a survey In: DEST’09: Proceedings of the 3rd IEEE International Conference on Digital Ecosystems and Technologies, 217–222. IEEE.Google Scholar
- 13.Laboratory of Research on Software Engineering (LaPES) - StArt Tool. http://lapes.dc.ufscar.br/tools/start_tool. Accessed 8 June 2016.
- 14.Grobelnik M (2011) Many faces of text processing In: WIMS’11: Proceedings of the International Conference on Web Intelligence, Mining and Semantics, 5. ACM.Google Scholar
- 18.Bharathi G, Venkatesan D (2012) Study of ontology or thesaurus based document clustering and information retrieval. J Eng Appl Sci7(4): 342–347.Google Scholar
- 19.Reshadat V, Feizi-Derakhshi MR (2012) Studying of semantic similarity methods in ontology. Res J Appl Sci Eng Technol4(12): 1815–1821.Google Scholar
- 20.Schiessl M, Bräscher M (2012) Do texto às ontologias: uma perspectiva para a ciência da informação. Ciência da Informação40(2): 301–311.Google Scholar
- 21.Cimiano P, Völker J, Studer R (2006) Ontologies on demand?—a description of the state-of-the-art, applications, challenges and trends for ontology learning from text. Inf Wiss Prax57(6-7): 315–320.Google Scholar
- 32.W, 3C - Semantic Web Health Care and Life Sciences Interest Group. https://www.w3.org/blog/hcls/. Accessed 8 June 2016.
- 33.National Center for Biotechnology Information - PubMed. http://www.ncbi.nlm.nih.gov/pubmed/. Accessed 8 June 2016.
- 39.Abacha AB, Zweigenbaum P (2011) A hybrid approach for the extraction of semantic relations from MEDLINE abstracts. Lect Notes Comput Sci (Incl Subseries Lect Notes Artif Intell Lect Notes Bioinforma)6609 LNCS(PART 2): 139–150.Google Scholar
- 44.Torunoglu D, Telseren G, Sagturk O, Ganiz MC (2013) Wikipedia based semantic smoothing for twitter sentiment classification In: INISTA 2013: Proceedings of the IEEE International Symposium on Innovations in Intelligent Systems and Applications, 1–5. IEEE, Albena.Google Scholar
- 48.He W, Tian X, Chen Y, Chong D (2016) Actionable social media competitive analytics for understanding customer experiences. J Comput Inf Syst56(2): 145–155.Google Scholar
- 49.Tian X, He W, Tao R, Akula V (2016) Mining online hotel reviews: a case study from hotels in China In: AMCIS 2016: Proceedings of the 22nd Americas Conference on Information Systems, 1–8.Google Scholar
- 50.ACM - Asian and Low-Resource Language Information Processing (TALLIP). http://tallip.acm.org/. Accessed 8 June 2016.
- 52.Chen J, Liu J, Yu W, Wu P (2009) Combining lexical stability and improved lexical chain for unsupervised word sense disambiguation In: KAM’09: Proceedings of the Second International Symposium on Knowledge Acquisition and Modeling, 430–433. IEEE, Wuhan. http://ieeexplore.ieee.org/document/5362135/.CrossRefGoogle Scholar
- 53.Rusu D, Fortuna B, Grobelnik M, Mladenic D (2009) Semantic graphs derived from triplets with application in document summarization. Informatica (Slovenia)33(3): 357–362.Google Scholar
- 55.Mansuy T, Hilderman RJ (2006) A characterization of WordNet features in Boolean models for text classification In: AusDM 2006: Proceedings of the fifth Australasian Conference on Data Mining and Analystics, 103–109. Australian Computer Society, Inc, Darlinghurst,Google Scholar
- 56.Ciaramita M, Gangemi A, Ratsch E, Šaric J, Rojas I (2005) Unsupervised learning of semantic relations between concepts of a molecular biology ontology In: IJCAI’05: Proceedings of the 19th International Joint Conference on Artificial Intelligence, 659–664. Morgan Kaufmann Publishers Inc., San Francisco, CA.Google Scholar
- 58.Gujraniya D, Murty MN (2012) Efficient classification using phrases generated by topic models In: ICPR 2012: Proceedings of the 21st International Conference on Pattern Recognition, 2331–2334. IEEE, Tsukuba,Google Scholar
- 60.Wu Q, Zhang C, Deng X, Jiang C (2011) LDA-based model for topic evolution mining on text In: ICCSE 2011: Proceedings of the 6th International Conference on Computer Science & Education, 946–949. IEEE, Singapore,Google Scholar
- 62.Wu J, Dang Y, Pan D, Xuan Z, Liu Q (2010) Textual knowledge representation through the semantic-based graph structure in clustering applications In: HICSS 2010: Proceedings of the 43rd Hawaii International Conference on System Sciences, 1–8. IEEE, Washington,Google Scholar
- 63.Princeton University - WordNet. http://wordnet.princeton.edu/. Accessed 8 June 2016.
- 65.Weller K (2010) Knowledge representation in the social semantic web. Walter de Gruyter.Google Scholar
- 66.Weller K, et al (2007) Folksonomies and ontologies: two new players in indexing and knowledge representation In: Proceedings of the Online Information Conference, 108–115.Google Scholar
- 68.Li J, Zhao Y, Liu B (2009) Fully automatic text categorization by exploiting wordnet In: Information Retrieval Technology, 1–12. Springer, Berlin,Google Scholar
- 69.Mansuy TN, Hilderman RJ (2006) Evaluating WordNet features in text classification models In: FLAIRS Conference 2006: Proceedings of the Nineteenth International Florida Artificial Intelligence Research Society Conference, 568–573. AAAI PRESS, Florida,Google Scholar
- 70.Shin Y, Ahn Y, Kim H, Lee SG (2015) Exploiting synonymy to measure semantic similarity of sentences In: IMCOM ’15: Proceedings of the 9th International Conference on Ubiquitous Information Management and Communication, 40:1–40:4. ACM, New York,Google Scholar
- 73.Wikipedia. https://www.wikipedia.org/. Accessed 8 June 2016.
- 74.Kim HJA, Hong KJA, Chang JYb (2015) Semantically enriching text representation model for document clustering In: Proceedings of the ACM Symposium on Applied Computing,922–925. ACM, New York, http://dl.acm.org.ez67.periodicos.capes.gov.br/citation.cfm?id=2696055.CrossRefGoogle Scholar
- 78.Mizzaro S, Pavan M, Scagnetto I, Valenti M (2014) Short text categorization exploiting contextual enrichment and external knowledge In: Proceedings of the First International Workshop on Social Media Retrieval and Analysis, 57–62. ACM, New York,Google Scholar
- 80.Chang MW, Ratinov LA, Roth D, Srikumar V (2008) Importance of semantic representation: dataless classification In: AAAI-08: Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, 830–835.Google Scholar
- 82.HowNet Knowledge Database. http://www.keenage.com/. Accessed 8 June 2016.
- 84.Liu Z, Yu W, Chen W, Wang S, Wu F (2010) Short text feature selection for micro-blog mining In: CiSE 2010: Proceedings of the International Conference on Computational Intelligence and Software Engineering, 1–4. IEEE, Wuhan,Google Scholar
- 85.Hu P, He T, Ji D, Wang M (2004) A study of Chinese text summarization using adaptive clustering of paragraphs In: CIT’04: Proceedings of the Fourth International Conference on Computer and Information Technology, 1159–1164. IEEE, Wuhan,Google Scholar
- 88.Wang R (2010) Cognitive-based emotion classifier of Chinese vocabulary design In: ISISE 2010: Proceedings of the International Symposium on Information Science and Engineering, 582–585. IEEE.Google Scholar
- 91.Zelikovitz S, Kogan M (2006) Using Web searches on important words to create background sets for LSI classification In: FLAIRS Conference 2006: Proceedings of the Nineteenth International Florida Artificial Intelligence Research Society Conference, 298–603.Google Scholar
- 92.SentiWordNet. http://sentiwordnet.isti.cnr.it/. Accessed 8 June 2016.
- 94.Kumar V, Minz S (2013) Mood classifiaction of lyrics using SentiWordNet In: ICCCI 2013: Proceedings of the International Conference on Computer Communication and Informatics, 1–5. IEEE, Coimbatore,Google Scholar
- 95.Unified Medical Language System (UMLS) Metathesaurus. https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/. Accessed 8 June 2016.
- 99.Medical Subject Headings (MeSH). https://www.nlm.nih.gov/mesh/. Accessed 8 June 2016.
- 100.Logeswari S, Premalatha K (2013) Biomedical document clustering using ontology based concept weight In: ICCCI 2013: Proceedings of the International Conference on Computer Communication and Informatics, 1–4. IEEE, Coimbatore,Google Scholar
- 103.Kanavos A, Makris C, Theodoridis E (2012) On topic categorization of PubMed query results In: Artificial Intelligence Applications and Innovations, 556–565. Springer.Google Scholar
- 106.Mannai M, Ben Abdessalem Karaa W (2013) Bayesian information extraction network for Medline abstract. In: 2013 World Congress on Computer and Information Technology (WCCIT), 1–3. IEEE, Sousse,Google Scholar
- 107.Jiana B, Tingyu L, Tianfang Y (2012) Event information extraction approach based on complex Chinese texts In: IALP 2012: Proceedings of the International Conference on Asian Language Processing, 61–64.Google Scholar
- 108.Hengliang W, Weiwei Z (2012) A web information extraction method based on ontology. Adv Inf Sci Serv Sci4(8): 199–206.Google Scholar
- 110.Bharathi G, Venkatesan D (2012) Improving information retrieval using document clusters and semantic synonym extraction. J Theor Appl Inf Technol36(2): 167–173.Google Scholar
- 114.Veselovská K (2012) Sentence-level sentiment analysis in Czech In: WIMS’12:Proceedings of the 2Nd International Conference on Web Intelligence, Mining and Semantics, 65:1–65:4. ACM, New York,Google Scholar
- 116.Domínguez García R, Schmidt S, Rensing C, Steinmetz R (2012) Automatic taxonomy extraction in different languages using wikipedia and minimal language-specific information. Lect Notes Comp Sci (Incl Subseries Lect Notes Artif Intell Lect Notes Bioinforma)7181 LNCS(PART 1): 42–53.Google Scholar
- 118.Stenetorp P, Soyer H, Pyysalo S, Ananiadou S, Chikayama T (2012) Size (and domain) matters: evaluating semantic word space representations for biomedical text In: SMBM 2012: Proceedings of the 5th International Symposium on Semantic Mining in Biomedicine,42–49.Google Scholar
- 122.Zrigui M, Ayadi R, Mars M, Maraoui M (2012) Arabic text classification framework based on latent dirichlet allocation. J Comput Inf Technol20(2): 125–140.Google Scholar
- 124.Xiang W, Yan J, Ruhua C, Hua F (2013) Improving text categorization with semantic knowledge in Wikipedia. IEICE Trans Inf Syst96(12): 2786–2794.Google Scholar
- 126.Andreasen T, Bulskov H, Jensen PA, Lassen T (2011) Extracting conceptual feature structures from text In: ISMIS 2011: Proceedings 19th International Symposium on Methodologies for Intelligent Systems, 396–406. Springer, Berlin,Google Scholar
- 127.Goossen F, IJntema W, Frasincar F, Hogenboom F, Kaymak U (2011) News personalization using the CF-IDF semantic recommender In: WIMS’11: Proceedings of the International Conference on Web Intelligence, Mining and Semantics, 10. ACM, New York,Google Scholar
- 129.Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using Wikipedia-based explicit semantic analysis In: IJCAI-07: Proceedings of the 20th International Joint Conference on Artifical Intelligence, 1606–1611. Morgan Kaufmann Publishers Inc, San Francisco, http://dl.acm.org.ez67.periodicos.capes.gov.br/citation.cfm?id=1625535.Google Scholar
- 130.Navigli R, Faralli S, Soroa A, de Lacalle O, Agirre E (2011) Two birds with one stone: learning semantic models for text Categorization and word sense disambiguation In: CIKM’11: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, 2317–2320. ACM, Glasgow,Google Scholar
- 131.Mostafa MS, Haggag MH, Gomaa WH (2008) Document clustering using word sense disambiguation In: SEDE 2008: Proceedings of 17th International Conference on Software Engineering and Data Engineering, 19–24.Google Scholar
- 135.Wang W, Zhao D, Zou L, Wang D, Zheng W (2010) Extracting 5W1H event semantic elements from Chinese online news In: WAIM 2010: Proceedings of the Workshops of the 11th International Conference on Web-Age Information Management, 644–655. Springer, Berlin,Google Scholar
- 137.Van Der Horn P, Bakker B, Geleijnse G, Korst J, Kurkin S (2008) Classifying verbs in biomedical text using subject-verb-object relationships In: SMBM 2008: Proceedings of the 3rd International Symposium on Semantic Mining in Biomedicine, 137–140.Google Scholar
- 138.Kontos J, Malagardi I, Alexandris C, Bouligaraki M (2000) Greek verb semantic processing for stock market text mining In: NLP’00: Proceedings of the Second International Conference on Natural Language Processing, 395–405. Springer-Verlag, London.Google Scholar
- 139.Stankov I, Todorov D, Setchi R (2013) Enhanced cross-domain document clustering with a semantically enhanced text stemmer (SETS). Int J Knowl-Based Intell Eng Syst17(2): 113–126.Google Scholar
- 140.Huang CH, Yin J, Hou F (2011) A text similarity measurement combining word semantic information with TF-IDF method. Jisuanji Xuebao(Chin J Comput)34(5): 856–864.Google Scholar
- 144.Fathy I, Fadl D, Aref M (2012) Rich semantic representation based approach for text generation In: INFOS 2012: Proceedings of the 8th International Conference on Informatics and Systems, NLP–20. IEEE, Cairo,Google Scholar
- 145.Wu J, Xuan Z, Pan D (2011) Enhancing text representation for classification tasks with semantic graph structures. Int J Innov Comput Inf Control (ICIC)7(5): 2689–2698.Google Scholar
- 146.Alencar ROD, Davis Jr CA, Gonçalves MA (2010) Geographical classification of documents using evidence from Wikipedia In: GIR ’10: Proceedings of the 6th Workshop on Geographic Information Retrieval, 12. ACM, New York,Google Scholar
- 147.Smirnov I, Tikhomirov I (2009) Heterogeneous semantic networks for text representation in intelligent search engine EXACTUS In: SENSE’09: Proceedings of the Workshop on Conceptual Structures for Extracting Natural Language Semantics, 1–9.Google Scholar
- 148.Chau R, Tsoi AC, Hagenbuchner M, Lee V (2009) A conceptlink graph for text structure mining In: ACSC’09: Proceedings of the Thirty-Second Australasian Conference on Computer Science - Volume 91, 141–150. Australian Computer Society, Inc., Darlinghurst,Google Scholar
- 150.Lebret R, Collobert R (2015) Rehabilitation of count-based models for word vector representations. Lect Notes Comput Sci (Incl Subseries Lect Notes Artif Intell Lect Notes Bioinforma)9041: 417–429.Google Scholar
- 151.Li R, Shindo H (2015) Distributed document representation for document classification. Lect Notes Comput Sci (Incl Subseries Lect Notes Artif Intell Lect Notes Bioinforma)9077: 212–225.Google Scholar
- 152.Sohrab MG, Miwa M, Sasaki Y (2015) Centroid-means-embedding: an approach to infusing word embeddings into features for text classification. Lect Notes Comput Sci (Incl Subseries Lect Notes Artif Intell Lect Notes Bioinforma)9077: 289–300.Google Scholar
- 154.Zhang C, Zhang L, Wang CJ, Xie JY (2014) Text summarization based on sentence selection with semantic representation In: Proceedings of the International Conference on Tools with Artificial Intelligence,Vol. 2014-December. IEEE, Limassol. 584–590.Google Scholar
- 156.Kamal A, Abulaish M, Anwar T (2012) Mining feature-opinion pairs and their reliability scores from web opinion sources In: WIMS’12: Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics, 15. ACM, New York,Google Scholar
- 163.Volkova S, Caragea D, Hsu WH, Drouhard J, Fowles L (2010) Boosting biomedical entity extraction by using syntactic patterns for semantic relation discovery In: WI-IAT 2010: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, 272–278. IEEE, Toronto.CrossRefGoogle Scholar
- 164.Waltinger U, Mehler A (2009) Social semantics and its evaluation by means of semantic relatedness and open topic models In: WI-IAT’09: Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology-Volume 01, 42–49. IEEE Computer Society, Milan,CrossRefGoogle Scholar
- 165.Kass A, Cowell-Shah C (2006) Using lightweight NLP and semantic modeling to realize the internet’s potential as a corporate radar In: AAAI Fall Symposium. AAAI PRESS.Google Scholar
- 169.Ahmed ST, Nair R, Patel C, Davulcu H (2009) BioEve: bio-molecular event extraction from text using semantic classification and dependency parsing In: BioNLP’09: Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task, 99–102. Association for Computational Linguistics.Google Scholar
- 171.Wordle. http://www.wordle.net/. Accessed 15 June 2016.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.