- 1.4k Downloads
The ultimate purpose of text analysis pipelines is to infer new information from unknown input texts. To this end, the algorithms employed in pipelines are usually developed on known training texts from the anticipated domains of application (cf. Sect. 2.1). In many applications, however, the unknown texts significantly differ from the known texts, because a consideration of all possible domains within the development is practically infeasible (Blitzer et al. 2007). As a consequence, algorithms often fail to infer information effectively, especially when they rely on features of texts that are specific to the training domain . Such missing domain robustness constitutes a fundamental problem of text analysis (Turmo et al. 2006; Daumé and Marcu 2006). The missing robustness of an algorithm directly reduces the robustness of a pipeline it is employed in. This in turn limits the benefit of pipelines in all search engines and big data analytics applications, where the domains of texts cannot be anticipated. In this chapter, we present first substantial results of an approach that improves robustness by relying on novel structure-based features that are invariant across domains .
Section 5.1 discusses how to achieve ideal domain independence in theory. Since the domain robustness problem is very diverse, we then focus on a specific type of text analysis tasks (unlike in Chaps. 3 and 4). In particular, we consider tasks that deal with the classification of argumentative texts , like sentiment analysis , stance recognition , or automatic essay grading (cf. Sect. 2.1). In Sect. 5.2, we introduce a shallow model of such tasks, which captures the sequential overall structure of argumentative texts on the pragmatic level while abstracting from their content. For instance, we observe that review argumentation can be represented by the flow of local sentiment . Given the model , we demonstrate that common flow patternsexist in argumentative texts (Sect. 5.3). Our hypothesis is that such patterns generalize well across domains . In Sect. 5.4, we learn common flow patterns with a supervised variant of clustering . Then, we use each pattern as a single feature for classifying argumentative texts from different domains . Our results for sentiment analysis indicate the robustness of modeling overall structure (other tasks are left for future work). In addition, we can visually make results more intelligible based on the model (Sect. 5.5). Altogether, this chapter realizes the overall analysis within the approach of this book, highlighted in Fig. 5.1. Both robustness and intelligibility benefit the use of pipelines in ad-hoc large-scale text mining .