- 1.4k Downloads
The importance of run-time efficiency is still often disregarded in approaches to text analysis tasks , limiting their use for industrial size text mining applications (Chiticariu et al. 2010b). Search engines avoid efficiency problems by analyzing input texts at indexing time (Cafarella et al. 2005). However, this is impossible in case of ad-hoc text analysis tasks . In order both to manage and to benefit from the ever increasing amounts of text in the world, we need not only scale existing approaches to the large (Agichtein 2005), but we also need to develop novel approaches at large scale (Glorot et al. 2011). Standard text analysis pipelines execute computationally expensive algorithms on most parts of the input texts, as we have seen in Sect. 3.1. While one way to enable scalability is to rely on cheap but less effective algorithms only (Pantel et al. 2004; Al-Rfou’ and Skiena 2012), in this chapter we present ways to significantly speed up arbitrary pipelines by up to over one order of magnitude. As a consequence, more effective algorithms can be employed in large-scale text mining .
In particular, we observe that the schedule of a pipeline’s algorithms affects the pipeline’s efficiency , when the pipeline analyzes only relevant portions of text (as achieved by our input control from Sect. 3.5). In Sect. 4.1, we show that the optimal schedule can theoretically be found with dynamic programming . It depends on the run-times of the algorithms and the distribution of relevant information in the input texts. Especially the latter varies strongly between different collections and streams of texts, often making an optimal scheduling too expensive (Sect. 4.2). In practice, we thus perform scheduling with informed search on a sample of texts (Sect. 4.3). In cases where input texts are homogeneous in the distribution of relevant information , the approach reliably finds a near-optimal schedule according to our evaluation. In other cases, there is not one single optimal schedule (Sect. 4.4). To optimize efficiency , a pipeline then needs to adapt to the input text at hand. Under high heterogeneity , such an adaptive scheduling works well by learning in a self-supervised manner what schedule is fastest for which text (Sect. 4.5). For large-scale text mining , a pipeline can finally be parallelized , as we outline in Sect. 4.6. The contribution of Chap. 4 to our overall approach is shown in Fig. 4.1.