Keywords

1 Introduction

The information content of the scientific literature is largely represented visually in the figures — charts, diagrams, tables, photographs, etc. [14]. However, this information remains largely inaccessible and unused for document analysis [1, 15] or in academic search portals such as Google Scholar and Microsoft Academic Search. We posit that the structure and content of the visual information in the figures closely relate to its impact, that it can be analyzed to study how different fields of science organize and present data, and, ultimately, that it can be used to improve search and analytics tools [3, 16]. All of these applications share requirements around the ability to extract, classify, manage, and reason about the content of the figures in the papers rather than just the text alone.

Fig. 1.
figure 1

Dismantling a visualization from the scientific literature. The original source image (a) is first segmented by a splitting method relying on background and layout patterns (b), then the fragments are classified and recursively merged into meaningful visualizations. Image source: (Boone et al., PLoS ONE, 5.)

In an initial investigation of these hypotheses, we extracted all figures from a corpus of PubMed papers and developed a classifier to recognize them, building on the work of Savva et al. [11]. We quickly found that about 35 % of all figures were composite figures that contained multiple sub-figures, and therefore could not be meaningfully classified directly. Finding it difficult to ignore 35 % of the information before we even began, we decided to tackle the problem of “dismantling” these composite figures automatically; our solution to the dismantling problem is described in this paper.

Figure 1(a) illustrates a simple example of a composite figure. The figure includes a diagram of a molecular sequence (A), a set of photographs of electrophoresis gels (B), an accumulation of a specific type of cells represented as a bar chart (C), and an alternative visualization of molecular sequences (D). Some sub-figures includes additional substructures: Part A breaks the sequences into two zoom-in sections and part D includes four distinct (but related) sub-diagrams. The task to extract the intended sub-figures is hard: The diversity and complexity of the hand-crafted visualizations that appear in the literature resist simple heuristic approaches. (The reader is encouraged to browse some of the real examples in this paper as an illustration of this diversity and complexity). Basic image segmentation techniques are inapplicable; they cannot distinguish between meaningful sub-figures and auxiliary fragments such as labels, annotations, legends, ticks, titles, etc.

Aside from the applications in literature search services, figure classification, and bibliometrics, exposing these multi-part figures for analysis affords new basic research into the role of visualization in science. Consider the following questions:

  1. 1.

    What types of visualization are common in each discipline, and how have these preferences evolved over time?

  2. 2.

    How does the use of visualization in the literature correlate with measures of impact?

  3. 3.

    Do certain disciplines tend to use specialized visualizations more often than others? Does the use of specialized visualizations tend to be rewarded in terms of impact?

  4. 4.

    Can we reverse engineer the data used in the paper automatically by analyzing the text and visualizations of the paper alone?

As a first step toward answering these questions, we present a decomposition algorithm to extract the content of multi-part figures and a detector to determine whether an image is a composite figure or a single figure. The algorithm involves three steps: First, we split a composite figure into small components by reasoning about layout patterns and the empty space between sub-figures (Fig. 1(b)). Second, we merge the split fragments by using an SVM-based classifier to distinguish auxiliary elements such as ticks, labels and legends that can be safely merged into standalone sub-figures that should remain distinct (Fig. 1(c)). Third, we assign a score to alternative initial segmentation strategies and select the higher scoring decomposition as the final output. About the detector, we utilize the output from the splitting step as image features and train an SVM-classifier for the task. To evaluate our approach, we compiled a corpus of 880 multi-chart figures and 1067 single-chart figures chosen randomly from the PubMed databases. For the detector we achieved 90.2 % accuracy via 10-fold cross-validation. For the decomposition algorithm we randomly extracted 261 multi-chart figures as a testing set. We manually decomposed these multi-part figures and found that they were comprised of 1534 individual visualizations. Our algorithm produced 1281 total sub-images, of which 1035 were perfect matches for a manually extracted sub-figure. The remaining 246 incorrect pieces were either multi-chart images that required further subdivision, or meaningless fragments that required further merging. For the 85 % of the images containing eight or fewer sub-figures, we achieved 80.1 % recall and 85.1 % precision of correct sub-images. For the remaining 15 % of densely packed and complex figures, we achieved 42.1 % recall and 68.3 % precision for correct sub-images.

2 Related Work

Content-based image retrieval (CBIR) organizes digital image archives by their visual content [2, 9, 13], allowing users to retrieve images sharing similar visual elements with query images. This technology has been widely deployed and is available in multiple online applications. However, CBIR has not been used to enhance scientific and technical document retrieval, despite the importance of figures to the information content of a scientific paper. Current academic search systems are based on annotations of titles, authors, abstracts, key words and references, as well as the text content.

Recognition of data visualizations is a different problem than recognition of photographs or drawn images. In early studies, Futrelle et al. presented a diagram- understanding system utilizing graphics constraint grammars to recognize two-dimensional graphs [4]. Later, they proposed a scheme to classify vector graphics in PDF documents via spatial analysis and graphemes [5, 12]. N. Yokokura et al. presented a layout-based approach to build a layout network containing possible chart primitives for recognition of bar charts [17]. Y. Zhou et al. used Hough-based techniques [18] and Hidden Markov Models [19] to approach bar chart detection and recognition. W. Huang et al. proposed model-based method to recognize several types of chart images [8]. Later they also introduced optical character recognition and question answering for chart classification [7]. In 2007, V. Prasad et al. applied multiple computer vision techniques including Histogram of Orientation Gradient, Scale Invariant Feature Transform, detection of salient curves etc. as well as Support Vector Machine (SVM) to classified five commonly used charts [10]. In 2011, Savva et al. proposed an interesting application of chart recognition [11]. Their system classifies charts first, extracts data from charts second and then re-designs visualizations to improve graphical perception finally. They achieved above 90 % accuracy in chart classification for ten commonly used charts. These works focused on recognizing and understanding individual chart images. None of these efforts worked with figures in the scientific literature, which are considerably more complex than typical visualizations, and none involved the use of multi-chart images.

3 Decomposition Algorithm

Our algorithm is comprised of three steps: (1) splitting, (2) merging and (3) selecting. In first step, we recursively segment an original images into separate sub-images by analyzing empty space and applying assumptions about layout. In the second step, we use an SVM-based classifier to distinguish complete sub-figures from auxiliary fragments (ticks, labels, legends, annotations) or empty regions. In the third step, we compare the results produced by alternative initial segmentation strategies by using a scoring function and select the best choice as the final output.

Fig. 2.
figure 2

(a) Fire lanes. We locate the lanes by using the histogram of columns. Orange dots represent qualified columns that pass the thresholds. (b) Histogram of columns. Image source: (Subramaniam et al., The Journal of cell biology, 165:357-369.) (Color figure online)

3.1 Step 1: Splitting

The splitting algorithm recursively decomposes the original figure into sub-images. Authors assemble multiple visualizations together in a single figure to accommodate a limited space budget or to relate multiple visualizations into a coherent argument. We made a few observations about how these figures are assembled that guide the design of our splitting algorithm: First, the layout typically involves a hierarchical rectangular subdivision as opposed to an arbitrarily unstructured collage. Second, authors often include a narrow blank buffer between two sub-figures as a “fire lane” to ensure that the overall layout is readable (Fig. 2(a)). Third, paper-based figures are typically set against a light-colored background. We will discuss figures that violate these assumptions in Sect. 5.

Based on these assumptions, our splitting algorithm recursively locates empty spaces and divide the multi-chart figure into blocks. Based on our rectangularity assumption, we locate empty spaces by seeking wholly blank rows or columns as opposed to empty pixels. We first convert the color image into grayscale, and then compute a histogram for rows (and a second histogram for columns) by summing the pixel values of each row (or column). Figure 2(b) gives an example of a figure with its corresponding histogram for the columns. Candidate fire lanes appear as peaks or plateaus in the histogram with a value near the maximum, so we normalize the histogram to its maximum value and apply a high-pass empty threshold \(\theta _{e}\) to obtain a candidate set of “blank” rows (or columns). The maximum value does not necessarily indicate a blank row or column, because there may be no entirely blank rows or columns. For example, the green vertical line in Fig. 3(a) is the maximum pixel value sum, but is not blank and is not a good choice as a fire lane. To address this issue, we apply a low-pass variance threshold \(\theta _{var1}\) to filter such items by their relatively high variances (Fig. 3(b)). We use a second method to detect empty spaces by applying another, stricter low-pass variance threshold \(\theta _{var2}\) on rows or columns. The first method provides a wider pass window and the second method is well-suited to handle figures with a dark background.

To set the values of the three thresholds, we collected 90 composite figures (avoiding figures with photographs) and ran the splitting step with different combinations of thresholds against this training set. Since our goal is just to tune these parameters, we make a simplifying assumption that finding the correct number of sub-images implies a perfect split; that is, if the number of divided sub-images equals the correct number of sub-figures determined manually, we assume the division was perfect. The reason for this simplifying assumption is to improve automation for repeated experiments; we did not take the time to manually extract perfect splits for each image with which to compare. Under this analysis, the values for the thresholds that produced the best results were \(\theta _{e}=0.999\), \(\theta _{var1} = 100\), and \(\theta _{var2} = 3\).

We group neighboring empty pixel-rows or empty pixel-columns to create empty “fire lanes” as shown in Fig. 2(a). The width of the fire lane is used in the merge step to determine each sub-image’s nearest neighbor. Half of each fire lane is assigned to each of the two blocks; each block becomes a new image input to be analyzed recursively. Row-oriented splits and column-oriented splits are alternatively performed, recursively, until no fire lane is found within a block. The recursion occurs at least two times to ensure both orientations are computed at least once.

Different initial splitting orientations can result in different final divisions, so the splitting algorithm is performed twice: once beginning vertically and once beginning horizontally. We individually execute merging for the two results and automatically evaluate the merging results in step 3. The split with higher score is taken as the final decomposition.

Fig. 3.
figure 3

The identification of fire lanes is non-trivial. (a) Locating fire lanes without applying the variance threshold \(\theta _{var1}\) leads to an error: since there are no entirely blank columns, the maximum value (highlighted in green) is not a qualified fire lane. (b) The disqualified column is filtered by applying \(\theta _{var1} = 100\). Image source: (Hong et al., BMC Genetics, 13:78) (Color figure online)

3.2 Step 2: Merging

The merging algorithm receives the splitting result as input and then proceeds in two substeps: First, we use an SVM-based classifier to distinguish standalone sub-figures representing meaningful visualizations from auxiliary blocks that are only present as annotations for one or more standalone sub-figures. Second, we recursively merge auxiliary blocks, assigning each to its nearest block, until all auxiliary blocks are associated with one (or more) standalone sub-figures. We refer to this process as hierarchical merging. If two neighboring blocks have incongruent edges, a non-convex shape may result. In this case, we perform T-merging: we search nearby for sub-figures that can fill the non-convexity in the shape. We will discuss the details of the classifier, Hierarchical Merging, and T-Merging in this section.

Table 1. The features used to classify sub-images as either standalone sub-figures or auxiliary fragments. We used \(k=5\) for our experiment; thus the feature vector consists of 15 elements. We achieved classification accuracy of 98.1 %, suggesting that these geometric and whitespace-oriented features well describe the differences between the two categories.
Fig. 4.
figure 4

Blank coverage according to blank rows (red) and blank columns (green). We divided the image into 5 sections horizontally and vertically. In each section, we {computed the percentage of blank-row or blank-column respectively. The 10 vectors form a portion of image feature. Image source: (Kapina et al., PLoS ONE, 6.) (Color figure online)

Training SVM-Based Binary Classifier. Figure 5 shows an example of an intermediate state while merging, consisting of 18 sub-images from the composite figure. Sub-images labeled (D, F, H, J, N, O, Q, R) are classified as standalone blocks. All others are classified as auxiliary blocks. The goal of the merging algorithm is to remove auxiliary blocks by assigning them to one or more standalone blocks. To recognize auxiliary blocks, we extract a set of features for each block and train an SVM-based classifier. The features selected are based on the assumption that the authors tend to follow implicit rules about balancing image dimensions and distributing empty space within each figure, and that these rules are violated for auxiliary annotations. To describe the dimensions of the block, we compute proportional area, height and width relative to that of the original image, as well as the aspect ratio. To describe the distribution of empty space, we use the same thresholds from the splitting step to locate entirely blank rows or columns and then compute the proportion of the total area covered by the pixels of these blank elements. We do not consider the overall proportion of empty pixels, because many visualizations use an empty background — consider a scatter plot, where the only non-empty pixels are the axes, the labels, the legend, and the glyphs. As a result, blank rows and columns should be penalized, but blank pixels should not necessarily be penalized.

Blank coverage alone does not sufficiently penalize sub-figures that have large blocks of contiguous empty space; a pattern we see frequently in auxiliary sub-images. For example, an auxiliary legend offset in the upper right corner of a figure will have large contiguous blocks of white space below and to the left of it. To describe these cases where empty space tends to be concentrated in particular areas, we divide each sub-image into k equal-size sections via horizontal cuts and another k sections via vertical cuts. We then extract one feature for each horizontal and vertical section; 2k features total. Each feature \(f_i\) is computed as the proportion of blank rows in section i. To determine a suitable k, we experimented with different values from \(k=0\) to \(k=10\) on the training data and set \(k=5\) based on the results. In this paper, we do not consider further optimizing this parameter. With the combination of the dimensional features and the empty-space features, we obtain a 15-element feature vector for each sub-image. These features are summarized in Table 1.

As an example of how these features manifest in practice, consider Fig. 4. This image has 26.6 % blank coverage; blank columns are colored green and blank rows are colored red. As with most visualizations in the literature, the overall percentage of blank pixels is very high, but the percentage of blank rows and columns is relatively low. We divide the image into horizontal and vertical sections as indicated by the green and red dashed lines. The decimals indicate percentages of blank row or column of their nearby sections.The complete 15-element feature vector of this image is {1, 1, 1, 0.4272, 0.2656, 0.5217, 0, 0, 0, 0.0435, 0.0943, 0, 0.5094, 0.1321, 0.1132}.

To evaluate our classifier, we collected another corpus containing 213 composite figures from the same source for training independence. The splitting algorithm was used to produce 7541 sub-images from the corpus. For evaluation, we manually classified a set of 6524 standalone sub-images and 1017 auxiliary sub-images. We used LibSVM [6] and set all parameters as default to train the model. The classification accuracy of 98.1 is achieved.%.

Fig. 5.
figure 5

The tree structure of a decomposition. This multi-chart image was split using column-orientation. The result of the splitting step can form a tree structure. The numbers in parentheses present the sections in each splitting level that the block belongs to. For instance, H(3, 2) refers that block H is in the third section when the original multi-chart figure is split. Then it is the second sub-section when the third section is split again. With the assistance of the classifier, we color standalone blocks by blue and auxiliary blocks by red. Image source: (Botella-Soler et al., PLoS ONE, 7.) (Color figure online)

Fig. 6.
figure 6

Examples of hierarchical merging. In all cases, the goal is to merge all auxiliary blocks, (labeled A) into standalone blocks (labeled S). Each merge operation is indicated by a white arrow. (a) An acceptable merge. The new block is the smallest rectangle that covers both merging blocks. (b) Two different merge paths that lead to the same result. (c) Another case of acceptable multi-merging. (d) This merging is forbidden because after merging the auxiliary into the standalone block the resulting shape is non-rectangular. The operation only involves the blocks with yellow outline. Once the local merging in this level is completed, it repeats again in the next level, which will involve the very right standalone block. (e) Another case of forbidden merging because of the same reason. After completing hierarchical merging, the residual auxiliary blocks will be handled by T-Merging (Color figure online).

Hierarchical Merging. The result of the splitting step forms a tree structure as shown in Fig. 5. Merging starts from the leaves of the tree. All leaves can only merge with other leaves in the same level. After completing all possible merges among siblings, we transfer the newly merged block to their parents’ level as new leaves. Hierarchical merging stops after it finishes merging in the top level.

In each level, we re-run the classifier to determine auxiliary blocks. We then induce a function on the set of blocks, assigning each block to its nearest adjacent neighbor called its merge target. A block and its merge target are called a merge pair. Under the assumption that the width of a fire lane indicates the strength of relationship between the two blocks, the merge target for a block is the adjacent neighbor with the narrowest lane between them. Figure 6(a) shows a combination of two blocks. The new block is the smallest rectangle that covers both merging blocks. Only adjacent blocks are allowed to merge together. If there are two or more qualified blocks, we break the tie by the shortest distance between centroids of blocks.

If after merging all auxiliary blocks at one level of the tree we find that the resulting shape is non-rectangular, we attempt to apply T-merging or pass the result on to the next higher level. For example, Fig. 6(b) and (c) can be merged since the result is rectangular. Figure 6(d) and (e) are forbidden. In this case, the merging of the block pair is skipped and the auxiliary block is labeled as standalone and processed using T-Merging, described next. We repeat the local merging until the statuses of all blocks are standalone. We then pass these blocks up to the next level, reclassify them, and repeat the local merging again.

T-Merging. T-Merging handles residual auxiliary blocks ignored in Hierarchical Merging. These are usually shared titles, shared axes or text annotations that apply to multiple sub-figures; e.g., Fig. 6(d) and (e). As shown in Fig. 7, merging the auxiliary block 1 (“the legacy”), to any adjacent standalone block generates a non-rectangular shape. We define block 2 and block 3 as legateesFootnote 1 to proportionally share block 1. We find the set of legatees by the following procedure: For each edge e of the legacy, find all blocks that share any part of e and construct a set. If merging this set as a unit produces a rectangle, then it is a qualified set. If multiple edges produce qualified sets, choose the edge with t the narrowest fire lane. In Fig. 7, only the set consisting of block 2 and 3 satisfies the above criteria; blocks 4, 5, 6 are not proper legatees. Figure 8 illustrates the evolution from a source image through Hierarchical Merging to its T-merged output.

Fig. 7.
figure 7

Examples of T-Merging. The legacy (block 1) is marked by white color in text. According to our algorithm, only block 2 and block 3 are qualified to share block 1.

Fig. 8.
figure 8

(a) Composite figure. (b) Splitting result. (c) Intermediate state of hierarchical merging after completing level 3. (d) Hierarchical merging result. The very-top block and the middle block require T-Merging. (e) T-Merging result. Image source: (Botella-Soler et al., PLoS ONE, 7.)

3.3 Step 3: Selecting

The splitting and merging steps may produce different results from different initial splitting orientations (Fig. 8). Step 3 scores the two different results and selects the one with higher score as the final output. Under the assumption that authors tend to follow implicit rules about balancing image dimensions, the decomposition that produces more sub-images with similar dimensions is given a higher score. To capture this intuition, we define the scoring function as

$$\begin{aligned} S_{decomposition} = 4\sum _{i \in \textsf {blocks}} \sqrt{A_i} - 2\alpha \sum _{i,j\in \textsf {Pairs}}|{l^{top}_i - l^{top}_j}| +|{l^{left}_i - l^{left}_j}|, \end{aligned}$$

where \(A_i\) is the area of the corresponding block, \(\alpha \) is a penalty coefficient and \(l^{top}_i\) is the length of the top edge of block i (respectively, left). Each element of the set Pairs is a pair of blocks ij, where j is the block that has the most similar dimensions to i for \(i\ne j\). The two coefficients normalize the two terms to the full perimeter. The formula enforces a geometric property of composite figures: The first term obtains its maximum value when all blocks are equal in size. The second term subtracts the difference between each block and its most similar neighbor to reward repeating patterns and penalize diversity in the set. The penalty coefficient weights the importance of dimensional difference. We assigned \(\alpha =1\) in our experiment.

4 Composite Figure Detection

To make our algorithm for a broader use that a given image is not pre-labeled as a composite figure, we extended the splitting algorithm to pre-recognize composite figures instead of applying the complete composition algorithm for the task. It can save a large portion of time consuming by the merging algorithm. The output of splitting algorithm can be a feature describing the geometric layout of sub-figures. Figure 9 shows the splitting result where the firelanes obtained from all recursive layers are highlighted by the light red color. It can be regarded as a binary mask. The uncovered area is defined as the effective figure regions (EFR). Next, we subdivide the mask into \(N\times N\) blocks and compute the proportion of EFR in each block as shown by the EFR density map. Finally squeeze the values into a 1-D vector with \(N^2\) elements. Furthermore for the features, we average the heights and widths of all training images and calculate the dimension ratio as \((height_n / height_{avg})\) and \((width_n / width_{avg})\) for each image n as simple dimensional features. We concatenate the dimensional features and geometrical features to create the final feature vector. By using the same technique to train the auxiliary classifier, we obtained 90.2 % accuracy from 10-fold cross-validation on the entire corpus comprising 880 composite figures and 1067 single figures.

Fig. 9.
figure 9

The pipeline from splitting an image to acquiring it’s feature vector. Image source: (Boone et al., PLoS ONE, 5.)

5 Experimental Evaluation

In this section, we describe experiments designed to answer the following questions: (a) Can our algorithm be used to estimate visualization diversity, a weaker quality metric sufficient for many of our target applications? (Yes; Table 2) (b) Can our algorithm effectively extract correct sub-figures, a stronger quality metric? (Yes; Table 2) (c) Could a simpler method work just as well as our algorithm? (No; Table 2) (d) Is step 3 of the algorithm (selection) necessary and effective? (Yes; Fig. 11).

Fig. 10.
figure 10

Results of different initial splitting orientations. (a) Split begins from horizontal (initially row-oriented), and a lower score due to mismatched elements. (b) Split begins from vertical (initially column-oriented), and a higher score. Image source: (Türumen et al. PLoS Genetics, 5.)

The corpus we used for our experiments was collected from the PubMed database. We selected a random subset of the PubMed database by collecting all tar.gz files from 188 folders (from //pub/pmc/ee/00 to //pub/pmc/ee/bb); these files contain the pdf files of the papers as well as the source images of the figures, so figure extraction was straightforward. In order to filter non-figure images such as logos, banners, etc., we only used images of size greater than 8 KB. We manually separate the composite figures from the single-chart figures and divided the composite figures into a testing set and a training set. We trained the classifier and performed cross-evaluation with the training set, reserving the test set for a final experimental evaluation. The testing set S for the experiments contains 261 composite figures related to biology, biomedicine, or biochemistry. Each figure contains at least two different types of visualizations; e.g., a line plot and a scatter plot, a photograph and a bar chart, etc. We ignored multi-chart figures comprised of single-type figures in this experiment for the convenience of evaluation, described later in the first question. We evaluated performance in two ways: (1) type-based evaluation, a simpler metric in which we attempt to count the number of distinct types of visualizations within a single figure, and (2) chart-based evaluation, a stronger metric in which we attempt to perfectly recover all sub-figures within a composite figure.

Can Our Algorithm Be Used to Estimate Visualization Diversity? The motivation for type-based evaluation is that some of our target applications in bibliometrics and search services need only know the presence or absence of particular types of visualizations in each figure to afford improved search or to collect aggregate statistics — it is not always required to precisely extract a perfect sub-figure, as long as we can tell what type of figure it is. For example, the presence or absence of an electrophoresis gel image appears to be a strong predictor of whether the paper is in the area of experimental molecular biology; we need not differentiate between a sub-figure with one gel and a sub-figure with several gels. Moreover, it is not always obvious what the correct answer should be when decomposing collections of sub-figures of homogeneous type: Part of Fig. 8(a) contains a number of repeated small multiples of the same type — it is not clear that the correct answer is to subdivide all of these individually. Intuitively, we are assessing the algorithms’ ability to eliminate ambiguity about what types of visualizations are being employed by a given figure, since this task is a primitive in many of our target applications.

To perform type-based evaluations we label a test set by manually counting the number of distinct visualization types in each composite figure. For example, Fig. 2 has two types of visualizations, a line chart and a bar chart; Fig. 5 also has two types of visualizations, a line chart and an area chart; Fig. 10(a) also has two types of visualizations, bar charts and electrophoresis gels. We then run the decomposition algorithm and manually distinguish correct extractions from incorrect extractions. Only homogeneous sub-images — those containing only one type of visualization — are considered correct. For example, the top block in Fig. 10(a) is considered correct, because both sub-figures are the same type of visualization: an electrophoresis gel image. The bottom two blocks of Fig. 10(a) are considered incorrect, since each contains both a bar chart and a gel.

Using only the homogeneous sub-images (the heterogeneous sub-images are considered incorrect), we manually count the number of distinct visualization types found for each figure. We compare this number with the number of distinct visualization types found by manual inspection of the original figure. For example, in Fig. 10(a), the algorithm produced one homogeneous sub-image (the top portion), so only one visualization type was discovered. However, the original image has two distinct visualization types. So our result for this figure would be 50 %.

To determine the overall accuracy we define a function \(diversity: Figure \rightarrow Int\) as \(diversity(f) = |\{type(s)\quad | \quad s \in decompose(f)\}|\), where decompose returns the set of subfigures and type classifies each subfigure as a scatterplot, line plot, etc. The final return value is the number of distinct types that appear in the figure. We then sum the diversity scores for all figures in the corpus. We compute this value twice: once using our automatic version of the decompose function and once using a manual process. Finally, we divide the total diversity computed automatically by the total diversity computed manually to determine the overall quality metric. The automatic method is not generally capable of finding more types than are present in the figure, so this metric is bounded above by 1. In our experiment, we obtained the diversity score of 591 and 640 respectively from automatic decomposition and manual process. The accuracy by this metric is therefore 92.3 %.

Can Our Algorithm Effectively Extract Correct Sub-figures? For chart-based evaluation, we attempt to perfectly extract the exact subfigures found by manual inspection, and measure precision and recall. For instance, Figs. 2, 5, and 10(b) contain 5, 8, and 6 sub-figures respectively. To obtain ground truth, we manually extracted 1534 visualizations from the entire image set S; about 5.88 visualizations per composite figure on average. In this experiment, a sub-image that includes exactly one visualization is defined as correctly extracted; exceptions are described next. However, a sub-image that crops a portion of a visualization, includes only auxiliary annotations, or includes two or more visualizations is considered incorrect. These criteria are stricter than necessary for many applications of the algorithm; for example, partial visualizations or visualizations with sub-structure will often still be properly recognized by a visualization-type classifier and can therefore be used for analysis. However, this metric provides a reasonable lower bound on quality.

We make an exception to these criteria: We consider an array of photographic images to be one visualization. This exception is to ensure that we do not artificially improve our results: The algorithm is very effective at decomposing arrays of photos, but it is not obvious that these arrays should always be decomposed; the set is often treated as a unit. In this analysis, we also ignore cases where an auxiliary annotation is incorrectly assigned to one owner instead of another. The reason is that we find a number of ambiguous cases where “ownership” of an auxiliary annotation is not well-defined.

Table 2. Chart-based evaluation. Where \(S_{all}\) denotes the entire composite figure set, \(S_{p\le 8}\) denotes the subset of composite figures containing eight or fewer sub-figures, and \(S_{p>8}\) denotes the subset of composite figures containing nine or more sub-figures. We compared our main approach to a splitting-only method based on our splitting algorithm. The recall and the precision of correct sub-images, as well as the accuracy of decomposition were significantly enhanced. Our technique achieved a better performance for a subset of composite figures that contains eight or fewer sub-figures.

We define a notion of recall and precision based on these correctness criteria. To compute recall, the number of correct sub-images returned by the algorithm is divided by the number of correct sub-figures manually counted in the corpus using the same criteria for correctness. To compute precision, we divide the number of correct sub-images returned by the algorithm by the total number of extracted sub-images. Our algorithm achieves recall of 67.5 % and precision of 80.8 %. In addition, the percentage of figures that are perfectly decomposed — the right number of correct images and no incorrect images — is 57.9 %. Table 2 summarizes the chart-based evaluation in more detail. Later in this section we will analyze the mistakes made by the algorithm.

Fig. 11.
figure 11

Step 3 of the algorithm, selection, makes correct decisions. Circles represent perfectly decomposed figures and crosses represent imperfectly decomposed figures. This scatter plot illustrates that figures with perfect decomposition mostly distribute near the line of slope 1, indicating similar solutions were found by our decomposition algorithm regardless of the starting orientation. The selection step deals with the grey plots and the red plot, which are composite figures that have different outputs from the two initial splitting orientations. Only one mistaken selection was made and 95.9 % accuracy was achieved (Color figure online).

Fig. 12.
figure 12

(a) Histogram of perfectly decomposed and imperfectly decomposed figures. (b) Histogram of extracted sub-figures. Our decomposition algorithm performed better for composite figures with lower number of sub-figures. Entanglement and over-merging are common issues for images of densely packed sub-figures.

Does a Simpler Method Work Just as Well? For comparison, we measured the performance of our algorithm relative to a simpler split-based algorithm. Here, we modified our splitting step (Sect. 3.1) to make it more viable as a complete algorithm. As presented, our splitting step may produce a large number of auxiliary fragments that need to be merged (e.g., Fig. 8(b)). But a reasonable approach would be to cap the number of recursive steps and see if we could avoid the need to merge altogether. We use two recursive steps — once for vertical and once for horizontal. Also, as a heuristic to try and improve the results, we discarded fire lanes with width less than 4 pixels for the same purpose because most lanes between auxiliary fragments or between auxiliary fragments and effective sub-figures are relatively narrow.

Our results show that this splitting-only algorithm extracted 833 correct sub-images and achieved 54.3 % recall and 53.1 % precision. Only 16.1 % of the original composite figures were decomposed perfectly into exact sub-figures without any errors. By both measures, this simpler method performs significantly worse despite optimizations (Table 2).

Is Step 3 of the Algorithm (Selection) Useful and Effective? To evaluate the utility of our selection step, we manually compared the two outputs of different splitting orientations before our algorithm automatically chose one. There are 237 figures that have the same results from the two initial splitting orientations. For the remaining 24 figures that require selecting algorithm, our selection algorithm correctly chose the better output for 23 figures, 11 from initial column-oriented split and 13 from initial row-oriented split. Figure 11 shows an overview of all selection scores as computed by the formula in Sect. 3.3. Each point denotes a composite figure. Circles are figures decomposed perfectly and crosses are figures decomposed imperfectly. Figures with perfect decomposition mostly appear near the line of slope 1, indicating that our decomposition algorithm often finds similar solutions for regardless of the starting orientation. However, for points where one score is different than the other, we conclude that the selection step plays an important role.

Where Does the Algorithm Make Mistakes? To understand the algorithm’s performance more deeply, we considered whether the complexity of the initial figure had any effect on the measured performance. Figure 12(a) shows a histogram of composite figures, where each category is a different number of sub-figures. The dark portion of each bar indicates the proportion of composite figures that were perfectly decomposed. The curve, which shows the accuracy of perfect decomposition, decays significantly as the number of sub-figures increases; the algorithm tends to perform significantly better on figures with eight or fewer sub-figures.

Figure 12(b) is a histogram of the total number of sub-figures extracted from each category, regardless of whether or not the entire figure was perfectly decomposed. The black dotted line divides the categories into two subsets. The right subset, comprising composite figures containing nine or more sub-figures, includes only 15.0 % source figures but contributes 61.7 % of the unextracted sub-figures (i.e., the figures the algorithm failed to properly extract). Thus, a relative low recall of 42.1 % was obtained in this subset (Table 2). In the left subset, comprising 222 composite figures with eight or fewer sub-figures, the recall was greatly increased to 80.1 % (Table 2). The two bar charts both show a better performance on composite figures with lower sub-figure populations.

6 Limitations and Future Work

The current decomposition algorithm is suitable for grid-aligned multi-chart visualizations, where there exists at least one edge-to-edge fire lane that can bootstrap the process. Figure 13 shows two examples that do not satisfy this criterion, and for which our algorithm does not produce a result. Our algorithm is also ill-suited for arrays of similar sub-figures for which it is ambiguous and subjective whether or not they should be considered as one coherent unit. We chose to maximally penalize our algorithm by assuming that every individual element should be considered a separate sub-figure.

We are also working to make the classifier used in the merging phase more robust. Although our current binary classifier has achieved 90 % recall to recognize standalone sub-figures, we still receive an exponential decay to perfectly divide figures. For our corpus, 5.88 charts per image on average gave only \(1 - 0.9^{5.88} = 46.2\,\%\) accuracy of perfect decomposition in expectation. The misclassification is mostly on relatively small sub-figures. To improve the binary classifier, we need to consider more features derived from the color information, and from text location (via the use of character recognition algorithms).

For the future work, we are working to use the decomposition algorithm with visualization classification techniques to analyze the use of visualization in the literature by domain, by impact, by year, and by demography. We believe this effort represents a first step toward a new field of viziometrics — the analysis of how visualization techniques are used to convey information in practice.

Fig. 13.
figure 13

Cases for which the splitting algorithm is not appropriate. (a) Irregular outer bound of sub-figures may form a zigzag fire lane. (b) There is no end-to-end fire lane.

7 Conclusions

We have presented an algorithm that automatically detects composite figures and dismantles them into sub-figures. The decomposition algorithm first splits an image into several blocks via spatial analysis. An SVM-based image classifier with an accuracy of 98.1 % is used to recognize effective visualizations and auxiliary fragments. Next recursively merge fragments to reconstruct complete visualizations. For the detection, we extended the splitting algorithm to obtain geometric features from visualizations and again used the SVM as a classifier. An accuracy of 90.2 % was achieved. For type-based analysis, a weaker metric suitable for understanding the diversity of visualization used in the literature, 78.1 % of the composite figures in our corpus were completely divided into homogeneous, recognizable visualizations. For chart-based analysis, a stronger metric suitable for extracting the original visualizations exactly, we successfully extracted 67.5 % of the sub-images from 261 composite figures. For the 85 % of images in the corpus containing eight or fewer sub-figures, a better performance of 80.1 % recall was achieved. With this technique, we are now poised to unlock the content of multi-part composite figures in the literature and make it available for use in advanced applications such as bibliometrics and academic search.