1 Introduction

XML document clustering poses two major challenges. Firstly, the explicit manipulation of XML documents to catch content and structural resemblance embraces several research issues, namely the alignment of their (sub)structures, the identification of similarities between such (sub)structures and between the textual data nested therein, along with the discovery of possible mutual semantic relationships among textual data and (sub)structure labels. Secondly, resemblance between the structures and textual contents of XML documents should be caught at a semantic (i.e., topical) level.

In this paper, we focus on XML document clustering based on latent topic modeling for the purpose of avoiding the aforementioned issues. Our intuition is to partition a corpus of XML documents by topical similarity rather than by content and structure similarity. In particular, two are the proposed approaches.

The first approach consists in applying well-known clustering techniques to partition the semantic representations of the XML documents of an input corpus according to the MUESLI model. MUESLI (xMl clUstErS from Latent topIcs) is an innovative XML topic model, that is conceived as an adaptation to the XML domain of the former LDA [6] topic model for unstructured text data. Under MUESLI, the semantics of the observed XML documents is modeled as a probability distribution over a number of latent (or, beforehand unknown) topics. In turn, each such a topic consists of two multinomial probability distributions placed over the word tokens and the root-to-leaf paths, respectively. Both probability distributions are randomly sampled from respective Dirichlet priors. The latent topics are inferred from the observed XML documents by conventional Bayesian reasoning. For this purpose, approximate posterior inference and parameter estimation are derived. Additionally, a Gibbs sampling algorithm implementing both is designed. MUESLI differs from previous topic models of documents with text and tags (e.g., [10, 20, 21, 23, 24]) primarily in the generation of document structure. In particular, [20, 21, 23, 24] are not explicitly meant for XML corpora. Instead, [10] proposes the only one previous topic model for the XML domain. However, the latter differs from MUESLI, in that topics are not also characterized by a specific probability distribution over the root-to-leaf paths.

The second approach combines XML document clustering and topic modeling into one unified process. For this purpose, a new generative model of XML corpora, named PAELLA (toPicAl clustEr anaLysis of xmL corporA), is presented. Essentially, PAELLA describes a generative process, in which XML document clustering and topic modeling act as interacting latent factors, that rule the formation of the observed XML documents. Technically, this is accomplished through the incorporation of MUESLI into an innovative Bayesian probabilistic model, that also associates a latent cluster-membership random variable with each XML document. To the best of our knowledge, the integration of document clustering and topic modeling is unprecedented in the XML domain and PAELLA is the first effort along this previously unexplored line of research.

A comparative evaluation on real-world XML corpora reveals the superior effectiveness of the devised approaches.

This paper proceeds as follows. Section 2 presents notation and preliminaries. Sections 3 and 4 cover the approaches based on MUESLI and PAELLA, respectively. Section 5 provides a comparative evaluation of our approaches on real-world benchmark XML corpora. Section 6 concludes and highlights future research.

2 Preliminaries

In this section, we introduce the adopted notation and some basic concepts.

2.1 Traditional Tree-Based XML Document Representation

The structure and content of an XML document with no references [1] can be modeled through a suitable XML tree representation, that refines the traditional notion of rooted labeled tree to also catch content and its nesting into structure.

An XML tree is a rooted, labeled tree, represented as a tuple \(\mathbf {t}= (\mathbf {V}_{\mathbf {t}}, r_{\mathbf {t}}, \mathbf {E}_{\mathbf {t}}, \lambda _{\mathbf {t}})\), whose elements have the following meaning. \(\mathbf {V}_{\mathbf {t}} \subseteq \mathbb {N}\) is a set of nodes and \(r_{\mathbf {t}} \in \mathbf {V}_{\mathbf {t}}\) is the root of \(\mathbf {t}\), i.e. the only node with no entering edges. \(\mathbf {E}_{\mathbf {t}} \subseteq \mathbf {V}_{\mathbf {t}} \times \mathbf {V}_{\mathbf {t}}\) is a set of edges, catching the parent-child relationships between nodes of \(\mathbf {t}\). \(\lambda _t: \mathbf {V}_{\mathbf {t}} \mapsto \varSigma \) is a node labeling function, with \(\varSigma \) being an alphabet of node tags (i.e., labels).

Notice that the elements of XML documents are not distinguished from their attributes in an XML tree: both are mapped to nodes in the corresponding XML-tree representation.

Let \(\mathbf {t}\) be a generic XML tree. Nodes in \(\mathbf {V}_{\mathbf {t}}\) can be divided into two disjoint subsets: the set \(\mathbf {L}_{\mathbf {t}}\) of leaves and the set \(\mathbf {V}_{\mathbf {t}} - \mathbf {L}_{\mathbf {t}}\) of inner nodes. An inner node has at least one child. A leaf has no children and can only enclose textual items.

A root-to-leaf path \(p^{r_{\mathbf {t}}}_{l}\) in \(\mathbf {t}\) is a sequence of nodes encountered along the path from the root \(r_{\mathbf {t}}\) to a leaf node l in \(\mathbf {L}_{\mathbf {t}}\), i.e., \(p^{r_{\mathbf {t}}}_{l} = {<}r_{\mathbf {t}}, \ldots , l{>}\). Notation \(\lambda _{\mathbf {t}}(p^{r_{\mathbf {t}}}_{l})\) denotes the sequence of labels that are associated in the XML tree \(\mathbf {t}\) with the nodes of path \(p^{r_{\mathbf {t}}}_{l}\), i.e., \(\lambda _{\mathbf {t}}(p^{r_{\mathbf {t}}}_{l}) = {<}\lambda _{\mathbf {t}}(r_{\mathbf {t}}), \ldots , \lambda _{\mathbf {t}}(l){>}\). The set of all root-to-leaf paths in \(\mathbf {t}\) is denoted as \( paths (\mathbf {t}) = \{ p^{r_{\mathbf {t}}}_{l} | l \in \mathbf {L}_{\mathbf {t}}\}\).

Let l be a leaf in \(\mathbf {L}_{\mathbf {t}}\). The set \(\textit{text-items}(l) = \{ w_1, \ldots , w_h\} \) is a model of the text items provided by l. Elements \(w_i\) (with \(i=1 \ldots h\)) are as many as the distinct text items in the context of l. The whole text content of the XML tree \(\mathbf {t}\) is denoted as \(\textit{text-items}(\mathbf {t}) = \cup _{l \in \mathbf {L}_{\mathbf {t}}} \textit{text-items}(l)\).

Notation \(\lambda _{\mathbf {t}}(p^{r_{\mathbf {t}}}_{l}).w_h\) indicates an enriched path and will be used to explicitly represent the nested occurrence of the text item \(w_h\) in the structural context of the labeled root-to-leaf path \(p^{r_{\mathbf {t}}}_{l}\). Notice that prefixing a content item with the sequence of labels of the respective root-to-leaf path is an instance of tagging [7, 28]. The collection of all enriched paths in \(\mathbf {t}\) is instead indicated as \( paths ^{(e)}(\mathbf {t}) = \cup _{{l \in \mathbf {L}_{\mathbf {t}}}, {w \in \textit{text-items}(l)}} \{ \lambda _{\mathbf {t}}(p^{r_{\mathbf {t}}}_{l}).w \}\).

Hereafter, the notions of XML documents and XML tree are used interchangeably. Moreover, the generic (labeled) root-to-leaf path and (labeled) enriched path are indicated as p and p.w, respectively, to avoid cluttering notation.

2.2 XML Features for Topic Modeling

The design of topic models for the XML domain benefits from the adoption of a flat representation for the XML documents, since the underlying generative process is relieved of nesting text items into arbitrarily complex tree structures.

The generic XML document \(\mathbf {t}\) can be flattened into a collection \(\mathbf {x}^{(\mathbf {t})}\) of XML features chosen from its tree-based model. In this paper, we represent \(\mathbf {t}\) as a bag of enriched paths, since such XML features preserve the nesting of text items into root-to-leaf paths. Accordingly, we define \(\mathbf {x}^{(\mathbf {t})}\triangleq \{ p.w | \in paths ^{(e)}(\mathbf {t}) \}\).

3 MUESLI: A Topic Model for Clustering XML Corpora

MUESLI (xMl clUstErS from Latent topIcs) is a new hierarchical topic model of XML corpora, that is conceived as an adaptation of the basic LDA model [6] to the XML domain. More precisely, let \(\mathbf {D} = \{ \mathbf {x}^{(\mathbf {t})}| \mathbf {t}\in \mathcal {D} \}\) be the bag-of-enriched-paths representation of an input XML corpus \(\mathcal {D}\), in which the individual XML documents are characterized as discussed in Sect. 2.2. MUESLI is a Bayesian probabilistic model of the imaginary process, that generates \(\mathbf {D}\).

Such a generative process is assumed to be influenced by K latent topics. Each XML document \(\mathbf {x}^{(\mathbf {t})}\) in \(\mathbf {D}\) (or, also, \(\mathbf {t}\) in \(\mathcal {D}\)) exhibits the different topics to distinct degrees. This is captured by associating \(\mathbf {x}^{(\mathbf {t})}\) with an unknown probability distribution \(\varvec{\vartheta }_{\mathbf {t}}\) over the individual topics \(k=1, \ldots , K\), such that \(\varvec{\vartheta }_{\mathbf {t}, k}\) is the probability of topic k within \(\mathbf {x}^{(\mathbf {t})}\). In turn, each topic consists of

  • an unknown probability distribution \(\varvec{\varphi }_k\) over the text items in the vocabulary \(\mathcal {I} \triangleq \cup _{\mathbf {t}\in \mathcal {D}} \textit{text-items}(\mathbf {t})\), such that \(\varvec{\varphi }_{k,w}\) indicates the probability in topic k of the generic text item w from \(\mathcal {I}\);

  • an unknown probability distribution \(\varvec{\psi }_k\) over the root-to-leaf paths in the vocabulary \(\mathcal {R} \triangleq \cup _{\mathbf {t}\in \mathcal {D}} paths (\mathbf {t})\), such that \(\varvec{\psi }_{k,p.w}\) captures the probability in topic k of the generic root-to-leaf path p.w from \(\mathcal {R}\).

Figure 1(a) formalizes the conditional (in)dependencies among the random variables of MUESLI through a graphical representation in plate notation. All random variables of MUESLI are represented as nodes. The shaded nodes mark observed random variables, whose values are the observed results of the generation process (i.e., the XML documents in their bag-of-enriched path representation). Instead, the unshaded nodes indicate hidden random variables, whose values correspond to latent (or unobserved) aspects (i.e., sampled distributions and topic assignments). Plates (or rectangles) indicate reiterations.

Fig. 1.
figure 1

Graphical representation of MUESLI (a) and PAELLA (b)

Based on the conditional (in)dependencies of Fig. 1(a), the generative probabilistic process assumed by MUESLI implements the realization of all random variables as algorithmically detailed in Fig. 2. Notice that \(\varvec{\alpha }\), \(\varvec{\beta }\) and \(\varvec{\gamma }\) are hyperparameters of the MUESLI model and their role is clarified in Sect. 3.1.

Fig. 2.
figure 2

The probabilistic generative process under MUESLI

3.1 Observed-Data Likelihood and Prior Distributions

Let \(\mathbf {x}^{(\mathbf {t})}= \{ p^{(\mathbf {t},1)}.w^{(\mathbf {t},1)}, \ldots , p^{(\mathbf {t},N_{\mathbf {t}})}.w^{(\mathbf {t},N_{\mathbf {t}})} \}\) be the flattened representation of the XML tree \(\mathbf {t}\) from the XML corpus \(\mathcal {D}\), in which \(N_{\mathbf {t}}\) stands for the number of enriched paths in \(\mathbf {t}\). Moreover, let \(\mathbf {z}^{(\mathbf {t})}\triangleq \{ z_{\mathbf {t},1}, \ldots z_{\mathbf {t},N_{\mathbf {t}}} \}\) be the collection of topic assignments in \(\mathbf {t}\), i.e., the generic element \(z_{\mathbf {t},i}\) is the latent topic of the corresponding enriched path \(p^{(\mathbf {t},i)}.w^{(\mathbf {t},i)}\) in \(\mathbf {x}^{(\mathbf {t})}\) (with \(i = 1, \ldots , N_{\mathbf {t}}\)). In addition, assume that \(\varvec{P}\) and \(\varvec{W}\) denote, respectively, all observed root-to-leaf paths and text items, i.e., \(\varvec{P} \triangleq \cup _{\mathbf {t}\in \mathcal {D}} \{ p^{(\mathbf {t},1)}, \ldots , p^{(\mathbf {t},N_{{\mathbf {t}}})}\}\) and \(\varvec{W} \triangleq \cup _{\mathbf {t}\in \mathcal {D}}\{ w^{(\mathbf {t},1)}, \ldots , w^{(\mathbf {t},N_{\mathbf {t}})}\}\).

The data likelihood can be formalized as the following conditional probability distributions over \(\varvec{P}\) and \(\varvec{W}\)

$$\begin{aligned} \Pr (\varvec{P}|\varvec{Z},\varvec{\varPsi }) = \prod _{k=1}^K\prod _{p \in \mathcal {R}} \varvec{\psi }_{k,p}^{n_k^{(p)}} \quad \Pr (\varvec{W}|\varvec{Z},\varvec{\varPhi }) = \prod _{k=1}^K\prod _{w \in \mathcal {I}} \varvec{\varphi }_{k,w}^{n_k^{(w)}} \end{aligned}$$

where

  • \(n_k^{(p)}\) stands for the occurrences of the root-to-leaf path p under the topic k;

  • \(n_k^{(w)}\) stands for the occurrences of the text item w under the topic k;

  • \(\varvec{\varPsi }\) is a compact notation denoting all topic-specific root-to-leaf path distributions, i.e., \(\varvec{\varPsi } \triangleq \{ \psi _1, \ldots , \psi _K\}\) (with K being the number of latent topics);

  • \(\varvec{\varPhi }\) is a compact notation denoting all topic-specific word distributions, i.e., \(\varvec{\varPhi } \triangleq \{ \varphi _1, \ldots , \varphi _K\}\) (with K being the number of latent topics);

  • \(\varvec{Z}\) compactly denotes all topic assignments in \(\mathcal {D}\), i.e., \(\varvec{Z} \triangleq \{ \mathbf {z}^{(\mathbf {t})}| \mathbf {x}^{(\mathbf {t})}\in \mathbf {D} \}\).

Furthermore, the conditional probability distribution over \(\varvec{Z}\) is

$$\begin{aligned} \Pr (\varvec{Z} | \varvec{\varTheta }) = \prod _{\mathbf {t}\in \mathcal {D}}\prod _{k=1}^K \vartheta _{\mathbf {t},k}^{n_{\mathbf {t}}^{(k)}} \end{aligned}$$

where

  • \(n_{\mathbf {t}}^{(k)}\) stands for the occurrences of the topic k in the XML document \(\mathbf {t}\);

  • \(\varvec{\varTheta }\) is a compact notation, that stands for the whole set of the topic distributions associated with the individual XML documents, i.e., \(\varvec{\varTheta } \triangleq \{ \varvec{\vartheta }_{\mathbf {t}} | \mathbf {t}\in \mathcal {D} \}\).

In compliance with standard Bayesian modeling, under MUESLI, uncertainty on \(\varvec{\psi }\), \(\varvec{\varPhi }\) and \(\varvec{\varTheta }\) is captured by means of the below conjugate Dirichlet priors

$$\begin{aligned} \Pr (\varvec{\varPsi }|\varvec{\beta }) = { \prod _{k=1}^K\frac{1}{\Delta (\varvec{\beta })}\prod _{p \in \mathcal {R}} \varvec{\psi }_{k,p}^{\varvec{\beta }_p - 1}} \quad \Pr (\varvec{\varPhi }| \varvec{\gamma }) = \prod _{k=1}^K \frac{1}{\Delta (\varvec{\gamma })} \prod _{w \in \mathcal {I} } \varvec{\varphi }_{k,w}^{\varvec{\gamma }_w - 1} \quad \Pr (\varvec{\varTheta }|\varvec{\alpha }) = \prod _{\mathbf {t}\in \mathcal {D}} \frac{1}{\Delta (\varvec{\alpha })}\prod _{k=1}^K\varvec{\vartheta }_{\mathbf {t},k}^{\varvec{\alpha }_k - 1} \end{aligned}$$

The above \(\varvec{\beta }=\{ \varvec{\beta }_p | p \in \mathcal {R} \}\), \(\varvec{\alpha }= \{ \varvec{\alpha }_k | k=1, \ldots , K \}\) and \(\varvec{\gamma }=\{ \varvec{\gamma }_w | w \in \mathcal {R}\}\) are three hyperparameters. Their generic elements \(\varvec{\beta }_p\), \(\varvec{\alpha }_k\) and \(\varvec{\gamma }_w\) represent suitable pseudo-counts, enabling the incorporation of domain-specific prior knowledge [17] into the exploratory analysis of the latent topics in \(\mathcal {D}\).

3.2 Approximate Posterior Inference and Parameter Estimation

MUESLI is a generative model of XML corpora given their latent aspects. Essentially, it postulates assumptions explaining how such latent aspects govern the generation of the individual XML documents. Nonetheless, in order to cluster the XML documents by their latent topics, one has to infer the latent aspects (including the foresaid topic distributions) from the XML documents. Posterior inference is used for this purpose.

As it generally happens with probabilistic models of practical interest, under MUESLI, exact posterior inference is intractable, due to the complexity of the posterior distribution. Thus, we resort to collapsed Gibbs sampling, a Markov-Chain Monte-Carlo method for approximate inference [3, 5], that enables simple inference algorithms, even if the number of hidden variables is very large [5, 17]. The pseudo code of Gibbs sampling under MUESLI is sketched in Algorithm 1. The full conditional below is used for sampling (at step 10) any topic assignment \(\varvec{z}_{\mathbf {t},n}\) given all other topic assignments \(\varvec{Z}_{\lnot (\mathbf {t},n)}\) and the observed data \(\varvec{W}\) and \(\varvec{P}\)

$$\begin{aligned}&\Pr (\varvec{z}_{\mathbf {t},n}=k|\varvec{Z}_{\lnot (\mathbf {t},n)}, \varvec{W}, \varvec{P}, \varvec{\alpha }, \varvec{\beta }, \varvec{\gamma }) \nonumber \\&\qquad = \frac{n_k^{(w)} - 1 + \varvec{\gamma }_w}{(\sum _{w' \in \mathcal {I}} n_k^{(w')} + \varvec{\gamma }_w') - 1}\cdot \frac{n_k^{(p)} - 1 + \varvec{\beta }_p}{(\sum _{p' \in \mathcal {R}} n_k^{(p')} + \varvec{\beta }_p') - 1}\cdot \frac{n^{(k)}_{\mathbf {t}} - 1 + \varvec{\alpha }_k}{\sum _{k'=1}^K (n^{(k')}_{\mathbf {t}} + \varvec{\alpha }_{k'}) - 1} \end{aligned}$$
(1)
figure a

Concerning parameter estimation, due to conjugacy, \(\Pr (\varvec{\vartheta }_{\mathbf {t}}|\varvec{z}_{\mathbf {t}}, \varvec{\alpha })\), \(\Pr (\varvec{\varphi }_{k}|\varvec{Z}, \varvec{W}, \varvec{\gamma })\) and \(\Pr (\varvec{\psi _k}|\varvec{Z}, \varvec{P}, \varvec{\beta })\) are Dirichlet distributions. Thus, by using the expectation of the Dirichlet distribution [17], one can calculate the below parameter estimates

$$\begin{aligned}&\varvec{\vartheta }_{\mathbf {t},k} = \frac{\varvec{n}_{\mathbf {t}}^{(k)} + \varvec{\alpha }_k}{\sum _{k'=1}^K\varvec{n}_{\mathbf {t}}^{(k')} + \varvec{\alpha }_{k'}}, \quad \mathbf {t}\in \mathcal {D} \wedge k=1,\ldots ,K\end{aligned}$$
(2)
$$\begin{aligned}&\varvec{\varphi }_{k,w} = \frac{\varvec{n}_{k}^{(w)} + \varvec{\gamma }_w}{\sum _{w' \in \mathcal {I}} \varvec{n}_{k}^{(w')} + \varvec{\gamma }_{w'}}, \quad k=1, \ldots , K \wedge w \in \mathcal {I}\end{aligned}$$
(3)
$$\begin{aligned}&\varvec{\psi }_{k,p} = \frac{\varvec{n_k}^{(p)} + \varvec{\beta }_p}{\sum _{p' \in \mathcal {R}} \varvec{n_k}^{(p')} + \varvec{\beta }_{p'}}, \quad k=1, \ldots , K \wedge \quad p \in \mathcal {R} \end{aligned}$$
(4)

3.3 Partitioning Algorithms

The MUESLI topic model produces a lower-dimensional mixed-membership representation \(\varvec{\varTheta }\) of the XML corpus \(\mathcal {D}\), by projecting the individual XML documents into a K-dimensional space of latent topics. The parameters \(\varvec{\varTheta }\) establish the degree of participation of the individual XML documents in the distinct latent topics. We next discuss two techniques for partitioning \(\mathcal {D}\) based on \(\varvec{\varTheta }\).

Naive Partitioning. This technique places each XML document \(\mathbf {t}\) inside the cluster \(C^* = argmax _{k=1, \ldots , K} \varvec{\vartheta }_{\mathbf {t},k}\), with \(C^*\) corresponding to the most representative topic of \(\mathbf {t}\) according to MUESLI.

K-Medoids Partitioning. A more sophisticated technique for separating \(\mathcal {D}\) based on MUESLI consists in partitioning the topic distributions \(\varvec{\varTheta }\). This allows for grouping the XML documents through their cross-topic similarity along with using a number K of latent topics larger than the number \(\overline{K}\) of clusters to find in \(\mathcal {D}\). Both are expected to enable a more accurate separation of \(\mathcal {D}\). k-medoids [16] is a well-known clustering algorithm, that can be chosen to partition \(\varvec{\varTheta }\), because of its effectiveness and robustness to noise as well as outliers. k-medoids involves the computation of the intra-cluster divergences. To this end, we use the square root of the Jensen-Shannon distance, that was shown to be a metric [14].

4 PAELLA: Joint XML Clustering and Topic Modeling

PAELLA (toPicAl clustEr anaLysis of xmL corporA) is an innovative generative model of XML corpora, in which document clustering and topic modeling act as simultaneous and interdependent latent factors in the formation of the individual XML documents. Essentially, PAELLA envisages a scenario, in which each XML document \(\mathbf {x}^{(\mathbf {t})}\) is associated with a corresponding latent cluster membership \(c_{\mathbf {t}}\) as in [26]. \(c_{\mathbf {t}}\) is randomly sampled from an unknown cluster distribution \(\varvec{\eta }\). Furthermore, the underlying semantics \(\varvec{\vartheta }_{\mathbf {t}}\) of the XML document \(\mathbf {x}^{(\mathbf {t})}\) is a unknown distribution over K latent topics. These are individually characterized as in the MUESLI topic model of Sect. 3. Figure 1(b) shows the graphical representation of PAELLA. Its generative process is detailed in Fig. 3.

Under PAELLA, collapsed Gibbs sampling is exploited to perform the approximate posterior inference of \(c_{\mathbf {t}}\) and \(\mathbf {z}^{(\mathbf {t})}\) for each XML document \(\mathbf {x}^{(\mathbf {t})}\). Besides, parameter estimation is utilized to calculate the cluster distribution \(\varvec{\eta }\), the topic distribution \(\varvec{\vartheta }_{\mathbf {t}}\) for each XML document \(\mathbf {x}^{(\mathbf {t})}\) as well as the distributions \(\varvec{\varphi }_{k}\) and \(\varvec{\psi }_{k}\) for each topic \(k= 1, \ldots , K\). The mathematical and algorithmic details of collapsed Gibbs sampling and parameter estimation under PAELLA are omitted for space limitations, being similar to the respective developments in Sect. 3.2.

Fig. 3.
figure 3

The probabilistic generative process under PAELLA

5 Evaluation

In this section, we empirically assess the effectiveness of our approaches to XML clustering in comparison to various state-of-the-art competitors. In the following, the naive and K-Medoids clustering techniques adopted in conjunction with MUESLI are named, respectively, Naive and K-Medoids.

5.1 XML Corpora, Competitors and Evaluation Measures

All tests are carried out on Wikipedia and Sigmod. These are two real-world benchmark XML corpora, that are often used in the literature for the evaluation of techniques devoted to XML classification and clustering.

Wikipedia was adopted as the test-bed for the task of XML clustering by both content and structure, in the context of the XML Mining Track at INEX 2007 [13]. The overall corpus consists of 47, 397 articles from the online digital encyclopedia, that are organized into 19 classes (or thematic categories). Each such a class corresponds to a different Wikipedia Portal.

The 140 XML documents of the Sigmod corpus represent a portion of the SIGMOD Record issues. The documents comply with two different structural class DTDs and were, initially, used to evaluate the effectiveness of XML structural clustering techniques (e.g., in) [2]. However, the minimal number of structural classes makes this task not truly challenging. Thus, in our experimentation, we consider a rearrangement of Sigmod into 5 general classes proposed in [19]. These classes were formed, by means of expert knowledge, to reflect as many groups of structural and content features of the underlying XML documents.

Interestingly, the choice of Wikipedia and Sigmod allows for assessing the effectiveness of our approaches on XML corpora with diverging features. In particular, while the XML documents in Wikipedia can be viewed as schema-less XML trees with a deep structure and a high branching factor, Sigmod includes a much smaller number of XML trees with two distinct schema definitions [19]. Table 1 summarizes a selection of primary statistics of the chosen XML corpora.

Table 1. Characteristics of the chosen XML corpora

Naive, K-Medoids and PAELLA are compared on Wikipedia and Sigmod against several state-of-the-art competitors, i.e., HPXTD [10], MCXTD [10], XC-NMF [9], XPEC [8], XCFS [19], HCX [18], CRP [27], 4RP [27], SOM [15] and LSK [25].

The clustering effectiveness of all competitors is measured in terms of macro-averaged and micro-averaged purity, according to the standard evaluation guidelines of the of the Mining Track at the INEX 2007 competition [13].

5.2 Partitioning Effectiveness

All competitors are tested in the discovery of a number of clusters in the chosen XML corpora, that amounts to the actual number of natural classes.

Cluster discovery through MUESLI and PAELLA also involves setting a reasonable number of underlying topics. In the context of the Naive clustering strategy, MUESLI was trained to unveil both in Sigmod and in Wikipedia as many latent topics as the number of natural classes within the respective XML corpora. Instead, a preliminary sensitivity analysis was conducted to determine the number of topics under K-Medoids and PAELLA. This was accomplished by ranging the number of topics in the interval [5, 30] over Sigmod and [10, 60] over Wikipedia. Figure 4 shows the sensitivity of clustering effectiveness under K-Medoids and PAELLA to the number of topics. We fixed the number of topics under K-Medoids and PAELLA, so that to maximize their clustering effectiveness.

The clustering effectiveness of all competitors is compared in Fig. 5. PAELLA and K-Medoids deliver an overcoming clustering effectiveness, being aware of the whole semantics of the individual XML documents. Naive achieves a lower effectiveness compared to both PAELLA and K-Medoids, since cluster assignment is determined for each XML document only on the basis of its most pertinent topic. This does not allow for grouping the XML documents on an actual cross-topic similarity basis. Moreover, with Naive, inference under MUESLI is subject to the constraint on the number of topics, that is required to equal the number of clusters. Such limitations affect neither PAELLA nor K-Medoids, that naturally exploit the specificity of MUESLI (i.e., modeling the semantics of an XML corpus with no prior restrictions on the actual number of underlying topics), in order to group the XML documents by their respective topic mixtures.

The superiority of PAELLA with respect to K-Medoids is due to the fact that the former conceives and seamlessly integrates MUESLI as a natural complement, with which to enhance XML document clustering.

Noticeably, the better clustering performance delivered by Naive and K-Medoids in comparison with HPXTD and MCXTD, respectively, substantiates the rationality of enriching topics under MUESLI through the incorporation of probability distributions over root-to-leaf paths. Clearly, such a modeling choice also contributes to the performance gain attained by PAELLA.

Fig. 4.
figure 4

Sensitivity of K-Medoids and PAELLA to the number of topics.

Fig. 5.
figure 5

Macro-averaged and micro-averaged purity on Sigmod (a) and Wikipedia (b)

Next, we demonstrate the behavior of the proposed approaches on XML corpora, by inspecting the results outputted by PAELLA over Sigmod.

Figure 6 shows the topic mixtures associated with the 5 uncovered clusters. These are obtained by averaging the topic distributions of the individual documents therein. Additionally, Table 2 details two inferred word topics. Each topic is summarized by its top-5 most relevant words, whose clarity, specificity and coherence enable the intuitive interpretations in brackets.

Fig. 6.
figure 6

Topic distribution across clusters

Table 2. Two Sigmod topics

6 Conclusions and Future Research

We proposed two innovative machine-learning approaches for clustering XML corpora by latent topic homogeneity. The empirical evidence from experiments on real-world benchmark XML corpora showed the effectiveness of the devised approaches against several state-of-the-art competitors.

It is interesting to refine MUESLI and PAELLA, in order to also account for the syntactic and semantic relationships among words [4, 22], which is expected to improve XML clustering effectiveness. Finally, the incorporation of an n-gram topic model for text items is likely beneficial to more accurately catch the meaning of the textual content of the XML documents. In turn, this may be useful to further increase clustering effectiveness [11, 12].