1 Introduction

The last decade has witnessed a huge growth in the collection of rating (e.g., Amazon, IMDb, Yelp, Foursquare) or vote (e.g., Parltrack, Voteview) data. Such data depict the opinion (i.e., review, or vote) of people (e.g., IMDb users, European parliament member) on an item (e.g., movie, restaurant, ballot) and need to be analyzed by leveraging contextual information to discover new actionable insights that cannot be obtained otherwise. There has been a rapid rise in the analysis of such data in many applications such as fact checking or lead finding in political journalism, and collaborative rating analysis.

Fact checking has become increasingly common in political journalism. It contributes to the quality of news provided by mediaFootnote 1. For instance, Truth-O-Meter Footnote 2 was extensively used during the 2016 US presidential debate. Delving deeply into the votes sessions makes it possible to enlighten some claims about consensus between politicians or finding some flashpoints (i.e., contexts that lead to strong disagreement). Average rating is not enough for an item. While some individuals are in agreement on many items, they can be in strong disagreement for certain types of items. Such information can directly be used for recommendation. For example, in Movielens dataset, while usually middle-aged women users are in agreement with middle-aged men users w.r.t. their overall ratings, these collections are in disagreement for Comedy movies released in 1998.

The discovery of descriptions that distinguish a group of objects given a target (class) has been widely studied in data mining and machine learning community under several vocables (subgroup discovery, emerging patterns, contrast sets) [14]. We consider here the well-established framework of subgroup discovery (SD) [22]. Given a set of objects taking a vector of attributes (of Boolean, nominal, or numerical type) as description, and a class label as a target, the goal is to efficiently discover subgroups of objects for which there is a high difference between the label distribution within the group compared to the distribution within the whole dataset. SD has been extended to a richer framework that handles more complicated target concepts, the so-called Exceptional Model Mining (EMM) [17]. A model is built over the labels from the objects in the subgroup and is compared to the model of the whole dataset using a quality measure. The more different is the model, the more exceptional is the subgroup. Many models have been investigated in the last decade [6,7,8, 13, 21]. However, no model in the EMM framework makes it possible to characterize collection of individuals whose pairwise agreement exceptionally deviates according to a subset of objects.

In this paper, we introduce the problem of discovering collections of individuals and particular contexts where their pairwise agreement exceptionally differs from their usual one as an instance of EMM. Figure 1 gives an overview of our approach. Based on an aggregation level set a priori, the method begins by constituting collections of individuals (1). Bi-sets of individuals are identified by a description (2) and their global pairwise behavior is computed (3). The method eventually aims to identify subset of reviewed items (4) for which the related pairwise behavior (5) substantially differs from the global one (6). To discover such patterns, we have to simultaneously explore the search space associated to the reviewed items and the search space associated to the reviewers. To this end, we devise the method DSC (Discovering Similarities Changes) to discover three-set patterns \((context, collection_1, collection_2)\) that identify a context and two collections of individuals where an unexpected strengthening or weakening of pairwise agreement is observed. We define some closure operators and some effective pruning techniques based on the computation of tight upper bounds on the quality measure to efficiently explore the search space. DSC is able to handle numerical, nominal attributes and also hierarchical multi-tag attributes. The main contributions of this paper are manifold:

Problem Formulation. We define the novel problem of exceptional pairwise behavior discovery in the EMM framework. This formulation makes it possible to consider several similarity measures to assess the pairwise agreement.

Algorithm and Analysis. We propose a branch-and-bound algorithm that efficiently exploits tight upper bounds and closure operators.

Evaluation. We report a thorough empirical study on real-world datasets that demonstrates the efficiency and the effectiveness of DSC.

Fig. 1.
figure 1

Overview of DSC

The rest of the paper is organized as follows. Section 2 gives the formal definition of the exceptional pairwise behavior discovery problem. Section 3 presents the algorithms. Section 4 provides experimental results. Section 5 reviews the related work. Section 6 concludes and provides future directions.

2 Problem Definition

Data describing individuals outcomes about items are numerous, ranging from vote data to collaborative ratings through social-media platforms. We model such data as a triple \(\langle E, U, R\rangle \) where E is a collection of objects (e.g., ballots, items, restaurants) and \(\mathcal {A}_E=\{e_1,...,e_n\}\) depicts the schema of the studied objects described by n attributes. U identifies the individuals (e.g., social network users, parliament members) described by m attributes over the schema \(\mathcal {A}_U=\{u_1,...,u_m\}\). Eventually, R represents the reviews (e.g., opinions, votes, ratings) of individuals over the objects. Each element of R is a triple \(r~=~(e,u,o)\) where \(o \in O\) is the outcome of a user \(u \in U\) over an item \(e \in E\). The function o(eu) returns the outcome o of u over an item e.

A description c over E defines a set of restrictions over the domains of the attributes \(\mathcal {A}_E\). Such a description gives a context and identifies a subgroup of E denoted \(E_c\) which is a collection of objects that fulfill the restrictions of c. We use the symbol \(*\) to refer to the context that covers all the objects, therefore \(E_* = E\). Similarly, a description g over U, which is a set of restrictions over the domains of the attributes \(\mathcal {A}_U\), identifies a collection of individuals denoted \(U_g\).

We aim to discover a context c and collections of individuals \(U_{g'} \subseteq U\), \( U_{g''} \subseteq U\) (labeled respectively by their descriptions \(g', g''\) over the attributes of U) such that their pairwise agreement (similarities) differs exceptionally from the observed pairwise agreement over the whole objects. In other terms, we want to identify patterns \((c,g',g'')\) that suggest an important change in pairwise behavior between \(U_{g'}\) and \(U_{g''}\) within a context c. To this end, the outcomes of \(U_{g'}\) and \(U_{g''}\) have to be compared. Therefore, we need to define a similarity function between individuals over a given subgroup of objects. However, ratings data are generally sparse which limits the set of objects that have been rated by a pair of individuals. To overcome this issue, we have to consider aggregates of individuals and their aggregated outcome. The operator \(\gamma _L\) builds a partition of U according to their values on the attributes \(L \subseteq \mathcal {A}_U\). For instance, if U represents deputies affiliated to national parties depicted by the attribute np, \(\gamma _{\{np\}}(U)=\{G_1,G_2,...\}\) is a partition of U where each \(G_i\) represents a set of individuals affiliated to the same party.

We define an aggregated outcome operator \(\theta : E \times 2^U \rightarrow O\) which maps an aggregate of individuals \(G \subseteq U\) to its aggregated outcome w.r.t. an object e. For example, when dealing with movie ratings, aggregated outcome \(\theta (e,G)\) can be defined as the mean of ratings given by some individuals of G to a movie e. We can compare the similarity between two sets of individuals based on their aggregated outcomes. The similarity measure is thus defined as: \(sim : 2^E \times 2^U \times 2^U \longrightarrow [0,1]\).

Our method relies on an EMM vision. Thus, we first need to determine a model class and a quality measure \(\varphi \) over this model class. We use a similarity matrix as a model to capture the pairwise agreement between pairs of user collections (\(U_{g'}\), \(U_{g''}\)). Note that, contrary to common EMM approaches, there is no unique base model on the whole data but a model is related to a pair of descriptions \((g',g'')\) identifying collections of individuals. The base model denoted \(M^{g',g''}_*\), which represents the usual observed pairwise agreement over the whole objects between the candidate subgroups \(U_{g'}, U_{g''}\), is defined as: \(M^{g',g''}_*=\big (sim\left( E_*,i,j\right) \big )_{(i,j)\in \gamma _L(U_{g'}) \times \gamma _L(U_{g''})}\). The model built for a context c depicting a subgroup of objects \(E_c\) is: \(M^{g',g''}_c=\big (sim\left( E_c,i,j\right) \big )_{(i,j)\in \gamma _L(U_{g'}) \times \gamma _L(U_{g''})}\).

The quality measure \(\varphi \) aims to quantify how much the model induced by the subgroup is different from the base model, i.e., how much the pairwise agreement observed over the whole objects differs from the one observed in a particular context between \(U_{g'}\) and \(U_{g''}\). Several quality measures can be defined according to the use case. For example, if we are interested in finding controversial contexts, we define \(\varphi _{dissent}\) that captures the average similarity weakening between pairs of \(\gamma _L(U_{g'}) \times \gamma _L(U_{g''})\):

      \(\varphi _{dissent}(c,g',g'') = \frac{\sum _{(i,j) \in \gamma _L(U_{g'}) \times \gamma _L(U_{g''})} max\left( sim\left( E_*,i,j\right) -sim\left( E_c,i,j\right) ,0\right) }{|\gamma _L(U_{g'})|.|\gamma _L(U_{g''})|}\)

To find patterns \((c,g',g'')\) that suggest an unexpected change of pairwise agreement, we rely on a well-known task, i.e., the discovery of Top-k patterns that fulfill a minimum quality threshold constraint \(\sigma _{\varphi }\). Additional constraints can be taken into account (e.g. \(\langle |E_c| \ge \sigma _{E}\), \(|U_{g'}|\ge \sigma _{U}\), \(|U_{g''}|\ge \sigma _{U}\rangle \)).

3 Discovery of Exceptional Pairwise Behaviors

In this section, we describe the enumeration principle based on closure operators, especially in the case of attributes whose domain is defined as a hierarchy. We then present different aggregates and similarities as well as the quality measures and their related tight upper and lower bounds. We eventually describe the algorithms to discover exceptional pairwise behaviors.

3.1 Candidate Descriptions Enumeration

Description Language. Let \(\mathcal {G}\) be a generic collection of tuples which can be either E or U, and \(\mathcal {A}_{\mathcal {G}}=(a_1,a_2,...,a_n)\) its schema defined over n attributes. We denote by \(dom(a_i)\) the domain of an attribute \(a_i\). A description \(d=\langle r_1,r_2,...,r_n\rangle \) is a conjunction of restrictions over the attributes domains, where each restriction \(r_i\) corresponds to the attribute \(a_i\). The restriction definition depends on the attribute type. If an attribute \(a_i\) is nominal then the corresponding restriction \(r_i\) is assimilated to a membership into a subset of \(dom(a_i)\). Otherwise, if an \(a_i\) is numeric then the corresponding restriction \(r_i\) refers to a membership into an interval. The set of all possible descriptions is denoted \(\mathcal {D}\). A description \(d \in \mathcal {D}\) defines by intent a subgroup (extent) \(\mathcal {G}_d \subseteq \mathcal {G}\) which contains the tuples of \(\mathcal {G}\) verifying the restrictions of d. In order to bind the descriptions of \(\mathcal {D}\) to subgroups in \(\mathcal {G}\), we define a mapping function \(\delta : \mathcal {G} \rightarrow \mathcal {D}\) that maps each tuple \(g\in \mathcal {G}\) to its description in \(\mathcal {D}\). To define this mapping function, we rely on the corresponding mappings \(\delta _{a_i} : dom(a_i) \rightarrow \mathcal {D}_{a_i}\) that maps the values of an attribute \(a_i\) to its corresponding restriction \(r_i \in \mathcal {D}_{a_i}\). Given a tuple g, an attribute \(a_i\) and its value \(a_i^g\) in g, if \(a_i\) is numeric, the restriction is an interval \(\delta _{a_i}(a^g_i)=[a_i^g,a_i^g]\). Otherwise, if \(a_i\) is nominal, the restriction is a singleton \(\delta _{a_i}(a^g_i)=\{a^g_i\}\). Finally, with the former definitions, for a tuple \(g=(a^g_1,...,a^g_n)\) we have \(\delta (g)=\langle \delta _{a_1}(a^g_1),...,\delta _{a_n}(a^g_n) \rangle \).

Description Space Structure. To enumerate candidate descriptions (or candidate subgroups by extent), we traverse the search space \(\mathcal {D}\) in a bottom-up fashion. This search space is commonly depicted as a meet-semi lattice structured by an infimum operator denoted by \(\sqcap \) [10] which simply allows to get the lowest common description of two given descriptions. The infimum operator definition relies on the n infimum operator \(\sqcap _{a_i}\) corresponding each to the type of the attribute \(a_i\). Let a be a numeric attribute, the corresponding infimum operator \(\sqcap _a\) computes the minimum interval enclosing two intervals. In the other hand, if a is nominal, the corresponding infimum operator \(\sqcap _a\) is represented by a set union operator. Thus the meet-semi lattice \((D,\sqcap )\) is the result of the cartesian product of the meet-semi lattices \((\mathcal {D}_a,\sqcap _{a})\) each corresponding to an attribute \(a \in \mathcal {A}_{\mathcal {G}}\). The infimum operator allows us to define a partial order denoted by \(\sqsubseteq \) between descriptions. Given two descriptions c and d, we have \(c \sqsubseteq d \Leftrightarrow c \sqcap d = c\).

Specialization and Neighborhood Relations. Let \(c=\langle q_1, q_2,..., q_n\rangle \) and \(d = \langle r_1, r_2, ..., r_n\rangle \) be two descriptions of \(\mathcal {D}\), \(r_i\) and \(q_i\) are two restrictions on the attribute \(a_i\). \(r_i\) is a specialization of \(q_i\) iff \(r_i \Rightarrow q_i\) which is equivalent to \(q_i \sqsubseteq r_i \Leftrightarrow q_i \sqcap _{a_i} r_i = q_i\). A description d is a specialization of c (denoted \(c \sqsubseteq d\)) iff \(\forall i \in [1..n] \,:\, q_i \sqsubseteq r_i\). Obviously, \(c \sqsubseteq d \Longleftrightarrow \mathcal {G}_d \subseteq \mathcal {G}_c\) with \(\mathcal {G}_d\) (resp. \(\mathcal {G}_c\)) the subgroup covered by d (resp. c). When traversing the search space we extend a description to more complex descriptions by atomic refinements. Thus, we define the neighborhood relationship \(\prec \). We have \(c \prec d\) iff \(c \sqsubset d\;\wedge \;\not \exists e \in \mathcal {D} \;:\;c \sqsubset e \sqsubset d \) and d is said to be an upper neighbor of c. To get the neighbors of a candidate description \(c=\langle q_1, q_2, ..., q_n\rangle \), we rely on a similar neighborhood concept between restrictions. If a restriction q is over a nominal attribute a which is materialized by a subset \(s_q \subseteq dom(a)\) membership, neighbors of q are candidates r which correspond to singletons of \(s_q\). Similarly for a numeric attribute, candidate neighbors of a restriction r are the intervals q resulting from a left-minimal change or a right-minimal change on the interval bounds corresponding to r [12]. With these tools, we can easily define a refinement operator \(\eta \;:\;\; \mathcal {D} \rightarrow 2^\mathcal {D}\) which maps to each description d its neighbors in \(\mathcal {D}\) and we have:

$$\begin{aligned} \begin{aligned} \eta (c)&=\{d \in \mathcal {D} \,:\, d=\langle r_1,...,r_n\rangle \succ c=\langle q_1,...,q_n \rangle \} \\&= \{d \in \mathcal {D} \,:\, \exists j \in [1..n]\;|\; r_j \succ q_j \;and\; \forall i \in [1..n] \;|\; i \ne j \Rightarrow r_i = q_i \} \end{aligned} \end{aligned}$$
(1)

Additionally we define \(\eta _f\) that computes the neighbors of a given description c by refining the \(f^{th}\) restriction corresponding to the \(f^{th}\) attribute as follows:

$$\begin{aligned} \eta _f(c)=\{d \in \mathcal {D} \,:\, r_f \succ q_f \;and\; \forall i \in [1..n] \;|\; i \ne f \Rightarrow r_i = q_i \} \end{aligned}$$
(2)

Closed Descriptions. We rely on the concept of closed descriptions to significantly decrease the number of explored descriptions by avoiding redundancy. A description c is said to be closed iff for every specialization d (i.e. \(c \sqsubset d\)) there is at least one object in \(\mathcal {G}\) covered by c but not by d. More formally, \(\forall d \in \mathcal {D} \, : \, c \sqsubset d \Rightarrow \mathcal {G}_d \subsetneq \mathcal {G}_c\). Two descriptions c and d are considered as equivalent (denoted \(c \equiv d\)) iff \(\mathcal {G}_c = \mathcal {G}_d\). We can adapt the CbO (Close-by-One) algorithm [15] for our use in DSC.

To define the closure operator of a description d of \(\mathcal {D}\), we need to introduce two derivation operators that create a Galois connection between \(2^\mathcal {G}\) and \(\mathcal {D}\):

Given \(S \subseteq \mathcal {G}\), the description \(S^\square \in \mathcal {D}\) covering the subgroup S is:

$$S^\square := \sqcap _{g\in S}\delta (g)= \langle \sqcap _{g\in S}\delta _{a_1}(a^g_1),...,\sqcap _{g\in S}\delta _{a_n}(a^g_n) \rangle $$

Given a description d, the subgroup \(d^\square \) covered by d is:

$$ G_d = d^\square = \{g \in \mathcal {G} \;|\; d \sqsubseteq \delta (g) \}$$

\((.)^{\square \square }\) is a closure operator and for every \(d\in \mathcal {D}\) \(d^{\square \square }\) is a closed description.

Canonicity Test. An important aspect in CbO enumeration is the canonicity test, which allows to determine if a description after closure was already generated and discard it, if appropriate. The canonicity test relies on a linear order \(\lessdot \) between descriptions of \(\mathcal {D}\). Given an arbitrary order between attributes \(\mathcal {A}_\mathcal {G}=\{a_1,a_2,...,a_n\}\), if \(d=\langle r_1,...,r_n\rangle \) comes from a closure after a refinement of the \(f^{th}\) restriction of \(c=\langle q_1,...,q_n\rangle \) then we have: \(c \lessdot _f d \Longleftrightarrow \forall i \in [1..f-1] \;|\; q_i=r_i\;\wedge \;q_f \lessdot _{a_f} r_f\). Note that, in our case, the test part \(q_f \lessdot _{a_f} r_f\) is always valid when the \(f^{th}\) attribute is numeric or nominal. Although, the latter need to be assessed when the attribute is rather complex, such as for HMT attributes introduced in the next section.

3.2 Hierarchical Multi-tag Attribute (HMT)

Several votes and reviews datasets contain multi-tagged objects where each tag is a part of a hierarchical structure. For instance, the ballots in the European parliament can have multiple tags (e.g., the ballot Gender mainstreaming in the work of the European Parliament is tagged by 4.10.04-Gender equality and 8.40.01-European Parliament. Tag 4.10.04 itself identifies a hierarchy where tag 4.10 depicts Social policy which is a specialization of tag 4 that covers the ballots related to Economic, social and territorial cohesion). Let \(\mathcal {G}\) be a set of tagged objects. For the sake of simplicity, each object g is described by a unique attribute tags which is a set of tags. Tags form a tree noted T.

We can define the partial order \(\le \) between tags as the same usual partial order in a tree structure where the tree root is the minimum (e.g. \(*< 1 < 1.20\)). This allows us to define the ascendants (resp. descendants) operator \(\uparrow \) (resp. \(\downarrow \)) of a tag \(t \in T\). We have \(\uparrow \!t = \{u \in T | u \le t\}\) and \(\downarrow \!t = \{u \in T | u \ge t\}\). Let t and u be two tags, t is a lower neighbor of u denoted \(t \prec u\) iff \(\not \exists e \in T\;|\; t<e<u\). Thus t is a parent of u denoted as \(t=p(u)\).

A restriction over an HMT attribute is assimilated as a membership in a set of tags \(\{t_1,...,t_n\}\). We denote the description domain by \(\mathcal {D}\) which is a subset of \(2^T\). Each object \(g \in \mathcal {G}\) is mapped by \(\delta (g)\) to its corresponding description in \(\mathcal {D}\). Obviously if \(\delta (g)=\{t_1,t_2\}\), the object g is tagged explicitly by the tags \(t_1\) and \(t_2\) but also implicitly by all their generalization \(\uparrow \!t_1\) and \(\uparrow \!t_2\) as shown in Fig. 2.

Fig. 2.
figure 2

A tags tree (left), a collection of tagged items (middle) and a vector representation (right)

To handle this attribute among the other attributes in the complex search space defined previously, we need to define the infimum operator between two descriptions of \(\mathcal {D}\). Let \(c=\{t_1,...,t_n\}\) and \(d=\{u_1,...,u_m\}\) be two descriptions of \(\mathcal {D}\), we define as: where \(max : 2^T \rightarrow 2^T\) is a function that maps each subset of tags \(s\subseteq T\) to the leafs of the sub-tree compound of the tags of s: \(max(s) = \{t \in s \;|\;(\downarrow \!t \setminus \{t\}) \cap s = \emptyset \}\).

Intuitively depicts the set of the maximum explicit or implicit tags shared by the two descriptions. For instance, if \(c=\{1.10,2\}\) and \(d=\{1.20,2.10\}\), . A description d is said to be a specialization of c denoted \(c \sqsubseteq d\) iff which means \(\forall t \in c\;\; \exists u \in d\;\;|\;\; u \in \;\downarrow \!t \). A description c is considered as a lower neighbor of d denoted \(c \prec d\) iff:

$$ {\left\{ \begin{array}{ll} \exists !\;(t,u) \in c \times d\;:\;t \prec u\,\wedge \, \forall t' \in (c \setminus t) \; \exists u' \in d \; : \; t' = u'\;\;\;\;\;\;\;\;\;\mathrm {if}\;\,&{} |d|\!=\!|c|\\ \forall t \in c\;\exists u \in d : t\!=\!u \wedge \exists ! (t,u)\in c\!\times \!d \;\; \exists t'\!\in \;\uparrow \!t : p(u)\!=\!p(t')\!\! &{} |d|\!=\!|c|\!+\!1 \end{array}\right. } $$

Basically d is an upper neighbor of c, if either only one tag of d is refined in c by the neighborhood relation between tags or by adding a new tag in d that share parent with a tag in c or with one of its ascendants. The linear order between two conjunctions of tags \(c=\{t_1,...,t_n\}\) and \(d=\{u_1,...,u_n,...,u_m\}\) given that d comes from a closure after refinement of the \(f^{th}\) tag of c is defined as: \(c \lessdot _f d \Longleftrightarrow \forall i \in [1..f-1] \; : \; t_i = u_i\;\wedge \; t_f \le u_f\). The linear order between tags can be provided by the depth first search on T.

Based on the definitions of , neighborhood relation between two sets of tags and the linear order between them, the attribute HMT can be easily handled with the aforementioned attributes (numeric and nominal) in the complex search space dealing with n attributes.

3.3 Aggregations, Similarities and Quality Measures

An important aspect in DSC is the similarity measure between aggregates of individuals. Given \(L \subseteq \mathcal {A}_U\) a set of individuals attributes on which we compute aggregates of individuals, a collection of individuals \(U_g \subseteq U\) labeled by a description g, \(\gamma _L(U_g)=\{G_1,G_2,...,G_k\}\) is a partition of \(U_g\). The aggregate outcome \(\theta \) is defined according to the application domain. For example, the outcome of an aggregate of reviewers who give scores is defined as such: \(\theta _{review}(e,G)=\frac{1}{|G|}\sum _{u \in G} o(e,u)\). The outcome of an aggregate G of European parliament members w.r.t a ballot is given by the vote of the majorityFootnote 3 as (See Fig. 3).

Fig. 3.
figure 3

Aggregates outcomes over one reviewed object

In this paper, we consider similarities that convey the average agreement proportion between two aggregates \(G_i\), \(G_j\) based on their pairwise similarity simobj over each object. We define such similarities over \(2^E\times 2^U\times 2^U\):

$$\begin{aligned} sim(E,G_i,G_j)=\frac{1}{|E|}\sum _{e \in E} simobj (E,G_i,G_j) \end{aligned}$$
(3)

Indeed, the measure simobj which is defined over \(E\times 2^U\times 2^U\) is adapted on the application domain. For example, if we want to compare deputies where vote decision can be either a for, against or abstain. we define:

$$\begin{aligned} simobj_{votes}(e,G_i,G_j)= {\left\{ \begin{array}{ll} 1 &{} \mathrm {if} \;\theta (e,G_i)=\theta (e,G_j) \\ 0\;\;&{}\mathrm {else} \end{array}\right. } \end{aligned}$$
(4)

For ratings ranging from 1 to 5, the similarity simobj is defined by how much the scores given by the two aggregates are close:

$$\begin{aligned} simobj_{review}(e,G_i,G_j)=1-\frac{1}{4}|\theta (e,G_i)-\theta (e,G_j)| \end{aligned}$$
(5)

To discover interpretable patterns \((c,g',g'')\), we define the two following quality measures \(\varphi _{consent}\), \(\varphi _{dissent}\) by relying on the defined similarities. \(\varphi _{consent}\) makes it possible to consider a pattern as “interesting” if there is an important strengthening of similarities between individuals corresponding to \(g'\) and individuals corresponding to \(g''\) for the context c. \(\varphi _{dissent}\) aims to assess the weakening of similarities between individuals. We assume that the attributes \(L \subseteq \mathcal {A}_{U}\) used to build partitions of individuals are given:

  • \( \varphi _{consent}(c,g',g'')\!=\!\frac{\sum _{(i,j) \in \gamma _L(U_{g'}) \times \gamma _L(U_{g''})} max\left( sim\left( E_c,i,j\right) \!-\!sim\left( E_*,i,j\right) ,0\right) }{|\gamma _L(U_{g'})|.|\gamma _L(U_{g''})|}\)

  • \( \varphi _{dissent}(c,g',g'')\!=\!\frac{\sum _{(i,j) \in \gamma _L(U_{g'}) \times \gamma _L(U_{g''})} max\left( sim\left( E_*,i,j\right) \!-\!sim\left( E_c,i,j\right) ,0\right) }{|\gamma _L(U_{g'})|.|\gamma _L(U_{g''})|}\)

3.4 Upper Bounds on Quality Measures

To early discard unpromising descriptions, we follow a branch-and-bound approach in which an upper bound on the quality measure \(\varphi \) is computed for a candidate description. We first define a generic upper bound \(U\!B_{sim}\) and a lower bound \(L\!B_{sim}\) on sim. Given a threshold \(\sigma _E\) that fix the minimum threshold on objects subgroup size, \(G_i\) and \(G_j\) two aggregates of individuals, we have:

  • \(L\!B^1_{sim}(E,G_i,G_j)=max\left( \frac{\sigma _E-|E|(1-sim(E,G_i,G_j))}{\sigma _E},0\right) \)

  • \(L\!B^2_{sim}(E,G_i,G_j)=\frac{1}{\sigma _E}\) smallest \((\{simobj(e,G_i,G_j) \;|\; e \in E\}, \sigma _E)\)

  • \(U\!B^1_{sim}(E,G_i,G_2)=min\left( \frac{|E|*sim(E,G_i,G_j)}{\sigma _E},1\right) \)

  • \(U\!B^2_{sim}(E,G_i,G_j)=\frac{1}{\sigma _E}\) largest \((\{simobj(e,G_i,G_j) \;|\; e \in E\}, \sigma _E)\)

where smallest(Sn) (resp. largest(Sn)) computes the sum of the n minimum (resp. maximum) of given set S of real values. \(L\!B^1_{sim}\) (resp. \(U\!B^1_{sim}\)) is equivalent to \(L\!B^2_{sim}\) (resp. \(U\!B^2_{sim}\)) when simobj gives binary results such as \(simobj_{votes}\).

Given a description \((c,g',g'')\), we define the following upper boundsFootnote 4 on the quality measure of every specialization d of c (\(\forall d\;|\;c \sqsubseteq d\)):

$$\begin{aligned} \varphi _{consent}(d,g',g'') \le U\!B_{consent}(c,g',g'')\;\wedge \; \varphi _{dissent}(d,g',g'') \le U\!B_{dissent}(c,g',g'') \end{aligned}$$
  • \( U\!B_{consent}(c,g',g'') = \frac{\sum _{(i,j) \in \gamma _L(U_{g'}) \times \gamma _L(U_{g''})} max\left( U\!B_{sim}\left( E_c,i,j\right) -sim\left( E_*,i,j\right) ,0\right) }{|\gamma _L(U_{g'})|.|\gamma _L(U_{g''})|} \)

  • \( U\!B_{dissent}(c,g',g'') = \frac{\sum _{(i,j) \in \gamma _L(U_{g'}) \times \gamma _L(U_{g''})} max\left( sim\left( E_*,i,j\right) -L\!B_{sim}\left( E_c,i,j\right) ,0\right) }{|\gamma _L(U_{g'})|.|\gamma _L(U_{g''})|} \)

where \(U\!B_{consent}\) (resp. \(U\!B_{dissent}\)) corresponds to \(\varphi _{consent}\) (resp. \(\varphi _{dissent}\)).

3.5 Algorithms

Algorithm 1 called EnumCC (Enumerate Closed Candidates) describes the exploration of the search space over a collection of objects \(\mathcal {G}\) defined by the attributes \(\mathcal {A}_\mathcal {G}=\{a_1,...,a_n\}\). EnumCC enumerates the closed descriptions c that verify the constraint \(\sigma _{\mathcal {G}}\) on the size of its corresponding subgroup starting from a description d. Given a description d, EnumCC computes its corresponding subgroup \(S_c\), if its size exceeds the threshold, the closure c of d is computed and the linear order between them is verified. If so c is returned as a valid candidate. The algorithm then generates the neighbors by refining the attributes \(\{a_f,...,a_n\}\). The flag f determines the attribute that was refined to generate the description d. Finally, a recursive call is done to explore the lattice structure formed by d in a DFS fashion. The parameter cnt is a Boolean that allows to prune the search space based on the computation of the upper bound on the quality of a candidate description. EnumCC is depicted as a generator.

figure a
figure b

Algorithm 2 depicts DSC method based on the use of the closure operator and a branch and bound exploration. It is related to the task of finding topk patterns with a minimum quality threshold \(\sigma _\varphi \). The algorithm first generates the candidate pattern \((c,g',g'')\), subsequently the upper bound of the candidate pattern is computed. If it does not exceed the threshold \(\sigma _\varphi \), the search space is pruned. Otherwise the quality measure of the candidate is computed. If its quality exceeds the same threshold \(\sigma _\varphi \) then the topk set is updated. Subsequently, if the size of topk exceeds k, the worst pattern found w.r.t. \(\varphi \) is discarded and \(\sigma _\varphi \) is dynamically updated with the minimal quality of the current topk set. Note that E defines the objects on which individuals \(U_1\) and \(U_2\) (two subsets of U) gives outcomes. L determines the attributes on which the individuals are aggregated. Finally, \(\sigma _E,\sigma _{U}\) determines the thresholds of subgroups sizes of respectively E and U.

4 Empirical Study

In this section we report on both quantitative and qualitative experiments over the implemented algorithms. The algorithms were implemented in Python. The experiments were carried on an Intel Core i7-6700HQ 2.60 GHz machine with 16 GB RAM and were run by PyPy 5.4.1 For reproducibility purpose, the source code and the data are made available in our companion pageFootnote 5. These experiments aim to answer the following questions: Q1 - Is the closure over an HMT attribute more effective than mining closed itemsets? Q2 - Are the closing operator and the tights upper bounds effective and efficient? Q3 - Does our algorithm scale w.r.t. different parameters? Q4 - Does DSC provide actionable patterns?

Experiments were carried out on two real-world datasets: a movie review dataset Movielens Footnote 6 and the European parliament dataset EPD Footnote 7. The main characteristics of these datasets are reported in Table 1. In Movielens, 18 movie genres are organized through a flat hierarchy.

Table 1. Characteristics of the datasets

4.1 Performance Study

Q1 - We aim to study the performance of the closure operator in the presence of an HMT attribute. To this end, we compare it against the closure over itemsets (i.e., scaling) as illustrated in Fig. 2. A tree of tags is characterized by its height and its branching factor (k-ary tree). A dataset of multi-tagged object is described by the maximum number of tags (maxtags) that an object can have and also its size. Figure 4 reports the runtime and the number of explored candidates of the two closure operators when varying the branching factor, the tree height, the number of tags and the dataset size. For these experiments, we set the default values of these characteristics respectively to: 5, 3, 3 (hierarchy of 125 tags) and 5000 objects. HMTClosure exploits the structure of the tree and avoids exploring semantically equivalent descriptions (i.e.: \(\{3, 3.10.05\} \) is semantically equivalent to \(\{3.10.05\}\)) whereas ISClosure explores them. In all configurations, HMTClosure outperforms ISClosure on both the execution time and the number of explored candidates. These experiments demonstrate that taking into account hierarchical relations makes the closure operator more efficient and effective.

Fig. 4.
figure 4

Behavior of enumeration algorithms considering two closure operators for HMT attributes w.r.t. the number of objects, the height of the hierarchy, the number of tags and the branching factor which are set by default to respectively 5, 3, 3, 5000.

Q2 - A baseline algorithm is obtained by deactivating the pruning techniques based on upper-bound and the closure operators . Thus, the baseline only pushes monotonic constraints. We compare DSC with the baseline and also with closed which is DSC without an upper bound computation on both Movielens and EPD. Notice that in EPD \(U\!B^1_{dissent}\) and \(U\!B^2_{dissent}\) are equivalent for the considered similarity. Therefore, we only report \(U\!B^1_{dissent}\). We interrupt a method if its execution time exceeds one hour.

Figures 5 and 6 report the behavior (i.e., execution time and number of explored candidates) of the different methods when varying the characteristics of the datasets Movielens and EPD. Obviously, these experiments give evidence that each of the different optimizations of DSC are effective. For Movielens dataset, DSC is the most efficient when using \(U\!B^2_{dissent}\) instead of \(U\!B^1_{dissent}\). Indeed, \(U\!B^2_{dissent}\) is more costly to compute than \(U\!B^1_{dissent}\) but much tighter. The differences between the baseline and DSC are much more important on EPD because the HMT attribute is more complex than in Movielens. The experiments also demonstrate that the number of attributes used in a description of an object or a user heavily impacts the performance of the method as it increases the size of the search space.

Q3 - Figure 7 reports the behavior of DSC on EPD when varying the input parameters (i.e., the minimum thresholds \(\sigma _{E}\) and \(\sigma _{U}\) and the quality measure). Obviously, when the thresholds increase (i.e. become more stringent) the number of explored patterns and thus the execution time decrease. Nevertheless, we observe that when decreasing \(\sigma _{E}\), DSC remains efficient thanks to its pruning abilities based on upper-bound computations and closure operators. The execution time increases in line with the number of dimensions |L| on which are computed the group of individuals while the number of explored descriptions remains roughly the same. Indeed, the computation of the model is more costly. Finally, a greater \(\sigma _\varphi \) leads to an important reduction of the number of explored candidates and therefore a better execution time. This demonstrates the effectiveness of the pruning properties implemented in DSC. Even if the two quality measures behave similarly, \(\varphi _{consent}\) performs slightly better than \(\varphi _{dissent}\) as by default the relation between the parliament’s deputies w.r.t. their voting behavior is rather consensual.

Fig. 5.
figure 5

Effectivness of DSC (Top-5) according to Movielens dataset characteristics which are set by default to \(|E|=1681\), \(|U|=943\), \(\#attr_{objects} = 2\), \(\#attr_{individuals} = 2\). The default thresholds are \(\sigma _{E}=\sigma _{U}=5\), \(\sigma _\varphi =0\), \(|L|=1\).

Fig. 6.
figure 6

Effectivness of DSC (Top-5) according to EPD dataset characteristics which are set by default to \(|E|=2471\), \(|U|=778\), \(\#attr_{objects} = 3\), \(\#attr_{individuals} = 2\). The default thresholds are \(\sigma _{E}=\sigma _{U}=15\), \(\sigma _\varphi =0\), \(|L|=1\).

Fig. 7.
figure 7

Effectivness of DSC (top-5) over EPD according to constraints thresholds and quality measures. The default thresholds are \(\sigma _{E}=\sigma _{U}=15\), \(\sigma _\varphi =0\), \(|L|=1\)

4.2 Qualitative Results (Q4)

Table 2 describes some patterns found by DSC when looking for contexts that weaken the pairwise agreement between collections of reviewers identified by gender and age group in Movielens. For instance, middle-aged females tend to be in discord with their peer males for 1998 comedy movies (13 movies, e.g.: The Wedding Singer) in the best pattern. This can be observed by a significant decrease of similarity (of 35%) between the two aggregates from 86% to 51%. The diversification is done over the top-100 patterns. Two patterns are considered similar if one cover more than 50% of the reviewed objects contained in the second.

Figure 8 reports the patterns discovered suggesting flash points (particular contexts that lead European groups to important similarities weakening). These patterns allow us to explicit the differences between groups that usually share the same political line. For example, while PPE and S&D vote mostly the same (76% of the cases), the top pattern (1) uncovers the ballots (contextualized by their themes - such as 3.40.16 Raw materials and 6.10.05 Peace preservation - and their time period - Feb. 2015) on where the two groups strongly diverge. This is witnessed by a decrease of pairwise agreement from 76% to 0%. The heatmaps illustrated in Fig. 8. depict the overall pairwise agreement changes observed for the pattern (1). Such results can provide insights for both political analysts and journalists, where the analytic tool provided by DSC allows to help discover ideological idiosyncrasies when comparing deputies against their peers, determining red lines between political groups or exhibiting contexts where nations deputies coalesce against others in critical subjects.

Table 2. Diversified Top-4 patterns discovered over Movielens by grouping on agegroups
Fig. 8.
figure 8

Diversified Top-4 patterns over EPD by grouping over political groups. (1) determine the usual pairwise observed between political groups and (2, 3 & 4) illustrate the heatmaps corresponding to the best 3 pattern found in top-k table

5 Related Work

The problem of discovering exceptional subgroups based on the definition of a complex target model has been widely investigated in the recent years [7, 8, 13, 17, 18, 21]. Interestingly, de Sá et al. [6] use a similar matrix model to support the discovery of subgroups of individuals whose preference relation between ranked objects deviates from the norm. However, in the so-called exceptional preference mining, the dimensions of the model are fixed, i.e., the quality measure takes into account all objects and not dynamically a subset as in DSC. Dynamic EMM (i.e., EMM with a non-fixed model) has been recently investigated for different aims. Bosc et al. [4] propose a method to handle multi-label data where the number of labels per objects is much lower than the total number of labels which prevent the use of usual EMM model. Other dynamic EMM approaches aim to discover exceptional attributed sub-graphs [3, 13].

Thanks to open data policy, the analysis of political data has received much attention in the past decade. Most of them use basic data mining techniques. For instance, [11] uses clustering and PCA to identify cohesion blocs and dissimilarity blocs of voters within the US senate. Similar work was done on the Finnish [20] and the Italian [1] parliaments. An extensive tool was provided by [9] and applied to Swiss government datasets to detect opinion change of parliamentarians based on their expressed opinions before elections and votes cast afterwards.

Rating analysis has also received a wide interest in the last decade. In [5], the authors tackle the problem of rating interpretation by providing two methods (DEM, DIM). While the first one aims to discover groups of users that substantially agree for a given set of items, the second addresses the discovery of groups with an apparent inner discord. These two methods can be formalized as EMM instances with either a quality measure that assesses the average ratings of the identified subgroups or the average balance between positive and negative rating. While these methods consider a mono-objective measure (rating average), a similar work has been done to tackle multi-objective groups identification in [19]. It addresses a more complex statistical measure (rating distribution) and additionally coverage and diversity issues. In [2], the authors aim at using rating maps to identify subsets of reviews such that the distribution of rates observed is similar to the desired distributions.

6 Conclusion

In this paper, we introduced the novel problem of subjective exceptional pairwise behavior discovery in rating or vote data, rooted in the SD/EMM framework. We defined a branch-and-bound algorithm that exploits tight upper bounds and some closure operators to efficiently and effectively discover subgroups of interest. Experiments show that both quantitative and qualitative results are very satisfactory. We believe that this work opens new directions for future work. For example, the interactive discovery of exceptional pairwise behavior would make it possible to take into account prior knowledge. Such an exploration must be supported by instant mining algorithms.