Keywords

1 Introduction

Data classification is an important task in the field of data mining. Related classification methods and techniques are increasingly extensively studied and some of them have been successfully used to solve practical problems [1,2,3]. In the era of big data, the volume and variety of data is growing and growing. Big data shows features of not only large volume but also “inferior quality”. “Inferior quality” of data embodies in many aspects, two of which are incompleteness and inconsistency. Data’s incompleteness means that there are missing values in data while the inconsistency refers to that data contains conflicting descriptions. There are many reasons that cause the incompleteness and inconsistency, such as objective and subjective factors, noisy data, variety of data. In fact, problems caused by the incompleteness and inconsistency are unavoidable when extracting knowledge from big data [4]. The incompleteness usually enhances the degree of inconsistency. Actually the problems of incompleteness and inconsistency are interwoven and can not totally separated, which makes the problem of knowledge extraction more complicated. Therefore, it is difficult and meaningful to solve classification problems oriented to “inferior quality” data that is characterized mainly by incompleteness and inconsistency.

There are many ways to handle missing values. Stefanowski [5]. distinguished two different semantics for missing values: the ‘‘absent value’’ semantics and the ‘‘missing value’’ semantics. Grzymala-Busse [6]. further divided such missing values into three categories according to their comparison range: ‘‘do not care’’ conditions, restricted ‘‘do not care’’ conditions, and attribute-concept values. In 2009 we utilized sorting technique to design a fast approach to compute tolerance classes [7]. But this approach has the problem of data fragmentation when the degree of missing values increases to a certain level. Recently, we presented a new method called index method to quickly compute attribute-value blocks [8], which can completely eliminate data fragmentations. In this paper, we studied on a fast classification problem of incomplete inconsistent data by using and improving such index methods.

2 Preliminaries

2.1 Incomplete Decision Systems (IDSs) and Attribute-Value Blocks

A decision system that contains missing values is called an incomplete decision system (IDS), which can be described as 4-tuple: \( IDS = (U,\;A = C \cup D,\;V = \bigcup\limits_{a \in A} {V_{a} ,\left\{ {f_{a} } \right\})_{a \in A} } \), where U is a finite nonempty set of objects, indicating a given universe; both C and D are finite nonempty set of attributes (features), called condition attribute set and decision attribute set, respectively, where \( C \cap D = \phi ;\;V_{a} \) is the domain of attribute aA, and |V a | denotes the number of elements in V a , i.e., the cardinality of V a ; f a : U → V is an information function from U to V, which maps an object in U to a value in V a . Sometimes \( (U,\;A = C \cup D,\;V = \bigcup\limits_{a \in A} {V_{a} ,\;\left\{ {f_{a} } \right\})_{a \in A} } \) is expressed as (U, CD) for simplicity if V and f a are understood. Without loss of generality, we suppose D = {d} in this paper; that is, D is supposed to be composed of only one attribute. A missing value is usually denoted as “*”. That is, if there exists aC such that * ∈ V a , then the decision system (UCD) is an incomplete decision system.

For an incomplete decision system \( IDS = (U,C \cup D) \), we define the concept of missing value degree (the degree of missing values) [8], denoted as MD ( IDS ), for IDS:

$$ MD(IDS) = \frac{{{\text{the}}\,{\text{number}}\,{\text{of}}\,{\text{missing}}\,{\text{attribute}}\,{\text{values}}}}{|U||C|}. $$

Obviously, missing value degrees have great effects on classification performance, but related studies on this problem are little reported.

Attribute-value blocks were presented by Grzymala-Busse to analyze incomplete data [8]. Here we first introduce some concepts related to attribute-value blocks.

In an incomplete system (U, CD), for any aC and vV a , ( a, v ) is said to be an attribute-value pair, which is an atomic formula of decision logic [9]. Let [(a, v)] denote all the objects from U which can be matched with (a, v), and [(a, v)] is the so-called attribute-value (pair) block [10]. According to the semantics of ‘‘do not care’’ conditions, we have:

$$ [(a,v)] = \left\{ {\begin{array}{*{20}l} {\{ y \in U|f_{a} (y) = v\,{\text{or}}\,f_{a} (y) = *\} ,} \hfill & {{\text{if}}\,v \ne *;} \hfill \\ {U,} \hfill & {{\text{else}} .} \hfill \\ \end{array} } \right. $$

Actually, v = f a (x) for some xU, and therefore block [(a, v)] is usually denoted by K a (x) or S a (x) in many studies, i.e., \( K_{a} \left( x \right) = S_{a} \left( x \right) = \left[ {\left( {a,\;f_{a} \left( x \right)} \right)} \right] \). For BC, the attribute-value block with respect to B, K B (x), is defined as follows:

$$ K_{B} (x) = \bigcap\limits_{a \in B} {[(a,\;f_{a} (x))]} = \bigcap\limits_{a \in B} {K_{a} (x)} . $$

Property 1.

For B′, B′′ ⊆ C, if \( B^{{\prime }} \; \subseteq \;B^{{\prime \prime }} ,K_{{B^{{\prime \prime }} }} \left( x \right)\; \subseteq \;K_{{B^{{\prime \prime }} }} \left( x \right) \).

2.2 Incomplete Inconsistent Decision Systems (IIDSs)

In an incomplete system (U, CD), since V D does not contain missing values, D can partition U into a family of equivalence classes, which are called decision classes. In this paper, we let D x denote the decision class that contain object x, where xU.

Definition 1.

For an object xU, let \( m_{B} \left( x \right) = \frac{{ |K_{B} (x) \cap D_{x} |}}{{ |K_{B} (x) |}} \), denoting the degree to which object x belongs to decision class D x with respect to B, and then μ B (x) is called the consistency degree of object x with respect to B.

Obviously, 0 < μ B (x) ≤ 1. If μ C (x) = 1, object x is said to be a consistent object, otherwise an inconsistent object. It is not difficult to find that an inconsistent object x means that block K C (x) intersects at least two different decision classes, i.e., K C (x) D x .

For an incomplete decision system (U, CD), if U contains inconsistent objects, i.e., there exists yU such that μ C (y) < 1, then the decision system is said to be an incomplete inconsistent decision system (IIDS), denoted as IIDS = (U, CD).

Definition 2.

Let id(IIDS) denote the ratio of the number of inconsistent objects to the number of all objects in U, i.e., id(IIDS) = |{xU | μ C (x) < 1}| /|U|, and then id(IIDS) is called inconsistency degree of the decision system IIDS.

Obviously, 0 ≤ id(IIDS) ≤ 1. In order to judge if an object x is consistent, we need to compute its consistency degree μ B (x), which is possibly a time-consuming computational process because it involves set operations. The following property can make preparation for efficiently computing μ B (x).

Property 2.

For incomplete inconsistent decision system (U, CD) and any \( x \in U \), \( K_{B} \left( x \right) \cap D_{x} = \{ y \in U|y \in K_{B} \left( x \right)\,{\text{and}}\,f_{d} \left( y \right) = f_{d} \left( x \right)\} \), and therefore \( \mu_{B} \left( x \right) = \frac{{\left| {\{ y \in U |y \in K_{B} (x) \wedge f_{d} (y )= f_{d} (x )\} } \right|}}{{ |K_{B} (x) |}} \), where BC and D = {d}.

3 A Granulation Model Based on IIDSs

In order to compute K B (x), we generally need to traverse all objects in U, and therefore it takes the computation time of O(|U|2|B|) to compute K B (x) for all xU, which is a time-consuming computation process. However, we notice that when considering only one attribute, we can derive some useful properties to accelerate the computation process.

Definition 3.

In an IIDS (U, CD), for a given attribute aC, let \( U_{a}^{*} \) denote the set of all objects that f a (x) = *, i.e., \( U_{a}^{*} = \{ x \in U|f_{a} \left( x \right) = *\} \); attribute a can partition U\( U_{a}^{*} \) into a family of equivalence classes, which are pairwise disjoint, and let [x] a denote an equivalence class containing x, i.e., \( \left[ x \right]_{a} = \{ y \in U - U_{a}^{*} |f_{a} \left( y \right) = f_{a} \left( x \right)\} \), where xU, and let Γ a denote such a family, i.e., \( \Gamma _{a} = \{ \left[ x \right]_{a} |x \in U - U_{a}^{*} \} \).

Each element in Γ a is an equivalence class. Those objects are drawn together in an equivalence class due to that they have the same attribute value on corresponding attribute. Sometimes, in order to emphasize the attribute value, we let Γ a (v) denote an equivalence class in Γ a where all objects have attribute value v, i.e., \( \Gamma _{a} \left( v \right) = \{ y \in U - U_{a}^{*} |f_{a} \left( y \right) = v\} \).

It is not difficult to find that \( \Gamma _{a} \cup \{ U_{a}^{*} \} \) is a coverage of U; each element in \( \Gamma _{a} \cup \{ U_{a}^{*} \} \) is a subset of U, and they are also pairwise disjoint.

Property 3.

Given Γ a and \( U_{a}^{*} \), for any BC and xU, we have

$$ K_{a} \left( x \right) = \left\{ {\begin{array}{*{20}l} {[x]_{a} \cup U_{a}^{*} } \hfill & {if{\kern 1pt} f_{a} (x) \ne *} \hfill \\ U \hfill & {else} \hfill \\ \end{array} } \right. $$

where [x] a ∈ Γ a .

Property 3 provides a method for us to use Γ a and \( U_{a}^{*} \) to compute K a (x). We notice that |V a | almost does not increase with |U|, with which we can design an efficient algorithm to compute Γ a and \( U_{a}^{*} \) for all aC. The algorithm is described as follows.

figure a

From Algorithm 1, we can find that its time complexity is O(|C||U|t) ≤ O(|C||U||V a |). As mentioned above, |V a | almost does not increase with |U|, so |V a | can be regard as a constant generally. Therefore, the time complexity of this algorithm almost approaches linear complexity O(|C||U|).

In fact, Algorithm1 is to granulate each “column” for an IIDS and therefore to construct a granulation model for the IIDS. Let \( \varGamma_{a}^{*} =\Gamma _{a} \cup \{ U_{a}^{*} \} \), and such a granulation model is denoted as  = ( U , { \( {\varvec{\Gamma}}_{\varvec{a}}^{*} \)} a C , D ) in this paper.

With the granulation model, we can compute any block K B (x) by using formula \( K_{B} \left( x \right) = \bigcap\limits_{a \in B} {K_{a} (x)} \). In order to quickly compute K B (x), we should know “where K a (x) is”. So we construct an index structure to store the addresses of K a (x) for all aC and xU. Such an index structure is expressed as a matrix ψ = U × C = [ m ( xa )] x U,a C , where m(x, a) is the index or address of Γ a (v) and v = f a (x). The algorithm of constructing matrix ψ is described as follows.

figure b

Actually, Algorithm 2 is to traverse all objects in \( \varGamma_{a}^{*} \) for all aC. We notice that ∪ \( \Gamma _{a}^{*} = U \) and the elements (subset) in \( \varGamma_{a}^{*} \) are pairwise disjoint, therefore the complexity of this algorithm is exactly equal to O(|U||C|), which is linear complexity.

Both granulation model and index matrix ψ are denoted as ordered pair [ , ψ ]. If there is no confusion, [, ψ] is also called a granulation model. The purpose of constructing [, ψ] is to provide a way to quickly compute block K B (x) for any xU, with complexity of about O(|K B (x)||B|).

4 A Granulation-Model-Based Method for Constructing Classifier

4.1 An Attribute-Value Block Based Method of Acquiring Classification Rules

A classification rule can viewed as an implication relation between different granular worlds, and each object x can derive a classification rule, which is a “bridge” between such two worlds. Firstly, let’s consider the following inclusion relation: K B (x) ⊆ D x . K B (x) and D x has their own descriptions, which are formulae of decision logic [9]. Suppose their descriptions are ρ and φ, respectively. Then, object x can derive rule ρ → φ. According to Property 1, when removing attribute from B, K B (x) would enlarge, and therefore the generalization ability of rule ρ → φ would be strengthened. But its consistency degree μ B (x) may decrease and then increase its uncertainty. Therefore the operation of removing attributes from B must be done under a certain limited condition. Now we give the concept of object reduction, which is used to acquire classification rules.

Definition 4.

In an IIDS = (U, CD), for any object xU, BC is said to be a reduct of object x, if the following conditions can be satisfied: (a) μ B (x) ≥ μ C (x), and (b) for any B′ ⊂ B, μ B(x) < μ C (x).

Actually, it is time-consuming to find a reduce for an object, because it needs take too much time to search each subset of B so as to satisfy condition (b). A usual method is to select some attribute from C to constitute B such that μ B (x) ≥ μ C (x). Such a method is known as feature selection, with which we can easily obtain corresponding classification rule. In the following, we give an algorithm to perform feature selection for all objects xU and derive corresponding classification rules.

figure c

In Algorithm 3, step (5) is to remove redundant attributes in C, which is actually to perform feature selection. Let B j  = B−{a j }. According to Property 2, with [, ψ], \( \mu_{{B_{j} }} (x_{i} ) \) can be computed in the complexity of \( O\left( {\sum\limits_{j = 1}^{m} { |K_{{B_{j} }} (x_{i} ) |} |B_{j} |} \right) \). Therefore, the complexity of Algorithm 3 is \( O\left( {\sum\limits_{i = 1}^{n} {\sum\limits_{j = 1}^{m} { |K_{{B_{j} }} (x_{i} ) | |B_{j} |} } } \right) \). Suppose t is the average size of blocks \( K_{{B - \{ a_{j} \} }} (x_{i} ) \) for all a j C and x i U and h is the average length of B j , then \( O\left( {\sum\limits_{i = 1}^{n} {\sum\limits_{j = 1}^{m} { |K_{{B_{j} }} (x_{i} ) | |B_{j} |} } } \right) = O\left( {\left| U \right|\left| C \right| \cdot t \cdot h} \right) \). Generally, t << |U| and h << |C|, so O(|U||Ct·h) << O(|U|2|C|2).

It should be pointed that Algorithm 3 can not guarantee each generated attribute subset is a reduct, but it does remove some redundant attributes from C and therefore can finish the task of feature selection.

4.2 Rule Set Minimum

Since each rule in S is induced by an object in U, these rules and objects are one to one correspondence. Usually, we let r i denote the rule that is induced by object x i , i.e., r i corresponds to x i . We notice that |S| = |U| and there are many redundant rules in it. Therefore we need to further remove those redundant rules, and this process is so-called rule set minimum.

Definition 5.

For rule r: ρ → φ, let coverage( r ) denote all objects which can match rule r, i.e., coverage(r) = {xU | x| = r}.

For two rules r x : ρ x  → φ x and r y : ρ y  → φ y , if coverage(r x ) ⊆ coverage(r y ), then rule r x is redundant and should be removed. Based on this consideration, we design the following algorithm to minimize the rule set S.

figure d

In Algorithm 4, the key operation is to compute coverage(r i-1). Suppose B i-1 is a set of all attributes which are contained in rule r i-1, and then computing coverage(r i-1) is equivalent to computing block \( K_{{B_{i - 1} }} (x_{i - 1} ) \), whose complexity is \( O(|K_{{B_{i - 1} }} (x_{i - 1} )|\left| {B_{i - 1} } \right|) \) by using [, ψ]. Suppose the average size of blocks \( K_{{B_{i - 1} }} (x_{i - 1} ) \) is p and the average length of B i-1 is o, then the complexity is \( O(\left| {K_{{B_{1} }} (x_{1} )} \right|\left| {B_{1} } \right| + \left| {K_{{B_{2} }} (x_{2} )} \right|\left| {B_{2} } \right| + \ldots + \left| {K_{{B_{n} }} (x_{n} )} \right|\left| {B_{n} } \right|) = O(\left| U \right| \cdot p \cdot o) \). Generally, p << |U| and o << |C|. Therefore O(|Up·o) << O(|U|2|C|).

4.3 A Classification Algorithm for Constructing Rule-Based Classifier

Using the above four provided algorithms, we here give a complete algorithm to acquire a rule set, which is used as a classifier to classify incomplete inconsistent data. The complete algorithm is described as follows.

figure e

As analyzed above, the complexities of Algorithms 1 and 2 are all O(|C||U|), and those of Algorithms 3 and 4 are O(|U||Ct·h) and O(|Up·o), respectively, which are much less than O(|U|2|C|2) and O(|U|2|C|), respectively. Therefore, it can be seen that the time-consuming step is step (3), and then the complexity of Algorithm 5 is O(|U||Ct·h), which is much less than O(|U|2|C|2).

5 Experimental Analysis

In order to verify the effectiveness of the proposed methods, we conduct several experiments using UCI data sets (http://archive.ics.uci.edu/ml/datasets.html). These experiments ran on a PC equipped with a Windows 7, Intel(R) Xeon(R), CPU E5-1620v3,and 8 GB memory. The data sets are outlined in Table 1, where |U|, |C| and |Vd| stand for the numbers of samples, condition attributes, and decision classes, respectively.

Table 1. Description of the four data sets.

For there exist missing values in incomplete decision system and the relation between objects are tolerance relation, instead of equivalence relation, we can not use sorting technique to accelerate the process of computing blocks and therefore need to compare x with all other objects in U when computing block K B (x), where xU. Replace the method of computing attribute-value blocks in Algorithm 5, which is based on the granulation model, with such a method of computing blocks and keep other parts uncharged. Thus we would obtain another algorithm, denoted as Algorithm 5′. To compare the running times for a varying number of data records, we execute Algorithm 5 and Algorithm 5′ on two data sets, Mushroom and Nursery, for four times, with randomly extracting different objects at each time, and the results are shown in Fig. 1.

Fig. 1.
figure 1

Running times of Algorithm 5 and Algorithm 5′ for mushroom and nursery

Form Fig. 1 we can find that the running times of Algorithm 5′ increase much more rapidly than that of Algorithm 5. Therefore, the constructed granulation model can greatly improve computational efficiency for Algorithm 5.

Algorithm 5 consists of Algorithms 1, 2, 3 and 4. We count the time for each algorithm when they are executed on Mushroom and Nursery, and the results are shown in Table 2.

Table 2. Running times of Algorithms 1–4 for mushroom and nursery

It can be seen from Table 2 that, comparing with Algorithms 3 and 4, it only takes a little time for Algorithms 1 and 2 to construct granulation model. This means that constructing granulation model cost very little but it forms the foundation for fast feature selection and building classifiers. Additionally, Table 2 also shows that Algorithm 3 is the most time-consuming algorithm.

To verify the classification performances of Algorithm 5, we utilize Voting-records, Tic-Tac-Toe, and Nursery to test Algorithm 5 using 10-fold cross-validation. The results are shown in Table 3.

Table 3. Precision and recall of Algorithm 5

From Table 3, we can find that Algorithm 5 can have relatively high precision and recall on these data sets. This shows that the proposed algorithm has better application values. Additionally, Table 3 also shows that Algorithm 5 is suitable not only for incomplete inconsistent data but also for complete consistent data. Of course, it has better classification performances on complete consistent data than on incomplete inconsistent data.

6 Conclusion

Extracting rules from data sets and then using a rule set as a classifier to classify data is one of our purposes recent years. In this paper, oriented to incomplete inconsistent data, we first used attribute-value block technique to construct a granulation model, which actually consists of a block-based model and a index matrix; secondly, based on the constructed granulation model, an algorithm of acquiring classification rules is presented and then an algorithm of minimizing rule sets is proposed; with the proposed algorithms, we designed a classification algorithm for constructing a rule-based classifier; finally, we conducted some experiments to verify the effectiveness of the proposed algorithm. The experiment results are consistent with our theoretical analysis. Therefore, the work in this paper has a certain theoretical value and application value, and provides a new idea to classify incomplete inconsistent data.