1 Introduction

Since the massive data could be stored in cloud platforms, data mining for the large datasets is hot topic. Parallel methods of computing are alternative for large datasets processing and knowledge discovery for large data. MapReduce is a distributed programming model, proposed by Google for processing large datasets, so called Big Data. Users specify the required functions Map and Reduce and optional function Combine. Every step of computation takes as input pairs \(<key, values>\) and produces another output pairs \(<key', values'>\). In the first step, the Map function reads the input as a set \(<key, values>\) pairs and applies user defined function to each pair. The result is a second set of the intermediate pairs \(<key', values'>\), sent to Combine or Reduce function. Combine function is a local Reduce, which can help to reduce final computation. It applies second user defined function to each intermediate key with all its associated values to merge and group data. Results are sorted, shuffled and sent to the Reduce function. Reduce function merges and groups all values to each key and produces zero or more outputs.

Rough set theory is mathematical tool for dealing with incomplete and uncertain information [4]. The notion of core is very important in the rough set theory. In decision table some of the condition attributes may be superfluous (redundant in other words). This means that their removal cannot worsen the classification. The set of all indispensable condition attributes is called the core. One can also observe that the core is the intersection of all decision reducts – each element of the core belongs to every reduct. Thus, in a sense, the core is the most important subset of condition attributes. None of its elements can be removed without affecting the classification power of all condition attributes. A much more detailed description of the concept of the core can be found, for example, in the book [6] or in the article [4].

There are some research works combining MapReduce and rough set theory. In [10] parallel method for computing rough set approximations was proposed. In [9] method for computing core based on finding positive region was proposed. They also presented parallel algorithm of attribute reduction in [8]. In [2] is proposed a design of a Patient-customized Healthcare System based on the Hadoop with Text Mining for an efficient Disease Management and Prediction.

In this paper we propose parallel method for generating attribute core based on distributed programming model MapReduce and rough set theory. This paper is organized as follows. Section 1 includes background introduction to rough sets and two algorithms of generating core. Parallel algorithm based on discernibility measure of a set and MapReduce is proposed in Sect. 3. Results of experiments and analysis is presented in Sect. 4. Conclusions and future work are drawn in the last Section.

2 Pseudocode for Generating Core

2.1 Pseudocode for Generating Core Using Discernibility Matrix

In order to compute the core we can use discernibility matrix introduced by Prof. Andrzej Skowron (see e.g. [4, 6]). The core is the set of all single element entries of the discernibility matrix.

Let \(DT=(U,A \cup \{d\})\) be a decision table, where U is a set of objects, A is a set of condition attributes and d is a decision attribute. Below one can find pseudocode for an algorithm of calculating core \(C \subseteq A\) using discernibility matrix \([DM(x,y)]_{x,y \in U}\), where \(DM(x,y) = \{a \in A: a(x) \ne a(y) \text{ and } d(x) \ne d(y) \}\).

figure a

The main concept of this algorithm is based on a property of singletons i.e. cells from discernibility matrix consisted of the only one attribute. This property tells that any singleton cannot be removed without affecting the classification power. Input for the algorithm is the discernibility matrix DM. Output is core C consisting of a subset of condition attributes set denoted as A. Core is initialized as empty set in line 1. Two loops in lines 2 and 3 are responsible for iteration over all objects in discernibility matrix. In the condition instruction in line 4 it is checked if a matrix cell contains only one attribute. If so, then this attribute is added to the core C.

The main disadvantage of using discernibility matrix for big datasets is its size. Memory complexity of creating this type of square matrix is equal to \((cardinality(U))^2 * cardinality(A)\). This makes the algorithm showed in this subsection infeasible for big data. In the next subsection we present an approach more feasible for big data.

2.2 Pseudocode for Generating Core Based on Discernibility Measure of a Set of Attributes

Generating core based on discernibility measure was discussed in [3]. Counting table CT is a two-dimensional array indexed by values of information vectors (vector of all values of an attribute set \(B \subseteq A\)) and decisions values, where

$$ CT(i, j)=cardinality(\{x \in U: \vec {x}_B = i\ \text{ and }\ d(x)=j\}) $$

Pessimistic memory complexity of creating this type of matrix is equal to

(\(caridinality(U)*cardinality(V_d))\), where \(V_d\) is a set of all decisions.

The discernibility measure disc(B) of set of attributes \(B \subseteq A\) can be calculated from the counting table as follows:

$$ disc(B)=\frac{1}{2}\sum _{i,j}\sum _{k,l}CT(i,j)\cdot CT(k,l), \text{ if }\ i\ne k\ \text{ and }\ j\ne l $$

Below is the pseudocode for this algorithm:

figure b

Input to the algorithm is a decision table DT, and output is the core C of DT. In the beginning core C is initialized as empty set and all values in the counting table are set to zero. First loop in line 3 generate counting table for set of all conditional attributes. For each object in decision table, value in array is increased by one. Indexes of value are information vector of object and its value of decision. In the line 6 value of discernibility measure of set of all condition attributes, disc(A) , is initialized as zero. Next two loops in lines 7 and 8 take subsequent equivalence classes to comparison. If information vectors of these classes are not equal, disc(A) is increased by product of two values from the counting tables where indexes are values information vectors and different decisions. Next step is computing the discernibility measure of set of attributes after removing one of them. In line 14 all values in the counting table are set to zero. Loop in line 15 takes attribute a from set of all condition attributes A. Set B is initialized as set of all condition attributes after removing this attribute. Similarly as above, in lines 15–27 is calculated disc(B). Finally, values of the discernibility measure of set of all attributes and after removing attribute a are compared in the line 28. In case of difference, this attribute is added to the core.

3 MapReduce Implementation

The main concept of the proposed algorithm is parallel computation of counting tables. The proposed algorithm consists of the four steps: Map, Combine, Reduce and Compute Core.

figure c

Input to function Map are: key is a subtable id stored in HDFS, and value is decision subtable \(DT_i=(U_i, A \cup \{d\})\). First loop in line 1 takes an object x from decision subtable \(DT_i\) and emits pair \(<key', value'>\), where \(key'\) is information vector with respect to set of all condition attributes and decision of the object x, and value is id of the this object. Loop in line 5 takes attribute a from set of all condition attributes A. Set B is initialized as set A after removing this attribute. Next step is emitting pair \(<key', value'>\), where \(key'\) is information vector with respect to set B and decision of the object x, and \(value'\) is id of the this object.

figure d

Input to function Combine are \(<key, value>\) pairs where \(key: (\vec {x}_B,d(x)) \) and value is id of the object. This function accepts a key and a set of values associated with this key from the local Map. Function emits pairs \(<key', value'>\), where \(key'\) is \((\vec {x}_B, d(x)) \) and \(value'\) is the number of objects associated with this key from decision subtable \(DT_i\).

figure e

Input to function Reduce are \(<key, value>\) pairs where \(key: (\vec {x}_B, d(x)) \) and value is the number of objects belonging to equivalence class \([x]_B\) with the same decisions from decision subtable \(DT_i\). Function emits pairs \(<key', value'>\), where \(key'\) is \((\vec {x}_B, d(x))\) and \(value'\) is a number of the objects associated with this key from decision table DT. Each pair is saved to file, which name based on index removed attribute from the set of all condition attributes A.

figure f

Input to the function Compute Core is directory contains files, which names based on index of removed attribute from set of all condition attributes, and output is core. In the beginning core C is initialized as empty set and discernibility measure of set of all attributes is set to zero. First two loops in lines 3 and 4 take subsequent lines from file with counting table for all condition attributes to comparison. If these two lines contains information about two different information vectors and decisions, disc(A) is increased by product of two values these lines. Similarly operations are repeated for each file contains \(<key,value>\) pairs. Name of these files based on removed attribute from original set of condition attributes. If computed discernibility measure of set of attributes without this attribute is less than for all attributes, this attribute is added to the core.

4 Experimental Results

The algorithm was running on the Hadoop MapReduce platform [1], where Hadoop MapReduce is a YARN-based system for parallel processing of big datasets. In the experiments, Hadoop 2.5.1 version was used. Cluster of compute nodes contains 19 PC’s. All the PC’s has four 3.4 GHz cores and 8 GB of memory.

In this paper, we present the results of the conducted experiments using data about children with insulin-dependent diabetes mellitus (type 1). Diabetes mellitus is a chronic disease of the body’s metabolism characterized by an inability to produce enough insulin to process carbohydrates, fat, and protein efficiently. Treatment requires injections of insulin. Twelve condition attributes, which include the results of physical and laboratory examinations and one decision attribute (microalbuminuria) describe the database used in our experiments. The data collection so far consists of 107 cases. The database is shown at the end of the paper [5]. A detailed analysis of the above data is in Chap. 6 of the book [6].

This database was used for generating bigger datasets consisting of \(0.5\cdot 10^6\) to \(30\cdot 10^6\) of objects. New datasets were created by randomly multiplying the rows of original dataset. Numerical values were discretized and each attributes value was encoded using two digits.

4.1 Speedup

In speedup tests, the dataset size is constant and the number of nodes grows in each step of experiment. To measure speedup, we used dataset contains \(30\cdot 10^6\) objects. The speedup given by the n-times larger system is measured as [7]:

$$ Speedup(n) = \frac{t_n}{t_1} $$

where n is number of the nodes in cluster, \(t_1\) is the computation time on one node, and \(t_n\) is the computation time on n nodes.

The ideal parallel system with n nodes provides n times speedup. The linear speedup is difficult to achieve because of the I/O operations data from HDFS and the communications cost between nodes. Table 1 shows the computational time of generating core from dataset containing \(30\cdot 10^6\) objects. The number of nodes varied from 1 to 19. Figure 1 shows, the proposed algorithm has a good performance until the number of nodes is less than 14. Mapper doesn’t always work on node where is stored block of file in HDFS. In this case block of file with data is sending from node, where is stored to another, where is processing. The main reason why speedup isn’t linear is overhead read and write operations. Generally, if the number of node is bigger, the speed performs better.

Table 1. Speedup experiment results
Fig. 1.
figure 1

Speedup

4.2 Scaleup

Scaleup analysis measures stability system when system and dataset size grow in each step of experiment. The scaleup coefficient is defined as follows [7]:

$$ Scaleup(DT,n) = \frac{t_{DT_{1,1}}}{t_{{DT_n,n}}} $$

where \(t_{DT_{1,1}}\) is the computational time for dataset DT on one node, and \(t_{{DT_n,n}}\) is the computational time for n-larger dataset DT on n nodes.

We demonstrate how good the proposed parallel algorithm handles larger datasets when more nodes are available. In scaleup experiments there is linear relationship between number of nodes and dataset size. Core is generated for the dataset consisting of \(0.5\cdot 10^6\) objects on one node and for \(9.5\cdot 10^6\) objects on 19 nodes. Clearly, Fig. 2 shows that the proposed algorithm is scalable.

Fig. 2.
figure 2

Scaleup

4.3 Sizeup

In sizeup tests, the number of nodes is constant, and the dataset size grows in each step of experiment. Sizeup measures how much time is needed for calculations when the size of dataset n-larger than the original dataset. Sizeup is defined as follows [7]:

$$ Sizeup(DT) = \frac{t_{DT_n}}{t_{DT_1}} $$

where \(t_{DT_1}\) is execution time for a given dataset DT, and \(t_{DT_n}\) is execution time n-larger dataset than DT. Figure 3 shows the sizeup experiments results on ten nodes. Results shows that the proposed algorithm has a very good sizeup performance. When dataset grows ten times, the computational time grows 2.3 times.

Fig. 3.
figure 3

Sizeup

5 Conclusions and Future Research

In this paper, generating core for large datasets based on rough set theory is studied. A parallel attribute reduction algorithm is proposed, which is based on the distributed programming model of MapReduce and the core computation algorithm using discernibility measure of a set of attributes. It is worth noting that a very interesting element of the paper is an usage of a counting table instead of a discernibility matrix. The results of the experiments show that the proposed method is efficient for large data, and it is a useful method for data mining and knowledge discovery for big datasets. Our future research work will focus on optimizing data placement in Hadoop to improve efficiency and on applications of distributed in-memory computing for attribute reduction.