Advertisement

Descriptive Analysis in Power Query with R

  • Leila Etaati
Chapter

Abstract

This chapter focuses on descriptive analysis in Power BI. A brief introduction explains how we can use descriptive analysis to help decision making. Next comes a brief introduction to clustering, how clustering is performed in Power BI Report, and how we can do clustering in Power Query Editor. Finally, I will cover how to do market basket analysis in Power BI.

This chapter focuses on descriptive analysis in Power BI. A brief introduction explains how we can use descriptive analysis to help decision making. Next comes a brief introduction to clustering, how clustering is performed in Power BI Report, and how we can do clustering in Power Query Editor. Finally, I will cover how to do market basket analysis in Power BI.

K-Means Clustering

Cluster analysis is the task of grouping similar objects in a same group or cluster. The clustered objects have more similarity to each other than other objects. For example, we may be interested in grouping our customers, based on their purchase behavior or demographic information [1]. Or, to cite an example in science, we may want to cluster the number and severity of earthquakes that occurred in New Zealand over the past ten years. For a medical purpose, we may want to classify patients with cancer, based on their laboratory results.

In machine learning, there are many different algorithms to help us do clustering. Some of these algorithms are centroid-based, such as k-means, GMM, Fuzzy c-means, and so forth. The others are based on connectivity [2], such as hierarchical clustering. There are some algorithms that are density-based or based on probability, dimensionality reduction, and neural networks/deep learning.

In this chapter, I am going to show you how we use k-means algorithms for classifying data. There is a data set for household power consumption that is available for free from the UCI Machine Learning Repository [3]. The data set is about the consumption of electric power in one household over a period of almost four years. This data set has such attributes as date, time, global active minute active, and reactive power, average voltage, average current intensity, and watt hours for kitchen, dishwasher, oven, water heater, and so forth. You are going to see how we can naturally classify these data, using two different approaches in Power BI.

Clustering Custom Visuals

The first and easy way to cluster data is to use one of the custom visuals in the Power BI store: Clustering. To access it, you first must log in to a Power BI account. In the Visualizations panel, click the three dots, then choose the From Store option. Under Advanced Analytics, you will find the Clustering option for custom visuals (Figure 8-1).
Figure 8-1

Clustering custom visuals

Add the custom visual to the Power BI environment. If it is your first time adding a Power BI custom visual, some of packages, such as Plotly, XML, and such will be installed. Next, you must choose four different columns from the filed area. As you can see in Figure 8-2, some columns, such as Voltage, Global intensity (household global minute-averaged current intensity), Submetering 3 and 1 (watt-hour activity energy for kitchen, dishwasher, water heater, and so forth) have been selected.
Figure 8-2

Choosing columns for clustering

After choosing the parameters, you must set them up. Click the format icon, then scale data to normalize it. Apply PCA, if you want to choose the number of clusters you can specify, or leave it as the default (clusters will be chosen automatically).

As you can see in Figure 8-3, the result of the clustering is shown in a clustering diagram. K-means clustering found out about four main clusters. As you can see in the picture cluster 4 has more items, while cluster 1 has fewer items than other clusters. However, the limitation of this visualization is that we cannot see the numbers and are not able to compare each cluster according to its attributes.
Figure 8-3

Clustring parameters and results

In the next section, the process of performing clustering analysis inside Power Query will be shown, as well as how we can use Power BI visual to demonstrate the result.

Clustering in Power Query

As explained in previous chapters, it is possible to do machine learning inside Power Query, using R code. In this section, the process of clustering data using R code inside Power Query will be explained, and you will see how we can use a Power BI visualization to better describe the data grouped by the relevant clusters.

First, we must go to the Query Editor, then we have to duplicate the data set that we have, by right-clicking the data set and choosing the Duplicate option (shown in Figure 8-4).
Figure 8-4

Duplicating the household power consumption data set

After duplicating the data set, you will have a copy of the power consumption for household data that you must transform before applying the R code.
  1. 1.

    Remove the date and time columns. Click each column, then right-click and select Remove.

     
  2. 2.

    Remove errors. You may have some errors, so you will have to set a rule for dealing with them, e.g., click the Remove rows and choose Remove Errors or Remove Blank Rows or Remove Missing Values (Figure 8-5).

     
Figure 8-5

Data cleaning

After data cleaning, you can run R scripts by clicking the Transform tab and selecting the Run R Script option. In R editor, first we must bring data in the same scale, using the normalization function in R, as follows:

normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x))) }
NormalizedData <- as.data.frame(lapply(dataset, normalize))

The function kmeans clusters data using k-means clustering algorithms. This algorithm requires a data set and the number of clusters.

ClusterData<-kmeans(NormalizedData ,4)

After clustering the data, we must create an output data set. For this, we leverage the data.frame function, to create a data frame data set from the original data set and the number of clusters.

PowerConsumption<-data.frame(dataset,ClusterData$cluster)

After running the code, the data set now has original data, plus a new column that shows the cluster number. Moreover, we must add an index column to the current data set, to have a primary key for further analysis. To add a new column, click the Add Column tab in the top menu, then click Index Column, which starts from 0 by default, but you can change it to 1, by choosing another option.

As you can see in Figure 8-6, our data has now been prepared for visualization and further analysis. You now simply close and apply the changes.
Figure 8-6

Clustering data in Power Query

In the Visualizations panel, we are going to use some simple visualization to compare the clusters (Figure 8-7). The first chart shows the number of items in each cluster. We can see how many items each cluster has.
Figure 8-7

Number of items in each cluster

Also, you can use the Funnel chart option, to show the average of an attribute for a cluster (Figure 8-8).
Figure 8-8

Cluster comparison data

It is also possible to do a what-if analysis. For example, if the voltage changes from 244 to 254, we can check what clusters mainly have that range and which other attributes have that subset in common. As you can see in Figure 8-9, clusters 2 and 3 mainly cover the 244 to 254 voltage range.
Figure 8-9

What-if analysis of clustering results

Market Basket Analysis (Association Rules)

Another approach for descriptive analysis and data mining is market basket analysis, or association rules. Market Basket (association rules) analysis is used to find customer purchasing behavior by stores, to determine the related items that have been purchased together. This approach is not only used for optimizing product placement and marketing investments but also for determining rules in health care, government policies, events management, and so forth [5].

In this section, I will explain how market basket analysis can be used, how to write it in R, and how to come up with good rules.

What Is Market Basket Analysis?

This analysis examines customer purchasing behavior. For example, for a beauty supply store, this analysis might reveal some association related to how frequently customers purchase shampoo and conditioner together. From a marketing perspective, perhaps promoting shampoo will lead customers to buying conditioner as well. From a sales perspective, placing shampoo beside conditioner on a store shelf will encourage customers to purchase both [4].

As stated, “association rules” is another name for market basket analysis. Association rules take the form if X then Y. For example, 60% of those who buy life insurance may then buy health insurance. To cite another example, 80% of those who buy books online may then also buy music online. Also, 50% of those who have high blood pressure and are overweight may then have high cholesterol [3].

Other examples of association rules include searching for interesting and frequently occurring patterns of DNA and protein sequences in cancer data, finding patterns in purchases or medical claims that occur in combination with fraudulent credit card or insurance claims, and identifying combinations of behavior that precede customers dropping their cellular phone service or upgrading their cable television package. You can see that association rules are not just about shopping but also can be applied in health care, insurance, and other industries. The main mathematical calculation of this method is explained here: http://radacad.com/make-business-decisions-market-basket-analysis-part-1 .

There are three main attributes for market basket analysis.
  • Support: The support of an item set or rule measures how frequently it occurs in the data.

  • Confidence: Confidence is about the probability of a rule. For example, if a customer has purchased a flower, what is the probability that they are going to purchase a get-well-soon card.

  • Lifts: Importance of the rules, that is, how much of the variance it explains

The first step in each analysis is to identify the main business problem. In this book, I have followed the examples provided in Lantz [6], which identify the most shopping list items that have been purchased together by customers. The second step is to import data into Power BI. You can download the data from GitHub [7]. This data set contains information about more than 9,000 customer transactions. For market basket analysis in R, we must install the arules package. In addition, we must install two other packages in RStudio related to arules: Matrix and Methods. To install these packages, you must have one R version in your machine, such as RStudio or Microsoft R Open. To install these packages, open R IDE, click New R Scripts, and type “Install.packages(<Package Name>).”

We follow the same procedure described in previous chapters for loading data and running the R scripts in Power Query (Figure 8-10).
Figure 8-10

Importing a data set into Power BI and Run R Script

You must then put the following code into the R scripts editor:

library(Matrix)
library(arules)
library(methods)
groceries <- read.transactions("[your local address]/groceries.csv", sep = ",")
Temp<-apriori(groceries, parameter = list(support = 0.006, confidence = 0.25, minlen = 2))
output<-inspect(Temp[1:100])

In this code, the apriori function from arules is used.

The result of finding association rules in customer behavior is shown in Figure 8-11. The first column (lhs) lists the main item that people purchased; the third column (rhs) lists the items related to the lhs items. The support, confidence, and lift measures are shown in the fourth to sixth columns, respectively. A total 100 rules have been extracted. Finally, click Close and Apply at the top left side. Now we can create a visualization for showing an item and related items in Power BI.
Figure 8-11

Result of the market basket analysis

Now we can create a visualization for showing an item and related items in Power BI. We use a specific custom visualization, “Forced-Directed Graph,” from the Power BI web site, to show the relationships between items. To get this visual, click Market Place at the top of the Visualizations tab, then search for “Forced-Directed Graph.

After importing the custom visual, the related columns should also be imported (Figure 8-12). For the source, we put lh. For the target, rhs should be used. For the weight, we use the lift attribute.
Figure 8-12

Importing the custom visual and related columns

Some of the visualization parameters, as shown in Figure 8-13, must be changed, to see the visualization better (Figure 8-14).
Figure 8-13

Power Query filter value

Figure 8-14

Custom visual parameters setting

We need another custom visual, Chiclet Slicer, from the Power BI store. This custom visual is a specific slicer that can be used to slice and dice the charts with a specific image. As previously, click Market Place, then search for Chiclet Slicer. After importing the visual, choose the value rhs, copy the slicer, and set the value lhs (Figure 8-15).
Figure 8-15

Importing Chiclet Slicer

You will see a report, as shown in Figure 8-16.
Figure 8-16

Market basket analysis report

By choosing an item, such as sliced cheese, with an rhs value in the slicer, a graph chart shows us the related items that people purchase most frequently with that item, such as sausage and yogurt.

Summary

This chapter gave a brief introduction to descriptive analytics. First, an explanation of the clustering was given. How to use k-mean clustering in Power BI visualization was then explained, as was the process of clustering in Power Query Editor. Finally, an explanation of market basket analysis was presented.

References

  1. [1]

    Ilknur Kaynar-Kabul, “Understanding data mining clustering methods,” The SAS Data Science Blog, http://blogs.sas.com/content/subconsciousmusings/2016/05/26/data-mining-clustering/#prettyPhoto/0/ , May 26, 2016.

     
  2. [2]

    Quora, “What are the best clustering algorithms used in machine learning?” www.quora.com/What-are-the-best-clustering-algorithms-used-in-machine-learning .

     
  3. [3]

    Georges Hebrail and Alice Berard, “Individual household electric power consumption Data Set,” UCI Machine Learning Repository, https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption , August 30, 2012.

     
  4. [4]

    Leila Etaati, “Make Business Decisions: Market Basket Analysis Part 1,” RADACAD, http://radacad.com/make-business-decisions-market-basket-analysis-part-1 , February 14, 2017.

     
  5. [5]
     
  6. [6]

    Lantz, Brett. Machine Learning with R. Birmingham, UK: Packt Publishing, 2015.

     
  7. [7]

    Zach Sednick, “Machine Learning with R datasets,” GitHub, https://github.com/stedy/Machine-Learning-with-R-datasets/blob/master/groceries.csv , 2019.

     

Copyright information

© Leila Etaati 2019

Authors and Affiliations

  • Leila Etaati
    • 1
  1. 1.Aukland, AucklandNew Zealand

Personalised recommendations