Introduction to Machine Learning and R
Abstract
Beginners to machine learning are often confused by the plethora of algorithms and techniques being taught in subjects like statistical learning, data mining, artificial intelligence, soft computing, and data science. It’s natural to wonder how these subjects are different from one another and which is the best for solving realworld problems. There is substantial overlap in these subjects and it's hard to draw a clear Venn diagram explaining the differences. Primarily, the foundation for these subjects is derived from probability and statistics. However, many statisticians probably won't agree with machine learning giving life to statistics, giving rise to the neverending chicken and egg conundrum kind of discussions. Fundamentally, without spending much effort in understanding the pros and cons of this discussion, it’s wise to believe that the power of statistics needed a pipeline to flow across different industries with some challenging problems to be solved and machine learning simply established that highspeed and frictionless pipeline. The other subjects that evolved from statistics and machine learning are simply trying to broaden the scope of these two subjects and putting it into a bigger banner.
Except for statistical learning, which is generally offered by mathematics or statistics departments in the majority of the universities across the globe, the rest of these subjects—like machine learning, data mining, artificial intelligence, and soft computing—are taught by computer science department.
In the recent years, this separation is disappearing but the collaboration between the two departments is still not complete. Programmers are intimidated by the complex theorems and proofs and statisticians hate talking (read as coding) to machines all the time. But as more industries are becoming data and productdriven, the need for getting the two departments to speak a common language is strongly emphasized. Roles in industry are suitably revamped to create openings like machine learning engineers, data engineers, and data scientists into a broad group being called the data science team .
The purpose of this chapter is to take one step back and demystify the terminologies as we travel through the history of machine learning and emphasize that putting the ideas from statistics and machine learning into practice by broadening the scope is critical.
At the same time, we elaborate on the importance of learning the fundamentals of machine learning with an approach inspired by the contemporary techniques from data science. We have simplified all the mathematics to as much extent as possible without compromising the fundamentals and core part of the subject. The right balance of statistics and computer science is always required for understanding machine learning, and we have made every effort for our readers to appreciate the elegance of mathematics, which at times is perceived by many to be hard and full of convoluted definitions, theories, and formulas.
1.1 Understanding the Evolution
The first challenge anybody finds when starting to understand how to build intelligent machines is how to mimic human behavior in many ways or, to put it even more appropriately, how to do things even better and more efficiently than humans. Some examples of these things performed by machines are identifying spam emails, predicting customer churn, classifying documents into respective categories, playing chess, participating in jeopardy, cleaning house, playing football, and much more. Carefully looking at these examples will reveal that humans haven’t perfected these tasks to date and rely heavily on machines to help them. So, now the question remains, where do you start learning to build such intelligent machines? Often, depending on which task you want to take up, experts will point you to machine learning, artificial intelligence (AI), or many such subjects, that sound different by name but are intrinsically connected.
In this chapter, we have taken up the task to knit together this evolution and finally put forth the point that machine learning, which is the first block in this evolution, is where you should fundamentally start to later delve deeper into other subjects.
1.1.1 Statistical Learning
The whitepaper, Discovery with Data: Leveraging Statistics with Computer Science to Transform Science and Society by American Statistical Association (ASA) [1], published in July 2014, defines statistics as “the science of learning from data, and of measuring, controlling, and communicating uncertainty is the most mature of the data sciences”. This discipline has been an essential part of the social, natural, biomedical, and physical sciences, engineering, and business analytics, among others. Statistical thinking not only helps make scientific discoveries, but it quantifies the reliability, reproducibility, and general uncertainty associated with these discoveries. This excerpt from the whitepaper is very precise and powerful in describing the importance of statistics in data analysis.
Tom Mitchell, in his article, “The Discipline of Machine Learning [2],” appropriately points out, “Over the past 50 years, the study of machine learning has grown from the efforts of a handful of computer engineers exploring whether computers could learn to play games, and a field of statistics that largely ignored computational considerations, to a broad discipline that has produced fundamental statisticalcomputational theories of learning processes.”
This learning process has found its application in a variety of tasks for commercial and profitable systems like computer vision, robotics, speech recognition, and many more. At large, it’s when statistics and computational theories are fused together that machine learning emerges as a new discipline.
1.1.2 Machine Learning (ML)
The Samuel CheckersPlaying Program, which is known to be the first computer program that could learn, was developed in 1959 by Arthur Lee Samuel, one of the fathers of machine learning. Followed by Samuel, Ryszard S. Michalski, also deemed a father of machine learning, came out with a system for recognizing handwritten alphanumeric characters, working along with Jacek Karpinski in 19621970. The subject from then has evolved with many facets and led the way for various applications impacting businesses and society for the good.
Tom Mitchell defined the fundamental question machine learning seeks to answer as, “How can we build computer systems that automatically improve with experience, and what are the fundamental laws that govern all learning processes?” He further explains that the defining question of computer science is, “How can we build machines that solve problems, and which problems are inherently tractable/intractable?”, whereas statistics focus on answering “What can be inferred from data plus a set of modeling assumptions, with what reliability?”
This set of questions clearly shows the difference between statistics and machine learning. As mentioned earlier in the chapter, it might not even be necessary to deal with the chicken and egg conundrum, as we clearly see that one simply complements the other and is paving the path for the future. As we dive deep into the concepts of statistics and machine learning, you will see the differences clearly emerging or at times completely disappearing. Another line of thought, in the paper “Statistical Modeling: The Two Cultures” by Leo Breiman in 2001 [3], argued that statisticians rely too heavily on data modeling, and that machine learning techniques are instead focusing on the predictive accuracy of models.
1.1.3 Artificial Intelligence (AI)
AI also sits at the core of robotics. The 1971 Turing Award winner, John McCarthy, a well known American computer scientist, was believed to have coined this term and in his article titled, “What Is Artificial Intelligence?” he defined it as “the science and engineering of making intelligent machines [4]”. So, if you relate back to what we said about machine learning, we instantly sense a connection between the two, but AI goes the extra mile to congregate a number of sciences and professions, including linguistics, philosophy, psychology, neuroscience, mathematics, and computer science, as well as other specialized fields such as artificial psychology. It should also be pointed out that machine learning is often considered to be a subset of AI.
1.1.4 Data Mining
Knowledge Discovery and Data Mining (KDD), a premier forum for data mining, states its goal to be advancement, education, and adoption of the “science” for knowledge discovery and data mining. Data mining, like ML and AI, has emerged as a interdisciplinary subfield of computer science and for this reason, KDD commonly projects data mining methods, as the intersection of AI, ML, statistics, and database systems. Data mining techniques were integrated into many database systems and business intelligence tools, when adoption of analytic services were starting to explode in many industries.
The research paper, “WEKA Experiences with a Java opensource project”[5] (WEKA is one of the widely adapted tools for doing research and projects using data mining), published in the Journal of Machine Learning Research, talked about how the classic book, Data Mining: Practical Machine Learning Tools and Techniques with Java,[6] was originally named just Practical Machine Learning, and the term data mining was only added for marketing reasons. Eibe Frank and Mark A. Hall, who wrote this research paper, are the two coauthors of the book, so we have a strong rationale to believe this reason for the name change. Once again, we see fundamentally, ML being at the core of data mining.
1.1.5 Data Science
It’s not wrong to call data science a big umbrella that brought everything with a potential to show insight from data and build intelligent systems inside it. In the book, Data Science for Business [7], Foster Provost and Tom Fawcett introduced the notion of viewing data and data science capability as a strategic asset, which will help businesses think explicitly about the extent to which one should invest in them. In a way, data science has emphasized the importance of data more than the algorithms of learning.
We strongly believe the fundamentals of these different fields of study are all derived from statistics and machine learning but different flavors, for reasons justifiable in its own context, were given to it, which helped the subject be molded into various systems and areas of research. This book will help trim down the number of different terminologies being used to describe the same set of algorithms and tools. It will present a simpletounderstand and coherent approach, the algorithms in machine learning and its practical use with R. Wherever it’s appropriate, we will emphasize the need to go outside the scope of this book and guide our readers with the relevant materials. By doing so, we are reemphasizing the need for mastering traditional approaches in machine learning and, at the same time, staying abreast with the latest development in tools and technologies in this space.
Our design of topics in this book are strongly influenced by data science framework but instead of wandering through the vast pool of tools and techniques you would find in the world of data science, we have kept our focus strictly on teaching practical ways of applying machine learning algorithms with R.
The rest of this chapter is organized to help readers understand the elements of probability and statistics and programming skills in R. Both of these will form the foundations for understanding and putting machine learning into practical use. The chapter ends with a discussion of technologies that apply ML to a realworld problem. Also, a generic machine learning process flow will be presented showing how to connect the dots, starting from a given problem statement to deploying ML models to working with realworld systems.
1.2 Probability and Statistics
 1.
What are the chances of a fair coin coming up heads 10 times in a row?
 2.
If my friend flips a coin 10 times and gets 10 heads. Is she playing a trick on me?
 1.
How likely it is that five cards drawn from a perfectly shuffled deck will all be hearts?
 2.
If five cards off the top of the deck are all hearts, how likely is it that the deck was shuffled?
In case of the coin toss, the first question could be answered if we know the coin is fair, there’s a 50% chance that any individual coin flip will come up heads, in probability notation, P(heads) = 0.5. So, our probability is P(heads 10 times in a row) =.0009765625 (since all the 10 coin tosses are independent of each other, we can simply compute (0.5)10 to arrive at this value). The probability value .0009765625 quantifies the chances of a fair coin coming up heads 10 times in a row.
On the other side, such a small probability would mean the occurrence of the event (heads 10 times in a row) is very rare, which helps to infer that my friend is playing some trick on me when she got all heads. Think about this—does tossing a coin 10 times give you strong evidence for doubting your friend? Maybe no; you may ask her to repeat the process several times. The more the data we generate, the better will be the inference. The second set of questions has the same thought process but is applied to a different problem. We encourage you to perform the calculations yourself to answer the question.
So, fundamentally, probability could be used as a tool in statistics to help us answer many such realworld questions using a model. We will explore some basics of both these worlds, and it will become evident that both converge at a point where it’s hard to observe many differences between the two.
1.2.1 Counting and Probability Definition
It’s easy to count the total number of possible outcomes in such a simple example with three coins, but as the size and complexity of the problem increase, manually counting is not an option. A more formal approach is to use combinations and permutations. If the order is of significance, we call it a permutation ; otherwise, generally the term combination is used. For instance, if we say it doesn’t matter which coin gets heads or tails out of the three coins, we are only interested in number of heads, which is like saying there is no significance to the order, then our total number of possible combination will be {HHH, HHT, HTT, TTT}. This means HHT and HTH are the same, since there are two heads on these outcomes. A more formal way to obtain the number of possible outcome is shown in Table 11. It’s easy to see that, for the value n = 2 (heads and tails) and k = 3 (three coins), we get eight possible permutations and four combinations.
Table 11. Permutation and Combinations
This way of calculating the probability using the counts or frequency of occurrence is also known as the frequentist probability . There is another class called the Bayesian probability or conditional probability, which we will explore later in the chapter.
1.2.2 Events and Relationships
In the previous section, we saw an example of an event. Let’s go a step further and set a formal notion around various events and their relationship with each other.
1.2.2.1 Independent Events
A and B are independent if occurrence of A gives no additional information about whether B occurred. Imagine that Facebook enhances their Nearby Friends feature and tells you the probability of your friend visiting the same cineplex for a movie in the weekends where you frequent. In the absence of such a feature in Facebook, the information that you are a very frequent visitor to this cineplex doesn’t really increase or decrease the probability of you meeting your friend at the cineplex. This is because the events—A, you visiting the cineplex for a movie and B, your friend visiting the cineplex for a movie—are independent.
On the other hand, if such a feature exists, we can’t deny you would try your best to increase or decrease your probability of meeting your friend depending on if he or she is close to you or not. And this is only possible because the two events are now linked by a feature in Facebook.
Let’s take another example of a dependent. When the sun is out in Pleasantville it never rains; however, if the sun is not out, it will definitely rain. Farmer John cannot harvest crops in the rain. Therefore, any individual harvest is dependent on it not raining.
 1.
The probability of events A and B occurring at the same time is equal to the product of probability of event A and probability of event B
 2.
The probability of event A given B has already occurred is equal to the probability of A
 3.
Similarly, the probability of event B given A has already occurred is equal to the probability of B
For the event A = Tossing two heads, and event B = Tossing head on first coin, so P(A∩B) = 3/8 = 0.375 whereas P(A)P(B) = 4 / 8 * 4 / 8 = 0.25 which is not equal to P(A∩B). Similarly, the other two conditions can also be validated.
1.2.2.2 Conditional Independence
In the Facebook Nearby Friends example, we were able to ascertain that the probability of you and your friend both visiting the cineplex at the same time has to do something with your location and intentions. Though intentions are very hard to quantify, it’s not the case with location. So, if we define the event C to be, being in a location near to cineplex, then it’s not difficult to calculate the probability. But even when you both are nearby, it’s not necessary that you and your friend would visit the cineplex. More formally, this is where we define conditionally, A and B are independent given C if P(A∩B C) = P(A  C)P(B C).
Note here that independence does not imply conditional independence, and conditional independence does not imply independence. It’s in a way saying, A and B together are independent of another event, C.
1.2.2.3 Bayes Theorem
where, P(B)≠0, P(A) is then called a prior probability and P(A  B) is called posterior probability, which is the measure we get after an additional information B is known. Let’s look at the Table 12, a twoway contingency table for our Facebook Nearby example, to explain this better.
Table 12. Facebook Nearby Example of TwoWay Contingency Table
This example is based on the twoway contingency table and provides a good intuition around conditional probability. We will deep dive into the machine learning algorithm called Naive Bayes as applied to a realworld problem, which is based on the Bayes Theorem, later in Chapter 6.
1.2.3 Randomness, Probability, and Distributions
David S. Moore et. al.’s book, Introduction to the Practice of Statistics [9], is an easytocomprehend book with simple mathematics, but conceptually rich ideas from statistics. It very aptly points out, “random” in statistics is not a synonym for “haphazard” but a description of a kind of order that emerges in the long run. They further explain that we often deal with unpredictable events in our life on a daily basis that we generally term as random, like the example of Facebook’s Nearby Friends, but we rarely see enough repetition of the same random phenomenon to observe the longterm regularity that probability describes.

“We call a phenomenon random if individual outcomes are uncertain but there is nonetheless a regular distribution of outcomes in a large number of repetitions. The probability of any outcome of a random phenomenon is the proportion of times the outcome would occur in a very long series of repetitions.”
This leads us to define a random variable that stores such random phenomenon numerically. In any experiment involving random events, a random variable, say X, based on the outcomes of the events will be assigned a numerical value. And the probability distribution of X helps in finding the probability for a value being assigned to X.
For example, if we define X = {number of head in three coin tosses}, then X can take values 0, 1, 2, and 3. Here we call X a discrete random variable. However, if we define X = {all values between 0 and 2}, there can be infinitely many possible values, so X is called a continuous random variable.
1.2.4 Confidence Interval and Hypothesis Testing
Suppose you were running a socioeconomic survey for your state among a chosen sample from the entire population (assuming it’s chosen totally at random). As the data starts to pour in, you feel excited and, at the same time, a little confused on how you should analyze the data. There could be many insights that can come from data and it’s possible that every insight may not be completely valid, as the survey is only based on a small randomly chosen sample.
Law of Large Numbers (more detailed discussion on this topic in Chapter 3) in statistics tells us that the sample mean must approach the population mean as the sample size increases. In other words, we are saying it’s not required that you survey each and every individual in your state but rather choose a sample large enough to be a close representative of the entire population. Even though measuring uncertainty gives us power to make better decisions, in order to make our insights statistically significant, we need to create a hypothesis and perform certain tests.
1.2.4.1 Confidence Interval
Now, in order to define confidence interval, which generally takes a form like this
estimate ± margin of error
1.2.4.2 Hypothesis Testing
Hypothesis testing is sometimes also known as a test of significance. Although CI is a strong representative of the population estimate, we need a more robust and formal procedure for testing and comparing an assumption about population parameters of the observed data. The application of hypothesis is wide spread, starting from assessing what’s the reliability of a sample used in a survey for an opinion poll to finding out the efficacy of a new drug over an existing drug for curing a disease. In general, hypothesis tests are tools for checking the validity of a statement around certain statistics relating to an experiment design. If you recall, the highlevel architecture of IBM’s DeepQA has an important step called hypothesis generation in coming out with the most relevant answer for a given question.
The hypothesis testing consists of two statements that are framed on the population parameter, one of which we want to reject. As we saw while discussing CI, the sampling distribution of the sample mean \( \overline{\mathrm{x}} \) follows a normal distribution \( \mathrm{N}\left(\mu, \sigma /\sqrt{\mathrm{n}}\right). \) One of most important concepts is the Central Limit Theorem (a more detailed discussion on this topic is in Chapter 3), which tells us that for large samples, the sampling distribution is approximately normal. Since normal distribution is one of the most explored distributions with all of its properties well known, this approximation is vital for every hypothesis test we would like to perform.

H_{o} : There is no difference in the mean income or true mean income

H_{a} : The true mean incomes are not the same
In case we reject H_{o}, we have two choices to make, whether we want to test \( \overline{\mathrm{x}} \)>0, \( \overline{\mathrm{x}} \)<0 or simply \( \overline{\mathrm{x}} \)≠0, without bothering much about direction, which is called twoside test. If you are clear about the direction, a oneside test is preferred.
where Z has the standard normal distribution N(0, 1). This probability is called pvalue . We will use this value quite often in regression models.
Since the probability is large enough, we have no other choice but to stick with our null hypothesis. In other words, we don’t have enough evidence to reject the null hypothesis. It could also be stated as, there is 35% chance of observing a difference as extreme as the $1400 in our sample if the true population difference is zero. A note here, though; there could be numerous other ways to state our result, all of it means the same thing.
Finally, in many practical situations, it’s not enough to say that the probability is large or small, but instead it’s compared to a significance or confidence level. So, if we are given a 95% confidence interval (in other words, the interval that includes the true value of μ with 0.95 probability), values of μ that are not included in this interval would be incompatible with the data. Now, using this threshold α = 0.05 ( 95% confidence), we observe the Pvalue is greater than 0.05 (or 5%), which means we still do not have enough evidence to reject H_{o}. Hence, we conclude that there is no difference in the mean income between the year 2005 and 2015.
There are many other ways to perform hypothesis testing, which we leave for the interested readers to refer to detailed text on the subject. Our major focus in the coming chapters is to do hypothesis testing using R for various applications in sampling and regression.
We introduce the field of probability and statistics, both of which form the foundation of data exploration and our broader goal of understanding the predictive modeling using machine learning.
1.3 Getting Started with R
R is GNU S, a freely available language and environment for statistical computing and graphics that provides a wide variety of statistical and graphical techniques: linear and nonlinear modeling, statistical tests, time series analysis, classification, clustering, and lot more than what you could imagine.
Although covering the complete topics of R is beyond the scope of this book, we will keep our focus intact by looking at the end goal of this book. The getting started material here is just to provide the familiarity to readers who don’t have any previous exposure to programming or scripting languages. We strongly advise that the readers follow R’s official website for instructions on installing and some standard textbook for more technical discussion on topics.
1.3.1 Basic Building Blocks
This section provides a quick overview of the building blocks of R, which uniquely makes R the most sought out programming language among statisticians, analysts, and scientists. R is an easytolearn and an excellent tool for developing prototype models very quickly.
1.3.1.1 Calculations
As you would expect, R provides all the arithmetic operations you would find in a scientific calculator and much more. All kind of comparisons like >, >=, <, and <=, and functions such as acos, asin, atan, ceiling, floor, min, max, cumsum, mean, and median are readily available for all possible computations.
1.3.1.2 Statistics with R
R is one such language that’s very friendly to academicians and people with less programming background. The ease of computing statistical properties of data has also given it a widespread popularity among data analysts and statisticians. Functions are provided for computing quantile, rank, sorting data, and matrix manipulation like crossprod, eigen, and svd. There are also some really easytouse functions for building linear models quite quickly. A detailed discussion on such models will follow in later chapters.
1.3.1.3 Packages
The strength of R lies with its community of contributors from various domains. The developers bind everything in one single piece called a package, in R. A simple package can contain few functions for implementing an algorithm or it can be as big as the base package itself, which comes with the R installers. We will use many packages throughout the book as we cover new topics.
1.3.2 Data Structures in R
Fundamentally, there are only five types of data structures in R, and they are most often used. Almost all other data structures are built on these five. Hadley Wickham, in his book Advanced R [10], provides an easytocomprehend segregation of these five data structures, as shown in Table 13.
Table 13. Data Structures in R

Factors: This one is derived from a vector

Data tables: This one is derived from a data frame
The homogeneous type allows for only a single data type to be stored in vector, matrix, or array, whereas the Heterogeneous type allows for mixed types as well.
1.3.2.1 Vectors
Vectors are the simplest form of data structure in R and yet are very useful. Each vector stores all elements of same type. This could be thought as a onedimensional array, similar to those found in programming languages like C/C++
1.3.2.2 Lists
Lists internally in R are collections of generic vectors. For instance, a list of automobiles with name, color, and cc could be defined as a list named cars, with a collection of vectors named name, color, and cc inside it.
1.3.2.3 Matrixes
Matrixes are the data structures that store multidimensional arrays with many rows and columns. For all practical purposes, its data structure helps store data in a format where every row represents a certain collection of columns. The columns hold the information that defines the observation (row).
1.3.2.4 Data Frames
Data frames extend matrixes with the added capability of holding heterogeneous types of data. In a data frame, you can store character, numeric, and factor variables in different columns of the same data frame. In almost every data analysis task, with rows and columns of data, a data frame comes as a natural choice for storing the data. The following example shows how numeric and factor columns are stored in the same data frame.
1.3.3 Subsetting
R has one of the most advanced, powerful, and fast subsetting operators compared to any other programming language. It’s powerful to an extent that, except for few cases, which we will discuss in the next section, there is no looping construct like for or while required, even though R explicitly provides one if needed. Though its very powerful, syntactically it could sometime turn out to be an nightmare or gross error could pop up if careful attention is not paid in placing the required number of parentheses, brackets, and commas. The operators [, [[, and $ are used for subsetting, depending on which data structure is holding the data. It’s also possible to combine subsetting with assignment to perform some really complicated function with very few lines of code.
1.3.3.1 Vectors
For vectors, the subsetting could be done by referring to the respective index of the elements stored in a vector. For example, car_name[c(1,2)] will return elements stored in index 1 and 2 and car_name[2] returns all the elements except for second. It’s also possible to use binary operators to instruct the vector to retrieve or not retrieve an element.
1.3.3.2 Lists
Subsetting in lists is similar to subsetting in a vector; however, since a list is a collection of many vectors, you must use double square brackets to retrieve an element from the list. For example, cars[2] retrieves the entire second vector of the list and cars[[c(2,1)]] retrieves the first element of the second vector.
1.3.3.3 Matrixes
Matrixes have a similar subsetting as vectors. However, instead of specifying one index to retrieve the data, we need two index here—one that signifies the row and the other for the column. For example, mdat[1:2,] retrieves all the columns of the first two rows, whereas mdat[1:2,"C.1"] retrieves the first two rows and the C.1 column.
1.3.3.4 Data Frames
Data frames work similarly to matrixes, but they have far more advanced subsetting operations. For example, it’s possible to provide conditional statements like df$fac == "A", which will retrieve only rows where the column fac has a value A. The operator $ is used to refer to a column.
1.3.4 Functions and the Apply Family
As the standard definition goes, functions are the fundamental building blocks of any programming language and R is no different. Every single library in R has a rich set of functions used to achieve a particular task without writing the same piece of code repeatedly. Rather, all that is required is a function call. The following simple example is a function that returns the nth root of a number with two arguments, num and nroot, and contains a function body for calculating the nth root of a real positive number.
This example is a userdefined function, but there are so many such functions across the vast collection of packages contributed by R community worldwide. We will next discuss a very useful function family from the base package of R, which has found its application in numerous scenarios.

lapply returns a list of the same length as of input X, each element of which is the result of applying a function to the corresponding element of X.

sapply is a userfriendly version and wrapper of lapply by default returning a vector, matrix or, if you use simplify = "array", an array if appropriate. Applying simplify2array(). sapply(x, f, simplify = FALSE, USE.NAMES = FALSE) is the same as lapply(x, f).

vapply is similar to sapply, but has a prespecified type of return value, so it can be safer (and sometimes faster) to use.

tapply applies a function to each cell of a ragged array, that is, to each (nonempty) group of values given by a unique combination of the levels of certain factors.
As you can see, every operation in the list involves a certain logic, which needs a loop (for or while loop) like traversal on the data. However, by using the apply family of functions, we can reduce writing programming codes to a minimum and instead call a singleline function with the appropriate arguments. It’s functions like these that make R the most preferred programming language for even less experienced programmers.
1.4 Machine Learning Process Flow
The process flow has four main phases, which we will from here on refer to as PEBE—Plan, Explore, Build, and Evaluate, as shown in the Figure 17. Let’s get into the details of each of these.
1.4.1 Plan
This phase forms the key component of the entire process flow. A lot of energy and effort needs to be spent on understanding the requirements, identifying every data source available at our disposal, and framing an approach for solving the problems being identified from the requirements. While gathering data is at the core of the entire process flow, considerable effort has to be spent in cleaning the data for maintaining the integrity and veracity of the final outputs of the analysis and model building. We will discuss many approaches for gathering various types of data and cleaning them up in Chapter 2.
1.4.2 Explore
Exploration sets the ground for analytic projects to take flight. A detailed analysis of possibilities, insights, scope, hidden patterns, challenges, and errors in the data are first discovered at this phase. A lot of statistical and visualization tools are employed to carry out this phase. In order to allow for greater flexibility for modification if required in later parts of the project, this phase is divided into two parts. The first is a quick initial analysis that’s carried out to assess the data structure, including checking naming conventions, identifying duplicates, merging data, and further cleaning the data if required. Initial data analysis will help identify any additional data requirement, which is why you see a small leap of feedback loop built into the process flow.
In the second part, a more rigorous analysis is done by creating hypotheses, sampling data using various techniques, checking the statistical properties of the sample, and performing statistical tests to reject or accept the hypotheses. Chapters 2, 3, and 4 discuss these topics in detail.
1.4.3 Build
Most of the analytic projects either die out in the first or second phase; however, the one that reaches this phase has a great potential to be converted into a data product. This phase requires a careful study of whether a machine learning kind of model is required or a simple descriptive analysis done in the first two phases is more than sufficient. In the industry, unless you don’t show a ROI on effort, time, and money required in building a ML model, the approval from the management is hard to come by. And since many ML algorithms are kind of a black box where the output is difficult to interpret, the business rejects them outright in the very beginning.
So, if you pass all these criteria and still decide to build the ML model, then comes the time to understand the technicalities of each algorithm and how it works on a particular set of data, which we will take up in Chapter 6. Once the model is built, it’s always good to ask if the model satisfies your findings in the initial data analysis. If not, then it’s advisable to take a small leap of feedback loop.
One reason you see Build Data Product in the process flow before the evaluation phase is to have a minimal viable output directed toward building a data product (not a full fledged product, but it could even be a small Excel sheet presenting all the analysis done until this point). We are essentially not suggesting that you always build a ML model, but it could even be a descriptive model that articulates the way you approached the problem and present the analysis. This approach helps with the evaluation phase, whether the model is good enough to be considered for building a more futuristic predictive model (or a data product) using ML or whether there still is a scope for refinement or whether this should be dropped completely.
1.4.4 Evaluate
This phase determines either the rise of another revolutionary disruption in the traditional scheme of things or the disappointment of starting from scratch once again. The big leap of feedback loop is sometimes unavoidable in many realworld projects because of the complexity it carries or the inability of data to answer certain questions. If you have diligently followed all the steps in the process flow, it’s likely that you may just want to further spend some effort in tuning the model rather than taking the big leap to start all from the scratch.
It’s highly unlikely that you can build a powerful ML model in just one iteration. We will explore in detail all the criteria for evaluating the model’s goodness in Chapter 7 and further finetune the model in Chapter 8.
1.5 Other Technologies
While we place a lot of emphasis on the key role played by programming languages and technologies like R in simplifying many ML process flow tasks which otherwise are complex and time consuming, it would not be wise to ignore the other competing technologies in the same space. Python is another preferred programming language that has found quite a good traction in the industry for building productionready ML process flows. There is an increased demand for algorithms and technologies with capabilities of scaling ML models or analytical tasks to a much larger dataset and executing them at realtime speed. The later part needs a much more detailed discussion on big data and related technologies, which is beyond the scope of this book.
Chapter 9, in a nutshell, will talk about such scalable approaches and other technologies that can help you build the same ML process flows with robustness and using industry standards. However, do remember that every approach/technology has its own pros and cons, so wisely deciding the right choice before the start of any analytic project is vital for successful completion.
1.6 Summary
In this chapter, you learned about the evolution of machine learning from statistics to contemporary data science. We also looked at the fundamental subjects like probability and statistics, which form the foundations of ML. You had an introduction to the R programming language, with some basic demonstrations in R. We concluded the chapter with the machine learning process flow the PEBE framework.
In the coming chapters, we will go into the details of data exploration for better understanding and take a deep dive into some realworld datasets.