Abstract
In this chapter, we introduce methods for modeling data. We refer to these methods as “basics” since they form the foundation for many of the more advanced ideas and concepts discussed in later chapters. The most basic concept is that of a model itself. You may ask: “What is a model? And why do we need models at all?” We will give answers to these fundamental questions in Section 3.1. In Section 3.2, we will discuss linear regression models as one of the most widespread and versatile types of models. The name “linear” implies that we will discuss models that assume that the relationship follows a straight line. For instance, you may argue that the more you eat, the more weight you gain – and you may gain an additional pound of body weight for every pound of food that you eat. That’s exactly what we mean by a linear model: every unit of “input” has the same (proportional) impact on the “output.” If I eat 2 pounds of food, I will gain 2 pounds in body weight; and if I eat 4 pounds, I will gain 4 pounds of body weight, and so on – the relationship between input and output is always the same (1 to 1 in this case). You will quickly realize that while the concept of linear models is extremely powerful, it also has its limitations. For instance, do you really believe that the entire world follows linear relationships? If, for instance, human growth was linear and increased by the same rate every year, why is it that by the time we reach the age of 50, we are not 50 feet tall? In that sense, we will also discuss limitations of linear regression throughout this chapter. Some of these limitations will be addressed immediately, while others will be our motivation for more advanced methods discussed in subsequent chapters.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
These words are usually attributed to the famous statistician George Box.
- 2.
The answer is: Sales now increase at a slower rate. In fact, now, for each additional advertising dollar, sales only increase by two dollars. Compare this to an increase of five dollars in the previous model.
- 3.
Rather than squaring residuals, we could also take their absolute values. Both approaches have the effect that negative discrepancies are considered as bad as positive ones. So why do we square rather than take absolute values? The answer probably depends on who you ask. One answer is grounded in history. Least squares regression goes back to the famous mathematician Carl Friedrich Gauss in the eighteenth century. Back in the eighteenth century, computers were not available. So, in order to compute a regression line, one would have to do the calculations by hand. We can determine the minimum of the sum of the squared residuals manually because it only involves minimizing a quadratic function, which can be done by taking the first derivative. In contrast, minimizing a function that involves absolute values is much more involved and requires iterative (i.e., computer-driven) calculations.
- 4.
Note that both sales and advertising are recorded in thousands of dollars, so the more accurate interpretation of the value of a is that, in the absence of any advertising, the company still records sales of $51,849 on average.
- 5.
Similar to the previous footnote, we remind the reader that since both data are recorded in thousands of dollars, the more appropriate interpretation would be “for every increase in advertising by $1,000, sales increase by $7,527,” which reflects the scaling of the recorded data.
- 6.
In fact, if we divide SST by (n − 1), we obtain the sample variance.
- 7.
We are applying the concept of confidence intervals below in a slightly inaccurate way: a 95% is computed by adding and subtracting 1.96 times the standard error from the mean; in the confidence interval calculations below, we are using a factor of 2 instead of 1.96. We believe that for quick-and-dirty manual calculations, this rounding does not make much of a difference. However, in order to obtain a precise answer, one should use computerized calculations rather than manual ones.
- 8.
We are again using the factor 2 instead of the more accurate 1.96 in the calculations below.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2011 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Jank, W. (2011). Data Modeling I – Basics. In: Business Analytics for Managers. Use R. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-0406-4_3
Download citation
DOI: https://doi.org/10.1007/978-1-4614-0406-4_3
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-0405-7
Online ISBN: 978-1-4614-0406-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)