Data Modeling I – Basics

Jank, Wolfgang

doi:10.1007/978-1-4614-0406-4_3

Data Modeling I – Basics

Wolfgang Jank²

Chapter
First Online: 01 January 2011

8840 Accesses

Part of the book series: Use R ((USE R))

Abstract

In this chapter, we introduce methods for modeling data. We refer to these methods as “basics” since they form the foundation for many of the more advanced ideas and concepts discussed in later chapters. The most basic concept is that of a model itself. You may ask: “What is a model? And why do we need models at all?” We will give answers to these fundamental questions in Section 3.1. In Section 3.2, we will discuss linear regression models as one of the most widespread and versatile types of models. The name “linear” implies that we will discuss models that assume that the relationship follows a straight line. For instance, you may argue that the more you eat, the more weight you gain – and you may gain an additional pound of body weight for every pound of food that you eat. That’s exactly what we mean by a linear model: every unit of “input” has the same (proportional) impact on the “output.” If I eat 2 pounds of food, I will gain 2 pounds in body weight; and if I eat 4 pounds, I will gain 4 pounds of body weight, and so on – the relationship between input and output is always the same (1 to 1 in this case). You will quickly realize that while the concept of linear models is extremely powerful, it also has its limitations. For instance, do you really believe that the entire world follows linear relationships? If, for instance, human growth was linear and increased by the same rate every year, why is it that by the time we reach the age of 50, we are not 50 feet tall? In that sense, we will also discuss limitations of linear regression throughout this chapter. Some of these limitations will be addressed immediately, while others will be our motivation for more advanced methods discussed in subsequent chapters.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
These words are usually attributed to the famous statistician George Box.
2.
The answer is: Sales now increase at a slower rate. In fact, now, for each additional advertising dollar, sales only increase by two dollars. Compare this to an increase of five dollars in the previous model.
3.
Rather than squaring residuals, we could also take their absolute values. Both approaches have the effect that negative discrepancies are considered as bad as positive ones. So why do we square rather than take absolute values? The answer probably depends on who you ask. One answer is grounded in history. Least squares regression goes back to the famous mathematician Carl Friedrich Gauss in the eighteenth century. Back in the eighteenth century, computers were not available. So, in order to compute a regression line, one would have to do the calculations by hand. We can determine the minimum of the sum of the squared residuals manually because it only involves minimizing a quadratic function, which can be done by taking the first derivative. In contrast, minimizing a function that involves absolute values is much more involved and requires iterative (i.e., computer-driven) calculations.
4.
Note that both sales and advertising are recorded in thousands of dollars, so the more accurate interpretation of the value of a is that, in the absence of any advertising, the company still records sales of $51,849 on average.
5.
Similar to the previous footnote, we remind the reader that since both data are recorded in thousands of dollars, the more appropriate interpretation would be “for every increase in advertising by $1,000, sales increase by $7,527,” which reflects the scaling of the recorded data.
6.
In fact, if we divide SST by (n − 1), we obtain the sample variance.
7.
We are applying the concept of confidence intervals below in a slightly inaccurate way: a 95% is computed by adding and subtracting 1.96 times the standard error from the mean; in the confidence interval calculations below, we are using a factor of 2 instead of 1.96. We believe that for quick-and-dirty manual calculations, this rounding does not make much of a difference. However, in order to obtain a precise answer, one should use computerized calculations rather than manual ones.
8.
We are again using the factor 2 instead of the more accurate 1.96 in the calculations below.

Author information

Authors and Affiliations

Department of Decision and Information Technologies Robert H. Smith School of Business, University of Maryland, Van Munching Hall, College Park, MD, 20742-1815, USA
Wolfgang Jank

Authors

Wolfgang Jank
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wolfgang Jank .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Jank, W. (2011). Data Modeling I – Basics. In: Business Analytics for Managers. Use R. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-0406-4_3

Download citation

DOI: https://doi.org/10.1007/978-1-4614-0406-4_3
Published: 18 July 2011
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-0405-7
Online ISBN: 978-1-4614-0406-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics