Introduction
This entry will attempt to cover the basics of simple linear regression (SLR) and its dependencies. This includes the mathematical derivations, main assumptions and interpretation of the results. What this entry will not cover is the programmatic implementation, as well as the workarounds in implementations for certain mathematical processes in statistical packages like Statsmodels. SLR in a nutshell is a predictive model, and like all other predictive models, its main purpose is to try and:
What separates SLR from other models is that SLR attempts to achieve this with a straight line. It is one of the oldest and simplest parametric statistical models (or supervised machine learning algorithms), and forms the basis for many more complex ones. By understanding the inner workings and shortcomings of SLR, one would have a good foundation to expand their knowledge to other machine learning algorithms, as well as apply the concepts learnt in SLR to implement best practices in future modelling exercises. This article is broken up into several distinct sections, and depending on your familiarity with these components, you can choose to skip accordingly.
Given that the main feature of SLR is to use a straight line to model phenomena, it would be sensible to revisit the mathematical definition of a line. Recall that a line in a 2-dimensional grid can be defined algebraically as: $$ { y = mx + b } $$ Here, y and x are the horizontal and vertical coordinates on some 2-dimensional grid, m is the slope of the line, and b is the intercept. The intercept is basically where the line crosses the y-axis. The slope, or the gradient of the line, can be calculated as: $${ m = \frac{ y_2 - y_1 }{ x_2 - x_1 } }$$ Now with these parameters, we can construct and characterize any 2-dimensional line. We can better visualize what this means by playing around with the graphing widget in desmos. Have a play around with the sliders on the left for different values of m and c, and see how it affects the resulting line.
Depending on the context, the interpretation of the slope and gradient will differ.
The Regression model
1. Residuals are normally distributed
This assumes that the residuals, defined as \(y_{actual} - y_{modelled}\) or \(\epsilon\) are normally distributed. We can validate this assumption visually by plotting QQ-Plots or histograms post-model fitting, which we will explore in the later section. Or for a more numerical method, we can use the Jarque-Bera (JB) test to check this assumption. If results from tests show otherwise, this could be evidence that the SLR model is not sufficiently explaining the data, and this leads to an unexplained "pattern" in the residuals instead of random.
2. Variance of residuals are constant
The official term for this is homoscedasticity. This means that the variability of \(y_i\) remains constant across different values of \(x_i\). This can be validated visually by plotting a scatterplot of the model predictions and the residual values. If there is a change of variability between the \(X\) values, we say that the data is heteroscedastic. When this happens, the model results become less reliable. Specifically, heteroscedasticity increases the variance of the regression coefficient estimates, but the regression model does not pick up on this. This makes it much more likely for a regression model to declare that a term in the model is statistically significant, when in fact it is not.
3. Residuals are indepdendent of each other
It is assumed in the SLR model that the error terms are independent of each other, and the response. If we observed some dependent relationships between our residuals, that means that the SLR model is not capturing/ modelling the correlation between the \(x_i\) well enough.
4. Linearity between \(X\) and \(Y\)
SLR is afterall a linear model. If there is any numerical or visual evidence from the data that suggests otherwise, we could either rescale the data to a log-scale or use a different model altogether.
Least Squares Method
The idea behind the least squares method, is to find values of \(\beta_0\) and \(\beta_1\), which minimizes the errors between the actual \(y_i\) terms observed in the dataset, and the values produced by the linear model \beta_0 + \beta_1 x_i. If the reader happens to be more mathematically savvy, then the reader would have picked up that this translates into an optimization problem.
Optimization problems in mathematics are a class of problems where, given some mathematical function \(f(x)\), we want to find the point \(x\) such that \(f(x)\) returns the highest or lowest possible value. In order to do this, we need to find the point where: $${ \frac{\partial y}{\partial x} = f'(x) = 0 }$$ Or in other words, we need to find the point in the graph, where the derivative (or slope) has a flat gradient. If we refer to the function below where \(f(x) = x^2\), we note that its derivate, or slope is defined as \(f'(x) = 2x\) as shown on the right. If we followed along with the logic and set the slope to 0, we get: $${ 2x = 0, x = 0}$$ Which implies that when \(x = 0\), the function \(x^2\) will be at the lowest value. Have a look at the plots below and see if the 0 point on the \(f'(x)\) does indeed correspond to the lowest point on the \(f(x)\) graph on the left.
Here, the function that we are trying to minimize is actually embedded in the name of the method, least squares! We're trying to find values \(\beta_0\) and \(\beta_1\) that returns us the least amount of errors, or squared errors to be exact: $${ (y_{predicted} - y_{actual})^2 }$$
By applying the concepts of optimizing the squared errors and some algebraic manipulation, we end up with the 2 formulae which can be solved to obtain the values for the 2 variables. This is also known as a system of 2 equations with 2 unknowns: $${ \beta_1 = \frac{\Sigma x_i \Sigma y_i - n \Sigma x_i y_i }{(\Sigma {x_i}^2) - n\Sigma (x_i)^2} }$$ $${ \beta_0 = \frac{1}{n}(\Sigma y_i - \beta_1 \Sigma x_i) }$$ While this seems somewhat cryptic at first, if we take a closer look at the results, we see that each of the terms can actually be calculated from our dataset. Play around with the sliders below for both the SLR parameters and see how the resulting line and least squares results look like.
\(x_{i}\) | \(y_{i}\) | \(f(x_i)\) | \(\epsilon^2\) |
---|---|---|---|
0 | 0 | ||
1 | 2 | ||
2 | 1 | ||
3 | 4 | ||
4 | 2 | ||
5 | 4 | ||
6 | 2 | ||
7 | 7 | ||
8 | 5 | ||
9 | 6 |
Interpretation of the SLR model
We can first discuss what an example SLR model looks like and the kind of statements we can make from it. From there, we can start to discuss the other statistics from a typical SLR model and the kind of statements we can make about those. From the example above, our SLR model looks like this: $${ y_i = 0.54 + 0.61 x_i + \epsilon }$$ A common misconception is saying that "if my observation \(x\) is 1, then \(y\) would be 1.5". While the math does add up, the statement is not totally accurate and does not reflect what the model result should actually be conveying. To be more accurate, given the assumptions we have made, the model actually tells us that \(y_i \sim{} N(\beta_0 + \beta_1 x_i, \sigma)\) or \(y_i \sim{} N(0.54 + 0.61 x_i, \sigma)\). This means that if we theoretically repeat the scenario where \(x = 1\) many times, the model assumes that:
The p-value, is simply a probability of observing the given results under the assumption that the null hypothesis is correct. Keeping that in mind, we'd expect that a smaller value would provide stronger evidence that the resulting coefficients from our SLR model are more likely to be a meaningful addition as changes in the term are related to the changes in the response. A common benchmark used is a p-value lower than 5% to deem a term meaningful. This can be can generally be translated into a statement such as :
Given the assumption that (insert null hypothesis here) and (insert distributional assumptions here), the probability of observing the value 1.5 from the model results is (p-value)%.
\(R^2\) as it is more commonly known, is generally a measure of how well the SLR model fits the data. More specifically, it is a statistical measure that tells us the proportion of the variance for a response that is explained by the term in the SLR, and is defined by the following formula: $${ R^2 = 1 - \frac{SS_{regression}}{SS_{total}} }$$ Some things to note:
The QQ-Plot basically plots the quantiles of 2 different distrubtions against each other. This is very useful in statistical analyses where we want visual evidence that a particular result follows a certain distribution. Refer to the image from sherry towers below for an overview.
The JB-test is defined as: $${ JB = \frac{n}{6}(S^2 + \frac{1}{4}(K-3)^2) }$$ where \(n\) is the number of samples, \(S\) is the measure of skewness in the sample, and \(K\) is the measure of kurtisos in the sample: $${ S = \frac{\mu_3}{\sigma^3} }$$ $${ K = \frac{\mu_4}{\sigma^4} }$$ This is a goodness-of-fit test to evaluate if a data has the same skewness and kurtosis as a normal distribution. The test results are non-negative and the larger the value, the more evidence that the sample data does not have a normal distribution. Given that the underlying data is normally distributed, this JB statistic asymptotically converges to a \(\chi^2(2)\), which can then be tested against at differing confidence levels. However, it is noted that for smaller samples, this approximation is very sensitive and often leads to rejecting the null hypothesis (skewness and kurtosis is 0) when it is in fact true, leading to a large type 1 error rate.
Limitations and uses
While arguably not many real world relationships are linear in nature, SLR still proves useful for short term approximations like: