Interactive learning: The Basics of Simple Linear Regression


Introduction

This entry will attempt to cover the basics of simple linear regression (SLR) and its dependencies. This includes the mathematical derivations, main assumptions and interpretation of the results. What this entry will not cover is the programmatic implementation, as well as the workarounds in implementations for certain mathematical processes in statistical packages like Statsmodels. SLR in a nutshell is a predictive model, and like all other predictive models, its main purpose is to try and:

What separates SLR from other models is that SLR attempts to achieve this with a straight line. It is one of the oldest and simplest parametric statistical models (or supervised machine learning algorithms), and forms the basis for many more complex ones. By understanding the inner workings and shortcomings of SLR, one would have a good foundation to expand their knowledge to other machine learning algorithms, as well as apply the concepts learnt in SLR to implement best practices in future modelling exercises. This article is broken up into several distinct sections, and depending on your familiarity with these components, you can choose to skip accordingly.

Given that the main feature of SLR is to use a straight line to model phenomena, it would be sensible to revisit the mathematical definition of a line. Recall that a line in a 2-dimensional grid can be defined algebraically as: $$ { y = mx + b } $$ Here, y and x are the horizontal and vertical coordinates on some 2-dimensional grid, m is the slope of the line, and b is the intercept. The intercept is basically where the line crosses the y-axis. The slope, or the gradient of the line, can be calculated as: $${ m = \frac{ y_2 - y_1 }{ x_2 - x_1 } }$$ Now with these parameters, we can construct and characterize any 2-dimensional line. We can better visualize what this means by playing around with the graphing widget in desmos. Have a play around with the sliders on the left for different values of m and c, and see how it affects the resulting line.

Depending on the context, the interpretation of the slope and gradient will differ.

The Regression model

The simple linear regression model is defined as follows: $${ y_i = \beta_0 + \beta_1 x_i + \epsilon }$$ We can see that this closely resembles an equation of a straight line, with an extra \(\epsilon\) term. The \(\epsilon\) component is also known as the error term, and represents the inevitable deviations between the actual data and our linear model (it is hardly the case where the straight line perfectly represents what is happening in the real world). Despite the deviations, we can make some useful statements about these errors as a result of using the SLR model which we will explore. The rest of the terms make sense since the goal of a SLR model is to model some data with a line in the first place. The \(x_i\) and \(y_i\) terms here represent our observed datapoints and response values respectively. In simple terms, we can think of \(y_i\) as results that we want to explain, and \(x_i\) as the features that we believe will influence \(y_i\). The \(\beta_0\) and \(\beta_1\) terms, respresents the intercept and slope of the line, and their interpretation will change with the context of the data. These 2 terms are also called the parameters of the model, and like before, any SLR model can be defined with different combinations of \(\beta_0\) and \(\beta_1\). While we note that these 2 terms can take any real numerical value and we'd still end up with a valid linear model, what we really want to do is to find the values for these 2 parameters such that the resulting line approximates our data the "best". This is done by using the "Least Squares Algorthim".

1. Residuals are normally distributed

This assumes that the residuals, defined as \(y_{actual} - y_{modelled}\) or \(\epsilon\) are normally distributed. We can validate this assumption visually by plotting QQ-Plots or histograms post-model fitting, which we will explore in the later section. Or for a more numerical method, we can use the Jarque-Bera (JB) test to check this assumption. If results from tests show otherwise, this could be evidence that the SLR model is not sufficiently explaining the data, and this leads to an unexplained "pattern" in the residuals instead of random.

2. Variance of residuals are constant

The official term for this is homoscedasticity. This means that the variability of \(y_i\) remains constant across different values of \(x_i\). This can be validated visually by plotting a scatterplot of the model predictions and the residual values. If there is a change of variability between the \(X\) values, we say that the data is heteroscedastic. When this happens, the model results become less reliable. Specifically, heteroscedasticity increases the variance of the regression coefficient estimates, but the regression model does not pick up on this. This makes it much more likely for a regression model to declare that a term in the model is statistically significant, when in fact it is not.

3. Residuals are indepdendent of each other

It is assumed in the SLR model that the error terms are independent of each other, and the response. If we observed some dependent relationships between our residuals, that means that the SLR model is not capturing/ modelling the correlation between the \(x_i\) well enough.

4. Linearity between \(X\) and \(Y\)

SLR is afterall a linear model. If there is any numerical or visual evidence from the data that suggests otherwise, we could either rescale the data to a log-scale or use a different model altogether.

Least Squares Method

The idea behind the least squares method, is to find values of \(\beta_0\) and \(\beta_1\), which minimizes the errors between the actual \(y_i\) terms observed in the dataset, and the values produced by the linear model \beta_0 + \beta_1 x_i. If the reader happens to be more mathematically savvy, then the reader would have picked up that this translates into an optimization problem.

Optimization problems in mathematics are a class of problems where, given some mathematical function \(f(x)\), we want to find the point \(x\) such that \(f(x)\) returns the highest or lowest possible value. In order to do this, we need to find the point where: $${ \frac{\partial y}{\partial x} = f'(x) = 0 }$$ Or in other words, we need to find the point in the graph, where the derivative (or slope) has a flat gradient. If we refer to the function below where \(f(x) = x^2\), we note that its derivate, or slope is defined as \(f'(x) = 2x\) as shown on the right. If we followed along with the logic and set the slope to 0, we get: $${ 2x = 0, x = 0}$$ Which implies that when \(x = 0\), the function \(x^2\) will be at the lowest value. Have a look at the plots below and see if the 0 point on the \(f'(x)\) does indeed correspond to the lowest point on the \(f(x)\) graph on the left.

Here, the function that we are trying to minimize is actually embedded in the name of the method, least squares! We're trying to find values \(\beta_0\) and \(\beta_1\) that returns us the least amount of errors, or squared errors to be exact: $${ (y_{predicted} - y_{actual})^2 }$$

$${ Error = f(y) = \Sigma_i (y_{i} - \hat{y}_{i})^2 }$$ $${ Error = f(\beta_0, \beta_1) = \Sigma_i (y_{i} - (\beta_0 x_i + \beta_1))^2 }$$ $${ Error = f(\beta_0, \beta_1) = \Sigma_i (y_{i} -\beta_0 x_i - \beta_1)^2 }$$ Here, our function (whatever shape it has) output would change with different values of 'm' and 'b' and it is our job to find the values of 'm' and 'b' that give the smallest output, or the smallest error. This can be done with the concept of the zero-slope point discussed above. We first take the partial derivatives with respect to both 'm' and 'b'. $${ \frac{\partial f(\beta_0, \beta_1)}{\partial \beta_0} = 2\Sigma_i (y_{i} - \beta_1 x_i - \beta_0)}$$ $${ \frac{\partial f(\beta_0, \beta_1)}{\partial \beta_1} = 2x_i \Sigma_i (y_{i} - \beta_1 x_i - \beta_0)}$$ We know that the minimum value for the error will occur when both: $${ \frac{\partial f(\beta_0, \beta_1)}{\partial \beta_0} = 0 }$$ $${ \frac{\partial f(\beta_0, \beta_1)}{\partial \beta_1} = 0 }$$ Which means that we now have 2 equations: $${ 2\Sigma_i (y_{i} - \beta_1 x_i - \beta_0) = 0}$$ $${ 2x_i \Sigma_i (y_{i} - \beta_1 x_i - \beta_0) = 0}$$ Which can be solved algbraically or with a system of equations, to give us the formulae: $${ \beta_1 = \frac{\Sigma_i x_i \Sigma_i y_i - n \Sigma_i x_i y_i }{(\Sigma_i {x_i}^2) - n\Sigma_i (x_i)^2} }$$ $${ \beta_0 = \frac{1}{n}(\Sigma_i y_i - \beta_1 \Sigma_i x_i) }$$

Given the hypothesis: $${ f_{\theta}(x) = \theta_{0}x_{0} + \theta_{1}x_{1} + ... + \theta_{n}x_{n} }$$ The function we are trying to minimize is: $${ K(\theta) = \frac{1}{2m}\Sigma (f_{\theta}(x_i) - y_{i})^2 }$$ This is basically the sum of squared errors. The extra term in front is to make it more mathematically tractable. Minimizing a constant multiplied by a function gives the same extrema. This can be rewritten as: $${ f_{\theta}(x) = \theta^Tx = X\theta }$$ $${ K(\theta) = \frac{1}{2m}(X\theta - y)^T(X\theta - y) }$$ $${ K(\theta) = ((X\theta)^T - y^T)(X\theta - y) }$$ $${ K(\theta) = (X\theta)^T X\theta - (X\theta)^Ty - y^T(X\theta) + y^Ty }$$ $${ K(\theta) = \theta^T X^T X\theta - 2(X\theta)^Ty + y^Ty }$$ Which is the equation we are trying to minimize. We take the first order condition: $${ \frac{\partial K}{\partial \theta} = 2X^TX\theta - 2X^Ty = 0 }$$ $${ X^TX\theta = X^Ty }$$ Assuming \(X^TX\) is invertible (for sklearn a pseudo inverse is calculated), we get: $${ \theta = (X^TX)^{-1}X^Ty }$$

By applying the concepts of optimizing the squared errors and some algebraic manipulation, we end up with the 2 formulae which can be solved to obtain the values for the 2 variables. This is also known as a system of 2 equations with 2 unknowns: $${ \beta_1 = \frac{\Sigma x_i \Sigma y_i - n \Sigma x_i y_i }{(\Sigma {x_i}^2) - n\Sigma (x_i)^2} }$$ $${ \beta_0 = \frac{1}{n}(\Sigma y_i - \beta_1 \Sigma x_i) }$$ While this seems somewhat cryptic at first, if we take a closer look at the results, we see that each of the terms can actually be calculated from our dataset. Play around with the sliders below for both the SLR parameters and see how the resulting line and least squares results look like.

\(x_{i}\) \(y_{i}\) \(f(x_i)\) \(\epsilon^2\)
0 0
1 2
2 1
3 4
4 2
5 4
6 2
7 7
8 5
9 6

Interpretation of the SLR model

We can first discuss what an example SLR model looks like and the kind of statements we can make from it. From there, we can start to discuss the other statistics from a typical SLR model and the kind of statements we can make about those. From the example above, our SLR model looks like this: $${ y_i = 0.54 + 0.61 x_i + \epsilon }$$ A common misconception is saying that "if my observation \(x\) is 1, then \(y\) would be 1.5". While the math does add up, the statement is not totally accurate and does not reflect what the model result should actually be conveying. To be more accurate, given the assumptions we have made, the model actually tells us that \(y_i \sim{} N(\beta_0 + \beta_1 x_i, \sigma)\) or \(y_i \sim{} N(0.54 + 0.61 x_i, \sigma)\). This means that if we theoretically repeat the scenario where \(x = 1\) many times, the model assumes that:

  1. \(y\) will probably take different values on each scenario, but is expected to average out to 1.5
  2. The frequency of the different values \(y\) can take across the scenarios can be approximated by a bell curve centered at 1.5
  3. This 1.5 figure will change depending on the observed value of \(x\), with the minimum of 0.54 when \(x\) is 0
  4. We can be 95% confident that \(y\) will take values between \(1.5 \pm{} 1.645 * \frac{\sigma}{\sqrt{n}} \)
Now that we've understood the main resulting formula and what the \(\beta_0\) and \(\beta_1\) coefficients mean, the SLR output provides us with more pieces of information which we can use to evaluate how "good" of a job our model is actually doing on approximating the data. Note that some of these are more useful and commonly seen in programming libraries than others. Along with the explanations below, there is an included spreadsheet here that uses the sample data in example above to reverse engineer all the regression outputs.

The p-value, is simply a probability of observing the given results under the assumption that the null hypothesis is correct. Keeping that in mind, we'd expect that a smaller value would provide stronger evidence that the resulting coefficients from our SLR model are more likely to be a meaningful addition as changes in the term are related to the changes in the response. A common benchmark used is a p-value lower than 5% to deem a term meaningful. This can be can generally be translated into a statement such as :

Given the assumption that (insert null hypothesis here) and (insert distributional assumptions here), the probability of observing the value 1.5 from the model results is (p-value)%.

\(R^2\) as it is more commonly known, is generally a measure of how well the SLR model fits the data. More specifically, it is a statistical measure that tells us the proportion of the variance for a response that is explained by the term in the SLR, and is defined by the following formula: $${ R^2 = 1 - \frac{SS_{regression}}{SS_{total}} }$$ Some things to note:

  • Taking this square root of this value gives us the linear correlation measure, bounded by -1 and 1
  • Excel's \({R^2}_{adj}\) just adjusts the value for the number of terms in the model. This is a guard to discourage addition of useless features into the model
  • This is calculated by: \({R^2}_{adj} = 1 - [\frac{(1-R^2)(n - 1)}{(n - k - 1)}]\) and penalizes the user for adding in unnecessary features where \(R^2\) will not
  • While \(R^2\) does give us information on how well the model is fitting the data linearly, it may not always be the most appropriate in justifying the model selected

The QQ-Plot basically plots the quantiles of 2 different distrubtions against each other. This is very useful in statistical analyses where we want visual evidence that a particular result follows a certain distribution. Refer to the image from sherry towers below for an overview.

Some important things to note:
  • If data follows the distribution perfectly, then QQ-plot will result in a straight line, with the green line signifying that the theoretical quantiles exactly equals what is seen in the data
  • For a normal distribution, we're basically plotting sorted data against the Z-scores given the rank of the data
  • These Z-scores are calculated with the following statement. "Solve for \(Z\) in \(P(Z <= p)\), given that \(Z\sim N(0, 1)\) and p is the percentile of the particular data point."
  • Now say that p is 5%, we know that the line should have a vertical value of ~-1.645. If our points fall below that as seen in the 2nd row of pictures, what we're saying is that 5% of the data falls below -1.645 in a standard normal distribution, but in our dataset, 5% of data falls below e.g -20 instead. Which gives us an idea that there are more than 5% of points that are lesser than -1.645, meaning either our data is left skewed or more peaked than a normal distribution
While we can generally use any theoretical distribution to be compared against in the x-axis. The reason a normal distribution is a popular one stems from the fact that many statistical and models assume that errors are normally distributed, and having evidence that our modelled errors are indeed normally distributed would support the fact that the selected model works well enough in trying to explain any trends and biases in our dataset. For an example of how to create QQ-plots from scratch in excel, refer to this article.

The JB-test is defined as: $${ JB = \frac{n}{6}(S^2 + \frac{1}{4}(K-3)^2) }$$ where \(n\) is the number of samples, \(S\) is the measure of skewness in the sample, and \(K\) is the measure of kurtisos in the sample: $${ S = \frac{\mu_3}{\sigma^3} }$$ $${ K = \frac{\mu_4}{\sigma^4} }$$ This is a goodness-of-fit test to evaluate if a data has the same skewness and kurtosis as a normal distribution. The test results are non-negative and the larger the value, the more evidence that the sample data does not have a normal distribution. Given that the underlying data is normally distributed, this JB statistic asymptotically converges to a \(\chi^2(2)\), which can then be tested against at differing confidence levels. However, it is noted that for smaller samples, this approximation is very sensitive and often leads to rejecting the null hypothesis (skewness and kurtosis is 0) when it is in fact true, leading to a large type 1 error rate.

Limitations and uses

While arguably not many real world relationships are linear in nature, SLR still proves useful for short term approximations like:

Furthermore, the concepts in SLR are applied to various more complex models. For example, explanation model uses a local linear model in an attempt to explain more complex machine learning results. A feed forward neural network with 1 bias node and 1 weight node is essentially a SLR model given squared errors and the identity activation function. The SLR model is also a special case of a larger class of linear models, called "Generalized Linear Models", or GLMs.