Image by Chris Liverani on Unsplash

To not miss out on any new articles, consider subscribing

Regression is a statistical method that attempts to determine the strength and behaviour of the relationship between one dependent variable (usually denoted by Y) and a set of one or more other variables (known as independent variables). Ordinary least squares (OLS) regression is a statistical method of analysis that estimates the relationship between the variables by minimizing the sum of squared differences between the observed and predicted values of the dependent variable.

If your data shows a linear relationship between the X and Y variables, it is useful to find the line that best fits that relationship. The Least Squares Regression Line is the line that makes the vertical distance from the data points to the regression line as small as possible. It’s called a “least squares” because the best line of fit is one that minimizes the sum of squares of the errors (aka the variance). Another name for the line is “Linear regression equation” (because the resulting equation gives you a linear equation). R² measures how well a linear regression line fits the data and has the equation ŷ= a+ b x. a denotes the intercept, b is the slop, x is the independent variable and ŷ is the dependent variable. Once the intercept and slope have been estimated using least squares, various indices are studied to determine the reliability of these estimates. One of the most popular of these reliability indices is the correlation coefficient.

Correlation quantifies the direction and strength of the relationship between two numeric variables, X and Y. The correlation coefficient, or simply the correlation, is an index that always lies between -1 and 1. When the value is near zero, there is no linear relationship. As the correlation gets closer to plus or minus one, the relationship is stronger. A value of +1 indicates a perfect positive linear relationship and -1 indicates a perfect negative linear relationship between two variables.¹

The correlation squared (R²) has special meaning in simple linear regression. It represents the proportion of variation in Y explained by X (accounted by the variation in X). It is defined as the sum of squares due to the regression divided by the adjusted total sum of squares of Y. R² does not measure the magnitude of the slopes and does not measure the appropriateness of a linear model. It measures the strength of the linear component of the model. When there is an intercept in the regression, correlation magnitude= sqrt(R²) and sign (corr) = sign (regression slope of Y on X). So if the correlation magnitude is positive, then the regression slope of Y on X is positive too.

Ordinary least squares implementation in Python

OLS can be carried out in various Python packages such as in stats models, numpy, pandas and scipy. For this article, we will be exploring the stats models package.

The data used is the Life Expectancy data from Kaggle. It has 22 columns for Year, Country, Life Expectancy and features that might affect life expectancy. For this project, we studied the influence of Alcohol on Life Expectancy for Nigeria from 2005 till 2013. Using stats models, we’re working with pct_change() or diff not the original number because numbers could rise or fall but not be correlated. The pct_change() function computes the percentage change from the immediately previous row by default. This is useful in comparing the percentage of change in a time series of elements.

Use add constant method of sm to add a column of 1s to aid in calculating intercept. The value of the correlation coefficient is unchanged if either X or Y is multiplied by a constant or if a constant is added.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
view raw OLS.ipynb hosted with ❤ by GitHub

The first element in the parameters array, alpha, signifies the intercept and the second element is the slope.

To see more details about the model such as the R², adjusted R², F-statistic, log-likelihood and other relevant statistics, you can print a summary of the model.


For the full notebook for this article, check out the GitHub gist:

I hope this was helpful and you are able to apply OLS on your time series data to see how they correlate over time. Feel free to reach out on LinkedIn, Twitter or send an email: contactaniekan at gmail dot com if you want to chat about this or anything.

Stay safe. 😷


[1] Chapter 300, NCCS Statistical Software

To not miss out on any new articles, consider subscribing