What is linear regression?

Linear regression is a predictive analysis method that allows researchers to explore data and find relationships between certain variables. Linear regression is used through linear regression models, which are created and explored on a variety of programming languages such as Python, JavaScript, and R. Many different variables and relationships can be studied such as whether placing boxes of cereal at the height of the bottom shelves or the height of shelves that are eye-level increases the number of cereal boxes purchased. Statisticians use regression models to describe and visually showcase data. Feel free to explore About Linear Regression to learn more.

How is linear regression used?

In the simplest terms, a simple linear regression model is a mathematical representation of one dependent variable and one independent variable (e.g., analyzing the effect that being a first generation student can have on academic performance). One of the most common ways these models are used is for prediction puruposes. However, many aspects need to be taken into account in order to make these predictions.

Assumptions needed for valid linear regression analysis

Linearity: When plotting the data, the points need to have a linear trend.
Independence: The cause or result of one variable can’t affect that of another variable.
Normality: There needs to be a relatively equal amount of points on both sides of the regression line.
Equal Variance: The spread of points about the line must be somewhat equal.

Transformations

If the data doesn’t match the linearity assumption, data scientists can transform the data by altering the linear regression formula to have a linear trend. There are multiple ways to transform the formula including log, square root, and power transformations. The decision is up to the analyst when deciding which transformation best allows the linearity assumption to be met. Tranformations in Regression provides a helpful table laying out each transformation if you’d like to learn more.

Covariance and correlation

Covariance and correlation are numerical data summaries that are used to better understand the relationships between variables. These values enhance linear regression analysis. Covariance explains whether a relationship is negative or positive whereas correlation explains the strength of a relationship from -1, indicating a strong negative relationship, to 1, indicating a strong positive relationship. For example, if the correlation between being a minority and academic success is 0.86, data scientists have reason to conclude that race positivly affects academic success.

Cautions to take when using linear regression

All four of the above assumptions must be met in order for valid use of a linear regression model.
These models can only be used to predict values within the given range of data.
If a transformation is used, make sure to back transform predictions in order to maintain accuracy.
The number covariance presents is irrelevant. Statisticians only care about whether covariance is negative or positive.

How do I apply it to a data set?

Below you will find code written in R that explores simple linear regression analysis.

Example code

Load needed libraries.

library(tidyverse)
library(lmtest)
library(normtest)
library(MASS)

Read in the data and explore it through visualization, correlation, and covariance. Use the plot to also check whether the linearity assumption is met.

data <- read_table("data_file.txt")

cor(data$variable2,data$variable1)
cov(data$variable2,data$variable1)

ggplot(data, aes(x = variable1 ,y = variable2)) +
  geom_point()+
  geom_smooth(se=F)+
  labs(title = "Plot 1")

Create a simple linear regression (slr) model and create a fitted values and residuals plot to check whether the independence assumption is met.

slr <- lm(variable2~variable1, data)

ggplot(mapping=aes(x=slr$fitted.values, y=slr$residuals)) +
  geom_point() +
  geom_hline(yintercept=0,linetype="dashed")+
  labs(y="Residuals", x = "Fitted Values", title = "Plot 2")

Create a histogram of the residuals to check whether the normality assumption is met. Run a KS test to double-check that the normality assumption is met. A higher output/p value indicates that the data is normal.

ggplot() +
  geom_histogram(mapping=aes(x=stdres(slr)))+
  labs(x="Residuals", title = "Plot 3")

ks.test(stdres(slr), "pnorm")

Check whether the equal variance assumption is met by referring to the fitted values and residuals plot (Plot 2) as well as running a BP test. A higher output/p value indicates that the data has equal variance. (Optional) Determine any outliers in the data by calculating Cooks Distance.

bptest(slr)

cd <- cooks.distance(slr)
data[cd>(4/nrow(data)),]

Apply a transformation to the data if needed. A log transformation was applied to this linear regression model.

log_slr <- lm(log(variable2)~log(variable1), data=data)

Check whether the four assumptions are now met with the applied transformation. If the assumption was met previously, you don’t have to check whether it’s met again, but if you do it’s a good safety net. This code is very similar to the previous code used originally to check the assumptions.

ggplot(data, aes(x=log(variable1),log(variable2))) +
  geom_point()+
  geom_smooth(se=F)+
  labs(title = "Plot 4")

ggplot(mapping=aes(x=log_slr$fitted.values, y=log_slr$residuals)) +
  geom_point() +
  geom_hline(yintercept=0,linetype="dashed")+
  labs(y="Residuals", x = "Fitted Values", title = "Plot 5")
  
ggplot() +
  geom_histogram(mapping=aes(x=stdres(log_slr)))+
  labs(x="Residuals", title = "Plot 6")

ks.test(stdres(log_slr), "pnorm")
 
bptest(log_slr)

Use cross validation to indicate how well the slr model is able to predict and whether it should be used in prediction.

num.cv <- 100              
bias <- rep(NA, num.cv)    
RPMSE <- rep(NA, num.cv)   
num.test <- 10      

for(i in 1:num.cv){
  test.obs <- sample(1:nrow(data), num.test)
 
  test.set <- data[test.obs,]
  train.set <- data[-test.obs,]
  
  train.lm <- lm(log(variable2)~log(variable1), data=train.set)
  
  test.preds <- predict.lm(train.lm, newdata=test.set)
  
  test.preds <- exp(test.preds)
  
  bias[i] <- mean(test.preds-test.set$variable2)
  
  RPMSE[i] = (sqrt(mean((test.preds-test.set$variable2)^2)))
}

tibble("Average Bias"=mean(bias), "Average Error" = mean(RPMSE))

Provide the predictions on the transformed scale and plot the transformed model on the original scale.

pred <- data.frame(variable1 = seq(4,40,by = 1))
pred$variable2 <- predict.lm(log_slr, newdata=pred) 
pred$variable2 <- exp(pred$variable2)

ggplot(data=data, mapping=aes(x=variable1, y=variable2)) +
  geom_point() + 
  geom_line(data=pred, aes(x=variable1,y=variable2)) +
  labs(title = "Plot 7: Transformed Model on Original Scale")

Make desired predictions of variable2 values making sure that the chosen variable1 values are within the span of the data.

pred <- data.frame(variable1 = c(value1,value2))
exp(predict.lm(log_slr, newdata=pred))

I invite you to choose a dataset you’d like to play around with and use the above code to create a simple linear regression model! Feel free to leave any questions, concerns, or new insights you discover in the comments.