Simple Linear Regression

Simple Linear Regression in R (2018) [STHDA]

Formula and basics

Simple linear regression predicts a quantitative outcome variable y, on the basis of a single predictor variable x.

An important assumption of the linear regression model is that the relationship between the predictor variables and the outcome is linear and additive.

Residual Sum of Squares (RSS)

  • The sum of the squares of the residual errors

Ordinary least squares regression (OLS)

  • b0b_0 and b1b_1 are determined so that the RSS is minimized

Residual Standard Error (RSE)

  • The average variation of points around the fitted regression line

  • Lower the RSE, the better the fitted regression model

Load & preview the data

library(tidyverse) ## For data manipulation and visualization
library(ggpubr)    ## For publication-ready plots
theme_set(theme_pubr())
0.5s
linear_reg_R (R)

The marketing data set describes the impact of Youtube, Facebook and newspaper advertising on sales.

#devtools::install_github('kassambara/datarium')
library(datarium)
data("marketing", package = "datarium") 
head(marketing)
linear_reg_R (R)
0 items

Visualize the data

Create a scatterplot displaying the sales in thousands of dollars vs. Youtube advertising budget:

ggplot(marketing, aes(x = youtube, y = sales)) + 
  geom_point() + 
  stat_smooth()
linear_reg_R (R)

Correlation coefficient

  • Measures the level of association between X and Y

  • -1: perfect negative correlation

  • +1: perfect positive correlation

  • ~0: weak relationship between the variables

cor(marketing$sales, marketing$youtube)
linear_reg_R (R)

Fit a linear regression model

The jtools package is very useful for summarizing and visualizing regression results. Read more about it here.

library(jtools)
model <- lm(sales ~ youtube, data = marketing) 
summ(model)
linear_reg_R (R)
  • Intercept=8.43 means that when Youtube advertising budget is $0, we can expect $8,430 in sales

Fit the linear regression line

The confidence band reflect the uncertainty about the line

ggplot(marketing, aes(youtube, sales)) + 
  geom_point() + 
  stat_smooth(method = lm)
linear_reg_R (R)

Model assessment

Model summary

  • Call: the function used to compute the regression model

  • Residuals: distribution of the residuals (median should be ~0 and min ~ max)

  • Coefficients: the regression beta coefficients and their statistical significance. Statistical significance indicated by asterisks.

  • RSE, R2R^2 and F-statistic: metrics of how well the model fits to the data

summary(model)
linear_reg_R (R)

Coefficients significance

Standard errors

Measures the variability/accuracy of the beta coefficients. Can be used to compute the confidence intervals of the coefficients.

t-value

A t-test is performed to check whether or not these coefficients are significantly different from zero. High t-statistics (which go with low p-values near 0) indicate that a predictor should be retained in a model.

Model accuracy

RSE (Closer to zero, the better)

  • Whether the RSE is acceptable depends on the problem

R2R^2 (Higher the better)

  • Represents the proportion of information (i.e. variation) in the data that can be explained by the model, or how well the model fits the data

  • For a simple linear regression, R2R^2 is the square of the Pearson correlation coefficient

  • The adjusted R2R^2 is adjusted for the degrees of freedom. As R2R^2 tends to increase when there are more predictors, should consider this metric for multiple linear regression model

F statistic (Higher the better)

  • Give overall significance of the model, assessing whether at least one predictor variable has a non-zero coefficient

  • For a simple linear regression, this just duplicates the information of the t-test in the coefficient table

  • More important in multiple linear regression

  • A large F statistic corresponds to a statistically significant P value

Runtimes (1)