Simple Linear Regression

Simple Linear Regression in R (2018) [STHDA]

Formula and basics

Simple linear regression predicts a quantitative outcome variable y, on the basis of a single predictor variable x.

An important assumption of the linear regression model is that the relationship between the predictor variables and the outcome is linear and additive.

• The sum of the squares of the residual errors

Ordinary least squares regression (OLS)

• $b_0$ and $b_1$ are determined so that the RSS is minimized

Residual Standard Error (RSE)

• The average variation of points around the fitted regression line

• Lower the RSE, the better the fitted regression model

library(tidyverse) ## For data manipulation and visualization
library(ggpubr)    ## For publication-ready plots
theme_set(theme_pubr())
0.5s
linear_reg_R (R)

The marketing data set describes the impact of Youtube, Facebook and newspaper advertising on sales.

#devtools::install_github('kassambara/datarium')
library(datarium)
data("marketing", package = "datarium")
head(marketing)
linear_reg_R (R)
0 items

Visualize the data

Create a scatterplot displaying the sales in thousands of dollars vs. Youtube advertising budget:

ggplot(marketing, aes(x = youtube, y = sales)) +
  geom_point() +
  stat_smooth()
linear_reg_R (R)

Correlation coefficient

• Measures the level of association between X and Y

• -1: perfect negative correlation

• +1: perfect positive correlation

• ~0: weak relationship between the variables

cor(marketing$sales, marketing$youtube)
linear_reg_R (R)

Fit a linear regression model

The jtools package is very useful for summarizing and visualizing regression results. Read more about it here.

library(jtools)
model <- lm(sales ~ youtube, data = marketing)
summ(model)
linear_reg_R (R)
• Intercept=8.43 means that when Youtube advertising budget is $0, we can expect$8,430 in sales

Fit the linear regression line

The confidence band reflect the uncertainty about the line

ggplot(marketing, aes(youtube, sales)) +
  geom_point() +
  stat_smooth(method = lm)
linear_reg_R (R)

Model assessment

Model summary

• Call: the function used to compute the regression model

• Residuals: distribution of the residuals (median should be ~0 and min ~ max)

• Coefficients: the regression beta coefficients and their statistical significance. Statistical significance indicated by asterisks.

• RSE, $R^2$ and F-statistic: metrics of how well the model fits to the data

summary(model)
linear_reg_R (R)

Coefficients significance

Standard errors

Measures the variability/accuracy of the beta coefficients. Can be used to compute the confidence intervals of the coefficients.

t-value

A t-test is performed to check whether or not these coefficients are significantly different from zero. High t-statistics (which go with low p-values near 0) indicate that a predictor should be retained in a model.

Model accuracy

RSE (Closer to zero, the better)

• Whether the RSE is acceptable depends on the problem

$R^2$ (Higher the better)

• Represents the proportion of information (i.e. variation) in the data that can be explained by the model, or how well the model fits the data

• For a simple linear regression, $R^2$ is the square of the Pearson correlation coefficient

• The adjusted $R^2$ is adjusted for the degrees of freedom. As $R^2$ tends to increase when there are more predictors, should consider this metric for multiple linear regression model

F statistic (Higher the better)

• Give overall significance of the model, assessing whether at least one predictor variable has a non-zero coefficient

• For a simple linear regression, this just duplicates the information of the t-test in the coefficient table

• More important in multiple linear regression

• A large F statistic corresponds to a statistically significant P value