Simple Linear Regression
Simple Linear Regression in R (2018) [STHDA]
Formula and basics
Simple linear regression predicts a quantitative outcome variable y
, on the basis of a single predictor variable x
.
An important assumption of the linear regression model is that the relationship between the predictor variables and the outcome is linear and additive.
Residual Sum of Squares (RSS)
The sum of the squares of the residual errors
Ordinary least squares regression (OLS)
and are determined so that the RSS is minimized
Residual Standard Error (RSE)
The average variation of points around the fitted regression line
Lower the RSE, the better the fitted regression model
Load & preview the data
library(tidyverse) ## For data manipulation and visualization
library(ggpubr) ## For publication-ready plots
theme_set(theme_pubr())
The marketing
data set describes the impact of Youtube, Facebook and newspaper advertising on sales.
#devtools::install_github('kassambara/datarium')
library(datarium)
data("marketing", package = "datarium")
head(marketing)
Visualize the data
Create a scatterplot displaying the sales in thousands of dollars vs. Youtube advertising budget:
ggplot(marketing, aes(x = youtube, y = sales)) +
geom_point() +
stat_smooth()
Correlation coefficient
Measures the level of association between X and Y
-1: perfect negative correlation
+1: perfect positive correlation
~0: weak relationship between the variables
cor(marketing$sales, marketing$youtube)
Fit a linear regression model
The jtools
package is very useful for summarizing and visualizing regression results. Read more about it here.
library(jtools)
model <- lm(sales ~ youtube, data = marketing)
summ(model)
Intercept=8.43 means that when Youtube advertising budget is $0, we can expect $8,430 in sales
Fit the linear regression line
The confidence band reflect the uncertainty about the line
ggplot(marketing, aes(youtube, sales)) +
geom_point() +
stat_smooth(method = lm)
Model assessment
Model summary
Call: the function used to compute the regression model
Residuals: distribution of the residuals (median should be ~0 and min ~ max)
Coefficients: the regression beta coefficients and their statistical significance. Statistical significance indicated by asterisks.
RSE, and F-statistic: metrics of how well the model fits to the data
summary(model)
Coefficients significance
Standard errors
Measures the variability/accuracy of the beta coefficients. Can be used to compute the confidence intervals of the coefficients.
t-value
A t-test is performed to check whether or not these coefficients are significantly different from zero. High t-statistics (which go with low p-values near 0) indicate that a predictor should be retained in a model.
Model accuracy
RSE (Closer to zero, the better)
Whether the RSE is acceptable depends on the problem
(Higher the better)
Represents the proportion of information (i.e. variation) in the data that can be explained by the model, or how well the model fits the data
For a simple linear regression, is the square of the Pearson correlation coefficient
The adjusted is adjusted for the degrees of freedom. As tends to increase when there are more predictors, should consider this metric for multiple linear regression model
F statistic (Higher the better)
Give overall significance of the model, assessing whether at least one predictor variable has a non-zero coefficient
For a simple linear regression, this just duplicates the information of the t-test in the coefficient table
More important in multiple linear regression
A large F statistic corresponds to a statistically significant P value