Intelligence Refinery / Jan 14 2020

Remix of Simple Linear Regression by IRIntelligence Refinery

Simple Linear Regression

Simple Linear Regression in R (2018) [STHDA]

Formula and basics

Simple linear regression predicts a quantitative outcome variable y, on the basis of a single predictor variable x.

An important assumption of the linear regression model is that the relationship between the predictor variables and the outcome is linear and additive.

Residual Sum of Squares (RSS)

The sum of the squares of the residual errors

Ordinary least squares regression (OLS)

$b_0$ and $b_1$ are determined so that the RSS is minimized

Residual Standard Error (RSE)

The average variation of points around the fitted regression line
Lower the RSE, the better the fitted regression model

Load & preview the data

library(tidyverse) ## For data manipulation and visualizationlibrary(ggpubr)    ## For publication-ready plotstheme_set(theme_pubr())

0.5s

linear_reg_R (R)

The marketing data set describes the impact of Youtube, Facebook and newspaper advertising on sales.

#devtools::install_github('kassambara/datarium')library(datarium)data("marketing", package = "datarium") head(marketing)

linear_reg_R (R)

0 items

Visualize the data

Create a scatterplot displaying the sales in thousands of dollars vs. Youtube advertising budget:

ggplot(marketing, aes(x = youtube, y = sales)) +   geom_point() +   stat_smooth()

linear_reg_R (R)

Correlation coefficient

Measures the level of association between X and Y
-1: perfect negative correlation
+1: perfect positive correlation
~0: weak relationship between the variables

cor(marketing$sales, marketing$youtube)

linear_reg_R (R)

Fit a linear regression model

The jtools package is very useful for summarizing and visualizing regression results. Read more about it here.

library(jtools)model <- lm(sales ~ youtube, data = marketing) summ(model)

linear_reg_R (R)

Intercept=8.43 means that when Youtube advertising budget is $0, we can expect $8,430 in sales

Fit the linear regression line

The confidence band reflect the uncertainty about the line

ggplot(marketing, aes(youtube, sales)) +   geom_point() +   stat_smooth(method = lm)

linear_reg_R (R)

Model assessment

Model summary

Call: the function used to compute the regression model
Residuals: distribution of the residuals (median should be ~0 and min ~ max)
Coefficients: the regression beta coefficients and their statistical significance. Statistical significance indicated by asterisks.
RSE, $R^2$ and F-statistic: metrics of how well the model fits to the data

summary(model)

linear_reg_R (R)

Coefficients significance

Standard errors

Measures the variability/accuracy of the beta coefficients. Can be used to compute the confidence intervals of the coefficients.

t-value

A t-test is performed to check whether or not these coefficients are significantly different from zero. High t-statistics (which go with low p-values near 0) indicate that a predictor should be retained in a model.

Model accuracy

RSE (Closer to zero, the better)

Whether the RSE is acceptable depends on the problem

$R^2$ (Higher the better)

Represents the proportion of information (i.e. variation) in the data that can be explained by the model, or how well the model fits the data
For a simple linear regression, $R^2$ is the square of the Pearson correlation coefficient
The adjusted $R^2$ is adjusted for the degrees of freedom. As $R^2$ tends to increase when there are more predictors, should consider this metric for multiple linear regression model

F statistic (Higher the better)

Give overall significance of the model, assessing whether at least one predictor variable has a non-zero coefficient
For a simple linear regression, this just duplicates the information of the t-test in the coefficient table
More important in multiple linear regression
A large F statistic corresponds to a statistically significant P value

Simple Linear Regression

Formula and basics

Load & preview the data

Visualize the data

Correlation coefficient

Fit a linear regression model

Fit the linear regression line

Model assessment

Model summary

Coefficients significance

Standard errors

t-value

Model accuracy

Runtimes (1)