Simple Linear Regression
Simple Linear Regression in R (2018) [STHDA]
Formula and basics
Simple linear regression predicts a quantitative outcome variable
y, on the basis of a single predictor variable
An important assumption of the linear regression model is that the relationship between the predictor variables and the outcome is linear and additive.
Residual Sum of Squares (RSS)
Ordinary least squares regression (OLS)
Residual Standard Error (RSE)
The average variation of points around the fitted regression line
Lower the RSE, the better the fitted regression model
Load & preview the data
marketing data set describes the impact of Youtube, Facebook and newspaper advertising on sales.
Visualize the data
Create a scatterplot displaying the sales in thousands of dollars vs. Youtube advertising budget:
Measures the level of association between X and Y
-1: perfect negative correlation
+1: perfect positive correlation
~0: weak relationship between the variables
Fit a linear regression model
jtools package is very useful for summarizing and visualizing regression results. Read more about it here.
Fit the linear regression line
The confidence band reflect the uncertainty about the line
Call: the function used to compute the regression model
Residuals: distribution of the residuals (median should be ~0 and min ~ max)
Coefficients: the regression beta coefficients and their statistical significance. Statistical significance indicated by asterisks.
RSE, and F-statistic: metrics of how well the model fits to the data
Measures the variability/accuracy of the beta coefficients. Can be used to compute the confidence intervals of the coefficients.
A t-test is performed to check whether or not these coefficients are significantly different from zero. High t-statistics (which go with low p-values near 0) indicate that a predictor should be retained in a model.
RSE (Closer to zero, the better)
(Higher the better)
Represents the proportion of information (i.e. variation) in the data that can be explained by the model, or how well the model fits the data
For a simple linear regression, is the square of the Pearson correlation coefficient
The adjusted is adjusted for the degrees of freedom. As tends to increase when there are more predictors, should consider this metric for multiple linear regression model
F statistic (Higher the better)
Give overall significance of the model, assessing whether at least one predictor variable has a non-zero coefficient
For a simple linear regression, this just duplicates the information of the t-test in the coefficient table
More important in multiple linear regression
A large F statistic corresponds to a statistically significant P value